Abstract
Accurate ancestry inference is critical for identifying genetic contributors of cancer disparities among populations. Although methods to infer genetic ancestry have historically relied upon genome-wide markers, the adaptation to targeted clinical sequencing panels presents an opportunity to incorporate ancestry inference into routine diagnostic workflows. We show that global ancestral contributions and admixture of continental populations can be quantitatively inferred using markers captured by the MSK-IMPACT clinical panel. In a pan-cancer cohort of 45,157 patients, we observed differences by ancestry in the frequency of somatic alterations, recapitulating known associations and revealing novel associations. Despite the comparable overall prevalence of driver alterations by ancestry group, the proportion of patients with clinically actionable alterations was lower for African (30%) compared with European (33%) ancestry. Although this result is largely explained by population-specific cancer subtype differences, it reveals an inequity in the degree to which different populations are served by existing precision oncology interventions.
We performed a comprehensive analysis of ancestral associations with somatic mutations in a real-world pan-cancer cohort, including >5,000 non-European individuals. Using an FDA-authorized tumor sequencing panel and an FDA-recognized oncology knowledge base, we detected differences in the prevalence of clinically actionable alterations, potentially contributing to health care disparities affecting underrepresented populations.
This article is highlighted in the In This Issue feature, p. 2483
INTRODUCTION
Cancer health disparities between racial and ethnic groups remain a key public health challenge in the United States (1–3). Differences in cancer incidence and mortality are due to a complex interplay of nongenetic factors such as access to health care, socioeconomic status, diet and lifestyle, and genetic factors including ancestral differences between populations. Although highly correlated with race, ancestry is a distinct attribute that specifically refers to inherited genetic variation correlated with human migration patterns. Moreover, unlike race, ancestry can be inferred quantitatively, including in recently admixed populations, such as African Americans and Latin Americans, in which individuals exhibit a mixture of alleles inherited from multiple ancestral groups (4).
Genetic contributions to observed differences in cancer incidence and clinical outcomes in different ancestral populations have been identified in many cancer types. For example, genome-wide mapping studies have revealed specific risk loci (5) that may contribute to the higher incidence and mortality rates for prostate cancer among African American men compared with European Americans (6). Similarly, population-based studies have consistently identified higher incidence rates of triple-negative breast cancer (TNBC) in women of African ancestry compared with Europeans (7, 8). Population-specific differences in the observed rates of specific somatic alterations have also been reported—for instance, the higher rate of EGFR mutations in patients of Asian descent with non–small cell lung cancer even after accounting for smoking history (9, 10).
In recent years, data from large-scale cancer genomics efforts have provided the opportunity to understand the relation between genetic ancestry and somatic features across cancers (11, 12). However, the limited number of patients of non-European ancestry in these studies has precluded a thorough evaluation of ancestry-specific associations across all cancer types. These studies have typically relied on genome-wide ancestry markers from broad genome-scale next-generation sequencing (NGS), often available only in the research setting. As targeted NGS panels are increasingly utilized clinically, the adaptation of these methods to more focused sets of markers presents an opportunity to prospectively incorporate ancestry inference as part of routine clinical care (13). The application to large cohorts of clinically sequenced patients may enable the identification of additional novel associations of ancestry with molecular and clinical features (14). At Memorial Sloan Kettering Cancer Center (MSK), we have compiled an enterprise-scale resource of tumor and matched normal sequencing data from over 500 cancer histologies using MSK-IMPACT, an FDA-authorized targeted NGS panel encompassing up to 505 cancer-associated genes (15).
Here, we demonstrate that ancestral contributions of African (AFR), European (EUR), East Asian (EAS), Native American (NAM), and South Asian (SAS) populations can be robustly inferred using more than 3,000 common single-nucleotide polymorphism (SNP) markers captured by MSK-IMPACT. We also show that Ashkenazi Jewish (ASJ) ancestry can be inferred with high sensitivity and specificity using an additional set of ASJ ancestry informative markers. Applying these methods to infer the genetic ancestry of 45,157 patients who underwent prospective sequencing at MSK, we report differences by ancestry in the frequency of somatic features and the prevalence of clinically actionable targets of FDA-approved therapies.
RESULTS
Genetic Ancestry Inference from MSK-IMPACT Data
We first sought to determine whether SNP markers within captured regions of the targeted MSK-IMPACT panel were sufficient to accurately infer genetic ancestry. To this end, we selected a reference dataset of 2,129 nonadmixed individuals from 24 geographic populations who were sequenced using whole-genome sequencing in the 1000 Genomes Project (1KGP; ref. 16) representing five different continental groups: AFR, NAM, EAS, EUR, and SAS (Supplementary Fig. S1A and S1B; see Methods). Principal component analysis of the genome-wide SNP markers showed a clear separation of five continental ancestry groups (Supplementary Fig. S1C). To evaluate the power of MSK-IMPACT–specific markers to infer ancestry, we chose an independent dataset of 279 samples from the Simons Genome Diversity Project (SGDP; ref. 17) comprising whole-genome sequences from diverse human populations around the world. We used ADMIXTURE (18), a program for estimating ancestry from large autosomal SNP genotype datasets by modeling the probability of the observed genotypes using ancestry proportions and population allele frequencies, to estimate ancestry contributions from five continental population groups in all SGDP samples using both the genome-wide (n = 716,011) and MSK-IMPACT–specific (n = 3,331–5,378) SNP markers chosen from the different versions of the panel to evaluate the concordance in global ancestry inferences and admixture estimates.
Both the genome-wide and MSK-IMPACT–specific analyses revealed high proportions of NAM, AFR, EUR, EAS, and SAS ancestries estimated for samples from America, sub-Saharan Africa, West Eurasia, East Asia, and South Asia, respectively (Fig. 1A; Supplementary Fig. S2A). We observed near-perfect concordance in the estimated ancestry proportion of admixed individuals between the two marker sets (Pearson correlation coefficient = 0.997, P < 2.2e−16; Fig. 1B; Supplementary Fig. S2B–S2E).
Having established that MSK-IMPACT markers are sufficient to accurately infer ancestry and admixture in the whole-genome SGDP data, we repeated the same supervised ADMIXTURE analysis to infer ancestry contributions of AFR, EAS, EUR, NAM, and SAS populations in 45,157 individuals sequenced on MSK-IMPACT panels (Fig. 1C; Supplementary Table S1). We assigned discrete ancestry labels to each patient in whom the contribution to ancestry of a single population was inferred to be ≥80%; the remaining were considered admixed or of other ancestry not represented in the reference panel. Altogether, 76% of patients were labeled EUR, 5% AFR, 6% EAS, 2% SAS, <1% NAM, and 11% admixed/other (ADM).
Five different versions of MSK-IMPACT were used during the accumulation of this dataset, comprising 341, 400, 410, 468, or 505 genes. We compared inferred ancestry proportions from samples run on different MSK-IMPACT versions and show that inferred ancestral proportions were highly concordant (Pearson correlation coefficient >0.997, P < 2.2e−16) across different versions and samples (Supplementary Fig. S3A–S3D). We also demonstrate that admixture estimates can be accurately inferred from tumor-only sequencing (Pearson correlation coefficient = 1, P < 2.2e−16; Supplementary Fig. S3E–S3G).
As an independent comparator, both self-reported race and ethnicity were available for 40,321 individuals (89%). Of the self-reported non-Latinx white individuals, 96% had inferred EUR ancestry (Fig. 1D). Of the self-reported non-Latinx Black/African American individuals, 68% had inferred AFR ancestry and 31% were labeled ADM with mean AFR admixture contribution of 66%. On the other hand, the vast majority of self-reported Hispanic/Latinx patients (72%) were ADM, which is consistent with known admixtures of EUR, AFR, and NAM ancestries in Latin American populations.
In some cases, the inferred genetic ancestry was able to provide a more granular level of detail than the categories available for self-identification of race and ethnicity. For instance, although our genetic analysis specifically distinguishes individuals with either SAS or EAS ancestry, individuals were only able to self-report as a single broad “Asian” category. Indeed, of those self-reported as Asian, 64% had EAS, 20% SAS, and 14% ADM inferred ancestries (Supplementary Fig. S4A and S4B). We were also able to assign ancestry to the 3,014 patients for whom self-identified race information was missing and would therefore have been excluded from analyses relying on self-reported race in this cohort with incomplete data on self-reported race (Supplementary Fig. S4C).
Differences between self-reported race and inferred genetic ancestry were observed in 2% of the patient cohort (739 patients; Supplementary Fig. S4D). Approximately 22% of these cases were individuals who self-reported as Hispanic/Latinx compared with 6% of the entire cohort. Anticipating that a higher proportion of patients from such recently admixed populations may not self-identify with the available race categories, we focused on discrepancies in self-reported non-Hispanic/Latinx patients. We reasoned that these discrepancies may represent reporting or database errors. To further investigate this, we leveraged independent manually curated family histories collected for a subset of individuals as part of cancer predisposition testing. For 160 patients with discrepancies (27, 41, 43, 22, and 27 with EUR, AFR, EAS, SAS, and NAM inferred ancestry, respectively), we identified compelling evidence in support of the inferred ancestry in 136/160 (85%) cases, compelling evidence contradicting the inferred ancestry in 1/160 (<1%), and inconclusive data available for the remainder. Thirty-eight of 43 EAS individuals were identified through detailed family history as either Chinese, Filipino, Korean, Vietnamese, or Singaporean. Sixteen of 22 SAS individuals were manually identified as either Indian or Pakistani, with the remaining patients identified as either from Guyana or Trinidad, where Indo-Caribbeans (individuals of Indian origin in the Caribbean) represent one of the largest ethnic groups. Ten of 15 patients predicted to be EUR and who self-reported as Asian originated from Middle Eastern countries and were expected to share more ancestry with Europeans than with South or East Asians. Taken together, these results reaffirm the accuracy of genetic ancestry inference from targeted panel data and demonstrate its utility and granularity when self-reported race may differ.
ASJ Ancestry Inference
ASJ individuals are of Eastern and Central European descent but tend to carry some genetic variants at a different frequency than other European populations (19). Among those variants are pathogenic germline mutations in certain cancer-predisposing genes, such as BRCA1 and BRCA2, which are associated with the risk of breast, ovarian, prostate, and pancreatic cancers (20). Before dedicated germline testing for pathogenic variants is obtained, ancestry inference may thus have value in clinical genetic risk assessment.
Although the MSK-IMPACT markers were chosen to assess continental ancestries, to infer ASJ ancestry we identified 282 additional SNP markers captured by MSK-IMPACT that had higher minor allele frequency (>1%) in ASJ compared with other populations (<0.1%). To evaluate the effectiveness of these markers for identifying ASJ individuals, we extracted genotypes at these marker sites from the MSK-IMPACT data in 8,217 patients for whom ASJ status was manually curated based on detailed family histories obtained through genetic counseling. Of the 1,257 patients whose ancestry was manually annotated as ASJ, 88% carried the minor allele of at least three of the 282 ASJ ancestry informative markers compared with only 1% of the 6,960 patients manually annotated as non-ASJ (Fig. 2A), indicating an accuracy of 97% in ASJ ancestry inference. This threshold of three markers had the highest accuracy and was therefore used for subsequent analyses. Additionally, our inference of ASJ ancestry was concordant across MSK-IMPACT panel versions and between tumor and matched normal for assessment of continental ancestry (Supplementary Fig. S5A–S5D).
We inferred genotype for the ASJ informative markers using MSK-IMPACT data on the entire cohort of 45,157 individuals and assigned ASJ ancestry to patients who met the threshold of three nonreference markers. Altogether, 16% of the cohort was identified as ASJ, 98% of whom were labeled EUR by ADMIXTURE. We, therefore, separated patients with EUR ancestry by ADMIXTURE into Ashkenazi Jewish European (hereby referred to as ASJ) and non–Ashkenazi Jewish European (hereby referred to as EUR) based on their inferred ASJ ancestry (Supplementary Table S1).
Of the patients with self-reported religion information, 92% of those with ASJ inferred ancestry were self-identified Jewish (Fig. 2B), and 84% of self-reported Jewish individuals had ASJ inferred ancestry. Considering that not all Jewish individuals are of Ashkenazi descent and that admixture is common in this population, the observed concordance between self-reported religion and ASJ ancestry suggests that our method is accurate and informative.
As an independent indicator of accuracy, we examined the prevalence of cancer-predisposing BRCA1 (c.68_69delAG, c.5266dupC) and BRCA2 (c.5946delT) founder mutations in patients with ASJ and non-ASJ inferred ancestry (Supplementary Table S2). These mutations were neither included in nor in linkage disequilibrium with the ASJ informative marker set. We found that the frequency of observing at least one of these three germline mutations was significantly higher in patients with ASJ inferred ancestry compared with non-ASJ (5.1% vs. 0.2%, Fisher exact test P = 8.97e−200; Fig. 2C). Moreover, the prevalence of these BRCA1/2 founder mutations increased with the number of heterozygous (het) or homozygous alternate (hom-alt) ASJ markers (Fig. 2D). Overall, 83% of patients with at least one of the BRCA1/2 founder mutations had ASJ inferred ancestry by our analysis. A majority (64%) of these patients had at least six het/hom-alt genotyped ASJ marker sites. Of the BRCA1/2 founder mutation–positive patients with ASJ inferred ancestry, 33% did not self-report as Jewish (3% reported a different religion, and for 30% there was no report of a religion) and, therefore, may not have been considered for germline risk screening.
Ancestral Differences in Cancer Type Prevalence and Somatic Mutation Patterns
We next examined the ancestry distribution within each cancer type for our cohort of 45,157 patients. AFR patients were overrepresented among patients with breast cancer, endometrial cancer, gastrointestinal stromal tumors, and uterine sarcoma, and underrepresented among patients with melanoma. We also observed an overrepresentation of EAS patients among patients with upper gastrointestinal cancers (esophagogastric and hepatobiliary), head and neck cancer, and salivary gland carcinoma (Fig. 3A). These differences were frequently associated with differences in the prevalence of specific subtypes within each general cancer type (Supplementary Fig. S6). For instance, although cutaneous melanoma represents by far the most common type of melanoma in patients with EUR ancestry (60%), it accounted for only 14% of melanomas in patients with AFR ancestry as opposed to acral (45%) and mucosal (36%) melanoma, consistent with prior reports (21). Additionally, uterine endometrioid carcinoma accounted for over 56% of all endometrial cancers in patients with EUR ancestry but only 23% in patients with AFR ancestry, with the more aggressive serous and carcinosarcoma/mixed Mullerian subtypes representing 35% and 24%, respectively, of all endometrial cancers in patients with AFR ancestry compared with 14% and 10%, respectively, in patients with EUR ancestry.
We next sought to identify associations between ancestry and somatic driver alterations. Recognizing that the different subtype frequencies might lead to spurious associations between ancestry and somatic mutations within general cancer types, we restricted our analyses to subtype-level comparisons, examining all cancer subtypes with 15 or more patients in at least one of the non-EUR populations. Because patients of EUR ancestry represent the majority of the cohort, we systematically compared the alteration frequencies of genes in EUR relative to other populations after adjusting for sex, age, and disease status (primary vs. metastatic). Altogether, we identified 18 significant associations (FDR-adjusted P < 0.05; Fig. 3B; Table 1; Supplementary Table S3).
Cancer subtype . | Code . | Gene . | Ancestry group 2 . | Alteration frequency in EUR . | Alteration frequency in group 2 . | Citations if previously reported . |
---|---|---|---|---|---|---|
Breast invasive ductal carcinoma | IDC | TP53 | AFR | 43.7% | 61.5% | (8, 11, 12, 25, 26) |
Colon adenocarcinoma | COAD | KRAS | AFR | 45.1% | 62.9% | (13, 30) |
Glioblastoma multiforme | GBM | TERT | EAS | 87.4% | 56.5% | |
Hepatocellular carcinoma | HCC | RB1 | EAS | 3.4% | 23.5% | (31) |
Hepatocellular carcinoma | HCC | TERT | EAS | 67.2% | 35.3% | |
Lung adenocarcinoma | LUAD | EGFR | EAS | 23.2% | 67.5% | (9, 10, 23, 45) |
Lung adenocarcinoma | LUAD | KRAS | EAS | 40.9% | 10.3% | (10, 23, 46) |
Lung adenocarcinoma | LUAD | EGFR | SAS | 23.2% | 65.5% | (22, 45) |
Lung adenocarcinoma | LUAD | STK11 | EAS | 17.2% | 3.1% | (23) |
Lung adenocarcinoma | LUAD | EGFR | AFR | 23.2% | 44.7% | (47) |
Lung adenocarcinoma | LUAD | KEAP1 | EAS | 8.8% | 2.6% | (23) |
Lung adenocarcinoma | LUAD | EGFR | ASJ | 23.2% | 29.4% | |
Lung adenocarcinoma | LUAD | MDM2 | SAS | 5.9% | 16.4% | |
Lung adenocarcinoma | LUAD | KRAS | SAS | 40.9% | 12.7% | (46) |
Lung adenocarcinoma | LUAD | KRAS | AFR | 40.9% | 24.4% | (46) |
Lung squamous cell carcinoma | LUSC | TP53 | EAS | 88.7% | 60% | |
Prostate adenocarcinoma | PRAD | FOXA1 | EAS | 14.1% | 35.1% | (27, 28) |
Prostate adenocarcinoma | PRAD | MYC | AFR | 6.9% | 12.7% | (29) |
Cancer subtype . | Code . | Gene . | Ancestry group 2 . | Alteration frequency in EUR . | Alteration frequency in group 2 . | Citations if previously reported . |
---|---|---|---|---|---|---|
Breast invasive ductal carcinoma | IDC | TP53 | AFR | 43.7% | 61.5% | (8, 11, 12, 25, 26) |
Colon adenocarcinoma | COAD | KRAS | AFR | 45.1% | 62.9% | (13, 30) |
Glioblastoma multiforme | GBM | TERT | EAS | 87.4% | 56.5% | |
Hepatocellular carcinoma | HCC | RB1 | EAS | 3.4% | 23.5% | (31) |
Hepatocellular carcinoma | HCC | TERT | EAS | 67.2% | 35.3% | |
Lung adenocarcinoma | LUAD | EGFR | EAS | 23.2% | 67.5% | (9, 10, 23, 45) |
Lung adenocarcinoma | LUAD | KRAS | EAS | 40.9% | 10.3% | (10, 23, 46) |
Lung adenocarcinoma | LUAD | EGFR | SAS | 23.2% | 65.5% | (22, 45) |
Lung adenocarcinoma | LUAD | STK11 | EAS | 17.2% | 3.1% | (23) |
Lung adenocarcinoma | LUAD | EGFR | AFR | 23.2% | 44.7% | (47) |
Lung adenocarcinoma | LUAD | KEAP1 | EAS | 8.8% | 2.6% | (23) |
Lung adenocarcinoma | LUAD | EGFR | ASJ | 23.2% | 29.4% | |
Lung adenocarcinoma | LUAD | MDM2 | SAS | 5.9% | 16.4% | |
Lung adenocarcinoma | LUAD | KRAS | SAS | 40.9% | 12.7% | (46) |
Lung adenocarcinoma | LUAD | KRAS | AFR | 40.9% | 24.4% | (46) |
Lung squamous cell carcinoma | LUSC | TP53 | EAS | 88.7% | 60% | |
Prostate adenocarcinoma | PRAD | FOXA1 | EAS | 14.1% | 35.1% | (27, 28) |
Prostate adenocarcinoma | PRAD | MYC | AFR | 6.9% | 12.7% | (29) |
NOTE: Eighteen significant differences (FDR-adjusted P < 0.05) were identified between the EUR cohort and another ancestral group. Alteration frequencies are displayed for both groups.
We found a number of significant associations within lung adenocarcinoma (LUAD), including a higher frequency of EGFR mutations in both EAS and SAS compared with EUR, consistent with prior reports (9, 22). Other associations in LUAD include lower frequency of KRAS mutations in both EAS and SAS as well as lower frequency of STK11 and KEAP1 mutations in EAS (23). Given the influence of smoking status on somatic mutation profiles of lung tumors (24) and differences in smoking status by ancestry observed in our cohort (Supplementary Fig. S7A–S7C), we restricted our analysis to never-smokers (n = 838 patients) and noted that although EGFR mutations remained significantly enriched in Asian populations, most other associations were no longer significant (Fig. 3C). This remained true when including all patients in the analysis and adjusting for smoking status (Supplementary Fig. S8). Among never-smokers, we also observed a higher frequency of EGFR mutations in AFR compared with EUR (69% vs. 49%, Fisher exact test P = 0.014), though we cannot exclude the possibility of ascertainment bias in our cohort. Although tumor mutation burden (TMB) was significantly lower in ASJ, EAS, and SAS patients with LUAD overall, these differences were not observed when restricted to only never-smokers (Supplementary Fig. S7C).
Additionally, we observed a higher frequency of TP53 mutations in women of AFR ancestry with breast invasive ductal carcinoma (IDC; 61% vs. 44% in EUR), as has been previously reported (25). As previously shown (7), this was driven in part by the higher incidence of TNBC in patients of AFR ancestry (32% vs. 17% in EUR; Supplementary Fig. S9A), the majority of whom carried concomitant TP53 alterations. However, even within the HR+/HER2− group, TP53 mutations were more common in patients with AFR ancestry (39%) compared with those of EUR ancestry (24%; Fisher exact test P = 0.005) or other populations (Supplementary Fig. S9B), as has been previously reported (26). This trend was also reflected in patients with admixed ancestry, with women exhibiting 20% to 50% AFR ancestry and 50% to 80% AFR ancestry harboring intermediate but increasing rates of TP53 alterations (Supplementary Fig. S10A and S10B).
Consistent with previous studies, we found a higher alteration frequency of FOXA1 in patients of EAS ancestry with prostate adenocarcinoma (PRAD; refs. 27, 28), MYC in patients of AFR ancestry with PRAD (27, 29), KRAS in patients with AFR ancestry with colon adenocarcinoma (COAD; ref. 30), and RB1 in patients with EAS ancestry with hepatocellular carcinoma (HCC; 31). We also observed lower frequency of TERT (primarily promoter) mutations in patients with EAS ancestry with glioblastoma multiforme (GBM) and in patients with EAS ancestry with HCC compared with those with EUR ancestry, which to our knowledge has not been described before.
Differences in Prevalence of Clinically Actionable Mutations by Ancestry
Finally, we sought to compare the rates of clinically actionable alterations in different ancestral populations. Although differences in actionable alterations partly reflect different cancer type and subtype distributions by ancestry, they may contribute to cancer health disparities and reveal opportunities for targeted drug development. We annotated mutations using OncoKB (32), an in-house, FDA-recognized clinical knowledge base, and considered mutations that were level 1 (FDA-recognized biomarker of FDA-approved treatment response), level 2 (standard care biomarker of treatment response), and level 3A (investigational treatment biomarker supported by compelling clinical evidence from the corresponding cancer type) to be actionable. Included in OncoKB level 1 are second-order features that predict response to immune-checkpoint inhibition: microsatellite instability (MSI) and TMB >10 mutations per megabase (TMB-high, or “TMB-H”; Supplementary Fig. S11A–S11C).
The proportion of patients with solid tumors harboring at least one oncogenic driver alteration was nearly identical across EUR, ASJ, AFR, EAS, and SAS populations (93.5%–94.5%) and slightly less in NAM (90.6%), although we are underpowered to detect differences of less than 10% in NAM due to its small sample size. The number of total driver mutations (median: 4.0–4.0; mean: 4.13–4.36) and somatic variants of unknown significance (median: 3.0–3.0; mean: 4.18–4.64) were also consistent across the five main ancestral groups, excluding TMB-high and MSI-high tumors. However, the proportion of patients with clinically actionable somatic alterations in our large real-world cohort was lower in AFR (29.6%) compared with EUR (33.1%; Fisher exact test P = 1.06e−03; Fig. 4A). Whereas the rates of actionable mutations in ASJ (34.3%) and EAS (34.4%) were comparable with EUR, we also observed lower rates of actionable alterations in NAM (29.5%) and SAS (29.8%) patients, though these differences were not statistically significant due to fewer individuals in these groups. MSI was most common in NAM and SAS (5.4% and 3.6%) and least common in EAS (2.4%; Supplementary Fig. S11C).
The frequency of actionable alterations varied across cancer types, with population-specific differences (Fig. 4B; Supplementary Fig. S12). For example, melanoma, non-melanoma skin cancer, and endometrial cancer exhibited lower rates of actionable alterations in patients of AFR ancestry compared with those of EUR ancestry. In melanoma, only 12% of patients with AFR ancestry harbored actionable mutations (vs. 73% of EUR) due to the much lower prevalence of cutaneous melanomas, in which level 1 BRAF V600E mutations and TMB-high signatures are enriched (Supplementary Fig. S13A–S13C). Similarly, the low rate of actionable mutations in patients with AFR ancestry with non-melanoma skin cancer (17% vs. 48% in EUR) is explained by the very low prevalence of cutaneous squamous cell carcinomas (n = 2), which exhibit TMB-high signatures in 76% of patients across all ancestries (Supplementary Fig. S14A–S14C). However, these cancer types are relatively rare and account for only a small fraction of the difference in actionability observed between the EUR and AFR cohorts. In order to determine the contribution of cancer type and subtype differences to the higher rate of actionable mutations in patients with EUR ancestry, we estimated the frequency of actionable mutations in the EUR cohort if the distribution of cancer subtypes was adjusted to match the AFR cohort. In this case, the actionability rate decreased to 31.4%, almost exactly the midpoint between the two cohorts, suggesting that cancer subtype differences account for approximately half of the difference in actionability between EUR and AFR patients and that the remainder relates to differences in the somatic mutation profiles.
After adjusting for cancer subtype distributions, the predominant contributors to the lower rate of actionable mutations among patients with AFR ancestry were IDC and COAD. In breast cancer overall, 47% of patients with AFR ancestry had actionable mutations compared with 58% with EUR ancestry. Moreover, the rate of actionable mutations in IDC (80% of breast cancers in AFR), especially level 1 alterations such as ERBB2 and PIK3CA, was lower in AFR compared with EUR. This imbalance is largely driven by the higher rate of TNBC in AFR, as well as a significantly lower frequency of PIK3CA mutations in AFR (26% vs. 42% in EUR, Fisher exact test P = 0.001) in HR+/HER2− IDC samples (Supplementary Fig. S15A–S15F), as has been previously reported (21, 22). In COAD, the imbalance is due to a lower prevalence of BRAF mutations (7% vs. 12%) and an MSI signature (5% vs. 13%) in the AFR cohort compared with the EUR cohort, whereas mutations in KRAS were significantly more common (Supplementary Fig. S16A–S16D; ref. 13).
In contrast to the above cancer types, thyroid cancer exhibited a greater proportion of actionable mutations in patients with AFR ancestry (63% vs. 35% in EUR, Fisher exact test P = 0.003) that was primarily driven by an enrichment of level 3A mutations in NRAS (Supplementary Fig. S17A–S17C). BRAF mutations by contrast are level 1 in anaplastic thyroid cancer (THAP) and level 3B in all other subtypes and occurred more frequently in non-AFR populations. Recent studies have described that NRAS- and BRAF-mutant thyroid cancers appear to be etiologically different entities (33, 34). In primary papillary thyroid cancer (THPA), mutations in NRAS and BRAF are usually mutually exclusive, with NRAS mutations enriched in the follicular variant of papillary thyroid carcinoma (FVPTC), whereas BRAF mutations are associated with the classic type and tall cell variant of THPA (33). The dichotomous relationship between BRAF and NRAS is maintained in advanced poorly differentiated thyroid cancer (THPD) and THAP (34). A cross-sectional analysis demonstrated that FVPTC were more common among Black Americans than white Americans with papillary thyroid carcinoma (35). This observation led us to examine the relative rates of activating NRAS and BRAF mutations across all thyroid cancers. Patients of AFR ancestry exhibited the highest rate of NRAS mutations of any ancestral population and the lowest rate of BRAF mutations of any ancestral population in every subtype examined: THPA, THAP, and THPD (Supplementary Fig. S17D). Overall, NRAS mutations were enriched 2.9-fold and BRAF mutations were depleted 2.5-fold in patients of AFR ancestry compared with those of EUR. This trend was also observed when comparing self-reported white and self-reported Black patients from MSK, as well as from other prospectively sequenced clinical cohorts represented in AACR Project GENIE (ref. 36; Supplementary Fig. S17E).
DISCUSSION
Here we show that genetic ancestry can be reliably inferred from targeted sequencing data using only a few thousand markers. We developed a framework for ancestry inference using the FDA-authorized MSK-IMPACT panel, the results of which remained consistent across different panel versions and demonstrated high concordance with self-reported race. Because targeted NGS panels continue to be used widely, especially in the clinical setting, this enables the interrogation of large prospective cancer cohorts (13). MSK-IMPACT alone is used to sequence more than 12,000 patients per year, representing an important and growing resource to study cancer disparities and explore potential genetic contributions to cancer prevalence and outcomes in diverse populations.
Although genetically inferred ancestry is not equivalent to race, nor does it capture the nongenetic environmental and socioeconomic factors that contribute to cancer health disparities, the high overall concordance between genetic ancestry and self-reported race in our cohort of 45,157 patients provides important validation of the accuracy of our method. Moreover, the direct inference of genetic ancestry from NGS data holds key advantages for enhancing population-based cancer research, as self-reported race and ethnicity can be incomplete or inaccurate in cancer registries. For example, in our real-world cohort, self-reported race was unknown or unavailable for 3,014 (7%) individuals. Categorical variables for reporting race also typically match census designations established by the U.S. Office of Management and Budget and can be overly broad (for example, “Asian”), whereas SAS and EAS ancestries are distinguishable by this method. Additionally, in contrast to self-reported race, ancestry is a continuous variable and can be used to quantitatively capture proportional contributions from different populations in admixed individuals.
Apart from continental-level ancestry, we are also able to identify genetic markers associated with ASJ ancestry that have high accuracy. This is important clinically, as individuals of ASJ ancestry are at increased risk of developing several types of cancer, and determination of ASJ status is an integral feature of germline risk assessment. With the selection of specific markers and additional testing, we anticipate the inference of additional subpopulations, such as Chinese and Japanese ancestral groups within the EAS population, as a future direction. As granular ancestry inference becomes more precise, the ethical and privacy issues around sharing this information with patients who may prefer not to know, especially when related to germline cancer risk, must be considered.
We also observed that ancestry inference was comparable between tumors and their matched normal samples, a finding with important implications for the majority of academic and commercial NGS laboratories that perform unmatched tumor testing. Standard strategies such as the use of population databases to filter out known germline variants are less accurate for individuals of non-European ancestries (37, 38). The inference of genetic ancestry from tumor-only sequencing could inform the probability that a given variant is somatic or germline in origin, enabling more accurate filters for diverse populations and guiding posttest genetic counseling and confirmatory testing when variants that may be germline in etiology are detected. Additionally, given bottlenecks associated with the clinical interpretation of germline variants of unknown significance (VUS), which often incorporates knowledge of frequency differences in diverse populations, our uniform and reproducible approach to ancestry inference could help to accelerate variant interpretation and reclassification, especially in non-European patients in whom germline VUSs occur at a higher rate (39, 40).
Due to the large and increasing number of patients receiving targeted clinical NGS testing, the inclusion of ancestry enables the identification of differences in somatic features by ancestry. In our cohort of 45,157 patients sequenced by MSK-IMPACT, we were able to recapitulate known gene–ancestry associations such as a higher frequency of somatic mutations in EGFR in patients of EAS and SAS ancestries with LUAD, FOXA1 in patients of EAS ancestry with PRAD, and TP53 in patients of AFR ancestry with IDC. Additionally, we observed novel putative associations, such as a lower frequency of TERT mutations in patients of EAS ancestry with GBM. We also observed fewer BRAF and more NRAS mutations in patients of AFR ancestry with thyroid cancers. Although we report 18 significant somatic alterations associated with ancestry, we also find nearly 10 of them to be explained by clinical attributes such as receptor status (breast) and smoking (lung). Further studies are required to determine if some of the remaining associations could be explained by other clinical, demographic, and disease-specific covariates that are not available in our study cohort. For example, it remains to be seen if the lower frequency of TERT alterations in HCCs of EAS ancestry could be explained by factors such as hepatitis infection and alcohol consumption (41).
Using our institutional, FDA-recognized knowledge base, OncoKB, we found that although the rates of oncogenic driver mutations were comparable across the different populations, the overall rate of therapeutically targetable mutations was lower in AFR, NAM, and SAS (30% in each) compared with EAS (34%) and EUR (33%) populations. Several factors may contribute to this difference, including differences in cancer type and subtype distributions, stage at cancer diagnosis, environmental factors, and access to health care. Moreover, this difference in overall actionability is modest, and its implications for diagnosis are unclear. Nonetheless, these findings represent the real-world clinical experience at our cancer center over an 8-year period, highlighting an urgent need for the development and approval of drugs targeting genomic alterations more common among underrepresented populations.
One limitation of this study is the ascertainment bias of our cohort recruited from a tertiary cancer care center. Although all patients at MSK with advanced disease are generally eligible to receive MSK-IMPACT testing, the composition of our cohort may not faithfully represent the distribution of cancer types or tumor stages observed in the broader community. Additionally, our cohort exhibits an imbalanced underrepresentation of non-European populations. Similar to The Cancer Genome Atlas (TCGA), individuals with EUR ancestry (including ASJ) make up more than 76% of this study. However, the absolute numbers of these underrepresented populations in this study are greater (2,193 AFR, 2,505 EAS, 821 SAS, 160 NAM, and 5,012 ADM) compared with TCGA and other studies. Moreover, this center-based patient cohort is growing at a rate of >12,000 patients per year, with community-based partnerships and ongoing initiatives to provide access to testing to underserved populations, which promise to increase the diversity of our institutional cohort.
Another limitation of this study is that we excluded data from patients with ADM ancestry for gene–cancer subtype association and clinical actionability analyses. ADM patients make up 11% of the total cohort. Moreover, 34% of self-reported Black/African American patients and 72% of self-reported Hispanic/Latinx patients exhibited genetic admixture. Ultimately, accurate estimates of local (locus-specific) ancestry will be necessary to understand the contribution of genetic ancestry in these patients. Although local ancestry is typically inferred using genome-wide marker data, Carrot-Zhang and colleagues (10) have shown that local ancestry may be inferred from targeted panel data by utilizing data from off-target reads. Future efforts to infer local ancestry from MSK-IMPACT data promise to further elucidate genetic contributions to somatic processes, particularly in these ADM populations.
In summary, by establishing a workflow for accurate global ancestry inference from targeted NGS data and applying it to a large real-world cohort profiled using an FDA-authorized clinical sequencing test and annotated using an FDA-recognized clinical knowledge base, we have identified population-specific differences in somatic patterns and actionable genomic alterations. We anticipate that this dataset will serve as an important resource for cancer disparities research.
METHODS
Marker Selection
For ADMIXTURE analyses, we first selected over 12 million genome-wide autosomal biallelic SNP markers with a minor allele frequency of greater than 1% in the 1000 Genomes cohort using PLINK v1.9 (42). For selecting MSK-IMPACT markers, we further restricted the markers to IMPACT505 probe regions. This resulted in 10,013 markers (Supplementary Table S4).
For selecting markers for ASJ ancestry determination, we selected biallelic autosomal SNPs from gnomAD (43) genomes r2.1.1 that were within regions covered by IMPACT468, and for which
AF_asj in gnomAD genome >0.01
AF, AF_nfe, AF_eas, AF_afr in gnomAD genomes <0.001
For markers seen with AF_asj >0 in gnomAD exomes r2.1.1, AF_asj/(max(AF, AF_asj, AF_nfe, AF_eas, AF_afr, AF_sas)) >2 in gnomAD exomes
Reference Sample Selection
We used samples from 1KGP as reference. 1KGP comprises data from 26 population groups from five different continental populations: AFR, admixed American (AMR), EAS, EUR, and SAS (Supplementary Table S6). However, certain groups in this dataset include individuals with recent ancestral admixture. For example, AMR individuals represent recent admixtures of AFR, EUR, NAM populations. Similarly, Africans Caribbean in Barbados and Americans of African ancestry in Southwest USA (ASW) within the AFR superpopulation are recent admixtures of EUR and AFR. In order to create a clean reference panel of individuals representing single ancestral populations, we identified and removed recently admixed individuals from the 1KGP dataset.
For this, we first ran unsupervised ADMIXTURE (16) v1.3 using 841,707 SNP markers that were a subset of the genome-wide SNP markers (see “Marker Selection”) in linkage equilibrium (PLINK v1.9 –indep-pairwise 1000 100 0.2), and with the number of ancestral populations (K) set to 5.
The five populations (P1, P2, P3, P4, and P5) determined by ADMIXTURE captured the EAS, AFR, SAS, NAM, and EUR ancestries (Supplementary Fig. S1). Any sample from the EAS, AFR, SAS, AMR, and EUR superpopulations of 1KGP that were estimated to have less than 0.8 fraction of P1, P2, P3, P4, and P5 populations as determined by ADMIXTURE were excluded from our reference for all subsequent supervised admixture analyses.
The resulting 1KGP reference panel consisted of 2,129 samples from 24 populations. We also ran a principal component analysis on the data and observed distinct separation among all five different ancestral groups (Supplementary Fig. S1C).
ADMIXTURE Analysis on SGDP
We downloaded 279 VCF files from SGDP that contained genotype information at every single position. For each SGDP sample, we extracted genotypes for the selected genome-wide SNP marker sets (see “Marker Selection”) using Genome Analysis Toolkit (GATK; ref. 44) v4.1.9.0 SelectVariants.
We ran GATK v4.1.9.0 Pileup on 100 random samples sequenced on different MSK-IMPACT versions to genotype the selected 10,013 IMPACT SNP markers. For this, we required a minimum mapping quality (MQ) of 10, minimum base quality (BQ) of 20, and minimum read depth (for reads that met the MQ and BQ thresholds) of 10. For each MSK-IMPACT version, we selected markers that could be genotyped in at least 80% of the samples. We then extracted genotypes for each of these MSK-IMPACT–specific SNP sets using GATK's SelectVariants.
We then merged genotype calls from SGDP and 1KGP using PLINK v1.9 and ran linkage-disequilibrium pruning (–indep-pairwise 1000 100 0.2). This resulted in 716,011 genome-wide markers, 5,378 IMPACT505 markers, 4,336 IMPACT468 markers, 3,990 IMPACT410 markers, 3,602 IMPACT341 markers, and 3,331 IMPACT-HEME markers (Supplementary Table S4).
We ran supervised ADMIXTURE with K = 5 and selected reference samples from 1KGP to estimate ancestral fractions of EAS, AFR, SAS, NAM, and EUR for all SGDP samples.
ADMIXTURE Analysis on MSK-IMPACT Data
We analyzed available MSK-IMPACT data from 45,157 patients who underwent prospective sequencing at MSK as part of their routine clinical care and whose data are also included in the AACR Project GENIE v10.1 public cohort (36). We used data from the matched normal that was available for 99.2% of the patients in the cohort and from the tumor sample for the remaining 357 patients. In case there were multiple matched normal samples for a patient, we picked the most recent sample. Six percent of all analyzed samples were sequenced on MSK's IMPACT-HEME panel, 5% on IMPACT341, 19% on IMPACT410, 68% on IMPACT468, and 2% on IMPACT505.
For each sample, we ran GATK v4.1.9.0 Pileup to genotype the selected 10,013 IMPACT SNP markers. For this we required an MQ ≥10, BQ ≥20, and a minimum read depth (for reads that met the MQ and BQ thresholds) of 10. We excluded markers that were not covered in the sample, merged genotype calls on remaining markers with those from 1KGP reference samples, and performed linkage-disequilibrium pruning using PLINK v1.9 (–indep-pairwise 1000 100 0.2). We ran supervised ADMIXTURE to estimate ancestral proportions of AFR, EUR, EAS, NAM, and SAS for the patients. Finally, we assigned ancestry labels to each patient. If the ancestral fraction of any of the populations was ≥0.8, the patient was assigned that population label. From ADMIXTURE run on SGDP, it was obvious that for individuals from populations not represented in the reference panel, ADMIXTURE would infer them to be admixed. Therefore, if the ancestral fraction of all populations was less than 0.8, the patient was labeled ADM. However, we think that the majority of such patients in the MSK-IMPACT cohort would be admixed and not from a different population.
ASJ Ancestry Determination
We ran GATK v4.1.9.0 Pileup to call genotypes on 282 markers selected for ASJ ancestry determination. We enumerated the het/home-alt genotyped markers for each patient. If the number of het/hom-alt genotyped markers was ≥3, the patient was assigned an ASJ label.
Ancestry Label Assignment
The patients labeled EUR in the ADMIXTURE analysis were further divided into ASJ and non-ASJ (EUR) based on their ASJ labels. All other patients were assigned their ADMIXTURE analysis labels.
Comparison of ADMIXTURE Results
To compare two different runs of ADMIXTURE for the same patient (e.g., with different marker sets), we calculated cosine similarity (cossim) between the two results.
where
S = {AFR, EAS, EUR, NAM, SAS}
Ai ancestral fraction of i in ADMIXTURE run1
Bi ancestral fraction of i in ADMIXTURE run2
Sample Selection for Subtype Prevalence, Gene–Ancestry Associations, and Actionable Mutation Analyses
There were sometimes multiple tumor samples from the same patient. In that case, we used only one sample. For this, we preferred samples with the highest tumor purity, primary sample over metastatic sample, and later time point sample over earlier time points.
Gene–Ancestry Associations in Different Cancer Subtypes
We used multivariate logistic regression models to identify genetic alterations associated with ancestry in each cancer subtype. The binary alteration status of a gene is defined as 1 when the gene has any somatic SNV, indel, focal copy-number alteration, or fusion event and 0 otherwise. For each cancer subtype, we excluded samples that were MSI-H (MSI score >10), and those with TMB scores of >20, or if they were labeled as ADM. Additionally, we excluded populations that had fewer than 15 samples for that cancer subtype.
In each cancer subtype, we tested the alteration status of each gene while controlling for age at diagnosis, disease status (primary or metastasis), and sex (where applicable). EUR served as the reference level (coded as 0) in all regressions. P values were adjusted for multiple hypothesis testing with the Benjamini–Hochberg procedure (Supplementary Table S3).
We repeated the analysis, this time additionally controlling for smoking status for LUAD and small cell lung cancer as well as HR status for IDC, breast invasive lobular carcinoma, and breast mixed ductal and lobular carcinoma.
Clinically Actionable Mutation Frequency Analysis
We defined samples with clinically actionable mutations as those with at least one level 1 to 3A mutation as defined in OncoKB (32). Solid tumor samples with MSI (MSI-high) were considered to have level 1 biomarkers. Similarly, solid tumor samples with a TMB score of >10 were considered to have level 1 biomarkers but were labeled as “Level 1 (TMB-H).” For each sample, we used the highest OncoKB level for actionable mutation comparisons. We ran exact binomial tests to compute 95% confidence intervals for the rates of clinically actionable mutations in each population.
Data Availability
The human sequence raw data generated in this study are protected and not publicly available due to patient privacy requirements but are available upon reasonable request from the corresponding author subject to institutional approvals. Tumor somatic mutations and associated clinical data for all patients in this study are available through AACR Project GENIE (v10.1 public cohort). Other data generated in this study are available within the article and its supplementary data files.
Authors’ Disclosures
M. Mehine reports grants from the Sigrid Jusélius Foundation during the conduct of the study. Y.L. Liu reports grants from Repare Therapeutics, GSK, and AstraZeneca outside the submitted work. A.R. Brannon reports other support from Johnson & Johnson outside the submitted work. P. Razavi reports grants and personal fees from Novartis, AstraZaneca, Inivata, and Epic Sciences, grants from Grail/Illumina, Guardant Health, Tempus, Invitae, and Biotheranostics, and personal fees from Natera, Daiichi Sankyo, and Pfizer outside the submitted work. H.A. Rizvi reports employment with AstraZeneca subsequent to the completion of this work. M.D. Hellmann reports grants and personal fees from Bristol Myers Squibb, personal fees from Achilles, Adagene, Adicet, Arcus, Blueprint, DaVolterra, Eli Lilly, Genentech/Roche, Genzyme/Sanofi, Janssen, Instil Bio, Mana Therapeutics, Merck, Mirati, Natera, Pact Pharma, Shattuck Labs, and Regeneron, personal fees and other support from AstraZeneca and Immunai, and other support from Factorial and Avail Bio during the conduct of the study, as well as a patent filed by Memorial Sloan Kettering related to the use of tumor mutational burden to predict response to immunotherapy (PCT/US2015/062208) pending and licensed to PGDx. D.B. Solit reports personal fees from Pfizer, Loxo/Lilly Oncology, Vividion Therapeutics, Scorpion Therapeutics, Fore Therapeutics, BridgeBio, and Fog Pharma outside the submitted work. C.L. Brown reports grants from the NCI during the conduct of the study and that the clinical cancer screening test used in this study, MSK-IMPACT, was developed at Memorial Sloan Kettering, the institution at which C.L. Brown is a full-time employee. C.L. Brown does not receive any direct or indirect financial benefit from the use of this test in this study or outside of this study. Z.K. Stadler reports other support from Genetech/Roche, Adverum, Gyroscope Therapeutics, Neurogene, Optos, Outlook Therapeutics, Regeneron, and REGENXBIO outside the submitted work. M.F. Berger reports personal fees from Eli Lilly, AstraZeneca, and PetDx outside the submitted work. No disclosures were reported by the other authors.
Authors’ Contributions
K. Arora: Conceptualization, formal analysis, writing–original draft. T.N. Tran: Formal analysis, writing–review and editing. Y. Kemel: Data curation, writing–review and editing. M. Mehine: Formal analysis, writing–review and editing. Y.L. Liu: Data curation, writing–review and editing. S. Nandakumar: Formal analysis, writing–review and editing. S.A. Smith: Formal analysis, writing–review and editing. A. R. Brannon: Formal analysis, writing–review and editing. I. Ostrovnaya: Formal analysis, writing–review and editing. K.H. Stopsack: Formal analysis, writing–review and editing. P. Razavi: Data curation, writing–review and editing. A. Safonov: Data curation, writing–review and editing. H.A. Rizvi: Data curation, writing–review and editing. M.D. Hellmann: Data curation, writing–review and editing. J. Vijai: Data curation, writing–review and editing. T.C. Reynolds: Data curation, writing–review and editing. J.A. Fagin: Investigation, writing–review and editing. J. Carrot-Zhang: Methodology, writing–review and editing. K. Offit: Supervision, writing–review and editing. D.B. Solit: Resources, writing–review and editing. M. Ladanyi: Resources, writing–review and editing. N. Schultz: Formal analysis, writing–review and editing. A. Zehir: Formal analysis, writing–review and editing. C.L. Brown: Data curation, writing–review and editing. Z.K. Stadler: Data curation, writing–review and editing. D. Chakravarty: Conceptualization, formal analysis, writing–original draft. C. Bandlamudi: Conceptualization, formal analysis, writing–original draft. M.F. Berger: Conceptualization, supervision, funding acquisition, writing–original draft.
Acknowledgments
We gratefully acknowledge members of the Marie Josée and Henry R. Kravis Center for Molecular Oncology, the Molecular Diagnostics Service in the Department of Pathology and Laboratory Medicine, and the Berger lab for their contributions. This work was supported by NIH awards P30 CA008748 and R01 CA227534 (to M.F. Berger) and the Sigrid Jusélius Foundation (to M. Mehine).
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Discovery Online (http://cancerdiscovery.aacrjournals.org/).