Abstract
Somatic mutations drive cancer development and are relevant to patient responses to treatment. Emerging evidence shows that variations in the somatic genome can be influenced by the germline genetic background. However, the mechanisms underlying these germline–somatic associations remain largely obscure. We hypothesized that germline variants can influence somatic mutations in a nearby cancer gene (“local impact”) or a set of recurrently mutated cancer genes across the genome (“global impact”) through their regulatory effect on gene expression. To test this hypothesis, tumor targeted sequencing data from 12,413 patients across 11 cancer types in the Dana-Farber Profile cohort were integrated with germline cancer gene expression quantitative trait loci (eQTL) from the Genotype-Tissue Expression Project. Variants that upregulate ATM expression were associated with a decreased risk of somatic ATM mutations across 8 cancer types. GLI2, WRN, and CBFB eQTL were associated with global tumor mutational burden of cancer genes in ovarian cancer, glioma, and esophagogastric carcinoma, respectively. An EPHA5 eQTL was associated with mutations in cancer genes specific to colorectal cancer, and eQTL related to expression of APC, WRN, GLI1, FANCA, and TP53 were associated with mutations in genes specific to endometrial cancer. These findings provide evidence that germline–somatic associations are mediated through expression of specific cancer genes, opening new avenues for research on the underlying biological processes.
Analysis of associations between the germline genetic background and somatic mutations in patients with cancer suggests that germline variants can influence local and global tumor mutations by altering expression of cancer-related genes.
Introduction
Cancer is a genetic disease driven by somatic events occurring in the genome over time. Identifying genes carrying driver mutations (cancer driver genes) and elucidating their roles in the related signaling pathways have become primary goals in cancer genomic research (1–4). Many of these driver genomic alterations have been found to be clinically actionable drug or therapeutic targets for precision medicine (5–8).
Different cancers are characterized by different patterns of somatic mutations (9, 10). Even patients with the same cancer may have substantial heterogeneity in the overall tumor mutational burden (TMB), mutation patterns characterized by mutational signatures, or the cancer genes and oncogenic signaling pathways altered (4, 11–14). These heterogeneities in the somatic mutational profile can lead to differential cancer progression, prognosis, and treatment response (15–17). Mounting evidence suggests that somatic variations in tumors can have a germline genetic basis (12, 18–23). This germline–somatic relationship has been established at different levels, from the impact of a single germline variant on somatic mutation rate of a cancer gene (20), to the associations between germline polygenic risk scores and somatic mutational signatures (23). However, the study of germline–somatic relationship in cancer is still at an early stage and the mechanisms responsible for these observed associations are still largely uncovered.
Germline variants may affect somatic mutations through gene expression (19, 22, 24). In the well-established example of the APOBEC mutational process, rs17000526-A allele in the APOBEC3B region is associated with higher expression of this gene, which contributes to somatic mutagenesis of APOBEC signatures in bladder tumor (19). Many underlying mechanisms may coexist, but an intuitive and interpretable hypothesis would be that the germline cancer gene expression quantitative trait loci (eQTL) alter the propensity of acquiring somatic mutations in those specific genes or globally through their regulatory effect on gene expression. Although prior studies included gene expression information in the analysis of germline–somatic associations, there is no systematic study focusing on both the local and global impact of eQTL on somatic mutations in cancer genes across multiple cancers. Many latent associations and mechanisms may thus have been missed.
Here, we performed a pan-cancer analysis of the germline genetic impacts on both the local and global tumor mutations, incorporating regulatory information of germline variants on gene expression. Specifically, we evaluated the associations between germline cancer gene eQTL and (i) somatic mutation status of those cancer genes or any hotspot mutation in those genes, (ii) tumor mutation counts (TMC) of all recurrently mutated cancer genes for a cancer type, and (iii) TMB of all targeted cancer genes from the OncoPanel sequencing platform across 11 cancer types in the Dana-Farber Profile cohort. Our results demonstrate evidence for germline–somatic associations that are potentially mediated through cancer gene expression and provide insights into the mechanisms of mutagenesis in somatic cells.
Materials and Methods
Study population
The Dana-Farber Profile, initiated in 2011, is a cohort study of unselected patients with cancer who presented at the Dana-Farber Cancer Institute, Brigham and Women's Hospital, or Boston Children's Hospital, received genomic profiling and provided written informed consent prior to inclusion in this study. Tumor specimens, mainly formalin-fixed, paraffin-embedded tissues, were retrieved from all consented patients for targeted sequencing. Comprehensive clinical and pathologic data were collected along with the genomic data (6, 25). The study protocol was approved by the Institutional Review Board (IRB) of Dana-Farber/Partners Cancer Care Office for the Protection of Research Subjects (11–104/17–000). Secondary analyses of previously collected data were approved by the Dana-Farber IRB (19–033/19–025). This study was conducted in accordance with recognized ethical guidelines (e.g., Declaration of Helsinki, CIOMS, Belmont Report, U.S. Common Rule).
Tumor targeted sequencing
A workflow of the full data generating and processing pipeline is present in Fig. 1. All collected tumor samples were sequenced on OncoPanel, a targeted sequencing platform designed for detecting somatic variations in a panel of actionable cancer genes. There are three versions of the panel targeting the exon and/or intron regions of 304, 326, and 447 genes, respectively; each patient in the cohort was sequenced on one of the panels (Supplementary Table S1). All targeted genes were previously identified oncogenes or tumor suppressor genes involved in cancer-related signaling pathways (26). Sequencing was performed using an Illumina HiSeq 2500 with 2 × 100 paired-end reads followed by somatic mutation calling using MuTect (for single-nucleotide variants; ref. 27) and Indelocator (for indels; http://www.broadinstitute.org/cancer/cga/indelocator) from reads aligned to the targeted genome regions with >50× reads (“on-target reads”). More details about the tumor sequencing pipeline can be found in prior studies (6, 26).
We collected somatic mutation data from the tumor sequences of 18,472 primary cancer samples spanning over 60 cancer types and subtypes. Some tumors exhibit microsatellite instability (MSI) with high mutational burden; the germline–somatic relationship for those hypermutable subtypes might be substantially different from the microsatellite stable (MSS) tumors. We thus further classified each sample as MSI or MSS using MSIDetect (28). Cancer types with >500 samples were selected; for each selected cancer, we removed those rare subtypes with <3 samples. The remaining 12,413 samples across 11 cancer types were included in the downstream analysis (Supplementary Table S2).
Germline imputation from tumor sequences
Details of inferring common germline variants from the OncoPanel tumor sequencing data are described elsewhere (25) and briefly summarized here. Tumor targeted sequencing generated both high-coverage “on-target reads” aligned to the targeted regions and low-coverage “off-target reads” aligned to the rest of the genome (Fig. 1). Common germline variants with minor allele frequency (MAF) > 1% in the European population were imputed from these tumor sequences (mainly relied on off-target reads) using linkage disequilibrium (LD) information with the 1000 Genomes Phase III release as the haplotype reference panel. Imputation accuracies from several algorithms designed for imputing germline variants from low coverage data were evaluated by comparing the imputed allele dosage to the gold standard germline data generated from genotyping array. The STITCH algorithm (29) yielded the highest overall accuracy and the resulting imputed germline data were used for the downstream analysis. The imputed variants were subsequently restricted to an imputation INFO score of greater than 0.4, which produced a mean imputation correlation of 0.86 between tumor imputed and germline SNP array variants (25).
Genetic ancestry was inferred by projecting the imputed germline genetic data into the genetic ancestry principal components using weights derived for European, African, and Asian populations from the 1000 Genomes Project reference data (30). We further restricted our analysis to samples with <10% inferred non-European ancestry.
Identifying recurrently mutated cancer genes and hotspot mutations
We identified recurrently mutated cancer genes, defined as genes with ≥5% carriers of missense mutations, for each selected cancer type from the somatic data. We included additional genes that were identified as highly significantly mutated or significantly mutated genes among known cancer genes for each selected cancer type from the TumorPortal (http://www.tumorportal.org/; ref. 31); genes from TumorPortal may not necessarily have mutation frequency ≥ 5% in our data. A total of 135 cancer genes and 342 gene-cancer pairs were identified, with the mutation frequency ranging from 0.0036 to 0.73 (Fig. 2A; Supplementary Table S3). Mutation status for each sample and gene is defined as whether this sample carries at least one functional mutation (frame_shift_del, frame_shift_ins, frameshift, initiator_codon, missense and splice_region, missense_mutation, nonsense_mutation, protein_altering, splice_site, start_lost, stop_lost, and translation_start_site) in this gene and is considered to capture the “local” tumor mutation (mutation in one cancer gene).
For each selected cancer type, we further identified specific mutations with ≥5% carriers in the somatic data as hotspot mutations. Seven of the 11 cancer types harbor at least one hotspot mutation. A total of 17 hotspot mutations and 25 mutation-cancer pairs were identified, with the mutation frequency ranging from 0.051 to 0.33 (Fig. 2B; Supplementary Table S4).
Quantifying TMB of all panel genes and TMC of recurrently mutated cancer genes
TMB is defined as the total number of missense mutations per megabase based on the targeted sequencing data of all panel genes (Fig. 2C). It captures the total mutations in all targeted cancer genes and is considered as a refined “global” mutational burden restricting to a set of cancer-related genes rather than the genome-wide mutational burden. In addition to TMB, we also calculated TMC for each sample, which is defined as the count of recurrently mutated cancer genes (specific to each cancer type) that harbor at least one missense mutation (Fig. 2D). Compared with TMB, TMC is a more refined measure of the mutational burden in likely driver genes for a cancer. Moreover, by counting the genes instead of the mutations, the TMC analysis would be less sensitive to hypermutable outliers compared with TMB analysis.
Identifying eQTL from the Genotype-Tissue Expression Project for all selected genes
We obtained the eQTL and gene expression association results in normal tissue for all selected genes from the meta-analyzed multi-tissue eQTL results using METASOFT (32) from the Genotype-Tissue Expression Project (GTEx) Analysis V8 release. We selected those genome-wide significant eQTL with P < 5 × 10−8 from any of the fixed effect (FE), random effect (RE), or Han and Eskin's random effect (RE2) models. Variants with MAF < 1% were further removed. A total of 28,486 eQTL for 114 genes with imputed germline data available were included in the analysis. To determine the number of effective tests in multiple testing corrections, we performed LD clumping on the final list of eQTL for each gene at various LD threshold (Supplementary Table S5) and used r2 = 0.3 in the final analysis (33).
Assessing the associations of cancer gene eQTL with TMB and TMC
We assessed the association between each selected cancer gene eQTL and TMB of all panel genes for each cancer by fitting a linear model adjusting for age, gender (if applicable), panel version, and tumor purity. MSI status was also adjusted as a covariate for the models of colorectal and endometrial cancer where a substantial proportion of the cases display hypermutability (34, 35). TMB was Winsorized to 98% within each cancer type to reduce the impact of potential outliers on the association results. The associations between cancer gene eQTL and TMC were evaluated for recurrently mutated cancer genes for each cancer type by fitting a negative binomial model with the same covariates as the TMB models. Sensitivity analysis was performed to assess the impacts of potential TMB or TMC outliers on the association results by varying the Winsorization thresholds and using standardized TMB. For TMC, we further evaluated the impacts of using count of missense mutations instead of count of mutated genes on the germline–somatic associations.
Assessing the associations of cancer gene eQTL with recurrently mutated cancer genes and hotspot mutations
The local impact of each cancer gene eQTL on the risk of having somatic mutations in that gene or a nearby hotspot mutation was assessed using logistic regression. These analyses further adjusted for TMB along with all the covariates included in the TMB or TMC models. Meta-analysis was performed to evaluate the impact of a cancer gene eQTL on the mutation status of one gene or mutation across cancers.
Data availability statement
The individual-level data used in this study are not publicly available due to patient privacy requirements but are available upon reasonable request from the corresponding author subject to institutional approvals. Tumor somatic data and clinical data for all patients in this study are available through AACR Project GENIE (Version 12.0-public). Summary data generated in this study are available within the article and its Supplementary Data files. The GTEx Analysis V8 release data used in this study are available on the GTEx Portal (https://gtexportal.org/home/datasets).
Results
Germline cancer gene eQTL influence global tumor mutations
We analyzed 28,486 eQTL for 114 cancer genes and assessed their associations with TMB of all cancer genes sequenced on the panel across cancers. We identified 22 significant eQTL–TMB associations representing 3 independent gene-cancer pairs that passed the Bonferroni correction threshold accounting for the number of effective tests (P < 3.45 × 10−6; Supplementary Tables S5 and S6). Table 1 summarizes the results for the most significant association at each locus. There exists heterogeneity in the effects of these eQTL on TMB across cancers (Supplementary Table S7). Sensitivity analysis on the impacts of potential outliers showed that the association of the GLI2 eQTL and TMB in ovarian cancer was sensitive to the Winsorization threshold (Supplementary Table S8). This association also became nonsignificant if we use standardized TMB as the outcome (beta = 0.26, P = 0.43) while the other two top associations remained nominally significant (beta = −2.33, P = 1.57 × 10−3 for rs139944315 (WRN) and TMB in glioma; beta = −0.23, P = 0.04 for rs11075646 (CBFB) and TMB in esophagogastric carcinoma).
. | eQTL . | Association results with TMB or TMC . | Association results with gene expression from GTExa . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cancer type . | Region . | Lead SNP . | Posb . | Effect allele . | Other allele . | EAF . | Beta . | SE . | P . | Gene . | Beta (FE) . | P (FE) . | Beta (RE) . | P (RE) . | P (RE2) . |
TMB of all cancer genes on OncoPanel | |||||||||||||||
Ovarian Cancer | 2q14.2 | rs1530578 | 121702128 | T | C | 0.01 | 17.61 | 3.02 | 1.26E-08 | GLI2 | −0.11 | 2.41E-16 | −0.11 | 2.00E-10 | 1.33E-16 |
Glioma | 8p12 | rs139944315 | 30332577 | T | A | 0.99 | −16.36 | 2.65 | 1.21E-09 | WRN | −0.30 | 6.35E-44 | −0.28 | 4.16E-20 | 6.79E-45 |
Esophagogastric Carcinoma | 16q22.1 | rs11075646 | 66969176 | C | G | 0.90 | −2.57 | 0.53 | 1.48E-06 | CBFB | 0.05 | 2.06E-12 | 0.05 | 1.34E-08 | 3.03E-12 |
TMC of recurrently mutated cancer genes | |||||||||||||||
Colorectal Cancer | 4q13.2 | rs10031417 | 66700879 | A | G | 0.55 | −0.18 | 0.04 | 2.04E-06 | EPHA5 | 0.03 | 5.63E-05 | 0.03 | 4.12E-02 | 6.36E-13 |
Endometrial Cancer | 5q22.2 | rs397768 | 112181576 | G | A | 0.38 | 0.24 | 0.05 | 1.43E-05 | APC | 0.01 | 7.91E-02 | 0.00 | 7.36E-01 | 3.77E-10 |
Endometrial Cancer | 8p12 | rs11782945 | 31082006 | G | A | 0.41 | 0.58 | 0.12 | 4.95E-07 | WRN | 0.06 | 1.06E-26 | 0.06 | 4.95E-21 | 3.14E-26 |
Endometrial Cancer | 12q13.3 | rs73115907 | 57754701 | G | A | 0.76 | −0.31 | 0.06 | 4.61E-07 | GLI1 | −0.02 | 2.75E-03 | −0.02 | 1.31E-01 | 2.99E-08 |
Endometrial Cancer | 16q24.3 | rs7201264 | 90036122 | C | G | 0.20 | 0.41 | 0.07 | 1.10E-08 | FANCA | −0.07 | 4.70E-18 | −0.09 | 1.32E-07 | 8.17E-41 |
Endometrial Cancer | 17p13.1 | rs17884306 | 7572101 | C | T | 0.94 | −0.60 | 0.13 | 4.36E-06 | TP53 | 0.09 | 2.41E-16 | 0.08 | 4.45E-12 | 8.15E-16 |
. | eQTL . | Association results with TMB or TMC . | Association results with gene expression from GTExa . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cancer type . | Region . | Lead SNP . | Posb . | Effect allele . | Other allele . | EAF . | Beta . | SE . | P . | Gene . | Beta (FE) . | P (FE) . | Beta (RE) . | P (RE) . | P (RE2) . |
TMB of all cancer genes on OncoPanel | |||||||||||||||
Ovarian Cancer | 2q14.2 | rs1530578 | 121702128 | T | C | 0.01 | 17.61 | 3.02 | 1.26E-08 | GLI2 | −0.11 | 2.41E-16 | −0.11 | 2.00E-10 | 1.33E-16 |
Glioma | 8p12 | rs139944315 | 30332577 | T | A | 0.99 | −16.36 | 2.65 | 1.21E-09 | WRN | −0.30 | 6.35E-44 | −0.28 | 4.16E-20 | 6.79E-45 |
Esophagogastric Carcinoma | 16q22.1 | rs11075646 | 66969176 | C | G | 0.90 | −2.57 | 0.53 | 1.48E-06 | CBFB | 0.05 | 2.06E-12 | 0.05 | 1.34E-08 | 3.03E-12 |
TMC of recurrently mutated cancer genes | |||||||||||||||
Colorectal Cancer | 4q13.2 | rs10031417 | 66700879 | A | G | 0.55 | −0.18 | 0.04 | 2.04E-06 | EPHA5 | 0.03 | 5.63E-05 | 0.03 | 4.12E-02 | 6.36E-13 |
Endometrial Cancer | 5q22.2 | rs397768 | 112181576 | G | A | 0.38 | 0.24 | 0.05 | 1.43E-05 | APC | 0.01 | 7.91E-02 | 0.00 | 7.36E-01 | 3.77E-10 |
Endometrial Cancer | 8p12 | rs11782945 | 31082006 | G | A | 0.41 | 0.58 | 0.12 | 4.95E-07 | WRN | 0.06 | 1.06E-26 | 0.06 | 4.95E-21 | 3.14E-26 |
Endometrial Cancer | 12q13.3 | rs73115907 | 57754701 | G | A | 0.76 | −0.31 | 0.06 | 4.61E-07 | GLI1 | −0.02 | 2.75E-03 | −0.02 | 1.31E-01 | 2.99E-08 |
Endometrial Cancer | 16q24.3 | rs7201264 | 90036122 | C | G | 0.20 | 0.41 | 0.07 | 1.10E-08 | FANCA | −0.07 | 4.70E-18 | −0.09 | 1.32E-07 | 8.17E-41 |
Endometrial Cancer | 17p13.1 | rs17884306 | 7572101 | C | T | 0.94 | −0.60 | 0.13 | 4.36E-06 | TP53 | 0.09 | 2.41E-16 | 0.08 | 4.45E-12 | 8.15E-16 |
Abbreviation: EAF, effect allele frequency.
aMeta-analysis results of the associations between the eQTL and normalized gene expression levels across 49 tissues.
bPosition based on GRCh37/hg19.
To further investigate the relationship between the observed germline–somatic associations and gene expression, we compared our results with the association results between the identified top eQTL and expression level of the specific cancer genes in normal tissue in GTEx (Table 1; Fig. 3); we also performed statistical colocalization between TMB and gene expression for the identified top genes in Table 1 in individual GTEx tissues using the coloc R package (Supplementary Table S9 and Supplementary Fig. S1; ref. 36). The T allele of rs1530578 was associated with elevated TMB in ovarian cancer and reduced expression of GLI2 across tissues (Fig. 3A and D). The largest effect of rs1530578 on GLI2 expression was observed in ovary with beta = −0.55 and P = 3.93 × 10−5 (Fig. 3A), and the strongest colocalization signal for GLI2 expression and TMB was also in ovarian tissue (posterior probability that expression and TMB share a causal variant PPH4 = 0.66). rs139944315 was associated with TMB in glioma and expression of WRN across tissues in a consistent direction (Fig. 3B and E). While the largest effect of this variant on WRN expression was observed in subcutaneous adipose tissue, there was also an association in putamen of basal ganglia with beta = −0.51 for the T allele (P = 0.05; Fig. 3B). Colocalization signals for WRN and TMB were weak: PPH4 < 0.08 for every tissue. Finally, we found that the C allele of rs11075646 was associated with decreased TMB in esophagogastric carcinoma and slightly increased expression of CBFB across tissues (Fig. 3C and F). This variant had a nominally significant impact on CBFB expression in both gastroesophageal junction (beta = 0.10 and P = 0.02; Fig. 3C) and mucosa of esophagus (beta = 0.08 and P = 0.01) while the most significant effect was observed for thyroid tissue (Fig. 3F). The strongest colocalization signal was also for thyroid tissue (PPH4 = 0.34). None of these three top variants or variants in high LD (r2 > 0.1) with them has been linked to cancer incidence in genome-wide association studies (GWAS) from GWAS catalog (https://www.ebi.ac.uk/gwas/home; Supplementary Table S10; ref. 37).
We next assessed the impacts of cancer gene eQTL on TMC, which quantifies the mutational burden of a set of genes that are recurrently mutated in the specific cancer. There were 145 significant eQTL-TMC associations after Bonferroni correction (P < 1.73 × 10−5; Supplementary Table S11), representing six independent gene-cancer pairs (Table 1). Sensitivity analysis showed that all top TMC associations were robust to a wide range of Winsorization thresholds (Supplementary Table S12). Replacing count of mutated genes with count of mutations also yielded similar results compared with the main analysis (Supplementary Table S13). Given that all top eQTL-TMC associations were identified for colorectal and endometrial cancer, we further performed a stratified analysis by MSI status. There was no substantial deviation in the effect estimates for MSS or MSI subgroup from the main analysis though the subgroup results were less significant, especially for MSI samples, which was likely due to the reduced sample sizes (Supplementary Table S14). Finally, we compared these top eQTL-TMC associations to the previous eQTL–TMB results and found that all these top germline variants were associated with TMB in the corresponding cancers in consistent directions with TMC with nominal significance (Supplementary Table S15).
There exists substantial heterogeneity in the associations with gene expression level across tissues for many of the top variants in the TMC associations (Fig. 4). Two of the tissue-specific associations have both P < 0.05 and m-value > 0.8: rs10031417 and EPHA5 expression in sigmoid colon and rs7201264 and FANCA expression in uterus (Fig. 4A and E). The A allele of rs10031417 was associated with lower somatic mutational burden in recurrently mutated cancer genes in colorectal cancer and slightly higher expression of EPHA5 across tissues (Fig. 4A and G); this positive effect on EPHA5 expression was larger in sigmoid colon with beta = 0.17 and P = 1.11 × 10−3 (Fig. 4A). It is worth noting that a variant that is in LD with rs10031417 (rs13104357, r2 = 0.18) has also been reported to be associated with EPHA5 expression in colorectal tumor samples in The Cancer Genome Atlas (TCGA; ref. 38); the direction of this association in tumor was consistent with in normal tissue. The rs7201264-C allele was associated with both increased TMC in endometrial cancer and decreased FANCA expression across tissues (Fig. 4E and K); it had a specific significant impact on FANCA expression in uterus (Fig. 4E; beta = −0.28 for the C allele, P = 0.02). Evidence for colocalization between gene expression and TMC was modest for most of the TMC-associated genes in Table 1 (PPH4 < 0.30), with the exception of GLI1 and FANCA and endometrial cancer TMC (GLI1 max PPH4 = 0.72 in fibroblasts and FANCA max PPH4 = 0.61 in pituitary tissue; Supplementary Table S9 and Supplementary Fig. S1).
Variants in LD with rs7201264 have been linked to melanoma and a few other cancers, but not endometrial cancer (Supplementary Table S10). rs78378222, which is in LD with the top variant identified for TMC in endometrial cancer and TP53 expression (rs17884306, r2 = 0.21 with rs78378222), has been previously associated with the risk of uterine fibroids and several cancers in but not the risk of endometrial cancer specifically (39, 40).
Local impacts of germline eQTL on somatic mutations in cancer genes
None of the individual associations between somatic mutation status for recurrently mutated genes and their eQTL passed the Bonferroni correction threshold (P < 1.73 × 10−5). The most significant association observed was between a TSC2 eQTL and somatic TSC2 mutation status in endometrial cancer (beta = −1.81 for rs12918530-C allele, P = 1.56 × 10−4; Supplementary Table S16). Looking across all cancers, there was a significant (P < 6.91 × 10−5) association between an ATM eQTL (lead SNP: rs4753834 at 11q22.3) and somatic ATM mutations from a meta-analysis of 8 cancers (Fig. 5; Supplementary Table S17). The G allele of rs4753834 was associated with a lower risk of having somatic mutations in ATM (beta = −0.35, P = 3.43 × 10−5 across cancers from FE model) and increased expression of ATM in normal tissues (beta = 0.05, P = 1.03 × 10−20 across tissues from RE model). This variant also had specific effects on ATM expression in many tissues related the 8 cancers, including mammary tissue (beta = 0.06), sigmoid colon (beta = 0.09), hypothalamus (beta = 0.12), lung (beta = 0.07), and prostate (beta = 0.11), all with P < 0.05 and m-value > 0.9. Moreover, variants that are in LD with rs4753834 have also been associated with ATM expression in tumor samples of breast cancer (rs673281, r2 = 0.21, beta = −0.08 for the T allele, P = 1.98 × 10−4) and glioma (rs1003623, r2 = 0.21, beta = −0.11 for the T allele, P = 4.56 × 10−4; ref. 38); the directions were also consistent with those in normal tissues. We additionally tested the associations of ATM eQTL and TMB or TMC of cancer genes and found that variants in LD with rs4753834 (lead SNP: rs672964, r2 = 0.21 with rs4753834) were associated with TMB (beta = −0.69 for rs672964-C, P = 2.97 × 10−5) and TMC (beta = −0.07 for rs672964-C, P = 0.02) in non–small cell lung cancer in the consistent direction with ATM mutation status. No association with cancer risk was found for rs4753834 in GWAS Catalog. A few variants that are in LD with rs4753834 have previously been linked to melanoma, but not any of these 8 cancers (Supplementary Table S10).
We also identified nominal associations between eQTL for cancer genes identified in the global tumor mutation analysis with the somatic mutation status of that gene in the corresponding cancer. We found that rs1897693 (r2 = 0.42 with rs10031417) was associated with both the expression of EPHA5 in normal tissues (beta = 0.03 for the C allele, P = 0.03 across tissues from RE model) and the somatic mutation status of EPHA5 in colorectal cancer (beta = −0.66 for the C allele, P = 0.01). Another variant rs55671402 was associated with FANCA expression in normal tissues (beta = −0.13 for the C allele, P = 9.67 × 10−13 across tissues from RE model; beta = −0.54 for the C allele, m-value = 0.98, P = 1.35 × 10−3 in uterus) and somatic mutations in FANCA in endometrial tumors (beta = −1.23 for the C allele, P = 8.61 × 10−3).
We further assessed the impacts of eQTL for a cancer gene on each identified hotspot mutation in that gene. None of the associations passed the Bonferroni correction threshold (P < 3.40 × 10−4) with the most significant association observed for rs1867930 with p.S249C in FGFR3 in bladder cancer (beta = 0.60 for the G allele, P = 3.54 × 10−3; Supplementary Table S18). Only one nominally significant (P < 0.05) association from the meta-analysis across cancers was found for rs11047823 with p.G12D in KRAS across colorectal cancer, endometrial cancer, non–small cell lung cancer, and pancreatic cancer (beta = 0.24 for the G allele, P = 0.01 across cancers from the FE model), though it still did not pass the Bonferroni correction threshold for significance (P < 5 × 10−3).
Discussion
In this study, we evaluated the influence of germline variants that are associated with cancer gene expression on somatic mutations in specific cancer genes across 11 cancer types. Our analysis revealed novel associations of germline eQTL with local mutation status of a single cancer gene or the global mutational burden. These findings provide preliminary evidence for the hypothesis that germline variants can influence local and global tumor mutations by altering the expression level of specific cancer genes.
Nevertheless, there are also other possible scenarios that can yield the same results (Fig. 6). First, given that there exists a causal impact of eQTL on somatic mutations, we still cannot conclude that this is only mediated by the transcript abundance of the specific eQTL gene (Fig. 6A). Finding an eQTL signal in the cancer-related tissue can provide a further support for mediation via expression. Second, we are studying somatic mutations in developed tumor (S’) rather than in normal or precancerous tissue (S; Fig. 6A). S’ can serve as a proxy for S, though it was measured after tumorigenesis and might be further influenced by other factors such as the tumor microenvironment (41). Here, we are studying mutations in cancer genes that have been identified as potential drivers for carcinogenesis. Even if some mutations in those genes occurred after cancer initiation, our results could still inform us of the role of germline variants in inducing somatic mutations during cancer progression. Finally, consider the three possible scenarios in Fig. 6B: under the situation that the germline variants only influence cancer diagnosis through other pathways and there is no causal effect on somatic mutations, we may still observe this germline–somatic association among patients with cancer due to collider bias. We are unable to distinguish between these three scenarios based on our data, but we can leverage information from other sources (e.g., association results of the germline variants with cancer incidence from GWAS) to weigh these possible scenarios for each identified association.
The germline–somatic associations identified here were consistent with prior evidence, and many may be related to the biological mechanisms that underlie patients’ response to immunotherapy. Among all the identified eQTL genes, APC, ATM, CBFB, and TP53 have been predicted as pan-cancer tumor suppressor genes across 33 cancer types in TCGA (1). We observed that the germline variants associated with reduced expression of these tumor suppressor genes were associated with increased tumor mutations, except for APC (Table 1; Figs. 3,4–5). The APC protein plays an important role in the Wnt signaling pathway (42) and interacts with E-cadherin, which regulates cell adhesion (43). Mutations that inactivate APC lead to disruption of β-catenin degradation, resulting in its translocation into the nucleus and activation of the transcription of multiple genes, which triggers cancer development, including endometrial carcinogenesis (44). Active β-catenin signaling has been linked to resistance to anti–programmed death-ligand 1 (PD-L1)/anti–CTLA-4 monoclonal antibody therapy in melanoma (45). A recent study found that germline pathogenic variants in APC are associated with elevated TMB (46). Intuitively, we would assume a variant that downregulates the expression of a tumor suppressor gene to be associated with elevated risk of cancer and somatic mutational burden, but this assumption might be oversimplified as the oncogenic or tumor suppressive effect of a gene on carcinogenesis and on somatic mutational burden would depend on the signaling pathway that the gene involved in and may vary substantially across cancer types (47). In our study, the major allele of rs397768 slightly downregulates APC expression across tissues. If this indicates activation of β-catenin signaling in endometrial carcinogenesis, then it should be associated with resistance to immunotherapy and reduced tumor mutations as we observed (Table 1; Fig. 4). However, this interpretation depends upon many variable components involved in this complex biological process; further study is needed to elucidate the molecular mechanisms underlying these associations.
ATM germline and somatic mutations have been linked to multiple cancers. Activated ATM protein kinase phosphorylates a few key proteins, which activates DNA damage checkpoint, leading to its main tumor suppressive effect of cell-cycle arrest and apoptosis (48). A study of pathogenic germline variants in cancer identified two-hit events for ATM where a germline variant in ATM is coupled with a somatic mutation in the other copy of the gene in multiple cancers (49). They also found that ATM pathogenic variant carriers had lower ATM expression, which is in line with our finding that the minor allele of rs4753834 is associated with lower expression of ATM but higher risk of having somatic mutation in the gene (Fig. 5). Recent studies also reported that ATM mutations were associated with improved response to immune checkpoint blockade therapy (50, 51). We observed this inverse relationship of ATM expression with both somatic ATM mutations across cancers and TMB in non–small cell lung cancer, which may support the potential role of ATM as a therapeutic target for promoting the response to immunotherapy.
TP53, which encodes protein p53, is one of the most frequently mutated genes in cancer. Genetic alterations in the p53 stress response pathway can affect the tumor suppressive role of TP53 and promote tumorigenesis (52). Results from a recent study demonstrated evidence for the interactive effects of a germline cancer risk variant, rs78378222, and somatic mutation status of TP53 on cancer risk, prognosis, and drug responses (53). The C allele of rs78378222 has been linked to lower expression level of wild-type TP53 in both normal tissue and tumor, which in turn reduce p53 cellular activity and lead to poorer overall survival of patients. In our analysis, we found that the minor allele of rs17884306, which is correlated with the C allele of rs78378222, was associated with higher TMC and lower TP53 expression (Table 1; Fig. 4). One study highlighted the predictive value of somatic TP53 mutations for benefit from anti–PD-1/PD-L1 immunotherapy in lung cancer, which may also be related to our findings (54).
Increased expression of GLI2 in the hedgehog signaling pathway has been found to induce PD-L1 expression in gastric cancer and promote tumor resistance to immunotherapy (55). We identified a germline eQTL at 2q14.2 that upregulates GLI2 and is associated with lower TMB in ovarian cancer; nominally significant association was also found in esophagogastric carcinoma in the same directions, which is in line with previous study and may explain mechanisms (Table 1; Fig. 3; Supplementary Table S7).
Reduced EPHA5 expression has been linked to lymph node metastasis, advanced TNM stage, and poor survival outcome in colorectal cancer, supporting its tumor suppressive role in this cancer (56). Recent work showed that having somatic EPHA5 mutations is positively associated with TMB and response to immune checkpoint inhibitor therapy in lung cancer (57). We also identified consistent associations of an EPHA5 eQTL at 4q13.2 with both somatic EPHA5 mutations and the global tumor mutations in colorectal cancer. This eQTL influences EPHA5 expression in colorectal cancer and normal colon tissue (Table 1; Fig. 4); the allele that was associated with reduced expression was also associated with increased somatic mutations.
Our study has several limitations. First, as mentioned above, we cannot easily distinguish between several possible scenarios of the causal relationships that may be consistent with the observed associations between germline eQTL and tumor mutations. We suggest future studies to further investigate these associations in normal tissue or precancerous lesions and incorporating haplotype-level information. Experimental validation is also necessary to confirm the putative mechanisms through gene expression for the identified associations. Second, the use of germline data imputed from off-target reads in tumor sequencing provides only a probabilistic estimate of the imputed variant. It would still be important to validate these findings using direct germline genotyping. Future studies would also benefit from having germline, gene expression, and somatic data all from the same large cohort. Third, our analysis focused on somatic mutations in the tumor, but we only included eQTL identified from normal tissue, which may miss tumor-specific eQTL effects (19). However, using normal tissue eQTL as the genetic instrument is more consistent with our hypothesis that eQTL alter gene expression in normal tissue, which contributes to somatic mutagenesis and tumor initiation. Where available, we also cross-referenced the eQTL results to those in corresponding tumor tissue and found the results had consistent direction with those in normal tissue. Finally, for somatic mutations, we only focused on missense and a few other functional mutations; future studies can further investigate the germline impact on somatic copy number alteration or structural variation through gene regulation. For germline variants, it would also be important to systematically assess the impact of rare pathogenic variants on gene expression and tumor mutations in future studies.
Our study focused only on those genome-wide significant variants from multi-tissue eQTL results, thus had more power to detect specific germline–somatic associations compared with a standard single-tissue colocalization analysis given that the eQTL effects tend to be correlated across tissues (58). Moreover, this approach makes less assumption about the true causal tissues, which we often do not know, for a certain cancer and gene association. Colocalization signals between global tumor mutation and gene expression were observed for some of the identified top genes, e.g., TMB in ovarian cancer and GLI2 expression in ovary, but these tissue-specific analyses using GTEx eQTL results tend to be underpowered and the results should be interpreted with caution (59). We also used very stringent Bonferroni thresholds to reduce false positives. Given that the tests are not independent across outcomes, we used per outcome Bonferroni thresholds as the results would not be substantially different from using a global threshold counting all effective tests performed across outcomes. Even using a Bonferroni threshold counting all tests (not independent) performed in the study (P = 2.37 × 10−6), most of the identified associations remain significant, except for the APC eQTL and TMC association in endometrial cancer (P = 1.43 × 10−5), TP53 eQTL and TMC association in endometrial cancer (P = 4.36 × 10−6), and ATM eQTL and ATM somatic mutation association across cancers (P = 3.43 × 10−5), all of which are still close to the threshold (Supplementary Table S19).
In conclusion, we systematically investigated the impacts of 28,486 germline cancer gene eQTL on somatic mutations in 114 cancer genes among 12,413 patients across 11 cancer types. Our results indicate that germline variants that regulate the expression of cancer genes also influence local and global tumor mutations. These findings provide further evidence for the important role of gene expression regulation in germline–somatic associations and open avenues for future research on the molecular mechanisms underlying these associations that confer cancer risk and sensitize cancer to immunotherapy.
Authors' Disclosures
P. Kraft reports grants from NIH during the conduct of the study. No disclosures were reported by the other authors.
Authors' Contributions
Y. Liu: Conceptualization, formal analysis, investigation, visualization, methodology, writing–original draft, writing–review and editing. A. Gusev: Conceptualization, resources, data curation, supervision, funding acquisition, methodology, writing–review and editing. P. Kraft: Conceptualization, resources, supervision, funding acquisition, methodology, writing–review and editing.
Acknowledgments
This work was supported by NCI grants R01CA227237 and R01CA244569 (to A. Gusev), and R01CA260352 (to P. Kraft), Phi Beta Psi Sorority, and Emerson Collective.
The GTEx was supported by the Common Fund of the Office of the Director of the NIH, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from the GTEx Portal on 08/04/2021.
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).