Abstract
Tumor mutational burden (TMB) is an emerging biomarker of response to immunotherapy in solid tumors. However, the extent to which variation in TMB between patients is attributable to germline genetic variation remains elusive. Here, using 7,004 unrelated patients of European descent across 33 cancer types from The Cancer Genome Atlas, we show that pan-cancer TMB is polygenic with approximately 13% of its variation explained by approximately 1.1 million common variants altogether. We identify germline variants that affect TMB in stomach adenocarcinoma through altering the expression levels of BAG5 and KLC1. Further analyses provide evidence that TMB is genetically associated with complex traits and diseases, such as smoking, rheumatoid arthritis, height, and cancers, and some of the associations are likely causal. Overall, these results provide new insights into the genetic basis of somatic mutations in tumors and may inform future efforts to use genetic variants to stratify patients for immunotherapy.
This study provides evidence for a polygenic architecture of tumor mutational burden and opens an avenue for the use of whole-genome germline genetic variations to stratify patients with cancer for immunotherapy.
Introduction
Somatic mutations, in particular those that confer selective growth advantage to cells, contribute to cancer development (1). Patterns and functions of somatic mutations in human cancer genome have been extensively analyzed in a wide variety of cancers, leading to numerous biological insights (2–4). Particularly, an important somatic event termed tumor mutational burden (TMB), determined by the total number of nonsynonymous somatic mutations in a cancer genome, has been shown to have great practical relevance. Notably, lines of evidence suggested that TMB is a predictive biomarker to identify patients with cancer who are more likely to have response to immune checkpoint inhibitor (ICI) therapies (5–9).
Several studies have revealed the association of germline genetic variants with somatic events in tumors, including mutational signatures (10–12), and somatic mutations in specific cancer-related genes (13, 14). In addition, Zhu and colleagues uncovered a negative association between germline cancer risk alleles and somatic mutation burden in breast cancer (15). Robles-Espinoza and colleagues found that germline MC1R variants were associated with somatic mutation burden of melanoma (16). Despite these progresses, our understanding of the germline genetics of TMB is still largely limited. A key missing piece is the elucidation of the genetic architecture of TMB (e.g., the overall contribution of common genetic variants to TMB and the distribution of effect sizes of the common variants) and the genetic and causal relationships of complex traits (including diseases) with TMB. Answering these questions will be instrumental for guiding future genome-wide association studies (GWAS) for TMB, identifying risk factors for TMB, and assessing the potential clinical utility of the genetic characteristics of TMB in ICI therapies.
Here, we study the genetic architecture of TMB from a pan-cancer analysis of 7,004 patients of European ancestry across 33 cancer types in The Cancer Genome Atlas (TCGA). Although there are likely differences in the genetic architecture of TMB between the cancer types, the sample sizes in TCGA are insufficient for methods that use genome-wide common variants to depict the genetic architecture of TMB for individual cancer types. Hence, we focus this study on pan-cancer TMB while acknowledging that there is likely genetic heterogeneity in TMB between cancer types (see below for more discussion). We estimate that approximately 13% of variance in pan-cancer TMB can be explained by all common genetic variants together. We identify germline loci associated with TMB in specific cancer types, as well as the underlying putative causal genes that drive the associations. Moreover, our findings show a shared genetic component between pan-cancer TMB and complex traits (including diseases), such as tobacco smoking, height, rheumatoid arthritis, and cancers. We further investigate the causal relationships between TMB and complex traits by Mendelian randomization (MR) analyses, which provide evidence for putative causal effects of tobacco smoking on pan-cancer TMB.
Materials and Methods
TCGA cohort
TCGA included both tumor and matched normal biospecimens from 10,224 human samples with written informed consent under authorization of local institutional review boards (IRB; https://cancergenome.nih.gov/abouttcga/policies/informedconsent). We excluded 344 highly mutated samples likely reflecting technical artefacts as detected by a previous study (4). Samples that were marked by the Analysis Working Group on the basis of pathology were also discarded, but those with “RNA degradation” flag were retained because no RNA data were used in this study (4). All samples in public TCGA database have been collected and utilized following strict human subjects protection guidelines, written informed consent, and IRB review of protocols (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/history/policies). Our study was approved by the Ethics Committees of The University of Queensland (Brisbane, Queensland, Australia, ID: 2011001173), Wenzhou Medical University (Wenzhou, Zhejiang, P.R. China, ID: 2019094), and Westlake University (Hangzhou, Zhejiang, P.R. China, ID: 20200722YJ001).
Quantification of tumor mutational burden
TMB is defined as the total number of nonsynonymous variants (including single-nucleotide variants and indels) per Mb of an exome examined on the basis of MC3 calls from PanCanAtlas (mc3.v0.2.8.PUBLIC.maf; https://www.synapse.org/#!Synapse:syn7214402/files/).
SNP array data
We used genotype data of 522,606 SNPs derived from Affymetrix 6.0 SNP arrays. Birdseed genotype files (n = 11,459, across 33 cancer types) were downloaded from the Genome Data Commons legacy (GRCh37/hg19) archive (https://portal.gdc.cancer.gov/legacy-archive). Birdseed2vcf (https://github.com/ding-lab/birdseed2vcf) was used to transform birdseed genotype files into individual VCF files, which were then merged into a single combined VCF file. We chose one single file according to the following criteria, if 1 patient had multiple aliquot barcodes, as described in detail elsewhere (17): (i) sample type: blood > solid (sample type code, 10 > 11), (ii) molecular type of analyte: D analytes (native DNA) were preferred over G, W, or X (whole-genome amplified); (iii) order of portion: portion with a higher number was selected, (iv) order of plate: plate with a higher number was selected, and (v) sequencing or analysis center: center with a higher number was selected. Finally, a total of 10,406 nonredundant samples for 33 cancer types were obtained for downstream analysis.
Principal component analysis
To remove ancestral outliers from TCGA sample, we performed a principal component analysis (PCA) in a combined genotype dataset of TCGA and 1000 Genomes Project (1000GP, RRID:SCR_008801; ref. 18). Only the autosomal SNPs with missingness rate < 5% and minor allele frequency (MAF) > 1% and individuals with missingness rate < 5% (446,018 SNPs on 12,909 individuals in total) were included in the PCA. After the PCA, we removed TCGA individuals whose principal component 1 (PC1) or PC2 deviated more than six SDs from the mean of the corresponding PC of the 1000GP individuals of European ancestry. Finally, a total of 8,179 TCGA individuals of European ancestry were retained for further analysis.
Genotype imputation
We imputed a set of cleaned genotype data consisting of 423,239 autosomal SNPs (missingness rate < 5%, MAF > 1%, and Hardy Weinberg equilibrium P > 10−6) on 8,179 individuals of European ancestry to the Haplotype Reference Consortium (19) using the Sanger Imputation Server (https://imputation.sanger.ac.uk/; ref. 20). The genotype posterior probabilities from imputation were converted to hard-call genotypes using PLINK2 (–hard-call-thresh 0.1) (21). We filtered out SNPs with missingness rate > 5%, MAF < 1%, Hardy Weinberg equilibrium P < 10−6, or imputation INFO score < 0.3. Note that because the sample size of TCGA cohort was too small to analyze rare variants, we included only the common variants (i.e., MAF > 1%) in this study.
Genome-wide association analysis
We first removed individuals with missing TMB or covariate information (e.g., age, sex, or tumor purity). We then exploited genome-wide complex trait analysis (GCTA; ref. 22) to compute the genetic relationships between all pairs of the remaining subjects using variants in common with those from HapMap 3 (referred to as HM3 SNPs hereafter) and excluded one of each pair of individuals with estimated genetic relatedness > 0.05, resulting in a set of 7,004 unrelated individuals. We regressed TMB against age and standardized the residuals to z-scores within each sex group of each cancer type. For pan-cancer TMB, we first combined the standardized residuals of all the cancer types and then applied a rank-based inverse normal transformation (RINT) to the combined residuals. We also applied RINT to TMB in each cancer type.
For the imputed genotype data, we performed additional quality control steps in the unrelated individuals: excluding variants with missingness rate > 5%, MAF < 1%, or Hardy Weinberg equilibrium P < 10−6 and samples with missingness rate > 5%. Finally, a total of 7,567,484 common variants, with 1,139,665 in common with HM3 SNPs, and 7,004 unrelated individuals were used for the analyses below.
Association tests were performed for pan-cancer TMB and TMB of each cancer type based on a linear regression model implemented in fastGWA (23). We included the first 20 PCs and tumor purity as covariates in the regression model. Variants with P < 5 × 10−8 were regarded as genome-wide significant.
SNP-based heritability estimation
We used 1,139,665 HM3 SNPs and 7,004 unrelated samples to estimate the SNP-based heritability (|h_{SNP}^2$|) for pan-cancer TMB. We opted to use the HM3 SNPs because they are a trimmed set of variants optimized to capture common genetic variation of the human genome (24), and to improve the comparability of this study with previous |h_{SNP}^2$| estimation analyses. We used the following methods for |h_{SNP}^2$| estimation.
Genomic relatedness matrix-restricted maximum likelihood analysis
Genomic relatedness matrix-restricted maximum likelihood (GREML; also known as single-component GERML or GREML-SC) is a mixed linear model method that leverages a genetic relatedness matrix (GRM) to estimate the proportion of phenotypic variance in a trait that can be explained by all SNPs in unrelated individuals. GREML is implemented in the GCTA software package (22). Details of the GREML methodology have been described elsewhere (25). We built a GRM of the 7,004 unrelated individuals using the HM3 SNPs and then estimated |\hat{h}_{SNP}^2$| with the first 20 PCs and tumor purity fitted as covariates in the GREML model.
MAF-stratified GREML analysis
It has been shown in previous studies that |\hat{h}_{SNP}^2$| obtained from GREML-SC can be biased if MAF distribution of causal variants is different from that of SNPs used in the analysis (26, 27). MAF-stratified GREML (GREML-MS) was thus developed to address this gap by stratifying SNPs by MAF into multiple GRMs and fitting the GRMs jointly in a model. We stratified the HM3 SNPs used into five MAF bins (i.e., 0.01 < MAF < 0.1, 0.1 < MAF < 0.2, 0.2 < MAF < 0.3, 0.3 < MAF < 0.4, and 0.4 < MAF < 0.5) and generated the corresponding GRMs. GCTA was then used to perform a GREML-MS analysis with adjustment for the first 20 PCs and tumor purity to estimate |\hat{h}_{SNP}^2$| for pan-cancer TMB.
Linkage disequilibrium and MAF-stratified GREML analysis
Linkage disequilibrium and MAF-stratified GREML (GREML-LDMS) is another variant of the GREML method developed to account for nonrandom distribution of causal variants with respect to both MAF and linkage disequilibrium (LD; refs. 27–29). In this analysis, each of the five MAF bins in the GREML-MS analysis above was split into a low and a high LD bin based on LD scores of the SNPs, resulting in 10 MAF and LD bins in total. We then ran GCTA to conduct a GREML-LDMS analysis (27) for pan-cancer TMB with the first 20 PCs and tumor purity included as covariates.
Bayesian mixed linear model with an S parameter analysis
The Bayesian mixed linear model with an S parameter method (termed BayesS), implemented in the GCTB software (http://cnsgenomics.com/software/gctb/), was employed to estimate three genetic architecture parameters simultaneously, that is, |h_{SNP}^2$|, polygenicity (proportion of SNPs with nonzero effects), and relationship between SNP effect size and MAF (30). We performed a BayesS analysis for pan-cancer TMB with a correction for the first 20 PCs and tumor purity.
LD score regression analysis
LD score regression (LDSC) is a summary data–based method that estimates |h_{SNP}^2$| by regressing GWAS |{\chi ^2}$| statistics of all SNPs on their LD scores (https://github.com/bulik/ldsc; ref. 31). LDSC is computationally more efficient than the conventional |h_{SNP}^2$| estimation methods that rely on individual-level data, especially for datasets with large sample sizes. We conducted an LDSC analysis using the GWAS summary statistics for pan-cancer TMB and LD scores provided by the LDSC package (computed from the 1000GP individuals of European ancestry based on HM3 SNPs). We excluded variants in the MHC region due to the complexity of this region.
Genetic correlation estimation using bivariate LDSC
Similar to the estimation of |h_{SNP}^2$| by LDSC for a single trait, bivariate LDSC estimates the genetic covariance by regressing the product of z-scores for two traits on LD score across SNPs (32). We used the bivariate LDSC to estimate genetic correlation (rg) between pan-cancer TMB and a complex trait. GWAS summary data for 46 complex traits (including diseases) were collected from published studies (Supplementary Table S3).
Polygenic risk score analysis
SBayesR is a summary data–based Bayesian multiple regression method, which was developed mainly for polygenic risk score (PRS) analysis and has been shown to improve prediction accuracy over other widely used PRS approaches (21, 33, 34). SBayesR is available in the GCTB software package (30). We ran SBayesR to estimate the joint effects of all the HM3 SNPs using GWAS summary data for a complex trait and LD information from a reference sample with individual-level genotypes. The MCMC chain was performed with 50,000 iterations, 20,000 burn in, and other default parameters. The shrinkage of the SNP effects toward zero comes from two components, one from a probability of a point mass at zero and the other from the prior distribution (a mixture of normal distributions centered at zero) for the nonnull effects. We used a set of 50,000 unrelated individuals of European ancestry randomly selected from the UK Biobank (UKB) as the LD reference for the UKB GWAS summary data and another set of 50,000 unrelated individuals of European ancestry sampled from the GERA (Resource for Genetic Epidemiology Research on Aging) cohort (35) as the LD reference for the other GWAS summary datasets. Similar as above, SNPs in the MHC regions were excluded.
After obtaining the SBayesR estimates of SNP effects, we computed the PRS of the trait in TCGA cohort using the following equation:
where m is the number of all the HM3 SNPs without selection, |\hat{\beta }$| is the estimated effect of the |i$|th SNP from SBayesR, and |{x_i}$|, coded as 0, 1, or 2, is the number of effect alleles of the |i$|th SNP. We then correlated the PRS of each of the 46 traits with pan-cancer TMB, as well as TMB of individual cancer types.
MR analysis
Generalized summary data–based MR (GSMR) is an MR method that uses multiple genetic variants as instruments to infer causal relationship between two phenotypes using GWAS summary statistics (36). Here, we employed GSMR to explore the causal associations between 11 traits, which showed nominally significant associations with pan-cancer TMB in either the PRS or rg analysis, and pan-cancer TMB. Following Zhu and colleagues' study (36), we selected quasi-independent SNPs (LD r2 < 0.05) with P < 5 × 10−8 as genetic instruments and only analyzed traits with at least 10 genetic instruments. We ran the GSMR analysis with the heterogeneity in dependent instrument (HEIDI)-outlier filtering step (default setting) to exclude genetic instruments that are associated with both the exposure and outcome phenotypes because of horizontal pleiotropy. We also repeated the MR analyses using the inverse-variance weighted approach (MR-IVW) implemented in the R package “MR-base” (37). Similar as in the GSMR analyses, quasi-independent SNPs associated with an exposure with P < 5 × 10−8 were chosen as instruments, which were then pruned for LD following the default settings in MR-base.
Summary data–based MR analysis
Summary data–based MR (SMR) is a method that integrates summary statistics from a GWAS for a trait with those from an expression quantitative trait locus (eQTL) study to identify genes whose expression levels are the mediators of SNP effects on the trait (38). We used SMR to identify putative causal genes for TMB in breast invasive carcinoma, kidney chromophobe, and stomach adenocarcinoma. We used cis-eQTL summary data from the eQTLGen consortium, an eQTL meta-analysis of 31,684 individuals in blood (39). We included in the SMR analysis only the genes with at least a cis-eQTL with P < 5 × 10−8 (the default setting in SMR) and excluded genes in the MHC regions, resulting in 15,487 genes in total. We used a genome-wide significance threshold of P < 3.23 × 10−6 (i.e., 0.05/15,487) to claim significant SMR associations and the HEIDI test to distinguish causality or vertical pleiotropy from linkage (PHEIDI > 0.05).
Results
The polygenic architecture of pan-cancer TMB
We defined TMB as the total number of nonsynonymous variants per Mb of an exome examined (Materials and Methods). We first estimated the proportion of variance in pan-cancer TMB explained by all common SNPs (i.e., |h_{SNP}^2$|) in 7,004 unrelated European patients across 33 cancer types in TCGA (hereafter termed pan-cancer TMB; Supplementary Figs. S1 and S2). The TMB phenotype was adjusted for covariates, such as age, gender, and cancer type, and standardized to z-score by RINT (Materials and Methods). We employed five commonly used |h_{SNP}^2$| estimation methods, with additional covariates fitted to control for the effects due to population stratification and tumor purity (Materials and Methods). The five methods used were: (i) GREML-SC analysis (25), (ii) GREML-MS (five GRMs; ref. 28), (iii) GREML-LDMS (10 GRMs; ref. 28), (iv) BayesS (30), and (v) LDSC (31). GREML-MS and GREML-LDMS can mitigate biases caused by nonrandom distribution of causal variants with respect to MAF and LD (27–29, 40), and hence were included in this analysis. Unless stated otherwise, all the analyses in this study were performed on the basis of approximately 1.1 million imputed HM3 SNPs with genome-wide coverage and MAF > 0.01 (Materials and Methods). There were no major differences between the results from the five methods (Table 1). All the estimates of |h_{SNP}^2\ $|were statistically significant with a mean of 12.8% across methods (Table 1). The estimates of |h_{SNP}^2$| remained largely unchanged when the TMB phenotype was not corrected for age (Supplementary Table S1). Analyses using TMB z-score without RINT showed a substantially smaller |\hat{h}_{SNP}^2$| on average (Supplementary Figs. S3A–S3G and S4; Supplementary Table S2), owing to the influence of a small number of outliers on the phenotypic variance as demonstrated by simulation (Supplementary Fig. S5A and S5B), suggesting that RINT was necessary. We, therefore, performed the following analyses using RINT TMB. The BayesS method is capable of simultaneously estimating three genetic architecture parameters, that is, |\hat{h}_{SNP}^2$|, polygenicity (proportion of SNPs with nonzero effects), and relationship between SNP effect size and MAF (also known as the S parameter). The polygenicity estimate of pan-cancer TMB from BayesS was approximately 1% [95% confidence interval (CI), 0.84%–1.16%], meaning that approximately 10,000 (95% CI, 9,573–13,220) common SNPs with nonzero effects on pan-cancer TMB. These results indicate that like many other complex human traits, such as height (25), pan-cancer TMB is affected by a large number of genetic variants each with a small effect, consistent with a polygenic architecture.
Method . | |{\hat{\bi h}\bf ^2}_{{\rm{SNP}}}$| . | SE . | P . |
---|---|---|---|
GREML-SC | 0.129 | 0.0468 | 0.0058 |
GREML-MS | 0.125 | 0.0491 | 0.0100 |
GREML-LDMS | 0.132 | 0.0580 | 0.0229 |
BayesS | 0.0987 | 0.0383 | 0.0098 |
LDSC | 0.153 | 0.0651 | 0.0180 |
Method . | |{\hat{\bi h}\bf ^2}_{{\rm{SNP}}}$| . | SE . | P . |
---|---|---|---|
GREML-SC | 0.129 | 0.0468 | 0.0058 |
GREML-MS | 0.125 | 0.0491 | 0.0100 |
GREML-LDMS | 0.132 | 0.0580 | 0.0229 |
BayesS | 0.0987 | 0.0383 | 0.0098 |
LDSC | 0.153 | 0.0651 | 0.0180 |
Identifying germline variants and genes associated with TMB
While realizing that the current sample size might not be sufficient for GWAS discovery for a polygenic trait, we performed a genome-wide association (GWA) analysis for pan-cancer TMB (Materials and Methods). No variant reached the genome-wide significance level (P = 5 × 10−8; Fig. 1A), suggesting that the effect sizes of all the pan-cancer–associated variants are very small, in line with the polygenic model. On the other hand, however, GWA analyses of TMB in specific cancer types (Material and Methods) identified genome-wide significant loci for TMB in breast invasive carcinoma (top SNP, rs611414; P = 3.30 × 10−14), kidney chromophobe (top SNP, rs1586868; P = 2.19 × 10−10), and stomach adenocarcinoma (top SNP, rs34026011; P = 2.01 × 10−8; Figs. 1B,–D and 2A,–C; Supplementary Fig. S6), implying tumor-specific genetic architecture of TMB. The loci for TMB in breast invasive carcinoma and kidney chromophobe also passed the Bonferroni-corrected experimental-wise significant threshold (PBonferroni corrected = 1.67 × 10−9) considering a total of 30 cancer types tested with a sample size greater than 45. We then performed an SMR analysis (38) to identify putative causal genes underlying the association signals and prioritized two genes (BAG5 and KLC1; PSMR = 2.10 × 10−7 and PHEIDI = 0.41 for BAG5 and PSMR = 3.09 × 10−7 and PHEIDI = 0.37 for KLC1) for stomach adenocarcinoma (Fig. 2D), consistent with a mechanism that the germline variants affect stomach adenocarcinoma TMB through altering the expression levels of BAG5 and KLC1. Both genes have been reported to play an important role in tumorigenesis (41, 42), but how they augment somatic mutational processes in tumors remains to be addressed in the future. It should be noted that there were multiple independent eQTL signals for KLC1, and the secondary eQTL signal did not seem to have an effect on TMB (Supplementary Fig. S7), suggesting that either KLC1 is not a causal gene or the secondary eQTL signal for KLC1 is tissue specific.
Shared polygenic effects between pan-cancer TMB and 46 complex traits and diseases
Having known that pan-cancer TMB has a substantial polygenic component, we sought to investigate whether there were shared genetic factors between pan-cancer TMB and complex traits (including diseases). To do this, we selected 46 complex traits for which summary statistics from large-scale GWASs were available from prior work (Supplementary Table S3), and used two analytic approaches, the PRS (43) and bivariate LDSC (32), to quantify the extent to which the polygenic effects are shared between traits (Materials and Methods). Both methods use all genome-wide common SNPs and are robust in the presence of a large number nonassociated SNPs.
In the PRS analysis, we used SBayesR (33), a summary data–based Bayesian multiple regression method, to estimate the weights for all the HM3 SNPs to construct a PRS (i.e., a weighted sum of effect alleles across all SNPs) for each of the 46 traits in TCGA and tested the correlation between the PRS and pan-cancer TMB. We opted to use SBayesR for the PRS analysis because it requires only summary statistics from the discovery GWAS and shows improved prediction accuracy over competing approaches (44, 45). We found that pan-cancer TMB was significantly positively correlated with rheumatoid arthritis PRS and negatively correlated with smoking cessation PRS, with an FDR < 0.05 (Fig. 3; Supplementary Table S3).
We next employed the bivariate LDSC method to estimate |{r_g}$| between pan-cancer TMB and each of the 46 traits, where |{r_g}$| can be interpreted as the correlation of polygenic effects between TMB and a trait. The LDSC analyses were performed using all the HM3 SNPs from the GWAS summary data for pan-cancer TMB and 46 traits mentioned above. We found that pan-cancer TMB showed significant rg with height (|{\hat{r}_g}$| = 0.28; SE, 0.081) with FDR < 0.05 (Fig. 3; Supplementary Table S3). We observed good concordances between the estimates from PRS and LDSC in both the magnitude (Pearson correlation = 0.83 and P = 1.39 × 10−12) and the significance level (Pearson correlation = 0.71 and P = 3.53 × 10−8; Fig. 3; Supplementary Fig. S8), with eight nominally significant (i.e., P < 0.05) estimates from the former and nine from the latter, including smoking initiation and rheumatoid arthritis (Supplementary Table S3).
Putative causal associations of tobacco smoking and height with pan-cancer TMB: an MR analysis
We have shown above that TMB is genetically associated with several complex traits and diseases. Then, the question is whether those associations are causal. To address this hypothesis, we performed an MR analysis to test causal effect of a complex trait on pan-cancer TMB (Materials and Methods). We opted to use the GSMR (36) method for the MR analysis because GSMR utilizes multiple SNPs as genetic instruments, accounts for sampling variance in estimated SNP effects on exposure, and accounts for potential residual correlations between the genetic instruments. We denote the MR estimate of causal effect of a trait or disease on TMB by |{\hat{b}_{xy}}$|, which is interpreted as the change of TMB in SD units given one unit increase in the trait (in SD units, if the trait was standard in GWAS) or disease risk (on the logit scale). We included in the GSMR analysis 11 traits (including diseases) that showed a nominally significant genetic overlap with pan-cancer TMB in either the PRS or the LDSC analysis above. We used quasi-independent SNPs associated with each of the 11 traits with P < 5.0 × 10−8 as the genetic instruments in the GSMR analysis (Supplementary Table S4). Unless stated otherwise, a statistical significance was claimed on the basis of an FDR threshold of 0.05 in the analyses below. Note that we only tested one direction of a causal association, that is, the effect of a complex trait on TMB, because testing the other direction requires a very large sample size to detect sufficient genome-wide significant SNPs for TMB and such a large sample is currently unavailable.
We found a significant positive association between height and pan-cancer TMB (|{\hat{b}_{xy}}$| = 0.068; SE = 0.022; P = 1.89 × 10−3; Fig. 4; Supplementary Table S4), in line with the previous finding that increased height is a risk for cancer, as observed in epidemiologic data (46, 47) and recent MR studies (36, 48, 49). We noted a positive effect of smoking initiation on pan-cancer TMB (|{\hat{b}_{xy}}\ = \ 0.14$|; SE = 0.055; P = 0.011; Fig. 4; Supplementary Table S4), in line with a negative effect of smoking cessation on pan-cancer TMB (|{\hat{b}_{xy}}$| = −0.27; SE = 0.13; P = 0.042; Fig. 4; Supplementary Table S4). Although both of them did not pass the FDR threshold, which may be due to the small number of genetic instruments available for smoking cessation (N = 20) or the small sample size of TCGA cohort for TMB GWAS analyses, these results support the causal role of tobacco smoking in somatic mutations, as shown in previous studies (50–55). To determine the robustness of our findings, we conducted MR analyses with the MR-IVW method (56). The results from MR-IVW method were highly consistent with those from GSMR (Supplementary Fig. S9A and S9B; Supplementary Table S4). Of note, we report the GSMR estimates as the main results because GSMR is more powerful than MR-IVW as shown in our previous simulations (36).
Discussion
In this study, our results provided the first molecular genetic evidence for a polygenic architecture of pan-cancer TMB by whole-genome estimation and prediction analyses using data from TCGA. Together with several emerging associations of germline variants with somatic events (11–16), our findings support the hypothesis that TMB is not only heritable, but also polygenic. Using germline genotypes and somatic mutation phenotypes in TCGA, we obtained a moderate and significant proportion of variance explained by all common SNPs (i.e., |h_{SNP}^2$|) of approximately 13% for pan-cancer TMB. Because of the small sample sizes of the individual cancer types, we focused the |h_{SNP}^2$| estimation analysis only on pan-cancer TMB. The genetic variance for pan-cancer TMB can be regarded as the mean of a genetic variance and covariance matrix for TMB of the individual cancer types. In theory, the estimate of |\hat{h}_{SNP}^2$| in the pooled sample is smaller than the mean of the estimates from the subsamples, unless the rg between the subsamples is 1. Hence, if there is genetic heterogeneity in TMB between the individual cancer types, which is very likely to be the case, the |\hat{h}_{SNP}^2$| for pan-cancer TMB provides a lower limit of the mean |\hat{h}_{SNP}^2$| across TMB of individual cancer types. Our GWA analysis for pan-cancer TMB did not identify any genome-wide significant loci for pan-cancer TMB (Fig. 1A), consistent with what is predicted from a polygenic model that the effect sizes of individual SNPs are too small to reach the genome-wide significance level when the sample size is not sufficiently large. Inference of polygenicity from the BayesS analysis further supported the polygenic architecture of pan-cancer TMB with an estimate of approximately 10,000 (95% CI, 9,573–13,220) common germline variants with nonzero effects. GWASs with much larger sample sizes are required in the future for the discovery of individual genetic variants associated with pan-cancer TMB. On the other hand, however, we identified a few genome-wide significant loci for TMB in specific cancer types, such as breast invasive carcinoma, kidney chromophobe, and stomach adenocarcinoma (Fig. 2B,–D; Supplementary Fig. S6), suggesting genetic heterogeneity among cancer types. One notable example is the variants at the BAG5/KLC1 locus, where we found evidence that the effects of the germline variants on stomach adenocarcinoma TMB are mediated by the expression level of BAG5 or KLC1. These results indicate that cancer type–specific TMB is likely to be less polygenic than pan-cancer TMB because the latter is a compound trait of the former.
Despite the failure of detecting individual SNPs associated with pan-cancer TMB at a genome-wide significance level, whole-genome estimation and prediction analyses that test the overall sharing of polygenic effects between pan-cancer TMB and a wide range of complex traits and diseases revealed several important insights. First, we found a positive rg of pan-cancer TMB with smoking initiation (even though it was only nominally significant) and negative correlation with smoking cessation. Our MR analyses found suggestive evidence for a causal link from tobacco smoking to pan-cancer TMB. These results suggest that tobacco smoking can cause somatic mutations, in line with previous findings (50–55). For example, association of tobacco smoking with the risk of human cancer has been established for more than 60 years by epidemiologic and clinical studies (57–59). Mutational signatures associated with tobacco smoking were detected and well-characterized in human tumors from large-scale sequencing studies (53). Recently, these mutational signatures were experimentally induced in human-induced pluripotent stem cell lines exposed to the polycyclic aromatic hydrocarbon, one of the most important mutagenic components of tobacco smoking (54). A more recent study evaluated the landscape of somatic mutations of 632 single-cell–derived colonies in normal bronchial epithelium by whole-genome sequencing and found that cells for current and ex-smokers revealed a substantially increased mutational burden (55). All these discoveries support a putative causal effect of tobacco smoking on the generation of somatic mutations.
Our work also has several limitations. First, as mentioned above, we did not attempt to estimate the |h_{SNP}^2$| for TMB of individual cancer types due to their small sample sizes; note that the SE of |\hat{h}_{SNP}^2$| is approximately 318/n, with n being the sample size (60). The |\hat{h}_{SNP}^2$| for pan-cancer TMB serves a lower limit of the mean estimate of |h_{SNP}^2$| for TMB of the individual cancer types. It is likely that the |h_{SNP}^2$| for TMB of each cancer type varies, with some being substantially larger than others. Large sample sizes are required in future studies to estimate the |\hat{h}_{SNP}^2$| for TMB of specific cancer types, especially those for which the |h_{SNP}^2$| is small. Second, the difference in TMB variance before and after RINT can lead to a difference in the |\hat{h}_{SNP}^2$|. Hence, the |\hat{h}_{SNP}^2$| needs to be interpreted specifically for the phenotype used in the |h_{SNP}^2\ $|estimation analysis. Nevertheless, regardless of the issue with respect to RINT, the magnitude of |\hat{h}_{SNP}^2$| together with its SE is meaningful because at least it suggests that the |h_{SNP}^2$| for pan-cancer TMB is very unlikely to be large. Third, the genome-wide significant loci for TMB in breast invasive carcinoma, kidney chromophobe, and stomach adenocarcinoma warrant replication in independent datasets when appropriate samples are available in the future. Fourth, although we have collected the largest available GWASs' data for the complex traits thus far, the power to detect some of the rgs, PRS correlations, or MR associations is still limited. For example, the PRSs of smoking initiation were associated with pan-cancer TMB only at the nominal significance level; these associations need to be confirmed in larger datasets in the future. Last but not the least, we observed a significant negative correlation between pan-cancer TMB and ovarian cancer PRS. One of the possible explanations is the Knudsen two-hit hypothesis (61), which states that individuals with lower germline genetic risk for ovarian cancer may need to accumulate more somatic mutations to develop cancer. However, this hypothesis does not explain why we did not have a similar observation for other cancer types, such as breast and prostate cancers. Further studies are warranted to understand the cause of the negative correlation between pan-cancer TMB and ovarian cancer PRS.
In summary, our pan-cancer analyses show that TMB is a heritable trait with a polygenic architecture. Given that TMB is a surrogate for ICI outcome, these knowledges offer possibility that germline variants associated with TMB may be used to stratify patients for ICI therapies by a simple blood- or saliva-based assay. While the power to identify individual variants associated with a polygenic trait, like pan-cancer TMB, is limited given the current sample size, our GWA analyses for cancer type–specific TMB identified three genome-wide significant loci, and the follow-up functional integration analysis highlighted a putative mechanism whereby the variants at the BAG5/KLC1 locus are associated with stomach adenocarcinoma TMB through their effects on BAG5/KLC1 expression, providing an important lead to design future functional studies to understand the molecular mechanisms underpinning the genetic control of somatic mutations. Using GWASs summary data, we identified putative causal associations of tobacco smoking and height with pan-cancer TMB. These findings confirm the links between tobacco smoking and height with cancer risk and support the somatic mutation theory of carcinogenesis. Our results not only contribute to the understanding of the underlying mechanisms of somatic mutations, but also may stimulate systematic efforts to assess genetic variants as putative noninvasive biomarkers to stratify patients with cancer for ICI therapies.
Authors’ Disclosures
No disclosures were reported.
Authors' Contributions
X. Sun: Conceptualization, data curation, formal analysis, investigation, visualization, methodology, writing-original draft, writing-review and editing. A. Xue: Data curation, software, formal analysis. T. Qi: Data curation, formal analysis. D. Chen: Data curation, formal analysis. D. Shi: Data curation, formal analysis. Y. Wu: Data curation, formal analysis. Z. Zheng: Software, formal analysis. J. Zeng: Data curation, software, formal analysis. J. Yang: Conceptualization, supervision, funding acquisition, methodology, writing-original draft, project administration, writing-review and editing.
Acknowledgments
This research was partly supported by the Sylvia & Charles Viertel Charitable Foundation, the Australian Research Council (FT180100186 to J. Yang), and the Westlake Education Foundation. The results published here are, in part, based upon data generated by TCGA Research Network: https://www.cancer.gov/tcga (dbGaP accession no. phs000178). This project also makes use of data from the UK Biobank (application no. 54336). We thank three anonymous reviewers for their constructive comments on an early version of the article.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.