Tumor mutational burden (TMB) is an emerging biomarker of response to immunotherapy in solid tumors. However, the extent to which variation in TMB between patients is attributable to germline genetic variation remains elusive. Here, using 7,004 unrelated patients of European descent across 33 cancer types from The Cancer Genome Atlas, we show that pan-cancer TMB is polygenic with approximately 13% of its variation explained by approximately 1.1 million common variants altogether. We identify germline variants that affect TMB in stomach adenocarcinoma through altering the expression levels of BAG5 and KLC1. Further analyses provide evidence that TMB is genetically associated with complex traits and diseases, such as smoking, rheumatoid arthritis, height, and cancers, and some of the associations are likely causal. Overall, these results provide new insights into the genetic basis of somatic mutations in tumors and may inform future efforts to use genetic variants to stratify patients for immunotherapy.

Significance:

This study provides evidence for a polygenic architecture of tumor mutational burden and opens an avenue for the use of whole-genome germline genetic variations to stratify patients with cancer for immunotherapy.

Somatic mutations, in particular those that confer selective growth advantage to cells, contribute to cancer development (1). Patterns and functions of somatic mutations in human cancer genome have been extensively analyzed in a wide variety of cancers, leading to numerous biological insights (2–4). Particularly, an important somatic event termed tumor mutational burden (TMB), determined by the total number of nonsynonymous somatic mutations in a cancer genome, has been shown to have great practical relevance. Notably, lines of evidence suggested that TMB is a predictive biomarker to identify patients with cancer who are more likely to have response to immune checkpoint inhibitor (ICI) therapies (5–9).

Several studies have revealed the association of germline genetic variants with somatic events in tumors, including mutational signatures (10–12), and somatic mutations in specific cancer-related genes (13, 14). In addition, Zhu and colleagues uncovered a negative association between germline cancer risk alleles and somatic mutation burden in breast cancer (15). Robles-Espinoza and colleagues found that germline MC1R variants were associated with somatic mutation burden of melanoma (16). Despite these progresses, our understanding of the germline genetics of TMB is still largely limited. A key missing piece is the elucidation of the genetic architecture of TMB (e.g., the overall contribution of common genetic variants to TMB and the distribution of effect sizes of the common variants) and the genetic and causal relationships of complex traits (including diseases) with TMB. Answering these questions will be instrumental for guiding future genome-wide association studies (GWAS) for TMB, identifying risk factors for TMB, and assessing the potential clinical utility of the genetic characteristics of TMB in ICI therapies.

Here, we study the genetic architecture of TMB from a pan-cancer analysis of 7,004 patients of European ancestry across 33 cancer types in The Cancer Genome Atlas (TCGA). Although there are likely differences in the genetic architecture of TMB between the cancer types, the sample sizes in TCGA are insufficient for methods that use genome-wide common variants to depict the genetic architecture of TMB for individual cancer types. Hence, we focus this study on pan-cancer TMB while acknowledging that there is likely genetic heterogeneity in TMB between cancer types (see below for more discussion). We estimate that approximately 13% of variance in pan-cancer TMB can be explained by all common genetic variants together. We identify germline loci associated with TMB in specific cancer types, as well as the underlying putative causal genes that drive the associations. Moreover, our findings show a shared genetic component between pan-cancer TMB and complex traits (including diseases), such as tobacco smoking, height, rheumatoid arthritis, and cancers. We further investigate the causal relationships between TMB and complex traits by Mendelian randomization (MR) analyses, which provide evidence for putative causal effects of tobacco smoking on pan-cancer TMB.

TCGA cohort

TCGA included both tumor and matched normal biospecimens from 10,224 human samples with written informed consent under authorization of local institutional review boards (IRB; https://cancergenome.nih.gov/abouttcga/policies/informedconsent). We excluded 344 highly mutated samples likely reflecting technical artefacts as detected by a previous study (4). Samples that were marked by the Analysis Working Group on the basis of pathology were also discarded, but those with “RNA degradation” flag were retained because no RNA data were used in this study (4). All samples in public TCGA database have been collected and utilized following strict human subjects protection guidelines, written informed consent, and IRB review of protocols (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/history/policies). Our study was approved by the Ethics Committees of The University of Queensland (Brisbane, Queensland, Australia, ID: 2011001173), Wenzhou Medical University (Wenzhou, Zhejiang, P.R. China, ID: 2019094), and Westlake University (Hangzhou, Zhejiang, P.R. China, ID: 20200722YJ001).

Quantification of tumor mutational burden

TMB is defined as the total number of nonsynonymous variants (including single-nucleotide variants and indels) per Mb of an exome examined on the basis of MC3 calls from PanCanAtlas (mc3.v0.2.8.PUBLIC.maf; https://www.synapse.org/#!Synapse:syn7214402/files/).

SNP array data

We used genotype data of 522,606 SNPs derived from Affymetrix 6.0 SNP arrays. Birdseed genotype files (n = 11,459, across 33 cancer types) were downloaded from the Genome Data Commons legacy (GRCh37/hg19) archive (https://portal.gdc.cancer.gov/legacy-archive). Birdseed2vcf (https://github.com/ding-lab/birdseed2vcf) was used to transform birdseed genotype files into individual VCF files, which were then merged into a single combined VCF file. We chose one single file according to the following criteria, if 1 patient had multiple aliquot barcodes, as described in detail elsewhere (17): (i) sample type: blood > solid (sample type code, 10 > 11), (ii) molecular type of analyte: D analytes (native DNA) were preferred over G, W, or X (whole-genome amplified); (iii) order of portion: portion with a higher number was selected, (iv) order of plate: plate with a higher number was selected, and (v) sequencing or analysis center: center with a higher number was selected. Finally, a total of 10,406 nonredundant samples for 33 cancer types were obtained for downstream analysis.

Principal component analysis

To remove ancestral outliers from TCGA sample, we performed a principal component analysis (PCA) in a combined genotype dataset of TCGA and 1000 Genomes Project (1000GP, RRID:SCR_008801; ref. 18). Only the autosomal SNPs with missingness rate < 5% and minor allele frequency (MAF) > 1% and individuals with missingness rate < 5% (446,018 SNPs on 12,909 individuals in total) were included in the PCA. After the PCA, we removed TCGA individuals whose principal component 1 (PC1) or PC2 deviated more than six SDs from the mean of the corresponding PC of the 1000GP individuals of European ancestry. Finally, a total of 8,179 TCGA individuals of European ancestry were retained for further analysis.

Genotype imputation

We imputed a set of cleaned genotype data consisting of 423,239 autosomal SNPs (missingness rate < 5%, MAF > 1%, and Hardy Weinberg equilibrium P > 10−6) on 8,179 individuals of European ancestry to the Haplotype Reference Consortium (19) using the Sanger Imputation Server (https://imputation.sanger.ac.uk/; ref. 20). The genotype posterior probabilities from imputation were converted to hard-call genotypes using PLINK2 (–hard-call-thresh 0.1) (21). We filtered out SNPs with missingness rate > 5%, MAF < 1%, Hardy Weinberg equilibrium P < 10−6, or imputation INFO score < 0.3. Note that because the sample size of TCGA cohort was too small to analyze rare variants, we included only the common variants (i.e., MAF > 1%) in this study.

Genome-wide association analysis

We first removed individuals with missing TMB or covariate information (e.g., age, sex, or tumor purity). We then exploited genome-wide complex trait analysis (GCTA; ref. 22) to compute the genetic relationships between all pairs of the remaining subjects using variants in common with those from HapMap 3 (referred to as HM3 SNPs hereafter) and excluded one of each pair of individuals with estimated genetic relatedness > 0.05, resulting in a set of 7,004 unrelated individuals. We regressed TMB against age and standardized the residuals to z-scores within each sex group of each cancer type. For pan-cancer TMB, we first combined the standardized residuals of all the cancer types and then applied a rank-based inverse normal transformation (RINT) to the combined residuals. We also applied RINT to TMB in each cancer type.

For the imputed genotype data, we performed additional quality control steps in the unrelated individuals: excluding variants with missingness rate > 5%, MAF < 1%, or Hardy Weinberg equilibrium P < 10−6 and samples with missingness rate > 5%. Finally, a total of 7,567,484 common variants, with 1,139,665 in common with HM3 SNPs, and 7,004 unrelated individuals were used for the analyses below.

Association tests were performed for pan-cancer TMB and TMB of each cancer type based on a linear regression model implemented in fastGWA (23). We included the first 20 PCs and tumor purity as covariates in the regression model. Variants with P < 5 × 10−8 were regarded as genome-wide significant.

SNP-based heritability estimation

We used 1,139,665 HM3 SNPs and 7,004 unrelated samples to estimate the SNP-based heritability (⁠h_{SNP}^2$⁠) for pan-cancer TMB. We opted to use the HM3 SNPs because they are a trimmed set of variants optimized to capture common genetic variation of the human genome (24), and to improve the comparability of this study with previous h_{SNP}^2$ estimation analyses. We used the following methods for h_{SNP}^2$ estimation.

Genomic relatedness matrix-restricted maximum likelihood analysis

Genomic relatedness matrix-restricted maximum likelihood (GREML; also known as single-component GERML or GREML-SC) is a mixed linear model method that leverages a genetic relatedness matrix (GRM) to estimate the proportion of phenotypic variance in a trait that can be explained by all SNPs in unrelated individuals. GREML is implemented in the GCTA software package (22). Details of the GREML methodology have been described elsewhere (25). We built a GRM of the 7,004 unrelated individuals using the HM3 SNPs and then estimated \hat{h}_{SNP}^2$ with the first 20 PCs and tumor purity fitted as covariates in the GREML model.

MAF-stratified GREML analysis

It has been shown in previous studies that \hat{h}_{SNP}^2$ obtained from GREML-SC can be biased if MAF distribution of causal variants is different from that of SNPs used in the analysis (26, 27). MAF-stratified GREML (GREML-MS) was thus developed to address this gap by stratifying SNPs by MAF into multiple GRMs and fitting the GRMs jointly in a model. We stratified the HM3 SNPs used into five MAF bins (i.e., 0.01 < MAF < 0.1, 0.1 < MAF < 0.2, 0.2 < MAF < 0.3, 0.3 < MAF < 0.4, and 0.4 < MAF < 0.5) and generated the corresponding GRMs. GCTA was then used to perform a GREML-MS analysis with adjustment for the first 20 PCs and tumor purity to estimate \hat{h}_{SNP}^2$ for pan-cancer TMB.

Linkage disequilibrium and MAF-stratified GREML analysis

Linkage disequilibrium and MAF-stratified GREML (GREML-LDMS) is another variant of the GREML method developed to account for nonrandom distribution of causal variants with respect to both MAF and linkage disequilibrium (LD; refs. 27–29). In this analysis, each of the five MAF bins in the GREML-MS analysis above was split into a low and a high LD bin based on LD scores of the SNPs, resulting in 10 MAF and LD bins in total. We then ran GCTA to conduct a GREML-LDMS analysis (27) for pan-cancer TMB with the first 20 PCs and tumor purity included as covariates.

Bayesian mixed linear model with an S parameter analysis

The Bayesian mixed linear model with an S parameter method (termed BayesS), implemented in the GCTB software (http://cnsgenomics.com/software/gctb/), was employed to estimate three genetic architecture parameters simultaneously, that is, h_{SNP}^2$⁠, polygenicity (proportion of SNPs with nonzero effects), and relationship between SNP effect size and MAF (30). We performed a BayesS analysis for pan-cancer TMB with a correction for the first 20 PCs and tumor purity.

LD score regression analysis

LD score regression (LDSC) is a summary data–based method that estimates h_{SNP}^2$ by regressing GWAS {\chi ^2}$ statistics of all SNPs on their LD scores (https://github.com/bulik/ldsc; ref. 31). LDSC is computationally more efficient than the conventional h_{SNP}^2$ estimation methods that rely on individual-level data, especially for datasets with large sample sizes. We conducted an LDSC analysis using the GWAS summary statistics for pan-cancer TMB and LD scores provided by the LDSC package (computed from the 1000GP individuals of European ancestry based on HM3 SNPs). We excluded variants in the MHC region due to the complexity of this region.

Genetic correlation estimation using bivariate LDSC

Similar to the estimation of h_{SNP}^2$ by LDSC for a single trait, bivariate LDSC estimates the genetic covariance by regressing the product of z-scores for two traits on LD score across SNPs (32). We used the bivariate LDSC to estimate genetic correlation (rg) between pan-cancer TMB and a complex trait. GWAS summary data for 46 complex traits (including diseases) were collected from published studies (Supplementary Table S3).

Polygenic risk score analysis

SBayesR is a summary data–based Bayesian multiple regression method, which was developed mainly for polygenic risk score (PRS) analysis and has been shown to improve prediction accuracy over other widely used PRS approaches (21, 33, 34). SBayesR is available in the GCTB software package (30). We ran SBayesR to estimate the joint effects of all the HM3 SNPs using GWAS summary data for a complex trait and LD information from a reference sample with individual-level genotypes. The MCMC chain was performed with 50,000 iterations, 20,000 burn in, and other default parameters. The shrinkage of the SNP effects toward zero comes from two components, one from a probability of a point mass at zero and the other from the prior distribution (a mixture of normal distributions centered at zero) for the nonnull effects. We used a set of 50,000 unrelated individuals of European ancestry randomly selected from the UK Biobank (UKB) as the LD reference for the UKB GWAS summary data and another set of 50,000 unrelated individuals of European ancestry sampled from the GERA (Resource for Genetic Epidemiology Research on Aging) cohort (35) as the LD reference for the other GWAS summary datasets. Similar as above, SNPs in the MHC regions were excluded.

After obtaining the SBayesR estimates of SNP effects, we computed the PRS of the trait in TCGA cohort using the following equation:

formula

where m is the number of all the HM3 SNPs without selection, \hat{\beta }$ is the estimated effect of the i$th SNP from SBayesR, and {x_i}$⁠, coded as 0, 1, or 2, is the number of effect alleles of the i$th SNP. We then correlated the PRS of each of the 46 traits with pan-cancer TMB, as well as TMB of individual cancer types.

MR analysis

Generalized summary data–based MR (GSMR) is an MR method that uses multiple genetic variants as instruments to infer causal relationship between two phenotypes using GWAS summary statistics (36). Here, we employed GSMR to explore the causal associations between 11 traits, which showed nominally significant associations with pan-cancer TMB in either the PRS or rg analysis, and pan-cancer TMB. Following Zhu and colleagues' study (36), we selected quasi-independent SNPs (LD r2 < 0.05) with P < 5 × 10−8 as genetic instruments and only analyzed traits with at least 10 genetic instruments. We ran the GSMR analysis with the heterogeneity in dependent instrument (HEIDI)-outlier filtering step (default setting) to exclude genetic instruments that are associated with both the exposure and outcome phenotypes because of horizontal pleiotropy. We also repeated the MR analyses using the inverse-variance weighted approach (MR-IVW) implemented in the R package “MR-base” (37). Similar as in the GSMR analyses, quasi-independent SNPs associated with an exposure with P < 5 × 10−8 were chosen as instruments, which were then pruned for LD following the default settings in MR-base.

Summary data–based MR analysis

Summary data–based MR (SMR) is a method that integrates summary statistics from a GWAS for a trait with those from an expression quantitative trait locus (eQTL) study to identify genes whose expression levels are the mediators of SNP effects on the trait (38). We used SMR to identify putative causal genes for TMB in breast invasive carcinoma, kidney chromophobe, and stomach adenocarcinoma. We used cis-eQTL summary data from the eQTLGen consortium, an eQTL meta-analysis of 31,684 individuals in blood (39). We included in the SMR analysis only the genes with at least a cis-eQTL with P < 5 × 10−8 (the default setting in SMR) and excluded genes in the MHC regions, resulting in 15,487 genes in total. We used a genome-wide significance threshold of P < 3.23 × 10−6 (i.e., 0.05/15,487) to claim significant SMR associations and the HEIDI test to distinguish causality or vertical pleiotropy from linkage (PHEIDI > 0.05).

The polygenic architecture of pan-cancer TMB

We defined TMB as the total number of nonsynonymous variants per Mb of an exome examined (Materials and Methods). We first estimated the proportion of variance in pan-cancer TMB explained by all common SNPs (i.e., h_{SNP}^2$⁠) in 7,004 unrelated European patients across 33 cancer types in TCGA (hereafter termed pan-cancer TMB; Supplementary Figs. S1 and S2). The TMB phenotype was adjusted for covariates, such as age, gender, and cancer type, and standardized to z-score by RINT (Materials and Methods). We employed five commonly used h_{SNP}^2$ estimation methods, with additional covariates fitted to control for the effects due to population stratification and tumor purity (Materials and Methods). The five methods used were: (i) GREML-SC analysis (25), (ii) GREML-MS (five GRMs; ref. 28), (iii) GREML-LDMS (10 GRMs; ref. 28), (iv) BayesS (30), and (v) LDSC (31). GREML-MS and GREML-LDMS can mitigate biases caused by nonrandom distribution of causal variants with respect to MAF and LD (27–29, 40), and hence were included in this analysis. Unless stated otherwise, all the analyses in this study were performed on the basis of approximately 1.1 million imputed HM3 SNPs with genome-wide coverage and MAF > 0.01 (Materials and Methods). There were no major differences between the results from the five methods (Table 1). All the estimates of h_{SNP}^2\ $were statistically significant with a mean of 12.8% across methods (Table 1). The estimates of h_{SNP}^2$ remained largely unchanged when the TMB phenotype was not corrected for age (Supplementary Table S1). Analyses using TMB z-score without RINT showed a substantially smaller \hat{h}_{SNP}^2$ on average (Supplementary Figs. S3A–S3G and S4; Supplementary Table S2), owing to the influence of a small number of outliers on the phenotypic variance as demonstrated by simulation (Supplementary Fig. S5A and S5B), suggesting that RINT was necessary. We, therefore, performed the following analyses using RINT TMB. The BayesS method is capable of simultaneously estimating three genetic architecture parameters, that is, \hat{h}_{SNP}^2$⁠, polygenicity (proportion of SNPs with nonzero effects), and relationship between SNP effect size and MAF (also known as the S parameter). The polygenicity estimate of pan-cancer TMB from BayesS was approximately 1% [95% confidence interval (CI), 0.84%–1.16%], meaning that approximately 10,000 (95% CI, 9,573–13,220) common SNPs with nonzero effects on pan-cancer TMB. These results indicate that like many other complex human traits, such as height (25), pan-cancer TMB is affected by a large number of genetic variants each with a small effect, consistent with a polygenic architecture.

Table 1.

Estimates of h_{SNP}^2$ for pan-cancer TMB by multiple widely used methods from 7,004 unrelated patients of European ancestry across 33 cancer types in TCGA cohort.

Method{\hat{\bi h}\bf ^2}_{{\rm{SNP}}}$SEP
GREML-SC 0.129 0.0468 0.0058 
GREML-MS 0.125 0.0491 0.0100 
GREML-LDMS 0.132 0.0580 0.0229 
BayesS 0.0987 0.0383 0.0098 
LDSC 0.153 0.0651 0.0180 
Method{\hat{\bi h}\bf ^2}_{{\rm{SNP}}}$SEP
GREML-SC 0.129 0.0468 0.0058 
GREML-MS 0.125 0.0491 0.0100 
GREML-LDMS 0.132 0.0580 0.0229 
BayesS 0.0987 0.0383 0.0098 
LDSC 0.153 0.0651 0.0180 

Identifying germline variants and genes associated with TMB

While realizing that the current sample size might not be sufficient for GWAS discovery for a polygenic trait, we performed a genome-wide association (GWA) analysis for pan-cancer TMB (Materials and Methods). No variant reached the genome-wide significance level (P = 5 × 10−8; Fig. 1A), suggesting that the effect sizes of all the pan-cancer–associated variants are very small, in line with the polygenic model. On the other hand, however, GWA analyses of TMB in specific cancer types (Material and Methods) identified genome-wide significant loci for TMB in breast invasive carcinoma (top SNP, rs611414; P = 3.30 × 10−14), kidney chromophobe (top SNP, rs1586868; P = 2.19 × 10−10), and stomach adenocarcinoma (top SNP, rs34026011; P = 2.01 × 10−8; Figs. 1B,D and 2A,C; Supplementary Fig. S6), implying tumor-specific genetic architecture of TMB. The loci for TMB in breast invasive carcinoma and kidney chromophobe also passed the Bonferroni-corrected experimental-wise significant threshold (PBonferroni corrected = 1.67 × 10−9) considering a total of 30 cancer types tested with a sample size greater than 45. We then performed an SMR analysis (38) to identify putative causal genes underlying the association signals and prioritized two genes (BAG5 and KLC1; PSMR = 2.10 × 10−7 and PHEIDI = 0.41 for BAG5 and PSMR = 3.09 × 10−7 and PHEIDI = 0.37 for KLC1) for stomach adenocarcinoma (Fig. 2D), consistent with a mechanism that the germline variants affect stomach adenocarcinoma TMB through altering the expression levels of BAG5 and KLC1. Both genes have been reported to play an important role in tumorigenesis (41, 42), but how they augment somatic mutational processes in tumors remains to be addressed in the future. It should be noted that there were multiple independent eQTL signals for KLC1, and the secondary eQTL signal did not seem to have an effect on TMB (Supplementary Fig. S7), suggesting that either KLC1 is not a causal gene or the secondary eQTL signal for KLC1 is tissue specific.

Figure 1.

Manhattan plot of GWAS for pan-cancer TMB (A) and TMB in breast invasive carcinoma (BRCA; B), kidney chromophobe (KICH; C), and stomach adenocarcinoma (STAD; D). Chromosome positions are shown on the x-axis, and the −log10P values are shown on the y-axis. The dashed line corresponds to P = 5 × 10−8.

Figure 1.

Manhattan plot of GWAS for pan-cancer TMB (A) and TMB in breast invasive carcinoma (BRCA; B), kidney chromophobe (KICH; C), and stomach adenocarcinoma (STAD; D). Chromosome positions are shown on the x-axis, and the −log10P values are shown on the y-axis. The dashed line corresponds to P = 5 × 10−8.

Close modal
Figure 2.

GWAS and SMR locus plots. Shown are the GWAS results at the loci associated with TMB in breast invasive carcinoma (BRCA; A), kidney chromophobe (KICH; B), and stomach adenocarcinoma (STAD; C), and SMR results at the BAG5 and KLC1 locus for TMB in stomach adenocarcinoma (D). AC, Chromosome positions are shown on the x-axis along with the names and sizes of the genes, and the −log10P values are shown on the left y-axis. The purple dot shows the top associated SNP and the other dotted colors indicate pairwise LD (r2) with this SNP. The right y-axis represents the regional recombination rate (cM/Mb). D, The gray dots of the top plot show the −log10P values of SNPs from the GWAS analysis for TMB in stomach adenocarcinoma, and the diamonds indicate the −log10P values of the genes from the SMR analysis. The −log10P values of eQTL from eQTLGen for BAG5 and KLC1 are shown in the bottom plot. Highlighted in red are the genes (BAG5 and KLC1) that passed the SMR (P < 3.23 × 10−6) and HEIDI tests (P > 0.05).

Figure 2.

GWAS and SMR locus plots. Shown are the GWAS results at the loci associated with TMB in breast invasive carcinoma (BRCA; A), kidney chromophobe (KICH; B), and stomach adenocarcinoma (STAD; C), and SMR results at the BAG5 and KLC1 locus for TMB in stomach adenocarcinoma (D). AC, Chromosome positions are shown on the x-axis along with the names and sizes of the genes, and the −log10P values are shown on the left y-axis. The purple dot shows the top associated SNP and the other dotted colors indicate pairwise LD (r2) with this SNP. The right y-axis represents the regional recombination rate (cM/Mb). D, The gray dots of the top plot show the −log10P values of SNPs from the GWAS analysis for TMB in stomach adenocarcinoma, and the diamonds indicate the −log10P values of the genes from the SMR analysis. The −log10P values of eQTL from eQTLGen for BAG5 and KLC1 are shown in the bottom plot. Highlighted in red are the genes (BAG5 and KLC1) that passed the SMR (P < 3.23 × 10−6) and HEIDI tests (P > 0.05).

Close modal

Shared polygenic effects between pan-cancer TMB and 46 complex traits and diseases

Having known that pan-cancer TMB has a substantial polygenic component, we sought to investigate whether there were shared genetic factors between pan-cancer TMB and complex traits (including diseases). To do this, we selected 46 complex traits for which summary statistics from large-scale GWASs were available from prior work (Supplementary Table S3), and used two analytic approaches, the PRS (43) and bivariate LDSC (32), to quantify the extent to which the polygenic effects are shared between traits (Materials and Methods). Both methods use all genome-wide common SNPs and are robust in the presence of a large number nonassociated SNPs.

In the PRS analysis, we used SBayesR (33), a summary data–based Bayesian multiple regression method, to estimate the weights for all the HM3 SNPs to construct a PRS (i.e., a weighted sum of effect alleles across all SNPs) for each of the 46 traits in TCGA and tested the correlation between the PRS and pan-cancer TMB. We opted to use SBayesR for the PRS analysis because it requires only summary statistics from the discovery GWAS and shows improved prediction accuracy over competing approaches (44, 45). We found that pan-cancer TMB was significantly positively correlated with rheumatoid arthritis PRS and negatively correlated with smoking cessation PRS, with an FDR < 0.05 (Fig. 3; Supplementary Table S3).

Figure 3.

Shared polygenic effects between pan-cancer TMB and 46 complex traits. The x-axis shows the estimate of rg between pan-cancer TMB and a complex trait, and the y-axis indicates the correlation between pan-cancer TMB and the trait PRS. The estimates with FDR < 0.05 in either the PRS or rg analysis are highlighted in dark blue and those with P < 0.05 are highlighted in light blue.

Figure 3.

Shared polygenic effects between pan-cancer TMB and 46 complex traits. The x-axis shows the estimate of rg between pan-cancer TMB and a complex trait, and the y-axis indicates the correlation between pan-cancer TMB and the trait PRS. The estimates with FDR < 0.05 in either the PRS or rg analysis are highlighted in dark blue and those with P < 0.05 are highlighted in light blue.

Close modal

We next employed the bivariate LDSC method to estimate {r_g}$ between pan-cancer TMB and each of the 46 traits, where {r_g}$ can be interpreted as the correlation of polygenic effects between TMB and a trait. The LDSC analyses were performed using all the HM3 SNPs from the GWAS summary data for pan-cancer TMB and 46 traits mentioned above. We found that pan-cancer TMB showed significant rg with height (⁠{\hat{r}_g}$ = 0.28; SE, 0.081) with FDR < 0.05 (Fig. 3; Supplementary Table S3). We observed good concordances between the estimates from PRS and LDSC in both the magnitude (Pearson correlation = 0.83 and P = 1.39 × 10−12) and the significance level (Pearson correlation = 0.71 and P = 3.53 × 10−8; Fig. 3; Supplementary Fig. S8), with eight nominally significant (i.e., P < 0.05) estimates from the former and nine from the latter, including smoking initiation and rheumatoid arthritis (Supplementary Table S3).

Putative causal associations of tobacco smoking and height with pan-cancer TMB: an MR analysis

We have shown above that TMB is genetically associated with several complex traits and diseases. Then, the question is whether those associations are causal. To address this hypothesis, we performed an MR analysis to test causal effect of a complex trait on pan-cancer TMB (Materials and Methods). We opted to use the GSMR (36) method for the MR analysis because GSMR utilizes multiple SNPs as genetic instruments, accounts for sampling variance in estimated SNP effects on exposure, and accounts for potential residual correlations between the genetic instruments. We denote the MR estimate of causal effect of a trait or disease on TMB by {\hat{b}_{xy}}$⁠, which is interpreted as the change of TMB in SD units given one unit increase in the trait (in SD units, if the trait was standard in GWAS) or disease risk (on the logit scale). We included in the GSMR analysis 11 traits (including diseases) that showed a nominally significant genetic overlap with pan-cancer TMB in either the PRS or the LDSC analysis above. We used quasi-independent SNPs associated with each of the 11 traits with P < 5.0 × 10−8 as the genetic instruments in the GSMR analysis (Supplementary Table S4). Unless stated otherwise, a statistical significance was claimed on the basis of an FDR threshold of 0.05 in the analyses below. Note that we only tested one direction of a causal association, that is, the effect of a complex trait on TMB, because testing the other direction requires a very large sample size to detect sufficient genome-wide significant SNPs for TMB and such a large sample is currently unavailable.

We found a significant positive association between height and pan-cancer TMB (⁠{\hat{b}_{xy}}$ = 0.068; SE = 0.022; P = 1.89 × 10−3; Fig. 4; Supplementary Table S4), in line with the previous finding that increased height is a risk for cancer, as observed in epidemiologic data (46, 47) and recent MR studies (36, 48, 49). We noted a positive effect of smoking initiation on pan-cancer TMB (⁠{\hat{b}_{xy}}\ = \ 0.14$⁠; SE = 0.055; P = 0.011; Fig. 4; Supplementary Table S4), in line with a negative effect of smoking cessation on pan-cancer TMB (⁠{\hat{b}_{xy}}$ = −0.27; SE = 0.13; P = 0.042; Fig. 4; Supplementary Table S4). Although both of them did not pass the FDR threshold, which may be due to the small number of genetic instruments available for smoking cessation (N = 20) or the small sample size of TCGA cohort for TMB GWAS analyses, these results support the causal role of tobacco smoking in somatic mutations, as shown in previous studies (50–55). To determine the robustness of our findings, we conducted MR analyses with the MR-IVW method (56). The results from MR-IVW method were highly consistent with those from GSMR (Supplementary Fig. S9A and S9B; Supplementary Table S4). Of note, we report the GSMR estimates as the main results because GSMR is more powerful than MR-IVW as shown in our previous simulations (36).

Figure 4.

Evaluation of the causal associations between 11 traits and pan-cancer TMB. The estimated causal effects of the 11 traits on pan-cancer TMB from GSMR analyses (left; blue), the rg estimates from LDSC (middle; red), and the correlations between the trait PRS and pan-cancer TMB (right; yellow). Error bars represent 95% CIs and asterisks indicate significance status, with *, P < 0.05; **, P < 0.01; ***, P < 0.001. The significant results after FDR correction (FDR < 0.05) are labeled as solid diamonds.

Figure 4.

Evaluation of the causal associations between 11 traits and pan-cancer TMB. The estimated causal effects of the 11 traits on pan-cancer TMB from GSMR analyses (left; blue), the rg estimates from LDSC (middle; red), and the correlations between the trait PRS and pan-cancer TMB (right; yellow). Error bars represent 95% CIs and asterisks indicate significance status, with *, P < 0.05; **, P < 0.01; ***, P < 0.001. The significant results after FDR correction (FDR < 0.05) are labeled as solid diamonds.

Close modal

In this study, our results provided the first molecular genetic evidence for a polygenic architecture of pan-cancer TMB by whole-genome estimation and prediction analyses using data from TCGA. Together with several emerging associations of germline variants with somatic events (11–16), our findings support the hypothesis that TMB is not only heritable, but also polygenic. Using germline genotypes and somatic mutation phenotypes in TCGA, we obtained a moderate and significant proportion of variance explained by all common SNPs (i.e., h_{SNP}^2$⁠) of approximately 13% for pan-cancer TMB. Because of the small sample sizes of the individual cancer types, we focused the h_{SNP}^2$ estimation analysis only on pan-cancer TMB. The genetic variance for pan-cancer TMB can be regarded as the mean of a genetic variance and covariance matrix for TMB of the individual cancer types. In theory, the estimate of \hat{h}_{SNP}^2$ in the pooled sample is smaller than the mean of the estimates from the subsamples, unless the rg between the subsamples is 1. Hence, if there is genetic heterogeneity in TMB between the individual cancer types, which is very likely to be the case, the \hat{h}_{SNP}^2$ for pan-cancer TMB provides a lower limit of the mean \hat{h}_{SNP}^2$ across TMB of individual cancer types. Our GWA analysis for pan-cancer TMB did not identify any genome-wide significant loci for pan-cancer TMB (Fig. 1A), consistent with what is predicted from a polygenic model that the effect sizes of individual SNPs are too small to reach the genome-wide significance level when the sample size is not sufficiently large. Inference of polygenicity from the BayesS analysis further supported the polygenic architecture of pan-cancer TMB with an estimate of approximately 10,000 (95% CI, 9,573–13,220) common germline variants with nonzero effects. GWASs with much larger sample sizes are required in the future for the discovery of individual genetic variants associated with pan-cancer TMB. On the other hand, however, we identified a few genome-wide significant loci for TMB in specific cancer types, such as breast invasive carcinoma, kidney chromophobe, and stomach adenocarcinoma (Fig. 2B,D; Supplementary Fig. S6), suggesting genetic heterogeneity among cancer types. One notable example is the variants at the BAG5/KLC1 locus, where we found evidence that the effects of the germline variants on stomach adenocarcinoma TMB are mediated by the expression level of BAG5 or KLC1. These results indicate that cancer type–specific TMB is likely to be less polygenic than pan-cancer TMB because the latter is a compound trait of the former.

Despite the failure of detecting individual SNPs associated with pan-cancer TMB at a genome-wide significance level, whole-genome estimation and prediction analyses that test the overall sharing of polygenic effects between pan-cancer TMB and a wide range of complex traits and diseases revealed several important insights. First, we found a positive rg of pan-cancer TMB with smoking initiation (even though it was only nominally significant) and negative correlation with smoking cessation. Our MR analyses found suggestive evidence for a causal link from tobacco smoking to pan-cancer TMB. These results suggest that tobacco smoking can cause somatic mutations, in line with previous findings (50–55). For example, association of tobacco smoking with the risk of human cancer has been established for more than 60 years by epidemiologic and clinical studies (57–59). Mutational signatures associated with tobacco smoking were detected and well-characterized in human tumors from large-scale sequencing studies (53). Recently, these mutational signatures were experimentally induced in human-induced pluripotent stem cell lines exposed to the polycyclic aromatic hydrocarbon, one of the most important mutagenic components of tobacco smoking (54). A more recent study evaluated the landscape of somatic mutations of 632 single-cell–derived colonies in normal bronchial epithelium by whole-genome sequencing and found that cells for current and ex-smokers revealed a substantially increased mutational burden (55). All these discoveries support a putative causal effect of tobacco smoking on the generation of somatic mutations.

Our work also has several limitations. First, as mentioned above, we did not attempt to estimate the h_{SNP}^2$ for TMB of individual cancer types due to their small sample sizes; note that the SE of \hat{h}_{SNP}^2$ is approximately 318/n, with n being the sample size (60). The \hat{h}_{SNP}^2$ for pan-cancer TMB serves a lower limit of the mean estimate of h_{SNP}^2$ for TMB of the individual cancer types. It is likely that the h_{SNP}^2$ for TMB of each cancer type varies, with some being substantially larger than others. Large sample sizes are required in future studies to estimate the \hat{h}_{SNP}^2$ for TMB of specific cancer types, especially those for which the h_{SNP}^2$ is small. Second, the difference in TMB variance before and after RINT can lead to a difference in the \hat{h}_{SNP}^2$⁠. Hence, the \hat{h}_{SNP}^2$ needs to be interpreted specifically for the phenotype used in the h_{SNP}^2\ $estimation analysis. Nevertheless, regardless of the issue with respect to RINT, the magnitude of \hat{h}_{SNP}^2$ together with its SE is meaningful because at least it suggests that the h_{SNP}^2$ for pan-cancer TMB is very unlikely to be large. Third, the genome-wide significant loci for TMB in breast invasive carcinoma, kidney chromophobe, and stomach adenocarcinoma warrant replication in independent datasets when appropriate samples are available in the future. Fourth, although we have collected the largest available GWASs' data for the complex traits thus far, the power to detect some of the rgs, PRS correlations, or MR associations is still limited. For example, the PRSs of smoking initiation were associated with pan-cancer TMB only at the nominal significance level; these associations need to be confirmed in larger datasets in the future. Last but not the least, we observed a significant negative correlation between pan-cancer TMB and ovarian cancer PRS. One of the possible explanations is the Knudsen two-hit hypothesis (61), which states that individuals with lower germline genetic risk for ovarian cancer may need to accumulate more somatic mutations to develop cancer. However, this hypothesis does not explain why we did not have a similar observation for other cancer types, such as breast and prostate cancers. Further studies are warranted to understand the cause of the negative correlation between pan-cancer TMB and ovarian cancer PRS.

In summary, our pan-cancer analyses show that TMB is a heritable trait with a polygenic architecture. Given that TMB is a surrogate for ICI outcome, these knowledges offer possibility that germline variants associated with TMB may be used to stratify patients for ICI therapies by a simple blood- or saliva-based assay. While the power to identify individual variants associated with a polygenic trait, like pan-cancer TMB, is limited given the current sample size, our GWA analyses for cancer type–specific TMB identified three genome-wide significant loci, and the follow-up functional integration analysis highlighted a putative mechanism whereby the variants at the BAG5/KLC1 locus are associated with stomach adenocarcinoma TMB through their effects on BAG5/KLC1 expression, providing an important lead to design future functional studies to understand the molecular mechanisms underpinning the genetic control of somatic mutations. Using GWASs summary data, we identified putative causal associations of tobacco smoking and height with pan-cancer TMB. These findings confirm the links between tobacco smoking and height with cancer risk and support the somatic mutation theory of carcinogenesis. Our results not only contribute to the understanding of the underlying mechanisms of somatic mutations, but also may stimulate systematic efforts to assess genetic variants as putative noninvasive biomarkers to stratify patients with cancer for ICI therapies.

No disclosures were reported.

X. Sun: Conceptualization, data curation, formal analysis, investigation, visualization, methodology, writing-original draft, writing-review and editing. A. Xue: Data curation, software, formal analysis. T. Qi: Data curation, formal analysis. D. Chen: Data curation, formal analysis. D. Shi: Data curation, formal analysis. Y. Wu: Data curation, formal analysis. Z. Zheng: Software, formal analysis. J. Zeng: Data curation, software, formal analysis. J. Yang: Conceptualization, supervision, funding acquisition, methodology, writing-original draft, project administration, writing-review and editing.

This research was partly supported by the Sylvia & Charles Viertel Charitable Foundation, the Australian Research Council (FT180100186 to J. Yang), and the Westlake Education Foundation. The results published here are, in part, based upon data generated by TCGA Research Network: https://www.cancer.gov/tcga (dbGaP accession no. phs000178). This project also makes use of data from the UK Biobank (application no. 54336). We thank three anonymous reviewers for their constructive comments on an early version of the article.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Stratton
MR
,
Campbell
PJ
,
Futreal
PA
. 
The cancer genome
.
Nature
2009
;
458
:
719
24
.
2.
Greenman
C
,
Stephens
P
,
Smith
R
,
Dalgliesh
GL
,
Hunter
C
,
Bignell
G
, et al
Patterns of somatic mutation in human cancer genomes
.
Nature
2007
;
446
:
153
8
.
3.
Ng
PK
,
Li
J
,
Jeong
KJ
,
Shao
S
,
Chen
H
,
Tsang
YH
, et al
Systematic functional annotation of somatic mutations in cancer
.
Cancer Cell
2018
;
33
:
450
62
.
4.
Bailey
MH
,
Tokheim
C
,
Porta-Pardo
E
,
Sengupta
S
,
Bertrand
D
,
Weerasinghe
A
, et al
Comprehensive characterization of cancer driver genes and mutations
.
Cell
2018
;
174
:
1034
5
.
5.
Samstein
RM
,
Lee
CH
,
Shoushtari
AN
,
Hellmann
MD
,
Shen
R
,
Janjigian
YY
, et al
Tumor mutational load predicts survival after immunotherapy across multiple cancer types
.
Nat Genet
2019
;
51
:
202
6
.
6.
Hellmann
MD
,
Callahan
MK
,
Awad
MM
,
Calvo
E
,
Ascierto
PA
,
Atmaca
A
, et al
Tumor mutational burden and efficacy of nivolumab monotherapy and in combination with ipilimumab in small-cell lung cancer
.
Cancer Cell
2019
;
35
:
329
.
7.
Hellmann
MD
,
Ciuleanu
TE
,
Pluzanski
A
,
Lee
JS
,
Otterson
GA
,
Audigier-Valette
C
, et al
Nivolumab plus ipilimumab in lung cancer with a high tumor mutational burden
.
N Engl J Med
2018
;
378
:
2093
104
.
8.
Robert
C
,
Schachter
J
,
Long
GV
,
Arance
A
,
Grob
JJ
,
Mortier
L
, et al
Pembrolizumab versus ipilimumab in advanced melanoma
.
N Engl J Med
2015
;
372
:
2521
32
.
9.
Borghaei
H
,
Paz-Ares
L
,
Horn
L
,
Spigel
DR
,
Steins
M
,
Ready
NE
, et al
Nivolumab versus docetaxel in advanced nonsquamous non-small-cell lung cancer
.
N Engl J Med
2015
;
373
:
1627
39
.
10.
Middlebrooks
CD
,
Banday
AR
,
Matsuda
K
,
Udquim
KI
,
Onabajo
OO
,
Paquin
A
, et al
Association of germline variants in the APOBEC3 region with cancer risk and enrichment with APOBEC-signature mutations in tumors
.
Nat Genet
2016
;
48
:
1330
8
.
11.
Chen
ZS
,
Wen
WQ
,
Beeghly-Fadiel
A
,
Shu
XO
,
Diez-Obrero
V
,
Long
JR
, et al
Identifying putative susceptibility genes and evaluating their associations with somatic mutations in human cancers
.
Am J Hum Genet
2019
;
105
:
477
92
.
12.
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium
. 
Pan-cancer analysis of whole genomes
.
Nature
2020
;
578
:
82
93
.
13.
Shu
X
,
Gu
JC
,
Huang
MS
,
Tannir
NM
,
Matin
SF
,
Karam
JA
, et al
Germline genetic variants in somatically significantly mutated genes in tumors are associated with renal cell carcinoma risk and outcome
.
Carcinogenesis
2018
;
39
:
752
7
.
14.
Carter
H
,
Marty
R
,
Hofree
M
,
Gross
AM
,
Jensen
J
,
Fisch
KM
, et al
Interaction landscape of inherited polymorphisms with somatic events in cancer
.
Cancer Discov
2017
;
7
:
410
23
.
15.
Zhu
B
,
Mukherjee
A
,
Machiela
MJ
,
Song
L
,
Hua
X
,
Shi
JX
, et al
An investigation of the association of genetic susceptibility risk with somatic mutation burden in breast cancer
.
Br J Cancer
2016
;
115
:
752
60
.
16.
Robles-Espinoza
CD
,
Roberts
ND
,
Chen
SY
,
Leacy
FP
,
Alexandrov
LB
,
Pornputtapong
N
, et al
Germline MC1R status influences somatic mutation burden in melanoma
.
Nat Commun
2016
;
7
:
12064
.
17.
Yuan
J
,
Hu
ZY
,
Mahal
BA
,
Zhao
SHD
,
Kensler
KH
,
Pi
JJ
, et al
Integrated analysis of genetic ancestry and genomic alterations across cancers
.
Cancer Cell
2018
;
34
:
549
.
18.
Altshuler
DM
,
Durbin
RM
,
Abecasis
GR
,
Bentley
DR
,
Chakravarti
A
,
Clark
AG
, et al
A global reference for human genetic variation
.
Nature
2015
;
526
:
68
.
19.
Das
S
,
Forer
L
,
Schonherr
S
,
Sidore
C
,
Locke
AE
,
Kwong
A
, et al
Next-generation genotype imputation service and methods
.
Nat Genet
2016
;
48
:
1284
7
.
20.
McCarthy
S
,
Das
S
,
Kretzschmar
W
,
Delaneau
O
,
Wood
AR
,
Teumer
A
, et al
A reference panel of 64,976 haplotypes for genotype imputation
.
Nat Genet
2016
;
48
:
1279
83
.
21.
Chang
CC
,
Chow
CC
,
Tellier
LCAM
,
Vattikuti
S
,
Purcell
SM
,
Lee
JJ
. 
Second-generation PLINK: rising to the challenge of larger and richer datasets
.
Gigascience
2015
;
4
:
7
.
22.
Yang
JA
,
Lee
SH
,
Goddard
ME
,
Visscher
PM
. 
GCTA: a tool for genome-wide complex trait analysis
.
Am J Hum Genet
2011
;
88
:
76
82
.
23.
Jiang
LD
,
Zheng
ZL
,
Qi
T
,
Kemper
KE
,
Wray
NR
,
Visscher
PM
, et al
A resource-efficient tool for mixed model association analysis of large-scale data
.
Nat Genet
2019
;
51
:
1749
.
24.
Altshuler
DM
,
Gibbs
RA
,
Peltonen
L
,
Dermitzakis
E
,
Schaffner
SF
,
Yu
FL
, et al
Integrating common and rare genetic variation in diverse human populations
.
Nature
2010
;
467
:
52
8
.
25.
Yang
J
,
Benyamin
B
,
McEvoy
BP
,
Gordon
S
,
Henders
AK
,
Nyholt
DR
, et al
Common SNPs explain a large proportion of the heritability for human height
.
Nat Genet
2010
;
42
:
565
9
.
26.
Lee
SH
,
Yang
J
,
Chen
GB
,
Ripke
S
,
Stahl
EA
,
Hultman
CM
, et al
Estimation of SNP heritability from dense genotype data
.
Am J Hum Genet
2013
;
93
:
1151
5
.
27.
Yang
J
,
Bakshi
A
,
Zhu
Z
,
Hemani
G
,
Vinkhuyzen
AAE
,
Lee
SH
, et al
Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index
.
Nat Genet
2015
;
47
:
1114
20
.
28.
Evans
LM
,
Tahmasbi
R
,
Vrieze
SI
,
Abecasis
GR
,
Das
S
,
Gazal
S
, et al
Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits
.
Nat Genet
2018
;
50
:
737
.
29.
Yang
J
,
Zeng
J
,
Goddard
ME
,
Wray
NR
,
Visscher
PM
. 
Concepts, estimation and interpretation of SNP-based heritability
.
Nat Genet
2017
;
49
:
1304
U243
.
30.
Zeng
J
,
de Vlaming
R
,
Wu
Y
,
Robinson
MR
,
Lloyd-Jones
LR
,
Yengo
L
, et al
Signatures of negative selection in the genetic architecture of human complex traits
.
Nat Genet
2018
;
50
:
746
53
.
31.
Bulik-Sullivan
BK
,
Loh
PR
,
Finucane
HK
,
Ripke
S
,
Yang
J
Schizophrenia Working Group of the Psychiatric Genomics Consortium
et al
LD score regression distinguishes confounding from polygenicity in genome-wide association studies
.
Nat Genet
2015
;
47
:
291
5
.
32.
Bulik-Sullivan
B
,
Finucane
HK
,
Anttila
V
,
Gusev
A
,
Day
FR
,
Loh
PR
, et al
An atlas of genetic correlations across human diseases and traits
.
Nat Genet
2015
;
47
:
1236
41
.
33.
Lloyd-Jones
LR
,
Zeng
J
,
Sidorenko
J
,
Yengo
L
,
Moser
G
,
Kemper
KE
, et al
Improved polygenic prediction by Bayesian multiple regression on summary statistics
.
Nat Commun
2019
;
10
:
5086
.
34.
Vilhjalmsson
BJ
,
Yang
J
,
Finucane
HK
,
Gusev
A
,
Lindstrom
S
,
Ripke
S
, et al
Modeling linkage disequilibrium increases accuracy of polygenic risk scores
.
Am J Hum Genet
2015
;
97
:
576
92
.
35.
Kvale
MN
,
Hesselson
S
,
Hoffmann
TJ
,
Cao
Y
,
Chan
D
,
Connell
S
, et al
Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort
.
Genetics
2015
;
200
:
1051
60
.
36.
Zhu
ZH
,
Zheng
ZL
,
Zhang
FT
,
Wu
Y
,
Trzaskowski
M
,
Maier
R
, et al
Causal associations between risk factors and common diseases inferred from GWAS summary data
.
Nat Commun
2018
;
9
:
224
.
37.
Hemani
G
,
Zheng
J
,
Elsworth
B
,
Wade
KH
,
Haberland
V
,
Baird
D
, et al
The MR-base platform supports systematic causal inference across the human phenome
.
Elife
2018
;
7
:
e34408
.
38.
Zhu
Z
,
Zhang
F
,
Hu
H
,
Bakshi
A
,
Robinson
MR
,
Powell
JE
, et al
Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets
.
Nat Genet
2016
;
48
:
481
7
.
39.
Võsa
U
,
Claringbould
A
,
Westra
H-J
,
Bonder
MJ
,
Deelen
P
,
Zeng
B
, et al
Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis
.
bioRxiv
2018
:
447367
.
DOI:
.
40.
Yang
J
,
Manolio
TA
,
Pasquale
LR
,
Boerwinkle
E
,
Caporaso
N
,
Cunningham
JM
, et al
Genome partitioning of genetic variation for complex traits using common SNPs
.
Nat Genet
2011
;
43
:
519
U44
.
41.
Yue
X
,
Zhao
Y
,
Huang
G
,
Li
J
,
Zhu
J
,
Feng
Z
, et al
A novel mutant p53 binding partner BAG5 stabilizes mutant p53 and promotes mutant p53 GOFs in tumorigenesis
.
Cell Discov
2016
;
2
:
16039
.
42.
Moamer
A
,
Hachim
IY
,
Binothman
N
,
Wang
N
,
Lebrun
JJ
,
Ali
S
. 
A role for kinesin-1 subunits KIF5B/KLC1 in regulating epithelial mesenchymal plasticity in breast tumorigenesis
.
EBioMedicine
2019
;
45
:
92
107
.
43.
Clarke
TK
,
Lupton
MK
,
Fernandez-Pujals
AM
,
Starr
J
,
Davies
G
,
Cox
S
, et al
Common polygenic risk for autism spectrum disorder (ASD) is associated with cognitive ability in the general population
.
Mol Psychiatry
2016
;
21
:
419
25
.
44.
Moser
G
,
Lee
SH
,
Hayes
BJ
,
Goddard
ME
,
Wray
NR
,
Visscher
PMJPG
. 
Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model
.
PLoS Genet
2015
;
11
:
e1004969
.
45.
Ni
G
,
Zeng
J
,
Revez
JR
,
Wang
Y
,
Ge
T
,
Restaudi
R
, et al
A comprehensive evaluation of polygenic score methods across cohorts in psychiatric disorders
.
medRxiv
2020
.
DOI:
.
46.
Green
J
,
Cairns
BJ
,
Casabonne
D
,
Wright
FL
,
Reeves
G
,
Beral
V
, et al
Height and cancer incidence in the Million Women Study: prospective cohort, and meta-analysis of prospective studies of height and total cancer risk
.
Lancet Oncol
2011
;
12
:
785
94
.
47.
Walter
RB
,
Brasky
TM
,
Buckley
SA
,
Potter
JD
,
White
E
. 
Height as an explanatory factor for sex differences in human cancer
.
J Natl Cancer Inst
2013
;
105
:
860
8
.
48.
Qian
F
,
Wang
SF
,
Mitchell
J
,
McGuffog
L
,
Barrowdale
D
,
Leslie
G
, et al
Height and body mass index as modifiers of breast cancer risk in BRCA1/2 mutation carriers: a Mendelian randomization study
.
J Natl Cancer Inst
2019
;
111
:
350
64
.
49.
Ong
JS
,
An
JY
,
Law
MH
,
Whiteman
DC
,
Neale
RE
,
Gharahkhani
P
, et al
Height and overall cancer risk and mortality: evidence from a Mendelian randomisation study on 310,000 UK Biobank participants
.
Brit J Cancer
2018
;
118
:
1262
7
.
50.
Rozek
LS
,
Herron
CM
,
Greenson
JK
,
Moreno
V
,
Capella
G
,
Rennert
G
, et al
Smoking, gender, and ethnicity predict somatic BRAF mutations in colorectal cancer
.
Cancer Epidemiol Biomarkers Prev
2010
;
19
:
838
43
.
51.
Brennan
JA
,
Boyle
JO
,
Koch
WM
,
Goodman
SN
,
Hruban
RH
,
Eby
YJ
, et al
Association between cigarette-smoking and mutation of the P53 gene in squamous-cell carcinoma of the head and neck
.
New Engl J Med
1995
;
332
:
712
7
.
52.
Blackford
A
,
Parmigiani
G
,
Kensler
TW
,
Wolfgang
C
,
Jones
S
,
Zhang
XS
, et al
Genetic mutations associated with cigarette smoking in pancreatic cancer
.
Cancer Res
2009
;
69
:
3681
8
.
53.
Alexandrov
LB
,
Ju
YS
,
Haase
K
,
Van Loo
P
,
Martincorena
I
,
Nik-Zainal
S
, et al
Mutational signatures associated with tobacco smoking in human cancer
.
Science
2016
;
354
:
618
22
.
54.
Kucab
JE
,
Zou
XQ
,
Morganella
S
,
Joel
M
,
Nanda
AS
,
Nagy
E
, et al
A compendium of mutational signatures of environmental agents
.
Cell
2019
;
177
:
821
.
55.
Yoshida
K
,
Gowers
KHC
,
Lee-Six
H
,
Chandrasekharan
DP
,
Coorens
T
,
Maughan
EF
, et al
Tobacco smoking and somatic mutations in human bronchial epithelium
.
Nature
2020
;
578
:
266
72
.
56.
Burgess
S
,
Butterworth
A
,
Thompson
SGJGe
. 
Mendelian randomization analysis with multiple genetic variants using summarized data
.
Genet Epidemiol
2013
;
37
:
658
65
.
57.
Pesch
B
,
Kendzia
B
,
Gustavsson
P
,
Jöckel
KH
,
Johnen
G
,
Pohlabeln
H
, et al
Cigarette smoking and lung cancer—relative risk estimates for the major histological types from a pooled analysis of case–control studies
.
Int J Cancer
2012
;
131
:
1210
9
.
58.
Secretan
B
,
Straif
K
,
Baan
R
,
Grosse
Y
,
El Ghissassi
F
,
Bouvard
V
, et al
A review of human carcinogens-part E: tobacco, areca nut, alcohol, coal smoke, and salted fish
.
Lancet Oncol
2009
;
10
:
1033
4
.
59.
Agudo
A
,
Bonet
C
,
Travier
N
,
Gonzalez
CA
,
Vineis
P
,
Bueno-de-Mesquita
HB
, et al
Impact of cigarette smoking on cancer risk in the European prospective investigation into cancer and nutrition study
.
J Clin Oncol
2012
;
30
:
4550
7
.
60.
Visscher
PM
,
Hemani
G
,
Vinkhuyzen
AA
,
Chen
G-B
,
Lee
SH
,
Wray
NR
, et al
Statistical power to detect genetic (co) variance of complex traits using SNP data in unrelated samples
.
PLoS Genet
2014
;
10
:
e1004269
.
61.
Knudson
AG
 Jr
. 
Mutation and cancer: statistical study of retinoblastoma
.
Proc Natl Acad Sci U S A
1971
;
68
:
820
3
.