Abstract
Genome-wide association studies (GWAS) have found hundreds of single-nucleotide polymorphisms (SNP) associated with increased risk of cancer. However, the amount of heritable risk explained by SNPs is limited, leaving most of the cancer heritability unexplained. Tumor sequencing projects have shown that causal mutations are enriched in genic regions. We hypothesized that SNPs located in protein coding genes and nearby regulatory regions could explain a significant proportion of the heritable risk of cancer. To perform gene-level heritability analysis, we developed a new method, called Bayesian Gene Heritability Analysis (BAGHERA), to estimate the heritability explained by all genotyped SNPs and by those located in genic regions using GWAS summary statistics. BAGHERA was specifically designed for low heritability traits such as cancer and provides robust heritability estimates under different genetic architectures. BAGHERA-based analysis of 38 cancers reported in the UK Biobank showed that SNPs explain at least 10% of the heritable risk for 14 of them, including late onset malignancies. We then identified 1,146 genes, called cancer heritability genes (CHG), explaining a significant proportion of cancer heritability. CHGs were involved in hallmark processes controlling the transformation from normal to cancerous cells. Importantly, 60 of them also harbored somatic driver mutations, and 27 are tumor suppressors. Our results suggest that germline and somatic mutation information could be exploited to identify subgroups of individuals at higher risk of cancer in the broader population and could prove useful to establish strategies for early detection and cancer surveillance.
This study describes a new statistical method to identify genes associated with cancer heritability in the broader population, creating a map of the heritable cancer genome with gene-level resolution.
See related commentary by Bader, p. 2586
Introduction
Decades of research have shown that inherited genomic mutations affect the risk of individuals of developing cancer (1). In cancer syndromes, mutations in susceptibility genes, such as the tumor protein 53 (TP53; ref. 2), and the BRCA1/2 DNA Repair Associated (BRCA1, BRCA2) genes (3, 4), confer up to an 8-fold increase in cancer risk in first-degree relatives (1). However, these inherited mutations are rare and highly penetrant and explain only a small fraction of the relative risk for all cancers (1, 5).
It has been hypothesized that part of cancer risk could be apportioned to high-frequency low-penetrant variants, such as single nucleotide polymorphism (SNP). Genome-wide association studies (GWAS; ref. 6) have been instrumental in identifying SNPs associated with increased risk of cancer in the broader population (1), including breast (7), prostate (8), testicular (9), and blood malignancies (10, 11). However, the vast majority of SNPs account only for a limited increase in cancer risk (1) and are usually filtered out by multiple hypotheses correction procedures applied in GWAS analysis (12), which ultimately leaves most of the cancer risk unexplained (5).
Although most SNPs have only subtle effects, there is mounting evidence suggesting that they still contribute to the risk of developing cancer (13). Recently, we have shown that low-penetrant germline mutations in p53 pathway genes can directly control cancer-related processes, including p53 activity and response to chemotherapies (14). Moreover, the Pan-Cancer Analysis of Whole Genomes (PCAWG) study found that 17% of all patients have rare germline variants associated with cancer (15). It is now becoming apparent that quantifying the contribution of low-penetrance but high-frequency inherited mutations could further improve our understanding on how inherited mutations mediate cancer risk and tumorigenesis.
Heritability analysis provides the statistical framework to estimate the contribution of all common SNPs to cancer risk regardless of their statistical significance and effect size (16). Studying heritability is now becoming a crucial step in cancer GWAS and has provided insights on the risk of developing many malignancies (17), including prostate (18), cervical (19), testicular germ cell tumor (20), and breast cancer (21, 22).
However, because the functional impact of the SNPs is context-dependent (23), it is important to quantify the amount of heritability explained by genomic regions associated with well-characterized biological functions (24, 25). Recently, the PCAWG study has shown that driver mutations are mostly located in protein-coding rather than regulatory regions (26), albeit few mutations in cis-regulatory regions, such as the TERT promoter, can still mediate cancer phenotypes. Thus, we reasoned that estimating the heritability of SNPs in protein-coding genes and proximal regulatory regions could provide novel insights into the etiology of this disease. However, developing analytic methods for estimating heritability at the gene level has been challenging, and current methods allow only the estimation of heritability for large functional regions or SNP categories, such as histone marks or expression quantitative trait loci (eQTL; refs. 25, 27).
Here, we developed a new method, called Bayesian Gene Heritability Analysis (BAGHERA), which, to the best of our knowledge, is the first method to enable heritability analysis both at genome-wide and at gene-level resolution. We performed extensive simulations to validate the robustness of BAGHERA estimates and assess whether our method was prone to false discoveries. Comparison with other state-of-the-art methods (27, 28) clearly showed that BAGHERA provides significantly more accurate heritability estimates for traits with heritability lower than 10%, such as cancer.
We then used BAGHERA to analyze all the 38 histologically different malignancies reported in the UK BioBank cohort (29). Our genome-wide heritability analysis showed that SNPs account for at least 10% of the heritable risk of 14 tumors, including late onset malignancies, such as prostate and bladder, which are not thought to be driven by high-frequency inherited mutations. We then used gene-level heritability analysis to build a panel of 1,146 genes, called cancer heritability genes (CHG), that have a significant contribution to the heritability of at least one cancer. Interestingly, a significant proportion of CHGs are known tumor suppressors or are directly involved in the hallmark processes controlling the transformation from normal to cancer cells.
Our study provides new methods to analyze GWAS data and genetic evidence of a causal role for high-frequency inherited mutations in cancer.
Materials and Methods
Estimation of heritability at the gene level
Narrow sense heritability, |{h^2}$|, is defined as the amount of phenotype variance explained by additive genetic effects. GWAS provide unique opportunities to study heritability of many diseases; in particular, with the advent of high-density arrays, where more than 500,000 SNPs are genotyped, the heritability explained by these variants, |h_{SNP}^2$|, represents a reasonable estimate for |{h^2}$|.
Our goal is to identify the amount of |h_{SNP}^2$| explained by a protein-coding gene and its proximal regulatory regions. To obtain unbiased heritability estimates and control the number of false positives, we require SNPs to be uniquely assigned to genes.
Hereby, we denote as genome-wide heritability the amount of heritability explained by all genotyped SNPs, M, whereas we refer to the amount of heritability explained by the SNPs in a gene as gene-level heritability. In a model where each SNP has equal contribution to the genome-wide heritability, the per-SNP heritability is |{\hbar ^2}\ = \ h_{SNP}^2/M$|. Conversely, if variants can have varying contribution to the genome-wide heritability, we can model the per-SNP heritability as a random variable, |\hbar _M^2$|, whose expectation is |\hbar _M^2\ = \ {\rm{E}}{[ {\hbar _j^2} ]_{j\ = \ 1, \cdots ,M}}$|, where M denotes the number of SNPs used to average the per-SNP contribution to heritability.
We hereby demonstrate that the genome-wide heritability can be expressed as the sum of the gene-level contribution and that the per-SNP genome-wide heritability is the expectation of the per-SNP gene-level heritability. Let K be the number of nonoverlapping genes in the human genome, each of them with |{M_k}$| SNPs, the genome-wide heritability can be expressed as |h_{SNP}^2\ = \ \mathop \sum \limits_{k\ = \ 1}^K \mathop \sum \limits_{j \in k} \hbar _j^2\ = \ \mathop \sum \limits_{k\ = \ 1}^K {M_k}\hbar _{{M_k}}^2$|, where |{M_k}\hbar _{{M_k}}^2$| is the amount of heritability explained by all the SNPs in the k-th gene. Thus, let the number of SNPs in each gene and the gene-level per-SNP heritability be independent random variables, it is straightforward to prove that the expectation of the gene-level per-SNP heritability is the per-SNP genome-wide estimate |h_{SNP}^2/M\ = \ E{[ {\hbar _{{M_k}}^2} ]_K}$|. However, estimating |h_{SNP}^2$| only from SNPs assigned to genes would lead to biased estimates, because the contribution of the SNPs in intergenic regions would be neglected; thus, SNPs outside genic regions are assigned to a single intergenic locus, such that the heritability is correctly estimated from all genotyped SNPs.
A hierarchical Bayesian model for heritability estimation
The estimation of heritability can be modeled as a hierarchical Bayesian regression problem, which provides a robust approach to simultaneously estimate the genome-wide heritability, |h_{SNP}^2$|, and the gene-level heritability, |h_k^2$|, from the observed data Y. Our base Bayesian regression model can be defined as follows:
where |{F_1},{F_2},{\rm{and\ }}{F_3}$| are suitable distributions.
SNP heritability, |h_{SNP}^2$|, is the ratio of the variance of the additive genetic effects, |\sigma _g^2$|, and the phenotypic variance, |\sigma _P^2$|. Let |\sigma _P^2\ = \ \sigma _g^2 + \sigma _e^2$|, where |\sigma _e^2$| are the nonadditive and environmental effects, these quantities can be modeled as random variables with |\sigma _g^2 \sim \Gamma ( {\alpha ,\theta } )$| and |\sigma _e^2 \sim \Gamma ( {\beta ,\theta } )$|, respectively. Because |\Gamma ( {\alpha ,\theta } )/( {\Gamma ( {\alpha ,\theta } ) + \Gamma ( {\beta ,\theta } )} ) \sim$| |{\rm{Beta}}( {\alpha ,\beta } )$|, a suitable distribution for |{F_1}$|, in Eq. A, would be an uninformative Beta distribution, e.g., |{\rm{Beta}}( {1,1} )$|. In practice, the use of a Beta distribution as prior for |h_{SNP}^2$| allows us to obtain accurate heritability estimates in the unit range even for low-heritability diseases, where classical methods are usually inaccurate (28).
The gene-level heritability, |h_k^2$|, can be modeled as a random variable following a Gamma distribution with shape |\alpha \ = \ h_{SNP}^2$| and rate |\beta \ = \ 1$|. It is worth noting that |h_k^2/M$| is the per-SNP heritability of gene k, whereas the amount of heritability explained by the gene is |{M_k}( {h_k^2/M} )$|, where |{M_k}$| are the SNPs in gene k. While theoretically the Gamma distribution is unbounded, in practice, for |{M_k} \ll M$|, the likelihood of obtaining an estimate |h_k^2$| s.t. |{M_k}( {h_k^2/M} ) \gt 1$| is negligible. Therefore, for |{F_2}\ = \ {\rm{\Gamma }}( {h_{SNP}^2,1} )$|, the expectation would be |h_{SNP}^2$|, which is an unbiased estimator of the genome-wide heritability.
Finally, our model requires a suitable estimator to regress |h_k^2$| from the observed data. Recently, many methods have been proposed to estimate heritability from GWAS data (30); however, the vast majority requires genotype data, which are both difficult to obtain, due to privacy concerns, and computationally taxing to analyze, because of high dimensionality. Thus, we adopted the LD-score (LDsc) regression model (28), which allows estimation of heritability from GWAS summary statistics, such as regression coefficients and standard errors, which are readily available (12).
Thus, for |{F_3}$|, we rewrote the LDsc model to estimate gene-level heritability, from summary statistics of M SNPs in a GWAS with N subjects, as follows:
where |\chi _{jk}^2$| and |{l_j}$| are the |{\chi ^2}$| statistic and LD score associated with SNP j in gene k, respectively. The LD score is a quantity defined as |{l_j}\ = \ \mathop \sum \limits_z r_{jz}^2$|, where |r_{jz}^2$| is the linkage disequilibrium between variant j and variant z within a certain genomic window (e.g., 1 Mb) in a given population. Importantly, LD scores can be conveniently computed from large-scale genetic studies, such as the 1000 Genomes project.
Finally, setting the standard deviation to the LD score of the j-th SNP allows us to control for heteroskedasticity of the test statistics due to linkage disequilibrium, somehow similar to the weighting scheme used in LDsc, and a term e accounting for confounding biases, which is modeled using an uninformative normal prior.
BAGHERA software
We implemented our hierarchical model (see Eq. C) as part of the BAGHERA software, which allows simultaneous estimation of genome-wide and gene-level heritability, also called heritability loci, which are genes and proximal regulatory regions with a per-SNP heritability higher than the genome-wide estimate (see Fig. 1). Because fitting the Beta–Gamma model is computationally taxing, we relaxed our requirements by modeling |h_k^2$| as a random variable following a Normal distribution whose mean is the genome-wide heritability, |h_{SNP}^2$|, and the standard deviation is controlled by an uninformative Inverse-Gamma prior. Although this formulation might provide gene-level heritability estimates outside the unit domain, we found this problem to be well controlled in practice.
BAGHERA workflow. Here, we show the four steps required to run gene-level heritability analysis with BAGHERA. A, In the preprocessing step, SNP summary statistics are retrieved, and genes are processed, such that a multigene locus is created when two or more genes are overlapping. B, SNPs are assigned to the closest gene locus within 50 kb. For example, the SNP marked with a star is within 50 kb from both D;E and F, but it is assigned to locus F, which is closer. SNPs farther than 50 kb from any gene locus are considered intergenic. C, BAGHERA uses the No U-Turn Sampler (NUTS; left) to fit our hierarchical Bayesian model to estimate genome-wide and gene-level heritability. The sampler estimates the posterior distributions of the heritability terms (right) and evaluates the indicator function to identify loci explaining a significant amount of heritability. When η > 0.99, the locus is considered significant. D, Finally, results are saved into CSV format to facilitate downstream analyses. It is worth noting that |h_{SNP}^2$| is the estimate for genome-wide heritability, and it is calculated for the malignancy rather than per-locus.
BAGHERA workflow. Here, we show the four steps required to run gene-level heritability analysis with BAGHERA. A, In the preprocessing step, SNP summary statistics are retrieved, and genes are processed, such that a multigene locus is created when two or more genes are overlapping. B, SNPs are assigned to the closest gene locus within 50 kb. For example, the SNP marked with a star is within 50 kb from both D;E and F, but it is assigned to locus F, which is closer. SNPs farther than 50 kb from any gene locus are considered intergenic. C, BAGHERA uses the No U-Turn Sampler (NUTS; left) to fit our hierarchical Bayesian model to estimate genome-wide and gene-level heritability. The sampler estimates the posterior distributions of the heritability terms (right) and evaluates the indicator function to identify loci explaining a significant amount of heritability. When η > 0.99, the locus is considered significant. D, Finally, results are saved into CSV format to facilitate downstream analyses. It is worth noting that |h_{SNP}^2$| is the estimate for genome-wide heritability, and it is calculated for the malignancy rather than per-locus.
BAGHERA predicts heritability genes by computing the posterior distribution of |{\eta _k} \sim I( {h_k^2 \,\gt \,h_{SNP}^2} )$|, where I is a function that returns 1 if the evaluated condition is true, and 0 otherwise. The expectation of the posterior distribution of |{\eta _k}$|, |{\rm{E}}[ {{\eta _k}} ]$|, is the probability of the heritability of a gene k of being higher than the genome-wide estimate; specifically, we report as heritability genes, those with |{\rm{E}}[ {{\eta _k}} ] \,\gt \,0.99$|. For each gene, we also report effect sizes in terms of fold change with respect to the genome-wide heritability estimate, as |f{c_k}\ = \ h_k^2/h_{SNP}^2$|.
We use the No-U-Turn Sampler as implemented in PyMC 3.4 (31) to fit the model, using 4 chains with 104 sweeps each and a burnin step consisting of 2,000 samples. Convergence of the sampling process was assessed based on the Gelman–Rubin convergence criterion.
BAGHERA is released as a Python software package under MIT license, and it is available on GitHub (https://github.com/stracquadaniolab/baghera), as a package on Anaconda, and as a Docker image. BAGHERA also implements the Beta–Gamma model described in the previous section, called BAGHERA-Γ. Alongside the source code, we also provide a Snakemake workflow (https://github.com/stracquadaniolab/workflow-baghera) to run the pipeline presented in our study.
UK BioBank summary statistics processing and curation
We used summary statistics of the UK BioBank GWAS for cancers classified using the ICD10 disease classification (source: https://nealelab.github.io/UKBB_ldsc/); importantly, data are uniformly processed with state-of-the-art methods, which prevents any methodologic bias. Here, we developed a custom pipeline to assign LD scores to SNPs, and SNPs to human genes (see Fig. 1). Specifically, we used precomputed LD scores for SNPs on autosomal chromosomes with minor allele frequency |{\rm{MAF}} \gt 0.01$| in the European population (EUR) of the 1000 Genomes project. We then removed the SNPs on chr6:26,000,000–34,000,000, because this region contains the major histocompatibility complex that have unusual genetic patterns and is known to affect GWAS result interpretation (25, 32). Ultimately, our analysis is conducted on 1,285,620 SNPs over 22 chromosomes.
We then used Gencode v31 to determine the genomic coordinates of protein coding genes in the GRCh37 human genome. First, we merged overlapping genes by creating a new multigene locus, whose name denotes the overlapping genes and whose boundaries are defined as the first and last base-pair of these loci. We then assigned to a locus all SNPs within or no more than |\pm 50$| kb away from its boundaries (Fig. 1); this strategy allows us to account for cis-regulatory elements while retaining gene-level resolution. All other SNPs are assigned to the intergenic locus. Overall, 55% of SNPs were mapped to a locus, while the rest of them are assigned to the intergenic term. Finally, to mitigate false positives due to poorly genotyped regions, we considered only gene-loci harboring at least 10 variants. Ultimately, our dataset consists of 15,025 loci; 12,042 (80.1%) of them are harboring more than 10 SNPs, which were considered in our heritability study. The results of our analyses are deposited in CSV format on Zenodo (doi: 10.5281/zenodo.3968269).
Enrichment analyses
We used a one-tailed Fisher exact test for all enrichment analyses, with P values adjusted using the Benjamini–Hochberg procedure, because we are interested in testing whether genes associated with a given category (e.g., molecular function, gene panel) are overrepresented in our set of significant heritability loci. Importantly, because loci in our analysis might represent overlapping protein-coding regions, we postprocessed our gene lists by converting each multi-gene locus into the set of its genes. For the gene ontology (GO) analysis, we used a GO slim annotation to obtain a high-level view of the processes and functions mediated by a set of genes. All external datasets, with their respective date of download, are detailed in the Supplementary Methods.
Results
Simulations assessing robustness of genome-wide and gene-level estimates for low heritability traits
We performed extensive testing of our method on simulated data to assess (i) the robustness of genome-wide estimates for low heritability traits and (ii) the false discovery rate (FDR) associated with gene-level predictions. All our datasets were calibrated to simulate low heritability traits (|h_{SNP}^2 \le 0.5$|), which is a reasonable assumption for cancer. We generated genotype data for M = 100,000 SNPs of N = 50,000 subjects using haplotypes of chromosome 1 from European populations under different heritability models (see Supplementary Methods).
Our analyses show that BAGHERA provides robust unbiased genome-wide estimates (see Supplementary Methods); interestingly, while extreme values of gene-level heritability might affect genome-wide estimates, we found that BAGHERA returns correct estimates both as the median of the posterior genome-wide heritability distribution and as the sum of gene-level heritability contributions.
We then assessed whether BAGHERA was able to identify heritability loci, that is loci harboring SNPs with a contribution to heritability higher than expected under a constant per-SNP heritability contribution. To do that, we selected 1% of the loci on chromosome 1 (≈ 13) as heritability loci and computed receiver operator characteristic (ROC) and precision recall (PR) curves at varying levels of genome-wide heritability (see Supplementary Methods). For all curves, we evaluated the area under the curve (AUC). Here, we found that BAGHERA correctly identified heritability loci (ROC AUC: 0.89), although precision and recall were consistently higher for higher genome-wide heritability levels (PR AUC: 0.41 for |{h^2}\ = \ 0.01$|, >0.58 for |{h^2} \gt 0.01$|; Supplementary Figs. S1 and S2A–S2C).
However, our simulated datasets have a main limitation; because simulating genotype data is a computationally taxing task, we restricted the number of simulated SNPs to M ≈ 100,000 SNPs from a single chromosome, whereas more than 1 M are routinely genotyped in modern studies.
We addressed this limitation by simulating summary statistics using only linkage disequilibrium information (see Supplementary Methods). This approach provides a tractable framework to test varying levels of heritability enrichment, reported in terms of fold change with respect to the genome-wide estimate, and to simulate SNPs across the entire genome, rather than a single chromosome.
We then assessed the performance by computing ROC and PR curves, the true positive rate (TPR), and the FDR. BAGHERA correctly identifies heritability loci, even with fold changes in heritability as low as |{f_c}\ = \ 5$| (ROC AUC range: 0.70–0.99). Importantly, we found BAGHERA to be conservative with a low FDR across all scenarios (FDR range: 0%–5%); this result suggests that our method is suitable for exploratory analyses, and that significant results are associated to true biological signal (see Supplementary Figs. S3–S5).
Comparison with state-of-the-art methods for genome-wide and local heritability estimation
To the best of our knowledge, BAGHERA is the first method specifically designed to analyze low heritability traits and to provide heritability estimates with gene-level resolution. However, because our method can estimate both genome-wide and local heritability with gene-level resolution, we decided to compare its performance to state-of-the-art methods designed to estimate genome-wide and local heritability.
Genome-wide estimates were compared with LD score regression (LDsc) results (28). Gold-standard methods require raw data; however, previous studies have shown that LDsc has comparable performance in most scenarios (22). Because LDsc is routinely used to estimate heritability for the traits in the UK BioBank, we retrieved the results for all 38 cancers and compared them with BAGHERA estimates. We found strong consensus between the estimates of the two methods (see Supplementary Fig. S6), consistent with the fact that BAGHERA uses a similar genome-wide estimator. Nonetheless, BAGHERA is more robust for low heritability traits, because our Bayesian formulation guarantees correct heritability estimates in the unit domain, whereas LDsc incorrectly provides negative values.
Performances on local heritability analysis were compared with the heritability estimation from summary statistics (HESS) method (27), which is the only available approach to estimate local heritability from summary statistics. Here, we used BAGHERA to estimate the heritability of 1703 regions, as defined in the HESS original study (see Supplementary Methods). We then restricted our analysis to breast and prostate cancer data, because these malignancies are those with the highest |h_{SNP}^2$| estimates; this was necessary to ensure a fair comparison between the two methods, because HESS is not designed for low heritability traits. Here, we found a statistically significant correlation between HESS and BAGHERA estimates (Pearson ρ : 0.76 for prostate and 0.78 for breast, see Supplementary Figs. S7 and S8). However, because BAGHERA provides robust estimates for as much as 15,000 regions, it enables more detailed analyses compared with HESS.
Taken together, we have shown that BAGHERA provides robust estimates for low heritability traits and can identify loci with heritability enrichment up to gene-level, which represent a 10-fold increase in genomic resolution compared with existing methods.
Genome-wide estimates of cancer heritability in the UK Biobank
We used BAGHERA to analyze 38 cancers in the UK Biobank (29), a large-scale prospective study aiming at systematically screening and phenotyping more than 500,000 individuals, with a reported age at the assessment centre ranging between 37 and 73 years.
We obtained summary statistics for N = 361,194 individuals (see Table 1), including subjects whose tumors were histologically characterized according to the ICD10 classification, where malignant neoplasms are identified with codes ranging from C00 to C97 (see Supplementary Methods). The number of cases varies significantly across cancers, ranging from 102 individuals, for malignant neoplasm of base of tongue (C01), to 9086 individual, for other malignant neoplasms of the skin (C44). In this cohort, cancer prevalence ranges between 0.29% and 2.51%, with higher estimates for common malignancies in European populations, such as breast and prostate cancer (33).
Genome-wide heritability of the 38 cancers in the UK BioBank.
ICD10 . | Malignancy . | Cases . | Prevalence . | |{ \hat{\bi \chi }^{\bf 2}}$| . | |{\bi h}_{\bi SNP}^{\bf 2}$| . | |{\bi h}_{{\bi SN{P_L}}}^{\bf 2}$| . | HL . |
---|---|---|---|---|---|---|---|
C44 | Other malignant neoplasms of skin | 9,086 | 0.0252 | 1.1408 | 0.0341 | 0.2422 | 422 |
C50 | Malignant neoplasm of breast | 8,304 | 0.0230 | 1.0869 | 0.0170 | 0.1285 | 267 |
C61 | Malignant neoplasm of prostate | 4,342 | 0.0120 | 1.0765 | 0.0191 | 0.2320 | 271 |
C18 | Malignant neoplasm of colon | 2,226 | 0.0062 | 1.0399 | 0.0070 | 0.1416 | 33 |
C43 | Malignant melanoma of skin | 1,672 | 0.0046 | 1.0288 | 0.0051 | 0.1293 | 52 |
C15 | Malignant neoplasm of esophagus | 519 | 0.0014 | 1.0236 | 0.0035 | 0.2296 | 24 |
C67 | Malignant neoplasm of bladder | 1,554 | 0.0043 | 1.0222 | 0.0047 | 0.1254 | 39 |
C34 | Malignant neoplasm of bronchus and lung | 1,427 | 0.0040 | 1.0208 | 0.0035 | 0.1010 | 17 |
C20 | Malignant neoplasm of rectum | 1,118 | 0.0031 | 1.0130 | 0.0031 | 0.1091 | 15 |
C62 | Malignant neoplasm of testis | 221 | 0.0006 | 1.0120 | 0.0024 | 0.3158 | 29 |
C71 | Malignant neoplasm of brain | 368 | 0.0010 | 1.0116 | 0.0030 | 0.2578 | 19 |
C45 | Mesothelioma | 150 | 0.0004 | 1.0110 | 0.0012 | 0.2213 | 5 |
C91 | Lymphoid leukemia | 349 | 0.0010 | 1.0109 | 0.0018 | 0.1646 | 11 |
C02 | Malignant neoplasm of other and unspecified parts of tongue | 152 | 0.0004 | 1.0106 | 0.0013 | 0.2475 | 23 |
C16 | Malignant neoplasm of stomach | 388 | 0.0011 | 1.0106 | 0.0010 | 0.0868 | 12 |
C83 | Diffuse non-Hodgkin lymphoma | 587 | 0.0016 | 1.0104 | 0.0014 | 0.0824 | 14 |
C82 | Follicular (nodular) non-Hodgkin lymphoma | 320 | 0.0009 | 1.0101 | 0.0031 | 0.3059 | 21 |
C90 | Multiple myeloma and malignant plasma cell neoplasms | 401 | 0.0011 | 1.0092 | 0.0013 | 0.1020 | 15 |
C56 | Malignant neoplasm of ovary | 693 | 0.0019 | 1.0063 | 0.0012 | 0.0616 | 13 |
C54 | Malignant neoplasm of corpus uteri | 988 | 0.0027 | 1.0063 | 0.0008 | 0.0295 | 14 |
C48 | Malignant neoplasm of retroperitoneum and peritoneum | 122 | 0.0003 | 1.0053 | 0.0009 | 0.2064 | 5 |
C64 | Malignant neoplasm of kidney except renal pelvis | 701 | 0.0019 | 1.0043 | 0.0009 | 0.0455 | 10 |
C01 | Malignant neoplasm of base of tongue | 102 | 0.0003 | 1.0043 | 0.0014 | 0.3596 | 10 |
C73 | Malignant neoplasm of thyroid gland | 278 | 0.0008 | 1.0042 | 0.0011 | 0.1254 | 13 |
C49 | Malignant neoplasm of other connective and soft tissue | 222 | 0.0006 | 1.0040 | 0.0017 | 0.2229 | 28 |
C80 | Malignant neoplasm without specification of site | 398 | 0.0011 | 1.0040 | 0.0016 | 0.1300 | 14 |
C53 | Malignant neoplasm of cervix uteri | 192 | 0.0005 | 1.0039 | 0.0005 | 0.0709 | 14 |
C22 | Malignant neoplasm of liver and intrahepatic bile ducts | 189 | 0.0005 | 1.0031 | 0.0009 | 0.1353 | 7 |
C21 | Malignant neoplasm of anus and anal canal | 139 | 0.0004 | 1.0027 | 0.0007 | 0.1436 | 23 |
C85 | Other and unspecified types of non-Hodgkin lymphoma | 762 | 0.0021 | 1.0023 | 0.0013 | 0.0600 | 9 |
C09 | Malignant neoplasm of tonsil | 162 | 0.0004 | 1.0022 | 0.0006 | 0.1009 | 5 |
C92 | Myeloid leukemia | 328 | 0.0009 | 1.0011 | 0.0008 | 0.0764 | 9 |
C17 | Malignant neoplasm of small intestine | 114 | 0.0003 | 1.0007 | 0.0015 | 0.3596 | 12 |
C19 | Malignant neoplasm of rectosigmoid junction | 498 | 0.0014 | 0.9992 | 0.0006 | 0.0390 | 10 |
C25 | Malignant neoplasm of pancreas | 403 | 0.0011 | 0.9991 | 0.0005 | 0.0402 | 12 |
C81 | Hodgkin's disease | 150 | 0.0004 | 0.9989 | 0.0003 | 0.0597 | 5 |
C69 | Malignant neoplasm of eye and adnexa | 137 | 0.0004 | 0.9970 | 0.0004 | 0.0705 | 14 |
C32 | Malignant neoplasm of larynx | 159 | 0.0004 | 0.9914 | 0.0003 | 0.0450 | 7 |
ICD10 . | Malignancy . | Cases . | Prevalence . | |{ \hat{\bi \chi }^{\bf 2}}$| . | |{\bi h}_{\bi SNP}^{\bf 2}$| . | |{\bi h}_{{\bi SN{P_L}}}^{\bf 2}$| . | HL . |
---|---|---|---|---|---|---|---|
C44 | Other malignant neoplasms of skin | 9,086 | 0.0252 | 1.1408 | 0.0341 | 0.2422 | 422 |
C50 | Malignant neoplasm of breast | 8,304 | 0.0230 | 1.0869 | 0.0170 | 0.1285 | 267 |
C61 | Malignant neoplasm of prostate | 4,342 | 0.0120 | 1.0765 | 0.0191 | 0.2320 | 271 |
C18 | Malignant neoplasm of colon | 2,226 | 0.0062 | 1.0399 | 0.0070 | 0.1416 | 33 |
C43 | Malignant melanoma of skin | 1,672 | 0.0046 | 1.0288 | 0.0051 | 0.1293 | 52 |
C15 | Malignant neoplasm of esophagus | 519 | 0.0014 | 1.0236 | 0.0035 | 0.2296 | 24 |
C67 | Malignant neoplasm of bladder | 1,554 | 0.0043 | 1.0222 | 0.0047 | 0.1254 | 39 |
C34 | Malignant neoplasm of bronchus and lung | 1,427 | 0.0040 | 1.0208 | 0.0035 | 0.1010 | 17 |
C20 | Malignant neoplasm of rectum | 1,118 | 0.0031 | 1.0130 | 0.0031 | 0.1091 | 15 |
C62 | Malignant neoplasm of testis | 221 | 0.0006 | 1.0120 | 0.0024 | 0.3158 | 29 |
C71 | Malignant neoplasm of brain | 368 | 0.0010 | 1.0116 | 0.0030 | 0.2578 | 19 |
C45 | Mesothelioma | 150 | 0.0004 | 1.0110 | 0.0012 | 0.2213 | 5 |
C91 | Lymphoid leukemia | 349 | 0.0010 | 1.0109 | 0.0018 | 0.1646 | 11 |
C02 | Malignant neoplasm of other and unspecified parts of tongue | 152 | 0.0004 | 1.0106 | 0.0013 | 0.2475 | 23 |
C16 | Malignant neoplasm of stomach | 388 | 0.0011 | 1.0106 | 0.0010 | 0.0868 | 12 |
C83 | Diffuse non-Hodgkin lymphoma | 587 | 0.0016 | 1.0104 | 0.0014 | 0.0824 | 14 |
C82 | Follicular (nodular) non-Hodgkin lymphoma | 320 | 0.0009 | 1.0101 | 0.0031 | 0.3059 | 21 |
C90 | Multiple myeloma and malignant plasma cell neoplasms | 401 | 0.0011 | 1.0092 | 0.0013 | 0.1020 | 15 |
C56 | Malignant neoplasm of ovary | 693 | 0.0019 | 1.0063 | 0.0012 | 0.0616 | 13 |
C54 | Malignant neoplasm of corpus uteri | 988 | 0.0027 | 1.0063 | 0.0008 | 0.0295 | 14 |
C48 | Malignant neoplasm of retroperitoneum and peritoneum | 122 | 0.0003 | 1.0053 | 0.0009 | 0.2064 | 5 |
C64 | Malignant neoplasm of kidney except renal pelvis | 701 | 0.0019 | 1.0043 | 0.0009 | 0.0455 | 10 |
C01 | Malignant neoplasm of base of tongue | 102 | 0.0003 | 1.0043 | 0.0014 | 0.3596 | 10 |
C73 | Malignant neoplasm of thyroid gland | 278 | 0.0008 | 1.0042 | 0.0011 | 0.1254 | 13 |
C49 | Malignant neoplasm of other connective and soft tissue | 222 | 0.0006 | 1.0040 | 0.0017 | 0.2229 | 28 |
C80 | Malignant neoplasm without specification of site | 398 | 0.0011 | 1.0040 | 0.0016 | 0.1300 | 14 |
C53 | Malignant neoplasm of cervix uteri | 192 | 0.0005 | 1.0039 | 0.0005 | 0.0709 | 14 |
C22 | Malignant neoplasm of liver and intrahepatic bile ducts | 189 | 0.0005 | 1.0031 | 0.0009 | 0.1353 | 7 |
C21 | Malignant neoplasm of anus and anal canal | 139 | 0.0004 | 1.0027 | 0.0007 | 0.1436 | 23 |
C85 | Other and unspecified types of non-Hodgkin lymphoma | 762 | 0.0021 | 1.0023 | 0.0013 | 0.0600 | 9 |
C09 | Malignant neoplasm of tonsil | 162 | 0.0004 | 1.0022 | 0.0006 | 0.1009 | 5 |
C92 | Myeloid leukemia | 328 | 0.0009 | 1.0011 | 0.0008 | 0.0764 | 9 |
C17 | Malignant neoplasm of small intestine | 114 | 0.0003 | 1.0007 | 0.0015 | 0.3596 | 12 |
C19 | Malignant neoplasm of rectosigmoid junction | 498 | 0.0014 | 0.9992 | 0.0006 | 0.0390 | 10 |
C25 | Malignant neoplasm of pancreas | 403 | 0.0011 | 0.9991 | 0.0005 | 0.0402 | 12 |
C81 | Hodgkin's disease | 150 | 0.0004 | 0.9989 | 0.0003 | 0.0597 | 5 |
C69 | Malignant neoplasm of eye and adnexa | 137 | 0.0004 | 0.9970 | 0.0004 | 0.0705 | 14 |
C32 | Malignant neoplasm of larynx | 159 | 0.0004 | 0.9914 | 0.0003 | 0.0450 | 7 |
Note: For each cancer, we report the number of cases, the prevalence in the cohort, the average |{\chi ^2}$| of the SNPs considered in the GWAS analysis (|{\hat{\chi }^2}$|), the genome-wide estimates of heritability, both on the observed (|h_{SNP}^2$|) and the liability (|h_{SN{P_L}}^2$|) scale, and the number of heritability loci (HL) reported by BAGHERA as significant for |\eta\ \gt\ 0.99$|. In bold, we denote the 16 cancers that we used for the downstream analysis and functional characterization.
Estimating heritability from nontargeted cohorts can be challenging, due to the small prevalence of the disease. To test whether we had sufficient signal for each cancer, we reasoned that if the SNP test statistic follows a |{\chi ^2}$| distribution with 1 degree of freedom, under the null hypothesis of no association, its expected value is |E[ {{\chi ^2}} ]\ = \ 1$|; thus, similarly to other studies, we expected to have sufficient polygenic signal for our analysis if the average |{\chi ^2}$| was greater than 1 (25). Here, we found the vast majority of cancers to have an average |{\chi ^2} \approx 1$|, with only 17 having a deviation greater than 1% from the expected value of the test statistic. We also did not consider cancers assigned to other malignant neoplasm of the skin (C44), because (i) most tumors belong to unspecified anatomic regions (C44.3, C44.9); (ii) are predominantly caused by sun exposure in Europeans; and (iii) and includes poorly characterized rare skin cancers. Ultimately, we restricted our study to 16 cancers for which we had sufficient power to perform our analysis. Nonetheless, all our results are consistent with those we obtained when considering all 38 cancer types (see Supplementary Figs. S9–S12D and S13A–S13C; Supplementary Tables S1–S3).
We then estimated genome-wide heritability of each cancer by computing the median of the posterior distribution of |h_{SNP}^2$| and transforming this value on to the liability scale, |h_{SN{P_L}}^2$|, to obtain estimates independent from prevalence and comparable across malignancies. We found cancer heritability to be |h_{SN{P_L}}^2\ = \ 14.7{\rm{\% }}$| on average, ranging from 8% for non-Hodgkin lymphoma and up to 31% for testis (see Table 1) consistent with other available estimates for this cohort (see Supplementary Materials and Supplementary Figs. S14, 15A–S15D, and S16A–S16C; Supplementary Table S4). While comparison between cancer heritability estimates is usually difficult across studies, due to differences in histologic classification and genetic confounders, we found our heritability estimates on the liability scale to be consistent with those reported for other cohorts, in particular for breast, prostate, testes, and bladder (17, 18, 20, 34). The heritability of testicular cancer is the highest among all malignancies (|h_{SN{P_L}}^2\ = \ 0.3158$|), consistent with the hypothesis that germline variants have stronger effects in early onset and young adult cancers. However, early onset cancers are underrepresented in the UK Biobank, because children and young adults were not enrolled in the study, and thus, an accurate estimation of the correlation between age of onset and heritability is not possible. Nonetheless, it is interesting to note that many malignancies with onset in late adulthood, such as prostate or bladder, still display a significant heritable component, ranging from |h_{SN{P_L}}^2\ = \ 0.25$| for brain tumors (age of onset:59) to |h_{SN{P_L}}^2\ = \ 0.08$| for diffuse non-Hodgkin lymphoma (age of onset: 60). Overall, 14 of 16 cancers (87%) show heritability higher than 10%, suggesting a consistent contribution of SNPs to the heritable risk of cancer.
Heritability loci across 16 malignancies
We identified 783 heritability loci (η > 0.99), harboring 1,146 protein-coding genes, across 16 cancers (see Fig. 2), with 53 heritability loci per malignancy on average, ranging from 5 loci in mesothelioma, to 271 loci for prostate (see Table 1; Fig. 3A); here, we are using the term heritability loci when referring to the nonoverlapping genomic regions tested by BAGHERA, which might also include multigene loci. Gene-level heritability across the selected 16 cancers has a long-tail distribution (Fig. 3B), with a median 16-fold increase compared with the genome-wide estimate, ranging from 4.4-fold for the phosphodiesterase 4D (PDE4D) gene locus to 276-fold for the fibroblast growth factor receptor 2 (FGFR2) gene locus in breast cancer. Interestingly, 87% of heritability loci show per-SNP heritability 10-fold higher than the genome-wide estimate. Only 3 loci have fold changes below 5 and more than 99% of loci with fold changes below 10 are found in the breast and prostate datasets, which have |h_{SNP}^2 \,\gt \,0.01$|. On the basis of our simulations, our set of heritability loci is expected to have a limited number of false positives.
Cancer heritability loci across the human genome. For each chromosome, we report all cancer heritability loci with heritability enrichment in the top 1%. In case of a multigene locus, we report only the first gene name of the locus.
Cancer heritability loci across the human genome. For each chromosome, we report all cancer heritability loci with heritability enrichment in the top 1%. In case of a multigene locus, we report only the first gene name of the locus.
Heritability loci across cancers in the UK Biobank. A, For each malignancy, we report the observed heritability (|\textstyle {h_{SNP}^2,}$|left box), the heritability on the liability scale (|\textstyle {h_{SN{P_L}}^2}$|, dark barplot, between 0 and 0.5), the percentage of |h_{SNP}^2$| explained by heritability loci (middle barplot; the percentage explained by heritability loci (HL) is highlighted with a darker shade), and the number of heritability loci (right barplot). B, Gene-level heritability distribution across heritability loci, expressed as fold change with respect to the genome-wide estimate. The x-axis is bound to the minimum and maximum values of fold change. We highlighted the top locus (FGFR2) and the median (15.9) fold change across all cancers. C, Percentage of cancer heritability loci associated with multiple cancers. Approximately 8% of heritability loci are common to multiple malignancies. D, Cancer heritability loci associated with multiple cancers. We report the 59 heritability loci common to at least two cancers; here, the size of the dot is proportional to the fold change of the locus in the specific cancer.
Heritability loci across cancers in the UK Biobank. A, For each malignancy, we report the observed heritability (|\textstyle {h_{SNP}^2,}$|left box), the heritability on the liability scale (|\textstyle {h_{SN{P_L}}^2}$|, dark barplot, between 0 and 0.5), the percentage of |h_{SNP}^2$| explained by heritability loci (middle barplot; the percentage explained by heritability loci (HL) is highlighted with a darker shade), and the number of heritability loci (right barplot). B, Gene-level heritability distribution across heritability loci, expressed as fold change with respect to the genome-wide estimate. The x-axis is bound to the minimum and maximum values of fold change. We highlighted the top locus (FGFR2) and the median (15.9) fold change across all cancers. C, Percentage of cancer heritability loci associated with multiple cancers. Approximately 8% of heritability loci are common to multiple malignancies. D, Cancer heritability loci associated with multiple cancers. We report the 59 heritability loci common to at least two cancers; here, the size of the dot is proportional to the fold change of the locus in the specific cancer.
Interestingly, heritability loci represent less than 1% of all the loci in the genome, but they are significantly more than those harboring genome-wide significant SNPs (see Supplementary Material, Supplementary Figs. S17 and S18, and Supplementary Table S5); this result is consistent with cancer being polygenic. Although we identified a polygenic signal, heritability loci account for up to 38% of all the heritable risk (breast cancer), suggesting that a significant amount of heritability could be explained by only few loci across the genome (Fig. 3A). Consistent with our hypotheses, when we looked at the contribution of SNPs in intergenic regions, we did not find any heritability enrichment.
We then tested whether heritability loci were shared among multiple cancers to identify any potential genomic hotspot for pan-cancer heritability. We found that only 59 (≈ 8%) of the 783 heritability loci show a significant heritability enrichment in at least 2 cancers, and 8 (≈ 1%) in 3 or more (Fig. 3C and D). This observation is consistent with results from tumor sequencing studies, which have shown that pleiotropic effects are limited to few master regulators, such as TP53 (35). Nonetheless, after performing literature curation, we found evidence for a cancer-mediating role for 7 of the 11 unique protein coding genes found in at least 3 cancers (see Supplementary Table 6), including 4 genes (CLPTM1L, APAF1, THADA, AGBL1) involved in apoptosis and 3 genes (PCDH15, DLG2, POU5F1B) involved in cell division, migration, and tumorigenesis (36, 37). It is important to note that the cisplatin resistance-related protein 9 (CLPTM1L) is the heritability locus found in most cancers (4/16) and is one of the gene in the 5p15.33 locus (the other being TERT), which has been consistently associated with different cancer types (38).
Taken together, our analysis found 783 loci, harboring 1,146 protein-coding genes, having a significant contribution to the heritable risk of at least 1 cancer. We denoted these 1,146 genes as CHGs.
CHGs are recurrently mutated in tumors
Tumor sequencing projects, including The Cancer Genome Atlas program and the PCAWG project, have identified a number of driver genes, which promote tumorigenesis when acquiring a somatic mutation.
There is also increasing evidence that genes harboring germline and somatic mutations can mediate cancer phenotypes (14, 39); thus, we tested whether CHGs are significantly enriched among known cancer driver genes. To do that, we obtained a curated list of driver genes using the COSMIC Cancer Gene Census (Supplementary Table 7). Interestingly, we found that a significant proportion of CHGs, 60 of 1,146 (≈ 5%), are also known cancer driver genes (|{\rm{OR}}\ = \ 1.75$|; |P:1.3 \times {10^{ - 4}}$|). These genes include members of the p53 pathway, such as the cyclin-dependent kinase inhibitor 2A (CDKN2A), the tumor protein 63 (TP63), and MDM4 regulator of p53 (MDM4), as well as genes mutated across multiple types of cancer, including FGFR2 and the anaplastic lymphoma kinase (Ki-1; ALK) gene (Fig. 4A and B).
Functional characterization of cancer heritability genes. A, List of CHGs reported as cancer driver genes across multiple annotations. With the blue hue (first three columns), we report the genes annotated by OncoKB, specifying whether they are tumor suppressors (TSG) or oncogenes (OG). With red and orange, 4-th and 5-th columns, we report the genes that are included in the COSMIC annotation as drivers and whether the reported mutation is somatic and germline. In the last four columns, we annotate each gene to the cancer type for which is denoted as driver in COSMIC. B, Enrichment of CHGs across cancer driver genes annotations; here, we report OncoKB (purple), COSMIC database (light blue), different cancer driver sets (dark blue), and other sets (green) like DNA repair genes and known actionable targets. Stars indicate statistical significance, with multiple terms having P <10−4. C, Gene ontology enrichment analysis using Fisher exact test. For each significant term, we report the OR (x-axis) and −log10 (FDR; color gradients). D, CHGs associated with the hallmark of cancers; genes in darker gray are tumor suppressors. Each gene is connected to the hallmarks that it mediates according to the Cancer Gene Census. E, Tumor suppressor and oncogene CHGs across cancers. For each cancer type (y-axis), we report the number of genes (x-axis) reported as tumor suppressors (TSG) and/or oncogenes in OncoKB (color codes, cancer genes are known to be drivers, but their specific role is not reported).
Functional characterization of cancer heritability genes. A, List of CHGs reported as cancer driver genes across multiple annotations. With the blue hue (first three columns), we report the genes annotated by OncoKB, specifying whether they are tumor suppressors (TSG) or oncogenes (OG). With red and orange, 4-th and 5-th columns, we report the genes that are included in the COSMIC annotation as drivers and whether the reported mutation is somatic and germline. In the last four columns, we annotate each gene to the cancer type for which is denoted as driver in COSMIC. B, Enrichment of CHGs across cancer driver genes annotations; here, we report OncoKB (purple), COSMIC database (light blue), different cancer driver sets (dark blue), and other sets (green) like DNA repair genes and known actionable targets. Stars indicate statistical significance, with multiple terms having P <10−4. C, Gene ontology enrichment analysis using Fisher exact test. For each significant term, we report the OR (x-axis) and −log10 (FDR; color gradients). D, CHGs associated with the hallmark of cancers; genes in darker gray are tumor suppressors. Each gene is connected to the hallmarks that it mediates according to the Cancer Gene Census. E, Tumor suppressor and oncogene CHGs across cancers. For each cancer type (y-axis), we report the number of genes (x-axis) reported as tumor suppressors (TSG) and/or oncogenes in OncoKB (color codes, cancer genes are known to be drivers, but their specific role is not reported).
However, the number of cancer driver genes is extremely variable across malignancies and studies; thus, we tested whether the enrichment of CHGs in cancer driver genes was independent from the cancer driver gene annotation used. To do that, we collected lists of cancer driver genes from multiple studies, including the PCAWG project (15), the Precision Oncology Knowledge Base (OncoKB; ref. 40), Memorial Sloan Kettering Impact and Heme gene panels (41), and the curated list of cancer genes by Vogelstein and colleagues (42). Here, we found that CHGs are significantly enriched in each cancer driver gene annotation analyzed, with an enrichment ranging from OR = 1.55 for the PCAWG annotation to OR = 2.47 for OncoKB tumor suppressors (Supplementary Table S7). Interestingly, we did not find any enrichment of CHGs in genes carrying germline driver mutations; this is consistent with the fact that most germline driver mutations are rare, and thus are unlikely to be genotyped in GWAS studies.
Taken together, we found 60 cancer heritability loci that are also recurrently mutated in multiple tumors; this result suggests that SNPs in CHGs might affect the same biological programs altered by somatic mutations in tumors.
CHGs underpin biological processes affecting tumorigenesis
Our gene-level heritability analysis identified 1,146 genes explaining a significant proportion of the heritable risk of at least 1 cancer. We then showed that CHGs are enriched in known cancer driver genes, suggesting that loci recurrently mutated in tumors also harbor high-frequency inherited mutations that could mediate cancer risk. Thus, we hypothesized that CHGs could be involved in molecular functions and biological processes affecting tumorigenesis.
To do that, we characterized CHGs by GO enrichment analysis (see Table 2). We found a statistically significant enrichment for 21 terms (Fisher exact test; FDR < 10%, Fig. 4C), with an average OR of 1.31 and up to 1.55 for growth. CHGs are genes predominantly involved in biological processes driving cell morphogenesis, differentiation, proliferation, and growth, which include the mammalian target of rapamycin (mTOR) and the Poly [ADP-ribose] polymerase 1 (PARP1) genes. We also observed a significant enrichment of genes associated with cytoskeleton organization and anatomic structure development, which include the Mothers against decapentaplegic homolog 2 (SMAD2) gene.
Gene ontology enrichment analysis of cancer heritability genes.
GO term . | No. CHGs . | OR . | P value . | FDR . |
---|---|---|---|---|
Anatomic structure development | 352 | 1.31 | 0.000044 | 0.006133 |
Kinase activity | 126 | 1.44 | 0.000237 | 0.012169 |
Growth | 84 | 1.55 | 0.000263 | 0.012169 |
DNA metabolic process | 82 | 1.53 | 0.000481 | 0.016723 |
Cytoskeleton organization | 120 | 1.39 | 0.000861 | 0.023924 |
Ion binding | 431 | 1.22 | 0.001248 | 0.028903 |
Biosynthetic process | 361 | 1.21 | 0.002711 | 0.041872 |
Biological_process | 505 | 1.20 | 0.002224 | 0.041872 |
Cell morphogenesis | 81 | 1.43 | 0.002419 | 0.041872 |
Cell proliferation | 146 | 1.30 | 0.003404 | 0.047312 |
Cytoskeleton | 141 | 1.28 | 0.005851 | 0.054216 |
Cellular protein modification process | 275 | 1.21 | 0.004476 | 0.054216 |
Cell–cell signaling | 123 | 1.30 | 0.005097 | 0.054216 |
Peptidase activity | 103 | 1.33 | 0.005513 | 0.054216 |
DNA binding transcription factor activity | 160 | 1.27 | 0.005068 | 0.054216 |
Enzyme binding | 178 | 1.24 | 0.006568 | 0.057059 |
Cell differentiation | 268 | 1.20 | 0.007776 | 0.063577 |
Embryo development | 77 | 1.36 | 0.009437 | 0.069042 |
Cytoskeletal protein binding | 77 | 1.36 | 0.009173 | 0.069042 |
Nucleus | 347 | 1.16 | 0.014507 | 0.097916 |
DNA binding | 174 | 1.21 | 0.014793 | 0.097916 |
GO term . | No. CHGs . | OR . | P value . | FDR . |
---|---|---|---|---|
Anatomic structure development | 352 | 1.31 | 0.000044 | 0.006133 |
Kinase activity | 126 | 1.44 | 0.000237 | 0.012169 |
Growth | 84 | 1.55 | 0.000263 | 0.012169 |
DNA metabolic process | 82 | 1.53 | 0.000481 | 0.016723 |
Cytoskeleton organization | 120 | 1.39 | 0.000861 | 0.023924 |
Ion binding | 431 | 1.22 | 0.001248 | 0.028903 |
Biosynthetic process | 361 | 1.21 | 0.002711 | 0.041872 |
Biological_process | 505 | 1.20 | 0.002224 | 0.041872 |
Cell morphogenesis | 81 | 1.43 | 0.002419 | 0.041872 |
Cell proliferation | 146 | 1.30 | 0.003404 | 0.047312 |
Cytoskeleton | 141 | 1.28 | 0.005851 | 0.054216 |
Cellular protein modification process | 275 | 1.21 | 0.004476 | 0.054216 |
Cell–cell signaling | 123 | 1.30 | 0.005097 | 0.054216 |
Peptidase activity | 103 | 1.33 | 0.005513 | 0.054216 |
DNA binding transcription factor activity | 160 | 1.27 | 0.005068 | 0.054216 |
Enzyme binding | 178 | 1.24 | 0.006568 | 0.057059 |
Cell differentiation | 268 | 1.20 | 0.007776 | 0.063577 |
Embryo development | 77 | 1.36 | 0.009437 | 0.069042 |
Cytoskeletal protein binding | 77 | 1.36 | 0.009173 | 0.069042 |
Nucleus | 347 | 1.16 | 0.014507 | 0.097916 |
DNA binding | 174 | 1.21 | 0.014793 | 0.097916 |
Note: We report the gene ontology terms significantly associated with cancer heritability genes, at 10%. For each term, we report the number of annotated CHGs, the odds ratio, the P value from the Fisher exact test, and the adjusted P value after applying the Benjamini–Hochberg procedure.
Although these molecular processes drive normal cell fate, survival, and proliferation, they are recurrently hijacked by cancer cells to gain growth advantage and spread through the body through metastases (43), a process that is considered an hallmark of cancer. We then tested whether CHGs are associated with any other hallmark of cancer, which are processes, common to all malignancies, controlling the transformation of normal into cancer cells (44). These lists of biological processes include proliferative signaling, suppression of growth, escaping immune response, cell replicative immortality, promoting inflammation, invasion and metastasis, angiogenesis, genome instability and mutation, and escaping cell death. Interestingly, we found 33 CHGs associated with at least one hallmark (|{\rm{OR}}:2.062;P:3 \times {10^{ - 4}}$|). Consistent with our previous analysis, cancer heritability loci are involved in escaping cell death, mediating proliferative signaling, invasion and metastasis (Fig. 4D; Supplementary Table S8). We then went further to understand whether CHGs mediate these cancer processes by acting either as tumor suppressor genes (TSG) or oncogenes (see Fig. 4E). To do that, we used the Precision Oncology Knowledge Base (OncoKB; ref. 40), a curated list of 519 cancer genes, including 197 TSGs, 148 oncogenes, and other cancer genes of unknown function. We found that 27 CHGs are tumor suppressors (OR: 2.47, |P:7.9 \times {10^{ - 5}}$|), whereas 17 are reported as oncogenes (OR: 1.83; P:0.0198), of which, 4 can function both as TSGs and as oncogenes (Fig. 4A, D, and E; Supplementary Tables S7 and S8); importantly, this result has been also confirmed when using the COSMIC Cancer Gene Census TSG annotation (OR: 2.036; |P:2.07 \times {10^{ - 4}}$|). Tumor suppressor CHGs include well-known cancer driver genes, such as CDKN2A and SMAD2, which regulate cell growth, and DNA repair genes, such as MUTYH and FANCA (45).
Taken together, we found evidence that CHGs directly mediate processes underpinning tumorigenesis; interestingly, while we did not observe pleiotropic effects at genomic level, we found that CHGs are involved in biological processes common to all cancers. It is then conceivable that inherited mutations in genes controlling these biological programs could provide a selective advantage to cancer cells, once they acquire a driver somatic mutation. Our results suggest a functional role for CHGs consistent with a two-hit model (46); while inherited mutations associated with oncogene activation are likely to be under purifying selection, mutations in TSGs can be observed at higher frequency because deleterious effects are only observed upon complete loss of function.
Discussion
Our study provides new fundamental evidence demonstrating a strong contribution of high-frequency inherited mutations to the heritable risk of cancer. We found that SNPs account for at least 10% of the heritable risk of 14 malignancies, and their contribution is not only limited to early onset cancers, but also malignancies with a late age of onset, such as bladder and prostate.
We then went further and built a high-resolution map of the heritable cancer genome consisting of 1,146 genes showing a significant contribution to cancer heritability. We then showed that CHGs are responsible for controlling growth, cell morphogenesis, and proliferation, which are fundamental processes required for tumorigenesis. Interestingly, we found that a significant proportion of CHGs (60/1,146) are also recurrently mutated across many tumors, including well-known driver genes such as FGR2, CDKN2A, and SMAD2. Importantly, 27 CHGs are known TSGs, suggesting that SNPs might support cancer by hijacking tumor suppressor functions. Ultimately, our results suggest that inherited mutations in TSGs could create a favorable genetic background for tumorigenesis. It is conceivable that SNPs make normal cells more likely to evade the cell–cell contact inhibition of proliferation, to elude the anatomic constrains of their tissue and to achieve more easily independent motility in the presence of other early oncogenic events; evidence supporting these mechanisms has been recently found in advanced urothelial cancer (47). Thus, combining germline and somatic genetic information of key cancer genes could facilitate the identification of subpopulations of patients at higher risk, differential response to treatment, and risk of relapse. Nonetheless, determining the heritability threshold to justify the integration of genes carrying low-penetrant mutations into clinical cancer genetics will require further investigation.
However, a causal role for many CHGs cannot be ascertained only by genetic analysis and will require further experimental validation. Of particular interest is the subset of CHGs belonging to the Solute carrier (SLC) family (48). SLCs might support cancer metabolism, and polymorphisms in these loci could provide a strong basis for interaction with environmental risk factors such as fats, carcinogens, metal ion deficiencies, and thus could be integrated with future dietary studies, because risk factors may be greater in subgroups of patients.
Obtaining a genomic map with gene-level resolution required the development of a new method we called Bayesian Gene Heritability Analysis (BAGHERA), for estimating heritability of low heritability traits at the gene level; to the best of our knowledge, BAGHERA is the first method to enable heritability analysis with gene-level resolution. We performed extensive simulations to show that our method provides robust genome-wide and gene-level heritability estimates across different genetic architectures and outperforms existing methods when used to analyze low heritability traits, such as cancer.
We also recognize the limitations of our work. Although our method provides accurate estimates of genome-wide heritability, extremely low heritability diseases could lead to negative gene-level heritability estimates; this was a trade-off to ensure reasonable computational efficiency, although a rigorous model is provided as part of our software. Our analysis does not incorporate functional information, such as gene expression or stratified effects for synonymous/nonsynonymous variants, which limits our power of detecting tissue-specific contributions and single causal variants. Finally, because BAGHERA works at single-gene level using summary statistics, analyzing tumors triggered by multihit events might still require genotype data.
Taken together, our study provides new insights on the genetic architecture of cancer with gene-level resolution. We expect that integrating heritability information of cancer genes, along with other cancer heritability genes linked to environmental risk factors and somatic information, will help define more effective early detection and surveillance strategies for the broader population.
Authors’ Disclosures
No disclosures were reported.
Authors' Contributions
V. Fanfani: Conceptualization, data curation, software, formal analysis, validation, investigation, methodology, writing–original draft, writing–review and editing. L. Citi: Validation, methodology, writing–review and editing. A.L. Harris: Validation, investigation, writing–review and editing. F. Pezzella: Validation, investigation, writing–review and editing. G. Stracquadanio: Conceptualization, software, formal analysis, supervision, validation, investigation, methodology, writing–original draft, writing–review and editing.
Acknowledgments
The authors would like to thank Dr. Yongjin Park at the University of British Columbia for the useful discussions about GWAS data simulation.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.