Abstract
To identify rare variants associated with prostate cancer susceptibility and better characterize the mechanisms and cumulative disease risk associated with common risk variants, we conducted an integrated study of prostate cancer genetic etiology in two cohorts using custom genotyping microarrays, large imputation reference panels, and functional annotation approaches. Specifically, 11,984 men (6,196 prostate cancer cases and 5,788 controls) of European ancestry from Northern California Kaiser Permanente were genotyped and meta-analyzed with 196,269 men of European ancestry (7,917 prostate cancer cases and 188,352 controls) from the UK Biobank. Three novel loci, including two rare variants (European ancestry minor allele frequency < 0.01, at 3p21.31 and 8p12), were significant genome wide in a meta-analysis. Gene-based rare variant tests implicated a known prostate cancer gene (HOXB13), as well as a novel candidate gene (ILDR1), which encodes a receptor highly expressed in prostate tissue and is related to the B7/CD28 family of T-cell immune checkpoint markers. Haplotypic patterns of long-range linkage disequilibrium were observed for rare genetic variants at HOXB13 and other loci, reflecting their evolutionary history. In addition, a polygenic risk score (PRS) of 188 prostate cancer variants was strongly associated with risk (90th vs. 40th–60th percentile OR = 2.62, P = 2.55 × 10−191). Many of the 188 variants exhibited functional signatures of gene expression regulation or transcription factor binding, including a 6-fold difference in log-probability of androgen receptor binding at the variant rs2680708 (17q22). Rare variant and PRS associations, with concomitant functional interpretation of risk mechanisms, can help clarify the full genetic architecture of prostate cancer and other complex traits.
This study maps the biological relationships between diverse risk factors for prostate cancer, integrating different functional datasets to interpret and model genome-wide data from over 200,000 men with and without prostate cancer.
See related commentary by Lachance, p. 1637
Introduction
For a number of diseases, including prostate cancer, there has been limited success in detecting associated rare genetic variants, some of which may have larger effect sizes than common variants (1). This is in part due to the difficulty of typing or imputing rare variants in adequately powered studies. Still, some rare germline variants associated with prostate cancer have been detected, such as in the DNA damage repair gene BRCA2 (2, 3) and the developmental transcription factor HOXB13 (4). While relatively few rare variants have been discovered, in aggregate, they may comprise a substantial portion of prostate cancer risk heritability (5). In contrast, genome-wide association studies (GWAS) of more common variants have identified nearly 200 independent genetic variants associated with prostate cancer (6). Each variant is typically associated with only a modest increase in prostate cancer risk, and thus individually not of sufficient magnitude to be clinically significant. However, combining all associated variants together into a single polygenic risk score (PRS) may distinguish men with a meaningfully increased risk of prostate cancer.
To investigate the impact of rare and common variants on prostate cancer risk, we undertook a large-scale GWAS of over 200,000 male subjects from two large cohorts: Kaiser Permanente (KP) in California (7) and the UK Biobank (UKB; ref. 8). Genotype microarrays, including GWAS backbones and custom rare variant content, were assayed in both cohorts, and unmeasured genotypes were imputed using a reference panel of over 27,000 phased Haplotype Reference Consortium (HRC) genomes (9). We evaluated associations between individual rare and common variants and prostate cancer risk and interpreted the evolutionary origin and functional mechanisms of novel findings using multiomics data. We also performed PRS modeling and functional characterization of the known common risk variants.
Patients and Methods
Study populations
We studied two cohorts of prostate cancer cases and cancer-free controls of European ancestry: (i) KP subjects from the Research Program on Genes, Environment, and Health (RPGEH) Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort, the California Men's Health Study (CMHS), and the ProHealth Study; and (ii) the UKB. The KP cohort included 6,196 male cases and 5,788 male controls of European ancestry (mean age at diagnosis for cases = 68.1 years, mean age at enrollment into GERA among controls = 71.5; Supplementary Table S1). Controls were selected to be roughly matching in age and based on the availability of DNA for genotyping. The UKB cohort included 7,917 cases and 188,352 controls of European ancestry (mean age at diagnosis = 64.1, mean age at study enrollment among controls = 57.1; Supplementary Table S1).
Custom microarray design and genotyping of KP subjects
To directly assay or tag potentially functional rare variation and to expand genome-wide coverage of less common variation in samples from KP, we collaborated with Affymetrix Inc. on the design of a custom Axiom DNA microarray (Supplementary Fig. S1A) that was complementary to a GWAS array previously used to genotype the RPGEH GERA population (10–12). The algorithm used to select variants on the custom array (Supplementary Fig. S1B) resulted in 416,047 variant probesets comprising 54 distinct modules, including missense and loss-of-function mutations, rare exonic mutations from The Cancer Genome Atlas and dbGaP prostate cancer tumor exomes (13, 14), and variants to supplement the previously genotyped GWAS array (Supplementary Table S2; refs. 7, 10–12). Many modules and most of the design content overlapped with the probesets on the UKB Affymetrix Axiom array, for which the array design, sample processing, and genotyping have been detailed previously (8).
Saliva biospecimens from KP participants were processed for DNA extraction using a protocol reported previously (10). DNA samples from KP were processed using Samasy (15), a sample management system providing a visual and machine interface to facilitate robot liquid handling automation from source plates to destination plates matched by age, case status, and ethnicity. The algorithm implemented for destination plate randomization is described in the Supplementary Materials. A total of 173 96-well destination plates were amplified to increase DNA yields, and 200 ng of input DNA per well were array hybridized for 48 hours at 48°C and genotyped using an Affymetrix GeneTitan Multi-Channel instrument. All of the reagents used for the genotyping assay are from the Axiom 2.0 Reagent Kit (Thermo Fisher Scientific; details given at https://tinyurl.com/y3yfse29).
Quality control and imputation
Detailed descriptions of the sample and genotype quality control (QC) procedures are given in the Supplementary Materials. Briefly, for the KP samples, we excluded specimens with poor resolution fluorescent measurements (Dish QC < 0.75) or call rate < 0.95 (Supplementary Fig. S2A). On the basis of heterozygosity rate, call rate, and plate call rate, samples were further stratified into three tiers that were used to guide genotype QC. Specifically, genotype calls and posterior cluster locations from higher tier samples (as a consequence of higher input DNA quantities) were prioritized and used as empirical priors for resolving genotypes of lower tier samples using the Affymetrix AxiomGT1 algorithm (Supplementary Fig. S2B; ref. 16). Genotypes were also filtered on the basis of batch differences across the RPGEH GERA, CMHS, and ProHealth, and based on the fold difference in minor allele frequency (MAF) relative to the HRC and 1000 Genomes Project reference panels. These genotypes were then merged with previously assayed GWAS genotypes for the KP subjects (dbGaP Study Accession: phs001221.v1.p1), whose QC was described in a prior publication (7).
The KP data were phased using Eagle v2.3 (cohort-based; ref. 17), and single-nucleotide variant calls were imputed using Minimac3 using a reference panel consisting of a subpopulation of 27,165 HRC genomes accessible via the European Genome Archive (EGAS00001001710, which includes the 1000 Genomes Project Phase III samples). Indel polymorphisms were imputed only using the 1000 Genomes Project Phase III reference panel (2,504 genomes; indels were not included in first release of HRC due to additional difficulty with harmonization; Supplementary Fig. S3). Variants with r2INFO < 0.3 or with MAF less than 1/NREF (where NREF represents the total number of samples in the respective reference panel) were removed from the imputed genotypes.
For the UKB data, preimputation QC protocols have been described previously (8). Genotypes were imputed by the UKB using two reference panels: the complete HRC reference (32,488 genomes; ref. 9), and the combined UK10K plus 1000 Genomes Project Phase III reference panels (9,746 haplotypes). We excluded poorly imputed (r2INFO < 0.3) and excessively rare (MAF < 3 × 10−5) genotypes from the UKB.
Association analyses
Associations between variant genotypes and prostate cancer were evaluated separately for the UKB and KP cohorts, and later combined in a meta-analysis. Association testing used logistic regression, with adjustment for age (for prostate cancer cases, age at diagnosis, vs. age at time of entry into the GERA study for controls), body mass index, principal components (PC) of ancestry, and for the KP cohort genotyping array (because a limited number of the GERA EUR samples were typed with a different array; Supplementary Fig. S3). The KP models controlled for 20 ancestry PCs using PLINK v2.00 (18), and the UKB models were adjusted for 10 PCs. The KP and UKB results were combined by fixed-effect meta-analysis using Metasoft v2.0.0 (19). Gene-based rare variant tests (observed MAF < 1%) were conducted with the sequence kernel association test (SKAT) using the rvtests package (v20171009; ref. 20), and meta-analyzed by Fisher's method (21) using R v3.3.3. Exome-wide significance (P < 2.27 × 10−6) was evaluated by Bonferroni correction for 21,984 genes, and suggestive genes were designated as 2.27 × 10−6 < P < 2.27 × 10−5.
Evolutionary history of rare variants
To estimate the recency in origin of rare prostate cancer risk variants and further quantitate their evolutionary selection, we examined the extended haplotype homozygosity (EHH), or the length of a haplotype on which a variant allele resides, using the reference panel of 27,165 phased HRC genomes and the selscan package (22). We also quantified the integrative haplotype score (iHS), or log ratio between a variant's major and minor alleles of the area under the EHH curves for each allele (22), to reflect differences in allelic age or selective pressure between the derived and ancestral alleles. The iHS was computed using an EHH cutoff of 0.05, both upstream (iHSL) and downstream (iHSR) of the query position.
PRS analyses
Each individual's PRS was computed by multiplying the out-of-sample effect sizes (6, 7) for each of the 188 previously reported prostate cancer risk loci (log ORs) by their risk allele dosages, and then summing the resulting 188 values together (Supplementary Table S3). The OR and 95% confidence intervals (CI) for associations between standardized PRS values (mean = 0, SD = 1) and prostate cancer case–control status were estimated using logistic regression with adjustment for the same covariates modeled in our association analyses, with the exception of genotyping array for the purpose of consistency between UKB and KP PRS values. In addition, we developed a PRS using summary statistics generated from the fixed-effect meta-analysis of UKB and KP. PRS were constructed from significant (P < 5 × 10−8) and suggestive (P < 1 × 10−6) independent variants (linkage disequilibrium, LD r2 < 0.1 within a 3 Mb window). In sample weights were adjusted for bias through a lasso-type winner's curse correction (23).
Functional annotation
To consider the functional relevance of the known prostate cancer risk variants, we integrated two different analyses and sources of data. We trained elastic net regression models of normal prostatic gene expression (24), with a linear combination of germline genotypes as the predictor, using GLMNet (25) and a dataset of 471 subjects with normal prostate tissue RNA expression and genotype data (26). Among the 188 previously reported prostate cancer risk variants, as well as the novel genome-wide significant variants identified here, we reported those that were included—or were in linkage disequilibrium (LD r2 > 0.5) with a variant—in our expression models. We used a χ2 test to assess whether the number of such expression quantitative trait (eQTL) variants was statistically significantly different from expected, where the latter was estimated from a set of 1000 MAF matched variants randomly selected from the HRC. For the same set of previously reported and newly identified variants, allele-specific differential transcription factor binding affinity was also estimated using sTRAP transcription factor affinity prediction (27) with the major and minor alleles.
Results
GWAS
Among 188 SNPs previously implicated in prostate cancer GWAS, our association meta-analysis found that: 135 replicated with nominal significance (P < 0.05) and effect estimates in the same direction as reported previously; 87 of these SNPs had P < 0.05/188 (Bonferroni correction); 4 had suggestive P values (5 × 10−7 > P > 5 × 10−8); and 42 replicated with genome-wide significance (P < 5 × 10−8; Fig. 1; Supplementary Table S3). Genome-wide significant associations (meta-analysis P < 5 × 10−8) were observed at three loci not previously associated in prostate cancer GWAS (>3 Mb away and LD r2 < 0.005 in all 1000 Genomes Phase III populations, relative to 188 known prostate cancer loci). Among the three, noncoding lead variants at these loci, rs557046152 (Chr 8p12; proximal to DUSP4 and KIF13B), rs555778703 (4q31.21; intronic to TBC1D9), and rs62262671 (3p21.31; intronic to BSN), two were imputed variants that are rare in European (non-Finnish) ancestry populations: rs557046152 (gnomAD v2.1.1 MAF = 0.001) and rs555778703 (gnomAD v2.1.1 MAF = 0.009). Three additional novel loci had genome-wide significant P values in the meta-analysis (Table 1), but lacked nominal significance (rs80242938 and rs149892036) and directional consistency (rs139191981) in both cohorts.
. | KP (6,196 cases and 5,788 controls) . | UKB (7,917 cases and 188,352 controls) . | Meta-analysis KP + UKB Subjects . | |||||
---|---|---|---|---|---|---|---|---|
Risk variant dbSNP rsid, Genomic locus, Risk allele (Ref) . | OR (95% CI) . | P . | KP MAF (r2INFO) . | OR (95% CI) . | P . | UKB MAF (r2INFO) . | OR (95% CI) . | P . |
rs557046152, 8p12, G (GTT) | 2.26 (1.72–2.96) | 3.70 × 10−9 | 0.015 (0.94) | 1.40 (1.06–1.85) | 0.019 | 0.0037 (0.85) | 1.79 (1.47–2.17) | 4.50 × 10−9 |
rs555778703, 4q31.21, G (A) | 1.54 (1.08–2.17) | 0.016 | 0.0046 (0.50) | 2.00 (1.54–2.58) | 1.64 × 10−7 | 0.0044 (0.74) | 1.82 (1.48–2.24) | 1.65 × 10−8 |
rs62262671, 3p21.31, G (A) | 1.18 (1.09–1.27) | 3.47 × 10−5 | 0.062 (0.98) | 1.10 (1.05–1.15) | 7.56 × 10−5 | 0.14 (1.0) | 1.12 (1.07–1.16) | 3.55 × 10−8 |
Significantly associated variants in meta-analysis, absent nominal significance in both cohorts | ||||||||
rs80242938, 16p13.3, G (A) | 7.10 (9.5 × 10−5–5.3 × 106) | 0.73 | 0.0067 (0.67) | 11.7 (5.17–26.7) | 4.18 × 10−9 | 8.9 × 10−5 (0.80) | 11.7 (5.16–26.6) | 3.95 × 10−9 |
rs149892036, 8q12.1, T (C) | 1.53 (0.81–2.88) | 0.19 | 0.0033 (0.80) | 2.31 (1.71–3.12) | 5.37 × 10−8 | 0.0015 (0.85) | 2.14 (1.63–2.81) | 4.32 × 10−8 |
rs139191981, 3q26.33, A (G) | 0.88 (0.19–4.09) | 0.88 | 0.013 (0.90) | 7.62 (3.93–14.8) | 1.87 × 10−9 | 0.00015 (0.92) | 5.43 (2.95–9.97) | 4.96 × 10−8 |
. | KP (6,196 cases and 5,788 controls) . | UKB (7,917 cases and 188,352 controls) . | Meta-analysis KP + UKB Subjects . | |||||
---|---|---|---|---|---|---|---|---|
Risk variant dbSNP rsid, Genomic locus, Risk allele (Ref) . | OR (95% CI) . | P . | KP MAF (r2INFO) . | OR (95% CI) . | P . | UKB MAF (r2INFO) . | OR (95% CI) . | P . |
rs557046152, 8p12, G (GTT) | 2.26 (1.72–2.96) | 3.70 × 10−9 | 0.015 (0.94) | 1.40 (1.06–1.85) | 0.019 | 0.0037 (0.85) | 1.79 (1.47–2.17) | 4.50 × 10−9 |
rs555778703, 4q31.21, G (A) | 1.54 (1.08–2.17) | 0.016 | 0.0046 (0.50) | 2.00 (1.54–2.58) | 1.64 × 10−7 | 0.0044 (0.74) | 1.82 (1.48–2.24) | 1.65 × 10−8 |
rs62262671, 3p21.31, G (A) | 1.18 (1.09–1.27) | 3.47 × 10−5 | 0.062 (0.98) | 1.10 (1.05–1.15) | 7.56 × 10−5 | 0.14 (1.0) | 1.12 (1.07–1.16) | 3.55 × 10−8 |
Significantly associated variants in meta-analysis, absent nominal significance in both cohorts | ||||||||
rs80242938, 16p13.3, G (A) | 7.10 (9.5 × 10−5–5.3 × 106) | 0.73 | 0.0067 (0.67) | 11.7 (5.17–26.7) | 4.18 × 10−9 | 8.9 × 10−5 (0.80) | 11.7 (5.16–26.6) | 3.95 × 10−9 |
rs149892036, 8q12.1, T (C) | 1.53 (0.81–2.88) | 0.19 | 0.0033 (0.80) | 2.31 (1.71–3.12) | 5.37 × 10−8 | 0.0015 (0.85) | 2.14 (1.63–2.81) | 4.32 × 10−8 |
rs139191981, 3q26.33, A (G) | 0.88 (0.19–4.09) | 0.88 | 0.013 (0.90) | 7.62 (3.93–14.8) | 1.87 × 10−9 | 0.00015 (0.92) | 5.43 (2.95–9.97) | 4.96 × 10−8 |
Gene-based rare variant analysis
An additional gene-based rare variant meta-analysis of KP and the UKB data, using SKAT and rare variants with MAF less than 0.01, yielded an exome-wide significant association at HOXB13 (P = 1.72 × 10−7; Fig. 2; Supplementary Table S4), a well-characterized prostate cancer risk locus harboring the rare yet highly penetrant missense founder mutation G84E rs138213197 (4). The association at HOXB13 was primarily driven by the rs138213197 risk SNP (Supplementary Fig. S4), which was highly significant in the GWAS meta-analysis (P = 2.96 × 10−43). The SNP within HOXB13 with the second smallest P value, rs116931900, was located in the C-terminal untranslated region of the HOXB13 open reading frame (P = 2.95 × 10−4). SKAT also identified a suggestive P value for ILDR1 (P = 7.46 × 10−6), a gene primarily expressed in prostate tissue (28). The ILDR1 association appeared to be primarily driven by a variant with a suggestive association in in the KP cohort; this was not associated with prostate cancer in the UKB (Supplementary Fig. S4; Supplementary Table S5).
Evolutionary characterization
We observed atypically long-range LD for the HOXB13 G84E rare variant rs138213197-T, extending beyond a 1 Mb window from its chromosomal position (Supplementary Fig. S5). This observation was substantiated by considerable EHH for the rare missense allele (Fig. 3A). In particular, rs138213197 had an integrated haplotype score (iHS) equal to 2.87 (iHSL: 3.53, iHSR: 2.54) in our HRC haplotype data, greater than the |iHS| > 2.5 threshold corresponding to the most extreme 1% of values (29), and reflecting the recent origin and/or selective constraint at the rs138213197 locus. Likewise, for the novel rare variant rs555778703 in the intron of gene TBC1D9, the rare G risk allele (Fig. 3B) had an iHS equal to 2.31 (iHSL: 2.00, iHSR: 2.79). For a proxy SNP rs57029021 (LD r2 = 0.666 in 1000 Genomes Project Phase III EUR) of the novel rare SNP rs557046152 (which was unmeasured in the EGA HRC reference genomes), the rare A allele had an iHS equal to 0.87 (iHSL: 1.60, iHSR: 0.77; Fig. 3C).
PRSs
For subjects in KP and the UKB, there was a strong association between being in the top decile versus the referent percentiles (40%–60%) of the PRS and prostate cancer risk [Supplementary Fig. S6; Supplementary Table S6; meta-analysis OR (95% CI) = 2.62 (2.46–2.79), P = 2.55 × 10−191; KP OR = 2.40 (2.06–2.80), P = 4.38 × 10−29; UKB OR = 2.71 (2.52–2.93), P = 2.25 × 10−150]. There was also a significant decrease in risk as a result of being in the bottom decile [Supplementary Fig. S6; Supplementary Table S6; meta-analysis OR (95% CI) = 0.34 (0.30–0.37), P = 1.69 × 10−98; KP OR = 0.36 (0.31–0.42), P = 6.78 × 10−40; UKB OR = 0.33 (0.29–0.38), P = 1.31 × 10−54]. No significant heterogeneity between KP and UKB cohorts was detected for any PRS decile association with prostate cancer risk (I2 < 4%; PCochran's Q > 0.84). The top decile included 9.40% of all controls, but 18.3% of the cases and the bottom decile 10.4% of controls and 4.02% of cases. When limiting our PRS to only the 87 prostate cancer risk variants that replicated in our data (P < 0.05/188) we saw little impact to risk as a result of being in the top decile [meta-analysis OR (95% CI) = 2.60 (2.44–2.77)] or the bottom decile [meta-analysis OR (95% CI) = 0.34 (0.31–0.38)]. However, when the three novel variants (rs557046152, rs555778703, and rs62262671) discovered in our study were incorporated into the PRS, following in-sample bias correction (30) of variant effects, there was an increase in the risk in the top decile [meta-analysis OR (95% CI) = 2.74 (2.57–2.92)], with risk in the bottom decile remaining unchanged [meta-analysis OR (95% CI) = 0.34 (0.31–0.38)].
We also generated PRS from our meta-analysis summary statistics using independent (LD r2 < 0.1) and genome-wide significant (P < 5 × 10−8) or suggestive (P < 1 × 10−6) variants and adjusted weights from our study, where a lasso-type in-sample bias correction procedure (23) was applied to prevent overfitting due to winner's curse (Supplementary Table S7). The GWAS significant and suggestive PRS developed in our study had similar risk from being in the top decile [meta-analysis OR (95% CI) = 2.34 (2.19–2.49) and 2.55 (2.39–2.72); Supplementary Table S8] and bottom decile [meta-analysis OR (95% CI) = 0.48 (0.44–0.52) and 0.44 (0.40–0.49)] as the PRS constructed from the 188 previously identified prostate cancer SNPs. Although we corrected our PRS variant effects, results may be inflated because of in-sample prediction and therefore should be interpreted with caution; further application to an independent cohort is needed to test the validity of this PRS.
Functional interpretation
To characterize the functional consequences of common prostate cancer variants, we examined their effects on gene expression and transcription factor binding. Among the 188 previously reported prostate cancer risk variants and three novel risk variants identified, 80 were in linkage disequilibrium (LD r2 > 0.5 in 1000 Genomes Project Phase III EUR) with an eQTL variant in our regularized models of normal prostatic expression levels, which was significantly higher than expected (P = 3.07 × 10−5; Supplementary Table S9). This included one of the three novel variants implicated in our meta-analysis, rs62262671, which is predicted to alter expression of two genes, RMB6 and UBA7.
Furthermore, 32 variants were predicted to significantly alter transcription factor binding site (TFBS) affinities (Supplementary Table S10), with greater than 3-fold predicted differences in TFBS log-probabilities. rs2680708 (17q22) showed the greatest fold change in predicted binding affinity (P = 3.91 × 10−7) of any variant-transcription factor pair analyzed (Supplementary Table S10), and was predicted to abrogate binding of the androgen receptor (AR) transcription factor, a sentinel of prostatic gene expression (31). In keeping with this AR-mediated link between genetic variation and prostate cancer risk, a pattern of androgen-mediated influence, and/or prostate cancer–related expression, also emerged among the remaining variant-transcription factor pairings, including DBP (rs2680708; ref. 32), HMGIY (rs5799921; ref. 33), AP1 (rs2660753; refs. 32, 34), DELTAEF1 (also known as ZEB1; rs7210100; ref. 35), STAT5A (rs742134; ref. 36), HNF1B (rs742134; ref. 37), and MAFB (rs9625483; ref. 38), among others. Moreover, the rs62262671 risk variant associated in the meta-analysis was predicted to impact binding of OCT1 (P = 5.04 × 10−3), a transcription factor closely intertwined with AR signaling in prostate cancer cell lines (39–42).
Discussion
We undertook a large-scale association analysis of genotyped and imputed common and rare variants, and implicated three novel loci, including two rare variants rs557046152 (Chr 8p12; proximal to DUSP4 and KIF13B at 8p12) and rs555778703 (4q31.21; intronic to TBC1D9), and one common variant rs62262671 (3p21.31; intronic to BSN). Gene-based rare variant tests, in our meta-analysis of 14,113 prostate cancer cases and 201,722 controls across the KP and UKB cohorts, also revealed significant and suggestive associations, respectively, for HOXB13, a well-studied prostate cancer risk gene, and a novel candidate gene ILDR1, which encodes a B7-like receptor protein related to the immune checkpoint blockade immuno-oncology pathway. Evolutionary analysis revealed patterns of haplotypic natural selection for both the novel rare variants associated in our study and also a known prostate cancer rare variant rs138213197 (HOXB13 G84E). Functional analyses suggested novel mechanistic hypotheses of transcription factor binding and gene expression modulation for dozens of prostate cancer risk loci, including previously reported variants and the novel variant rs62262671. Finally, a PRS analysis of 188 prostate cancer variants indicated that men in the highest PRS decile have a substantially increased risk that may be of clinical importance.
The two rare variants associated in our meta-analysis, rs557046152 [meta OR (95% CI) = 1.79 (1.47–2.17)] and rs555778703 [meta OR (95% CI) = 1.82 (1.48–2.24)], both exhibited OR effect sizes greater than what is most often seen for common GWAS variants (OR > 1.2), in keeping with their rare allele frequencies. Because rare allele frequencies with large effect sizes can be explained by purifying selection in the human genome (43), we examined the haplotypic patterns of rare variants associated in our study for hallmarks of natural selection. Indeed, evolutionary analyses revealed signs of selection at rs555778703 (iHS = 2.31), suggested the possibility of selection at an LD proxy for rs557046152 (rs57029021 iHS = 0.87), and confirmed the presence of selective forces at a known prostate cancer rare variant rs138213197 (HOXB13 G84E; iHS = 2.87). While previous studies have identified the haplotype for the G84E variant (44) and estimated its recent date of origin in the 18th century (45), our study is the first demonstration, to our knowledge, of evolutionary forces of natural selective pressure at the HOXB13 G84E haplotype.
Among the two genes associated in our gene-based rare variant tests, ILDR1 (immunoglobulin like domain containing receptor 1) encodes a cell-surface receptor protein ILDR1 that is a gene highly expressed in prostate tissue (28), and which localizes at tricellular tight junctions while sealing extracellular regions, along with its closely related homologs ILDR2 and ILDR3 (46). Recent studies have revealed that ILDR2 is a related to the B7 family of proteins (47), which includes the PD-L1 (B7-H1) and PD-L2 (B7-H2) immune checkpoint ligands (48). The success of anti-PD-1 and anti-PD-L1 immunotherapies has recently inspired preclinical testing of a novel immune checkpoint inhibitor targeting ILDR2 (49). Our finding of a rare variant association in ILDR1, as well as the sequence similarity between ILDR1 and ILDR2 (39% overall sequence identity) and structural similarity of ILDR2 to the B7 protein family (47), motivates study of the potential involvement of ILDR1 in prostate cancer and cancer immunology.
Our constructed PRS for European ancestry men exhibited a large magnitude of effect. Comparing men in the top 1% of the PRS distribution to the median (40%–60%) yielded an OR of 4.12 (95% CI, 3.60–4.72), which is only of somewhat lower magnitude as high-risk genes such as BRCA1 for breast cancer, where comparing mutation carriers to noncarriers have an OR = 5.91 (95% CI, 5.25–6.67; ref. 50). Although the PRS effect is of relatively large magnitude, the scores may not be fully transferable to individuals of non-European descent (7) and need to be examined in other ethnic groups (51, 52). In spite of these limitations, over a decade of GWAS efforts (53) have advanced the genetic characterization of prostate cancer considerably. Our construction of a PRS model for prostate cancer with high discrimination demonstrates this remarkable progress and the predictive power of aggregating prostate cancer risk loci. Interestingly, our meta-analysis did not replicate many of the previously associated SNPs included in the PRS and removing the nonreplicated SNPs had very little impact on the effect of the PRS. This may reflect winner's curse for the original findings or differences in populations studied. Nevertheless, the PRS remained strongly associated with prostate cancer even when including all of the SNPs.
Using eQTL models of prostatic gene expression, we developed functional annotations for germline variants implicated in our analysis, or previously implicated and present in our PRS. Among the genes putatively targeted by 80 such prostate cancer risk variants and their LD proxies, many have been previously described as prostate cancer risk genes in transcriptome wide association studies (TWAS; refs. 24, 54, 55): AGAP7, BHLHA15, C2orf43, C9orf78, CTBP2, EHBP1, FAM57A, FOXP4, GEMIN4, IRX4, MMP7, MSMB, NCOA4, PPP1R14A, RAB7L1, RGS17, SLC22A3, UHRF1BP1. Among the remaining fraction not directly implicated by prostate cancer TWAS, several additional genes have been genetically or transcriptionally linked to prostate cancer pathogenesis, including ACVR2A (56), NUDT11 (57), and PPFIBP2 (58).
Integration of gene expression and TFBS affinity data suggested novel mechanisms for many of the common prostate cancer variants reported previously. One example is a highly significant change in transcription factor binding affinity at rs2680708. This finding is especially interesting given that the risk allele rs2680708-G abrogates a binding site for AR, a master regulator of prostatic gene expression. The magnitude of the rs2680708 TFBS effect we predicted exceeds, by several fold, the magnitude of effect for TFBS variants that have been previously reported and validated by in vitro functional assays, including rs8134378, rs11084033, and rs2659051 (24, 59–61). While our eQTL analyses did not nominate any gene whose expression may be affected as a consequence of eliminating this particular binding site, further study may reveal the effect of rs2680708 on the dysregulation of gene expression or additional molecular processes that potentially link its reported effect on prostate cancer risk with its putative influence on AR binding.
The newly implicated rs62262671 risk variant (3p21) was also predicted to have an impact on binding affinity for OCT1, a transcription factor with a known impact on prostate cancer and AR signaling (39–42). Given that rs62262671 was also identified as an eQTL affecting the expression of RBM6 and UBA7, these findings suggest that OCT1 may be involved in the regulation of the expression of these two genes, and provides a hypothesis for future functional follow-up regarding the involvement of these genes in prostate cancer development. A potential limitation of in silico TFBS prediction using the sTRAP algorithm is that, for certain transcription factor profiles, very similar and significant P values are sometimes observed for different transcription factors at the same SNP. While this observation could simply reflect similarity between particular transcription factor binding profile models, it may also reflect limitations in the models that force highly significant differences to converge to the same P value upon reaching the tail of an empirical background distribution. Nevertheless, our approach provides a convenient and valid method for mechanistic interpretation and hypothesis generation in research involving allele-specific transcription factor binding at genetic polymorphisms.
To improve detection of rare variant associations our study design prioritized directly genotyping variants of putative functional significance, rare variants from trait-specific whole-exome sequencing cohorts, and rare variants with proximity to trait-associated loci, all on a custom microarray. While our use of two large, population-based cohorts empowered univariate association testing of prostate cancer loci driven by rare genetic variants genome-wide, multivariate testing of rare variants colocalizing at particular genes remains an effective means of revealing loci driven by rare variants, which lack marginal, univariate significance, but which significantly contribute to disease risk when considered in aggregate. Furthermore, due in part to limited statistical power, the mechanisms through which the rare, noncoding variants we identified are associated with prostate cancer remain somewhat unclear, with a lack of precise functional evidence regarding mechanism of action or close proximity to genes or known risk loci in cis. We used biophysical models of transcription factor binding to identify variant alleles that may introduce or interrupt a TFBS that are independent of allele frequency. Nevertheless, there remains a challenge of not only detecting—but also interpreting—how noncoding rare variants impact the genetic etiology of complex traits using existing gene-based methodologies and functional genomic datasets.
By undertaking a GWAS in the large KP and UKB population-based cohorts, we detected several novel prostate cancer risk loci, a novel risk gene, and a polygenic signature of prostate cancer risk. Functional characterization of prostate cancer risk variants using gene expression and transcription factor binding affinity data revealed putative mechanisms of disease risk. However, further study is needed to more fully illuminate the biological mechanisms that underpin the influence of prostate cancer risk loci, particularly for rare variants.
Authors' Disclosures
N.C. Emami reports other funding from Avail Bio, Inc. and personal fees from Illumina Inc. outside the submitted work. R.E. Graff reports grants from NIH during the conduct of the study. C.L. Cario reports other funding from Avail Bio outside the submitted work. J. Shan reports grants from NIH during the conduct of the study. L.C. Sakoda reports grants from NCI during the conduct of the study. J.S. Witte reports grants from NIH during the conduct of the study and other funding from Avail Bio outside the submitted work. No disclosures were reported by the other authors.
Authors' Contributions
N.C. Emami: Conceptualization, software, formal analysis, visualization, methodology, writing–original draft, writing–review and editing. T.B. Cavazos: Software, formal analysis, visualization, methodology, writing–original draft, writing–review and editing. S.R. Rashkin: Formal analysis, writing–review and editing. C.L. Cario: Data curation, software, formal analysis, writing–review and editing. R.E. Graff: Data curation, writing–original draft, writing–review and editing. C.G. Tai: Data curation, formal analysis, writing–review and editing. J.A. Mefford: Data curation, software, writing–review and editing. L. Kachuri: Formal analysis, writing–review and editing. E. Wan: Data curation, writing–review and editing. S. Wong: Data curation, writing–review and editing. D. Aaronson: Data curation, writing–review and editing. J. Presti: Data curation, writing–review and editing. L.A. Habel: Data curation, writing–review and editing. J. Shan: Data curation, writing–review and editing. D.K. Ranatunga: Data curation, writing–review and editing. C.R. Chao: Data curation, writing–review and editing. N.R. Ghai: Data curation, writing–review and editing. E. Jorgenson: Resources, writing–review and editing. L.C. Sakoda: Resources, writing–review and editing. M.N. Kvale: Supervision, writing–review and editing. P.-Y. Kwok: Data curation, writing–review and editing. C. Schaefer: Resources, funding acquisition, investigation, writing–review and editing. N. Risch: Resources, supervision, funding acquisition, investigation, writing–review and editing. T.J. Hoffmann: Resources, software, formal analysis, supervision, investigation, writing–original draft, writing–review and editing. S.K. Van Den Eeden: Conceptualization, resources, investigation, writing–review and editing. J.S. Witte: Conceptualization, resources, data curation, formal analysis, supervision, funding acquisition, validation, investigation, methodology, writing–original draft, project administration, writing–review and editing.
Acknowledgments
The authors thank Elad Ziv, Nadav Ahituv, and Ryan Hernandez for their guidance and feedback. We are also grateful to the KP Northern California members who have generously agreed to participate in the KP Research Program on Genes, Environment, and Health, the ProHealth Study and the California Men's Health Study. This research has been conducted using the UKB Resource under application number 14105, and has been supported by the Robert Wood Johnson Foundation (to J.S. Witte), the NIH R01CA088164, R01CA201358, R25CA112355, T32GM067547, and U01CA127298 (to J.S. Witte), the UCSF Goldberg-Benioff Program in Cancer Translational Biology (to J.S. Witte), the UCSF Discovery Fellows Program (to N.C. Emami and T.B. Cavazos), the Microsoft Azure for Research Program (to C.L. Cario and N.C. Emami), and the Amazon AWS Cloud Credits for Research Program (to C.L. Cario and N.C. Emami). Support for participant enrollment, survey completion, and biospecimen collection for the RPGEH was provided by the Robert Wood Johnson Foundation, the Wayne and Gladys Valley Foundation, the Ellison Medical Foundation, and KP national and regional community benefit programs. Genotyping of the GERA cohort was funded by a grant from the National Institute on Aging, the National Institute of Mental Health, and the NIH Common Fund (RC2 AG036607). The sponsors played no role in the study.