Abstract
Prostate cancer is one of the most heritable human cancers. Genome-wide association studies have identified at least 185 prostate cancer germline risk alleles, most noncoding. We used integrative three-dimensional (3D) spatial genomics to identify the chromatin interaction targets of 45 prostate cancer risk alleles, 31 of which were associated with the transcriptional regulation of target genes in 565 localized prostate tumors. To supplement these 31, we verified transcriptional targets for 56 additional risk alleles using linear proximity and linkage disequilibrium analysis in localized prostate tumors. Some individual risk alleles influenced multiple target genes; others specifically influenced only distal genes while leaving proximal ones unaffected. Several risk alleles exhibited widespread germline–somatic interactions in transcriptional regulation, having different effects in tumors with loss of PTEN or RB1 relative to those without. These data clarify functional prostate cancer risk alleles in large linkage blocks and outline a strategy to model multidimensional transcriptional regulation.
Many prostate cancer germline risk alleles are enriched in the noncoding regions of the genome and are hypothesized to regulate transcription. We present a 3D genomics framework to unravel risk SNP function and describe the widespread germline–somatic interplay in transcription control.
This article is highlighted in the In This Issue feature, p. 2711
INTRODUCTION
Prostate cancer remains a broad health concern because of the advancing global population age. It has a heritability of 57%, well above the 33% for all cancers (1). Both rare variants in DNA damage repair genes like BRCA2 and common variants contribute to prostate cancer heritability (2, 3). Genome-wide association studies (GWAS) have identified at least 185 common prostate cancer germline risk alleles (4, 5). Almost all of these reside in noncoding intergenic (75/185) or intronic regions (99/185).
The mechanisms by which these common noncoding variants influence prostate cancer risk are poorly understood. They are enriched in prostate lineage-specific enhancers and promoters (6–8). Many are thought to contribute to the transcriptional reprogramming of the prostate, akin to the recurrent somatic ERG gene fusions seen in prostate cancer (9). Prostate cancer risk alleles have been associated with transcriptional regulation through multiple mechanisms, including DNA methylation, transcription factor binding, and enhancer/promoter activity (5, 7, 10–18). For example, the prostate cancer risk allele rs684232 and the ERG transcription factor regulate the expression of three neighboring genes in a combinatorial manner in multiple patient cohorts (7, 14). Similarly, the prostate cancer risk SNP rs11672691 influences enhancer activity and upregulates the transcription of PCAT19 and CEACAM21 (12, 13). Noncoding SNPs influence tumor-specific DNA methylation in the human prostate (19, 20). Taken together, these studies suggest that unraveling the genomic and regulatory features of risk alleles can provide novel insights into the biology of prostate cancer.
Identification of the transcriptional targets of risk alleles has been difficult for myriad reasons. First, in many cases it is unclear which of multiple alleles in linkage disequilibrium (LD) is functional. Second, the functional interrogation of risk alleles is complicated by their relatively small effect sizes which can compound over decades to subtly influence tumorigenesis (2). Third, individual risk alleles may regulate multiple genes via chromatin interactions, leading to effects in genes distal in the linear genome. Fourth, germline risk alleles occur in the context of an evolving somatic genome and transcriptome, leading to joint germline–somatic influences in the regulation of target gene transcription.
To help resolve these problems, we quantified the two-dimensional (2D) and three-dimensional (3D) spatial structures of the prostate cancer genome and integrated them with germline, somatic, and transcriptome data across 565 patients with localized prostate cancer. First, we defined the transcriptional targets and chromatin features of prostate cancer risk alleles, including those in LD with tag SNPs. Second, we integrated tumor chromatin immunoprecipitation sequencing (ChIP-seq), whole-genome sequencing, transcriptome, and proteome data from patients with prostate cancer to quantify the germline–somatic interplay in transcriptional regulation between germline risk alleles and somatic mutations. Finally, we demonstrated how prostate cancer risk allele targets divergently influence the major cell lineages of the prostate gland. These studies present a multidimensional view of transcriptional dysregulation in prostate cancer.
RESULTS
Cis-Regulatory Targets of Germline Prostate Cancer Risk Alleles
We examined the DNA sequence and epigenetic features of 185 prostate cancer germline risk alleles (4, 17, 21) identified by GWAS (Fig. 1A and B). About one third (63/185) were located outside promoters and gene bodies, whereas only ∼6% (11/185) altered protein coding sequences—implying mechanisms beyond primary protein structure changes (Fig. 1C; Supplementary Table S1). We hypothesized that many prostate cancer risk alleles influence transcription via long-range chromatin interactions. To test this, we focused on the androgen receptor (AR) and FOXA1—master transcription factors in the prostate that are expressed in the LNCaP and VCaP luminal prostate cancer cell lines. We used ChIP-seq and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) data to evaluate the enrichment of prostate cancer risk loci in various chromatin features, including AR or FOXA1 transcription factor binding sites, insulated neighborhood boundaries marked by the cohesin component RAD21, active enhancers marked by H3K27ac, and RNA Polymerase II (RNA Pol II) binding sites. Enrichment analysis was performed using genome shuffling, controlling for peak number and width (see Supplementary Methods).
The overlap between prostate cancer risk alleles and both AR and FOXA1 peaks was significantly greater than expected by chance (Fig. 1D; Supplementary Fig. S1A). Prostate cancer risk alleles preferentially occurred in active chromatin marked by H3K27ac or RNA Pol II occupancy. ChIA-PET analysis inferred the subset of peaks that interacted with other peaks associated with the same protein, termed anchor peaks. The enrichment of prostate cancer risk alleles in chromatin landmarks was more pronounced in the benign RWPE-1 cells and AR-positive LNCaP and VCaP prostate cancer cells than in the AR-negative DU145 prostate cancer cells. Prostate cancer risk alleles were more enriched than random SNPs in 2D and 3D chromatin features in RWPE-1, LNCaP, and VCaP cells (Fig. 1E; Supplementary Fig. S1B). This enrichment was absent for most chromatin features in DU145 cells. Prostate cancer typically begins in a cell that expresses AR; DU145 cells lack AR expression and have undergone a lineage differentiation.
We next sought to identify specific transcriptional targets of prostate cancer risk alleles. Using RNA Pol II and RAD21 ChIA-PET data, we classified targets based on anchor peak location: Type I targets occur when risk loci have 3D chromatin interactions with gene promoters; type II targets occur when risk loci have 3D chromatin interactions with gene bodies; type III targets occur when risk loci have 3D chromatin interaction peaks proximal to genes but outside their body or promoter (Fig. 1A). A single risk allele can have all of type I, type II, and type III targets.
Using this methodology, we identified transcriptional targets of 24.3% (45/185) of the prostate cancer risk alleles. These included 13,384 total SNP–target pairs (107 type I targets, 111 type II targets, and 13,166 type III targets). This was significantly more than expected by chance for each target class: 68 additional type I targets, 72 additional type II targets, and 5,601 additional type III targets (Supplementary Fig. S1C; Supplementary Table S2).
To quantify the relationship between risk alleles and transcript abundance, we leveraged three cohorts of localized prostate cancer—the Canadian Prostate Cancer Genome Network (CPC-GENE; 127 samples), The Cancer Genome Atlas (TCGA; 338 samples), and the Porto cohort (100 samples)—each with germline genotypes and tumor transcriptomes (22–24). The allele frequencies for risk loci were well correlated across cohorts (Supplementary Fig. S1D). We stratified tumors according to their risk allele genotype to evaluate transcriptional effects of 13,384 SNP–target pairs representing 45 risk alleles. We identified transcriptional changes for 36/45 risk alleles, comprising 589 SNP–target pairs (14 type I SNP–target pairs, 12 type II SNP–target pairs, and 563 type III SNP–target pairs; Fig. 1F; Supplementary Table S3). These 589 SNP–target pairs met two criteria: (i) P < 0.05 in one or more cohorts and (ii) the same direction of effect (β value) in all three cohorts. The direction of β values (positive and negative β values represent upregulation and downregulation, respectively) was more concordant across cohorts for type I SNP–target pairs, followed by types II and III. We called these high-sensitivity SNP–target pairs. By applying an FDR <0.2 on the high-sensitivity SNP–target pairs, we retained 272 and 283 SNP–target pairs in the TCGA and CPC-GENE cohorts, respectively, with an overlap of 43 SNP–target pairs (Supplementary Fig. S1E and S1F; Supplementary Table S4).
We next performed a meta-analysis across the three patient cohorts using the Fisher method. This prioritized 104 high-specificity SNP–target pairs representing 31 risk alleles (Fisher combined FDR <0.2; Fig. 1F and G; Supplementary Fig. S1G; Supplementary Table S3). Using this criterion, we retained 32 of the 43 shared SNP–target pairs (between TCGA and CPC-GENE) as high-specificity SNP targets. The presence of an individual risk allele was associated with transcriptional changes in a median of three genes (Supplementary Fig. S1H). Taken together, these data highlight the utility of 3D genome maps in deciphering the transcriptomic targets of prostate cancer risk alleles.
Germline Risk Alleles Shape the Transcriptome of Primary Prostate Cancer
Two-dimensional proximity in the linear genome is widely used to assign transcriptional targets to risk SNPs (25). We therefore applied the same criteria used above for identifying high-specificity targets (Fisher combined FDR <0.2) to the nearest gene in the linear genome for each of the 185 prostate cancer risk alleles. For 48 of 185, the nearest gene was transcriptionally validated; 13 of these 48 SNP–nearest gene target pairs were rediscovered by our 3D genome approach, 8 of which were intragenic targets. Intriguingly, 10 of these 13 SNPs were associated with transcriptional regulation of both proximal and distal genes in primary tumors; only three were associated with transcriptional regulation of proximal genes (Fig. 2A). These results demonstrate that the transcriptional regulation by risk alleles extends beyond its nearest gene and highlights the complementarity of 2D and 3D approaches.
One example of the complex influence of the 3D genome architecture on transcriptional regulation by germline prostate cancer risk alleles is the SNP rs4962416, which lies within an intron of CTBP2 (26). The locus harboring the risk SNP rs4962416 exhibited RAD21 and RNA Pol II–associated chromatin interactions in LNCaP cells, along with AR occupancy and H3K27ac enrichment (Fig. 2B). The presence of the risk allele was associated with increased CTBP2 transcript abundance in primary patient tumors (FDR = 9.93 × 10−5; Fig. 2C). The presence of the risk allele was associated with increased H3K27ac (Fig. 2D, left). In 37 tumors heterozygous for rs4962416, the risk allele was more associated with H3K27ac than the wild-type (WT) allele (Fig. 2D, right). This allelic imbalance in an enhancer mark suggests rs4962416 may facilitate enhancer activation in tumors. Consistent with a tumor-promoting role of CTBP2 via modulation of AR signaling (27), our results suggest that upregulation of CTBP2 is a potential mechanism linking the risk SNP with prostate cancer etiology.
Although rs4962416 resides in the intron of CTBP2, surprisingly, such regulatory relationships are not generalizable in the absence of 3D genome information. For example, the risk SNP rs636291 is located within an intron of the gene PEX14. DFFA and PEX14 are bidirectional genes with their promoters in a head-to-head orientation. From a 3D genome perspective, the locus harboring the SNP rs636291 is bivalent as it has both the H3K27ac active enhancer mark and the CTCF insulator mark. Consistently, the locus exhibited both RNA Pol II– and RAD21-associated chromatin interactions with the promoter of the distal gene APITD1 (Supplementary Fig. S2A). The presence of rs636291 was associated with a reduced RNA abundance of APITD1 in patient tumors (FDR = 2.05 × 10−3; Supplementary Fig. S2B). However, we did not observe a relationship between rs636291 and RNA abundance of either the host PEX14 gene or its bidirectional partner gene, DFFA.
A second example of this phenomenon of intronic risk alleles not regulating their host genes is rs10486567, located within an intron of JAZF1 (28). This locus showed AR, FOXA1, and ERG co-occupancy and the H3K27ac active enhancer mark. We observed RNA Pol II–associated long-range chromatin interactions between this locus and the distal HOXA13–HOTTIP bidirectional gene cluster (Fig. 2E). The rs10486567 genotype was associated with increased RNA abundance of both HOXA13 and HOTTIP expression in patient tumors (FDR = 0.11; Fig. 2F). However, we did not observe a relationship between s10486567 and RNA abundance of the host gene JAZF1 (FDR = 0.277; Supplementary Fig. S2C). Taken together, these results indicate that transcriptional targets of risk alleles are not necessarily dictated by linear 2D proximity.
Our strategy elucidated transcription control by both intragenic and intergenic risk alleles. As an example of the latter, the risk SNP rs5759167 is located in an intergenic region between BIK and TTLL1. Its locus exhibited RNA Pol II–associated chromatin interaction with the promoter of BIK, but not TTLL1 (Supplementary Fig. S2D). Consistent with these 3D genome data, rs5759167 was associated with increased BIK transcript abundance in patient tumors (FDR = 7.66 × 10−4; Supplementary Fig. S2E), but not TTLL1 transcript abundance (FDR = 0.551; Supplementary Fig. S2F). We replicated several additional expression quantitative trait loci (eQTL) in patient cohorts (29), including rs10845943 with C1QL4 upregulation (FDR = 1.30 × 10−4; Supplementary Fig. S2G and S2H) and rs12621278 with ITGA6 downregulation (FDR = 3.27 × 10−4; Supplementary Fig. S2I and S2J). These results indicate that 3D genome information can improve the identification of the transcriptional targets of risk alleles.
Germline–Somatic Interplay in Prostate Cancer Transcriptional Regulation
We hypothesized that somatic driver mutations can modulate the transcriptional effects of germline risk alleles, perhaps enhancing the effect of tumor-promoting alleles and diminishing that of tumor-suppressing ones. To test this, we analyzed the transcriptional effects of a subset of 283 high-sensitivity SNP–target pairs which were also significant in the CPC-GENE cohort (FDR <0.2) by comparing patient tumors with and without specific somatic driver mutations (somatic mutation status is available only in the CPC-GENE cohort). Because prostate cancer is a C-class disease primarily driven by structural variants (30), we focused on four common driver structural variations that typically occur early and clonally in localized prostate cancer evolution (31): loss of PTEN, loss of RB1, loss of NKX3-1, and ERG gene fusion.
Of these four, PTEN-mutant tumors were systematically associated with larger germline effects on RNA (Fig. 3A). Loss of PTEN potentiated both transcriptional upregulation and downregulation by germline risk alleles. Loss of RB1 was also associated with enhanced germline effects on RNA. By contrast, loss of NKX3-1 and ERG gene fusion did not reach statistical significance. RB1 loss potentiated genome-wide transcriptional downregulation by germline risk alleles, and several individual SNP–target pairs were influenced by NKX3-1 loss or ERG fusion status (Supplementary Table S5). As an example of these effects, the rs4962416 risk allele was associated with increased CTBP2 transcript abundance in tumors with RB1 loss, but with little or no effect in tumors WT for RB1 status (Fig. 3B). The same risk allele was associated with increased CTBP2 transcript abundance in ERG-negative tumors, but with little or no effect in ERG-positive tumors. By contrast, loss of PTEN or NKX3-1 was not associated with changes in the effects of this risk allele on CTBP2 transcript.
Consistent with the patient tumor data, we observed that siRNA-based knockdown of RB1 resulted in the transcriptional upregulation of CTBP2 in LNCaP cells (Fig. 3C). Next, we studied the WT and rs4962416 in transcriptional regulation by conducting dual-luciferase reporter assays in LNCaP cells. In comparison with the vector control, the WT allele increased the reporter luciferase activity, and the magnitude of the effect was further increased in the presence of the risk allele (Fig. 3D). Given the observation of AR occupancy in the risk SNP locus (Fig. 2B), we tested the effect of androgen signaling on the activity of WT and rs4962416. Depletion of androgens from the media blocked the activity of WT and rs4962416 (Fig. 3E). This effect was reversed by the exogenous administration of the AR ligand dihydrotestosterone (DHT; Fig. 3E; Supplementary Fig. S3). Overall, these results indicate that driver somatic mutations influence gene-expression regulation by prostate cancer risk alleles and thereby contribute to the dynamic transcriptional reprogramming of the prostate epithelium.
We analyzed the chromatin binding profiles of RB1 (in LNCaP and VCaP cells) and ERG (in VCaP cells; refs. 32, 33). By overlapping RB1 and ERG binding peaks with RNA Pol II ChIA-PET anchor peaks, we created virtual chromatin contact maps of RB1 and ERG occupancy and traced the target genes. Importantly, the expression of target genes discovered by our new approach was significantly higher than the expression of the nearest neighboring genes as well as control genes in all our comparisons (Fig. 3F and G). These results suggest that RB1 and ERG regulate transcription via long-range chromatin interactions. Furthermore, the chromatin interaction gene targets (expressed genes; FPKM >1) of RB1 and ERG significantly overlapped with targets (277 genes corresponding to 283 SNP–target pairs) of prostate cancer risk SNPs (Fig. 3H and I).
Prostate Cancer Risk SNP rs8102476 and Contextual Transcriptional Regulation
The SNP rs8102476 is an example of simultaneous upregulation and downregulation of two adjacent genes by a risk allele. It is located in an intergenic region in chromosome 19q13.2 (34) that exhibited both RNA Pol II–associated and RAD21-associated chromatin interactions encompassing the proximal PPP1R14A and distal SPINT2 genes (Fig. 4A). These chromatin interactions displayed characteristic CTCF binding and/or the H3K27ac mark in the anchor regions. The presence of the risk allele was associated with the downregulation of the PPP1R14A transcript (FDR = 2.85 × 10−8) and upregulation of the SPINT2 transcript (FDR = 0.01) in patient tumors (Fig. 4B and C). PPP1R14A protein abundance showed a similar effect in a cohort of 63 patients (ref. 35; Fig. 4D). Therefore, rs8102476 is both an eQTL and a protein QTL (pQTL) for PPP1R14A. We suggest that changes in the boundaries of the insulated neighborhood and associated changes in enhancer–promoter contacts are likely mechanisms that link the presence of rs8102476 with gene-expression changes.
Cis-Regulatory Targets of SNPs in LD with the Tag Risk Alleles
LD represents the nonrandom association of alleles at different loci in the population. We adopted the approach used above to identify the chromatin interaction targets of alleles in LD to the tag risk SNPs. We identified a total of 494 LD SNPs for the 185 tag SNPs (r2 > 0.8; Supplementary Table S6). The median distance between the tag and LD SNPs was 8,492 bp (Supplementary Fig. S4A). We identified candidate targets for 99 LD SNPs, resulting in 35,836 LD SNP–target pairs (250 type I SNP–target pairs, 224 type II SNP–target pairs, and 35,362 type III SNP–target pairs). We identified 158 additional type I targets, 133 additional type II targets, and 17,064 additional type III targets than expected by chance (P < 0.01; Supplementary Fig. S4B; Supplementary Table S7).
We identified transcriptional changes for 82 of 99 LD SNPs with 3D chromatin linkages, including 70 LD SNPs with high-specificity targets and 12 LD SNPs with high-sensitivity targets (Supplementary Fig. S4C). A total of 247 high-specificity LD SNP targets were discovered (24 type I LD SNP–target pairs, 31 type II LD SNP–target pairs, and 192 type III LD SNP–target pairs; Fig. 5A; Supplementary Table S8). The presence of an individual LD SNP was associated with transcriptional changes in a median of two genes (Supplementary Fig. S4D). By applying an FDR <0.2 on the high-sensitivity SNP–target pairs, we retained 691 and 773 SNP–target pairs in the TCGA and CPC-GENE cohorts, respectively, with an overlap of 122 SNP–target pairs (Supplementary Fig. S4E; Supplementary Table S9). Of the 247 high-specificity LD SNP–target pairs: (i) 45 gene targets were shared between the tag and LD SNPs and were transcriptionally associated with both; (ii) 7 gene targets were common to the tag and LD SNPs, but were transcriptionally associated with LD SNPs only; and (iii) 195 gene targets were unique to the LD SNPs only (Fig 5B; Supplementary Fig. S4F). In comparison with the unique targets identified only for LD SNPs, a greater fraction of the common targets was enriched for direct targets (type I and type II LD SNP–target pairs; Fig. 5C).
We identified LD SNPs that were enriched for 2D chromatin features (histone modifications and transcription factor binding) and 3D chromatin interactions (RNA Pol II ChIA-PET and RAD21 ChIA-PET; Fig. 5D and E; Supplementary S4G and S4H). The LD SNPs were enriched in the enhancers of the benign RWPE-1 cells, and the AR-positive LNCaP and VCaP prostate cancer cells, but not in the AR-negative DU145 prostate cancer cells. We studied rs461251, an allele in LD with rs684232, one of the best-characterized prostate cancer GWAS risk alleles (7, 14, 36, 37). The two SNPs are located ∼200 bp apart. Both the tag SNP rs684232 and the LD SNP rs461251 were associated with loss of enhancer activity (decreased H3K27ac) and downregulation of VPS53, TLCD3A, and GEMIN4 transcript in patient tumors (Supplementary Fig. S4I–S4M).
Two parallel prostate cancer GWAS reported the tag SNPs rs7584330 and rs2292884 (38, 39). Given shared ancestry among study subjects and LD between these two SNPs, the association perhaps reflected the same signal—although rs7584330 gave a stronger signal. Both these lead SNPs had a paucity of 2D chromatin features and 3D chromatin interactions and lacked clear gene targets. We therefore hypothesized that other LD alleles may be functionally relevant. The SNP rs6760842 is in LD with rs7584330 and its locus exhibited an abundance of chromatin features; it lies in an H3K27ac-marked intronic enhancer of MLPH in LNCaP and VCaP cells. This enhancer participated in RNA Pol II–associated chromatin interactions with the promoter of MLPH, suggesting MLPH as the target gene (Fig. 5F). rs6760842 was associated with downregulation of the MLPH transcript in patient tumors (FDR = 7.52 × 10−5; Fig. 5G). rs6760842 was associated with decreased H3K27ac at the locus, indicating a loss of enhancer activity (Fig. 5H, left). Tumors heterozygous for rs6760842 exhibited an enrichment of the WT allele in the H3K27ac region (Fig. 5H, right). Together, these data support the idea that rs6760842 is the functional SNP in this GWAS region, influencing prostate cancer etiology through changing MLPH enhancer activity.
3D Genomic Architecture and Transcriptome-Wide Association Study Hits
Transcriptome-wide association studies (TWAS) have identified both candidate risk genes at known prostate cancer risk regions and novel candidate risk genes outside of them (40, 41). To determine the genomic features underlying transcription control of TWAS genes, we considered 222 published SNP–TWAS target pairs (Supplementary Table S10). Of these, 69 SNP–TWAS target pairs shared the prostate cancer risk SNPs included in this study, and for 27 of 69, we provided additional 3D genomics support. Six out of these 27 were transcriptionally validated high-specificity SNP–target pairs (Fig. 6A).
The 222 published SNP–TWAS target pairs include 86 SNPs from TCGA prostate tumors (40). We extended our analysis pipeline to identify the transcriptional targets of these 86 SNPs, as well as the 371 SNPs in LD with these SNPs. Many TWAS SNPs and LD SNPs were in transcriptionally active regions (Supplementary Fig. S5A and S5B; Supplementary Table S11). By applying an FDR <0.2, we observed an overlap of 9 TWAS SNP targets and 80 TWAS LD SNP targets, respectively, between the CPC-GENE and TCGA cohorts (Supplementary Fig. S5C and S5D; Tables S12 and S13). A total of 33 TWAS SNPs were assigned high-specificity targets by ChIA-PET directly or pairing with LD SNPs followed by ChIA-PET (Supplementary Fig. S5E). The overlap between the 86 TWAS SNPs and 185 prostate cancer risk SNPs was 23; seven of these were assigned high-specificity targets (Supplementary Fig. S5F). Eight out of the 12 TWAS SNPs with high-specificity targets identified by ChIA-PET could be rediscovered by TWAS (Supplementary Fig. S5G).
As a representative example, the prostate cancer risk SNP rs6465657 is in an intron of LMTK2 (42) and has been associated with three target genes via TWAS—LMTK2, BHLHA15, and TECPR1 (40). The locus harboring rs6465657 had a paucity of 2D chromatin features and 3D chromatin interactions, and therefore we hypothesized that other LD alleles may be functionally relevant. rs6965016 is in perfect LD with the tag SNP rs6465657—the patient tumors with the AA, AB, and BB genotypes were identical for both these SNPs in all cohorts. The locus harboring the LD SNP rs6965016 exhibited an abundance of 2D and 3D genome features, including co-occupancy of AR, FOXA1, and ERG, and the establishment of H3K27ac marked active enhancer (Fig. 6B). This locus also exhibited RNA Pol II–associated chromatin interactions encompassing the LMTK2, BHLHA15, and TECPR1 genes. rs6965016 was associated with the downregulation of LMTK2 (FDR = 0.164), BHLHA15 (FDR = 0.160), and TECPR1 (FDR = 0.049) in patient tumors (Fig. 6C). rs6965016 was also associated with reduced AR occupancy (Fig. 6D, top) and decreased H3K27ac levels—indicating a reduction in enhancer activity—in patient tumors (Fig. 6D, bottom). These results suggest that rs6965016 impedes the recruitment of AR, thereby reducing enhancer activity and suppressing the transcription of LMTK2, BHLHA15, and TECPR1.
Cell Lineages Associated with the Transcriptomic Targets of Prostate Cancer Risk Alleles
The prostate gland is composed of diverse cellular lineages (43, 44). Although prostate cancer is an epithelial cancer, the microenvironment surrounding the epithelial cells contributes to carcinogenesis (45). To define the expression of the prostate cancer risk allele target genes in the major cellular lineages of the human prostate gland, we examined the mRNA abundance of the transcriptionally validated high-sensitivity target genes of both tag SNPs and LD SNPs in the flow-sorted luminal epithelial, basal epithelial, other epithelial, and stromal cells of the normal human prostate gland (43). Unsupervised analysis grouped target genes into four clusters (Fig. 7A; Supplementary Table S14). Cluster 1 (153 genes) was associated with the stromal cells and was enriched in apoptotic pathways (Fig. 7B). Cluster 2 (171 genes) was associated with the luminal epithelial cells and was enriched in membrane tethering processes and transcriptional regulation by RNA Pol II. Cluster 3 (323 genes) was associated with the basal and other epithelial cells and was predominantly enriched in developmental pathways. The largest of these, cluster 4 (464 genes), was expressed in all the major cell types of the prostate gland and was enriched in metabolic processes.
We also examined the transcript abundance of a subset of 20 high-specificity target genes (Fisher combined FDR <0.05) in single-cell RNA sequencing (RNA-seq) of the normal human prostate gland (43). We detected transcript abundance for 15 of these 20 genes at the single-cell level (Supplementary Fig. S6A; Supplementary Table S15). Consistent with bulk RNA-seq analysis, transcriptional targets of prostate cancer risk alleles were expressed in multiple cell lineages of the prostate gland (Supplementary Fig. S6B–S6G). This analysis enabled us to resolve cell type–specific aspects of transcription control by prostate cancer risk alleles. For example, the transcriptional targets of the prostate cancer risk SNP rs8102476—PPP1R14A and SPINT2 genes—were expressed in distinct cellular lineages. SPINT2 expression was restricted to the epithelial cell lineages (luminal, basal, hillock, club, and neuroendocrine cells), whereas PPP1R14A expression was predominantly in smooth muscle cells. By integrating ChIA-PET, single-cell RNA-seq, and bulk RNA-seq data from cell lines, normal prostate, and patient tumors, respectively, we hypothesize that rs8102476 is associated with the downregulation of PPP1R14A in smooth muscle cells and the concomitant upregulation of SPINT2 in epithelial cell lineages (Fig. 4; Supplementary Fig. S6A–S6D).
We associate 87 risk SNPs with high-specificity transcriptionally validated targets in this study: 31 prostate cancer risk SNPs for which targets were defined by ChIA-PET, 48 prostate cancer risk SNPs for which targets were defined by linear proximity, and 40 prostate cancer risk SNPs for which targets were defined by pairing with LD SNPs, followed by ChIA-PET analysis (Fig. 7C). We hypothesize that closely spaced risk alleles may perhaps represent the same signal. As described earlier, rs7584330 and rs2292884 are less than 100 kbp apart and are likely to represent the same signal (Fig. 5F–H). Therefore, by setting distance cutoffs of 1 Mbp, 100 kbp, and 10 kbp, the 185 risk SNPs analyzed in this study were binned into 140, 157, and 177 clusters, respectively (Fig. 7D). Remarkably, regardless of the bin size, we have identified high-specificity SNP targets in >50% of the clusters (85/140, 1 Mbp clusters; 92/157, 100 kbp clusters; 106/177, 10 kbp clusters; Fig. 7D; Supplementary Table S16).
We have summarized the molecular and cellular features of the 104 high-specificity SNP–target pairs that were transcriptionally validated in patient cohorts (Supplementary Table S17). A large majority of high-specificity SNP–target pairs do not involve the nearest genes (91/104). We further filtered the 104 high-specificity SNP–target pairs to identify a subset of 20 SNP–target pairs (Fisher combined FDR <0.05; Fig. 7E; Supplementary Table S17). Of these 20 SNP–target pairs, (i) 4 SNP–target pairs were previously discovered by TWAS analysis (40), and (ii) 9 SNP–target pairs do not involve the nearest genes. The summary heat maps show how multiple lines of orthogonal evidence converge to illuminate transcription control by prostate cancer germline risk alleles. In conclusion, we have summarized the SNPs that overlapped between published data sets and ours; we have reported the SNP–target pairs that were rediscovered in our data set; we also report the number of novel SNP–target pairs discovered by our 3D genomics analysis (Supplementary Tables S18–S20; refs. 10, 14, 25, 29, 40, 46–49).
DISCUSSION
It has been long recognized that germline risk alleles are enriched in gene-regulatory elements, yet the specific transcriptional targets of most disease-associated variants have remained elusive. We detect a hitherto unknown pervasive germline–somatic interplay between the prostate cancer risk alleles and somatic loss of PTEN or RB1. These data suggest that the transcriptional effects of prostate cancer risk alleles are not static, but are dynamically tuned by the mutation profile of the cancer. Our discoveries build on prior studies where 3D genome interaction data were used to resolve complex genetic puzzles, and, in so doing, provide a conceptual framework for transcription control by noncoding variants (50, 51). We suggest that 3D genome information is critical to deconvolving the relationship between risk SNP location and risk SNP function.
A limitation of our study is the lack of evidence on several prostate cancer risk SNPs. This can be partially addressed by (i) analyses of larger better-powered cohorts, and (ii) conducting ChIA-PET analysis of the human prostate and its distinct cell types. Additional mechanisms that are not directly linked to genome architecture or transcription may be in play (some SNPs could be hotspots for DNA breaks or rearrangements). The effect sizes were modest in many cases. This is presumably because of an inherent selection against common strong effect variants. Mendelian variants have a high effect size but are present in low frequency in the population. GWAS hits have a relatively lower effect size but are present in higher frequency in the population.
Although trans-acting proteins, histone modifications, and DNA methylation are commonly linked to transcriptional regulation, we suggest that DNA sequence variations can also function as transcriptional activators/repressors by promoting/preventing the recruitment of master transcription factors (tuning enhancer activity) and/or adjusting the boundaries of insulated neighborhoods (tuning enhancer connectivity). We hypothesize that akin to the combinatorial binding of transcription factors, such as AR and FOXA1, the combinatorial interaction among cis-regulatory DNA elements can also influence transcription.
Integrating 3D genome data with polygenic risk scores (PRS) can provide insights into genetic pathways associated with an individual's lifetime prostate cancer risk and can usher in the development of intervention strategies to prevent or delay the disease (21). The data from analyses such as these may be used as priors when developing PRS moving forward. In this way, variants with regulatory influences on known prostate cancer pathways may be upweighted while also balancing the fact that risk variants may not all work directly on known pathways and may have biology still to uncover. Studies such as ours may help prioritize variants that fall slightly below typical GWAS significance thresholds but may still inform on risk. Understanding risk SNP function can not only help distinguish cancer risk but also provide further insights into the tumor biology and potential vulnerabilities.
Our study has filled in a key knowledge gap in the field and can stimulate the transition of the GWAS field from association to function. The enhancer–target connectome data emanating from cell line 3D genome studies (representing prostatic lineage) can in principle be used to interpret prostate cancer GWAS signals across all ancestries and will have tremendous value in addressing prostate cancer health disparities. The accuracy of PRS greatly varies across ancestry populations. Mechanistic insights into the function of risk variants along with an increasing understanding of the molecular differences in prostate tumors across diverse populations may help us develop more accurate and generalizable PRS.
Although our study is limited to prostate cancer, our approach can be broadly applied to study other complex diseases with an underlying genetic basis. Integration of GWAS data with population-based LD maps, cell type–based 3D genome maps, and clinical genomics measurements shows significant promise to advance our understanding of the etiology of complex diseases.
METHODS
Preprocessing of Omics Data
The omics data analyzed in this study include ChIA-PET, ChIP-seq, and RNA-seq from benign RWPE-1 cells, AR-positive LNCaP and VCaP prostate cancer cells, and AR-negative DU145 prostate cancer cells. ChIA-PET data were processed using the ChIAPET2 tool, and all peaks and intrachromosome interactions were used in the downstream analysis (52). ChIP-seq data were processed using bowtie for mapping and MACS2 for peak calling (53). RNA-seq data were processed by TopHat for mapping (54). All the details about the data preprocessing are described elsewhere (7).
Summary of Cohort Data
In this study, three cohorts were used for target validation.
TCGA cohort
The TCGA PRAD (prostate adenocarcinoma) was used as a validation cohort (22). We included samples with high concordance (>80%) between SNP6 array and whole-exome sequencing genotypes, and of European descent (338 samples). Genotypes were imputed using the Sanger Imputation Service—prephasing using Shapeit2 (55), imputation using PBWT (56), and the Haplotype Reference Consortium (release 1.1) panel (57).
CPC-GENE cohort
We interrogated a previously published cohort of 127 intermediate-risk prostate cancer samples that underwent RNA-seq described previously (23). Germline SNPs were first identified using GATK (v3.4.0-3.7.0) for each patient individually using HaplotypeCaller followed by VariantRecalibration and ApplyRecalibration. Individual variant call formats (VCF) were merged using bcftools (v.1.8) assuming SNPs not present in an individual VCF were a homozygous reference. The minor allele frequency (MAF) of all SNPs within the merged VCF was calculated and filtered to consider only SNPs with MAF >0.01 (n = 10,058,344). Next, all patients were regenotyped using GATK (v.4.0.2.1) at these sites to produce genomic variant call formats (gVCF; i.e., with option -ERC GVCF). Individual gVCFs were merged using GenomicsDBImport, and joint genotyping was run using GenotypeGVCFs. Finally, SNPs were recalibrated using VariantRecalibrator and ApplyVQSR. Ancestry was determined based on genetic similarity to populations from the 1000 Genomes Project as reported previously (58). Only individuals of European ancestry were included in subsequent analyses (n = 127).
Porto cohort
RNA-seq and ChIP-seq for AR, H3K27ac, H3K27me3, and H3K4me3 profiles from 100 prostate cancer samples in the Porto cohort were integrated in this study (24). RNA-seq processed data were downloaded from the Gene Expression Omnibus (GEO) database and raw ChIP-seq data were downloaded from Sequence Read Archive data (GSE120741).
Definition of Peak and Genomic Regions
All RNA Pol II or RAD21 binding peaks and intrachromosome interactions were defined from the preprocessing results of ChIA-PET. Peaks that interact with other peaks are called anchor peaks or anchor regions. Binding peaks of other TFs and epigenetic marks, e.g., AR, FOXA1, CTCF, and H3K27ac, were identified by MACS2 with default parameters. The definition of promoters, gene bodies, and other regions was based on the annotation file Gencode V19. The DNA sequence between the upstream 500 bp and downstream 250 bp from transcription start sites was defined as promoters. Gene bodies include both exons and introns. After extracting the coordinates of genes from the annotation file, promoter regions were subtracted from whole gene regions, and then the remaining regions were defined as gene body regions. The regions that were not defined as promoters or gene bodies across the whole genome were defined as other regions. H3K27ac peaks are marks to define enhancer regions, apart from peaks that have overlaps with promoters.
Stratification of Patient Tumors Based on Somatic Driver Mutations
To assess the impact of somatic PTEN loss, RB1 loss, NKX3-1 loss, or ERG gene fusions on SNP–target association, the CPC-GENE samples were separated into somatic mutated and matched WT groups (30). For each SNP–target pair, eQTLs within somatic mutated and corresponding WT groups were quantified separately using the same linear regression model described above. Then two P values were calculated for each SNP–target pair, including P value for somatic mutated and matched WT groups.
Allelic Imbalance Analysis of ChIP-seq Data
Allelic imbalance analysis was conducted on ChIP-seq data of H3K27ac and AR from the Porto cohort. First, raw reads were mapped to the human genome by bwa. Second, the mapped bam files were processed by the WASP pipeline to remove the allele imbalance mapping biases (59). Third, reader groups were added by AddOrReplaceReadGroups (picard 2.10.3). Finally, allele-specific reads were counted by ASEReadCounter (GATK 4.1.4.0). Then, a paired Wilcox test was conducted to compare read counts mapped on reference allele and alternative allele for heterozygous patients on each SNP site.
RNA Extraction, cDNA Synthesis, and Quantitative RT-PCR
Total RNA was extracted from LNCaP cells using the RNeasy Plus Mini Kit (Qiagen, #74136), and cDNA synthesis was performed using the SuperScript VILO cDNA Synthesis Kit (Invitrogen, #11754050) following the manufacturer's instructions. Real-time quantitative reverse transcription PCR (qPCR) was carried out on QuantStudio 6 Flex Real-Time PCR System (Thermo Fisher) using PowerUp SYBR Green Master Mix (Thermo Fisher, #A25776) to verify the knockdown efficiency and relative gene expression. The fold change in CTBP2, RB1, or KLK3 mRNA expression levels was calculated by the comparative Ct method, using the formula 2−(ΔΔCt). The primer sequences for the PCR were RB (forward AGACCCAGAAGCCATTGAAA, reverse CTGGAAAAGGGTCCAGATGA), CTBP2 (forward GTCTGGGACACGATGAACCT, reverse CTTGTTCTTCCTTGGGGTCA), KLK3 (forward CATCAGGAACAAAAGCGTGA, reverse ATATCGTAGAGCGGGTGTGG), GAPDH (forward TGCACCACCAACTGCTTAGC, reverse GGCATGGACTGTGGTCATGAG).
siRNA Transfection
ON-TARGETplus Human RB (#L-003296-02-0005) and negative control small interfering RNAs (siRNA) were purchased from Horizon Discovery. siRNAs were transfected into cells using Lipofectamine RNAiMAX Transfection Reagent (Thermo Fisher, #13778-150) as described in the manufacturer's protocol.
Luciferase Reporter Assay
The 1 kilobase pair enhancer region containing the WT or risk allele rs4962416 was cloned into pGL4 luciferase reporter plasmid (Promega, #E665A) and verified by Sanger sequencing. LNCaP cells were seeded on 12-well plates and transfected with the indicated plasmids. For DHT treatment, 5a-Androstan-17b-OL-3-one was purchased from Sigma-Aldrich (#A8380) and dissolved in ethanol. The plated cells were washed with PBS two times, replaced with 10% charcoal-stripped FBS media for 2 days, and treated with 10 nmol/L of DHT for 24 hours. The Firefly luciferase/Renilla luciferase activities were measured using the Dual-Glo luciferase assay system as described in the manufacturer's protocol (Promega, #E2940).
Predicting the Targets of Transcription Factors
We integrated ChIP-seq of RB1 and ERG with RNA Pol II ChIA-PET from the same cell line to predict targets of these factors. ChIP-seq peaks were used to define RB1/ERG binding sites, and if the peak was overlapping with one peak anchor from RNA Pol II ChIA-PET, the genes located in the paired peak anchors would be defined as targets. These genes in the nearest to ChIP-seq binding peaks and randomly selected genes were taken as control groups.
Data Visualization
The track files for multiple factors around different gene loci were visualized by the Integrative Genomics Viewer (IGV; ref. 60). A heat map was drawn with R package BoutrosLab.plotting.general. Scatter plot and balloon plot were drawn with R package ggpubr. Jensen–Shannon score was used to measure the expression distance between transcripts in Fig. 7A, and then k-means clustering was performed (k = 4).
Statistical Analysis
The significance of SNP enrichment was estimated by an empirical test (Fig. 1D and E). The significance of mRNA abundance changes was estimated by the linear regression model, and P value and β value were from the linear model. The significance of the ChIP-seq binding signal intensity difference was estimated by the Mann–Whitney test considering a recessive model. The significance of transcriptional changes after somatic mutation stratification was estimated by the Wilcoxon test (Fig. 3A). The significance of target position and number of validated cohorts was estimated by the χ2 test (Fig. 5C). The significance of allele-specific binding was estimated by the paired Wilcoxon test (Figs. 2D and 5H). P values for qPCR analysis were obtained using a two-tailed Student t test. Error bars indicate the SD of 3 technical replicates. P values for the luciferase reporter assay were obtained using the Benjamini–Hochberg procedure. Error bars indicate the SD of 6 technical replicates.
Data Availability
The RNA Pol II ChIA-PET data are available in the GEO under accession number GSE121020, and RAD21 ChIA-PET was downloaded from GEO by accession numbers GSE127018 and GSE127041. ERG ChIP-seq data were downloaded from GEO by accession number GSM1328979 and RB1 ChIP-seq data were downloaded from GEO by accession numbers GSM1974982 and GSM1974981.
Authors’ Disclosures
P.C. Boutros reports grants from the NCI, NIH, and the Prostate Cancer Foundation during the conduct of the study; other support from Sage Bionetworks, Intersect Diagnostics Inc., and BioSymetrics Inc. outside the submitted work; and a patent for germline determinants of prostate cancer aggression pending and a patent for germline determinants of somatic mutations pending. No disclosures were reported by the other authors.
Authors’ Contributions
J. Yuan: Conceptualization, data curation, formal analysis, investigation, methodology, writing–original draft, writing–review and editing. K.E. Houlahan: Resources, formal analysis, investigation, methodology, writing–review and editing. S.G. Ramanand: Conceptualization, investigation, methodology. S. Lee: Formal analysis, investigation, methodology. G. Baek: Conceptualization, investigation, methodology. Y. Yang: Formal analysis, investigation, methodology. Y. Chen: Conceptualization, methodology. D.W. Strand: Data curation, formal analysis. M.Q. Zhang: Conceptualization, resources, investigation. P.C. Boutros: Conceptualization, resources, formal analysis, investigation, methodology, writing–original draft, writing–review and editing. R.S. Mani: Conceptualization, formal analysis, supervision, funding acquisition, investigation, writing–original draft, project administration, writing–review and editing.
Acknowledgments
We thank James Malter, Diego Castrillon, and Ashish Kapoor for their insightful comments and discussions. R.S. Mani acknowledges funding support from National Cancer Institute (NCI)/NIH grant (R01CA245294), Cancer Prevention and Research Institute of Texas (CPRIT) Individual Investigator Research Award (RP190454), US Department of Defense Prostate Cancer Research Program (PCRP) Impact Award (W81XWH-17-1-0675), and US Department of Defense Breakthrough Award (W81XWH-21-1-0114). K.E. Houlahan was supported by a Vanier Canada Graduate Scholarship. This work was also supported by the NIH/NCI under awards P30CA016042, U01CA214194, and U24CA248265 and by the Prostate Cancer Foundation Special Challenge Award to P.C. Boutros (Award ID #: 20CHAS01) made possible by the generosity of Mr. Larry Ruvo. The results described here are in part based on data generated by TCGA (http://cancergenome.nih.gov), managed by the NCI and the National Human Genome Research Institute (NHGRI).
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Discovery Online (http://cancerdiscovery.aacrjournals.org/).