Abstract
Prostate cancer is a biologically heterogeneous disease with variable molecular alterations underlying cancer initiation and progression. Despite recent advances in understanding prostate cancer heterogeneity, better methods for classification of prostate cancer are still needed to improve prognostic accuracy and therapeutic outcomes. In this study, we computationally assembled a large virtual cohort (n = 1,321) of human prostate cancer transcriptome profiles from 38 distinct cohorts and, using pathway activation signatures of known relevance to prostate cancer, developed a novel classification system consisting of three distinct subtypes (named PCS1–3). We validated this subtyping scheme in 10 independent patient cohorts and 19 laboratory models of prostate cancer, including cell lines and genetically engineered mouse models. Analysis of subtype-specific gene expression patterns in independent datasets derived from luminal and basal cell models provides evidence that PCS1 and PCS2 tumors reflect luminal subtypes, while PCS3 represents a basal subtype. We show that PCS1 tumors progress more rapidly to metastatic disease in comparison with PCS2 or PCS3, including PSC1 tumors of low Gleason grade. To apply this finding clinically, we developed a 37-gene panel that accurately assigns individual tumors to one of the three PCS subtypes. This panel was also applied to circulating tumor cells (CTC) and provided evidence that PCS1 CTCs may reflect enzalutamide resistance. In summary, PCS subtyping may improve accuracy in predicting the likelihood of clinical progression and permit treatment stratification at early and late disease stages. Cancer Res; 76(17); 4948–58. ©2016 AACR.
Introduction
Prostate cancer is a heterogeneous disease. Currently defined molecular subtypes are based on gene translocations (1, 2), gene expression (3, 4), mutations (5–8), and oncogenic signatures (9, 10). In other cancer types, such as breast cancer, molecular classifications predict survival and are routinely used to guide treatment decisions (11, 12). However, the heterogeneous nature of prostate cancer, and the relative paucity of redundant genomic alterations that drive progression, or that can be used to assess likely response to therapy, have hindered attempts to develop a classification system with clinical relevance (13).
Recently, molecular lesions in aggressive prostate cancer have been identified. For example, overexpression of the androgen receptor (AR) due to gene amplification has been observed in castration-resistant prostate cancer (CRPC) (14). Presence of AR variants (AR-V) that do not require ligand for activation have been reported in a large percentage of CRPCs and have been correlated with resistance to AR-targeted therapy (15). The oncogenic function of enhancer of zeste homolog 2 (EZH2) was found in cells of CRPC, and recurrent mutations in the speckle-type POZ protein (SPOP) gene occur in approximately 15% of prostate cancers (16, 17). Expression signatures related to these molecular lesions have also been developed to predict patient outcomes. While, in principle, signature-based approaches could be used independently in small cohorts (4, 10), there is a potential for an increase in diagnostic or prognostic accuracy if signatures reflecting gene expression perturbations relevant to prostate cancer could be applied to large cohorts containing thousands of clinical specimens.
Here we present the results of an integrated analysis of an unprecedentedly large set of transcriptome data, including from over 4,600 clinical prostate cancer specimens. This study revealed that RNA expression data can be used to categorize prostate cancer tumors into 3 distinct subtypes, based on molecular pathway representation encompassing molecular lesions and cellular features related to prostate cancer biology. Application of this subtyping scheme to 10 independent cohorts and a wide range of preclinical prostate cancer models strongly suggest that the subtypes we define originate from inherent differences in prostate cancer origins and/or biological features. We provide evidence that this novel prostate cancer classification scheme can be useful for detection of aggressive tumors using tissue as well as blood from patients with progressing disease. It also provides a starting point for development of subtype-specific treatment strategies and companion diagnostics.
Materials and Methods
Merging transcriptome datasets and quality control
To assemble a merged dataset from diverse microarray and high-throughput sequencing platforms, we applied a median-centering method followed by quantile scaling (MCQ; ref. 18). Briefly, each dataset was normalized using the quantile method (19). Probes or transcripts were assigned to unique genes by mapping NCBI entrez gene IDs. Redundant replications for each probe and transcript were removed by selecting the one with the highest mean expression. Log2 intensities for each gene were centered by the median of all samples in the dataset. Each of the matrices was then transformed into a single vector. The vectors for the matrices were scaled by the quantile method to avoid a bias toward certain datasets or batches with large variations from the median values. These scaled vectors were transformed back into the matrices. Finally, the matrices were combined by matching the gene IDs in the individual matrices, resulting in a merged dataset of 2,115 samples by 18,390 human genes. To evaluate the MCQ-based normalization strategy, we applied the XPN (cross platform normalization; ref. 20) method to the same datasets and compared it with the merged data from MCQ. Multidimensional scaling (MDS) between samples was performed to assess batch effects. The same MCQ approach with the quantile method, or the single channel array normalization (SCAN) method (21), was also applied for normalization and batch correction of data from the independent cohorts.
Computing pathway activation score
We used the Z-score method to quantify pathway activation (22). Briefly, the Z-score was defined by the difference between the error-weighted mean of the expression values of the genes in a gene signature and the error-weighted mean of all genes in a sample after normalization. Z-scores were computed using each signature in the signature collection for each of the samples, resulting in a matrix of pathway activation scores.
Determination of the optimal number of clusters
Non-negative matrix factorization (NMF) clustering with a consensus approach is useful to elucidate biologically meaningful classes (23). Thus, we applied the consensus NMF clustering method (24) to identify the optimal number of clusters. NMF was computed 100 times for each rank k from 2 to 6, where k was a presumed number of subtypes in the dataset. For each k, 100 matrix factorizations were used to classify each sample 100 times. The consensus matrix with samples was used to assess how consistently sample-pairs cluster together. We then computed the cophenetic coefficients and silhouette scores for each k, to quantitatively assess global clustering robustness across the consensus matrix. The maximum peak of the cophenetic coefficient and silhouette score plots determined the optimal number of clusters.
Classification using a 14-pathway classifier
We constructed a classifier, where a set of predictors consists of 14 pathways, using a naïve Bayes machine learning algorithm. For training the classifier, we used the pathway activation scores and subtype labels of the result of the NMF clustering process. We then computed the misclassification rate using stratified 10-fold cross validation. To assess performance, we adopted a 3-class classification as a 2-class classification (e.g., PCS1 vs. others) and computed the average area under the receiver operating characteristic (ROC) curves from all 3 of 2-class classifications. Finally, we applied the 14-pathway classifier to assign subtypes to the specimens.
Identifying subtype-enriched genes
Wilcoxon rank-sum test and subsequent false discovery rate (FDR) correction with Storey's method (25) were employed to identify differentially expressed genes between the subtypes. Genes were selected with FDR < 0.001 and fold change ≥ 1.5, resulting in 428 subtype-enriched genes (SEG).
Development of a 37-gene diagnostic panel
A random forest machine learning algorithm was employed to develop a diagnostic gene panel. For parameter estimation and training the model, we used the merged dataset. Initially, the model comprised of the 428 SEGs as a set of predictors and subtype label of the merged dataset was used as a response variable for model training. To verify the optimal leaf size, we compared the mean squared errors (MSE) obtained by classification of leaf sizes of 1 to 50 with 100 trees, resulting in an optimal leaf size of 1 for model training. We then permuted the values for each gene across every sample and measured how much worse MSE became after the permutation. Imposing a cutoff of importance score at 0.5, we selected the 37 genes for subtyping. From the computation of MSE growing 100 trees on 37 genes and on the 428 SEGs, the 37 genes we chose gave the same MSE as the full set of 428 genes. ROC curve analyses and 10-fold cross-validation were also conducted to assess the performance of a classification ensemble.
Statistical analysis
We performed principal component analysis (PCA) and MDS for visualizing the samples to assess their distribution using pathway activation profiles. Wilcoxon rank-sum statistics were used to test for significant differences in pathway activation scores between the subtypes. Kaplan–Meier analysis, Cox proportional hazard regression, and the χ2 test were performed to examine the relationship(s) between clinical variables and subtype assignment. The OR test using dichotomized variables was conducted to investigate relationships between different subtyping schemes. The MATLAB package (Mathworks) and the R package (v.3.1 http://www.r-project.org/) were used for all statistical tests.
Results
A prostate cancer gene expression atlas
To achieve adequate power for a robust molecular classification of prostate cancer, we initially collected 50 prostate cancer datasets from three public databases: Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo), ArrayExpress (http://www.ebi.ac.uk/arrayexpress), and the UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu) and selected 38 datasets (Supplementary Table S1), in which the numbers of samples are larger than 10 and where over 10,000 genes were measured (Fig. 1A). This collection contains datasets consisting of 2,790 expression profiles of benign prostate tissue, primary tumors, and metastatic or CRPC (CRPC/Met; Fig. 1B). We then removed a subset of samples with ambiguous clinical information and generated a single merged dataset by cross study normalization, based on median-centering and the quantile normalization method (MCQ; ref. 18). The merged dataset consists of 1,321 tumor specimens that we named the Discovery (DISC) cohort. The merged gene expression profiles showed a significant reduction of systematic, dataset-specific bias in comparison with the same dataset corrected by the XPN method, which is also used for merging data from different platforms (20) (Fig. 1C). Biological differences between tumors and benign tissues were also maintained while minimizing batch effects (Fig. 1D).
Integration of prostate cancer transcriptome and quality control. A, schematic showing the process of collecting and merging prostate cancer transcriptomes. B, clinical composition of 2,115 prostate cancer cases. C, MDS of merged expression profiles after MCQ or XPN correction in the DISC cohort. Dots with different colors represent different batches or datasets. D, hierarchical clustering illustrates the sample distribution of uncorrected (top), corrected by MCQ (middle), and corrected by XPN (bottom). Different colors on “Batches” rows represent different batches or datasets from the individual studies. E, MDS of pathway activation profiles in the DISC cohort shows distribution of the samples from same batches. Dots with different colors represent different batches or datasets.
Integration of prostate cancer transcriptome and quality control. A, schematic showing the process of collecting and merging prostate cancer transcriptomes. B, clinical composition of 2,115 prostate cancer cases. C, MDS of merged expression profiles after MCQ or XPN correction in the DISC cohort. Dots with different colors represent different batches or datasets. D, hierarchical clustering illustrates the sample distribution of uncorrected (top), corrected by MCQ (middle), and corrected by XPN (bottom). Different colors on “Batches” rows represent different batches or datasets from the individual studies. E, MDS of pathway activation profiles in the DISC cohort shows distribution of the samples from same batches. Dots with different colors represent different batches or datasets.
As validation datasets, we assembled another collection of 12 independent cohorts consisting of 2,728 tumors from primary and CRPC/Met samples (Table 1). From this collection, 3 datasets, the Swedish watchful waiting cohort (SWD), the Emory cohort (EMORY), and the Health Study Prostate Tumor cohort (HSPT), were obtained from GEO. The gene expression profiles and clinical annotations of The Cancer Gnome Atlas (TCGA) cohort of 333 prostate cancer and SU2C/PCF Dream Team cohort (SU2C) of 118 CRPC/Mets were obtained from cBioPortal (http://www.cbioportal.org/). Seven additional cohorts were obtained from the Decipher GRID database (GRID). The expression datasets from the GRID were generated using a single platform, the Affymetrix Human Exon 1.0 ST Array, using primary tumors for the purpose of developing outcomes and treatment response signatures. We used these 7 cohorts to investigate associations of clinical outcomes with subtype assignment in this study.
List of independent cohorts for validation of the subtypes
Cohort name . | Number of samples . | Disease status . | Available clinical outcomes . | Data from GRID . | Abbreviation . | PubMed ID . |
---|---|---|---|---|---|---|
Swedish Watchful-Wainting Cohort | 281 | Localized | OS | No | SWD | 20233430 |
The Cancer Genome Anatomy | 333 | Localized | N.A. | No | TCGA | 26000489 |
Emory University | 106 | Localized | N.A. | No | EMORY | 24713434 |
Health Professionals Follow-up Study and Physicians' Health Study Prostate Tumor Cohort | 264 | Localized | N.A. | No | HSPT | 25371445 |
Stand Up To Cancer/Prostate Cancer Foundation Dream Team Cohort | 118 | CRPC/Met | N.A. | No | SU2C | 26000489 |
Mayo Clinic Cohort 1 | 545 | Localized | PMS, TMP, PCSM | Yes | MAYO1 | 23826159 |
Mayo Clinic Cohort 2 | 235 | Localized | PMS, TMP, PCSM | Yes | MAYO2 | 23770138 |
Thomas Jefferson University cohort | 130 | Localized | PMS, TMP, PCSM | Yes | TJU | 25035207 |
Cleveland Clinic Foundation Cohort | 182 | Localized | PMS, TMP, PCSM | Yes | CCF | 25466945 |
Memorial Sloan Kettering Cancer Center cohort | 131 | Localized | PMS, PCSM | Yes | MSKCC | 20579941 |
Erasmus Medical Centre Cohort | 48 | Localized | PMS, PCSM | Yes | EMC | 23319146 |
Johns Hopkins Medicine Cohort | 355 | Localized | PMS, TMP, PCSM | Yes | JHM | 25466945 |
Cohort name . | Number of samples . | Disease status . | Available clinical outcomes . | Data from GRID . | Abbreviation . | PubMed ID . |
---|---|---|---|---|---|---|
Swedish Watchful-Wainting Cohort | 281 | Localized | OS | No | SWD | 20233430 |
The Cancer Genome Anatomy | 333 | Localized | N.A. | No | TCGA | 26000489 |
Emory University | 106 | Localized | N.A. | No | EMORY | 24713434 |
Health Professionals Follow-up Study and Physicians' Health Study Prostate Tumor Cohort | 264 | Localized | N.A. | No | HSPT | 25371445 |
Stand Up To Cancer/Prostate Cancer Foundation Dream Team Cohort | 118 | CRPC/Met | N.A. | No | SU2C | 26000489 |
Mayo Clinic Cohort 1 | 545 | Localized | PMS, TMP, PCSM | Yes | MAYO1 | 23826159 |
Mayo Clinic Cohort 2 | 235 | Localized | PMS, TMP, PCSM | Yes | MAYO2 | 23770138 |
Thomas Jefferson University cohort | 130 | Localized | PMS, TMP, PCSM | Yes | TJU | 25035207 |
Cleveland Clinic Foundation Cohort | 182 | Localized | PMS, TMP, PCSM | Yes | CCF | 25466945 |
Memorial Sloan Kettering Cancer Center cohort | 131 | Localized | PMS, PCSM | Yes | MSKCC | 20579941 |
Erasmus Medical Centre Cohort | 48 | Localized | PMS, PCSM | Yes | EMC | 23319146 |
Johns Hopkins Medicine Cohort | 355 | Localized | PMS, TMP, PCSM | Yes | JHM | 25466945 |
Abbreviations: N.A., not available; OS, overall survival; PMS, progression to metastatic state; PCSM, PC-specific mortality; TMP, time-to-metastatic progression.
Pathway activations describing prostate cancer biology
Recent studies have demonstrated the advantage of pathway-based analysis in clinical stratification for prostate and other cancer types (10, 26, 27), However, to date, there has been no study of prostate cancer using pathway activation profiles in which thousands of patient specimens were used. In addition, the utility of recently characterized molecular lesions such as AR amplification/overexpression, AR-V expression, transcriptional activation of EZH2 and forkhead box A1 (FOXA1), and SPOP mutation have not been fully exploited for classification. Therefore, we employed 22 pathway activation gene expression signatures encompassing prostate cancer–relevant signaling and genomic alterations (Supplementary Tables S2 and S3) in the DISC cohort (n = 1,321). These were ultimately collapsed into 14 pathway signatures that were grouped into 3 categories: (i) prostate cancer–relevant signaling pathways, including activation of AR, AR-V, EZH2, FOXA1, and rat sarcoma viral oncogene homolog (RAS) and inactivation by polycomb repression complex 2 (PRC); (ii) genetic and genomic alterations, including mutation of SPOP, TMPRSS2–ERG fusion (ERG), and deletion of PTEN; and (iii) biological features related to aggressive prostate cancer progression, including stemness (ES), cell proliferation (PRF), epithelial–mesenchymal transition (MES), proneural (PN), and aggressive prostate cancer with neuroendocrine differentiation (AV). Pathway activation scores were computed in each specimen in the DISC cohort using the Z-score method (22). The conversion of gene expression to pathway activation showed a further reduction of batch effects, while preserving biological differences that are particularly evident in the clustering of metastatic and nonmetastatic samples (Fig. 1E).
Identification and validation of molecular subgroups
We performed unsupervised clustering based on consensus NMF clustering (24) using the 14 pathway activation profiles in the DISC cohort. A consensus map of the NMF clustering results shows clear separation of the samples into three clusters (Fig. 2A). To identify the optimal number of clusters and to assess robustness of the clustering result, we computed the cophenetic coefficient and silhouette score using different numbers of clusters (2–6). These results indicate that 3 clusters is a statistically optimal representation of the data (Fig. 2B). A heatmap of 3 sample clusters demonstrates highly consistent pathway activation patterns within each group (Fig. 2C). These analyses suggest that the clusters correspond to three prostate cancer subtypes. We compared the magnitude of activation of each pathway across the 3 clusters evident in Fig. 2C using the Wilcoxon rank-sum test for pairwise comparisons (Supplementary Fig. S1). The PCS1 subtype exhibits high activation scores for EZH2, PTEN, PRF, ES, AV, and AR-V pathways. In contrast, ERG pathway activation predominates in PCS2, which is also characterized by high activation of AR, FOXA1, and SPOP. PCS3 exhibits high activation of RAS, PN, MES, while AR and AR-V activation are low.
Identification and validation of novel prostate cancer subtypes. A, consensus matrix depicts robust separation of tumors into three subtypes. B, changes of cophenetic coefficient and silhouette score at rank 2 to 6. C, pathway activation profiles of 1,321 tumors defines three prostate cancer subtypes. D, score plot of PCA for benign and three subtypes. E and F, the three subtypes were recognized in 10 independent cohorts. G and H, correlation of pathway activation profiles in 8 prostate cancer cell lines from the CCLE and 11 prostate cancer mouse models and probability from the pathway classifier.
Identification and validation of novel prostate cancer subtypes. A, consensus matrix depicts robust separation of tumors into three subtypes. B, changes of cophenetic coefficient and silhouette score at rank 2 to 6. C, pathway activation profiles of 1,321 tumors defines three prostate cancer subtypes. D, score plot of PCA for benign and three subtypes. E and F, the three subtypes were recognized in 10 independent cohorts. G and H, correlation of pathway activation profiles in 8 prostate cancer cell lines from the CCLE and 11 prostate cancer mouse models and probability from the pathway classifier.
High enrichment of PRC and low AR within PCS3 raises the question of whether this subtype is an artifact of contaminating nontumor tissues. However, PCA demonstrates that samples in PCS3 are as distinct from benign tissues as samples in the other subtypes (Fig. 2D). To further confirm the difference from benign tissue, we made use of a gene signature shown to discriminate benign prostate tissue from cancer in a previous study (28) and found a significant difference (P < 0.001) in all the tumors in the subtypes compared with benign tissues (Supplementary Fig. S2). These results demonstrate that prostate cancers retain distinct gene expression profiles between subtypes, which are not related to the amount of normal tissue contamination.
To validate the PCS classification scheme, a 14-pathway classifier was developed using a naïve Bayes machine learning algorithm (see details in Materials and Methods). This classifier was applied to 9 independent cohorts of localized tumors (i.e., SWD, TCGA, EMORY, HSPT, MAYO1/2, CCF, TJU, and JHM) and the SU2C cohort of CRPC/Met tumors. Out of these 10 independent cohorts, 5 cohorts (i.e., MAYO1/2, TJU, CCF, and JHM) were from the GRID (Fig. 2E; Table 1; ref. 29). The 14-pathway classifier reliably categorized tumors in the DISC cohort into 3 subtypes, with an average classification performance = 0.89 (P < 0.001). The 3 subtypes were identified in all cohorts. Their proportions were similar across the localized disease cohorts, demonstrating the consistency of the classification algorithm across multiple practice settings (Fig. 2E). The 2 cohorts consisting of CRPC/Met tumors (DISC and SU2C) showed some differences in the frequency of PCS1 and PCS3; the most frequent subtype in the DISC CRPC/Met cohort was PCS1 (66%), while the most frequent subtype in SU2C was PCS3 (45%; Fig. 2F). PCS2 was the minor subtype in both CRPC/Met cohorts.
To determine whether the PCS classification is relevant to laboratory models of prostate cancer, we analyzed 8 human prostate cancer cell lines from The Cancer Cell Line Encyclopedia (CCLE; GSE36133; ref. 30) and 11 prostate cancer mouse models (31, 32). There are two datasets for mouse models. The first dataset (GSE53202) contains transcriptome profiles of 13 genetically engineered mouse models, including normal epithelium (i.e., wild-type), low-grade PIN (i.e., Nkx3.1 and APT), high-grade PIN, and adenocarcinoma (i.e., APT-P, APC, Myc, NP, Erg-P, and NP53), CRPC (i.e., NP-Ai), and metastatic prostate cancer (i.e., NPB, NPK, and TRAMP). Because of no available data for samples without drug treatment, the Nkx3.1 and APC models were excluded from this analysis. The second dataset (GSE34839) contains transcriptome profiles from mice with PTEN-null/KRAS activation mutation-driven high-grade, invasive prostate cancer and mice with only the PTEN-null background. This analysis revealed that all 3 prostate cancer subtypes were represented in the 8 human prostate cancer cell lines (Fig. 2G), while only 2 subtypes (PCS1 and PCS2) were represented in the mouse models (Fig. 2H). This result provides evidence that the subtypes are recapitulated in genetically engineered mouse models and persist in human cancer cells in cell culture.
Evaluation of PCS subtypes in comparison with other subtypes
Several categorization schemes of prostate cancer have been described, based mostly on tumor-specific genomic alterations and in some cases with integration of transcriptomic and other profiling data (10, 29, 33). This prompted us to compare the PCS classification scheme with the genomic subtypes derived by TCGA (34), because comprehensive genomic categorization was recently made available (35). We also compared the PCS classification with the subtypes recently defined by Tomlins and colleagues from RNA expression data (29). The Tomlins subtyping scheme is defined using the 7 GRID cohorts (i.e., MAYO1/2, TJU, CCF, MSKCC, EMC, and JHM) that we used for validating the PCS system. The large number of cases in the 7 GRID cohorts (n = 1,626) is comparable with our DISC cohort in terms of heterogeneity and complexity. TCGA identified several genomic subtypes, named ERG, ETV1, ETV4, FLI1, SPOP, FOXA1, IDH1, and “other.” Tomlins and colleagues described 4 subtypes based on microarray gene expression patterns that are related to several genomic aberrations [i.e., ERG+, ETS+, SPINK1+, and triple negative (ERG−/ETS−/SPINK1−)].
A comparison of the PCS categories with the TCGA genomic subtypes showed that the tumors classified as ERG, ETV1/4, SPOP, FOXA1, and “other” were present across all the PCS categories in the TCGA dataset (n = 333; Fig. 3A). SPOP cancers were enriched in PCS1 (OR: 3.53), while PCS2 tumors were overrepresented in TCGA/ERG cancers (OR: 1.82) and TCGA/“other” cancers were enriched in PCS3 (OR: 1.79; Fig. 3B). In the GRID cohorts, we observed all PCS categories in all classification groups as defined by Tomlins and colleagues (Fig. 3C and D). We found a high frequency of the Tomlins/ERG+ subtype in PCS2, but not in PCS1. PCS1 was enriched for Tomlins/ETS+ and Tomlins/SPINK1+ subtypes, while PCS3 was enriched for the triple-negative subtype but not the ERG+ or ETS+ subgroups. Finally, we compared the Tomlins classification method with the PCS classification using 5 of 7 GRID cohorts. PCS1 demonstrated significantly shorter metastasis-free survival compared with PCS2 and PCS3 (P < 0.001; Fig. 3E). In contrast, no difference in metastatic progression was seen among the Tomlins categories (Fig. 3F).
Comparison of the PCS subtypes with previously described subtypes. A, distribution of TCGA tumors (n = 333) using the PCS subtypes compared with TCGA subtypes. B, relationship between PCS subtyping and TCGA subtypes. C, distribution of GRID tumors (n = 1,626) using PCS categories compared with Tomlins subtypes. D, relationship between PCS subtyping and Tomlins subtypes. E and F, association of metastasis-free survival using Tomlins subtypes and using the PCS subtypes in the GRID tumors. G, metastasis-free survival in tumors of GS ≤ 7 (left) and GS ≥ 8 (right).
Comparison of the PCS subtypes with previously described subtypes. A, distribution of TCGA tumors (n = 333) using the PCS subtypes compared with TCGA subtypes. B, relationship between PCS subtyping and TCGA subtypes. C, distribution of GRID tumors (n = 1,626) using PCS categories compared with Tomlins subtypes. D, relationship between PCS subtyping and Tomlins subtypes. E and F, association of metastasis-free survival using Tomlins subtypes and using the PCS subtypes in the GRID tumors. G, metastasis-free survival in tumors of GS ≤ 7 (left) and GS ≥ 8 (right).
PCS1 contained the largest number of prostate cancers with GS ≥ 8 (Fig. 2C). Given the overall poorer outcomes seen in PCS1 tumors, we tested whether this result was simply a reflection of the enrichment of high-grade disease in this group (i.e., GS ≥ 8). For this analysis, we merged 5 GRID cohorts (i.e., MAYO1/2, TJU, CCF, and JHM) into a single dataset and separately analyzed low and high-grade disease. We observed a similarly significant (P < 0.001) association between subtypes and metastasis-free survival in GS ≤ 7 and in GS ≥ 8 (Fig. 3G). Thus, tumors in the PCS1 group exhibit the poorest prognosis, including in tumors with low Gleason sum score. Finally, in the DISC cohort, although CRPC/Met tumors were present in all PCS categories, PCS1 predominated (66%), followed by PCS3 (27%) and PCS2 (7%) tumors. To confirm whether this clinical correlation is replicated in individual cohorts, we also assessed association with time to metastatic progression, prostate cancer–specific mortality (PCSM), and overall survival (OS) in 5 individual cohorts in the GRID (i.e., MAYO1/2, CCF, TJU, and JHM) and in the SWD cohorts. PCS1 was seen to be the most aggressive subtype, consistent with the above results (Supplementary Fig. S3).
PCS categories possess characteristics of basal and luminal prostate epithelial cells
Prostate cancer may arise from oncogenic transformation of different cell types in glandular prostate epithelium (36–38). Breast cancers can be categorized into luminal and basal subtypes, which are associated with different patient outcomes (39). It is unknown whether this concept applies to human prostate cancer. To examine whether the 3 PCS categories are a reflection of different cell types, we identified 428 SEGs (SEG1–3; 86 for PCS1, 123 for PCS2, and 219 for PCS3; Supplementary Table S4) in each subtype. As expected, these genes are involved in pathways that are enriched in each subtype (Fig. 4A) and that define the perturbed cellular processes of the subtype. We then identified the cellular processes that are associated with the SEGs. Proliferation and lipid/steroid metabolism are characteristic of SEG1 and SEG2, while extracellular matrix organization, inflammation, and cell migration are characteristic of SEG3 (Fig. 4B). This result suggests that distinct biological functions are associated with the PCS categories.
Genes enriched in each of the three subtypes are associated with luminal and basal cell features. A, relative gene expression (left) and pathway inclusion (right) of SEGs are displayed. B, cellular processes enriched by each of the three SEGs (P < 0.05). C, expression of the luminal and basal markers in the three subtypes. D, enrichment of basal stem cell signature. E, correlation of pathway activities between samples from human and mouse prostate (left) and probability from the pathway classifier (right).
Genes enriched in each of the three subtypes are associated with luminal and basal cell features. A, relative gene expression (left) and pathway inclusion (right) of SEGs are displayed. B, cellular processes enriched by each of the three SEGs (P < 0.05). C, expression of the luminal and basal markers in the three subtypes. D, enrichment of basal stem cell signature. E, correlation of pathway activities between samples from human and mouse prostate (left) and probability from the pathway classifier (right).
To determine whether the PCS categories reflect luminal or basal cell types of the prostatic epithelium, we analyzed the mean expression of genes known to be characteristic of luminal (EZH2, AR, MKI67, NKX3-1, KLK2/3, and ERG) or basal (ACTA2, GSTP1, IL6, KRT5, and TP63) prostatic cells (Fig. 4C). We observed a strong association (FDR < 0.001; fold change > 1.5) between luminal genes and PCS1 and PCS2, and basal genes and PCS3. To verify this observation, we used two independent datasets derived from luminal and basal cells from human (40) and mouse (GSE39509; ref. 37) prostates. The assignment of a basal designation to PCS3 is further supported by the highly significant enrichment in PCS3, in comparison with the other two subtypes, of a recently described prostate basal cell signature derived from CD49f-Hi versus CD49f-Lo benign and malignant prostate epithelial cells (Fig. 4D; ref. 41). In addition, using the 14-pathway classifier, mouse basal tumors and human basal cells from benign tissues were classified as PCS3, while mouse luminal tumors and benign prostate human luminal cells were classified into PCS2 (Fig. 4E). These results are consistent with the conclusion that the PCS categories can be divided into luminal and basal subtypes.
A gene expression classifier for assignment to subtypes
Given the potential advantages of the PCS system to classify tumor specimens, we constructed a classifier that can be applied to an individual patient specimen in a clinical setting (Supplementary Fig. S4A). First, of 428 SEGs, 93 genes were selected on the basis of highly consistent expression patterns in 10 cohorts (i.e., SWD, TCGA, EMORY, HSPT, SU2C, MAYO1/2, CCF, TJU, and JHM). Second, using a random forest machine learning algorithm, we selected 37 genes with feature importance scores >0.5, showing a comparable level of error with the full model based on 428 SEGs (Supplementary Fig. S4B). Performance of the classifier was assessed in the GRID cohort (AUC = 0.97). The 37-gene panel displays significantly different expression patterns between the three subtypes in the DISC cohort (Fig. 5A).
A 37-gene classifier employed in patient tissues and CTCs. A, heatmap displays the mean expression pattern of the 37-gene panel in the three subtypes from the DISC cohort. B, hierarchical clustering of 77 CTCs obtained from CRPC patients by gene expression of the 37-gene panel. Bar plot in the bottom displays probability of PCS assignment from application of the classifier.
A 37-gene classifier employed in patient tissues and CTCs. A, heatmap displays the mean expression pattern of the 37-gene panel in the three subtypes from the DISC cohort. B, hierarchical clustering of 77 CTCs obtained from CRPC patients by gene expression of the 37-gene panel. Bar plot in the bottom displays probability of PCS assignment from application of the classifier.
The robust performance of the gene panel led us to determine whether it could be used to profile circulating tumor cells (CTC) from patients with CRPC. We analyzed single-cell RNA-seq data from 77 intact CTCs isolated from 13 patients (42). Prior to the clustering analysis to investigate the expression patterns of these CTC data, the normalized read counts as read-per-million (RPM) mapped reads were transformed on a log2 scale for each gene. The 77 CTCs were largely clustered into two groups using median-centered expression profiles corresponding to the 37-gene PCS panel by the hierarchical method (Fig. 5B). One group (GROUP I), consisting of 67 CTCs displays low expression of PCS1-enriched genes, while the other group (GROUP II) consisting of 10 CTCs has high expression of PCS1-enriched genes. In addition, we observed that PCS3-enriched genes in the panel were not detected or have very low expression changes across all CTCs as shown in the heatmap of Fig. 5B. The results suggest that CTCs can be divided into two groups with the 37-gene PCS panel. Given this result, we hypothesized that the 37-gene classifier might assign CTCs to PCS1 or PCS2, consistent with the clustering result. The bar graph below the heatmap illustrates the probability of likelihood of PCS assignment, with the result that all the CTCs were assigned to PCS1 (n = 12) or PCS2 (n = 65), while no PCS3 CTCs were assigned on the basis of the largest probability score. By comparing with the CTC group assignment, 7 (70%) of 10 CTCs in the GROUP II were assigned to PCS1 by the 37-gene classifier and 62 (95%) of 65 CTCs in the GROUP I were assigned to PCS2 by the classifier. We then tested whether GROUP I and II exhibit any difference in terms of therapeutic responses. Of note, 5 of the 7 CTCs in GROUP II (OR: 1.74; 95% confidence interval: 0.49–6.06) were from patients whose cancer exhibited radiographic and/or PSA progression during enzalutamide therapy, suggesting that the 37-gene PCS panel can potentially identify patients with resistance to enzalutamide therapy.
Collectively, the results demonstrate that the 37-gene classifier has a potential to assign individual prostate cancers to PCS1 using both prostate tissues and blood CTCs, suggesting that the classifier can be applied to subtype individual prostate cancers using clinically relevant technology platforms (43, 44), including by noninvasive methods.
Discussion
In this study, we describe a novel classification system for prostate cancer, based on an analysis of over 4,600 prostate cancer specimens, which consists of only 3 distinct subtypes, designated PCS1, PCS2, and PCS3. PCS1 exhibits the highest risk of progression to advanced disease, even for low Gleason grade tumors. Although sampling methods across the cohorts we studied were different, classification into the 3 subtypes was reproducible. For example, the SWD cohort consists of specimens that were obtained by transurethral resection of the prostate rather than radical prostatectomy; however, subtype assignment and prognostic differences between the subtypes were similar to the other cohorts we examined (Supplementary Fig. S3J). Genes that are significantly enriched in the PCS1 category were highly expressed in the subset of CTCs (58%, 7 CTCs out of 12) from patients with enzalutamide-resistant tumors. This proportion of resistant cases in PCS1 CTCs is very high compared with PCS2 CTCs (8%, 5 CTCs out of 65). The characteristics of the PCS categories are summarized in Table 2.
Summary of PCS characteristics
Sample type . | Features . | PCS1 . | PCS2 . | PCS3 . |
---|---|---|---|---|
Patient tumors | Proportion | 6% | 47% | 47% |
Pathology | Enriched GS ≥ 8 | Enriched GS ≤ 7 | Enriched GS ≤ 7 | |
Prognosis | Poor | Variable | Variable | |
Subtypes- TCGA | SPOP | ERG | ‘other’ | |
Subtypes- Tomlins | ETS+, SPINK+ | ERG+ | Triple Negative | |
Pathway signatures | AR-V, ES, PTEN, PRF, EZH2, NE | AR, FOXA1, SPOP, ERG | PRC, RAS, PN, MES | |
Cell lineage | Luminal-like | Luminal-like | Basal-like | |
Patient CTCs | Proportion | 16% | 84% | 0% |
Enzalutamide resistance | Yes (58%) | No (8%) | Unknown |
Sample type . | Features . | PCS1 . | PCS2 . | PCS3 . |
---|---|---|---|---|
Patient tumors | Proportion | 6% | 47% | 47% |
Pathology | Enriched GS ≥ 8 | Enriched GS ≤ 7 | Enriched GS ≤ 7 | |
Prognosis | Poor | Variable | Variable | |
Subtypes- TCGA | SPOP | ERG | ‘other’ | |
Subtypes- Tomlins | ETS+, SPINK+ | ERG+ | Triple Negative | |
Pathway signatures | AR-V, ES, PTEN, PRF, EZH2, NE | AR, FOXA1, SPOP, ERG | PRC, RAS, PN, MES | |
Cell lineage | Luminal-like | Luminal-like | Basal-like | |
Patient CTCs | Proportion | 16% | 84% | 0% |
Enzalutamide resistance | Yes (58%) | No (8%) | Unknown |
Previously published prostate cancer classifications have defined subtypes largely based on the presence or absence of genomic alterations (e.g., TMPRSS2-ERG translocations). Tumors with ERG rearrangement (ERG+) are overrepresented in PCS2; however, it is not the presence or absence of an ERG rearrangement that defines the PCS2 subtype, but rather ERG pathway activation features based on coordinate expression levels of genes in the pathway. Our findings provide evidence for biologically distinct forms of prostate cancer that are independent of Gleason grade, currently the gold standard for clinical decision-making. In addition, by comparing prognostic profiles between the PCS categories and the Tomlins and colleagues categories, prognostic information was evident only from the PCS classification scheme in the same cohort. Taken together, this indicates that the PCS classification is unique.
Although the current report has provided evidence that PCS classification can assign subtypes within groups of “indolent” as well as aggressive tumors, and in a wide range of preclinical models, it remains to be determined whether the PCS categories might be stable during tumor evolution in an individual patient. An interesting alternative possibility is that disease progression results in phenotypic diversification with respect to the PCS assignment. We have shown that preclinical model systems, including genetically engineered mouse models (GEMM), can be assigned with high statistical confidence to the PCS categories. We believe the simplest explanation for this finding is that these subtypes reflect distinct epigenetic features of chromatin that are potentially stable, even in the setting of genomic instability associated with advanced disease. This possibility needs to be formally tested. The human prostate cancer cell lines we evaluated could be assigned to all 3 subtypes; however, the GEMMs we tested could only be assigned to PCS1 and PCS2. This finding suggests that approximately 1 of 3 of human prostate cancers are not being modeled in widely used GEMMs. It should be feasible to generate mouse models for PCS3 through targeted genetic manipulation of pathways that are deregulated in PCS3 and through changing chromatin structure, such as by altering the activity of the PRC2 complex.
A major clinical challenge remains the early recognition of aggressive disease, in particular, due to the multifocal nature of prostate cancer (45). The classification scheme we describe predicts the risk of progression to lethal prostate cancer in patients with a diagnosis of low-grade localized disease (Fig. 3G). It is possible that in these cancers, pathway activation profiles are independent of Gleason grade and that pathways indicating high risk of progression are manifested early in the disease process and throughout multiple cancer clones in the prostate. In addition to predicting the risk of disease progression, PCS subtyping might also assist with the selection of drug treatment in advanced cancer by profiling CTCs in patient blood. With the 37-gene classifier we present here, it will be possible to assign individual tumors to PCS categories in a clinical setting. This new classification method may provide novel opportunities for therapy and clinical management of prostate cancer.
Disclosure of Potential Conflicts of Interest
N. Erho is a bioinformatics group lead at GenomeDx Biosciences Inc. M. Alshalalfa is a bioinformatician at GenomeDx Biosciences Inc. H. Al-deen Ashab is a data scientist at Genomedx Biosciences Inc. E. Davicioni has ownership interest (including patents) in GenomeDx Biosciences Inc. R.J. Karnes reports receiving other commercial research support from GenomeDx Biosciences Inc. E.A. Klein has received speakers bureau honoraria from GenomeDx Biosciences Inc. A.E. Ross has ownership interest (including patents) in GenomeDx Biosciences Inc. Mandeep Takhar is a bioinformatician at GenomeDX. No potential conflicts of interest were disclosed by the other authors.
Authors' Contributions
Conception and design: S. You, B.S. Knudsen, J. Kim, M.R. Freeman
Development of methodology: S. You, M. Alshalalfa, E.M. Schaeffer, J. Kim
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): S. You, B.S. Knudsen, E. Davicioni, R.J. Karnes, E.A. Klein, R.B. Den, A.E. Ross, E.M. Schaeffer, I.P. Garraway
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S. You, N. Erho, M. Alshalalfa, H. Al-deen Ashab, E. Davicioni, E.M. Schaeffer, J. Kim, M.R. Freeman
Writing, review, and/or revision of the manuscript: S. You, B.S. Knudsen, N. Erho, M. Takhar, E. Davicioni, R.J. Karnes, E.A. Klein, R.B. Den, A.E. Ross, E.M. Schaeffer, I.P. Garraway, J. Kim, M.R. Freeman
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S. You, M. Takhar, E. Davicioni, E.A. Klein, M.R. Freeman
Study supervision: S. You, M.R. Freeman
Acknowledgments
The authors are grateful to Drs. Francesca Demichelis, Felix Feng, and Benjamin Berman for helpful discussions during the course of this study.
Grant Support
This study was supported by grants from NIH (R01DK087806, R01CA143777, G20 RR030860, and 2P01 CA098912), the Department of Defense PCRP (W81XWH-14-1-0152; W81XWH-14-1-0273), the Urology Care Foundation Research Scholars Program, the Prostate Cancer Foundation Creativity Award, the Jean Perkins Foundation, and the Spielberg Family Discovery Fund.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
References
Supplementary data
Differential enrichment of a benign prostate specific gene signature.
List of gene expression datasets included in the analysis of the DISC cohort.
Publications from which the pathway activation gene sets were obtained.
The genes in the collection of pathway signatures used in this study.
List of 428 SEGs.