Abstract
Breast cancer classification has been the focus of numerous worldwide efforts, analyzing the molecular basis of breast cancer subtypes and aiming to associate them with clinical outcome and to improve the current diagnostic routine. Genomic and transcriptomic profiles of breast cancer have been well established, however the proteomic contribution to these profiles has yet to be elucidated. In this work, we utilized mass spectrometry–based proteomic analysis on more than 130 clinical breast samples to demonstrate intertumor heterogeneity across three breast cancer subtypes and healthy tissue. Unsupervised analysis identified four proteomic clusters, among them, one that represents a novel luminal subtype characterized by increased PI3K signaling. This subtype was further validated using an independent protein-based dataset, but not in two independent transcriptome cohorts. These results demonstrate the importance of deep proteomic analysis, which may affect cancer treatment decision making.
Significance: These findings utilize extensive proteomics to identify a novel luminal breast cancer subtype, highlighting the added value of clinical proteomics in breast cancer to identify unique features not observable by genomic approaches. Cancer Res; 78(20); 6001–10. ©2018 AACR.
Introduction
Breast cancer is the most prevalent cancer type in women and one of the most common death causes in women worldwide. Tumors are routinely classified based on the expression pattern of three receptors: estrogen receptor (ER), progesterone receptor (PR), and HER2 (1–3). This classification provides the basis for treatment decision making. The luminal, ER positive subtype, is the most prevalent one, with a more favorable prognosis (1). As such, endocrine treatments such as tamoxifen, efficiently target this subtype (4). HER2+ tumors are likely to respond to Trastuzumab (Herceptin), an inhibitory antibody targeting this receptor (1, 4). The triple-negative (TN) subtype, which suffers from the absence of known drug targets, has less favorable prognosis (5). The molecular mechanisms underlying its tumorigenicity are less clear and therefore it is often treated using nonspecific chemotherapy (4–6).
Beyond the routine clinical classification, genomic and transcriptomic studies defined four "intrinsic" breast cancer subtypes based on their gene expression patterns: luminal A, luminal B, basal and HER2+ (3, 7). The Cancer Genome Atlas (TCGA) Network (2012) further associated the gene expression profiles of each subtype with DNA copy number variations and mutation profiles. For example, basal tumors presented a high frequency of TP53 mutations, whereas ER+ tumors (both luminal A and B) showed a high frequency of PIK3CA mutations. The Metabric consortium analyzed 2,000 breast tumors and based on the integration of mRNA and DNA copy-number analysis divided the samples into 10 different subgroups (2). Other groups studied triple-negative tumors more deeply, further dividing them to more specific subgroups (8, 9). However, classification ambiguities still exist, and although multiple studies have been focusing on molecular characterization of breast cancer profiles, the clinical routine still relies primarily on immunostaining of ER, PR, and HER2 as a basis for classification (10).
In contrast to the genomic studies, protein expression levels in breast cancer and their association with classification have not yet been thoroughly studied. Previously, the TCGA performed reversed-phase protein arrays (RPPA) in order to evaluate how genetic alterations affect the protein level. However, this method is limited to hundreds of pre-selected proteins and several phosphorylation sites, and does not provide a system-wide view of the proteome. Analyzing the cancer proteome in an unbiased manner, two recent mass spectrometry (MS)-based proteomics studies analyzed dozens of tumors and evaluated the relevance of gene-expression levels to the proteome level. In the first study, we analyzed a cohort of 40 breast tumors derived from different subtypes, quantified the proteins using the super-SILAC technology (11), and reached the depth of more than 10,000 proteins (12). The second study, which was conducted by the Clinical Proteomic Tumor Analysis Consortium (CPTAC; ref. 13), analyzed MS-based proteomics and phosphoproteomics data from a cohort containing 77 samples, used iTRAQ quantification and reached the depth of 12,553 proteins and 33,239 phosphorylation sites. They further integrated the proteomic with the genomic profiles of the same cohort from the TCGA, and associated the mutational landscape with the corresponding proteomic and phosphoproteomic profiles. Both studies showed limited concordance between the overall protein and mRNA profiles. This was evident both in the 19-protein signature identified by Tyanova and colleagues, in which a third was found to be regulated exclusively on the proteomic level, and also in the global protein–mRNA correlation analysis of the CPTAC, which resulted in a median Pearson correlation of 0.39.
The larger scope of the CPTAC study enabled them to perform unsupervised classification of the tumors and identify novel proteomic and phosphoproteomic subgroups. Compared with the four expected mRNA-based subtypes, they identified only three proteomic groups, out of which two corresponded to the known classification (a basal cluster and a luminal cluster), whereas the third one was a new "stromal-enriched" cluster. This cluster was characterized by samples originating from all subtypes, and thus may be associated with stroma-rich regions within the tumors.
In this study, in order to explore intertumor heterogeneity and examine proteomics-based classification, we integrated three datasets from both published and unpublished resources, all generated using the same approach. The entire cohort consisted of 212 formalin-fixed paraffin-embedded (FFPE) samples originating from different breast cancer subtypes, as well as healthy tissue. Out of these, 131 samples were selected for analysis, based on the data quality and sample annotation. Unsupervised analysis of the integrated cohort enabled us to identify a novel luminal breast cancer subtype. Comparing our findings to independent published data, we found that this classification is reproduced only on the proteomic level. This proteomic cohort enabled us to both support and challenge the known breast cancer classification, and thus to facilitate the analysis of similarities and differences between distinct omics levels.
Materials and Methods
Sample preparation and super-SILAC mix
The proteomic data integrated three datasets, including two published ones (12, 14) and a novel dataset, Dataset-III, which was generated using the same techniques. For Dataset-III, 70 FFPE samples were obtained from the Department of Oncology, Hadassah Medical Center, Jerusalem, Israel, upon approval of the Institutional Review Board committee and in accordance with the Declaration of Helsinki ethical guidelines. Samples were pathologically defined based on the IHC staining of ER, PR and HER2. Samples were macro-dissected from tissue slices. Each sample was enriched to include cellular areas and exclude stromal components. We used the super-SILAC standard to serve as a reference sample for relative quantification and for normalization between the three datasets (11). The standard was prepared as previously described (11). Briefly, four breast cancer cell lines of different origin, stage and receptor status (HCC1599, MCF7, HCC1937 and HCC2218) and normal mammary epithelial cells (HMEC) were SILAC labeled with heavy lysine and arginine and their lysates were combined at equal ratios (protein amount). Super-SILAC was generated from cell lines within 6 months from their purchase from the ATCC or ECACC, or within 6 months from their authentication at the Genomics Core Facility of BioRap Technologies and the Rappaport Research Institute in the Technion, Israel. All tissues and cells were solubilized in buffer containing 4% SDS, 0.1 M dithiothreitol in 0.1 M Tris-HCl pH 7.6. FFPE samples were incubated for 1 h at 95°C followed by sonication. Prior to protein digestion, proteins from each clinical sample were combined with equal protein amounts of the super-SILAC mix. The joint lysates were then trypsin-digested following the FASP protocol, on top of 30 kDa cutoff Amicon filters (15). To increase analytical depth, the resulting peptides were fractionated using pH-based strong anion exchange chromatography (SAX) in a StageTip format (15). Each sample was separated into six fractions using buffers of different pH values (11, 8, 6, 5, 4, 3) followed by desalting and concentration on C18 StageTips (16). In 25 cases, when the protein amount was insufficient, we performed in-solution trypsin digestion and/or did not fractionate the samples. Prior to MS analysis, peptides were eluted from the StageTips with 80% acetonitrile, vacuum-concentrated and diluted in MS loading buffer (2% acetonitrile, 0.1% formic acid).
Liquid chromatography – mass spectrometry analysis
Peptides were separated by reverse-phase chromatography using the nano- ultra high performance liquid chromatography system (UHPLC; Easy-nLC1000, Thermo Fisher Scientific), coupled to the Q-Exactive or Q-Exactive Plus mass spectrometers (Thermo Fisher Scientific). Peptides were separated with 220–240 min linear gradients of water/acetonitrile. MS resolution was 70,000 with an automatic gain control (AGC) target of 3e6, and MS/MS resolution was 17,500 with AGC target of 1e5. MS analysis was performed using a top10 method in which every high resolution MS scan is followed by fragmentation of the 10 most abundant peaks by higher-energy collisional dissociation.
Protein identification and quantification
MS raw files of all three datasets [Dataset-I (12), Dataset-II (14), and Dataset-III] were jointly analyzed by MaxQuant (17, 18) version 1.5.3.16 and the integrated Andromeda search engine (19). MS/MS spectra were referenced to the Uniprot human proteome (http://www.uniprot.org/) released September 2015. False discovery rate (FDR) of 0.01 was used on both the peptide and protein levels using the target-decoy approach (20). The search included methionine oxidation and protein N-terminal acetylation as variable modifications, and carbamidomethyl cysteine as a fixed modification. SILAC was employed for protein quantification, defined as two MS1 labels reflecting the SILAC heavy labels (Lysine8 and Arginine10). Maximum number of allowed missed peptide cleavages was set to 2, and the minimum peptide length was set to 7 amino-acids. An initial search followed by mass recalibration was performed with a mass tolerance of 20 ppm.
Data preprocessing and statistics
Analyses were performed using the Perseus software (21) and R environment (where indicated). Initial dataset consisted of 212 samples from all three datasets, and these were then filtered based on their origin and data quality. After exclusion of samples as described in the Results section, the analyzed cohort consisted of 131 samples, including 109 high quality tumor samples and 22 high quality healthy samples. The excluded samples were 30 lymph node samples (healthy and metastases), one male patient, samples lacking clinical/pathological annotation and samples with less than 3,000 quantified proteins.
All data analyses were performed on the logarithmized (log2) ratios of the unlabeled samples towards the same heavy labeled super-SILAC mix. For all pairwise comparisons we filtered the data to retain only proteins with valid quantification values in at least 50% of the samples. Missing values were imputed by drawing from a normal distribution with a width of 0.3 of the overall data distribution and a down-shift of 1.5 standard deviations. The initial filtration of 50% valid quantification values per protein, ensures that all of them are seen in the heavy channel, and missing values represent lowly expressed proteins.
Batch effects originating from the dataset integration-related variability were corrected for using principal component analysis (PCA) and component subtraction. We found that the second component of a PCA separates between datasets (9% of the variation), and therefore subtracted this component before integrated downstream analysis. Student t tests were performed using a permutation-based FDR cutoff of 0.05, except for the tumor-healthy analysis where we used a cutoff of 0.1. Enrichment analyses were performed using Fisher exact test with a Benjamini–Hochberg FDR (22) of 0.02. 2D Annotation enrichment test was performed in Perseus software as previously described (23), based on gene sets provided by the Molecular Signature Database (MSigDB; ref. 24). Venn diagrams were plotted using Biovenn web tool (25).
Clustering and network analysis
For the consensus clustering analysis, ratios across samples and across proteins were z-score normalized followed by their clustering using the consensus cluster algorithm (26). The algorithm performs multiple iterations of hierarchical clustering, whereas in each iteration only a subset of the data is considered and the range of the number of output clusters is determined by the user. This results in a separate consensus matrix for each number of clusters, describing the frequency of two samples clustering together out of the total number of iterations. We used R environment (version 3.2.3) and the ConsensusClusterPlus package (27) to run this analysis (parameters used: max. number of clusters, 10; number of iterations, 1,000; subsampling fraction, 0.8; clustering algorithm, hierarchical; distance matrix, Pearson correlation).
The significantly upregulated proteins in each consensus cluster were applied to the String database (http://string-db.org) in order to derive the protein-protein interaction information within each group. Next, we performed a betweeness centrality analysis of each consensus cluster network, in which a node is considered central if it appears in many shortest paths between two other nodes. The top 10% central nodes from each network were used for further analysis. Centrality analysis was performed using Cytoscape and the CytoNCA plugin (28).
TCGA RPPA data
The TCGA RPPA data matrix was downloaded from The Cancer Proteome Atlas (TCPA) download portal (http://tcpaportal.org/tcpa/download.html).
Data availability
The entire dataset of 212 samples is available through the PRIDE repository (Dataset-I accession: PXD002619; Dataset-II accession: PXD000815; Dataset-III accession: PXD009766).
Results
Integration of three breast cancer datasets
We generated an integrated breast cancer proteomic dataset by combining three datasets from published and unpublished resources, all assembled in our laboratory, and all analyze macrodissected tumors. The first cohort (Dataset-I) was obtained from Tyanova and colleagues (12), and consisted of 40 breast tumor samples representative of invasive ductal carcinoma (IDC) of the three main subtypes (ER+, HER2+, and TN). The second cohort (Dataset-II), from Pozniak and colleagues (14), consisted of 88 samples including luminal IDC, matching adjacent healthy tissue or matched lymph node metastases. For the third dataset (Dataset-III), 70 FFPE blocks were pathologically defined based on the expression of ER, PR, and HER2. Samples were analyzed by MS-based proteomics using super-SILAC as an internal standard. The three datasets were integrated, generating a cohort of 212 FFPE samples in total, originating from the different subtypes (131 ER+, 30 TN, 15 HER2, and 9 with no annotation) and from adjacent healthy tissue (27 samples). Computational analysis of the integrated cohort using MaxQuant software resulted in the identification of 10,819 proteins in total. Next, we filtered out tumor samples with less than 3,000 identifications and excluded samples of metastatic origin or of missing sample annotation. Altogether, the final cohort included 131 samples, including 22 healthy controls (for a detailed list of the samples before and after filtration see Supplementary Table S1; see Supplementary Fig. S1A and S1B for summary of the sample processing workflow). Accurate comparison of samples was possible because all datasets were quantified relative to the same super-SILAC reference sample, in which each sample is quantified against a heavy-labeled mix of healthy and cancerous breast cell lines (11). To remove minor batch effects associated with the source of tumor samples and the sample preparation, we identified and removed the discriminating component using PCA (see Materials and Methods and Supplementary Fig. S1C). The entire cohort of 131 samples was then analyzed using supervised approaches, to validate the relevance to the known breast cancer subtypes; and using unsupervised approaches to uncover novel breast cancer classification.
Supervised comparison between breast cancer subtypes
Aiming to validate whether the known subtype characteristics were evident in the proteomics data, analysis of the changes of known subtype markers (ERBB2/Her2, MKI67, KRT18, EGFR, CD44, and FOXA1; refs. 29, 30) between the healthy tissues and each subtype reproduced most of the expected subtype patterns (Fig. 1A). For example, HER2 was higher in HER2 samples compared with ER+ and TN; as expected, KRT18 and FOXA1 were significantly lower in TN tumors; however, EGFR and CD44 were only slightly higher in TN than the other subtypes. Interestingly, comparison to the healthy tissue showed that some of these cancer markers are not exclusively expressed in the transformed tissue.
Supervised analysis of breast cancer subtypes and healthy tissue. A, Strip plots show the protein expression distribution of selected subtype markers. For significance test of each pairwise comparison, we used Student t test, FDR < 0.05. B, Venn diagram of the numbers of significantly upregulated proteins in each subtype when compared with healthy samples and processes enriched within the commonly upregulated proteins (Fisher exact test, FDR < 0.02). C, Selected biological processes (from gene ontology biological processes and molecular functions) that were enriched in the different subtypes when compared with healthy samples. q values after Benjamini–Hochberg FDR correction are indicated next to each process in each subtype. *, q < 0.02. D, Proteomaps of the significantly different proteins show that each subtype is characterized with specific metabolic and signaling pathways. Each polygon represents a KEGG pathway and its size corresponds to the t test difference of the proteins it contains.
Supervised analysis of breast cancer subtypes and healthy tissue. A, Strip plots show the protein expression distribution of selected subtype markers. For significance test of each pairwise comparison, we used Student t test, FDR < 0.05. B, Venn diagram of the numbers of significantly upregulated proteins in each subtype when compared with healthy samples and processes enriched within the commonly upregulated proteins (Fisher exact test, FDR < 0.02). C, Selected biological processes (from gene ontology biological processes and molecular functions) that were enriched in the different subtypes when compared with healthy samples. q values after Benjamini–Hochberg FDR correction are indicated next to each process in each subtype. *, q < 0.02. D, Proteomaps of the significantly different proteins show that each subtype is characterized with specific metabolic and signaling pathways. Each polygon represents a KEGG pathway and its size corresponds to the t test difference of the proteins it contains.
A global subtype comparison resulted in hundreds of significant proteins upregulated in each subtype (Fig. 1B; Student t test FDR <0.1; Supplementary Tables S2A–S2C). Enrichment analyses identified subtype-specific processes, for example, galactose metabolism in HER2 tumors and oxidative phosphorylation in ER+ tumors—corresponding to the well-characterized metabolic activity of breast cancer subtypes (31, 32). Upregulated immune activity, demonstrated by type I interferon and antigen presentation processes, was significantly enriched in TN tumors, as previously described (Fig. 1C, see Supplementary Tables S2D–S2F for the full list; refs. 8, 33). To obtain a more general functional and subtype-related view of the samples, we used lower stringency criteria for subtype specificities, (one-sided t test, P < 0.05) and constructed Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway proteomaps (Fig. 1D; ref. 34). In agreement with previous findings (35–37), these maps show the subtype-specificity of signaling pathways. For example, MAPK signaling was prominent in ER+ tumors, whereas ERBB2 signaling activity was prominent, as expected, in HER2 tumors. Other than subtype-specific processes, we identified significantly enriched processes in the group of proteins that are common to all subtypes when compared with healthy. These include actin cytoskeleton organization, extracellular vesicular exosome, and type I interferon signaling (Fig. 1B; Supplementary Table S2G and S2H). Taken together, these results indicate that proteomic data capture many of the well-established subtype characteristics, both globally and with regard to specific markers.
Unsupervised analysis of the cancer proteomes identifies four tumor clusters
In the second step of the analysis we aimed to examine unsupervised classification of the tumors. To eliminate masking of subtype differences, we excluded the 22 healthy controls, and classified the 109 tumor samples using the consensus clustering algorithm (26). Optimization of the number of clusters in the dataset resulted in four clusters that best separated the samples. This was evaluated quantitatively using the area under the cumulative distribution curve, which indicated that generating four clusters results in more than 15% improvement in accurate grouping of the data, compared with three clusters; but more than four clusters adds less than 10% (Supplementary Fig. S2A and S2B). A visual inspection of the consensus matrix also provided support of a clear cluster separation (Fig. 2A). Examination of the distribution of the classical subtypes between the consensus clusters showed partial concordance with the known classification; most TN samples belonged to consensus cluster (CC) 4; HER2 samples spread across clusters; and ER+ samples (LumA or LumB) were divided into three clusters (Fig. 2B). These analyses, as well as t-SNE analysis (Fig. 2C), revealed that although CC1, CC2, and CC3 are all associated with ER+ samples, CC2 and CC3 are closer to the TN-enriched cluster, CC4; and CC1 is relatively segregated from them. Notably, running the consensus clustering algorithm with k = 3, the ER+ CC2 was co-clustered with the ER− CC4, thus indicating their higher global proximity (Supplementary Fig. S2C). To examine whether contaminating proteins of the extracellular compartment affect this classification, we repeated the analysis and omitted all extracellular matrix (ECM) proteins from the dataset using Gene Ontology Cellular Component annotation. Re-analyzing with consensus clustering resulted in three clusters, reproducing CC1 and CC3, and creating a combined cluster of CC2 and CC4. This suggests that ECM proteins do not play a key role in the separation of the luminal samples (for the sample-wise comparison, see Supplementary Fig. S2D).
The proteomic clusters identified using consensus clustering. A, Heatmap representing the consensus matrix of clustered samples. Blue, higher consensus between each pair of samples. Visual inspection of the consensus matrix identifies four different groups. B, Alluvial plot shows the distribution of clinically defined subtypes across consensus clusters. Consensus clusters 1, 2, and 3 included mostly ER+ samples, whereas consensus cluster four included mostly TN ones. HER2 samples were spread and not clustered together. C, T-distributed stochastic neighbor embedding (t-SNE) analysis of 109 tumor samples, based on the ANOVA-significant proteins segregating the consensus clusters. Consensus cluster 1 is more easily distinguished, whereas 2, 3, and 4 appear relatively closer on the t-SNE axes.
The proteomic clusters identified using consensus clustering. A, Heatmap representing the consensus matrix of clustered samples. Blue, higher consensus between each pair of samples. Visual inspection of the consensus matrix identifies four different groups. B, Alluvial plot shows the distribution of clinically defined subtypes across consensus clusters. Consensus clusters 1, 2, and 3 included mostly ER+ samples, whereas consensus cluster four included mostly TN ones. HER2 samples were spread and not clustered together. C, T-distributed stochastic neighbor embedding (t-SNE) analysis of 109 tumor samples, based on the ANOVA-significant proteins segregating the consensus clusters. Consensus cluster 1 is more easily distinguished, whereas 2, 3, and 4 appear relatively closer on the t-SNE axes.
Next, we explored the characteristics of the newly identified proteomic groups and their relation to the known subtype markers. Plotting six selected subtype markers (as mentioned above) showed the subtype-expected pattern according to the CC distribution and the known classification (Fig. 3). For example, KRT18 was significantly higher in all ER+ clusters compared to the TN-enriched CC4; and the TN markers EGFR and CD44 were significantly higher in CC4 when compared to CC3 and CC1, respectively. Examination of the LumB marker MKI67 suggested that CC1 represents this subtype. This was further supported by an enrichment analysis that identified gene sets of RB1-regulated and MYC-related targets, with which LumB subtype is known to be associated (1), as significantly enriched within CC1 samples when compared to either CC2 or CC3 (Supplementary Fig. S3A). This analysis also showed that the majority of significantly enriched processes are higher in CC1 in both comparisons. Very few processes are higher in CC2 and CC3, and the common ones include mostly interferon related pathways (Supplementary Fig. S3B; for the full lists see Supplementary Table S3A).
Selected subtype markers across consensus clusters. Selected subtype markers exhibit the subtype-expected pattern (for each pairwise comparison, we used Student t test, FDR < 0.05). MKI67 expression was significantly higher in CC1, suggesting its association with the LumB subtype, and the luminal KRT18 was significantly higher in all ER+ clusters (CC1–CC3) compared with the TN CC4.
Selected subtype markers across consensus clusters. Selected subtype markers exhibit the subtype-expected pattern (for each pairwise comparison, we used Student t test, FDR < 0.05). MKI67 expression was significantly higher in CC1, suggesting its association with the LumB subtype, and the luminal KRT18 was significantly higher in all ER+ clusters (CC1–CC3) compared with the TN CC4.
Having associated CC4 with TN subtype and CC1 with LumB subtype, CC2 and CC3 remained as two possible LumA clusters. In order to characterize them, we first analyzed each one of the two LumA clusters compared to the TN CC4. We speculated that the comparison to the distinct cluster will highlight the specific luminal features of each one of the clusters. First, to ensure that both clusters are indeed luminal, in agreement with their pathological annotation, we performed a 2D-enrichment analysis looking for gene sets from the Molecular Signatures Database (MSigDB) that behave similarly in both CC2 and CC3. We found gene sets such as YANG_BREAST_CANCER_ESR1_UP and SMID_BREAST_CANCER_BASAL_DN to be significantly enriched in CC2 and CC3 when compared to CC4, while gene sets such as SMID_BREAST_CANCER_BASAL_UP were significantly de-enriched. In addition, both CC2 and CC3 were enriched for bone metastasis gene sets (Fig. 4A; Supplementary Table S3B), recapitulating the known ER+ metastatic pattern (38, 39).
Comparison between the LumA consensus clusters. A, 2D enrichment analysis of subtype-specific gene sets revealed common enrichment of ER+ -related gene sets as well as ER+ metastasis-related gene sets in both CC2 and CC3. Barcode plots show proteins ranked according to fold change of CC2 or CC3 relative to CC4, where each vertical line represents a single protein. High rank indicates higher expression in either CC2 or CC3. B, Network centrality analysis of CC2 and CC4 upregulated proteins resulted in 21 overlapping proteins among the top 10% central in each group. TN-related signaling proteins are represented as dash-lined nodes. C, Density and rug-plots of proteins whose significantly high expression is common to CC2 and CC4, but not to CC3.
Comparison between the LumA consensus clusters. A, 2D enrichment analysis of subtype-specific gene sets revealed common enrichment of ER+ -related gene sets as well as ER+ metastasis-related gene sets in both CC2 and CC3. Barcode plots show proteins ranked according to fold change of CC2 or CC3 relative to CC4, where each vertical line represents a single protein. High rank indicates higher expression in either CC2 or CC3. B, Network centrality analysis of CC2 and CC4 upregulated proteins resulted in 21 overlapping proteins among the top 10% central in each group. TN-related signaling proteins are represented as dash-lined nodes. C, Density and rug-plots of proteins whose significantly high expression is common to CC2 and CC4, but not to CC3.
In order to further characterize the differences between the LumA clusters, we examined the proteins significantly upregulated in each cluster. Student's t-test resulted in hundreds of significantly different proteins between CC2 and CC3, out of which, 539 were higher in CC2 and 127 were higher in CC3 (for the list of significant proteins see Supplementary Table S3C). Interestingly, 265 proteins were commonly and significantly upregulated in both CC2 (luminal) and CC4 (TN) when compared to CC3. To identify the key proteins in this comparison, we plotted the protein-protein interaction networks of the significantly higher proteins in CC2 and CC4 (compared to CC3), and performed a betweeness centrality analysis of each network. Selecting the 10% top central proteins in each group showed 21 overlapping proteins between the CC2 and the CC4 networks. Among these proteins, we detected signaling proteins such as NFKB1, MTOR and STAT3 (Fig. 4B and C). Collectively, these results indicate that CC2 shows high expression of key TN signaling markers, but simultaneously still maintains an intrinsic-ER+ signature.
Clinical relevance of the new LumA cluster (CC2)
In order to further validate our results on an independent sample cohort and based on a distinct technological platform, we reanalyzed the reversed phase protein array (RPPA) data provided by the TCGA (1). These data included examination of 217 antibodies on 848 tumor samples, of those we selected only IDC samples (n = 621), which correspond to the histological subtype included in our study. Interestingly, similar to the MS-based data, k-means clustering of the TCGA RPPA data identified two LumA clusters. Defining k-means clustering with four clusters resulted in one HER2 cluster, one basal cluster, and two clusters enriched for LumA samples, one of which was also enriched for LumB samples (Supplementary Fig. S4A). Running the same algorithm with optional five clusters gave similar results, but separated the combined LumA/LumB cluster into two different clusters (Fig. 5A). Comparison of the two LumA RPPA clusters also reproduced the differences in specific proteins (Fig. 5B; Supplementary Table S4), including significant differences in PI3K-related proteins and their phosphorylated forms. Interestingly, the cluster with upregulated PI3K-related proteins was also characterized by increased ER signaling, demonstrated by significant upregulation of both nonphosphorylated and phosphorylated ER. Phosphorylation of ER on Ser118 is known to be mediated by AKT signaling and was shown to increase ER transcriptional activity (40, 41). Additionally, ER-involved MTOR/PI3K activity is a known resistance mechanism to endocrine therapy such as tamoxifen (40, 42, 43). To further compare the RPPA results to the original MS data, we examined the connection between PI3K and ER on the original dataset. Since the ER protein was below signal detection level in many of the samples, we searched for ER downstream proteins to serve as a proxy for ER expression. We gathered a list of 1346 genes that are annotated as ER targets in MSigDB, out of which 411 were present in our data. In agreement with the RPPA results, also the proteomics results show significantly higher predicted ER activity in CC2 (P = 7.71E-11, see Supplementary Fig. S4B).
Independent validation of the proteomic classification. A, K-means clustering of TCGA RPPA data, based on 621 IDC tumor samples, reproduces the proteomic classification. Two LumA-enriched clusters are identified when data are divided into five clusters, whereas the other three are each enriched (Fisher exact test, FDR < 0.02) for the known intrinsic subtypes (LumB, Basal, and HER2). B, Comparison of k-means clusters 2 and 3, which correspond to the proteomic CC2 and CC3, reveals a similar expression pattern of signaling proteins such as MTOR, AKT, and MAPK14. Alongside these, ER itself (both phosphorylated and nonphosphorylated) is upregulated in k-means cluster 2 (Student t test, FDR < 0.05). C, Plotting the 666 proteins distinguishing CC2 and CC3 creates a clear differentiation between the samples in our proteomic data. Clustering of the mRNA levels of same genes from the TCGA RNA-sequencing data (D) or the Metabric data (E) shows no clear separation of subtypes or integrative clusters.
Independent validation of the proteomic classification. A, K-means clustering of TCGA RPPA data, based on 621 IDC tumor samples, reproduces the proteomic classification. Two LumA-enriched clusters are identified when data are divided into five clusters, whereas the other three are each enriched (Fisher exact test, FDR < 0.02) for the known intrinsic subtypes (LumB, Basal, and HER2). B, Comparison of k-means clusters 2 and 3, which correspond to the proteomic CC2 and CC3, reveals a similar expression pattern of signaling proteins such as MTOR, AKT, and MAPK14. Alongside these, ER itself (both phosphorylated and nonphosphorylated) is upregulated in k-means cluster 2 (Student t test, FDR < 0.05). C, Plotting the 666 proteins distinguishing CC2 and CC3 creates a clear differentiation between the samples in our proteomic data. Clustering of the mRNA levels of same genes from the TCGA RNA-sequencing data (D) or the Metabric data (E) shows no clear separation of subtypes or integrative clusters.
LumA subtypes distinction enabled specifically by proteomic data
Previous studies highlighted major differences in unsupervised classification based on proteomic compared to the genomic approaches (1–3, 7). We therefore sought to examine whether the proteomic signature of differentially expressed proteins in CC2 and CC3 also separates the LumA samples based on their RNA profiles. In contrast to the separation based on the protein levels (Fig. 5C), hierarchical clustering of the RNA data of the same genes showed no clear differentiation between the intrinsic subtypes, even when sorted according to subtype (Fig. 5D). We applied the same analysis to an independent gene expression dataset, from the Metabric study (2012). Sorting the samples according to the 10 integrative clusters identified in the Metabric analysis and plotting the gene expression of the same signature yielded results similar to the TCGA RNA analysis, and did not manage to differentiate clearly between the clusters (Fig. 5E). This analysis indicates that the distinction between LumA sub-subtypes is potentially unique to protein-based data.
Discussion
We integrated three proteomic datasets of breast cancer samples originating from the three main clinical subtypes of IDC and from healthy tissue. Supervised analysis of the tumor samples validated many of the previous findings regarding their specific subtype-related pathways and functionalities, including immune activity in TN tumors and elevated MAPK signaling in ER+ tumors. The size of the cohort, despite being significantly smaller than current genomic studies, is still larger than the microarray studies that initially identified the breast cancer intrinsic subtypes (3, 7). Additionally, the careful tumor dissection enabled focusing on the cancer cells rather than stroma, which affected past classifications. Unsupervised tumor classification using the consensus clustering algorithm separated the data into four clusters, consisting of one TN-enriched group and three ER+ enriched groups, while HER2 samples were distributed across groups.
The inability to identify a distinct group of HER2 samples in proteomics data might stem from the diverse phenotype of these tumors beyond HER2 expression. This result was previously shown by the CPTAC consortium (13) using similar techniques and a smaller cohort, and also by the Metabric study (2), which shows that HER2 is high in several integrative clusters based on copy number aberrations and mRNA expression. Additionally, a mass cytometry-based proteomic analysis of breast cancer also suggested that HER2 expressing cells do not represent a single phenotype, and are instead characterized by variable expression of other proteins (44). The RPPA results did separate the HER2 subtype, presumably due to the pre-selection of a large number of HER2 related proteins in the arrays. In the current work, we consistently assessed HER2 and its associated proteins in an unbiased manner. They were indeed high in the HER2 clinical subtype, but showed variable expression across consensus clusters. While the importance of HER2 is not questionable given the efficacy of HER2-directed therapies, our results suggest that this pathway affects only a subset of the proteome and does not represent a distinct subtype when examining the global proteome profiles. Thus, it is useful as a therapeutic marker rather than a representative of a defined tumor subtype. Additional classification ambiguities between studies may originate from the pre-selection of features (genes/proteins) based on their standard deviation, as done by previous studies (2, 13). Furthermore, major discrepancies have been identified between all ’omics’-based classifications (including ours), which are based on averaging of large tumor regions, as compared to IHC approaches. Apart from potentially overlooking the effect of tumor microenvironment components (e.g., ECM, immune infiltrates, and fibroblasts), “averaging” the cancer cells masks the intratumor heterogeneity that is well documented in breast cancer (45, 46). This intratumor heterogeneity is likely to underlie some of the observed inconsistencies between classification using the clinical IHC approaches and the “omics” approaches.
Comprehensive investigation of the three ER+ clusters identified one LumB cluster and two LumA clusters. One of the two LumA clusters was characterized with increased activity of TN-associated signaling proteins, thus suggesting a novel, sub-subtype of LumA tumors. Our classification was validated using an independent dataset obtained from TCGA, which also assisted in the characterization of this novel subtype as potentially more aggressive and therapy resistant. The mechanism underlying this resistance relies on the coordinated interaction between ER itself and growth factor signaling proteins. ER can be phosphorylated on several sites by MAPK as well as PI3K downstream kinases, such as AKT, P70S6K1, and MAPK14. Phosphorylation leads to increased transcriptional activity of ER, which in turn upregulates the transcription of genes belonging to those pathways (47, 48). One of the common phosphorylation sites is Serine118, which alongside other sites (such as Serine167), has been implicated in resistance of ER+ tumors to tamoxifen (42, 49). Resistance is suggested to be due to reduced affinity to the drug, accompanied by increased binding of estrogen (40, 42). The signaling proteins that constitute the profile of the proteomic CC2, together with upregulation of both phosphorylated and nonphosphorylated ER, reflect this therapy-resistant phenotype. Thus, the new LumA cluster potentially represents a group of more aggressive tumors, more likely to be resistant to treatment.
The identification of a novel LumA subtype has several implications. Our findings suggest that classifying luminal breast tumors based on protein expression may set the ground for a more comprehensive clinical evaluation of breast tumors, and thus benefit treatment decision making. This work may therefore represent a starting point for a directed study of response to hormonal treatment. A focused study of this sort will include patient survival information as well as clinical data regarding patient course of treatment and response. These could increase the understanding and interpretation of newly identified tumor profiles, and may provide insights on their biological nature and clinical implications.
The work presented here shows the added value of cancer proteomics, both independently as a robust methodology and as a complementary layer of biological information to genomics and transcriptomics. The integration of multiple samples yielded high-coverage and high-accuracy data, shedding light on the complex diversity of breast cancer, and altogether demonstrating the importance of a deep, systemic, and multilevel point of view in breast cancer research.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: A. Sonnenblick, T. Peretz, T. Geiger
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): H. Agmon, M. Harel, A. Sonnenblick, T. Peretz
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): G. Yanovich, A. Sonnenblick, T. Geiger
Writing, review, and/or revision of the manuscript: G. Yanovich, M. Harel, A. Sonnenblick, T. Peretz, T. Geiger
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): A. Sonnenblick
Study supervision: T. Geiger
Acknowledgments
We thank the members of the Geiger laboratory for fruitful discussions and technical assistance. We thank Dana Silverbush and Dvir Netanely for insightful bioinformatics assistance. We specifically thank Diana Nemichenitzer for assisting in sample preparation of Dataset III. This work was supported by the Israel Cancer Research Fund; G. Yanovich, M. Harel, and T. Geiger were supported by the European Research Council (ERC; PROTEOMICAN 639534).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
References
Supplementary data
Sample summary table
Enrichment analyses
RPPA data