Reverse-phase protein arrays (RPPA) represent a powerful functional proteomic approach to elucidate cancer-related molecular mechanisms and to develop novel cancer therapies. To facilitate community-based investigation of the large-scale protein expression data generated by this platform, we have developed a user-friendly, open-access bioinformatic resource, The Cancer Proteome Atlas (TCPA, http://tcpaportal.org), which contains two separate web applications. The first one focuses on RPPA data of patient tumors, which contains >8,000 samples of 32 cancer types from The Cancer Genome Atlas and other independent patient cohorts. The second application focuses on the RPPA data of cancer cell lines and contains >650 independent cell lines across 19 lineages. Many of these cell lines have publicly available, high-quality DNA, RNA, and drug screening data. TCPA provides various analytic and visualization modules to help cancer researchers explore these datasets and generate testable hypotheses in an effective and intuitive manner. Cancer Res; 77(21); e51–54. ©2017 AACR.
Proteins comprise the basic functional units in biological processes and represent major targets for cancer therapy. However, large-scale cancer proteomic data have been relatively limited, in contrast to the recent explosion of next-generation sequencing data at both DNA and RNA levels. Importantly, the relatively low correlation of DNA and RNA levels with protein levels and with posttranslationally modified proteins, in particular, requires direct assessment of protein levels. Reverse-phase protein arrays (RPPA) represent a powerful functional proteomic approach that can assess a sizable number of selected protein markers across many samples in a cost-effective, sensitive, and high-throughput manner (1–3). This quantitative antibody-based assay has been widely used to investigate molecular events that drive tumor initiation/progression and to evaluate biomarkers and mechanisms that underlie sensitivity/resistance to cancer therapy (4, 5). More recently, we have employed this platform to characterize >8,000 patient samples through The Cancer Genome Atlas (TCGA; ref. 6), and the current RPPA platform contains approximately 300 protein markers, covering all major cancer signaling pathways. However, it remains a major informatic challenge for researchers to access and analyze RPPA data effectively, which limits the translation of these valuable functional proteomic data into clinical utility. To better serve the cancer research community, we have developed an open-access bioinformatic resource, The Cancer Proteome Atlas (TCPA). This resource is publicly available at http://tcpaportal.org, which substantially reduces the barrier to analyzing such complex RPPA data that biomedical researchers face.
Materials and Methods
As shown in Fig. 1A, the proteins profiled by the RPPA platform are first extracted from either patient tumor tissues or cultured cell lines, followed by serial dilutions. Then, the diluted proteins are arrayed on nitrocellulose-coated slides and probed with validated antibodies. Antibody validation is an ongoing process, with continuous reassessment of antibodies used on the platform as well as new antibodies. The data obtained from the arrays are further collected and normalized using a bioinformatic pipeline that consists of several cutting-edge algorithms for data normalization, curve fitting, and processing imaging data (7, 8). The standardized and normalized data are then curated and linked with other data in the TCPA portal.
Currently, TCPA portal contains two separate web applications for exploring and analyzing the proteomic data. One focuses on patient cohort data (9), and the other is for cell line data. For the web application of patient cohort data, the latest release contains the data from approximately 8,000 samples of 32 TCGA cancer types and another approximately 500 samples from independent patient cohorts (Fig. 1B). For the web application of cell line data, the current release contains RPPA data of >650 independent cancer cell lines from the MD Anderson Cell Lines Project (10). Notably, the RPPA data of cell lines were generated by the same platform as the patient cohort data, providing independent sets for validating patterns observed in patient cohorts. TCPA routinely releases the latest RPPA data with a strict versioning approach and also allows for the efficient exploration of these datasets. Both patient cohort and cell line web applications provide three major types of modules: (i) “Datasets” modules provide the detailed information of samples and protein markers curated in TCPA, as well as a tree-view interface for data downloading; (ii) “Analysis” modules enable users to perform various common protein-centered analyses, including correlation, differential expression, patient survival, and drug sensitivity analyses; and (iii) “Visualization” modules allow users to examine the global RPPA data patterns interactively. The TCPA portal provides detailed online tutorials on how to use these modules (see Supplementary Video S1).
Users can examine sample information from the “Summary” module, such as data source, cancer type, number of samples, and number of antibodies (Fig. 1C). Moreover, by clicking the “show” button for each dataset, users can obtain more detailed information about how data were processed and normalized. Through the “Download” module, users can select datasets of interest in a tree-view menu (Fig. 1D). For each dataset, we provide both level 3 and level 4 data, with level 3 data representing the normalized data from independent batches (slides) and level 4 data representing merged data across multiple batches. These data have been precompressed in a zip file. Level 3 and level 4 data are equal in terms of analyzing the data generated in a single batch; level 4 data should be used when analyzing data across multiple batches. After downloading and decompressing the data, users can find the dataset metadata as well as the RPPA data matrix in the same folder. “My protein” module provides detailed information about each protein marker assessed, through which, users can obtain corresponding gene(s), validation status, antibody source, and catalog number. This module also allows for visualization of the protein expression level across different cancer types (Fig. 1E). For example, by clicking the green “plus” sign in the view column of the protein PD-L1, users can see the boxplots of the protein expression levels in 19 TCGA cancer types. To check whether the protein has a similar pattern across cell lines, one can go to the same module in the cell line application and search for PD-L1. The sorted boxplots in different cell line lineages show the same order as those in the patient tumors (median values from high to low): head and neck (HNSC), lung (LUSC and LUAD), and uterine (UCEC) cancers (Fig. 1F).
We implemented several commonly used approaches in TCPA, including correlation, differential expression, patient survival, and drug sensitivity analysis. For “Correlation Analysis,” users can examine whether the expression levels of two proteins correlate with each other. Through this module, users can select an RPPA dataset by cancer type/lineage and obtain all pairwise correlations in a table view (Fig. 1G). The first two columns show protein pairs, followed by correlation coefficient and corresponding P value. Furthermore, by clicking the plus sign, one can visualize the data in a scatter plot. For example, the plot shows that IGFBP2 and PD-L1 have a negative correlation in HNSC, and such a correlation can be also confirmed in cell line data (Fig. 1H). This independent validation increases the confidence of the patterns observed in patient cohorts. Another useful analysis is “Differential Expression.” For patient tumor samples, users can identify which proteins are most differentially expressed between any two tumor types or subtypes. In Fig. 1I, PD-L1 is elevated in HNSC compared with UCEC (t test P = 1.42 × 10−35). For cancer cell lines, users can identify which proteins are differentially expressed between cell lines with and without a mutated gene. In addition, the patient cohort application contains a unique module of “Survival Analysis,” and the cell line application contains a unique module of “Protein-Drug Analysis.” These two modules can be used to identify potential prognostic and predictive protein markers. As shown in Fig. 1J, the expression levels of several proteins significantly correlate with patient survival times. For example, patients with high ECADHERIN expression show better survival than those with low expression (11). To further check whether this protein has some predictive power for drug sensitivity, “Protein-Drug Analysis” can be used to assess which drugs correlate with this protein. For example, the expression of ECADHERIN shows negative correlations with sensitivity to afatinib, lapatinib, and saracatinib (Fig. 1K). Similarly, one can query by the drug to find which proteins show correlations with sensitivity to a specific drug (Fig. 1L).
TCPA provides two innovative visualization modules for exploring global patterns of RPPA datasets. One is “Network” visualization, in which protein markers are nodes and the interactions are colored edges linking two nodes. The red/green color represents a positive/negative correlation between two proteins. Users can search any protein using the search box to further focus on the neighbors of that protein. For example, Fig. 1M shows correlations between ECADHERIN and its neighbors. As in the other analytic modules, users can perform parallel analysis in cell line samples (Fig. 1N). Here, most of the correlations observed in Fig. 1M can be recaptured in Fig. 1N. The other module is “NG-CHM” visualization (Fig. 1O). This dynamic, interactive heatmap module enables users to visually check the global protein expression pattern for each cancer type or cell line lineage. In each heatmap, users can zoom in/out, check the expression level for each dot and related external resources, and obtain high-resolution figures for further analysis or publication.
We have developed an open-access, user-friendly bioinformatic resource, TCPA, which contains expression levels of key cancer proteins from approximately 10,000 samples of both patient tumors and cancer cell lines. TCPA is not only a one-stop data repository for obtaining high-quality RPPA data, but also a powerful web platform for making sense of these data through user-friendly, intuitive analyses. In the future, we will add more cancer RPPA data and integrate other genomic and clinical data into this resource. We expect that TCPA will continue to serve as a highly valuable resource that helps researchers generate testable hypotheses, validate findings of interest, and ultimately facilitate the development of novel cancer therapies.
Disclosure of Potential Conflicts of Interest
G.B. Mills reports receiving commercial research grants from Adelson Medical Research Foundation, AstraZeneca, Breast Cancer Research Foundation, Critical Outcome Technologies, Komen Research Foundation, Nanostring, Karus, Illumina, Takeda/Millenium Pharmaceuticals, and Pfizer, has received speakers bureau honoraria from Allostery, AstraZeneca, ImmunoMet, ISIS Pharmaceuticals, Lilly, MedImmune, Novartis, Pfizer, Symphogen, and Tarveda, has ownership interest (including patents) in Catena Pharmaceuticals, ImmunoMet, Myriad Genetics, PTV Ventures, and Spindletop Ventures, and is a consultant/advisory board member for Adventist Health, Allostery, AstraZeneca, Catena Pharmaceuticals, Critical Outcome Technologies, ImmunoMet, ISIS Pharmaceuticals, Lilly, MedImmune, Novartis, Precision Medicine, Provista Diagnostics, Signalchem Lifesciences, Symphogen, Takeda/Millenium Pharmaceuticals, Tarveda, Tau Therapeutics. Han Liang is a consultant/advisory board member for INFOTECH Soft Inc. No potential conflicts of interest were disclosed by the other authors.
Conception and design: J. Li, G.B. Mills, H. Liang
Development of methodology: J. Li, R. Akbani, H. Liang
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): J. Li, Y. Lu, G.B. Mills, H. Liang
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): J. Li, R. Akbani, W. Zhao, G.B. Mills, H. Liang
Writing, review, and/or revision of the manuscript: J. Li, R. Akbani, G.B. Mills, H. Liang
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): J. Li, R. Akbani, J.N. Weinstein, H. Liang
Study supervision: Y. Lu, H. Liang
Other (performed functional proteomics study by RPPA analysis): Y. Lu
This study was supported by the NIH (CA168394, CA098258, CA143883 and HG008100 to G.B. Mills; CA175486 to H. Liang; CA209851 to H. Liang and G.B. Mills; CA201949 and CA210950 to G.B. Mills, R. Akbani, and J.N. Weinstein; and the MD Anderson Cancer Center Support grant, CA016672), a grant from the Cancer Prevention and Research Institute of Texas (RP140462 to H. Liang), the Lorraine Dell Program in Bioinformatics for Personalization of Cancer Medicine (to J. N. Weinstein), and the Adelson Medical Research Foundation (to G.B. Mills).