P-MartCancer is an interactive web-based software environment that enables statistical analyses of peptide or protein data, quantitated from mass spectrometry–based global proteomics experiments, without requiring in-depth knowledge of statistical programming. P-MartCancer offers a series of statistical modules associated with quality assessment, peptide and protein statistics, protein quantification, and exploratory data analyses driven by the user via customized workflows and interactive visualization. Currently, P-MartCancer offers access and the capability to analyze multiple cancer proteomic datasets generated through the Clinical Proteomics Tumor Analysis Consortium at the peptide, gene, and protein levels. P-MartCancer is deployed as a web service (https://pmart.labworks.org/cptac.html), alternatively available via Docker Hub (https://hub.docker.com/r/pnnl/pmart-web/). Cancer Res; 77(21); e47–50. ©2017 AACR.

The use of mass spectrometry (MS)-based technologies for global protein profiling of cancer-related tissues and bodily fluids has become a major focus in research centered on biomarker discovery to better detect and treat cancer. Global proteomic technologies are of interest in the field of cancer research because they provide a vital source of information regarding biological functions at the protein level (1–4). Global MS-based proteomics often allow hundreds of thousands of peptides mapping to tens of thousands of proteins to be measured, offering scientists an unprecedented view into processes involved in cancer development and progression. However, as with many newer high-throughput molecular technologies, the data are complex with multiple sources of variability (1); consequently, the postpeptide identification and quantification (i.e., downstream) analyses of these datasets require significant specialized expertise due to inherent data challenges, such as missing values and isoform quantification (5–7). Data processing is generally performed via custom and unpublished scripts, leading to continued issues in reproducibility of results (8). Thus, challenges in both access to the data and statistical methods in an easy to understand format for biomedical researchers has led to underutilization of publicly available databases, such as those generated through the Clinical Proteomics Tumor Analysis Consortium (CPTAC) funded through the NCI (Rockville, MD). The development of software that enables continued exploration and evaluation of existing data could increase the potential for new discovery from these comprehensive public datasets.

Existing software capabilities associated with CPTAC can broadly be placed in four categories for global MS-based data (https://proteomics.cancer.gov/proteomics/background): (i) spectral preprocessing; (ii) peptide and protein identification; (iii) specific methods for qualitative and quantitative differential statistics; and (iv) network analyses. P-MartCancer broadly fits in the third category; however, unlike these targeted analyses on data that have already been processed into a particular form, P-MartCancer offers a holistic approach to data analysis, allowing all steps of analysis from quality control through pattern discovery to be performed in a workflow-based manner. In addition, the manner that P-Mart offers direct access to the data for statistical analysis at multiple levels (peptide, gene, and protein), with the clinical information aligned, is unique within the set of tools supporting CPTAC. P-MartCancer provides an open, web-based interactive platform for performing quality control processing, statistics, protein quantification, and exploratory data analysis tasks in a manner that is reproducible (see Supplementary Video S1).

Software development

P-MartCancer functions are developed in R or Rccp, Rserve (https://www.rforge.net/Rserve/) is used to communicate between R and the web-service, and the interface is developed in Java. A subset of the R functions is currently available via GitHub (https://github.com/pmartR/), and the web service can also be installed via Docker Hub (https://hub.docker.com/r/pnnl/pmart-web/). Developed in this manner, adding functionality to P-MartCancer is straightforward through a standardized pipeline.

Cancer proteomics data

P-MartCancer currently accesses multiple proteomic datasets generated through the CPTAC available on the Data Portal (https://cptac-data-portal.georgetown.edu/cptacPublic/). Data are available at the peptide, protein, and/or gene levels where the protein and gene data are based on a defined Common Data Analysis Pipeline (CDAP; ref. 9). P-MartCancer offers flexibility to the user to either perform statistical processing on the peptide data (before or after parsimony) with P-MartCancer functions for gene or protein quantitation, or to use the data as supplied by CPTAC via the CDAP. Currently, various numbers of datasets are available for ovarian cancer and breast cancer (2); however, new datasets are being added as they become available. Each of these datasets contains metadata about the experiment and the user selects the clinical variable of interest (e.g., vital status, tumor stage) for data processing, allowing various hypotheses to be explored.

P-MartCancer is a modular workflow tool (Fig. 1A) with four primary capabilities: (i) quality control processing; (ii) gene or protein quantification; (iii) statistics; and (iv) exploratory data analysis. These are further divided into six key modules under which new functionality can be easily added through the R/Rserve to Java framework described previously. The modules and functions available depend upon the type of data being evaluated: peptide, protein, or gene. Each function called is displayed visually either via tables or figures in sequential order highlighted by the green text on top of the P-MartCancer screen where the current module is blue (Fig. 1B, top). Finally, the entire process is documented for the user at the end of the workflow, Fig. 1C, and all datasets and statistical results are available for download as .CSV files.

Figure 1.

Screenshots from P-MartCancer. A, The user selects the workflow to be implemented on the basis of the data type selected. B, The exploratory evaluation capability allows users to find proteins or genes of interest and evaluate the associated data visually. C, The log of all of the steps in the analysis performed by the user for use in publication or to facilitate reproducible analyses.

Figure 1.

Screenshots from P-MartCancer. A, The user selects the workflow to be implemented on the basis of the data type selected. B, The exploratory evaluation capability allows users to find proteins or genes of interest and evaluate the associated data visually. C, The log of all of the steps in the analysis performed by the user for use in publication or to facilitate reproducible analyses.

Close modal

Quality control processing

A challenge with proteomics data is preprocessing in a manner that does not ignore the different sources of variability that contribute to the complexity of these datasets. For example, peptides are not uniformly identified for all samples, and thus, large quantities of missing values are common, due to both random and nonrandom mechanisms. P-MartCancer offers a suite of preprocessing capabilities that handle issues of peptide and protein coverage, as well as the identification of potential outlier samples (7, 10, 11). At the peptide level, researchers can choose to (i) remove proteins with inadequate coverage (peptide filter); (ii) remove samples with outlier behavior (sample outlier filter; ref. 10); (iii) remove peptides too sparse for statistical analysis (peptide coverage filter); (iv) remove peptides with extremely high variability in the context of a coefficient of variation (CV filter); and (v) perform normalization. The quality control processing generates a high-quality dataset for continued statistical analysis. The gene- and protein-level datasets will allow functions 2–4 above, evaluating outlier behavior and removing genes or proteins that are too sparse or variable to add value to downstream statistics.

Differential statistics

Statistical analysis of peptide, gene, or protein-level data is currently focused on quantitative ANOVA-based methods and qualitative G-test methods (11). The ANOVA method allows the comparison of any number of groups, performing a multiple test correction when more than two groups are compared (e.g., clinical variable “tumor residual disease” is separated into four categories by size). A Tukey adjustment is performed when the user compares all groups with one another, and a Dunnett adjustment is performed when the user compares back to a single control group. Data are not imputed; statistical results are generated based only on the observed data to assure that accurate estimates of variance are being utilized for these tests. To identify qualitative changes, a G-test is also performed for each biomolecule to evaluate whether the number of nonmissing observations in one group is more than expected by chance. Multiple test adjustments for the G-test are performed using a Holm–Bonferoni correction. The total significance is given in the context of a bar graph, and to facilitate exploratory data analysis capabilities a P value threshold (default of 0.05) is used to move forward only a subset of the peptides, genes, or proteins for further evaluation.

Protein quantification

There are numerous approaches to quantify proteins from the measured peptide-level data (12). P-MartCancer currently offers a standard reference–based approach that scales all peptides to the most abundant, or most reproducible, peptide and gives the median signal (13).

Exploratory data analysis

P-MartCancer offers two exploratory data analysis capabilities. The first is probabilistic principal component analysis, which allows P-MartCancer to perform PCA without imputing missing values, demonstrated to be valuable in proteomics (7). The resulting scores are plotted using a standard scatter plot of the scores from the first two principal components that most cancer and biomedical researchers are accustomed to, allowing visual exploration of clustering across samples.

P-MartCancer also offers an interactive and customizable plotting capability called Trelliscope that allows sorting and querying across the peptides, genes, or proteins, Fig. 1B. Trelliscope uses the statistical results to plot each peptide, gene, or protein via either a boxplot of differential abundance or a bar graph of the number of observations, to view quantitative and qualitative changes, respectively. The entire space of the biomolecules being explored can be reduced by selecting various thresholds, such as P value or fold change, or the user can search for specific genes or proteins of interest. For each plot, the gene and protein information can be selected, and the associated information can be viewed in webpages, http://www.genecards.org and http://www.uniprot.org, respectively. In the example in Fig. 1B, the user has searched for BRAF and rapidly views the associated protein-level quantified information based on the clinical variable selected, “macroscopic disease,” and other information, such as that BRAF has a P value of approximately 0.04 for the specific comparison selected.

P-MartCancer offers a new online platform to access CPTAC datasets to enable new analyses. There is a wealth of capabilities that could be extremely useful to the proteomics community, many of which are under active development. For example, proteoform discovery, which is the identification of proteins with multiple forms, is also an important component of protein quantification (6). Additional future work is focused on adding new capabilities in statistical testing, machine learning, and gene set enrichment analysis, as well as the development of a user-upload capability to enable all researchers with MS-based peak-intensity data to create reproducible statistical downstream processing pipelines.

No potential conflicts of interest were disclosed.

Conception and design: B.-J.M. Webb-Robertson, J.L. Jensen, K.G. Stratton

Development of methodology: B.-J.M. Webb-Robertson, L.M. Bramer, J.L. Jensen, M.A. Kobold, K.G. Stratton

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): B.-J.M. Webb-Robertson, L.M. Bramer, K.G. Stratton, A.M. White

Writing, review, and/or revision of the manuscript: B.-J.M. Webb-Robertson, L.M. Bramer, K.G. Stratton, K.D. Rodland

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): L.M. Bramer, M.A. Kobold, K.G. Stratton, K.D. Rodland

P-MartCancer was developed at Pacific Northwest National Laboratory (PNNL), a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under contract DE-AC06-76RL01830.

This work was supported by NCI grant U01-1CA184783 to B.-J.M. Webb-Robertson.

1.
Gajadhar
AS
,
Johnson
H
,
Slebos
RJ
,
Shaddox
K
,
Wiles
K
,
Washington
MK
, et al
Phosphotyrosine signaling analysis in human tumors is confounded by systemic ischemia-driven artifacts and intra-specimen heterogeneity
.
Cancer Res
2015
;
75
:
1495
503
.
2.
Mertins
P
,
Mani
DR
,
Ruggles
KV
,
Gillette
MA
,
Clauser
KR
,
Wang
P
, et al
Proteogenomics connects somatic mutations to signalling in breast cancer
.
Nature
2016
;
534
:
55
62
.
3.
Slebos
RJ
,
Wang
X
,
Wang
X
,
Zhang
B
,
Tabb
DL
,
Liebler
DC
. 
Proteomic analysis of colon and rectal carcinoma using standard and customized databases
.
Sci Data
2015
;
2
:
150022
.
4.
Zhang
H
,
Liu
T
,
Zhang
Z
,
Payne
SH
,
Zhang
B
,
McDermott
JE
, et al
Integrated proteogenomic characterization of human high-grade serous ovarian cancer
.
Cell
2016
;
166
:
755
65
.
5.
Choi
M
,
Eren-Dogu
ZF
,
Colangelo
C
,
Cottrell
J
,
Hoopmann
MR
,
Kapp
EA
, et al
ABRF proteome informatics research group (iPRG) 2015 study: detection of differentially abundant proteins in label-free quantitative LC-MS/MS experiments
.
J Proteome Res
2017
;
16
:
945
57
.
6.
Webb-Robertson
BJ
,
Matzke
MM
,
Datta
S
,
Payne
SH
,
Kang
J
,
Bramer
LM
, et al
Bayesian proteoform modeling improves protein quantification of global proteomic measurements
.
Mol Cell Proteomics
2014
;
13
:
3639
46
.
7.
Webb-Robertson
BJ
,
Wiberg
HK
,
Matzke
MM
,
Brown
JN
,
Wang
J
,
McDermott
JE
, et al
Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics
.
J Proteome Res
2015
;
14
:
1993
2001
.
8.
Goecks
J
,
Nekrutenko
A
,
Taylor
J
,
Galaxy
T
. 
Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences
.
Genome Biol
2010
;
11
:
R86
.
9.
Markey
SP
,
Rudnick
PA
,
Mirokhin
YI
,
Roth
J
,
Stein
SE
.
Common Data Analysis Pipeline (CDAP)
.
Rockville, MD
:
NCI
; 
2014
. Available from: https://cptac-data-portal.georgetown.edu/cptac/aboutData/show?scope=dataLevels.
10.
Matzke
MM
,
Waters
KM
,
Metz
TO
,
Jacobs
JM
,
Sims
AC
,
Baric
RS
, et al
Improved quality control processing of peptide-centric LC-MS proteomics data
.
Bioinformatics
2011
;
27
:
2866
72
.
11.
Webb-Robertson
BJ
,
McCue
LA
,
Waters
KM
,
Matzke
MM
,
Jacobs
JM
,
Metz
TO
, et al
Combined statistical analyses of peptide intensities and peptide occurrences improves identification of significant peptides from MS-based proteomics data
.
J Proteome Res
2010
;
9
:
5748
56
.
12.
Matzke
MM
,
Brown
JN
,
Gritsenko
MA
,
Metz
TO
,
Pounds
JG
,
Rodland
KD
, et al
A comparative analysis of computational approaches to relative protein quantification using peptide peak intensities in label-free LC-MS proteomics experiments
.
Proteomics
2013
;
13
:
493
503
.
13.
Polpitiya
AD
,
Qian
WJ
,
Jaitly
N
,
Petyuk
VA
,
Adkins
JN
,
Camp
DG
 II
, et al
DAnTE: a statistical tool for quantitative analysis of -omics data
.
Bioinformatics
2008
;
24
:
1556
8
.