Proteogenomics has emerged as a valuable approach in cancer research, which integrates genomic and transcriptomic data with mass spectrometry–based proteomics data to directly identify expressed, variant protein sequences that may have functional roles in cancer. This approach is computationally intensive, requiring integration of disparate software tools into sophisticated workflows, challenging its adoption by nonexpert, bench scientists. To address this need, we have developed an extensible, Galaxy-based resource aimed at providing more researchers access to, and training in, proteogenomic informatics. Our resource brings together software from several leading research groups to address two foundational aspects of proteogenomics: (i) generation of customized, annotated protein sequence databases from RNA-Seq data; and (ii) accurate matching of tandem mass spectrometry data to putative variants, followed by filtering to confirm their novelty. Directions for accessing software tools and workflows, along with instructional documentation, can be found at z.umn.edu/canresgithub. Cancer Res; 77(21); e43–46. ©2017 AACR.
Proteogenomics integrates genomic, transcriptomic, and mass spectrometry (MS)–based proteomics data to verify the expression of protein sequence variants resulting from sequence variations at the DNA or RNA level (1). Most commonly, assembled RNA-Seq data containing potential sequence variants are translated in silico, generating possible expressed protein variants. Tandem mass spectrometry (MS/MS) spectra of peptides are acquired from proteolytic digestion of proteins isolated from the same sample. These MS/MS spectra are matched to the database containing the protein variants, as well as reference, known protein sequences. Peptide spectrum matches (PSM) of MS/MS spectra to variant sequences within the database confirm the expression of novel protein sequences, helping to distinguish important variants and improve genomic annotation (1). Recently, high profile studies demonstrated the value of proteogenomics for discovery of protein variants that may be drivers of cancer (2–5).
Despite its value, using the proteogenomics approach by the wider research community remains a challenge. This is primarily due to its intensive informatics requirements (6). Proteogenomics requires integration of disparate software from different omic domains, optimally within a single, user-friendly environment. The software and supporting hardware must scale to accommodate memory and compute-intensive needs presented by the large-scale datasets encountered. Finally, the workflow must address possible pitfalls, such as false positives and the need to confirm the novelty of putative variants identified (1). Although some described software platforms meet at least some of these requirements (7, 8), most proteogenomics studies to-date have been accomplished using relatively inaccessible, in-house informatics solutions.
Here, we present an informatics resource aimed at expanding the use of proteogenomics in cancer research. The resource is built upon the Galaxy bioinformatics platform (6), and is an extension of the Galaxy for proteomics (Galaxy-P) project (see galaxyp.org for more details). Galaxy enables integration of disparate, multi-omic tools in a single, user-friendly environment, as required for proteogenomics (6, 9, 10). The new resource described here provides workflows and training in the most critical aspects of proteogenomics: generation of customized protein sequence databases from RNA-Seq data, matching of MS/MS data to putative variant peptide sequences, and confirmation of the novelty of these identified sequences.
Description of Resource
Figure 1 describes the workflows that make up this resource. Each workflow is also detailed in-turn below. The page z.umn.edu/canresgithub provides directions to access workflows and related instructional material, including on-screen, interactive Galaxy Tours tutorials. We have also generated an instructional narrated video that overviews this resource and its operation (see Supplementary Video S1).
Customized database generation workflow
This workflow, in part, takes advantage of well-documented, mature software for RNA-Seq data analysis that are long-standing, core tools in the Galaxy platform. The workflow's input is raw RNA-Seq data (.FASTQ) along with a genomic annotation file (.GTF), which are analyzed by a series of tools to identify and assemble potential sequence variants from these data. The current workflow focuses on insertion-deletion (Indel) variants and single amino acid variants (SAV). These tools generate a variant call format (.VCF) file that provides a summary of all potential variants identified from the starting RNA-Seq data. Along with a .BAM file (RNA sequence alignment information), the .VCF file acts as an input to the tool CustomProDB (11). CustomProDB creates a customized protein sequence database in the common .FASTA format, which contains potential variant protein sequences, and annotation for the type of variant (e.g., SAV, Indel). The possible variant sequences are merged with reference protein sequences for the organism being studied to create a comprehensive sequence database for the sample being studied. We have developed workflows (accessed through z.umn.edu/canresgithub) for analyzing single-end RNA-Seq data (from a mouse sample) and also for paired-end RNA-Seq data (from human MCF7 cells).
Sequence database searching and variant confirmation workflow
We have deployed the software SearchGUI (compomics.github.io/projects/searchgui.html; ref. 12), which bundles several of the most popular sequence database searching programs to match MS/MS spectra to peptide sequences contained in the sequence database. The use of complementary searching programs provides more comprehensive and higher confidence PSM identification (13).
Inputs for the workflow are the customized protein sequence database and also Mascot generic format (.MGF) files, which contain peaklists from the raw MS/MS data. Often, MS-based proteomics data is generated from the fractionation of a single sample, with each fraction generating a separate MS raw file (and .MGF file). Galaxy can define a group of such files as a “Dataset Collection” (See z.umn.edu/canresgithub for a Dataset Collection Galaxy Tour). For this workflow, a Dataset Collection of .MGF files acts as an input to SearchGUI, where each separate .MGF file is analyzed in-turn using the same parameters. A single output from SearchGUI is produced, aggregating the results from each sequence database search program on each .MGF file.
The results file from SearchGUI acts as the input to the companion program PeptideShaker (14). PeptideShaker further processes PSM information from SearchGUI. This processing includes PSM quality control, statistical analysis and FDR estimation, post-translational modification localization scoring, protein inference from PSMs, as well as organization and annotation for viewing of the output. In its Galaxy implementation, users are offered a number of output options for PeptideShaker, including a PSM report, inferred protein identities and a zipped .cpsx file. The zipped .cpsx file contains all results and can be downloaded from the Galaxy web-interface and viewed using the free PeptideShaker viewer (compomics.github.io/projects/peptide-shaker.html).
The final part of the workflow acts on the PSM report from PeptideShaker to confirm novel variants. PSMs to putatively novel peptide sequences are selected via their annotation from the FASTA protein sequence database, and submitted for BLAST-P analysis. BLAST-P compares the sequences to known sequences of the organism being studied. Putative variant sequences that do not perfectly match to known sequences after BLAST-P analysis are selected and outputted as confirmed, novel peptide sequences, ready for further analysis. The output of this workflow is a tabular list of confirmed, novel peptide sequence present in the sample. Instructions on workflow operation and results interpretation are at z.umn.edu/canresgithub.
We have made this proteogenomics informatics resource available in multiple ways. Our public Galaxy instance (usegalaxyp.org) is a training site for use of these workflows, including small-scale data for users to access and use with published workflows. These workflows are also available on a larger capacity instance housed on the cloud-based Jetstream infrastructure (15). Instructions on accessing usegalaxyp.org and Jetstream are provided at z.umn.edu/canresgithub. In addition, our workflows and software have been published in the Galaxy Tool Shed. Galaxy users can directly import and use these on their own instance. The archived workflows track and store all operating parameters and version information for the software used in the analysis pipeline.
The resource described here provides foundational tools and workflows for proteogenomics analysis, implemented in the extensible Galaxy platform to facilitate further enhancements. For example, customized workflows for multi-stage database searching to facilitate variant-specific FDR estimates (1) are being developed. We are also working on a Galaxy plugin for visualizing proteogenomic results, enabling further viewing of PSM and protein identifications. Adding functionality for converting PSM information to a SAM file (7) for downstream viewing in the Integrated Genomics Viewer (software.broadinstitute.org/software/igv) are also in progress. Although not the focus here, Galaxy-based tools for quantifying RNA-Seq and MS-based proteomics data are available for quantitative proteogenomic analysis. In addition, we expect that the active and collaborative community of Galaxy users and developers will continue to add to the proteogenomic resource described here.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Conception and design: M.C. Chambers, P.D. Jagtap, L. Martens, T.J. Griffin
Development of methodology: M.C. Chambers, P.D. Jagtap, L. Martens, B.A. Grüning, I.R. Cooke, T.J. Griffin
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): C.R. Guerrero, I.R. Cooke, M. Heydarian, K.L. Reddy
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): M.C. Chambers, P.D. Jagtap, J.E. Johnson, P. Kumar, G. Onsongo, H. Barsnes, L. Martens, I.R. Cooke, T.J. Griffin
Writing, review, and/or revision of the manuscript: M.C. Chambers, P.D. Jagtap, T. McGowan, P. Kumar, H. Barsnes, M. Vaudel, L. Martens, B.A. Grüning, K.L. Reddy, T.J. Griffin
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): M.C. Chambers, J.E. Johnson, T. McGowan, P. Kumar, H. Barsnes
Study supervision: T.J. Griffin
Other (bioinformatic tool development): M. Vaudel
We thank Jeremy Fischer and Tom Doak for Jetstream assistance. We also thank John Chilton for his initial assistance wrapping tools in Galaxy and Subina Mehta for assistance with generating online materials.
This work was supported by Ghent University Concerted Research ActionBOF12/GOA/014 to L. Martens; Bergen Research Foundation and the Research Council of Norway (H. Barsnes); BMBF grant 031 A538A RBC (de.NBI; B. Grüning); NCI-ITCR grant 1U24CA199347 and NSF (U.S.) grant 1458524 to M.C. Chambers, P.D. Jagtap, J.E. Johnson, T. McGowan, P. Kumar, G. Onsongo, C.R. Guerrero, and T.J. Griffin.