Abstract
High-throughput genomic data that measures RNA expression, DNA copy number, mutation status, and protein levels provide us with insights into the molecular pathway structure of cancer. Genomic lesions (amplifications, deletions, mutations) and epigenetic modifications disrupt biochemical cellular pathways. Although the number of possible lesions is vast, different genomic alterations may result in concordant expression and pathway activities, producing common tumor subtypes that share similar phenotypic outcomes. How can these data be translated into medical knowledge that provides prognostic and predictive information? First-generation mRNA expression signatures such as Genomic Health's Oncotype DX already provide prognostic information, but do not provide therapeutic guidance beyond the current standard of care, which is often inadequate in high-risk patients. Rather than building molecular signatures based on gene expression levels, evidence is growing that signatures based on higher-level quantities such as from genetic pathways may provide important prognostic and diagnostic cues. We provide examples of how activities for molecular entities can be predicted from pathway analysis and how the composite of all such activities, referred to here as the “activitome,” helps connect genomic events to clinical factors to predict the drivers of poor outcome. Clin Cancer Res; 19(12); 3114–20. ©2013 AACR.
Disclosure of Potential Conflicts of Interest
M.J. Ellis is employed (other than primary affiliation; e.g., consulting) as a board member and has ownership interest (including patents) in Bioclassifier LLC and University Genomics. J.M. Stuart has ownership interest (including patents) in and is a consultant/advisory board member for Five3 Genomics. No potential conflicts of interest were disclosed by the other authors.
CME Staff Planners Disclosures
The members of the planning committee have no real or apparent conflict of interest to disclose.
Learning Objectives
Upon completion of this activity, the participant should understand how genetic pathway-based medical knowledge extracted from high-throughput genomic analysis can help with diagnosis and prognosis in clinical practice.
Acknowledgment of Financial or Other Support
This activity does not receive commercial support.
Background
Tumor subtypes define clinically relevant and molecularly recognizable classifications of cancer
Cancers manifest in different subtypes defined by a set of characteristic attributes such as mutations, cell lineage markers, and histology. Classifying tumors into clinically relevant subtypes is a major step in identifying therapeutic strategies. The distinctions between subtypes may reflect differences in the originating cells transformed by oncogenesis. For example, luminal breast cancers are often more differentiated than basal breast tumors and have a higher proportion of estrogen receptor expression. Subtype distinctions may also reflect different etiologies at work in similar cells due to the nature of the genomic damage. For example, colorectal tumors can exhibit a global DNA methylation phenotype thought to silence DNA repair genes such as MLH1, which then leads to an associated higher background mutation rate compared with other colorectal cancer subtypes. Tumors respond variably to small-molecule inhibition, and the differences in drug sensitivity between subtypes persist even when the tumors are transformed into cell line models (1). New high-throughput technologies will aid in the characterization and recognition of established and novel subtypes to better tailor therapeutics.
Genome-wide expression levels, organized as mathematical vectors of statistically differential gene levels, can be used as signatures of tumor subtypes. Signatures allow the detection of correlations between tumor characteristics such as the possibility that two different mutations may affect the same cellular wiring or that a particular mutation is associated with a clinical outcome. Signatures based only on gene expression may overlook signals from other cis- and trans-regulatory logic. Thus, we seek to find a comprehensive cellular pathway activity description we call the activitome. Just as the genome is the comprehensive description of the genetic information of a cell, the activitome is a comprehensive description of the functional and dysfunctional activity of a cell based on expression, methylation, copy number, and other high-throughput assay technologies. Here, we give a set of examples of data-driven approaches for predicting patient therapy using signatures based on the activitome inferred from global pathway analysis.
Inferring the activitome using global pathway analysis
Increases in computational power and the availability of comprehensive genetic networks make possible a systematic pathway analysis of tumor cells. Rather than focusing on one or a few known pathways, developments in probabilistic graphical models allow all known pathways of a cell to be computationally represented and used for multiplatform data analysis. We developed an integrated pathway approach called pathway recognition algorithm using data integration on genomic models (PARADIGM; ref. 2). In this framework, each type of omics measurement is mapped to a graphical model based on the central dogma of molecular biology (DNA is transcribed to RNA, which is translated into amino acids and hence proteins, and that protein may exist in passive and active forms). We enrich the model with the knowledge that proteins and RNA may regulate DNA. PARADIGM uses a merged set of constituent pathways from various databases called the SuperPathway. PARADIGM then infers the maximum likelihood integrated pathway level of pathway elements including genes, proteins, and protein complexes. The algorithm currently incorporates four types of high-throughput gene-level data: mRNA expression levels (including microarray and RNA-Seq), genomic copy number measures, epigenetic methylation data, and protein-level data (such as from the new reverse-phase protein arrays (RPPA), or from mass-spectroscopy approaches).
Figure 1 illustrates how gene activities can be inferred for a “small toy” pathway, that is, a pared-down model, simpler than reality. The PARADIGM graphical model centered on a particular gene is shown in detail in Figure 1A. Multiple different data measurements of a tumor sample are connected into a graphical model as observed variables (teal ellipses). Unobserved states of gene expression and activity are connected into the graph as hidden variables (orange ellipses). A computationally intensive method called Bayesian belief propagation is then used on the underlying factor graph to set the internal probabilities of the graphical model to a configuration that has a high likelihood according to the observed data (3).
The toy pathway in Figure 1B shows a small pathway involving a single kinase, PAK2, that posttranslationally inhibits the MYC–MAX complex. The transcription factor complex MYC–MAX in turn activates CCNB1 and ENO1 and represses WNT5A. Figure 1C illustrates how belief propagation could set the internal “active state” of PAK2 based on the downstream evidence. In this example, it finds that the PAK2 kinase is inactivated using the evidence downstream that suggests that MYC–MAX, a complex that is inhibited by PAK2, is active because one of its known activated targets (CCNB1) is highly expressed, whereas one of its repressed targets (WNT5A) has lower expression. Note that even though ENO1 is an activated target of the complex and it is not highly expressed, the model can explain away this apparent discrepancy using the information that the promoter of ENO1 seems to be epigenetically silenced and, therefore, the lower expression of ENO1 does not reflect on a lower activity of the MYC–MAX transcription factor complex. The set of all inferred quantities of gene-encoded features in a complex wiring diagram (right-hand side of Fig. 1C), which together form a quantitative state description of tumor cells that we refer to here as the activitome. The integrated measure of the activitome not only represents expression states of genes but also multimeric protein complexes, gene family roles, and higher-level cellular processes, which encapsulate both the molecular function and the information transmission aspect of genes and proteins.
Activitome signatures reveal the signaling layer that interlink genomic perturbations and transcriptional changes that characterize tumor subtypes
If activitome signatures provide an accurate view of tumor cell circuitry then one can ask whether they explain why the observed genomic lesions are associated with concomitant transcriptional changes. For example, basal breast cancers are often associated with TP53 mutations and are also characterized by major transcriptional “hubs” involving transcription factors such as FOXM1, MYC/MAX, and HIF1A. What is the regulatory logic that leads to alterations in these major programs as a result of loss of TP53? If we can determine the network that resolves this question, then such a network may provide an accurate representation of the cellular wiring of the tumor that can be used as a surrogate to identify potential targets.
One approach would be to use a heat diffusion method such as HotNet (4) to identify pathways that connect the observed genomic aberrations to activated or deactivated transcription factors. As an extension of the heat-diffusion approach, a linking set of genes also can be identified by applying multiple input gene sets to identify nodes that interconnect a set of “sources” (genomic perturbations) to a distinct set of “targets” (transcription factors). One can then find essential paths that resolve the effect of genomic alterations with phenotypic changes in the tumor state. Subnetworks can then be identified that interconnect protein-level activitome data to gene expression–level data using protein–protein interactions, predicted transcription factor to target connections, and curated interactions from literature. Permutation-based analysis can then be used to gauge the significance of the solutions.
As an example of this linking diffusion approach, the method was applied to The Cancer Genome Atlas (TCGA) breast cancer dataset, which included patient tumor/matched-normal samples for 533 patients, each with genomic sequencing data and microarray expression. To find the significant pathway differences between luminal and basal cancer subtypes, we conducted a differential analysis between 99 basal and 235 luminal A samples. As a noise-filtering step, we used mutually exclusive modules (MEMo; ref. 5) calls to get 117 genes with significant numbers of amplification, deletion, methylation, and mutation events. We then used a χ2 proportions test to find the genomic perturbations that occur with significantly different frequency between tumor subtypes. Significance analysis of microarrays (SAM; ref. 6) was used to compare the differential expression between basal and luminal A tumor subtypes. A test network that included curated transcriptional, protein-level and complex interactions for nearly 5,000 genes, and abstract concepts with roughly 100,000 interactions was used for the analysis. The interlinking diffusion approach was run using the 12 differentially occurring genomic perturbations as the first “source” set, and the 370 differentially transcribed transcription factors as the second or “target” set. One thousand random permutations of the upstream sources were conducted while keeping the downstream set fixed to assess the significance of the networks. A subnetwork connecting 11 of the 12 genomic perturbations to 336 of the 370 differentially expressed transcription factors, with 57 intermediate “linker” nodes and 5,238 network edges was found to be the most significant (Fig. 2).
Clinical–Translational Advances
Without translational applicability, inferring gene activities with pathway knowledge would be no more than an academic exercise. We describe how the inferences encoded in the activitome can be used to predict patient outcomes.
Activitome-derived features predict patient outcomes more reliably than gene expression
Specific pathway components in an activitome signature can reveal important aspects of tumor biology. For example, PARADIGM uncovered the FOXM1 transcription factor network in ovarian serous cancers, published in Nature with the TCGA marker paper (7). This indicated previously unknown cross-talk between proliferation and DNA damage repair regulated by distinct isoforms of the FOXM1 transcription factor. This cross-talk may explain in part why these tumor cells proliferate in response to DNA repair signals.
Evidence suggests activitomes may improve our ability to predict patient outcomes. When PARADIGM was applied to the TCGA glioblastoma multiforme dataset, higher accuracy predictors of overall survival in the patients could be obtained compared with those using gene expression signatures (2). This strongly suggests that the pathway-level information provides biologically relevant clues about the intrinsic state of tumor cells. Thus, using pathway-inferred levels to build activitome signatures shows promise for predicting biomedical outcomes.
Activitome signatures provide clues about cellular targets
Gene expression signatures have been used to identify putative drug targets. Some examples include the Connectivity-Map project pursued by the Golub laboratory at the Broad Institute (Cambridge, MA), the Ailun project by the Butte laboratory at Stanford University School of Medicine (Stanford, CA; ref. 8), and the Disease Diagnostic Gene Expression database by the Zhou laboratory at University of Southern California (Los Angeles, CA; ref. 9). In the same way, activitome-based signatures can be used to build predictive signatures. Activitomes provide potentially much richer information because the pathway interactions can reveal cryptic signals such as active transcription factors and signaling molecules that could go unnoticed by looking at the gene expression alone.
The merit of using pathway-based signatures for prediction was tested in a proof-of-concept demonstration in cell lines (1). In this case, gene expression and copy number data were from 36 breast cancer cell lines, 17 of which were of the basal (more aggressive) subtype and 19 were of the luminal (less aggressive) subtype. PARADIGM was run on all of the in vitro data and produced inferred pathway activity levels. A two-class (dichotomized) SAM test (6) was used to produce an association score for every feature in the SuperPathway. In this application, positive association scores reflect higher activity in basal tumors, whereas negative associations reflect higher activity in luminal tumors. An activitome signature contrasting basal from luminal tumors was then constructed as the vector of all association scores across the entire SuperPathway.
The activitome signature and SuperPathway together were used to identify significantly large subnetworks that connect high-scoring pathway components to “druggable” biomarkers. Subnetworks were created by retaining any interaction that connected two features, both of which had absolute association scores higher than the average absolute association. Among the largest of the hubs in the resulting network were a central DNA damage hub with the second highest connectivity (55 regulatory interactions; 1% of the network) and TP53 with the 14th highest connectivity (26 connections; 0.5% of the network) The subnetwork identified several pathways of interest, including the FOXM1-related network. Several genes upstream of FOXM1 are known targets of available drugs, including PLK3, suggesting that Polo kinase inhibitors may disrupt basal tumors. Indeed Polo kinase inhibitors on the cell lines were found to sensitize basal cells to a higher degree compared with luminal cells, consistent with the prediction encoded in the network.
Comparing activitome signatures reveals novel connections between mutations and drug response in luminal breast cancers
Activitome signatures can be used to connect mutations, clinical outcomes, and other “events” present in tumor samples. It is often of interest to know whether a particular mutation is associated with elevated risk or the possibility of developing resistance to a particular treatment option. Molecular signatures derived from such events can be used as proxies to predict such tendencies. As an example, pathway-based activitome signatures were used to analyze a set of patient tumors of the luminal breast cancer subtype (both luminal A and luminal B subtypes; ref. 10). In this study, clinical and genomic data on samples were assessed from a neoadjuvant aromatase inhibitor clinical trial designed to assess the responsiveness of samples to these estrogen-lowering agents (11). PARADIGM was used to build a predictive model for aromatase inhibitor therapy and to develop links between gene mutations and clinical outcomes. PARADIGM analysis revealed that multiple pathways are affected by a phalanx of mutations, including caspase/apoptosis, ErbB signaling, Akt/phosphoinositide 3-kinase/mTOR signaling, TP53/RB signaling, and mitogen-activated protein kinase/c-jun-NH2-kinase pathways. Several “hubs” such as ESR1 and FOXA1 were activated cohort wide, whereas other hubs exhibited high but differential changes in aromatase inhibitor-resistant tumors, including MYC, FOXM1, and MYB.
A method called differential pathway signature correlation (DiPSC) was developed for this study to compare signatures while accounting for the confounding that stems from sample overlap. Mutations in different genes may cause disruptions in the same pathway, which may lead to similar disruptions in the activitome. By comparing the vectors of activitome signatures of different mutations with clinical outcomes, intrinsic connections between these events may be uncovered. DiPSC randomly splits the patient cohort in half. In each half, two different activitome signatures are calculated from two distinct contrasts. A contrast corresponds to the dichotomy defined by the presence versus the absence of a particular “event,” such as a mutation or a clinical outcome. The event is used as a dichotomous variable in a two-sample SAM analysis to derive an activitome signature. The activitome signatures computed from each disjoint half are compared with one another. This guarantees that the comparison of the signatures is not polluted by any overlapping samples. The procedure of randomly splitting the cohort, rederiving the activitome signatures with SAM, and comparing the signatures is repeated 1,000 times. The final correlation is then computed as a mean and SD across the random trials.
DiPSC was applied to the 77 luminal samples using the PARADIGM-derived activitome signatures to uncover phenomena that underlie the resistance of some cancers to aromatase inhibitors. All pairs of associations were scored across all of the cohorts. An example of the association of mutations to subtype is illustrated in the DiPSC (dipstick) shown in Figure 3, which plots the correlation of all activitome signatures against the luminal B versus luminal A activitome signature. From this visualization, one can immediately see what patient groups lead to common signatures. The analysis revealed, for example, that mutated MALAT1 (a small noncoding RNA) had activitome signatures similar to TP53 mutations and is also associated with both high Ki-67 and high preoperative endocrine prognostic index scores that are indicators of resistance to drug treatment. Ki-67 is a prognostic indicator of proliferation in breast cancers. Because MALAT1 is mutated in only a handful of samples, this precluded several analyses used to detect such relations because of the low sample size, whereas DiPSC was able to leverage the robustness of the pathway signatures to find a significantly indicative pattern. On the other hand, PIK3CA, MLL3, and CDH1 do not enrich for either luminal subtype. ATR and MAP2K4 are slightly enriched for luminal A, and MAP3K1 mutations are overwhelmingly enriched for luminal A. Thus, relating cancer outcomes to those based on pathway-inferred signatures teased out novel connections not available to standard approaches.
Conclusion: Toward Patient-Specific Models
Clearly, activitome informatics tool development for clinical care is in an early stage of development. The data-driven discovery approach requires pooling data from many patients and data sources to build functional inferences. The challenge ahead of us is to develop single sample predictors that can guide therapeutic decisions in individual cases. One can imagine building a database of activitome signatures to represent all known cancer subtypes. Each will span the range of possible genomic, epigenomic, and transcriptomic activities that characterize all samples of a particular subtype. To identify a patient-specific model would then require two conceptual steps: (i) from the database of subtype signatures, identify the most representative for a particular patient sample and (ii) refine the model to best fit the particular set of genomic, epigenomic, and proteomic changes observed in the data of the patient. The approach not only leverages on the statistical power of multiple samples to define the starting subtype models but also encompasses the flexibility to adapt to a particular form of the disease. Just as in gene expression–based models, activitome-based models will require the careful acquisition of samples from well-conducted clinical trials that are sufficiently powered for the full suite of genomic and proteomic analysis pipeline executed at the clinical grade testing level. Although we are years away from its clinical use, the pathway-based approaches we describe provide the basis for a discussion on progress toward this goal and underscore the value of the deeply collaborative environment provided by our rapidly growing bioinformatics and computational biology discipline and many teams of clinicians and genome and proteome centers that provide us with data to analyze.
Authors' Contributions
Conception and design: T.C. Goldstein, M.J. Ellis, J.M. Stuart
Development of methodology: T.C. Goldstein, E.O. Paull, M.J. Ellis, J.M. Stuart
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): E.O. Paull, M.J. Ellis
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): T.C. Goldstein, E.O. Paull, M.J. Ellis, J.M. Stuart
Writing, review, and/or revision of the manuscript: T.C. Goldstein, E.O. Paull, M.J. Ellis, J.M. Stuart
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): T.C. Goldstein, M.J. Ellis
Grant Support
M.J. Ellis is supported by NIH grants U24 CA160035, R01 CA095614, P30 CA91842, a Susan G Komen for the Cure Promise Grant, a BCRP-Idea Award-BC112014, and the Breast Cancer Research Fund. J.M. Stuart is supported by an NSF CAREER Award and coleads a genome data analysis center supported by the TCGA project.