Existing cancer driver prediction methods are based on very different assumptions and each of them can detect only a particular subset of driver genes. Here we perform a comprehensive assessment of 18 driver prediction methods on more than 3,400 tumor samples from 15 cancer types, all to determine their suitability in guiding precision medicine efforts. We categorized these methods into five groups: functional impact on proteins in general (FI) or specific to cancer (FIC), cohort-based analysis for recurrent mutations (CBA), mutations with expression correlation (MEC), and methods that use gene interaction network-based analysis (INA). The performance of driver prediction methods varied considerably, with concordance with a gold standard varying from 9% to 68%. FI methods showed relatively poor performance (concordance <22%), while CBA methods provided conservative results but required large sample sizes for high sensitivity. INA methods, through the integration of genomic and transcriptomic data, and FIC methods, by training cancer-specific models, provided the best trade-off between sensitivity and specificity. As the methods were found to predict different subsets of driver genes, we propose a novel consensus-based approach, ConsensusDriver, which significantly improves the quality of predictions (20% increase in sensitivity) in patient subgroups or even individual patients. Consensus-based methods like ConsensusDriver promise to harness the strengths of different driver prediction paradigms.
Significance: These findings assess state-of-the-art cancer driver prediction methods and develop a new and improved consensus-based approach for use in precision oncology. Cancer Res; 78(1); 290–301. ©2017 AACR.
Cancers result from the accumulation of various types of DNA mutations including point mutations, indels, large-scale copy number aberrations (CNA), and structural variations (1). During tumor development, in addition to mutations that confer functional advantages to tumor cells (i.e., driver mutations; ref. 2), a large number of passenger mutations with no or little functional impact may arise, confounding our ability to identify the key events in oncogenesis for understanding and treating cancers (3).
Recent large-scale cancer genome sequencing efforts such as The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC) have harnessed technologic advances in DNA/RNA sequencing to provide comprehensive mutation catalogs and associated omics profiles in tumors. These compendiums provide a rich resource for the development of integrative cancer driver prediction methods (genes and mutations; refs. 4–6). In addition, they further highlight the challenges that still remain in driver prediction. In particular, due to the heterogeneity of cancer types, often few frequently mutated (and likely driver) genes were identified in these studies with many more genes being rarely mutated and thus indistinguishable from noise due to passenger mutations (7, 8). Despite this, the ability to identify cancer drivers (genes and mutations) may be key for improved targeted therapy (9, 10). For example, breast cancer patients with ERBB2 driver mutations can respond successfully to the ERBB2 inhibitor trastuzumab (11), but similar therapy may also benefit patients with other cancers where ERBB2 mutations are rare (12). After the initial wave of large-scale cancer studies, different cohorts of patients continue to be sequenced with more distinct phenotypes (e.g., previously unprofiled disease sites or disease states such as tumors characterized by primary or acquired drug resistance). In the growing paradigm of precision oncology, individual patients are also sequenced either broadly (whole exome) or with targeted sequencing panels of genes selected by above studies and existing knowledge about the patient's disease and available treatments, to gain insights into biology and to match the right patient to the right drug at the right time. There is thus a deep biological and clinical need to identify the mutation that drives the tumor of a single patient.
Because of its biological and practical importance, a range of different approaches have been proposed for inferring the impact of mutations on genes and their likely role in cancer. These methods differ widely in the information they require as input (e.g., point mutations, indels, CNAs, expression data etc.), in the models/assumptions that they use, and what they can predict (driver gene or mutation; refs. 13, 14). For example, many methods are based on using information about protein structure and evolution to detect point mutations that may have a functional impact in general (FI; 15–18), or specifically in the context of cancer (FIC; refs. 19–21). These methods predict functional/driver mutations in each sample independently and their relative strengths have been studied in previous work (22, 23). With the availability of large and heterogeneous cancer genomic datasets, newer methods have focused on cohort-based analysis to search for biases in mutation frequency indicative of positive selection in driver genes (CBA; refs. 24–29; compared in ref. 30), or mined for mutation–expression correlations to highlight driver CNAs (MEC; refs. 31–33; jointly evaluated in ref. 34). Finally, a few methods have sought to incorporate information about gene interaction networks in their analysis with the aim of providing more sensitive predictions (35, 36), or to enable driver prediction based on the integrative analysis of genomic and transcriptomic data (interaction network-based analysis, INA; refs. 4–6).
Despite the diversity of driver prediction methods, a comprehensive evaluation of the strengths and weaknesses of different classes of methods on a diverse range of cancer types has not been conducted. We sought to address this by evaluating the performance of a panel of 18 different computational methods, covering a wide variety of models and input data types, on >3,400 tumor datasets from 15 TCGA cancer types. Methods were evaluated systematically for their concordance with gold standard lists of driver and passenger genes as well as mutations, for their robustness to noise in the input, for their utility for working with data from small patient cohorts, and for their ability to provide accurate and actionable patient-specific predictions for precision medicine applications. The overall predictive power for driver genes was found to be moderate, highlighting the need for novel approaches and improved methods. In addition, predictions from different classes of methods were found to be orthogonal to each other, motivating the development of a consensus-based approach (ConsensusDriver) to increase sensitivity and specificity of driver predictions across cancer types. Consensus-based approaches such as ConsensusDriver provide a systematic way to combine the strengths of different driver prediction algorithms in building an analytic toolbox for precision oncology.
Materials and Methods
Data source and preprocessing
CNA and exome point mutation data for all cancer types was obtained from GDAC via Firehose (https://gdac.broadinstitute.org). All point mutations excluding synonymous mutations (i.e., indels, missense, nonsense, and splice site variants) and CNAs with a value of 2 (focal amplification) or −2 (focal deletion) were used for downstream analysis. Expression data for tumor and normal samples for all cancer types was downloaded from the TCGA website (level 3; https://tcga-data.nci.nih.gov). For a detailed description of expression data analysis, see Supplementary Methods. Protein expression data was downloaded from the TCPA portal (level 4; http://www.tcpaportal.org/tcpa).
Assessment of driver prediction methods
In total, we evaluated 18 methods that could be used for driver prediction (Fig. 1A), classifying these methods into (i) methods that belong to the FI category (primarily designed to identify function altering mutations but have been used for predicting driver mutations; refs. 22, 23) such as SIFT (15), PolyPhen2 (PP2; ref. 16), MutationTaster (MT; ref. 17), and MutationAssessor (MA; ref. 18), (ii) methods that tailor this idea to cancer by learning specific models (Functional Impact in Cancer, FIC) such as CHASM (19), transFIC (TF; ref. 20), and fathmm (FH; ref. 21), (iii) methods that use cohort based analysis to detect genes with signals of positive selection (CBA) such as ActiveDriver (AD; ref. 29), MutSigCV (MCV; ref. 24), MuSiC (MUS; ref. 25), OncodriveCLUST (OCL; ref. 26), and OncodriveFM (OF; ref. 27; all point mutation based), (iv) methods that integrate mutation data with transcriptomic data by looking for mutation–expression correlations (MEC) such as Conexic (CON; ref. 31), OncodriveCIS (OCI; ref. 32), and S2N (33), and finally (v) methods that use information from gene/protein interaction networks to analyze the effect of mutations, such as NetBox (NB; ref. 35), HotNet2 (HN2; ref. 36), DriverNet (DN; ref. 4), DawnRank (DR; ref. 5), and OncoIMPACT (OI; ref. 6).
A few methods were excluded from this benchmark for the following reasons: (i) they could not be run without further data processing, complex prefiltering steps or inclusion of additional data [Genome MuSiC (25), Conexic (31)] or (ii) provided incompatible predictions [Gistic2 (28) with region-level predictions].
For each method, we used default parameters or the set of recommended parameters provided in the method's manual or corresponding publication. In cases where methods required a threshold for candidate driver selection (e.g., on the P value or score for candidates), we used the value indicated in the method's publication or manual (see Supplementary Methods for a detailed description of the parameters and threshold used).
For analysis of patient-specific predictions, for most methods, mutated genes in each patient (with mutation types matching the expected input for the method) were ordered according to their rank on the full dataset. For FI and FIC methods, and for OncoIMPACT, mutation/patient specific scores were used to order genes (best score in the case of multiple mutations; ties broken by average gene score).
Comparison with cancer gene gold standard.
We assessed the performance of all methods against a gold standard list of cancer driver genes [union of Cancer Gene Census (37), a manually curated list of CNA driver genes (38), oncogenes from UniProt (http://www.uniprot.org/), gene list from the Vogelstein 20/20 rule (7), and a gene list from literature mining (39)] based on three different measures (on the top N predictions), precision (P), recall (R), and the F1 score (that combines both precision and recall):
where T is the number of top N predictions in the gold standard and G is the total number of predictions.
Robustness to subsampling.
Subsampling analysis was performed for of each of the 7 cancer types with more than 200 tumor datasets. Two different measures were used to evaluate the robustness of results from a method on a subsample (S) when compared with its results on the full dataset (F): stability as a measure of precision when comparing the top N predictions of S (SN) to truth as defined by F,
and recovery as a measure of sensitivity when comparing predictions in S to the top N predictions in F(FN),
To make the comparison between S and F reasonable, we excluded from F genes that were not mutated in the subsampled dataset. To avoid penalizing sensitive or conservative methods, we choose N to be 20 as a majority of the methods provided >20 predictions.
Generation of decoy missense mutations.
For each patient, we introduced n false positive/decoy mutations, where n = 2%, 5%, or 10% of the number of mutations in a tumor. Point mutations were randomly placed in coding regions of unmutated genes with probability proportional to the coding length and missense mutations were selected using annovar (to avoid bias against methods that cannot analyze nonsense or splice-site mutations). For consistency, this analysis was restricted to the 12 cancer types annotated using the hg19 genome (i.e., COAD, OV, and READ samples, annotated using hg18, were excluded).
Construction of an actionable gene list.
We downloaded gene lists from IntOGen (https://www.intogen.org/) and OncoKB (http://oncokb.org/), and took the union of the actionable genes reported in them. We excluded drugs associated to a nonmutated gene from OncoKB, off-target genes from the IntOGen list, drugs targeting fusion genes, gene therapy targets, and genes associated to drug resistance. Each drug/gene association was classified into three levels in the following order of preference: approved drug (Level 1 and 2A from OncoKB and “FDA approved drug” from IntOGen), investigational target (Level 3A of OncoKB and “Drug in Clinical Trials” from IntOGen), and research target (all other genes).
Analysis of driver mutations.
For this analysis, we only studied missense mutations, as many benchmarked methods only predict drivers for this mutation type. We obtained a list of >2,100 known missense driver mutations and 227 likely nononcogenic mutations from the OncoKB database (http://oncokb.org/), and merged these with a list of >110,000 missense mutations (population allele frequency ≥ 1%, with no known clinical association) from the dbSNP database (https://www.ncbi.nlm.nih.gov/projects/SNP/, build 138) that are unlikely to be cancer drivers.
All methods were then evaluated using the measures Recall (R) and Accuracy (A):
where D represents the number of known driver mutations (1,435), P represents the number of known passenger mutations (1,101 in all genes and 78 in the cancer genes gold standard) in the whole cohort, and TD and TP represent the number of correctly predicted driver and passenger mutations, respectively, in the whole cohort based on the top 5 patient-specific predictions.
ConsensusDriver is based on the Borda approach, where each gene was given a score equal to the sum over all methods of either its rank, if the gene was ranked, or the maximum number of predictions in that data set (M), if the gene was unranked.
Genes were then reranked according to this score. To select the best set of methods for a particular cancer type, we used the following procedure (equivalent to a leave-one-out): (i) exhaustively compute the Borda consensus score on the 262,125 possible method combinations; (ii) select the method combination that obtained the best average combine score for the others 14 cancer types. For sample/mutation specific predictions, we integrated the patient/mutation–specific predictions of the six methods identified (fathmm, CHASM, OncodriveFM, MutSigCV, DriverNET, and OncoIMPACT; Supplementary Fig. S1). BORDAall used all the methods in constructing a BORDA-based ranking.
Different cancer types represent diverse driver prediction challenges
For the purpose of this study, we selected 15 different cancer types from TCGA for which exome sequencing, copy number, and expression data (RNA-seq or arrays) were available (BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinoma; COAD, colon adenocarcinoma; GBM, glioblastoma multiforme; KIRC, kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; OV, ovarian serous cystadenocarcinoma; PAAD: Pancreatic Adenocarcinoma; PRAD, prostate adenocarcinoma; READ, rectum adenocarcinoma; STAD, stomach adenocarcinoma; THCA, thyroid carcinoma; see Materials and Methods). The cancer types selected vary widely in cohort sizes, mutational burden per patient and distribution of mutation types, thus representing a diverse set of challenges for driver prediction methods (Fig. 1B). For example, we noted that while some cancer types are predominantly affected by point mutations (KIRP) or CNAs (OV), others have similar number of genes affected by both point mutations and CNAs (GBM). In addition, certain cancer types exhibited a bimodal distribution for mutational burden (READ, COAD, PRAD, and KIRC) and this could impact the distributional assumptions of some methods. The distribution of mutation frequencies across genes also showed high variation between cancer types (Supplementary Fig. S2). For example, while LUSC and OV have many genes with mutation frequency above 25%, THCA has only 3 genes with frequency above 5%, potentially impacting the sensitivity of methods that are dependent on mutation frequency for driver prediction. We additionally noted that most tumors exhibited both point mutations and CNAs (Supplementary Fig. S3A) and thus methods that take only a subset of mutation types as input may be at a disadvantage in terms of sensitivity (e.g., FI and FIC methods that only consider missense variants; Supplementary Fig. S3B).
In terms of driver prediction methods, we attempted to be as comprehensive as possible. In total, we studied 18 methods, covering five different classes of driver prediction methods (Fig. 1A; Methods), and evaluated their performance in predicting cancer driver genes in patient cohorts and in individual patients, as well as driver mutations in patients.
Driver gene prediction identifies many novel drivers but sensitivity is still a bottleneck
To evaluate the ability of various driver prediction methods to accurately differentiate between driver and passenger genes in a dataset, we compiled gold standard lists for both. Specifically, we took the union of 5 different curated lists of driver genes that have been reported before, including the widely used Cancer Gene Census list (37), a manually curated list of driver genes affected by copy number alterations (38), genes annotated as oncogenes by UniProt (http://www.uniprot.org/), a gene list derived from the Vogelstein 20/20 rule (7), and a gene list derived from literature mining (Supplementary Table S1; ref. 39). Passenger genes were defined by taking the union of two manually curated lists of known passengers from NCG4 (40) and Rubio-Perez and colleagues (Supplementary Table S1; ref. 41). These gold standards are limited in that they are not cancer type or sample specific [although drivers are frequently shared (42) and targeted (43) across cancer types], but represent an attempt to construct as comprehensive a list as possible such that novel cancer driver genes can be more effectively demarcated. The methods were evaluated on how well their predictions identified cancer driver genes based on three standard measures (as well as others as detailed below): precision (fraction of predictions that belong to the gold standard), recall (fraction of the gold standard contained in the predictions), and the F1 score that combines both precision and recall (see Materials and Methods for a more detailed description).
Because of the wide variation in the number of driver predictions from different methods (median of 10 for MutSigCV to >8,000 for MutationTaster; Supplementary Fig. S4), we restricted our analysis to either the top 10 or top 50 predictions from each method (see Supplementary Note S1; Supplementary Fig. S5 for further details). An overview of the top 50 predictions for each method can be seen in Fig. 2A. In general, most methods report a low number of passenger genes in their top 50 predictions except for FI methods (∼20% of predictions). This is as expected as FI methods are not designed to specifically exclude function altering mutations that may not be linked to cancer, unlike the FIC methods.
The number of known cancer-associated genes reported in the top 50 predictions of different methods varied widely, from a mean of 4 for OncodriveCIS to 27 for fathmm (a majority of these belong to the Cancer Gene Census). In general, the highest sensitivity was provided by methods in the FIC and INA categories, reporting >15 known driver genes in the top 50 list. Note that the FIC methods use a machine learning approach with training sets that substantially intersect our gold standard (Fig. 2A), and thus their sensitivity to predict new driver genes may not be accurately captured here. On the other hand, methods in the CBA category were most concordant with the list of gold standard driver genes (0.5 and 0.6 for OncodriveFM and MutSigCV, respectively; Fig. 2B; Supplementary Fig. S6A and S6B). Selecting the best method in each category, we observed that all methods were more enriched for driver genes in their top predictions as expected, and methods such as fathmm and OncoIMPACT retained high precision even for predictions lower down the list (Supplementary Fig. S7). A striking aspect of the results in Fig. 2A is the large number of predicted genes that are neither passengers nor known driver genes. The majority of these genes are predicted by a single method and are likely enriched in false positives (Fig. 2C). However, genes predicted by multiple methods were strongly enriched in cancer-related functions (Supplementary Fig. S8), highlighting the fact that many more driver genes remain to be discovered, and consistent with recent work showing that more driver genes exist even in extensively studied TCGA cancer types (8).
We used the F1 score that combine precision and recall to rank methods and compare them against a “baseline” method that simply orders genes based on mutation frequency (Fig. 2B; Supplementary Fig. S9A and S9B; see Materials and Methods). The methods fathmm, CHASM, NetBox, DawnRank, DriverNet, and OncoIMPACT provided significantly better results than baseline for precision and F1 score, while ActiveDriver, OncodriveFM, and MutSigCV showed significant improvement in precision (Wilcoxon rank sum test P < 0.1; Supplementary Fig. S10A and S10B). The lower scores observed for MEC methods was not explained solely by their restriction to CNAs (Supplementary Fig. S11A and S11B).
To evaluate how driver gene predictions are affected by cohort size, we tested the different methods for robustness and power using a subsampling approach that compares predictions for a method to those on the full dataset (stability = precision and recovery = recall compared to results from full dataset; see Materials and Methods; ref. 6). Many methods exhibited high stability (>70%) at least for the 50 and 100 sample comparisons (ActiveDriver, MutSigCV, S2N, DriverNet, OncoIMPACT; Fig. 2B; Supplementary Fig. S12A). However, few methods exhibited recovery >50% (NetBox, OncoIMPACT), highlighting challenges in uncovering driver genes when cohort sizes are limited (Fig. 2B; Supplementary Fig. S12B). Overall, as summarized in Fig. 2B, no single method outperformed the others in all metrics.
Most methods predict no driver genes for 10% of patients but many provide robust patient-specific predictions
We next evaluated methods for their ability to accurately identify driver genes in a patient-specific manner to assess their utility for precision medicine applications. Note that not all methods provide predictions per patient and for such methods, we assumed that nominated driver genes are drivers in all patients in which they were mutated. We began by computing statistics for the number of driver genes nominated in each patient by various methods, under the assumption that reporting too few (<1) or too many driver genes (>15) may make them less useful (Fig. 3A; number of drivers per patient is generally expected to be <10; ref. 1). Interestingly, with the exception of FI methods that call a large number of driver mutations for the majority of the patients, nearly all the other methods report no drivers for >10% of patients. This could be an indication of low sensitivity but could also be due to driver events having other origins (e.g., copy-number neutral rearrangements, large translocations, regulatory, or noncoding mutations or methylation and other epigenetic events) that were not considered by these methods. The method OncoIMPACT was found to be unusual in this aspect (even compared with INA methods) as it identified at least 1 driver gene in nearly all patients. Methods belonging to the CBA and MEC categories typically identified <2 driver genes in a large fraction of the cohorts (∼40%). On the other end, some methods frequently (in >50% of cohort) identified >15 driver genes in patients, suggesting that they may be overcalling at the patient-specific level (MutationTaster, MutationAssessor, SIFT, PolyPhen2, and S2N; Fig. 3A).
Considering the top 5 patient-specific predictions, most methods provided similar precision and F1 score as in the cohort-level evaluation, with the network-based methods (INA) generally outperforming other approaches (Fig. 3B; Supplementary Fig. S13A and S13B). As before, CBA methods such as OncodriveFM and MutSigCV provided the best precision (Fig. 3B; Supplementary Fig. S14A and S14B).
To test robustness to noise and to estimate the specificity of the predictions at the patient-specific level, we introduced decoy passenger mutations in genes with probability weighted by the gene length (see Materials and Methods). Most methods exhibited good robustness to such noise with specificity generally higher than 95% (except for FI methods; Fig. 3B; Supplementary Fig. S15A). In particular, methods in the FIC category accounted well in identifying decoy function altering mutations, improving significantly over methods in the FI category. Also, as CBA methods explicitly model such sources of noise, they were found to have the best control over them. We also noted that most patients (>80%) do not have any of the decoy mutations in their predictions even when the overall specificity of a method is approximately 95% (Fig. 3B; Supplementary Fig. S15B). This is even more the case when only the top 5 or 10 predictions are considered, highlighting the robustness of many methods at the patient-specific level. As summarized in Fig. 3B, no single method uniformly outperformed the others at the patient-specific level as well.
Prioritization of actionable driver genes is still a challenge for most individual methods
The prioritization of driver genes and mutations that are actionable is a key requirement for decision support systems to aid in precision oncology. We sought to evaluate the performance of the various methods studied here based on curated lists of actionable genes (genes that can be targeted by a drug under certain conditions) from the OncoKB (http://oncokb.org/#/) and IntOGen (41) databases (see Materials and Methods; Supplementary Table S2). Analyzing the top 5 driver predictions per patient from each method, we observed significant variability in performance, with the fraction of patients with a predicted actionable driver gene varying from 6% for OncodriveCIS to >60% for DriverNET and OncoIMPACT (Fig. 4A). We observed that the different methods provided largely nonoverlapping predictions, enabling the union to predict actionable driver genes for up to 81% of patients. A breakdown of the predictions by cancer type (Fig. 4B) highlighted that six of them (LIHC, PRAD, KIRP, OV, KIRC, BRCA) have a much lower fraction of patients with predicted actionable driver genes. This could in part be due to the lack of sensitivity in driver prediction methods, but in most cases it is explained by the cancer types being enriched for nontargetable driver genes, highlighting the need for further drug discovery efforts in these cancer types. Finally, as a positive control test, we assessed the sensitivity of the methods in predicting two known actionable oncogenes, BRAF (various drugs are FDA approved for treating melanoma with V600 mutations; ref. 44) and PIK3CA (the inhibitor alpelisib is currently undergoing a clinical trial for breast cancer; ref. 45) in patients harboring known oncogenic mutations (BRAF V600 and 19 PIK3CA mutations located in the domains of the catalytic subunit; see Supplementary Table S3). We observed notable variation in the numbers of patients where the mutations were flagged as drivers by different methods (Fig. 4C), with multiple methods that did not report the genes for any patient (similar results were observed with top 10 predictions; Supplementary Fig. S16). These results highlight that the differences in the underlying model of various methods can lead to dramatically different abilities in predicting actionable drivers and that care should be exercised in interpreting and integrating results from different driver prediction systems.
Low concordance across methods enables the construction of a better consensus-based approach
A comparison of driver predictions across methods revealed that in addition to the expected differences across categories, many methods had a significant number of calls that were unique to them (Supplementary Fig. S17A). This was particularly the case for FIC methods such as fathmm and CHASM, and network-based methods (INA) such as DawnRank, DriverNet, and OncoIMPACT. In addition, for the more sensitive methods (e.g., fathmm and OncoIMPACT), many predictions were shared by >4 methods suggesting that this could provide additional confidence for many of their calls. To evaluate whether consensus approaches could improve over predictions from individual methods, we evaluated a rank aggregation–based approach using all methods (BORDAall) as well as a subset of methods identified using cross-validation (ConsensusDriver; see Materials and Methods). We found that the same methods were consistently selected by ConsensusDriver across cancer types, covering a wide range of methods across categories (Supplementary Fig. S1), including CHASM, fathmm (FIC), OncodriveFM, MutSigCV (CBA), DriverNet, and OncoIMPACT (INA).
Across cancer types, while ConsensusDriver was able to improve over the best individual methods (1.4 × improvement compared with fathmm in median F1 score, one-sided Wilcoxon rank sum test P value = 10−3; Fig. 5A; Supplementary Fig. S17B), BORDAall did not show a significant improvement in precision (Supplementary Fig. S18A and S18B) or in the F1 score (Supplementary Fig. S17B; see Supplementary Note S2; Supplementary Fig. S19A and S19B for comparisons with other machine learning approaches). Comparing ConsensusDriver to two consensus-based gene lists, we noted that it improved recall and F1 performance over both of them [MutSig (8) and DriverDB (46); one-sided Wilcoxon rank sum test P value = 10−3 and 2 × 10−4 for F1 improvement]. Overall ConsensusDriver is a consistent improvement over individual methods and consensus-based gene lists exhibiting a precision of 0.9 for its top 10 predictions and 0.63 over its top 50 predictions (Fig. 5B).
At the sample-specific level, ConsensusDriver is largely better than individual methods across metrics [e.g., it provides 1.5 × improvement over OncoIMPACT in recall (one-sided Wilcoxon rank sum test P < 2 × 10−16] and 1.35 × improvement in F1 score (one-sided Wilcoxon rank sum test P = 2.4 × 10−16); Fig. 5C], with the exception of precision (versus MutSigCV and OncodriveFM) and the fraction of patients without false positive predictions (versus MutSigCV). It arguably provides a better tradeoff though, by improving the fraction of samples with a predicted driver gene (from 0.8 for MutSig to 0.99) and predicted actionable driver genes (from 0.36 for MutSigCV to 0.67). This improved sensitivity is also accompanied by high specificity (0.99) for ConsensusDriver (Fig. 5C).
Evaluation of methods for their ability to predict driver mutations showed that despite being trained with a driver gene list, ConsensusDriver exhibits high recall (91% vs. 79% for DriverNet), as well as accuracy (93% vs. 87% for DriverNet; Fig. 5C). For genes mutated at a low frequency in the cohort (<2%), ConsensusDriver is able to correctly leverage the predictions of some of its constituent methods to identify known driver mutations (Supplementary Figs. S20 and S21). In addition, it retains higher discriminatory power in identifying driver and passenger mutations when restricted to known cancer genes compared to other methods (accuracy = 0.89 vs. 0.76 for DriverNet; Supplementary Fig. S22). An illustration of this can be seen in the NRAS gene in the TCGA gastric cancer cohort (mutation frequency < 2%) where ConsensusDriver predicts the known oncogenic mutations G12C as a driver for one patient and Q61R as a passenger for another patient. Deeper analysis of patient transcriptomes and proteomes reveal that the G12C mutations is accompanied by NRAS upregulation, significant deregulation of the RAS signaling pathway (12/71 genes from associated OncoIMPACT module, hypergeometric test P < 0.05) and on average a 2.7-fold increase in AKT phosphorylation over NRAS wild-type samples. In contrast, Q61R is accompanied by NRAS downregulation and little or no effect on RAS pathway expression and AKT phosphorylation. The notion that G12C and Q61R may be driver and passenger mutations respectively in the context of gastric cancer is further supported by analysis in an independent cohort of 167 patients that showed poor prognosis associated with NRAS codon 12/13 mutations and no patients with NRAS codon 61 mutations (47).
The additional sensitivity of ConsensusDriver helped establish that, with the exception of THCA, PRAD, and KIRP, most of the patients analyzed here have at least a known cancer gene in their predictions (Fig. 5D). The fraction of patients with actionable predicted driver genes is however lower, as many known driver genes are still not targetable (e.g., Ovarian Cancer, where most patients harbor a TP53 mutation, exhibits the lowest fraction of patients with a predicted actionable gene).
We provide the first systematic evaluation of different classes of driver prediction methods over a large number of cancer types. As the community still lacks standard evaluation protocols, we identified various criteria to evaluate predictions at the cancer-type level (concordance and sensitivity over know cancer genes, and stability/recovery of predictions upon sub-sampling) and at the patient-specific level (number of driver genes per patient, concordance with gold standard lists, robustness to noise mutations). The availability of our preformatted datasets, predictions from evaluated methods as well as a package of tools to study new predictions, provides a useful resource and a standardized framework to evaluate any newly developed method against a diverse panel of state-of-the-art methods and on a large number of cancer types.
A key result of our analysis is that there is no single method (or category of methods) that generally outperforms other methods and instead there are specific pros and cons that need to be taken into consideration when selecting a method for analyzing new datasets. For example, FIC methods are more appropriate for the analysis of a small number of samples when only exome data is available, while CBA methods should be selected for large-scale exome sequencing data sets and INA methods provide greater sensitivity when genomic and transcriptomic data is available. In general, our study highlights the value of integrative methods: for example, methods that are restricted to point mutations, not surprisingly, have a large drop in sensitivity in cancer types with significant amount of CNA events. In the ovarian cancer dataset, the best CBA method only predicts 3 known cancer genes compared to 18 using the best INA method. Furthermore, INA methods that integrate expression data (DriverNet, DawnRank, and OncoIMPACT) show, in most analyses, better results than methods that analyze only genomic data (NetBox and HotNet2). Further work is thus needed in this area, particularly in developing methods that incorporate information from other data types (e.g., miRNA-seq) and other mutation types (e.g., noncoding mutations).
Our study also provides a detailed analysis of the driver predictions at the patient level. It highlights the robustness (low false positive rate, high concordance with the gold standard) of driver gene predictions, but also the lack of sensitivity (significant fraction of patients with 0 to 1 driver predicted) of the vast majority of methods, with methods integrating expression having the best performance. In terms of prioritizing actionable genes or predicting driver mutations, most methods have even more severe limitations and integrating methods with different underlying models could help ameliorate this problem.
There are several limitations to our work: First, we limited our analysis to a single data source (TCGA, which currently provides the most comprehensive coverage of cancer types with genomic and transcriptomic data) and to a set of well-cited methods with software implementations that we were able to use successfully. Second, our evaluations were based on gold standard lists of driver genes and driver mutations that are not cancer-type or patient specific. They thus do not necessarily reflect the heterogeneity of cancers and lack direct evidence that a specific mutation has a functional role in a particular tumor. Other large-scale initiatives have tried to bridge this gap and provide cell-line specific shRNA (48) or drug resistance profiles (49). The results of these studies could potentially be used to generate more refined gold standards. However, such analysis will not come without drawbacks as (i) the cell lines used typically do not have normal controls and thus mutation calls can be error-prone and (ii) the experiments are limited to measuring cell growth and thus miss other relevant phenotypes (e.g., motility, invasiveness etc.). Nevertheless, large experimentally derived and patient specific gold standard driver mutation lists are needed to further advance the development and evaluation of new driver prediction methods.
Overall, our study highlights that while existing driver predictions methods can have limited sensitivity as a function of data types and modeling assumptions used, their diversity in fact provides an avenue for better consensus methods, as demonstrated by the novel consensus method proposed here (ConsensusDriver). Development of methods that harness new sources of information thus might provide greater benefits then refinement of existing paradigms for driver discovery. The design of a consensus-based approach requires careful selection of methods as demonstrated by the poor performance of Borda using all methods. The methods that are combined should be orthogonal enough to produce a different set of false positives and ideally sensitive enough to provide an intersecting set of true positives. The leave-one-out based selection approach used for ConsensusDriver allowed us to automatically perform this task, and helped select a set of methods with complementary strengths: high specificity of CBA methods, high sensitivity of FIC methods, and integration of different mutation types in addition to high sensitivity in INA methods. Our extensive evaluations suggest that ConsensusDriver not only provides good tradeoff between sensitivity and specificity for cohort level cancer gene prediction, but also for the prediction of patient specific driver mutations. In the context of precision oncology, ConsensusDriver's ability to integrate information across methods and accurately differentiate oncogenic from nononcogenic mutations, even for genes mutated in a single patient (Supplementary Fig. S21), should be very useful. Note that, ConsensusDriver is not fundamentally limited to the analysis of bulk tumors at a single time-point, and can be applied to longitudinal as well as spatially related data, including single-cell datasets for which mutation and transcriptome information are available (50). From a practical point of view, we provide an easy-to-use package to run 18 different driver prediction methods, as well as to aggregate their results into consensus predictions that are largely superior to the individual methods, thus serving as a valuable toolbox for precision oncology efforts.
Availability of Supporting Data
A toolbox that contains scripts to reproduce results presented in this paper and to evaluate results from newly developed methods is available at https://github.com/CSB5/driver_evaluation. The site also contains results for each driver prediction method on all fifteen cancer types and the necessary input files (such as normalized expression, differential expression, mutations, and copy number alteration lists). The ConsensusDriver package is freely available under the MIT license at https://hub.docker.com/r/csb5gis/consensusdriver and allows users to run individual driver prediction methods as well as the consensus algorithm.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Conception and design: D. Bertrand, S. Drissler, N. Nagarajan
Development of methodology: D. Bertrand, S. Drissler, B.K. Chia, N. Nagarajan
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): D. Bertrand, S. Drissler, B.K. Chia, I.B. Tan, N. Nagarajan
Writing, review, and/or revision of the manuscript: D. Bertrand, B.K. Chia, J.Y. Koh, C. Suphavilai, I.B. Tan, N. Nagarajan
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): B.K. Chia, J.Y. Koh, C. Li
Study supervision: N. Nagarajan
We thank Dr. Anders Skanderup and Dr. Asif Javed for insightful comments and suggestions on the manuscript. This work was supported by funding from the Agency for Science, Technology and Research (A*STAR), Singapore.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.