Missense variants are commonly identified in genomic sequence but only a small fraction directly contribute to oncogenesis. The ability to distinguish those missense changes that contribute to cancer progression from those that do not is a difficult problem usually only accomplished through functional in vivo analyses. Using two computational algorithms, Sorting Intolerant from Tolerant (SIFT) and the Pfam-based LogR.E-value method, we have identified features that distinguish cancer-associated missense mutations from other classes of missense change. Our data reveal that cancer mutants behave similarly to Mendelian disease mutations, but are clearly distinct from either complex disease mutations or common single-nucleotide polymorphisms. We show that both activating and inactivating oncogenic mutations are predicted to be deleterious, although activating changes are likely to increase protein activity. Using the Gene Ontology and data from the SIFT and LogR.E-value metrics, a classifier was built that predicts cancer-associated missense mutations with a very low false-positive rate. The classifier does remarkably well in a number of different experiments designed to distinguish polymorphisms from true cancer-associated mutations. We also show that recurrently observed mutations are much more likely to be predicted to be cancer-associated than rare mutations, suggesting that our classifier will be useful in distinguishing causal from passenger mutations. In addition, from an expressed sequence tag–based screen, we identified a previously unknown germ line change (P1104A) in tumor tissues that is predicted to disrupt the function of the TYK2 protein. The data presented here show that this novel bioinformatics approach to classifying cancer-associated variants is robust and can be used for large-scale analyses. [Cancer Res 2007;67(2):465–73]
A central focus of cancer genetics is the study of mutations that are causally implicated in tumorigenesis. The identification of such causal mutations not only provides insight into cancer biology but also presents anticancer therapeutic targets and diagnostic markers. For example, recent work has provided details of the biology underlying cancer, including information about the types of gene families involved in various stages of cancer (1) as well as the complex nature of the mutational spectra associated with different cancers (2). In clinical settings, these mutations have proved to be extremely valuable in distinguishing patient populations that are responsive to a particular therapy (3–7). In addition to somatic mutations, which are more prevalent in cancers, germ line mutations can confer a predisposition to cancer risks (8, 9). Further study of both somatic and germ line mutations associated with cancer is likely to lead to a deeper understanding of the biology of cancer and possibly will reveal additional targets for therapeutic design.
Targeted sequencing has been done to characterize novel cancer-associated mutations by identifying variants found in tumor tissue (1, 2, 10–16). However, the identification of true cancer-associated variants from such approaches is a challenging problem as these studies often yield large numbers of changes that are not necessarily causally associated with cancer (2, 16). This is partly a result of non–cancer-causing somatic variants, termed passenger mutations (2, 16), which accumulate in cancer tissue due to the high mutation rate and multiple cell divisions seen during tumor growth (17). Different types of tumors may display different point mutation rates. Cells with microsatellite instability, for example, are known to exhibit much higher mutation rates than those with chromosomal instability (reviewed in ref. 18), although tumors with DNA copy number alteration may also have a large number of point mutations (19). In large-scale screening for mutations, passenger mutations were estimated to account for two thirds of all variants identified (16). The identification of cancer-associated mutations can also be hindered by tumor heterogeneity as oncogenic changes may not be present equally throughout a tumor (20). As a consequence, such variants may be present only in a small fraction of the tissue sample used in a targeted sequencing experiment and are not easily detected (20). Thus, infrequently occurring cancer-associated changes can be difficult to distinguish in a directed sequencing approach. Aside from functional analysis, there is no method available to differentiate those variants that are responsible for tumor progression from other changes.
In addition to targeted sequencing approaches, expressed sequence tag (EST) sequences have also been used to identify missense changes overrepresented in cancer libraries (21, 22). However, similar to the targeted sequencing work, these analyses yielded large numbers of variants of which only a small number would likely contribute to cancer progression. This is not only due to factors such as passenger mutations but is also a result of poor data quality from high-throughput sequencing of EST libraries. Perhaps for this reason, none of the novel variants identified from such studies have been shown to exist in independent tumor samples. Although some of the putative cancer variants identified from EST data are expected to be real, such screens are hindered by a lack of additional filtering mechanisms that can enhance true signals from the high level of background noise.
Aside from missense cancer mutations and common polymorphisms, other classes of missense change have been identified. Missense mutations known to cause Mendelian disease have been annotated in Swiss-Prot and studied for their deleterious effects on protein function (23). In a recent work, missense mutations associated with complex disease were collected and shown to have characteristics distinct from Mendelian disease variants (24). The ability to distinguish missense changes by their disease phenotype suggests that comparison of cancer-associated variants to other classes of variant could be a valuable tool in understanding cancer progression. Furthermore, the availability of data sets of missense mutations associated with complex disease (24), Mendelian disease (23), and cancer (25) make such an analysis possible. Several computational methods, including Sorting Intolerant from Tolerant (SIFT; ref. 26), Polymorphism Phenotyping (PolyPhen; ref. 27), the Pfam-based LogR.E-value (28), large-scale annotation of coding nonsynonymous SNPs (LS-SNP; ref. 29), statistical geometry methods (30), support vector machine methods (31), decision trees (32), and random forest (RF) classifiers (33), have been developed to identify deleterious variants. However, it remains to be determined if these algorithms are able to distinguish cancer-associated variants from other types of change.
Here, we have used a number of computational methods to define characteristics of known cancer mutations and subsequently developed a novel approach for predicting cancer-associated mutations from among a large set of missense variants. This method provides a means of analyzing large-scale data sets and will likely prove to be increasingly relevant to genome-scale efforts currently under way to identify mutations involved in cancer progression.
Materials and Methods
Data sets. Variants were assembled as follows: (a) common variants were downloaded from National Center for Biotechnology Information (NCBI)56725), include only variants in those genes most likely to be involved in oncogenesis (25); (c) Mendelian disease–associated variants were obtained from identifying those records in the file uniprot_sprot.dat from the Swiss-Prot ftp site824). All data sets will be available online.9
SIFT and Pfam-based LogR.E-value scores. The SIFT program was downloaded1028). Note that for the discussion and display of data (Figs. 1 and 2; Table 1), we have used a negative version of this value (see ref. 28).
|Mutation type .||Gene .||Mutation .||SIFT score .||Pfam-based LogR.E-value .|
|Mutation type .||Gene .||Mutation .||SIFT score .||Pfam-based LogR.E-value .|
Abbreviation: NA, not applicable.
Gene Ontology log-odds analysis and RF classification. The log-odds scores were calculated to represent the relative frequency with which a Gene Ontology (GO) term was used to annotate cancer or noncancer gene sets. All genes represented in the COSMIC database associated with an oncogenic phenotype were used as the cancer gene data set. Further details of the GO analysis are presented as Supplementary Data.
The RF classifier was built using the package randomForest 4.5-16111212), tyrosine phosphatome (34), and serine-threonine kinase data (10). Technical details of this method as well as details of the pathway analysis are presented as Supplementary Data. All training data are freely available online.9
EST-based identification of variants. A set of 22,332 RefSeq mRNAs was downloaded from NCBI, and each mRNA was aligned to ESTs from public and Incyte collections. ESTs were aligned to the genome using GMAP (35) to determine overlaps with RefSeq mRNA genomic coordinates. Those ESTs that overlapped with a particular gene region were aligned to the corresponding RefSeq mRNA using NCBI BLAST run with default parameters. All mismatches were required to be flanked by 30 bp of sequence identical to the RefSeq mRNAs. Rare events that occurred in only one EST or were seen in only one library were eliminated, and variants were required to be observed in at least 3% of the total number of cancer ESTs. Any variant that overlapped with a record in dbSNP was eliminated to avoid identifying known or common SNPs. For similar reasons, variants identified in normal libraries were also eliminated. Lastly, a z-score was calculated to test the significance of the difference in the number of ESTs contributed from cancer or normal libraries. Variants were eliminated for P > 0.1. This list of putative computationally defined cancer variants was then further enriched by selecting those changes that were identified as cancer-associated by the RF classifier. Variants identified from the screen are presented online.9
Validation of missense mutations. For traditional sequencing, PCR products were amplified from genomic DNA extracted from cell lines. Sequence reactions were done using conventional Sanger sequencing methods for both sense and antisense directions on an Applied Biosystems 3730xl DNA Analyzer (Applied Biosystems, Foster City, CA). Traces were analyzed using Sequencher (Gene Codes, Ann Arbor, MI). For mass spectrometry analysis, variation validation assays were designed using MassARRAY 184.108.40.206 software (Sequenom, San Diego, CA). See Supplementary Data for further details. The cancer sequence data were obtained from the following tumor tissue types (and numbers) of samples: lung (64), breast (37), stomach (10), colon (9), and liver (8). The oligonucleotide primer used for TYK2 P1104A variant was 5′-TGACTCCAGCCAGAGC-3′.
Comparison of variant data sets. We collected four groups of missense variants as an initial step in understanding differences among variant groups. A set of 5,747 common polymorphisms used as a baseline data set was collected from dbSNP by extracting those polymorphisms with an overall minor allele frequency (MAF) >20%. Mendelian disease mutations were identified by isolating variants from the Swiss-Prot database annotated as disease but not cancer. These 11,456 noncancer disease missense mutations represent changes to single genes that follow standard Mendelian rules of inheritance and are thought to contribute to disease progression (23). From a previous study (24), we collected a set of 27 complex disease–associated variants that are known to contribute to disease progression but do not follow standard Mendelian rules of inheritance. Last, a set of 1,091 cancer-associated somatic mutations were gathered from the COSMIC database (25) in which cancer-associated mutations manually curated from literature sources are compiled. Because some mutations in COSMIC may be passenger or bystander mutations, we attempted to enrich for true cancer-associated mutations by using variants identified in genes likely to be involved in cancer progression (25).
Two distinct approaches were used to measure the effect of each variant on protein function. First, the SIFT program (26) was used to predict if a variant is likely to affect protein function. SIFT uses sequence homology between closely related protein species to measure this effect, and low SIFT scores (<0.05) are predictive of intolerant changes. The SIFT scores from all variants were calculated and plotted by class as cumulative distributions (Fig. 1A). The distributions of SIFT scores suggest that these variants can be divided into at least two classes: Mendelian/cancer–associated and complex/common. The differences between these groups are most apparent for the intolerant SIFT scores between 0.0 and 0.05 where 83% and 77% of the cancer and Mendelian variants are predicted to be intolerant compared with 42% and 12% of the complex and common missense changes. Not surprisingly, the distinction between cancer and normal variants becomes smaller for cumulative distributions of SIFT scores derived from SNPs with an overall MAF of <20% (not shown). Although cancer is often thought of as a complex disease involving changes to multiple loci, in this assay, cancer mutations seem more similar to Mendelian disease variants than complex disease variants.
The second approach used to predict the effect of a variant on protein function is the Pfam-based LogR.E-value, which calculates the difference between a wild-type and variant protein by measuring their fit to a Pfam model (28). The LogR.E-value score is derived from the E value provided by HMMER2 software (36) and was used previously to predict whether specific changes to a protein were likely to affect protein activity or stability (28). The underlying scoring systems behind LogR.E-value and SIFT are different (28), and it has been shown that these algorithms produce distinct metrics that can be used to independently analyze variant data (28).
Based on previous work with the LogR.E-value, scores that are less than −0.5 are predicted to alter protein function (ref. 28; see Materials and Methods). The values generated from analysis of all four variant sets using this method were plotted as cumulative distributions (Fig. 1B). As seen with the analysis of SIFT data, the distribution of the cancer and Mendelian disease–associated variants are fairly similar to one another, but both are significantly different from the distributions of either the complex or common variant data sets (Fig. 1B). Moreover, similar to the SIFT analysis, the cancer and Mendelian data sets have a larger percentage of changes predicted to have a significant effect on protein function (70% and 77%, respectively) compared with the complex and common variant data sets (33% and 26%, respectively). These data once again point to differences between cancer-associated variants and complex disease variants. Further, and perhaps even more apparent than the SIFT analysis, these data reveal similarities between Mendelian and cancer-associated variants that underscore that these changes may affect protein function in a similar manner.
Characterization of variants in different classes of genes. As the SIFT and Pfam-based LogR.E-value metrics proved useful in studying differences between populations of variants, we used these tools to identify characteristics of variants defined by biochemical analysis as activating or inactivating. Well-characterized inactivating mutations in the genes p53 (37), PTEN (38), and p16 (39) were analyzed. Each of these mutations was predicted to impair protein function in at least one of the assays (Table 1). This result is consistent with the debilitating characteristics of these mutations. Next, activating mutations in BRAF (40), K-Ras (41), Kit (42), PDGFR α (43), JAK2 (44), and epidermal growth factor receptor (EGFR; refs. 3, 5, 6) were analyzed in the same manner. Although these activating mutations increase the output of different signaling pathways and might not be considered intolerant, it was striking that all mutations were predicted to impair protein function in at least one of the assays (Table 1). Together, these results suggest that both activating and inactivating oncogenic mutations seem to be untolerated by wild-type proteins although the activating mutations are able to increase the signaling ability of these molecules.
As many activating changes have been identified in kinase genes, we were interested in examining a larger pool of such mutations. Using data from the COSMIC database, mutations were collected either from genes with kinase domains or genes without kinase domains. The distribution of SIFT scores for these two variant data sets suggests that these changes are largely predicted to be intolerant (78% and 85%, for the variants in kinase and nonkinase genes, respectively; Fig. 2A). The Pfam-based LogR.E-value scores follow a similar trend to the SIFT data in that 54% of the kinase variants and 87% of the nonkinase variants are predicted to affect protein function (Fig. 2B). A recent analysis of cancer-associated variants in genes with kinase domains (45) revealed that a majority of such changes are likely to be activating as they often affect residues that regulate the catalytic activity of the kinase (45). Thus, the results presented in Fig. 2 not only may reflect the similarities between kinase and nonkinase genes but also may suggest that, in general, activating and inactivating changes are not tolerated.
GO analysis of cancer genes and the development of a cancer-variant predictor. We next attempted to identify an additional metric that could be used to distinguish the cancer-associated variants by classifying the genes in which they reside. Using a standard set of GO annotations, a log-odds score was calculated to measure the difference in frequency that a particular GO term was used to annotate genes that were either known to be involved in cancer (25), or, not known to be involved in cancer and represented by a RefSeq mRNA. For example, genes annotated with the term “ion transport” are underrepresented in the cancer data set, and this term has a negative log-odds score of −2.30. On the other hand, genes annotated with the term “cell cycle” are overrepresented in the cancer data set, and this term has a positive log-odds score of 2.06 (Fig. 3A). For a gene of interest annotated with a set of GO terms, one can sum the log-odds score for each GO term. This cumulative score reveals inherent differences between genes containing common variants and genes containing cancer variants as shown in Fig. 3B. The different distributions are not a result of the varying levels to which these gene sets were annotated, as scores that were normalized by the number of annotated terms produced very similar data (not shown). Although these data could reflect an inherent bias in the types of genes chosen for analysis, it is likely that this approach reveals distinguishing features of the cancer and noncancer gene sets.
Based on the above analysis, it seemed possible that a computational method could be developed that would combine the information from the SIFT, Pfam-based LogR.E-value, and GO log-odds metrics to predict whether a variant was likely to be cancer associated. The RF classifier provides such a method by dividing a large pool of data into smaller subsets based on characteristics of each datum. This method has been used successfully in many biological applications, including distinguishing harmful SNPs from harmless SNPs (33) and classifying microarray data (46). In our analysis, variants included in the training set used to construct the classifier were assigned a measurement from each of the three different algorithms described above (SIFT, LogR.E-value, and the GO log-odds score). During the construction of the classifier using these sets of training data, an internal measurement of the rate at which training data are misclassified was determined to be 3.19% [this value is termed the out-of-bag (OOB) error rate; for further details, see Materials and Methods].
Further validation of the classifier was achieved by performing a cross-validation experiment in which a group of known variants was entirely excluded from the training process during the construction of the classifier. These excluded variants were then analyzed using the RF predictor to determine how often they were misclassified. From this analysis of 730 variants, only 10 of 581 (1.7%) normal variants were misclassified as cancer and only 13 of 149 (8.7%) cancer variants were misclassified as normal. Together with the OOB error rate, these data illustrate that this is a strong classifier with a very low rate of false-positive predictions. It should be noted that the noncancer variant training data set consists only of a subset of variants from dbSNP with a MAF of >20%. We expect to observe a somewhat higher rate of misclassification if more rare SNPs are included.
Validation and practical application of the classifier. A useful application of our classifier is in distinguishing relevant cancer-associated mutations from the expected polymorphic variants often identified during sequencing projects to discover novel somatic mutations. We attempted to model this process by collecting data from several protein family–based studies, including variants in the tyrosine kinome (12), tyrosine phosphatome (34), and serine/threonine kinase gene family (10). For each group of genes, two sets of variants were identified: the set of somatic cancer–associated mutations reported in each publication and a set of all missense SNPs isolated from dbSNP, which represent those changes that would be expected as background polymorphisms identified in sequence data. The variants were then classified by our predictor. It should be noted that none of these variants were used to train the RF classifier. As shown in Fig. 4A, although 64% of the somatic mutations in the tyrosine kinome were predicted to be cancer-associated, a significantly smaller percentage of the changes from dbSNP (23%) were predicted to be cancer-associated (P = 3.9e−4; two-tailed Fisher's exact test). Analysis of the tyrosine phosphatase data set reveals that the somatic mutations were also more often predicted to be cancer associated than the variants isolated from dbSNP (P = 0.034; Fig. 4B). A similar trend is observed for the serine/threonine data set in which 36% of the somatic variants and only 19% of the variants from dbSNP were predicted to be cancer associated (Fig. 4C). Thus, our predictor is capable of distinguishing true cancer mutations from among large sets of sequence data even in the absence of extensive knowledge of SNPs. It is worth noting that the percentage of cancer-associated variants from these data sets was lower than that observed during the analysis of the training data. This likely reflects the fact that the experimental data described above presumably contain fewer true cancer causal variants than the training data mutations, which were derived from a select group of well-studied mutants.
Whereas true causal mutations are under positive selection and are therefore observed recurrently, bystander or passenger mutations occur in a stochastic fashion and are expected at lower frequencies. We were interested in determining if those variants that occur infrequently are less likely to be predicted to be cancer associated by our classifier. To address this issue, the entire set of cancer-associated variants in COSMIC was divided into groups according to the number of times that a change was observed. The percentage of variants predicted to be cancer associated was determined for each class using the RF classifier. As shown in Fig. 5A, whereas 58% of variants observed at least 10 times are predicted to be cancer associated, only 43% of variants occurring only a single time are predicted to be cancer associated (P = 0.018, two-tailed Fisher's exact test). This analysis not only shows the usefulness of our classifier but also suggests that a large fraction of infrequent variants in COSMIC could be passenger mutations.
As an additional test of the usefulness of our classifier, we attempted to predict cancer-associated mutations from a recent large-scale sequencing effort of human colorectal and breast tumors (47). In this recent analysis, genes were classified into two groups based on mutation frequencies: those likely to be involved in tumor progression (CAN genes) and those less likely to be involved in tumor progression (non-CAN genes; ref. 47). Variants identified in either CAN genes or non-CAN genes were classified by our predictor. As shown in Fig. 5B, variants identified in CAN genes are much more likely to be classified as cancer associated (26.3%) than variants in non-CAN genes (13.3%; P = 8.8e−6; two-tailed Fisher's exact test). The larger number of cancer-associated variants predicted from the CAN genes is consistent with the suggestion that these mutations were under selection during tumor progression (47). Importantly, independent of the method used for deriving the CAN genes, our classifier provides a novel method that prioritizes mutations for in vivo functional analysis.
EST-based identification of novel variants. We designed an EST-based screen to collect novel variants that could then be classified using our predictor. In this approach, ESTs from libraries derived from either normal or cancer tissue were aligned to RefSeq mRNAs to identify mismatches. From 2,600 candidate variants that were specifically present in ESTs from cancer libraries, 494 variants (19%) were identified as cancer associated based on the RF predictor. This list includes a number of known cancer-related mutations, such as the C135F change in p53.
Such an EST-based approach is likely to be complicated by the sequence noise commonly associated with EST data. To address this issue, eight novel variants were selected for validation by determining if the changes were in fact present in genomic sequence or were identified as a consequence of noisy EST sequences. Two of the variants, a V1304M change in MAST2 and a Y11H change RIOK2, were identified in genomic DNA from their respective EST tissue sources (Supplementary Fig. S1), which show that some of these events indeed occur in genomic sequences and are not a result of EST sequencing artifacts. Although it was not surprising to identify these changes in their respective EST tissue sources, it was important to determine whether some of the predicted mutations were present in unrelated tumor DNA samples. This was addressed by performing mass spectrometry genotyping for 65 predicted cancer variants over a collection of 128 tumor tissue samples that were independent of any EST libraries used in the screen. From this analysis, we identified one novel variant, the P1104A variants in the kinase domain of the TYK2 gene, which was present in four independent tumor tissues (Supplementary Fig. S2A and S2B). This variant was classified as a germ line change as it was also identified in matched normal samples.
It is likely that the P1104A variant is a novel cancer-associated germ line mutation that affects TYK2 function. This change has not been found in other normal samples and is absent from the comprehensive dbSNP. Analysis of sequence data covering the entire TYK2 gene of >47 normal individuals by the Seattle SNP project (48) and 102 normal individuals catalogued at the SNP500Cancer project failed to reveal any samples with this change. The P1104A variant mutates a conserved proline that underlies the substrate-binding groove in the COOH-terminal, helical lobe of the TYK2 kinase domain. In addition, the proline is positioned under a key tryptophan residue in a ring-stacking interaction that stabilizes the inactive conformation of the activation loop. The mutation of proline to an alanine could help precipitate an activated catalytic state of TYK2, which may lead to an oncogenic phenotype.
Missense sequence variants can exhibit a wide range of effects on proteins, producing diverse phenotypic outcomes. From the analysis in this article, we have shown that it is possible to use a combination of computational metrics to predict those changes that are more likely to contribute to oncogenesis. This approach becomes increasingly relevant in light of the recent initiation of large-scale cancer genome initiatives (11), which are likely to produce large numbers of variants of which only a small proportion will be implicated in tumorigenesis.
The contribution of multiple somatic changes makes it difficult to classify cancer as either a Mendelian or a complex disease. Oncogenesis often requires the accumulation of mutations in multiple genes, and one might be tempted to consider cancer mutations as more similar to those involved in complex diseases. Thus, it is somewhat surprising to observe that cancer mutations behave more similarly to Mendelian disease mutations than to complex disease mutations. Although one could argue that these results simply reflect the fact that variants resembling Mendelian mutations are those most easily studied in the laboratory and subsequently collected in databases such as COSMIC, it is more likely that these results point to cancer as a disease that progresses through a series of stepwise deleterious mutations (49). Such mutations are likely to lie in the class of genes that, when altered, mediate tumor progression by providing cancer cells growth advantages, metastatic capabilities, and the ability to escape from homeostasis control (49). Our analysis does not imply that variants similar to those found in complex disease are less important for tumor progression. However, those changes may be less frequent and more diverse than the changes we identified.
An interesting observation from our analysis is that activating mutations seem to impair protein function. Although one might hypothesize that such changes would be desirable because they increase the output of the molecule, our data suggest that proteins that function at a wild-type level are selectively maintained. Our analysis of variants in kinase and nonkinase genes supports the notion that activating mutations in kinases primarily affect residues involved in the control of their enzymatic activity (12, 15, 50). An apparent gain-of-function phenotype could be manifested as a damaging change to a regulatory domain that controls kinase activity. In fact, this theory has been supported by known activating mutations in BRAF and EGFR (3, 5, 6, 40).
With the aid of our computational classifier, we were able to identify a novel germ line variant from EST sequences, the P1104A change in the kinase domain of TYK2. This variant is present in four different tumor samples independent of the EST libraries where it was originally found. As this variant was not found in other normal samples, it is likely that P1104A is a germ line mutation associated with increased cancer risk. Although we are encouraged by this finding, we also observed many limitations intrinsic to EST-based mutation screening. First, a majority of the observed variants from ESTs are likely sequence artifacts. In our hands, only two of eight expected variants were found in the genomic DNA from the same tissue sources where the variant ESTs were derived. Second, the ability to detect mutations is heavily influenced by EST coverage and library bias. Although we were able to observe some of the known cancer mutations, many of them were missed due to insufficient EST coverage at expected locations. For example, the V600E mutation of BRAF, a prevalent somatic mutation in melanoma, was missed owing to a lack of EST sequences from melanoma libraries covering the expected location. Furthermore, true mutations identified in the EST libraries may not be easily found in unrelated tumor samples if they occur at a low frequency. Despite these difficulties, we show here that it is still possible to identify novel cancer-associated variants from the EST data. Given the reduced cost of genotyping, it is becoming feasible to validate large numbers of putative cancer variants.
The approach described here will likely facilitate the identification of new mutations associated with oncogenesis. With efforts such as The Cancer Genome Atlas project under way, a large volume of sequence variation data is expected to accumulate in the next few years. Such data will be most useful if it is possible to distinguish the rare, meaningful mutations from among the large number of missense variants. As shown in our analysis, using a variety of different data sets, the method we developed provides statistically significant enrichment of cancer-associated changes. Consistent with these data, a preliminary analysis of an unpublished large sequence data set using this classifier revealed that 47% of variants identified as somatic or germ line were predicted to be cancer associated, whereas only 19% of polymorphisms known in dbSNP were predicted to be cancer associated.13
J. Kaminker and S. Seshagiri, unpublished observations.
Although our classifier did well on different sets of variants, the frequencies of predicted cancer and noncancer mutations vary from those suggested by the OOB error and cross-validation data. It is likely that this is in part due to the requirement that common SNPs used to train the classifier have a MAF of >20%. This high MAF was chosen to ensure that the majority of SNPs used to train the classifier were indeed common and unlikely to be deleterious. In the experiments described above, no attempt was made to exclude SNPs based on their MAF and it would not be surprising to find that some of the variants described as common are rare and possibly deleterious. In addition, while the training data were derived from a highly studied set of variants, much of the large-scale experimental data consist of novel changes that are not yet well studied for their role in cancer progression. Although some of the changes identified by Sjoblom et al. (47) are likely to be tumorigenic, it is possible that others will be classified as passenger or background changes.
It may be desirable in the future to attempt to separate cancer mutations from other types of variants using additional structural characteristics of amino acid changes, such as those presented in the recent work by Furney et al. (51). However, the approach here is an important first step toward differentiating cancer-associated variants from other types of change. The novelty and usefulness of this algorithm provides a needed method that will enable the prioritization of large data sets of variants for further study.
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
We thank Shiuh-Ming Luoh, Colin Watanabe, Jerry Tang, Lawrence Hon, Reece Hart, and Brian Desany for helpful discussions; Thomas Wu and Guy Cavet for their detailed comments on the manuscript; and William Wood for guidance and support throughout the project.