The NCI-60 cell lines are the most frequently studied human tumor cell lines in cancer research. This panel has generated the most extensive cancer pharmacology database worldwide. In addition, these cell lines have been intensely investigated, providing a unique platform for hypothesis-driven research focused on enhancing our understanding of tumor biology. Here, we report a comprehensive analysis of coding variants in the NCI-60 panel of cell lines identified by whole exome sequencing, providing a list of possible cancer specific variants for the community. Furthermore, we identify pharmacogenomic correlations between specific variants in genes such as TP53, BRAF, ERBBs, and ATAD5 and anticancer agents such as nutlin, vemurafenib, erlotinib, and bleomycin showing one of many ways the data could be used to validate and generate novel hypotheses for further investigation. As new cancer genes are identified through large-scale sequencing studies, the data presented here for the NCI-60 will be an invaluable resource for identifying cell lines with mutations in such genes for hypothesis-driven research. To enhance the utility of the data for the greater research community, the genomic variants are freely available in different formats and from multiple sources including the CellMiner and Ingenuity websites. Cancer Res; 73(14); 4372–82. ©2013 AACR.
The NCI-60 human tumor cell line panel (1) is used by a broad range of cancer investigators and by the NCI Developmental Therapeutics Program (DTP) to discover novel anticancer drugs (2). This panel represents an invaluable and publicly accessible platform of pharmacological, genomic, metabolomic, biochemical, and molecular datasets (3–8). This study reports findings from whole exome sequencing (WES) of the NCI-60 panel of cell lines. In addition, pharmacogenomic analyses provide examples of a few of the many ways the variant data could be used to generate novel hypotheses. Our study complements two recently published large-scale cancer cell line sequencing studies, which used a limited number of genes (9, 10), because our work provides the whole exome variants for the entire NCI-60 cell lines. The data are made available through the CellMiner, NCI DTP and Ingenuity Systems' websites (11).
Materials and Methods
The list of cell lines in the NCI-60 panel and their tissue origins are given in Supplementary Fig. S8. DNA was extracted from cells and fingerprinted as described before (12).
Exome capture and sequencing
Briefly, 38 Mb of coding region for each cell line was captured using the Agilent SureSelect All Exon v1.0 Kit (Agilent). Genomic DNA (3 μg) was sheared using the Covaris S2 ultra-sonicator (Covaris) using the settings duty cycle 10%, intensity 5%, cycle/burst 200, and time 60s, which yielded a fragment size distribution with a mean at 200 bp. Libraries were generated using standard Illumina library protocol (Illumina) followed by size selection using ChromaSpin TE200 spin columns (Clonetech). Pre- and postcapture steps were conducted following the manufacturers' protocol (Agilent). The samples were sequenced as paired-end 80-mer reads on an Illumina Genome Analyzer IIx instrument (Illumina) following the manufacturers' protocol.
Data processing and variant calls
Fastq files were aligned against the reference human genome build 19 (hg19) using the Burrows-Wheeler Aligner (13). Alignment files were base quality score recalibrated and locally realigned around indels with GATK (14) and marked for duplicates using PICARD tools (picard.sourceforge.net). Alignment files and variant calls can be accessed from the links provided (11). Consensus genotype calls were generated using samtools mpileup (15) and annotated using the Annovar package (16). Variants were further filtered for the SureSelect bait region, a minimum read depth of 6 and a minimum quality score of 30 for single nucleotide variant (SNV) and 60 for indels, producing the final variant calls.
Drug activity determination
Gene expression and other NCI-60 molecular characterization
The x-axis of a volcano plot depicts the difference in mean log GI50 between the cell lines containing a mutation in the specified gene and the cell lines not containing such a mutation. The y-axis depicts the statistical significance level for the comparison of log GI50 for those 2 groups of cell lines with larger values indicating smaller P values. On a volcano plot for a gene, the points represent the compounds. On a volcano plot for a compound, the points represent the genes. For a volcano plot representing a gene, the false discovery rate can be limited to 0.2 or less by restricting attention to the 310 clinical and investigational compounds with P values no greater than 0.0005. When examining all of the screening compounds, the false discovery rate will be greater unless attention is restricted by a more stringent significance cut-off (e.g., 10−4) and an imposed cut-off on difference in log GI50 between mutated and wild-type groups (e.g., ± 0.5). In general, however, the volcano plots are used either to confirm previously identified hypotheses or to generate hypotheses that require independent validation.
Super Learner prediction models
Using GI50 data on the NCI-60 for 103 U.S. Food and Drug Administration (FDA)-approved and 207 investigational oncology drugs and the 711 genes with at least 5 cell lines containing a type II variant in the gene, we estimated a predictor for each drug using the Super Learner algorithm (19). The predictor uses the gene-level mutation profile to predict the log GI50 for each drug. The Super Learner is an ensemble-based prediction methodology that combines different machine-learning predictors into a single optimal predictor based on minimizing the cross-validated risk. The base algorithms for the Super Learner include elastic net regression, gradient-boosting regression, bagging, CART, random forests, neural networks, and support vector machines. In total, 35 prediction algorithms were combined for the Super Learner ensemble. We do not expect a single prediction algorithm (e.g., elastic net regression) to be optimal across all 310 drugs, and the Super Learner allows the final predictor to data-adaptively up-weight the best algorithms for the final predictor. Examining the weights for each algorithm across the 310 drugs (data not shown) shows great variability, indicating we should see a benefit with the Super Learner ensemble approach. Within a drug, the Super Learner predicts the log GI50 based on the gene-level mutation profiles. To compare across the drugs with different potencies, the log GI50 values need to be normalized. We define the normalized log GI50 for a cell line as the log GI50 minus the mean log GI50 for that drug in all the other cell lines. For ROC analysis, we classified a cell line as sensitive to a drug if its true-normalized log GI50 was less than −0.5, and insensitive if the value was greater than 0.5.
The variant calls were generated as described in Materials and Methods, where we filtered variants with a minimum quality of 30 (60 for small insertions/deletions) and a minimum depth of 6 with at least 3 alternate alleles over the targeted 38 Mb coding region. Because matched normals are not available for cell lines, we conducted a more stringent filtering to identify potential cancer-specific variants. Using this filtering, the variants were divided into 2 groups: type I variants corresponding to common (and possibly germline) variants and type II variants enriched for acquired cancer-specific variants (Supplementary Figs. S1 and S2). We obtained more than 1.2 million type I and 60,005 type II variants in the NCI-60 cell lines.
Although a limitation of cell line sequencing is the lack of available normal-matched tissue for comparison, the NCI-60 panel does allow comparisons between cell lines from 9 distinct tissues of origin. NCI-60 cell lines with known microsatellite instability (MSI; Supplementary Fig. S3) have very high type II variant counts (Fig. 1A). However, HCC2998, a colon cancer cell line not known to have MSI, has the highest number of type II variants. In contrast to the known MSI cell lines, more than 98% of HCC2998 type II variants are SNVs (Supplementary Fig. S4), suggesting that this hypermutator phenotype arises from a mechanism other than MSI. Of interest, HCC2998 carries a POLE exonuclease domain missence variant coding for a P286R mutation in POLϵ (Supplementary Fig. S5). Previous reports indicate that impaired POLϵ proofreading results in a high rate of single nucleotide substitutions and increased tumor formation (20) and POLE mutations in colorectal cancer has recently been reported (21). HCC2998 seems to exemplify this phenomenon, providing a reagent for further investigation and illustrating the utility of the NCI-60 WES data.
Given the diversity in the NCI-60 panel based on the tissue of origin, the WES data reveal important information about the etiology of each subgroup. As is evident from Fig. 1B, there is a wide range of transition-to-transversion ratios (ti/tv) among the NCI-60 panel. Melanoma cell lines have the highest ti/tv (3.93) with higher C:G to T:A transitions, which is the major mode of change for UV-induced DNA damage (22). In contrast, lung cancer cell lines have a ti/tv (0.67) indicative of tobacco smoke-induced DNA damage (23). Thus, the WES data supports the prior notion that the NCI-60 panel retains disease etiology signatures (7).
Figure 2A shows a map of the 10 most frequently mutated genes in the Catalogue of Somatic Mutations in Cancer (COSMIC) database (24). We annotated the WES variant calls as those present in the COSMIC database (v59) and those that are absent in COSMIC but predicted to be deleterious by the Sorting Tolerant From Intolerant (SIFT; ref. 25) or PolyPhen (26) algorithms. TP53 is the most frequently mutated gene overall, whereas BRAF is the most frequently mutated gene among melanoma cell lines (Fig. 2A). Although most of the variants identified in these 10 genes are already annotated in COSMIC, novel variants in these 10 genes were also observed. Although, the lack of normal tissue makes it almost impossible to validate these as somatic changes, these variants were not observed in either the 1,000 Genomes Project (27), or in the 5,600 normal whole exomes available through the NHLBI Exome Sequencing Project (28). Besides the many well-defined cancer genes such as those in Fig. 2A, large-scale tumor sequencing efforts by others continue to lead to the discovery of novel cancer genes, such as the 16 genes listed in Fig. 2B. Because the NCI-60 cell lines are so well characterized and readily available, they are ideal tools for hypothesis-driven research of these novel cancer genes/mutations identified by large-scale sequencing efforts. Details for these particular mutations or for any other gene mutation can be downloaded from the public domains, including CellMiner (Fig. 3) or Ingenuity website (Supplementary Fig. S6).
To show the utility of this unique dataset and illustrate one of many ways to apply these data in hypothesis-driven research, we carried out an integrated pharmacogenomic investigation. The fact that the NCI-60 panel has been used to screen thousands of compounds provides a rich resource for testing the relationship of variants in genes to drug response. Among 43,225 compounds screened for activity against the NCI-60 cell lines (as of September 2012), 15,898 showed high dynamic range in their GI50 estimates across all cell lines. For each gene with at least 5 cell lines containing a type II variant, we evaluated the association of log GI50 to variants in genes for all of the screened compounds. TP53, the most frequently mutated gene in the NCI-60 panel, shows strong correlation with drug response. MDM2 inhibitors are effective agents in cell lines with wild-type p53 (Fig. 4A), where they can induce cell death. Of the 15,898 compounds and 310 FDA-approved or investigational oncology drugs, the activities of 2 clinically relevant MDM2 inhibitors show strong negative correlation with mutant p53 (Fig. 4B). Nutlin-3 gives the highest statistical significance score for its activity in p53 wild-type cell lines (Fig. 4B and C). MI-219, a known MDM2 inhibitor, exhibits a similar strong negative correlation with mutant p53 (Fig. 4B). In contrast, National Service Center (NSC)-670177 (Supplementary Table S1) shows significant selectivity for the p53 mutant cells. However, the proposed p53-specific compound reactivation of p53 and induction of tumor cell apoptosis (RITA; NSC-652287; ref. 29), initially identified as a DNA cross-linking agent (30), showed little evidence of selective activity for cell lines with p53 wild-type status and only limited correlation with nutlin-3 (Supplementary Fig. S7A), questioning the claim that RITA acts specifically as a p53-reactivating compound. As for comparison, RITA displays far less selectivity for p53 wild-type cells than the classical DNA-targeted agent mithramycin. As expected, expression of the well-known components of the p53 pathway, MDM2 and miR-34a (31) correlate with p53 wild-type cell lines (Fig. 4E and F). Additional pharmacogenomic correlations between TP53 mutational status, miRNAs, mRNA transcripts, or other agents are listed in Supplementary Fig. S7B. Integrating additional genomic datasets, such as gene and miRNA expression data (18) strengthens the value in all these comprehensive datasets for the NCI-60 panel.
We further supplemented this work with cross-validated multivariate analyses. For each of the 310 FDA-approved or investigational oncology drugs, we developed a Super Learner ensemble machine-learning model predicting log GI50 based on variants in genes. We included genes with type II variants in 5 or more cell lines across the NCI-60 panel. Leave-one-out cross-validation was used to evaluate the ability of such modeling to distinguish sensitive from insensitive cell lines for individual drugs and to select active drugs for individual cell lines. We developed these 310 models for each loop of a cross-validation in which one cell line was omitted and the remaining cell lines were used as a training set. Those models were then used to predict the log GI50 values for all drugs for the omitted cell line thereby predicting the most active drugs (smallest normalized log GI50; see Materials and Methods) against this cell line (Supplementary Table S2). Using these models, we generated cross-validated receiver operating characteristic (ROC) curves for each cell line (Supplementary Fig. S8). The ROC curve plots sensitivity versus one minus specificity for identifying active drugs. The area under the curve (AUC) between the ROC curve and the diagonal line is a measure of the predictive accuracy of the WES-based models. A large AUC value for a cell line indicates that the mutation spectrum of the cell line is informative for discriminating active from inactive drugs. The set of drugs analyzed, however, contains many cytotoxics, for which the predictive model based only on mutation spectrum was poorly informative. Our models included only mutation status and did not attempt to distinguish the confounding between mutation status and cell line lineage. Further studies with comprehensive models that include copy number, transcript abundance, and methylation status should yield more accurate predictions.
The ROC curves provide valuable insight into cancer biology. For instance, among the NCI-60 melanoma cell lines, SK-MEL-2 has the lowest AUC value (Fig. 5A). This is particularly interesting because SK-MEL-2 is the only non-BRAF-V600E mutant melanoma cell line with an activating NRAS-Q61R mutation. As shown with the volcano plot in Fig. 5B, the 3 BRAF-V600E–specific inhibitors PLX-4720, vemurafenib (Fig. 5C) and SB-590885 stand out with extremely high significance and differential mean GI50 in the BRAF-mutant cell lines. All the MEK inhibitors (blue font) including selumetinib (Fig. 5D) and hypothemycin (Fig. 5E) show highly significant selectivity and differential GI50, indicating their therapeutic value in cancer cells with activated mitogen-activated protein kinase (MAPK) pathway. Notably, one compound, NSC-678518 showed extreme selectivity for the BRAF-mutated cells. NSC-678518, the anthrax lethal factor, was identified in a screen for agents with similar inhibitory profiles to another MAPK kinase inhibitor, PD098059, and shown to proteolytically inactivate such kinases (32).
Parallel studies support the value of correlating genomics and targeted agents (2, 9, 10). Figure 5C to E exemplifies that mutations in protein kinase target genes are strong indicators of response to clinically relevant targeted drugs. In addition, such observation could be generalized to key signaling pathways. Ten distinct kinase inhibitors from 3 major target classes cluster separately depending on the mutations in 6 genes: BRAF, NRAS, PIK3CA, PIK3R1, PTEN, and ERBB2 (Fig. 5F). These effects can be viewed in the context of the MAPK and phosphoinositide 3 kinase (PI3K) pathways downstream of receptor tyrosine kinases (RTK).
One of the most clinically relevant RTK is the epidermal growth factor receptor (EGFR). However, as showed by Garnett and colleagues (10), it is critical to integrate genomic mutation data with transcript levels to correlate and possibly predict drug responses. The NCI-60 provides a solid background for studying gene expression (see MDM2 example in Fig. 4E; ref. 18), and its large drug database offers unique opportunities to query drug response parameters. To test this possibility, we examined the EGFR inhibitor, erlotinib, whose activity is highly correlated with gefitinib and lapatinib in the NCI-60 (see Fig. 6 in ref. 18). Overall, high expression of EGFR (ERBB1) and ERBB2 are determinants of cellular response to erlotinib (Fig. 6B). However, the colon and central nervous system (CNS) cell lines are generally insensitive to erlotinib in spite of high EGFR and ERBB2 expression. This can be rationalized by taking into account mutations in the MAPK or PI3K pathways, a common mechanism of resistance (33), which are present in all 7 colon and 4 of 6 CNS cell lines (Fig. 6B).
Additional examples of correlations between type II variants and the 16,208 compounds, including the 310 FDA-approved or investigational oncology drugs are included in Supplementary Figs. 9 and 10. Supplementary Fig. 9 contains volcano plots for type II variants in 44 other genes of interest with the corresponding list of significant NSC numbers in supplementary Table S3. Supplementary Fig. S10 shows volcano plots for 28 selected drugs that are in clinical use or clinical trials. Together, these data again show the potential value of the NCI-60 drug and genomic databases for systems pharmacology.
The power of WES, instead of focused sequencing of preselected genes as published (9, 10), was revealed when we coincidentally found a significant correlation between a germline in-frame deletion (delCAATGT) in ATAD5 (rs72427574) in certain cell lines and their increased sensitivity to DNA-damaging agent bleomycin. In addition, zorbamycin (NSC-146208), and peplomycin (NSC-276382), which are both bleomycin analogues, show strong activity toward these cell lines. ATAD5, the human homolog of yeast ELG1, is essential for maintaining genome stability through its functions in deubiqitinating proliferating cell nuclear antigen and is known to be mutated in endometrial cancer (34–36). Genotype calls revealed 10 cell lines where 5 are heterozygous and 5 are homozygous for delCAATGT. Of the 10 cell lines, 3 are renal (ACHN, CAKI-1, RXF-393), where earlier work suggests dimethane sulfonate analogues, such as DMS612, as effective agents against renal cancer (37) and are being investigated phase I trials in renal cancer patients (#09-C-0111). Interestingly, there are additional germline variants in ATAD5, that are also present exclusively in the same set of 10 cell lines. When we looked for possible haplotypes in the Hapmap database, we discovered a region of linkage-disequilibrium spanning more than 300 kb (Fig. 7B). Therefore, this particular haplotype could be a response modifier during chemotherapy with DNA-damaging agents. These results illustrate the discovery potential of exonic variant data when integrated with previously available NCI-60 databases.
In this study, we provide WES analysis of the widely used NCI-60 cell line panel. We show that the overall pattern of mutation is strikingly divergent between cell lines, ranging from 172 to 9205 type II variants. As expected, higher variant rates are observed in MSI cell lines; but remarkably, the highest number of SNVs was observed in HCC2998, a colon cancer cell line in which we discovered a defect in the proofreading domain of POLϵ. The signature of specific carcinogens is readily discernible in lung cancer and melanoma, which show very low (0.67) and high (3.93) ti/tv ratio, respectively. Variants in established cancer genes are abundantly represented in the NCI-60, and numerous examples of variants in recently implied cancer genes are also present.
In addition to the mutational data provided in this article, substantial drug sensitivity data for tens of thousands of compounds and multiple other types of biological data are available for the NCI-60. Using straightforward approaches (see Fig. 3) together with more sophisticated analyses, we were able to show the influence of specific variants for TP53, BRAF, KRAS, NRAS, PIK3CA, PTEN, and ERBBs on the response to clinically relevant targeted agents (nutlin, vemurafenib, selumetinib, hypothemycin, rapamycin, wortmannin, perifosine, erlotinib, afatinib, lapatinib, and neratinib) and to identify aspects of those results that may merit further study. For example, even though targeted inhibitors of activated BRAF-V600E have been widely studied, the comprehensive NCI-60 datasets offers a unique opportunity to identify additional mechanisms of resistance and possibly offer novel means to overcome acquired resistance. The power of the NCI-60 WES variants is apparent from the observation that common variants in the human population may have a profound effect on drug response. Of course, our observation regarding the ATAD5 gene locus requires further studies; however, it opens up a completely new perspective on common variants and their phenotypes in the context of DNA damaging agents and the ongoing clinical trials with DMS612 (37).
In comparison to the 2 recent studies conducted with more cell lines (947 in ref. 9 and 639 in ref. 10), our study integrates far more drugs (approximately 20,000 vs. 24 in ref. 9 and 130 in ref. 10; see volcano plots in Figs. 4, 5, 7, and Supplementary Figures) and provides a comprehensive dataset of all exonic variants for the NCI-60 cell lines, whereas 1,600 genes were sequenced in ref. 9 and 64 cancer-related genes in ref. 10. Given the availability of extensive biological and pharmacological data and the vast number of NCI-60 variants identified in this study, such comprehensive analyses as performed by these 2 studies offer enormous opportunities. The WES data that we are providing for the NCI-60 also enables the vast compound activity database to be used as a resource for drug development to complement genomic studies conducted using larger cell line panels. That is, when one discovers a genomic variant as a molecular target using other cell line resources, using the WES data for the NCI-60 one can potentially identify screened compounds with selective activity for that target. We have limited our work to the exploration of certain aspects of this invaluable data, and made this dataset public for the greater community to use and analyze. This is critical for expanding our knowledge in understanding tumorigenesis and the genomic bases of drug sensitivity in years to come as many more cancer-related gene aberrations are discovered.
Importantly, the availability of this sequencing data will allow increased precision in the use of these common cell lines as experimental models and, as indicated above, expand the utility of other cell line panels for drug development. To enable this important step forward, the complete dataset is readily accessible in 2 forms, the easily searchable CellMiner database and a prefiltered, annotated Ingenuity Systems database. Through these portals, cancer investigators will be able to select precisely the cell line models most genetically suited to their research. The availability of the variant information allows the formulation and testing of hypotheses arising from the entire range of projects using the NCI-60 or its components. In conclusion, our datasets add substantial depth to the already extensive characterization of the NCI-60 tumor cell panel and provide an invaluable resource for ongoing investigations in cancer cell biology and pharmacology.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Conception and design: O.D. Abaan, S. Davis, J.H. Doroshow, Y. Pommier, P.S. Meltzer
Development of methodology: O.D. Abaan, S. Davis, R. Walker, Y. Jiang, R.M. Simon, Y. Pommier, P.S. Meltzer
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): O.D. Abaan, M. Pineda
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): O.D. Abaan, E.C. Polley, S. Davis, Y.J. Zhu, S. Bilke, Y. Gindin, S.L. Holbeck, R.M. Simon, J.H. Doroshow, Y. Pommier, P.S. Meltzer
Writing, review, and/or revision of the manuscript: O.D. Abaan, E.C. Polley, S. Davis, S.L. Holbeck, R.M. Simon, J.H. Doroshow, Y. Pommier, P.S. Meltzer
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): O.D. Abaan, S. Davis, R. Walker, M. Pineda, W.C. Reinhold, J.H. Doroshow, P.S. Meltzer
Study supervision: S. Davis, J.H. Doroshow, P.S. Meltzer
The authors thank B. Kopp, NCI-Frederick, for DNA purification and validation. The authors thank the NHLBI GO Exome Sequencing Project and its ongoing studies that produced and provided exome variant calls for comparison: the Lung GO Sequencing Project (HL-102923), the WHI Sequencing Project (HL-102924), the Broad GO Sequencing Project (HL-102925), the Seattle GO Sequencing Project (HL-102926), and the Heart GO Sequencing Project (HL-103010).
This study was supported by the Division of Cancer Treatment and Diagnosis (DCTD), and the Center for Cancer Research (CCR) of the National Cancer Institute, NIH.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.