Abstract
The genome era provides two sources of knowledge to investigators whose goal is to discover new cancer therapies: first, information on the 20,000 to 40,000 genes that comprise the human genome, the proteins they encode, and the variation in these genes and proteins in human populations that place individuals at risk or that occur in disease; second, genome-wide analysis of cancer cells and tissues leads to the identification of new drug targets and the design of new therapeutic interventions. Using genome resources requires the storage and analysis of large amounts of diverse information on genetic variation, gene and protein functions, and interactions in regulatory processes and biochemical pathways. Cancer bioinformatics deals with organizing and analyzing the data so that important trends and patterns can be identified. Specific gene and protein targets on which cancer cells depend can be identified. Therapeutic agents directed against these targets can then be developed and evaluated. Finally, molecular and genetic variation within a population may become the basis of individualized treatment.
Completion of the human genome sequence (1) has led to a postgenome era of a detailed exploration of the DNA sequence (the genome), DNA sequences that are transcribed into mRNA (the transcriptome), and the translated and posttranslationally modified protein sequences (the proteome). Instead of analyzing and targeting one particular gene or protein, the new paradigm in cancer research is to analyze entire sets of genes and proteins in order to identify a reasonable drug target. Multiple experimental and data analysis methodologies are being used to achieve this goal. Bioinformatics support is an essential component of this research effort.
This role for the human genome in anticancer drug design is based on the premise, well supported by experiments, that genetic variation plays a key role in cancer risk and disease outcome. New drugs are designed based on the resulting metabolic variation; analysis of the genome and transcriptome is used to identify molecular changes in cancer tissues and to find candidate proteins for therapeutic drug development. Genome analysis includes comparative genome hybridization (2, 3) and single nucleotide polymorphism (SNP) analyses. Transcriptome analysis of cancer cells includes genome-wide gene expression profiling methods using microarrays, RNAi knockdown of gene expression (4, 5), and analysis of alternative splicing. All of these experimental approaches require the management of large data sets. Cancer bioinformatics provides support in the form of experimental design, data collection and storage, tools to display and visualize the results, and methods for data analysis. For example, our bioinformatics group at the Arizona Cancer Center has developed a public web site (http://www.biorag.org)1
A comprehensive list of web sites referred to appears in the Appendix.
Considerable effort has been made to bring molecular data to researchers in a reliable and timely manner (6–8). The main molecular sequence data repositories, concurrently updated on a daily basis, are Genbank at the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/), EMBL at the European Molecular Biology Laboratory (http://www.ebi.ac.uk/embl/) and DDBJ, the DNA database of Japan (http://www.ddbj.nig.ac.jp/). The Genome browser (http://genome.ucsc.edu), Ensembl (http://www.ensembl.org), and the so-called Golden Path server (http://genome.ucsc.edu/) are also excellent resources for information on the human genome. Other resources such as UniGene and RefSeq at the National Center for Biotechnology Information web site support new gene discovery. There has also been a large community effort on database and online resource development for gene prediction, comparative genomics, sequence analysis, protein structural analysis, gene expression, transcription factors, protein-protein interactions, protein domains, and other molecular signatures (9). Special initiatives related to cancer research include the Cancer Genome Anatomy Project (http://cgap.nci.nih.gov) and the Cancer Biomedical Informatics Grid (http://cabig.nci.nih.gov) developed by the National Cancer Institute Computational Biology Group (http://ncicb.nci.nih.gov/). Examples of other useful resources for cancer research include (a) the Biomolecular Interaction Network Database (http://www.bind.ca/), (b) the Human Proteome Organisation (http://www.hupo.org), and (c) the Protein Structure Initiative (http://www.nigms.nih.gov/psi/). Drug development support and tracking of gene-drug-disease associations is provided by an interactive database at http://www.pharmgkb.org/index.jsp. Among the most important initiatives for investigating genes and disease are those that explore the relationships between disease and SNPs discussed below.
Genetics and Cancer
As cancer develops, the diseased cells undergo a series of genetic changes that drastically alter their metabolism. Many genes are lost and others take over new roles in promoting tumor growth, invasion, metastases, and angiogenesis. Genetic variation between the same tumor type in different individuals may result in metabolic differences between one cancer tissue and another that can affect disease progression or response to therapy. Further data analysis can assist in identifying these genetic differences and designing simple tests, e.g., tissue histochemical stains, to discriminate among them. Individual tissues may then be screened to devise an optimal therapy. The role of bioinformatics in this process is to analyze sequence and molecular data in order to define these differences.
An additional role for bioinformatics in genome-based therapies is to collect information on all of the human genes and proteins. Comparative genomics is an analysis of the types of genes, gene families, and the location of genes on the chromosomes of various organisms. These studies can include the history of evolutionary rearrangements and gene duplications that are the source of genetic variation between species. Variations among these genes can be used to predict gene and protein relationships and new biochemical pathways. A recent analysis of the human genome has revealed the presence of tandem duplicated regions that represent regions of genetic instability. These regions are often associated with human disease (10). Another analysis of sequence variation between genomes, genome shadowing (11), has examined sequence variation within gene promoters of primate species in order to discover which regions are least variable and therefore most important for function. Sequence polymorphisms within such conserved regions alert the investigator to the possibility of abnormal regulation or protein function as a cause of disease.
SNP Analysis and Cancer
SNPs and other sequence variations inherited from parents in a few key genes interacting with environmental factors put individuals at increased cancer risk. The units of inheritance are chromosomal regions called haplotype blocks ranging in size from a few kilobases to thousands of kilobases. During meiosis, the blocks in two parental chromosomes are combined to make a new chromosome with some blocks derived from one parent and others from the second parent. Sequence variation within these blocks provides genetic diversity in the human population. To identify these blocks, SNPs within one gene or chromosomal region are located and those SNPs that remain together within the population are identified; these define the haplotype blocks. The challenge is to design experiments, high-throughput data collection methods, and bioinformatics tools to discover the SNPs, those SNPS that impair protein function, and SNPs that influence gene regulation and expression. Once population haplotypes are known, those that are associated with cancer risk or response to therapy may be identified through biological studies and clinical trials.
SNPs are predicted to provide a major contribution to variability in disease manifestation or drug side effects. More than 2 million SNPs have been documented by the SNP consortium (12, 13) and the total number is estimated to be >10 million (14). Allele frequencies already analyzed indicate the presence of large blocks of genes in linkage disequilibrium (i.e., they are found to be associated in the human population more often than expected by chance) in the human genome (15, 16). A large effort is under way to generate a “Hapmap,” a map of haplotype blocks with particular SNP tags for each block being identified which make genome scanning for regions that affect disease more efficient. A recently released 3.9 cM resolution human SNP linkage map and screening set promises quick genome scans to find genetic variations that affect disease (17). Because of their dense distribution across the genome, SNPs are viewed as ideal markers for large-scale genome-wide association studies to discover genes in common complex diseases, such as cancer. SNPs are also being used for the detection of loss of heterozygosity, a common form of allelic imbalance, to identify genomic regions that harbor tumor suppressor genes and to characterize different tumor types, pathologic stages and cancer progression. This genome-wide characterization or genotyping of SNPs is done using high-density SNP arrays (18, 19) or matrix-assisted laser desorption ionization-time of flight mass spectrometry (20).
In complex disease such as cancer, a common approach to SNP analysis is to choose a candidate gene, screen for SNPs, and then to determine haplotypes, haplotype frequencies, and risk (disease association or drug response) associated with each haplotype. Some SNPs may influence risk differently in different tissues. For example, genetic polymorphisms in the NAT2 gene are known to modify the aromatic amine metabolism that affect susceptibility of risk to colon cancer and urinary bladder cancer; a NAT2 rapid acetylation phenotype is associated with a high risk for colon cancer and a slow acetylation phenotype is associated with a risk of bladder cancer (21). In such studies, care must be taken to assure significant results.
Several considerations must be made in designing a SNP experiment. First, the sample size must be large enough and an appropriate statistical analysis is needed. Due to failure to meet these needs, SNP results are often not reproduced successfully (14, 22). Second, SNP identification should include a randomized scan of a large population in genetic equilibrium across all genes that might affect risk. Third, when SNPS in a single candidate gene are analyzed for risk effects, SNPs in other genes that interact with the candidate gene may also influence outcome and can become uncontrolled sources of variation in the risk analysis. These difficulties in SNP analysis underscore the need for careful experimental design and data analysis.
Appropriate statistical analysis of SNP data and identification of signature SNPs for a particular haplotype block is an active area of research in bioinformatics. Dynamic programming, a method used for finding optimal alignment of sequences, is used for this purpose (23, 24). Public data repository and informatics tools to display, visualize, perform linkage analysis, and haplotyping are available. For example, dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), the SNP consortium (http://snp.cshl.org/), linkage analysis tools (http://biosun1.harvard.edu/complab/dchip/snp.htm) and web pages with extensive links (http://hgvbase.cgb.ki.se/databases.htm) are available and new resources are being developed and published regularly. To find additional web sites that support these types of analyses, we recommend performing web searches with terms such as linkage disequilibrium analysis, haplotype blocks, and SNP analysis, or with more specific terms such as name of program, e.g., SNPsim.
Alternative Splicing
Another mechanism that introduces variation in gene expression and alters proteins in cancer tissues by changing the positions of exon-intron and intron-exon junctions is alternative splicing. More than 40% of genes in the human genome are believed to undergo alternative splicing (25, 26). Genes associated with cancer (27) can produce multiple splice variants. The process of alternative splicing is highly regulated and disruption of a splicing pattern can produce splice variants that have different functions. There is much interest in finding which genes are showing alternative splicing, analyzing the tissue specificity of alternative splice forms, and identifying the resulting protein isoforms. Evidence that alternative splicing might have an impact on drug efficacy or toxicity (28) in chemotherapy has made alternative splice variants potential drug targets (27, 29). Tissue-specific mRNA variants can also act as regulatory RNAs that influence expression of other genes. Gene expression profiling using microarrays (30) is used for identification of these variants and other types of molecular profiling of cancer cells and tissues. Databases of splice variants (e.g., http://www.bioinf.mdc-berlin.de/igms/; and see the current January database issue of Nucleic Acids Research) and bioinformatics tools (31) have been developed for use in exploring relationships between alternative splicing and cancer.
Microarrays
Microarrays have emerged as a powerful tool for global analysis of gene expression, cancer genome hybridization, DNA-protein interactions, CpG island methylation, and SNP analysis (2). Gene expression profiling presents a new way of conducting cancer therapeutics. The common goal is to understand and examine the biological responses and processes that can discriminate tumor cells and tissues, classify tumor stages, and predict differential responses in patients.
There are two basic approaches towards analysis of microarray data—an exploratory approach in which the data are analyzed to determine whether genes or biological samples exhibit a similar pattern of expression—and a classifier approach in which there is a search for genes that can distinguish known classes, e.g., cancer tissues from normal ones. The former approach is often referred to as an unsupervised analysis because there is no constraint on which samples should be treated as similar—the latter is a supervised analysis because the samples are treated as classes. The usual computational approach is to discover the optimal number of classes using an unsupervised learning procedure, i.e., cells or tissues with similar patterns of gene expression based on a correlation or distance score (32). There are various methods for partitioning samples into groups based on their gene expression profiles. A common unsupervised method is hierarchical pairwise clustering (33) based on average-linkage between clusters to identify the most closely related classes in a tree representation of the relationships (33–35). Hierarchical clustering has been successful in the molecular classification of signature expression profiles of distinct types of large B cell lymphoma (36), benign and malignant prostate cancer tissues (37), and kidney tumors that can predict survival in patients with metastatic renal cell cancer (38). There are many additional clustering methods that have been applied to cancer-related data sets; these methods provide the means to search for different types of patterns based on experimental objectives (39). Clustering patterns should be evaluated to determine how well they are supported by the data and whether or not other clustering patterns would fit the data as well. New methods for evaluating and improving clusters (40, 41) have been introduced.
Other unsupervised methods that have been used to discover patterns and gene expression profiling from various tumor tissues are principal component analysis (42), self-organizing maps (43), multidimensional scaling (44, 45), and singular value decomposition (46). Although clustering and other data reduction and pattern discovery methods can separate distinct samples into clusters based on their expression and are useful in class discovery, they are not found to be effective for class comparison and prediction (47). Class prediction is inherent to cancer research and prognosis, and supervised learning methods have proven to be efficient in predicting various classes and types of cancer.
Supervised learning methods involve building a model to classify samples. They require a training set of samples of predefined classes from which gene classifiers are predicted and a test set that is used to validate the classifiers. Sometimes, the same set is used for both training and validation, with one or more members of the set being left out of the training step and used for validation (47). The prototype supervised method, linear discriminant analysis, has been used by statisticians for tumor classification using gene expression data (48). Other methods, including neighborhood analysis and weighted voting (49, 50), have been used for classification of tissue samples. Support vector machines, which provide data transformation methods for improved separation of samples (51, 52), have been used to predict classes in ovarian cancer tissues and diffuse large B cell lymphomas. K-nearest neighborhood prediction (53–55) has also been used for multiclass cancer expression analysis. Trained neural network models have shown promise in tumor classification and diagnosis prediction (56, 57), and decision trees have been built for discriminating among distinct colon cancer tissues (58). Genetic algorithms represent another tool for exploring classification of biological phenotypes (59). A novel, computationally intense method, called the small features set algorithm searches for patterns of small numbers of genes that discriminate between classes (60, 61).
A general problem with supervised learning methods is that of “overfitting the data” to the training samples (47, 62) so that the classifiers found are only useful for classifying those samples but not other, more variable, samples. Several classification methods have been compared (48, 52, 54, 63) without a clear consensus of the best methods. The underlying problem is that the number of samples is much lower than the number of potential predictors; hence, there is seldom a single, unique set of classifiers.
The Complexity of Microarray Data and the Need for Appropriate Statistical Analysis
There are many sources of noise and experimental variation in expression microarray data. Noise and signal variation is introduced at almost every stage of the experiment in sample preparation, RNA labeling, hybridization, signal and background measurement, and variation between biological samples. The combined effect is to make the task of obtaining a list of significantly varying genes a difficult one, with both false-negative (missed varying genes) and false-positive predictions, also called the false discovery rate and representing genes that are incorrectly predicted as varying, being reported. There is no one simple solution to these problems and there is an ongoing debate among statisticians regarding the most suitable experimental design and data analysis.
Most importantly, microarrays require carefully planned and executed experiments and a competently done statistical and computational analysis. Experimental design and data analysis should be based on a clear statement of the experimental goals. Design should identify and measure sources of variation so that appropriate statistical analyses can be done and the genes that are varying significantly can be identified. A general guideline is to use at least three to four biological replicates and, if cost is a consideration, to reduce the number of treatments or time points in favor of biological replication. In experiments with tissues, samples from patients in groups that differ in drug sensitivity, clinical outcome, or clinical diagnosis can assist in understanding the underlying variations in gene expression within each group and whether differences between groups are greater than this within-group amount. RNA extracted from each tissue may be used in a separate hybridization reaction. For experiments with cancer cell lines, biological replication involves repeating the experiment to produce multiple RNA preparations from the same treatment. With the inclusion of sufficient replication in an experiment and an appropriate statistical analysis, a list of genes that are varying most significantly can be obtained along with a statistical probability of being correct.
One of the most important contributions to microarray data analysis has been the application of tried and true statistical methods including linear and nonlinear models, ANOVA, and the application of sophisticated background and data normalization methods (64–67). For example, the application of the Linear Models for Microarray Analysis in the BioConductor package (http://www.bioconductor.org), which is based on the open source R statistical programming language, supports many of these statistical methods (68). The BioConductor web site includes many other useful packages for the analysis of microarray data. There are many other microarray data analysis packages that are readily found by web searching (69). BRB-array tools (http://linus.nci.nih.gov/BRB-ArrayTools.html), SAM (http://www-stat.stanford.edu/~tibs/SAM/), and GeneCluster (http://www.broad.mit.edu/cancer/software/genecluster2/gc2.html) are examples of commonly used ones.
Extraction of Biological Information from Microarray Experiments
One of the most difficult tasks faced by the biologist is what to do with a long list of genes from an expression microarray experiment. What is the biological relevance of these results? One approach is to take the list of genes and examine them for gene ontology, biochemical function, known biochemical and regulatory relationships, and known protein-protein and gene-gene interactions. If genes that are showing increased or decreased expression in cancer cells act in a known pathway, regulatory circuit or protein complex, then that function may be altered. This expectation is increased if two or more genes are in the same pathway or complex. A number of tools have been developed for this purpose (69). One such tool is Pathway Miner (70), which produces spreadsheets of gene and pathway information and interactive graphs that depict known regulatory relationships among the up-regulated and down-regulated genes. An example of a gene network produced by this tool is shown in Fig. 1.
Extraction of biological information from gene expression experiments should reveal those genes, proteins, and biochemical pathways in cancer cells and tissues that have altered expression compared with normal cells and tissues. Similar information may be found from the altered expression of genes between different subphenotypes or prognosis categories. Some specific examples of the rationale behind using these genes as drug targets are shown in Fig. 2. The main goal in these methods is to use high-throughput technologies to discover and then exploit genetic alterations in cancer cells that lead to their dependence on certain genes and proteins for altered function and metabolism. The bioinformatics tools described in this minireview can be used effectively to help identify and exploit these vulnerabilities.
Other more complex methods for the analysis of microarray data have been used in an attempt to discover previously unknown gene relationships that may be of future use for finding drug targets. One totally different approach towards microarray data analysis is to start with known information about genes before performing an analysis. Prediction methods can discover complex gene relationships if more information on genes is incorporated in the analysis. As a simple example, a list of genes that might be coregulated based on the presence of conserved DNA sequence patterns or transcription factor binding sites may first be produced. The promoter list may then be compared with a second list of genes produced by statistical analysis of microarray data to determine if the genes are varying in expression to a similar degree. This analysis might assist in finding abnormal regulatory changes that can be further exploited in drug design. More complex analyses have been done. Molecular interactions and relevance networks have been extracted from clinical data using mutual information from genes (71), genome data, and probabilistic graph models learned by integrating prior information about genes (72, 73). Inferring complex regulatory networks from expression data has been an area of bioinformatics research (72–76) but has proven to be a difficult undertaking. The network models that are produced often are in agreement with experimental data but provide very little new insight into new regulatory relationships. This result is largely due to the underlying complexity of the regulatory interactions among genes. These models are useful, however, and their improvement and eventual application should lead to a better understanding of the systemic response of cells to drugs.
Need for Future Bioinformatics Tools and Resources
The above approaches in cancer research can only be realized to their fullest potential with the continued development of bioinformatics tools and resources. The present availability of information on the human genome and proteome has depended heavily on previous progress in bioinformatics. Future development in these areas will include the efficient management and organization of biological and medical information into well-defined data objects that can be readily integrated and mined for relationships. These technologies will greatly assist in the development of new anticancer drugs. For example, in the Arizona Cancer Center, we have established cancer tissue repositories as relational databases into which clinical (subject to the appropriate regulatory requirements), genetic, and molecular data can be stored. Data from any experiment or analysis with tissues may then be entered into or referenced by this database. This information environment supports all subsequent tissue data analysis including information on the genetic, molecular, and biochemical changes in cancer cells. The Cancer Biomedical Informatics Grid (http://cabig.nci.nih.gov/) is being developed to provide development and sharing of these types of complex data objects between cancer centers.
Drug discovery also relies heavily on well-integrated data management in order to identify drug targets and determine drug-gene interactions. A good example is the G protein-coupled receptors that are used as therapeutic targets and represent almost 50% of all drug targets. For some of these receptors, the ligands remain unknown. The challenge is to evaluate the role of these G protein-coupled receptors and identify their specific ligands in normal physiology and disease and develop new therapeutics based on this information (77). Integrated bioinformatics approaches for analyzing these target genes, their nucleotide polymorphisms, protein structures, protein-protein interactions and protein modification sites for degradation, activation, and sorting the receptors is fundamental to understanding the response of individuals to such drugs. By providing integrated resources and knowledge databases, bioinformatics will greatly assist in the development and application of new anticancer drugs.
The connection between genes, proteins, and disease can best be understood using a systems biology approach in which genomics, proteomics, molecular, and genetic data is integrated with laboratory research. Cross-fertilization between data analysis and experiment will help to define the molecular processes involved in carcinogenesis in greater detail than is presently available and will ultimately lead to new therapeutic targets and treatments.
Appendix
List of web sites cited in article
http://biosun1.harvard.edu/complab/dchip/snp.htm, SNP analysis software.
http://cabig.nci.nih.gov/, Cancer Biomedical Informatics Grid.
http://cgap.nci.nih.gov, Cancer Genome Anatomy Project.
http://genome.ucsc.edu/, Golden Path human genome resource.
http://hgvbase.cgb.ki.se/databases.htm, list of SNP-related databases.
http://linus.nci.nih.gov/BRB-ArrayTools.html, statistical tools for microarray analysis.
http://ncicb.nci.nih.gov/, National Cancer Institute Computational Biology.
http://snp.cshl.org/, SNP consortium.
http://www.bioconductor.org, statistical analysis software.
http://www.bioinf.mdc-berlin.de/igms/, database of splice variants.
http://www.biorag.org, genome tools and databases.
http://www.broad.mit.edu/cancer/software/genecluster2/gc2.html, clustering tools for microarray data.
http://www.ddbj.nig.ac.jp/, Japanese sequence database.
http://www.ebi.ac.uk/embl/, European sequence database.
http://www.ensembl.org, genome browser.
http://www.hupo.org, human proteome project.
http://www.ncbi.nlm.nih.gov/, Genbank home.
http://www.ncbi.nlm.nih.gov/projects/SNP/, dbSNP database.
http://www.nigms.nih.gov/psi/, protein structure initiative.
http://www.pharmgkb.org/index.jsp, pharmacogenomics database.
http://www-stat.stanford.edu/~tibs/SAM/, statistical tools for microarray analysis.
Grant support: National Cancer Institute grants, particularly the Arizona Cancer Center Core Grant and startup funds provided by Dr. Daniel Von Hoff; Southwest Environmental Health Sciences Center Core Grant and startup funds from Dr. Serrine Lau.