The genome era provides two sources of knowledge to investigators whose goal is to discover new cancer therapies: first, information on the 20,000 to 40,000 genes that comprise the human genome, the proteins they encode, and the variation in these genes and proteins in human populations that place individuals at risk or that occur in disease; second, genome-wide analysis of cancer cells and tissues leads to the identification of new drug targets and the design of new therapeutic interventions. Using genome resources requires the storage and analysis of large amounts of diverse information on genetic variation, gene and protein functions, and interactions in regulatory processes and biochemical pathways. Cancer bioinformatics deals with organizing and analyzing the data so that important trends and patterns can be identified. Specific gene and protein targets on which cancer cells depend can be identified. Therapeutic agents directed against these targets can then be developed and evaluated. Finally, molecular and genetic variation within a population may become the basis of individualized treatment.

Completion of the human genome sequence (1) has led to a postgenome era of a detailed exploration of the DNA sequence (the genome), DNA sequences that are transcribed into mRNA (the transcriptome), and the translated and posttranslationally modified protein sequences (the proteome). Instead of analyzing and targeting one particular gene or protein, the new paradigm in cancer research is to analyze entire sets of genes and proteins in order to identify a reasonable drug target. Multiple experimental and data analysis methodologies are being used to achieve this goal. Bioinformatics support is an essential component of this research effort.

This role for the human genome in anticancer drug design is based on the premise, well supported by experiments, that genetic variation plays a key role in cancer risk and disease outcome. New drugs are designed based on the resulting metabolic variation; analysis of the genome and transcriptome is used to identify molecular changes in cancer tissues and to find candidate proteins for therapeutic drug development. Genome analysis includes comparative genome hybridization (2, 3) and single nucleotide polymorphism (SNP) analyses. Transcriptome analysis of cancer cells includes genome-wide gene expression profiling methods using microarrays, RNAi knockdown of gene expression (4, 5), and analysis of alternative splicing. All of these experimental approaches require the management of large data sets. Cancer bioinformatics provides support in the form of experimental design, data collection and storage, tools to display and visualize the results, and methods for data analysis. For example, our bioinformatics group at the Arizona Cancer Center has developed a public web site (http://www.biorag.org)1

1

A comprehensive list of web sites referred to appears in the Appendix.

with interactive databases and information on the sequence and function of all known human genes and proteins, and their homologues from other model organisms. This site may be used to probe the biological significance of experiments that generate large data sets involving changes in many genes, as in an expression microarray experiment.

Considerable effort has been made to bring molecular data to researchers in a reliable and timely manner (68). The main molecular sequence data repositories, concurrently updated on a daily basis, are Genbank at the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/), EMBL at the European Molecular Biology Laboratory (http://www.ebi.ac.uk/embl/) and DDBJ, the DNA database of Japan (http://www.ddbj.nig.ac.jp/). The Genome browser (http://genome.ucsc.edu), Ensembl (http://www.ensembl.org), and the so-called Golden Path server (http://genome.ucsc.edu/) are also excellent resources for information on the human genome. Other resources such as UniGene and RefSeq at the National Center for Biotechnology Information web site support new gene discovery. There has also been a large community effort on database and online resource development for gene prediction, comparative genomics, sequence analysis, protein structural analysis, gene expression, transcription factors, protein-protein interactions, protein domains, and other molecular signatures (9). Special initiatives related to cancer research include the Cancer Genome Anatomy Project (http://cgap.nci.nih.gov) and the Cancer Biomedical Informatics Grid (http://cabig.nci.nih.gov) developed by the National Cancer Institute Computational Biology Group (http://ncicb.nci.nih.gov/). Examples of other useful resources for cancer research include (a) the Biomolecular Interaction Network Database (http://www.bind.ca/), (b) the Human Proteome Organisation (http://www.hupo.org), and (c) the Protein Structure Initiative (http://www.nigms.nih.gov/psi/). Drug development support and tracking of gene-drug-disease associations is provided by an interactive database at http://www.pharmgkb.org/index.jsp. Among the most important initiatives for investigating genes and disease are those that explore the relationships between disease and SNPs discussed below.

As cancer develops, the diseased cells undergo a series of genetic changes that drastically alter their metabolism. Many genes are lost and others take over new roles in promoting tumor growth, invasion, metastases, and angiogenesis. Genetic variation between the same tumor type in different individuals may result in metabolic differences between one cancer tissue and another that can affect disease progression or response to therapy. Further data analysis can assist in identifying these genetic differences and designing simple tests, e.g., tissue histochemical stains, to discriminate among them. Individual tissues may then be screened to devise an optimal therapy. The role of bioinformatics in this process is to analyze sequence and molecular data in order to define these differences.

An additional role for bioinformatics in genome-based therapies is to collect information on all of the human genes and proteins. Comparative genomics is an analysis of the types of genes, gene families, and the location of genes on the chromosomes of various organisms. These studies can include the history of evolutionary rearrangements and gene duplications that are the source of genetic variation between species. Variations among these genes can be used to predict gene and protein relationships and new biochemical pathways. A recent analysis of the human genome has revealed the presence of tandem duplicated regions that represent regions of genetic instability. These regions are often associated with human disease (10). Another analysis of sequence variation between genomes, genome shadowing (11), has examined sequence variation within gene promoters of primate species in order to discover which regions are least variable and therefore most important for function. Sequence polymorphisms within such conserved regions alert the investigator to the possibility of abnormal regulation or protein function as a cause of disease.

SNPs and other sequence variations inherited from parents in a few key genes interacting with environmental factors put individuals at increased cancer risk. The units of inheritance are chromosomal regions called haplotype blocks ranging in size from a few kilobases to thousands of kilobases. During meiosis, the blocks in two parental chromosomes are combined to make a new chromosome with some blocks derived from one parent and others from the second parent. Sequence variation within these blocks provides genetic diversity in the human population. To identify these blocks, SNPs within one gene or chromosomal region are located and those SNPs that remain together within the population are identified; these define the haplotype blocks. The challenge is to design experiments, high-throughput data collection methods, and bioinformatics tools to discover the SNPs, those SNPS that impair protein function, and SNPs that influence gene regulation and expression. Once population haplotypes are known, those that are associated with cancer risk or response to therapy may be identified through biological studies and clinical trials.

SNPs are predicted to provide a major contribution to variability in disease manifestation or drug side effects. More than 2 million SNPs have been documented by the SNP consortium (12, 13) and the total number is estimated to be >10 million (14). Allele frequencies already analyzed indicate the presence of large blocks of genes in linkage disequilibrium (i.e., they are found to be associated in the human population more often than expected by chance) in the human genome (15, 16). A large effort is under way to generate a “Hapmap,” a map of haplotype blocks with particular SNP tags for each block being identified which make genome scanning for regions that affect disease more efficient. A recently released 3.9 cM resolution human SNP linkage map and screening set promises quick genome scans to find genetic variations that affect disease (17). Because of their dense distribution across the genome, SNPs are viewed as ideal markers for large-scale genome-wide association studies to discover genes in common complex diseases, such as cancer. SNPs are also being used for the detection of loss of heterozygosity, a common form of allelic imbalance, to identify genomic regions that harbor tumor suppressor genes and to characterize different tumor types, pathologic stages and cancer progression. This genome-wide characterization or genotyping of SNPs is done using high-density SNP arrays (18, 19) or matrix-assisted laser desorption ionization-time of flight mass spectrometry (20).

In complex disease such as cancer, a common approach to SNP analysis is to choose a candidate gene, screen for SNPs, and then to determine haplotypes, haplotype frequencies, and risk (disease association or drug response) associated with each haplotype. Some SNPs may influence risk differently in different tissues. For example, genetic polymorphisms in the NAT2 gene are known to modify the aromatic amine metabolism that affect susceptibility of risk to colon cancer and urinary bladder cancer; a NAT2 rapid acetylation phenotype is associated with a high risk for colon cancer and a slow acetylation phenotype is associated with a risk of bladder cancer (21). In such studies, care must be taken to assure significant results.

Several considerations must be made in designing a SNP experiment. First, the sample size must be large enough and an appropriate statistical analysis is needed. Due to failure to meet these needs, SNP results are often not reproduced successfully (14, 22). Second, SNP identification should include a randomized scan of a large population in genetic equilibrium across all genes that might affect risk. Third, when SNPS in a single candidate gene are analyzed for risk effects, SNPs in other genes that interact with the candidate gene may also influence outcome and can become uncontrolled sources of variation in the risk analysis. These difficulties in SNP analysis underscore the need for careful experimental design and data analysis.

Appropriate statistical analysis of SNP data and identification of signature SNPs for a particular haplotype block is an active area of research in bioinformatics. Dynamic programming, a method used for finding optimal alignment of sequences, is used for this purpose (23, 24). Public data repository and informatics tools to display, visualize, perform linkage analysis, and haplotyping are available. For example, dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), the SNP consortium (http://snp.cshl.org/), linkage analysis tools (http://biosun1.harvard.edu/complab/dchip/snp.htm) and web pages with extensive links (http://hgvbase.cgb.ki.se/databases.htm) are available and new resources are being developed and published regularly. To find additional web sites that support these types of analyses, we recommend performing web searches with terms such as linkage disequilibrium analysis, haplotype blocks, and SNP analysis, or with more specific terms such as name of program, e.g., SNPsim.

Another mechanism that introduces variation in gene expression and alters proteins in cancer tissues by changing the positions of exon-intron and intron-exon junctions is alternative splicing. More than 40% of genes in the human genome are believed to undergo alternative splicing (25, 26). Genes associated with cancer (27) can produce multiple splice variants. The process of alternative splicing is highly regulated and disruption of a splicing pattern can produce splice variants that have different functions. There is much interest in finding which genes are showing alternative splicing, analyzing the tissue specificity of alternative splice forms, and identifying the resulting protein isoforms. Evidence that alternative splicing might have an impact on drug efficacy or toxicity (28) in chemotherapy has made alternative splice variants potential drug targets (27, 29). Tissue-specific mRNA variants can also act as regulatory RNAs that influence expression of other genes. Gene expression profiling using microarrays (30) is used for identification of these variants and other types of molecular profiling of cancer cells and tissues. Databases of splice variants (e.g., http://www.bioinf.mdc-berlin.de/igms/; and see the current January database issue of Nucleic Acids Research) and bioinformatics tools (31) have been developed for use in exploring relationships between alternative splicing and cancer.

Microarrays have emerged as a powerful tool for global analysis of gene expression, cancer genome hybridization, DNA-protein interactions, CpG island methylation, and SNP analysis (2). Gene expression profiling presents a new way of conducting cancer therapeutics. The common goal is to understand and examine the biological responses and processes that can discriminate tumor cells and tissues, classify tumor stages, and predict differential responses in patients.

There are two basic approaches towards analysis of microarray data—an exploratory approach in which the data are analyzed to determine whether genes or biological samples exhibit a similar pattern of expression—and a classifier approach in which there is a search for genes that can distinguish known classes, e.g., cancer tissues from normal ones. The former approach is often referred to as an unsupervised analysis because there is no constraint on which samples should be treated as similar—the latter is a supervised analysis because the samples are treated as classes. The usual computational approach is to discover the optimal number of classes using an unsupervised learning procedure, i.e., cells or tissues with similar patterns of gene expression based on a correlation or distance score (32). There are various methods for partitioning samples into groups based on their gene expression profiles. A common unsupervised method is hierarchical pairwise clustering (33) based on average-linkage between clusters to identify the most closely related classes in a tree representation of the relationships (3335). Hierarchical clustering has been successful in the molecular classification of signature expression profiles of distinct types of large B cell lymphoma (36), benign and malignant prostate cancer tissues (37), and kidney tumors that can predict survival in patients with metastatic renal cell cancer (38). There are many additional clustering methods that have been applied to cancer-related data sets; these methods provide the means to search for different types of patterns based on experimental objectives (39). Clustering patterns should be evaluated to determine how well they are supported by the data and whether or not other clustering patterns would fit the data as well. New methods for evaluating and improving clusters (40, 41) have been introduced.

Other unsupervised methods that have been used to discover patterns and gene expression profiling from various tumor tissues are principal component analysis (42), self-organizing maps (43), multidimensional scaling (44, 45), and singular value decomposition (46). Although clustering and other data reduction and pattern discovery methods can separate distinct samples into clusters based on their expression and are useful in class discovery, they are not found to be effective for class comparison and prediction (47). Class prediction is inherent to cancer research and prognosis, and supervised learning methods have proven to be efficient in predicting various classes and types of cancer.

Supervised learning methods involve building a model to classify samples. They require a training set of samples of predefined classes from which gene classifiers are predicted and a test set that is used to validate the classifiers. Sometimes, the same set is used for both training and validation, with one or more members of the set being left out of the training step and used for validation (47). The prototype supervised method, linear discriminant analysis, has been used by statisticians for tumor classification using gene expression data (48). Other methods, including neighborhood analysis and weighted voting (49, 50), have been used for classification of tissue samples. Support vector machines, which provide data transformation methods for improved separation of samples (51, 52), have been used to predict classes in ovarian cancer tissues and diffuse large B cell lymphomas. K-nearest neighborhood prediction (5355) has also been used for multiclass cancer expression analysis. Trained neural network models have shown promise in tumor classification and diagnosis prediction (56, 57), and decision trees have been built for discriminating among distinct colon cancer tissues (58). Genetic algorithms represent another tool for exploring classification of biological phenotypes (59). A novel, computationally intense method, called the small features set algorithm searches for patterns of small numbers of genes that discriminate between classes (60, 61).

A general problem with supervised learning methods is that of “overfitting the data” to the training samples (47, 62) so that the classifiers found are only useful for classifying those samples but not other, more variable, samples. Several classification methods have been compared (48, 52, 54, 63) without a clear consensus of the best methods. The underlying problem is that the number of samples is much lower than the number of potential predictors; hence, there is seldom a single, unique set of classifiers.

There are many sources of noise and experimental variation in expression microarray data. Noise and signal variation is introduced at almost every stage of the experiment in sample preparation, RNA labeling, hybridization, signal and background measurement, and variation between biological samples. The combined effect is to make the task of obtaining a list of significantly varying genes a difficult one, with both false-negative (missed varying genes) and false-positive predictions, also called the false discovery rate and representing genes that are incorrectly predicted as varying, being reported. There is no one simple solution to these problems and there is an ongoing debate among statisticians regarding the most suitable experimental design and data analysis.

Most importantly, microarrays require carefully planned and executed experiments and a competently done statistical and computational analysis. Experimental design and data analysis should be based on a clear statement of the experimental goals. Design should identify and measure sources of variation so that appropriate statistical analyses can be done and the genes that are varying significantly can be identified. A general guideline is to use at least three to four biological replicates and, if cost is a consideration, to reduce the number of treatments or time points in favor of biological replication. In experiments with tissues, samples from patients in groups that differ in drug sensitivity, clinical outcome, or clinical diagnosis can assist in understanding the underlying variations in gene expression within each group and whether differences between groups are greater than this within-group amount. RNA extracted from each tissue may be used in a separate hybridization reaction. For experiments with cancer cell lines, biological replication involves repeating the experiment to produce multiple RNA preparations from the same treatment. With the inclusion of sufficient replication in an experiment and an appropriate statistical analysis, a list of genes that are varying most significantly can be obtained along with a statistical probability of being correct.

One of the most important contributions to microarray data analysis has been the application of tried and true statistical methods including linear and nonlinear models, ANOVA, and the application of sophisticated background and data normalization methods (6467). For example, the application of the Linear Models for Microarray Analysis in the BioConductor package (http://www.bioconductor.org), which is based on the open source R statistical programming language, supports many of these statistical methods (68). The BioConductor web site includes many other useful packages for the analysis of microarray data. There are many other microarray data analysis packages that are readily found by web searching (69). BRB-array tools (http://linus.nci.nih.gov/BRB-ArrayTools.html), SAM (http://www-stat.stanford.edu/~tibs/SAM/), and GeneCluster (http://www.broad.mit.edu/cancer/software/genecluster2/gc2.html) are examples of commonly used ones.

One of the most difficult tasks faced by the biologist is what to do with a long list of genes from an expression microarray experiment. What is the biological relevance of these results? One approach is to take the list of genes and examine them for gene ontology, biochemical function, known biochemical and regulatory relationships, and known protein-protein and gene-gene interactions. If genes that are showing increased or decreased expression in cancer cells act in a known pathway, regulatory circuit or protein complex, then that function may be altered. This expectation is increased if two or more genes are in the same pathway or complex. A number of tools have been developed for this purpose (69). One such tool is Pathway Miner (70), which produces spreadsheets of gene and pathway information and interactive graphs that depict known regulatory relationships among the up-regulated and down-regulated genes. An example of a gene network produced by this tool is shown in Fig. 1.

Figure 1.

Pathway miner analysis of microarray gene expression data. Interactive graph of the genes that are overexpressed (red circles with gene name) or underexpressed (green circles) shown as nodes. Genes are joined by lines (edges) of thickness proportional to the number of regulatory or biochemical pathways in which they interact. Clicking the mouse on a gene gives detailed lists of information about that gene, and clicking on an edge gives a list of pathways. The graph can be filtered in various ways. This analysis can readily reveal changes in biochemical pathways due to disease or drug treatment. Pathway miner may be accessed at http://www.biorag.org using the search option.

Figure 1.

Pathway miner analysis of microarray gene expression data. Interactive graph of the genes that are overexpressed (red circles with gene name) or underexpressed (green circles) shown as nodes. Genes are joined by lines (edges) of thickness proportional to the number of regulatory or biochemical pathways in which they interact. Clicking the mouse on a gene gives detailed lists of information about that gene, and clicking on an edge gives a list of pathways. The graph can be filtered in various ways. This analysis can readily reveal changes in biochemical pathways due to disease or drug treatment. Pathway miner may be accessed at http://www.biorag.org using the search option.

Close modal

Extraction of biological information from gene expression experiments should reveal those genes, proteins, and biochemical pathways in cancer cells and tissues that have altered expression compared with normal cells and tissues. Similar information may be found from the altered expression of genes between different subphenotypes or prognosis categories. Some specific examples of the rationale behind using these genes as drug targets are shown in Fig. 2. The main goal in these methods is to use high-throughput technologies to discover and then exploit genetic alterations in cancer cells that lead to their dependence on certain genes and proteins for altered function and metabolism. The bioinformatics tools described in this minireview can be used effectively to help identify and exploit these vulnerabilities.

Figure 2.

Choosing drugs based on genetic and molecular analysis. 1, the product of a gene that is overexpressed in a microarray analysis of cancer cells or tissues is the target for drug development. Normal cells are assumed to depend less on the product than cancer cells. 2, a pair of genes, A and B, that are a synthetic lethal combination, meaning that the product of at least one must be present for cell viability, have been identified by bioinformatics and genetic analysis. It has been determined by microarray analysis that gene A is not expressed in cancer cells so that the cancer cell depends heavily on B for normal metabolism. Directing a new drug against B will preferentially kill the cancer cell because the normal cell has the gene A backup function. 3, it has been determined that a particular gene such as a tumor suppressor gene is defective in the cancer cell. The target for drug development is then cells with the B mutation. Alternatively, small interfering RNA constructs can be used to find a vulnerability of B-defective cancer cells. Finally, in (4), cells with the B mutation are targeted as in (3), except that a drug is chosen that enhances the effects of an already worthwhile treatment.

Figure 2.

Choosing drugs based on genetic and molecular analysis. 1, the product of a gene that is overexpressed in a microarray analysis of cancer cells or tissues is the target for drug development. Normal cells are assumed to depend less on the product than cancer cells. 2, a pair of genes, A and B, that are a synthetic lethal combination, meaning that the product of at least one must be present for cell viability, have been identified by bioinformatics and genetic analysis. It has been determined by microarray analysis that gene A is not expressed in cancer cells so that the cancer cell depends heavily on B for normal metabolism. Directing a new drug against B will preferentially kill the cancer cell because the normal cell has the gene A backup function. 3, it has been determined that a particular gene such as a tumor suppressor gene is defective in the cancer cell. The target for drug development is then cells with the B mutation. Alternatively, small interfering RNA constructs can be used to find a vulnerability of B-defective cancer cells. Finally, in (4), cells with the B mutation are targeted as in (3), except that a drug is chosen that enhances the effects of an already worthwhile treatment.

Close modal

Other more complex methods for the analysis of microarray data have been used in an attempt to discover previously unknown gene relationships that may be of future use for finding drug targets. One totally different approach towards microarray data analysis is to start with known information about genes before performing an analysis. Prediction methods can discover complex gene relationships if more information on genes is incorporated in the analysis. As a simple example, a list of genes that might be coregulated based on the presence of conserved DNA sequence patterns or transcription factor binding sites may first be produced. The promoter list may then be compared with a second list of genes produced by statistical analysis of microarray data to determine if the genes are varying in expression to a similar degree. This analysis might assist in finding abnormal regulatory changes that can be further exploited in drug design. More complex analyses have been done. Molecular interactions and relevance networks have been extracted from clinical data using mutual information from genes (71), genome data, and probabilistic graph models learned by integrating prior information about genes (72, 73). Inferring complex regulatory networks from expression data has been an area of bioinformatics research (7276) but has proven to be a difficult undertaking. The network models that are produced often are in agreement with experimental data but provide very little new insight into new regulatory relationships. This result is largely due to the underlying complexity of the regulatory interactions among genes. These models are useful, however, and their improvement and eventual application should lead to a better understanding of the systemic response of cells to drugs.

The above approaches in cancer research can only be realized to their fullest potential with the continued development of bioinformatics tools and resources. The present availability of information on the human genome and proteome has depended heavily on previous progress in bioinformatics. Future development in these areas will include the efficient management and organization of biological and medical information into well-defined data objects that can be readily integrated and mined for relationships. These technologies will greatly assist in the development of new anticancer drugs. For example, in the Arizona Cancer Center, we have established cancer tissue repositories as relational databases into which clinical (subject to the appropriate regulatory requirements), genetic, and molecular data can be stored. Data from any experiment or analysis with tissues may then be entered into or referenced by this database. This information environment supports all subsequent tissue data analysis including information on the genetic, molecular, and biochemical changes in cancer cells. The Cancer Biomedical Informatics Grid (http://cabig.nci.nih.gov/) is being developed to provide development and sharing of these types of complex data objects between cancer centers.

Drug discovery also relies heavily on well-integrated data management in order to identify drug targets and determine drug-gene interactions. A good example is the G protein-coupled receptors that are used as therapeutic targets and represent almost 50% of all drug targets. For some of these receptors, the ligands remain unknown. The challenge is to evaluate the role of these G protein-coupled receptors and identify their specific ligands in normal physiology and disease and develop new therapeutics based on this information (77). Integrated bioinformatics approaches for analyzing these target genes, their nucleotide polymorphisms, protein structures, protein-protein interactions and protein modification sites for degradation, activation, and sorting the receptors is fundamental to understanding the response of individuals to such drugs. By providing integrated resources and knowledge databases, bioinformatics will greatly assist in the development and application of new anticancer drugs.

The connection between genes, proteins, and disease can best be understood using a systems biology approach in which genomics, proteomics, molecular, and genetic data is integrated with laboratory research. Cross-fertilization between data analysis and experiment will help to define the molecular processes involved in carcinogenesis in greater detail than is presently available and will ultimately lead to new therapeutic targets and treatments.

List of web sites cited in article

http://biosun1.harvard.edu/complab/dchip/snp.htm, SNP analysis software.

http://cabig.nci.nih.gov/, Cancer Biomedical Informatics Grid.

http://cgap.nci.nih.gov, Cancer Genome Anatomy Project.

http://genome.ucsc.edu/, Golden Path human genome resource.

http://hgvbase.cgb.ki.se/databases.htm, list of SNP-related databases.

http://linus.nci.nih.gov/BRB-ArrayTools.html, statistical tools for microarray analysis.

http://ncicb.nci.nih.gov/, National Cancer Institute Computational Biology.

http://snp.cshl.org/, SNP consortium.

http://www.bioconductor.org, statistical analysis software.

http://www.bioinf.mdc-berlin.de/igms/, database of splice variants.

http://www.biorag.org, genome tools and databases.

http://www.broad.mit.edu/cancer/software/genecluster2/gc2.html, clustering tools for microarray data.

http://www.ddbj.nig.ac.jp/, Japanese sequence database.

http://www.ebi.ac.uk/embl/, European sequence database.

http://www.ensembl.org, genome browser.

http://www.hupo.org, human proteome project.

http://www.ncbi.nlm.nih.gov/, Genbank home.

http://www.ncbi.nlm.nih.gov/projects/SNP/, dbSNP database.

http://www.nigms.nih.gov/psi/, protein structure initiative.

http://www.pharmgkb.org/index.jsp, pharmacogenomics database.

http://www-stat.stanford.edu/~tibs/SAM/, statistical tools for microarray analysis.

Grant support: National Cancer Institute grants, particularly the Arizona Cancer Center Core Grant and startup funds provided by Dr. Daniel Von Hoff; Southwest Environmental Health Sciences Center Core Grant and startup funds from Dr. Serrine Lau.

1
Finishing the euchromatic sequence of the human genome.
Nature
2004
;
431
:
931
–45.
2
Pollack JR, Iyer VR. Characterizing the physical genome.
Nat Genet
2002
;
32
Suppl:
515
–21.
3
Pollack JR, Perou CM, Alizadeh AA, et al. Genome-wide analysis of DNA copy-number changes using cDNA microarrays.
Nat Genet
1999
;
23
:
41
–6.
4
Napoli C, Lemieux C, Jorgensen R. Introduction of a chimeric chalcone synthase gene into petunia results in reversible co-suppression of homologous genes in trans.
Plant Cell
1990
;
2
:
279
–89.
5
Hammond SM, Caudy AA, Hannon GJ. Post-transcriptional gene silencing by double-stranded RNA.
Nat Rev Genet
2001
;
2
:
110
–9.
6
Wolfsberg TG, Wetterstrand KA, Guyer MS, Collins FS, Baxevanis AD. A user's guide to the human genome.
Nat Genet
2002
;
32
Suppl:
1
–79.
7
Baxevanis AD. Using genomic databases for sequence-based biological discovery.
Mol Med
2003
;
9
:
185
–92.
8
Baxevanis AD. The molecular biology database collection: 2003 update.
Nucleic Acids Res
2003
;
31
:
1
–12.
9
Galperin MY. The molecular biology database collection: 2005 update.
Nucleic Acids Res
2005
;
33
:
D5
–24.
10
Bailey JA, Gu Z, Clark RA, et al. Recent segmental duplications in the human genome.
Science
2002
;
297
:
1003
–7.
11
Boffelli D, McAuliffe J, Ovcharenko D, et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome.
Science
2003
;
299
:
1391
–4.
12
Sachidanandam R, Weissman D, Schmidt SC, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms.
Nature
2001
;
409
:
928
–33.
13
Thorisson GA, Stein LD. The SNP Consortium website: past, present and future.
Nucleic Acids Res
2003
;
31
:
124
–7.
14
Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease.
Nat Genet
2003
;
33
Suppl:
228
–37.
15
Reich DE, Cargill M, Bolk S, et al. Linkage disequilibrium in the human genome.
Nature
2001
;
411
:
199
–204.
16
Gabriel SB, Schaffner SF, Nguyen H, et al. The structure of haplotype blocks in the human genome.
Science
2002
;
296
:
2225
–9.
17
Matise TC, Sachidanandam R, Clark AG, et al. A 3.9-centimorgan-resolution human single-nucleotide polymorphism linkage map and screening set.
Am J Hum Genet
2003
;
73
:
271
–84.
18
Matsuzaki H, Dong S, Loi H, et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays.
Nat Methods
2004
;
1
:
109
–11.
19
Matsuzaki H, Loi H, Dong S, et al. Parallel genotyping of over 10,000 SNPs using a one-primer assay on a high-density oligonucleotide array.
Genome Res
2004
;
14
:
414
–25.
20
Tang K, Oeth P, Kammerer S, et al. Mining disease susceptibility genes through SNP analyses and expression profiling using MALDI-TOF mass spectrometry.
J Proteome Res
2004
;
3
:
218
–27.
21
Hein DW. Molecular genetics and function of NAT1 and NAT2: role in aromatic amine metabolism and carcinogenesis.
Mutat Res
2002
;
506–7
:
65
–77.
22
Colhoun HM, McKeigue PM, Davey Smith G. Problems of reporting genetic associations with complex outcomes.
Lancet
2003
;
361
:
865
–72.
23
Bafna V, Gusfield D, Hannenhalli S, Yooseph S. A note on efficient computation of haplotypes via perfect phylogeny.
J Comput Biol
2004
;
11
:
858
–66.
24
Zhang K, Qin Z, Chen T, Liu JS, Waterman MS, Sun F. HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms.
Bioinformatics
2005
;
21
:
131
–4.
25
Modrek B, Lee C. A genomic view of alternative splicing.
Nat Genet
2002
;
30
:
13
–9.
26
Modrek B, Lee CJ. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss.
Nat Genet
2003
;
34
:
177
–80.
27
Mercatante D, Kole R. Modification of alternative splicing pathways as a potential approach to chemotherapy.
Pharmacol Ther
2000
;
85
:
237
–43.
28
Veuger MJ, Heemskerk MH, Honders MW, Willemze R, Barge RM. Functional role of alternatively spliced deoxycytidine kinase in sensitivity to cytarabine of acute myeloid leukemic cells.
Blood
2002
;
99
:
1373
–80.
29
Bracco L, Kearsey J. The relevance of alternative RNA splicing to pharmacogenomics.
Trends Biotechnol
2003
;
21
:
346
–53.
30
Xu L, Hui L, Wang S, et al. Expression profiling suggested a regulatory role of liver-enriched transcription factors in human hepatocellular carcinoma.
Cancer Res
2001
;
61
:
3176
–81.
31
Lee C, Atanelov L, Modrek B, Xing Y. ASAP: the Alternative Splicing Annotation Project.
Nucleic Acids Res
2003
;
31
:
101
–5.
32
Lukashin AV, Fuchs R. Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters.
Bioinformatics
2001
;
17
:
405
–14.
33
Iyer VR, Eisen MB, Ross DT, et al. The transcriptional program in the response of human fibroblasts to serum.
Science
1999
;
283
:
83
–7.
34
Hartigan J. Clustering algorithms. New York: John Wiley & Sons; 1975.
35
Jain AK, Dubes RC. Algorithms for clustering data. 1988.
36
Alizadeh AA, Eisen MB, Davis RE, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Nature
2000
;
403
:
503
–11.
37
Dhanasekaran SM, Barrette TR, Ghosh D, et al. Delineation of prognostic biomarkers in prostate cancer.
Nature
2001
;
412
:
822
–6.
38
Vasselli JR, Shih JH, Iyengar SR, et al. Predicting survival in patients with metastatic kidney cancer by gene-expression profiling in the primary tumor.
Proc Natl Acad Sci U S A
2003
;
100
:
6958
–63.
39
Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL. Model-based clustering and data transformations for gene expression data.
Bioinformatics
2001
;
17
:
977
–87.
40
McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data.
Bioinformatics
2002
;
18
:
1462
–9.
41
Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure.
Bioinformatics
2003
;
19
:
1090
–9.
42
Hastie T, Tibshirani R, Eisen MB, et al. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns.
Genome Biol
2000
;
1
:
RESEARCH0003
, Epub 2000 Aug 4.
43
Sultan M, Wigle DA, Cumbaa CA, et al. Binary tree-structured vector quantization approach to clustering and visualizing microarray data.
Bioinformatics
2002
;
18
Suppl 1:
S111
–9.
44
Bittner M, Meltzer P, Chen Y, et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling.
Nature
2000
;
406
:
536
–40.
45
Mischel PS, Shai R, Shi T, et al. Identification of molecular subtypes of glioblastoma by gene expression profiling.
Oncogene
2003
;
22
:
2361
–73.
46
Kluger Y, Basri R, Chang JT, Gerstein M. Spectral biclustering of microarray data: coclustering genes and conditions.
Genome Res
2003
;
13
:
703
–16.
47
Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification.
J Natl Cancer Inst
2003
;
95
:
14
–8.
48
Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data.
J Am Stat Assoc
2002
;
97
:
77
–87.
49
Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science
1999
;
286
:
531
–7.
50
Shipp MA, Ross KN, Tamayo P, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.
Nat Med
2002
;
8
:
68
–74.
51
Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data.
Bioinformatics
2000
;
16
:
906
–14.
52
Yeang CH, Ramaswamy S, Tamayo P, et al. Molecular classification of multiple tumor types.
Bioinformatics
2001
;
17
Suppl 1:
S316
–22.
53
Ramaswamy S, Tamayo P, Rifkin R, et al. Multiclass cancer diagnosis using tumor gene expression signatures.
Proc Natl Acad Sci U S A
2001
;
98
:
15149
–54.
54
Pomeroy SL, Tamayo P, Gaasenbeek M, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression.
Nature
2002
;
415
:
436
–42.
55
Nutt CL, Mani DR, Betensky RA, et al. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification.
Cancer Res
2003
;
63
:
1602
–7.
56
Khan J, Wei JS, Ringner M, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks.
Nat Med
2001
;
7
:
673
–9.
57
O'Neill MC, Song L. Neural network analysis of lymphoma microarray data: prognosis and diagnosis near-perfect.
BMC Bioinformatics
2003
;
4
:
13
.
58
Zhang H, Yu CY, Singer B, Xiong M. Recursive partitioning for tumor classification with gene expression microarray data.
Proc Natl Acad Sci U S A
2001
;
98
:
6730
–5.
59
Ooi CH, Tan P. Genetic algorithms applied to multi-class prediction for the analysis of gene expression data.
Bioinformatics
2003
;
19
:
37
–44.
60
Kim S, Dougherty ER, Barrera J, Chen Y, Bittner ML, Trent JM. Strong feature sets from small samples.
J Comput Biol
2002
;
9
:
127
–46.
61
Kim S, Dougherty ER, Shmulevich I, et al. Identification of combination gene sets for glioma classification.
Mol Cancer Ther
2002
;
1
:
1229
–36.
62
Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data.
Proc Natl Acad Sci U S A
2002
;
99
:
6562
–6.
63
Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z. Tissue classification with gene expression profiles.
J Comput Biol
2000
;
7
:
559
–83.
64
Churchill GA. Fundamentals of experimental design for cDNA microarrays.
Nat Genet
2002
;
32
Suppl:
490
–5.
65
Quackenbush J. Microarray data normalization and transformation.
Nat Genet
2002
;
32
Suppl:
496
–501.
66
Simon R, Radmacher MD, Dobbin K. Design of studies using DNA microarrays.
Genet Epidemiol
2002
;
23
:
21
–36.
67
Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments.
Genome Biol
2003
;
4
:
210
.
68
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.
Stat Appl Genet Mol Biol
2004
;
3
:
Article 3
.
69
Mount D. Bioinformatics: sequence and genome analysis. 2nd ed. New York, Cold Spring Harbor Laboratory Press; 2004.
70
Pandey R, Guru RK, Mount DW. Pathway Miner: extracting gene association networks from molecular pathways for predicting the biological significance of gene expression microarray data.
Bioinformatics
2004
;
20
:
2156
–8.
71
Butte AJ, Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.
Pac Symp Biocomput
2000
;
5
:
418
–29.
72
Segal E, Wang H, Koller D. Discovering molecular pathways from protein interaction and gene expression data.
Bioinformatics
2003
;
19
Suppl 1:
i264
–71.
73
Segal E, Yelensky R, Koller D. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression.
Bioinformatics
2003
;
19
Suppl 1:
i273
–82.
74
Pe'er D, Regev A, Elidan G, Friedman N. Inferring subnetworks from perturbed expression profiles.
Bioinformatics
2001
;
17
Suppl 1:
S215
–24.
75
Hartemink AJ, Gifford DK, Jaakkola TS, Young RA. Combining location and expression data for principled discovery of genetic regulatory network models.
Pac Symp Biocomput
2002
;
7
:
437
–49.
76
Ideker T, Ozier O, Schwikowski B, Siegel AF. Discovering regulatory and signalling circuits in molecular interaction networks.
Bioinformatics
2002
;
18
Suppl 1:
S233
–40.
77
Shaaban S, Benton B. Orphan G protein-coupled receptors: from DNA to drug targets.
Curr Opin Drug Discov Devel
2001
;
4
:
535
–47.