Abstract
The increasing availability and maturity of DNA microarray technology has led to an explosion of cancer profiling studies. To extract maximum value from the accumulating mass of publicly available cancer gene expression data, methods are needed to evaluate, integrate, and intervalidate multiple datasets. Here we demonstrate a statistical model for performing meta-analysis of independent microarray datasets. Implementation of this model revealed that four prostate cancer gene expression datasets shared significantly similar results, independent of the method and technology used (i.e., spotted cDNA versus oligonucleotide). This interstudy cross-validation approach generated a cohort of genes that were consistently and significantly dysregulated in prostate cancer. Bioinformatic investigation of these genes revealed a synchronous network of transcriptional regulation in the polyamine and purine biosynthesis pathways. Beyond the specific implications for prostate cancer, this work establishes a much-needed model for the evaluation, cross-validation, and comparison of multiple cancer profiling studies.
INTRODUCTION
DNA microarray technology offers the ability to screen thousands of genes in parallel. This genome-wide approach has been used successfully to classify tumors and identify novel biomarkers associated with cancer (1, 2, 3, 4, 5). Before drawing conclusions about the expression of a specific gene, it is thought necessary to demonstrate independent experimental validation (6) using techniques such as reverse transcription-PCR, Northern blots, or tissue microarrays. Therefore, it is commonplace to use the microarray as a screening tool, then to validate a few intriguing genes for additional investigation. Whereas this model has proved fruitful, it under-uses the “depth” inherent in the original microarray dataset. With the increasing number of publicly available datasets, we and others (7) have proposed a meta-analysis of multiple datasets that address similar hypotheses to validate and statistically assess all of the positive results simultaneously. This analysis would potentially yield sets of significant, differentially expressed genes, void of the artifacts of individual studies. From these confirmed gene sets, cellular pathways or transcription factor networks could be reconstructed, ultimately leading to a more reliable understanding of the underlying biology. This proposition poses unique challenges both statistically and computationally, for whereas the hypotheses in microarray studies are often similar (i.e., identify genes differentially expressed in cancer), individual investigators use distinct protocols, microarray platforms, and analysis techniques. Here we demonstrate a robust statistical model for performing meta-analysis of gene expression data across independent studies and platforms. The model treats each gene in each study as an independent hypothesis. Significance is assigned to a gene based on random permutations of the expression values of the gene, and then the similarity of significance values across studies is assessed with meta-analysis methods combined with multiple inference statistical testing.
The model was implemented on four publicly available prostate cancer gene expression data sets generated by independent laboratories (3, 8, 9, 10). All four of the studies made comparisons between the gene expression profiles of clinically localized prostate cancer and benign prostate tissue with the goal of identifying genes differentially expressed between the two sample groups (summarized in Table 1). Two of the groups used spotted cDNA technology (3, 8), whereas two other groups used commercial oligonucleotide-based technology (9, 10). Considerable overlap in the genes assayed (Table 2) and similar experimental objectives among the studies allowed us to address four important questions, both technological and biological: (a) do the individual studies generate statistically significant results; (b) if so, do the studies identify the same genes (more than would be expected by chance); (c) is the significance of interstudy validation independent of the experimental platform (spotted cDNA versus oligonucleotide); and (d) do intervalidated sets of dysregulated genes provide clues into prostate carcinogenesis?
MATERIALS AND METHODS
Data Collection.
Data used in this study was publicly available or provided on request. Cluster ID and gene names were assigned to all of the cDNA clones and Affymetrix probes based on Unigene Build 139. The two sample groups considered in this meta-analysis were clinically localized prostate cancer (independent of tumor grade) and benign prostate tissue (for the purposes of this study, benign prostatic hyperplasia was not distinguished from other benign prostate tissue). All of the available data were used except for negative values from the oligonucleotide-based studies. The expression ratios from Luo et al. (8) were transformed by taking the reciprocal, so that they would be analogous to those of Dhanasekaran et al. (3). All of the expression values were base-two log-transformed. Missing data were allowed.
Individual Study Analysis.
Analysis was performed with custom software written in Perl. Statistical tests were one-sided and performed independently for the two null hypotheses that no genes are overexpressed and no genes are underexpressed in prostate cancer. Study-specific, gene-specific Ps were calculated using one-sided random permutation t tests. A t statistic (t) for an individual gene was calculated and compared with 10,000 t statistics generated by randomly assigning the sample labels to the expression values of the gene.
The P then equaled the fraction of random t statistics that were greater than or equal to the actual t statistic. Ps equal to zero were set to 0.0001. If a gene was present more than once in a study, the P was set to the smallest value. To calculate the q value (or gene-specific false discovery rate), genes were sorted by P, and then the ratio of the expected number of occurrences at or better than each P to the actual number of occurrences was computed.
Meta-Analysis.
For each possible combination of studies, a meta-analysis was performed to test the null hypothesis that positive results from individual studies do not correspond to the same genes. For each gene that was present in all of the studies within a given meta-analysis, a P summary statistic (S) was computed using the Ps from the random permutation t tests.
Summary statistic Ps were calculated by a comparison to 100,000 summary statistics generated by randomly selecting a P from each individual study contributing to the respective meta-analysis. The summary statistic P then equaled the fraction of random summary statistics that were greater than or equal to the actual. For each meta-analysis, we sorted genes by summary P, and then calculated the summary q value of a gene as the ratio of the expected number of occurrences at or better than the P of the gene to the actual number of occurrences (same formula as individual study analysis). To assimilate results from all of the meta-analyses, we selected the minimum summary q value for each gene and then ranked the gene lists by this variable. To visualize meta-analysis results, we normalized each gene in each study independently so that the mean gene expression of the benign samples equaled zero and the SD = 1. Data were visualized in TreeView (11). The color scale represents the level of over- or underexpression based on the number of SDs an expression value is above or below the mean benign sample.
Random Simulation.
In each study, samples were randomly assigned to prostate cancer or benign groups, maintaining the actual sample group sizes. The meta-analysis described previously was then performed.
RESULTS AND DISCUSSION
Individual Study Analysis of Prostate Cancer Gene Expression Datasets.
Before performing the interstudy analysis, we evaluated the results of the individual prostate cancer gene expression studies. For each gene in each study, we tested the null hypothesis that no relationship exists between the expression values of the gene and the cancer versus benign distinction. Ps were assigned to each gene with permutation t tests (4, 12) and, thus, represent the likelihood that expression values of a gene were associated with samples at random. If we were only investigating one gene, thus one hypothesis, the P would be a sufficient measure of significance; however, the simultaneous multiple inferences inherent in microarray analysis require special statistical consideration. For example, if we surveyed 1000 genes, a P of 0.05 may not be significant, because we would expect, on average, to randomly find 50 genes with Ps < 0.05. To address this, we adopted the method of false discovery rates (13). Estimated lowest gene-specific false discovery rates (or q values) have been suggested as a measure of significance analogous to Ps but adapted to multiple inference scenarios.3 Q values are determined by ranking genes by their Ps and then calculating the ratio of expected random occurrences at or better than a P of a gene to the actual number of occurrences. Fig. 1 depicts the q value plots of the four studies analyzed. The X axis is the rank index for genes sorted by P. If the null hypothesis were true, we would expect the gene q-values (represented on the Y axis) to remain nearly constant and equal to 1.0. As anticipated, all four of the studies had genes with q values much less than one, suggesting that many genes can differentiate prostate cancer from benign tissue. We suspect that the number of genes at a particular q value in a given study reflects the sample size of the study (and, thus, statistical power), as well as quality of the samples, arrays, and signals. The disparity among the number of positive “calls” for overexpression in prostate cancer is evident at a q value of 0.10 where Dhanasekaran et al. (3) called 665 genes, Luo et al. (8) called 758 genes, Welsh et al. (10) called 1194 genes, and Magee et al. (Ref. 9; the smallest study) called zero genes (Fig. 1,A). A similar phenomenon was observed for down-regulated genes (Fig. 1 B).
Meta-Analysis of Prostate Cancer Gene Expression.
We then implemented a meta-analysis model to assess the similarity of the findings between studies to ultimately identify statistically reliable sets of over- and underexpressed genes in prostate cancer (schematically described in Fig. 2). In all four of the published studies, molecular validation was performed on small sets of identified genes (maximum of 4) by Northern blotting (3), reverse transcription-PCR (8, 9, 10), in situ hybridization (9), and immunohistochemistry (for protein; Refs. 3, 10). We performed in-silico interstudy validation and significance analysis for all of the genes present in at least two, studies by integrating the methods of meta-analysis summary statistics (14) and false discovery rates (described previously). The null hypothesis we tested was that the positive findings in the four individual studies do not correspond to the same genes. Because different combinations of 2, 3, and 4 studies have different sets of overlapping genes, we performed an analysis for each possible combination and then assimilated the results. In each analysis, summary statistics were computed for genes represented in all of the included studies, and their significance was evaluated based on a distribution of randomly generated summary statistics (see “Materials and Methods”). Each gene was then assigned a summary q value, which was an estimate of the likelihood that the Ps of the gene from the individual studies were assigned to the gene by random selection from the respective studies. If the expression values of a gene significantly differentiated cancer from benign in multiple studies, the meta-analysis summary statistic of a gene would also be significant (i.e., represented by a low q value). However, if a gene was significant in only one study, the summary statistic would not be significant. The q value plot for overexpressed genes in all of the two-study analyses showed that all of the pair-wise combinations of studies yielded sets of significantly similar genes, although certain pairs of studies yielded more similarly “called” genes than others (Fig. 3,A). Interestingly, the spotted studies (i.e., Dhanasekaran et al. and Luo et al.; Refs. 3, 8) shared approximately equal similarity with the Welsh et al. (10) oligonucleotide study as they did with each other, suggesting that the two technologies generate comparable results. As expected, we found that increasing the number of studies in the meta-analysis increases the significance and number of called genes, with the exception of including the Magee et al. study (Ref. 9; Fig. 3,B). To formally assess the utility of our statistical model, we randomly designated samples in each study as benign or prostate cancer and carried out the meta-analysis as described. As anticipated, the randomly generated q value plots maintained values near 1.0 (supplementary Fig. 1).
Results from the various combination analyses were assimilated by selecting the lowest q value for each gene and then sorting genes based on the q value. We found 50 genes to be reliably overexpressed and 103 genes to be reliably underexpressed in prostate cancer at a q value of 0.10, suggesting that the studies identified a cohort of shared differentially expressed genes. We suspect that more genes would be called significant if additional studies were available. Several of the genes validated or commented on in the individual studies scored high in the meta-analysis including hepsin, myc, AMACR, and fatty acid synthase (3, 8, 9, 10). All four of the studies identified hepsin, a transmembrane serine protease, as being overexpressed in prostate cancer. Importantly, 11 other genes scored equally high in the meta-analysis (Fig. 4). An equal number of intriguing underexpressed genes were identified by meta-analysis including SPARCL1, a candidate tumor suppressor gene (15), and TIMP3, an inhibitor of metalloproteinases that is silenced by methylation in cancers (16), among others. Fig. 4 depicts an Eisen matrix representation (11) of the top 40 over- and underexpressed genes across the meta-analysis. As apparent from the figure, many genes reliably discriminate prostate cancer from benign prostate, independent of the particular dataset. The complete listing of genes identified by meta-analysis is available in our supplementary materials.
Delineation of Pathway Dysregulation by Interstudy Validation of Prostate Cancer Gene Expression.
We hypothesized that this cohort of validated, differentially expressed genes contained concurring clues into the underlying mechanisms of prostate carcinogenesis. To investigate this using a bioinformatic approach, we performed directed Entrez PubMed literature searches (17) and performed a KEGG4 pathway database query (18) using significant differentially expressed genes. Of the 6538 genes present in at least two studies, 631 were present in the KEGG database. Because of the decrease in the number of genes (hypotheses) in this analysis, summary q values were adjusted accordingly. A q value threshold of 0.02, which approximately corresponds to a q value of 0.2 in the preceding analysis, was adopted. The KEGG database batch entry and color-coded pathway tool allowed us to evaluate a large set of differentially expressed genes and simultaneously evaluate the involvement of a number of metabolic pathways. Pathways that were most populated by the database query were investigated.
We identified significant transcriptional involvement of the polyamine biosynthesis pathway at a number of enzymatic steps, some of which are confirmed by previous reports (Ref. 19; Fig. 5 A). Polyamines, including spermidine, spermine, and their precursor putrescine, have been implicated in cancer cell proliferation, protection from apoptosis, and DNA-protein binding (20). Interestingly, polyamines have also been found at high levels in the urine of prostate cancer patients (21). We found that enzymes directing substrates toward polyamines were overexpressed in prostate cancer. By contrast, the enzyme that directs ornithine (a urea cycle intermediate and polyamine precursor) to its alternate destination, proline synthesis, was underexpressed in prostate cancer. One gene that we identified was ODC, which is the rate-limiting enzyme in polyamine synthesis converting ornithine to putrescine. ODC has been shown previously to be overexpressed in prostate cancer and is the target of the chemotherapeutic agent DFMO (22). A recent clinical trial demonstrated that DFMO caused nearly complete depletion of putrescine (97.6%) but not of spermidine and spermine (73.6% and 50.8%, respectively). This result is consistent with the differential expression of polyamine biosynthesis enzymes revealed by meta-analysis. When ODC is inhibited by DFMO, excess ornithine (ODC substrate) may normally be shunted through OAT. However, the consistent underexpression of OAT in prostate cancer prevents this, thus driving ornithine through any uninhibited ODC to produce putrescine. The rapid conversion of putrescine to spermidine is facilitated by increased levels of spermidine synthase, which leads to high levels of spermidine and spermine but not putrescine. This proposed explanation warrants additional experimentation and may facilitate the design of better strategies directed at controlling polyamine levels in prostate cancer.
In addition to the polyamine biosynthesis pathway, we also detected a dysregulation of the purine biosynthesis pathway in prostate cancer. It is well known that nucleotide biosynthesis is necessary for DNA synthesis in mitotic cell division. Consistent with this notion, we found that the adenine monophosphate biosynthesis pathways, both salvage and de novo, are increased in prostate cancer by transcriptional up-regulation of multiple enzymes (Fig. 5 B). One enzyme identified by the meta-analysis, adenylosuccinate lyase, has been reported previously to be an indicator of breast and prostate malignancy (23). Perhaps the other enzymes in this pathway, implicated by our meta-analysis, could provide insight into the metabolic regulation of prostate cancer or indirectly serve as biomarkers. These examples of pathway dysregulation discovered by meta-analysis and KEGG demonstrate potential sets of synchronous clues that exist in a cohort of validated, differentially expressed genes.
Our model for microarray meta-analysis used a statistical and computational approach to facilitate the validation and significance analysis of four prostate cancer gene expression profiling studies. The resulting validated genes, coupled with interrogation of the KEGG database, enabled us to reconstruct the transcriptional events of two metabolic pathways important in prostate cancer. As in-silico molecular modeling evolves to contain more complete cell signaling and enzymatic pathways, as well as transcription factor networks, more valuable inferences will be possible. Beyond the analysis of the prostate cancer gene expression datasets, our work established a statistically rigorous model for evaluating and comparing multiple microarray datasets. As microarray technology continues to mature and an increasing amount of data becomes publicly available, larger scale meta-analyses will facilitate maximum usation and warehousing of microarray data.
Evaluating individual prostate cancer gene expression studies by estimated lowest false discovery rates (q value). The X axis represents the gene index as ranked by P, and the Y axis represents q value. A, q value plot for overexpressed genes and (B) under-expressed genes in prostate cancer. Gene expression data sets are indicated by the first author (3, 8, 9, 10). See text for details.
Evaluating individual prostate cancer gene expression studies by estimated lowest false discovery rates (q value). The X axis represents the gene index as ranked by P, and the Y axis represents q value. A, q value plot for overexpressed genes and (B) under-expressed genes in prostate cancer. Gene expression data sets are indicated by the first author (3, 8, 9, 10). See text for details.
Comparison of summary q value plots for analyses of different combinations of studies. A, two-study meta-analyses q value plot. B, three-study meta-analyses and four-study meta-analysis. See Fig. 1 for details.
Comparison of summary q value plots for analyses of different combinations of studies. A, two-study meta-analyses q value plot. B, three-study meta-analyses and four-study meta-analysis. See Fig. 1 for details.
A cohort of cross-validated genes identified by meta-analysis of prostate cancer gene expression profiles. Eisen matrix representation of genes consistently differentially expressed between clinically localized prostate cancer (P) and benign prostate tissue (B) across four independent microarray studies (3, 8, 9, 10). Each column represents an individual sample (number of samples is in parentheses), and each row represents a specific gene. Within each study, the data were normalized so that the mean expression level of the genes in the benign prostate specimens equaled zero and the SD = 1. A, 40 genes with the lowest q value for overexpression. Red intensity level indicates degree of overexpression, whereas black indicates equal or lower expression than the mean benign sample (see scale). B, 40 genes with the lowest q value for underexpression. Green intensity level indicates degree of underexpression, whereas black indicates equal or higher expression than the mean benign sample (see scale). Gray signifies technically inadequate or not present in a particular study. q value thresholds are provided to the left of the matrix. See supplementary data for full listing of calculated q values, and gene- and study-specific Ps for the genes delineated by meta-analysis.
A cohort of cross-validated genes identified by meta-analysis of prostate cancer gene expression profiles. Eisen matrix representation of genes consistently differentially expressed between clinically localized prostate cancer (P) and benign prostate tissue (B) across four independent microarray studies (3, 8, 9, 10). Each column represents an individual sample (number of samples is in parentheses), and each row represents a specific gene. Within each study, the data were normalized so that the mean expression level of the genes in the benign prostate specimens equaled zero and the SD = 1. A, 40 genes with the lowest q value for overexpression. Red intensity level indicates degree of overexpression, whereas black indicates equal or lower expression than the mean benign sample (see scale). B, 40 genes with the lowest q value for underexpression. Green intensity level indicates degree of underexpression, whereas black indicates equal or higher expression than the mean benign sample (see scale). Gray signifies technically inadequate or not present in a particular study. q value thresholds are provided to the left of the matrix. See supplementary data for full listing of calculated q values, and gene- and study-specific Ps for the genes delineated by meta-analysis.
Meta-analysis of microarrays coupled with bioinformatic resources revealed metabolic pathway dysregulation in prostate cancer. A, polyamine (spermidine and spermine) biosynthesis pathway. B, adenine monophosphate synthesis pathway. Genes identified as over- or underexpressed in prostate cancer by meta-analysis are shown in red and green, respectively. See text for details.
Meta-analysis of microarrays coupled with bioinformatic resources revealed metabolic pathway dysregulation in prostate cancer. A, polyamine (spermidine and spermine) biosynthesis pathway. B, adenine monophosphate synthesis pathway. Genes identified as over- or underexpressed in prostate cancer by meta-analysis are shown in red and green, respectively. See text for details.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
D. R. R. is a Fellow of the Medical Scientist Training Program. A. M. C. is a Pew Foundation Scholar. This work was supported by the CapCURE Foundation, Wendy Will Case Cancer Foundation, a pilot grant from the University of Michigan Bioinformatics Program, and a Career Development Award from the Specialized Program of Research Excellence in Prostate Cancer (P50 CA69568), National Cancer Institute. Supplementary information will be available at the author’s website (http://www.pathology.med.umich.edu/chinnaiyan/).
Internet address: http://www-stat.stanford.edu/∼jstorey/.
The abbreviations used are: KEGG, Kyoto Encyclopedia of Genes and Genomes; ODC, ornithine decarboxylase; DFMO, α-difluoromethylornithine; OAT, ornithine aminotransferase.
Prostate cancer gene expression data sets used in meta-analysis of microarrays
Authors . | Journal . | Volume:page . | Array type . | Total clones reported . | Total clones provided . | Unique clones . | Number of samples . | . | . | ||
---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | . | . | . | Benign prostate . | Localized prostate cancer . | Metastatic prostate cancer . | ||
Dhanasekaran, S.M., et al. | Nature | 412:822 | cDNA | 9984 | 9984 | 8352 | 19 | 14 | 20 | ||
Luo, J., et al. | Cancer Research | 61:4683 | cDNA | 6500 | 6500 | 5119 | 9 | 16 | 0 | ||
Magee, J.A., et al. | Cancer Research | 61:5692 | oligo | 7068 | 3350 | 2780 | 4 | 8 | 3 | ||
Welsh, J.B., et al. | Cancer Research | 61:5974 | oligo | 8900 | 6812 | 5485 | 9 | 23 | 1 |
Authors . | Journal . | Volume:page . | Array type . | Total clones reported . | Total clones provided . | Unique clones . | Number of samples . | . | . | ||
---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | . | . | . | Benign prostate . | Localized prostate cancer . | Metastatic prostate cancer . | ||
Dhanasekaran, S.M., et al. | Nature | 412:822 | cDNA | 9984 | 9984 | 8352 | 19 | 14 | 20 | ||
Luo, J., et al. | Cancer Research | 61:4683 | cDNA | 6500 | 6500 | 5119 | 9 | 16 | 0 | ||
Magee, J.A., et al. | Cancer Research | 61:5692 | oligo | 7068 | 3350 | 2780 | 4 | 8 | 3 | ||
Welsh, J.B., et al. | Cancer Research | 61:5974 | oligo | 8900 | 6812 | 5485 | 9 | 23 | 1 |
Percentage of genes shared by studies used in meta-analysis
For each study (rows) the percentage of genes found in all other studies (columns). The actual number of genes shared is given in parentheses.
Authors . | Dhanasekaran, S.M., et al. . | Luo, J., et al. . | Magee, J.A., et al. . | Welsh, J.B., et al. . |
---|---|---|---|---|
Dhanasekaran, S.M., et al. | 100 | 61.1 (5106) | 23 (1919) | 34.8 (2906) |
Luo, J., et al. | 99.7 (5106) | 100 | 30.5 (1560) | 41.6 (2132) |
Magee, J.A., et al. | 69 (1919) | 56.1 (1560) | 100 | 79.9 (2221) |
Welsh, J.B., et al. | 53 (2906) | 38.9 (2132) | 40.5 (2221) | 100 |
Authors . | Dhanasekaran, S.M., et al. . | Luo, J., et al. . | Magee, J.A., et al. . | Welsh, J.B., et al. . |
---|---|---|---|---|
Dhanasekaran, S.M., et al. | 100 | 61.1 (5106) | 23 (1919) | 34.8 (2906) |
Luo, J., et al. | 99.7 (5106) | 100 | 30.5 (1560) | 41.6 (2132) |
Magee, J.A., et al. | 69 (1919) | 56.1 (1560) | 100 | 79.9 (2221) |
Welsh, J.B., et al. | 53 (2906) | 38.9 (2132) | 40.5 (2221) | 100 |
Acknowledgments
We thank the authors of the prostate cancer gene expression studies used in this meta-analysis for making their data publicly available. We also thank Sarvana Dhanasekaran and Sobryangrayana Varambally (Department of Pathology, University of Michigan, Ann Arbor, MI) for helpful discussions, and Robin Kunkel (Department of Pathology, University of Michigan, Ann Arbor, MI) for assistance in preparation of the figures.