Abstract
4265
As DNA microarray technology matures, the use of gene expression profiling gradually moves from bench-side to bedside. Many classifying and predictive gene-signatures have been reported for different types of cancer. However, it has been noted that different signatures for the same disease process often share few common genes. The exact reasons for this phenomenon are not clear (NEJM 355:560, 2006). In view of the fact that a high number of genes are often associated with a disease process, we hypothesized that only a small fraction of these genes is needed to establish a signature for successful classification or prediction. Consequently, many equally effective classifiers and predictors can be generated from a dataset and share few commonalities. To demonstrate this point, two well-established breast cancer gene expression datasets of breast cancer patients from Netherlands Cancer Institute were used. First, we conducted an unsupervised hierachical cluster analyses on all cases using increasing numbers of 10 to 640 genes randomly selected from 4433 genes that were filtered from 24481 genes using the rule of van’t Veer, et al. Two major groups from each hierachical cluster analysis were compared by logrank test for metastasis-free survival. The analysis was repeated 1000 times. When randomly selected genes were increased to 80 or more, the gene number became sufficient to classify breast cancers into high and low risk for distant metastasis with p<0.05 80% of the time. When we used genes significantly associated with metastasis (n=1262), 80 or more randomly selected genes could successfully classify patients into high and low risk for metastasis 100% of the time. The findings support that only limited number of genes from a relatively large pool of significant genes is sufficient. This conclusion is further supported by results of a supervised approach. When a linear discrimination analysis was used to establish predictive models in a training set of 78 cases using increasing numbers of genes (10 to 100) randomly selected from 394 genes that were statisically associated with metastasis (Cox regression p<0.001), each predictive model was then applied to an independent test set of 234 patients. The study was repeated 1000 times. We found that >99% of predictors were effective, when the number of predictor genes increased to greater than or equal to 60. The average accuracies were 62% to 63% and higher than 59% accuracy obtained through the use of the 70 genes-signature of van’t Veer et al. The established predictors shared only 2 to 38 genes. The results of our study support that lack of commonality is largely due to high numbers of genes that can be used for classification or prediction. It is important to recognize this inherent property of DNA microarray study and one should not expect high degree of commonality between different classifiers or predictors developed for the same disease process.
99th AACR Annual Meeting-- Apr 12-16, 2008; San Diego, CA