Purpose: The requirement of frozen tissues for microarray experiments limits the clinical usage of genome-wide expression profiling by using microarray technology. The goal of this study is to test the feasibility of developing lung cancer prognosis gene signatures by using genome-wide expression profiling of formalin-fixed paraffin-embedded (FFPE) samples, which are widely available and provide a valuable rich source for studying the association of molecular changes in cancer and associated clinical outcomes.

Experimental Design: We randomly selected 100 Non–Small-Cell lung cancer (NSCLC) FFPE samples with annotated clinical information from the UT-Lung SPORE Tissue Bank. We microdissected tumor area from FFPE specimens and used Affymetrix U133 plus 2.0 arrays to attain gene expression data. After strict quality control and analysis procedures, a supervised principal component analysis was used to develop a robust prognosis signature for NSCLC. Three independent published microarray datasets were used to validate the prognosis model.

Results: This study showed that the robust gene signature derived from genome-wide expression profiling of FFPE samples is strongly associated with lung cancer clinical outcomes and can be used to refine the prognosis for stage I lung cancer patients, and the prognostic signature is independent of clinical variables. This signature was validated in several independent studies and was refined to a 59-gene lung cancer prognosis signature.

Conclusions: We conclude that genome-wide profiling of FFPE lung cancer samples can identify a set of genes whose expression level provides prognostic information across different platforms and studies, which will allow its application in clinical settings. Clin Cancer Res; 17(17); 5705–14. ©2011 AACR.

Translational Relevance

This article is the first study to develop a robust prognosis signature for non–small cell lung cancer (NSCLC) on the basis of genome-wide expression profiling of clinically available formalin-fixed and paraffin-embedded (FFPE) samples. Although clinical FFPE tumor samples are widely available, the genome-wide expression profiling of FFPE samples has been hampered because of the degradation of RNAs extracted from them. In this article, we show that NSCLC FFPE-derived signature is strongly associated with clinical outcome of the patients, is independent of clinical prognostic variables, and can be validated in several independent studies. We showed that, after strict quality control and analysis procedures, genome-wide profiling of FFPE samples can actually provide a unique opportunity to identify a set of genes whose expression level is less sensitive to the environmental changes. This gene signature is more robust across different platforms and studies, which is critical for the successful application of gene signatures in real clinical settings.

Lung cancer is the leading cause of death from cancer for both men and women in the United States and in most parts of the world, with a 5-year survival rate of 15% (1). Non–small-cell lung cancer (NSCLC) is the most common cause of lung cancer death, accounting for up to 85% of such deaths (2). Clinicopathologic staging is the standard prognosis factor for lung cancer used in clinical practice but does not capture the complexity of the disease so that heterogeneous clinical outcomes within the same stage are commonly seen. Several randomized clinical trials showed that adjuvant chemotherapy improves survival in resected NSCLC (3–7). The effect of adjuvant chemotherapy on prolonging survival is modest—only 4% to 15% improvement in 5-year survival—although such treatment is associated with serious adverse effects (6, 8). Therefore, it is of considerable clinical importance to have a robust and accurate prognostic signature for lung cancer, especially in early stage lung cancer to improve the current clinical decisions on whether an individual lung cancer patient should receive adjuvant chemotherapy or not.

Genome-wide expression profiles have been used to identify gene signatures to classify lung cancer patients with different survival outcomes (9–16). However, the requirement of frozen tissues for microarray experiments limits the clinical usage of these gene signatures. Furthermore, prognostic gene signatures for NSCLC developed by different groups show minimal overlap and are often difficult to reproduce by independent groups (17, 18). To address the problem of requirement for frozen issues, we designed this study to test the feasibility of developing lung cancer prognosis gene signatures by using genome-wide expression profiling of formalin-fixed paraffin-embedded (FFPE) samples, which are widely available and provide a valuable rich source for studying the association of molecular changes in cancer and associated clinical outcomes. We derived a prognosis signature for NSCLC from FFPE samples and validated it in several independent studies. To facilitate other researchers to reproduce all results in this study, we have provided a literate programming R package.

Tissue specimens

The overall study design and the flow chart of the derivation and validation of the robust gene signature are described in Figure 1. We randomly selected 100 NSCLC FFPE samples with annotated clinical information from the UT-Lung SPORE Tissue Bank from 2001 to 2005. From these samples, 75 samples passed the mRNA quality control criteria (Supplementary Methods). Among these 75 samples, 48 samples are adenocarcinomas and 27 are squamous cell carcinomas. The median follow-up time is 2.8 years and the maximum follow-up time is 6.9 years; the characteristics of these patients are summarized in Supplementary Table S1. The samples were obtained under approval of the Institutional Review Boards at MD Anderson Cancer Center.

Figure 1.

A, flow chart of the derivation and validation of the robust gene signature from FFPE samples collected from MD Anderson UT-Lung Cancer SPORE tissue bank (MDACC). B, flow chart of the derivation and validation of 59-gene prognosis signature.

Figure 1.

A, flow chart of the derivation and validation of the robust gene signature from FFPE samples collected from MD Anderson UT-Lung Cancer SPORE tissue bank (MDACC). B, flow chart of the derivation and validation of 59-gene prognosis signature.

Close modal

Sample microdissection and RNA extraction

FFPE tumor specimens were cut into serial sections with a thickness of 10 μm. For the pathologic diagnosis, one slide was stained with H&E and evaluated by a pathologist. Other sections were stained with nuclear fast red (NFR; American MasterTech Scientific Inc.) to enable visualization of histology. Tumor tissue was isolated by using manual macrodissection when the tumor area was more than 0.5 × 0.5 mm2 or laser capture microdissection (P.A.L.M. Microlaser Technologies AG) in cases of smaller tumor areas. At least 50 mm2 of tumor tissue was collected from each FFPE block. The extraction of RNA from tissue samples was done by a proprietary procedure of Response Genetics, Inc. (United States Patent Application 20090092979) designed to optimize the yield of higher molecular weight RNA fragments from FFPE specimens.

Microarray data preprocessing and quality control

Total RNA was processed for analysis on the Affymetrix U133 plus 2.0 arrays according to Affymetrix protocols for first- and second-strand synthesis, biotin labeling, and fragmentation. The quality control procedure for microarray data analysis was based on the percentage of present calls calculated by the MAS5 package. We selected arrays with at least 15% of probe sets present; 55 of 75 arrays passed this quality control criterion and will be used for the analysis. We selected probe sets that are present on all 55 arrays; 1,400 genes passed this criterion. These 1,400 genes were referred as the robust gene set (RGS), because the mRNA expression of these genes is robust to FFPE processing. The 55 samples and the 1,400 genes were used to develop gene signatures.

After microarray analysis QC, we used the RMA background correction algorithm (19) to remove nonspecific background noise. A robust regression model (20) was fitted to the probe level data, and the fitted expression values for the probes at the 3′ end were used to summarize the probe set expression values. Quantile–quantile normalization was used to normalize all the arrays. Consortium microarray raw data (13) was downloaded from caArray database of the National Cancer Institute (NCI) and preprocessed by RMA background correction and quantile–quantile normalization. All gene expression values were log transformed (on a base 2 scale).

Supervised classification by using supervised principal component analysis

Classification was done by using supervised principal component analysis (21, 22), a widely used classification method in biomedical research (23–26). As a supervised classification method, each prediction model was trained in a training dataset and then the performance was tested in an independent test dataset. We used an R package (version 2.81), Superpc (version 1.05), to implement the prediction algorithm, and the default parameters were used. The implementation details can be found in the Supplementary Sweave Report. The training and testing sets for each prediction model are summarized in Supplementary Table S2.

Survival analysis

Overall survival time was calculated from the date of surgery until death or the last follow-up contact. Survival curves were estimated by using the product-limit method of Kaplan–Meier (27) and were compared by using the log-rank test. The maximum follow-up time for the FFPE patient cohort is less than 7 years, whereas some patients in the consortium cohort have been followed for up to 17 years. To avoid the extrapolation of the prediction model, the comparison of survival time between predicted groups are truncated at 7 years. The analysis results without truncation can be seen in Supplementary Sweave Report. Univariate and multivariate Cox proportional hazards analysis (28) were also done, with survival as the dependent variable.

The robust gene set defines two tumor groups

The expression of these 1,400 genes divided the 55 patients into 2 groups on the basis of unsupervised clustering analysis (with Euclidean distance and complete linkage for the hierarchical clustering algorithm; Fig. 2). Interestingly, group 1 has significantly shorter survival time compared with group 2 (Fig. 2B; HR = 3.6, P = 0.017), and multivariate Cox proportional hazards analysis showed that the association between RGS groups and survival (P = 0.012) is independent of stage. Notably, group 1 was dominated by squamous cell carcinoma (23/28), whereas group 2 was dominated by adenocarcinomas (25/27; P < 0.0001; Supplementary Table S3). The other clinical characteristics including gender, age, and smoking status were not significantly different between the 2 groups. To explore whether the association between RGS groups and survival is due to the histologic difference between two groups, we drew Kaplan–Meier curves by both histology and RGS groups (Supplementary Fig. S1), and it shows clearly that RGS can distinguish high- and low-risk groups within both adenocarcinoma and squamous groups, indicating the association of RGS groups and survival is independent of histology groups.

Figure 2.

Microarray analysis of the gene expression profiles from FFPE lung tumor samples. A, unsupervised cluster analysis of the 55 FFPE lung cancer patient cohort by using the expression profile of 1,400 robust genes that pass the microarray quality control criterion. Vertical and horizontal axes represent robust genes and lung cancer patient clusters, respectively. B, Kaplan–Meier plot showing the association of the expression of robust genes with patient survival P values were obtained by using the log-rank test. Red color represents sample cluster I and black color represents sample cluster II defined by unsupervised clustering algorithm by using robust gene profiling data. • indicates censored samples. Gene set enrichment analysis found that the ER-negative signature derived from breast cancer patients is enriched in group 1 defined by RGS expression (C), and the ER-positive signature derived from breast cancer patients is enriched in group 2 defined by RGS expression (D). The y-axis shows running enrichment scores for the specific gene set on the 1,400 preranked genes. The x-axis shows the rank in the ordered dataset. The vertical lines represent the locations of the genes that are in the specific gene set.

Figure 2.

Microarray analysis of the gene expression profiles from FFPE lung tumor samples. A, unsupervised cluster analysis of the 55 FFPE lung cancer patient cohort by using the expression profile of 1,400 robust genes that pass the microarray quality control criterion. Vertical and horizontal axes represent robust genes and lung cancer patient clusters, respectively. B, Kaplan–Meier plot showing the association of the expression of robust genes with patient survival P values were obtained by using the log-rank test. Red color represents sample cluster I and black color represents sample cluster II defined by unsupervised clustering algorithm by using robust gene profiling data. • indicates censored samples. Gene set enrichment analysis found that the ER-negative signature derived from breast cancer patients is enriched in group 1 defined by RGS expression (C), and the ER-positive signature derived from breast cancer patients is enriched in group 2 defined by RGS expression (D). The y-axis shows running enrichment scores for the specific gene set on the 1,400 preranked genes. The x-axis shows the rank in the ordered dataset. The vertical lines represent the locations of the genes that are in the specific gene set.

Close modal

We used gene set enrichment analysis to identify the enriched gene sets in both RGS groups. Interestingly, an estrogen receptor (ER)–negative signature in breast cancer (29) is enriched in RGS group 1, meanwhile, an ER-positive signature in breast cancer (29) is enriched in RGS group 2 (Fig. 2C and D), indicating the relationship between the ER signatures and the RGS groups. The other enriched gene sets are summarized in Supplementary Table S4; notably, genes enriched in group 1 are also enriched in mouse neural stem cells and embryonic stem cells.

Construct and validate RGS prognosis signatures

FFPE samples training to testing.

The strong associations between RGS groups and survival outcomes motivated us to explore whether RGS expression profile can be used to construct prognosis signature. We randomly divided 55 patients into training (25 samples) and testing (30 samples) sets and constructed a prediction model by using 1,400 robust gene expression values in the training set through a supervised principle component approach (21). Figure 3A shows that the predicted low-risk group has significant longer survival time than the predicted high-risk group (P = 0.013) in the testing set. To test whether this association was not random, we randomly split the data into training and testing sets 200 times, repeated the same prediction and testing procedures for each set, and found that the prognosis performance of RGS signature is significantly better than random (P = 0.02).

Figure 3.

Kaplan–Meier plots showing the predictive power of the robust gene signatures. Fifty-five FFPE tumor samples from MD Anderson Cancer Center were randomly divided into training (25 samples) and testing (30 samples) sets (A). Independent validation of the robust gene signature in the 442 frozen sample cohort from multi-institute consortium. The microarray datasets were divided into 2 groups, one for the training and the other for the testing cohort according to the original paper (B). The training data were 55 FFPE tumor samples and the testing dataset were 442 frozen sample cohort from multi-institute consortium. The testing was done for all patients (C), stage I patients (E), stage II patients (F) and stage III patients (G) separately. The training data were the consortium dataset with 442 frozen samples and the testing data were 55 FFPE samples from MD Anderson Cancer Center (D). P values were obtained by the log-rank test. Red and black lines represent predicted high- and low-risk groups, respectively. • indicates censored samples.

Figure 3.

Kaplan–Meier plots showing the predictive power of the robust gene signatures. Fifty-five FFPE tumor samples from MD Anderson Cancer Center were randomly divided into training (25 samples) and testing (30 samples) sets (A). Independent validation of the robust gene signature in the 442 frozen sample cohort from multi-institute consortium. The microarray datasets were divided into 2 groups, one for the training and the other for the testing cohort according to the original paper (B). The training data were 55 FFPE tumor samples and the testing dataset were 442 frozen sample cohort from multi-institute consortium. The testing was done for all patients (C), stage I patients (E), stage II patients (F) and stage III patients (G) separately. The training data were the consortium dataset with 442 frozen samples and the testing data were 55 FFPE samples from MD Anderson Cancer Center (D). P values were obtained by the log-rank test. Red and black lines represent predicted high- and low-risk groups, respectively. • indicates censored samples.

Close modal

Frozen samples training to testing.

We then tested whether this robust gene set can be used to construct prognosis signature in frozen samples. The largest independent public available lung cancer microarray dataset is the recently published NCI Director's Consortium for study of lung cancer involving 442 resected adenocarcinomas (13). From that study, Affymetrix U133A microarray data for the 1,012 robust genes were excerpted with 388 less genes than our FFPE data because of the microarray platform difference. We used the same training and testing strategy as in the original analyses of these data (13) for constructing and validating prognosis signature through supervised principal component approach. The training set included samples from University of Michigan Cancer Center (UM) and Moffitt Cancer Center (HLM), and the testing set included the Memorial Sloan-Kettering Cancer Center (MSKCC) and Dana-Farber Cancer Institute (DFCI) samples. This analysis revealed that the predicted low-risk group has significant longer survival time than the predicted high-risk group (HR = 2.44, P = 0.00014) in the testing dataset (Fig. 3B).

FFPE to frozen samples and vice versa.

Next, we used our FFPE and the consortium datasets as frozen samples to investigate whether the predication model built from one type of sample can be validated in another type of sample. Again, the same supervised principal component method was used to construct the prediction model. The prediction model built from FFPE samples can significantly distinguish the high- and low-risk groups in frozen samples (Fig. 3C; HR = 1.95, P = 5.4 × 10−7), and the prediction model built from frozen samples can also distinguish the high- and low-risk groups in FFPE samples but with marginal significance (Fig. 3D; HR = 3.59, P = 0.068). We also tested the performance of FFPE prediction model on 4 individual datasets in consortium study and found that the predicted low-risk groups have longer survival time compared with the predicted high-risk groups for all sets: MSKCC dataset (median survival time 6.5 vs. 3.3 years; HR = 2.31, P = 0.0093), DFCI dataset (median survival time 5.9 vs. 0.9 years; HR = 2.62 P = 0.0076), HLM dataset (median survival time 3.4 vs. 2.2 years; HR = 1.25, P = 0.4) and UM dataset (median survival time 5.4 vs. 2.2 years; HR = 1.98, P = 0.0011; Supplementary Fig. S2). Next, we compared the performance of RGS signature with previous published lung cancer prognosis signatures by using the same consortium dataset as testing set. Shedden and colleagues (13) showed that the HRs for Method A signature (the best signature in their study) and Chen and colleagues (11) signatures range from 1.10 to 1.83 for the MSKCC test set, whereas the HR for our RGS signature is 2.89 on the same MSKCC test set. For the DFCI test set, the HRs range from 1.76 to 2.30 by using the published signatures, whereas the HR for our RGS signature on the same DFCI test set is 2.39. Therefore, the prognosis performance of RGS prognosis is at least as good as other published signatures in the microarray dataset.

The RGS prognosis signature is independent of clinical variables

To test whether RGS is an independent prognosis signature, we fitted a multivariate Cox regression model including RGS risk scores, age, gender, stage, smoking status, adjuvant chemotherapy usage, and clinical sites as covariables for the consortium dataset. The RGS risk scores were calculated from the prediction model built from the FFPE samples set. Table 1 shows that the RGS signature is significantly associated with the survival time after adjusting for other clinical variables (HR = 1.3, P = 0.007). Pathologic stages based on international staging system is the most widely used and important prognosis variable for lung cancer patients (30); here we tested whether RGS signature can further refine the prognosis within each stage. The RGS prognosis signature from FFPE samples was tested within each stage of the consortium dataset. The results show clearly that the RGS signature is significantly associated with survival outcome within each stage (Fig. 3E–G; HR = 1.54, P = 0.036 for stage I, HR = 1.81, P = 0.022 for stage II and HR = 1.90, P = 0.021 for stage III), indicating that the RGS signature can refine the prognosis for lung cancer patients. The RGS prognosis signature from FFPE samples was further tested for patients with or without adjuvant chemotherapy separately, and the results show that the RGS signature is significantly associated with survival for both groups (Supplementary Fig. S3A and B; HR = 1.95, P = 0.015 for patients with chemotherapy, HR = 1.99, P = 0.00062 for patients without chemotherapy).

Table 1.

The association between characteristics of patients and RGS risk scores and survival time for consortium patients on the basis of multivariate Cox regression model

VariablesHR (95% CI)P
RGS risk scores 1.300 (1.074–1.574) 0.0070 
Gender (female vs. male) 0.803 (0.576–1.119) 0.19 
Age (continuous in unit of 10 y) 1.571 (1.321–1.868) <0.0001 
Smoking (current/former vs. never) 1.356 (0.791–2.322) 0.27 
Stage 
 Stage II vs. stage I 2.116 (1.433–3.126) 0.0002 
 Stage III vs. stage I 4.855 (3.164–7.449) <0.0001 
Adjuvant chemotherapy (yes vs. no) 1.688 (1.172–2.431) 0.0049 
Study sites 
 DFCI vs. UM 1.295 (0.741–2.264) 0.36 
 HLM vs. UM 1.632 (1.094–2.434) 0.016 
 MSKCC vs. UM 0.657 (0.419–1.031) 0.068 
VariablesHR (95% CI)P
RGS risk scores 1.300 (1.074–1.574) 0.0070 
Gender (female vs. male) 0.803 (0.576–1.119) 0.19 
Age (continuous in unit of 10 y) 1.571 (1.321–1.868) <0.0001 
Smoking (current/former vs. never) 1.356 (0.791–2.322) 0.27 
Stage 
 Stage II vs. stage I 2.116 (1.433–3.126) 0.0002 
 Stage III vs. stage I 4.855 (3.164–7.449) <0.0001 
Adjuvant chemotherapy (yes vs. no) 1.688 (1.172–2.431) 0.0049 
Study sites 
 DFCI vs. UM 1.295 (0.741–2.264) 0.36 
 HLM vs. UM 1.632 (1.094–2.434) 0.016 
 MSKCC vs. UM 0.657 (0.419–1.031) 0.068 

NOTE: RGS scores were calculated from the prediction model built from MD Anderson Cancer Center FFPE samples.

Refine to 59-gene prognosis signature

Among all the RGS genes, 131 genes are associated with survival (P < 0.05) in the FFPE dataset, and 365 genes are associated with overall survival (P < 0.05) in the consortium dataset by univariate Cox regression analysis. There is significant overlap between these two gene lists (Fig. 4A; 59 common genes; P = 0.0008, hypergeometric test). More significant genes were found in the consortium data compared with the FFPE data, which is likely due to the larger sample size (n = 442) of the consortium dataset compared with the FFPE dataset sample size (n = 55). Surprisingly, HRs from the two datasets are very consistent with each other. All 59 genes have the same direction of effects (positive or negative) on the survival between the 2 datasets and the HRs from 2 datasets are highly correlated (Pearson's correlation = 0.86; Fig. 4B), indicating the high consistency of expressions of these genes across datasets. These results motivated us to hypothesize that these 59 genes (Supplementary Table S5) alone can be used for lung cancer prognosis. To test this hypothesis, we applied supervised principal component analysis to these 59 genes by using the FFPE dataset to construct a 59-gene prognosis signature. Because the selection of these 59 genes used information from both FFPE and consortium datasets, we used another 2 independent lung cancer datasets, including the Bild and colleagues (n = 111; ref. 9) dataset and the Bhattacharjee and colleagues dataset (n = 117; ref. 31) downloaded from the literature to validate our 59-gene signature. The 59-gene prediction model built from FFPE samples can significantly distinguish the high- and low-risk groups for both the Bhattacharjee and colleagues and Bild and colleagues datasets (Fig. 5A; HR = 1.81, P = 0.016 and Fig. 5C; HR = 2.10, P = 0.02, respectively). Furthermore, this signature can also significantly distinguish the high- and low-risk groups within stage I patients for both datasets (Fig. 5B and D), indicating that this 59-gene signature can refine the prognosis for lung cancer patients within stage I patients. Because of the small sample size for stage II and stage III patients in Bild and colleagues and Bhattacharjee and colleagues studies, the 59-gene prognosis signature was not tested for stage II and stage III patients. We also found that 59-gene prediction model built from the consortium dataset can also distinguish the high- and low-risk groups for the Bild and colleagues and Bhattacharjee and colleagues datasets (Supplementary Fig. S4A–D).

Figure 4.

Comparison of individual gene effect across FFPE samples from MD Anderson Cancer Center and 442 frozen samples from consortium. A, Venn diagram of genes associated with overall survival (P < 0.05 in univariate Cox regression models). It shows 59 genes are significantly associated with survival in both FFPE data and consortium data. B, the HRs from univariate Cox regression models for the 59 genes common in both sets are consistent between FFPE set and consortium set. C, regulatory gene and protein interaction networks defined by the 59 predictors. Computational molecular interaction network prediction on the basis of genes and proteins associated with the significant pathways in the Ingenuity Pathways Knowledge Base (IPKB) by IPA. Interactions between the different nodes are given as solid (direct interaction) and dashed (indirect interaction) lines (edges). This network received the highest score by IPA and is mostly centered on the transcription factors HNF4A and HNF1A, and ONECUT1. The shaded genes are the genes belonging to 59-gene signature.

Figure 4.

Comparison of individual gene effect across FFPE samples from MD Anderson Cancer Center and 442 frozen samples from consortium. A, Venn diagram of genes associated with overall survival (P < 0.05 in univariate Cox regression models). It shows 59 genes are significantly associated with survival in both FFPE data and consortium data. B, the HRs from univariate Cox regression models for the 59 genes common in both sets are consistent between FFPE set and consortium set. C, regulatory gene and protein interaction networks defined by the 59 predictors. Computational molecular interaction network prediction on the basis of genes and proteins associated with the significant pathways in the Ingenuity Pathways Knowledge Base (IPKB) by IPA. Interactions between the different nodes are given as solid (direct interaction) and dashed (indirect interaction) lines (edges). This network received the highest score by IPA and is mostly centered on the transcription factors HNF4A and HNF1A, and ONECUT1. The shaded genes are the genes belonging to 59-gene signature.

Close modal
Figure 5.

Kaplan–Meier plots showing the predictive power of the 59-gene signature for 2 independent validation sets. The training data were 55 FFPE tumor samples from MD Anderson Cancer Center and the testing dataset was frozen samples from lung cancer patients from Bhattacharjee and colleagues (31) dataset (A), the stage I patients from Bhattacharjee and colleagues dataset (B), frozen samples from lung cancer patients from Bild and colleagues (9) dataset (C), and the stage I patients from Bild and colleagues dataset (D). P values were obtained by the log-rank test. Red and black lines represent predicted high- and low-risk groups, respectively. • indicates censored samples.

Figure 5.

Kaplan–Meier plots showing the predictive power of the 59-gene signature for 2 independent validation sets. The training data were 55 FFPE tumor samples from MD Anderson Cancer Center and the testing dataset was frozen samples from lung cancer patients from Bhattacharjee and colleagues (31) dataset (A), the stage I patients from Bhattacharjee and colleagues dataset (B), frozen samples from lung cancer patients from Bild and colleagues (9) dataset (C), and the stage I patients from Bild and colleagues dataset (D). P values were obtained by the log-rank test. Red and black lines represent predicted high- and low-risk groups, respectively. • indicates censored samples.

Close modal

To understand the potential biological relevance of these 59 genes significantly associated with survival in the FFPE and consortium datasets, we used Ingenuity Pathway Analysis (IPA) to explore which known regulatory networks are enriched in this 59-gene set. IPA analysis revealed the most significant molecular networks to be cancer, tumor morphology, and respiratory disease. This network (Fig. 4C) includes 14 genes of the 59-gene set and is centered on transcription factors HNF4A, HNF1A, and ONECUT1 (HNF6A). This hepatocellular network has been implicated in hepatocellular carcinoma as determined by in vitro study (32) and molecular interactions in this network are putatively involved in lung cancer survival.

In this study, we tested the feasibility of deriving a lung cancer prognosis gene signature from FFPE tumor samples on the basis of genome-wide mRNA expression profiling. Although reverse transcriptase PCR methods have been used to measure gene expression level from FFPE samples (33–35), the selection of genes for testing are limited to the current knowledge base which is incomplete and inconsistent (36). Because of degradation and chemical alteration of RNA extracted from FFPE samples, the use of microarray analysis of gene expression from FFPE samples has been hampered (36). New technology and methodologies developed to extract RNA from FFPE samples coupled with new array platforms have made it possible to measure gene expression from FFPE samples (33, 37–40). A recent study showed the feasibility of using DNA-mediated annealing, selection, extension, and ligation arrays with 6,100 preselected genes to profile mRNA expression from hepatocellular carcinoma tissue (41). No prognosis signature for other types of cancer has been developed by using microarray analysis of gene expression from FFPE extracted RNA. In this study, we built a robust gene signature for NSCLC on the basis of microarray analysis of FFPE samples. We claim that this is a robust gene signature because it has been validated in 6 independent published datasets, including 4 sets from the consortium study and 2 additional studies from DFCI and Duke. We also built a prediction model by using the same set of robust genes from frozen samples and validated the model in both frozen and FFPE samples.

Most published gene signatures identified from different studies are usually very different and with little overlap. However, we found that there is significant overlap among the robust genes associated with survival outcomes between the FFPE dataset and the consortium dataset (P = 0.008). More impressively, the HRs, indicating the strength of the association of genes expression and survival time, are highly consistent between 2 independent datasets. Our interpretation for this consistency across studies is that the gene expression variation across studies is a major contribution to signature differences across studies. In this study, we used strict quality steps to exclude genes that were not expressed in our FFPE samples. This allowed for analysis of the remaining genes which had more stable expression patterns and were more robust to environment changes. Validation of our novel 59-gene signature prognostic for NSLC survival in 2 additional independent datasets further confirmed the robustness of these genes.

By grouping our RGS of 1,400 genes by gene expression, we found that the group expression levels correlated with survival. Interestingly, group 1 had a shorter survival and contained an ER-negative breast cancer signature. Group 2 had a longer survival and contained an ER-positive breast cancer signature. This correlation with ER status and survival has been shown previously in breast cancer and shown to have predictive power for prognosis (29). In addition to ER status, the RGS groups were separated by the presence of stem cell signatures (embryonic stem cell signature and neural stem cell signature), with group 1 (shorter survival) having 2 stem cell signatures, whereas group 2 (longer survival) did not. The embryonic stem cell signature has previously been shown to be associated with poor prognosis of NSCLC (42). In addition, in mouse models, a hematopoietic and neural stem cell–like signature in primary tumors has been shown to be a predictor of poor prognosis in 11 types of cancer, including lung (43). These ER status and stem cell signature data support our RGS expression groupings and their correlation with survival prognosis.

Besides the prognostic signature, the predictive signatures to determine the optimal chemotherapy regimen for individual patients also have tremendous clinical benefit. Tumor samples from clinical trials data are important to develop predictive signatures to reduce the selection bias for evaluating treatment efficacy within signature groups. However, very limited frozen tumor samples are available from completed clinical trials. Our study showed the feasibility of using FFPE samples for genome-wide mRNA profiling. Therefore, this study provides an important step to construct and validate predictive signatures for chemotherapy response by using the available FFPE samples from clinical trials in the future.

No potential conflicts of interest were disclosed.

This study was supported in part by grants from the Department of Defense (W81XWH-07-1-0306 03 to J.D. Minna and I.I. Wistuba), the Specialized Program of Research Excellence in Lung Cancer Grant (P50CA70907 to J.D. Minna, J. Roth, and I.I. Wistuba), the NCI (1R01CA152301-01 to Y. Xie and I.I. Wistuba, Cancer Center Support Grant CA-16672), the NIH (5R21DA027592 to G. Xiao), and the NSF (DMS0907562 to G. Xiao).

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Jemal
A
,
Siegel
R
,
Ward
E
,
Hao
Y
,
Xu
J
,
Thun
MJ
. 
Cancer statistics, 2009
.
CA Cancer J Clin
2009
;
59
:
225
49
.
2.
Tsuboi
M
,
Ohira
T
,
Saji
H
,
Miyajima
K
,
Kajiwara
N
,
Uchida
O
, et al
The present status of postoperative adjuvant chemotherapy for completely resected non-small cell lung cancer
.
Ann Thorac Cardiovasc Surg
2007
;
13
:
73
7
.
3.
Douillard
JY
,
Rosell
R
,
De Lena
M
,
Carpagnano
F
,
Ramlau
R
,
Gonzales-Larriba
JL
, et al
Adjuvant vinorelbine plus cisplatin versus observation in patients with completely resected stage IB-IIIA non-small-cell lung cancer (Adjuvant Navelbine International Trialist Association [ANITA]): a randomised controlled trial
.
Lancet Oncol
2006
;
7
:
719
27
.
4.
Kato
H
,
Ichinose
Y
,
Ohta
M
,
Hata
E
,
Tsubota
N
,
Tada
H
, et al
A randomized trial of adjuvant chemotherapy with uracil-tegafur for adenocarcinoma of the lung
.
N Engl J Med
2004
;
350
:
1713
21
.
5.
Arriagada
R
,
Bergman
B
,
Dunant
A
,
Le Chevalier
T
,
Pignon
JP
,
Vansteenkiste
J
. 
Cisplatin-based adjuvant chemotherapy in patients with completely resected non-small-cell lung cancer
.
N Engl J Med
2004
;
350
:
351
60
.
6.
Winton
T
,
Livingston
R
,
Johnson
D
,
Rigas
J
,
Johnston
M
,
Butts
C
, et al
Vinorelbine plus cisplatin vs. observation in resected non-small-cell lung cancer
.
N Engl J Med
2005
;
352
:
2589
97
.
7.
Strauss
GM
,
Herndon
JE
 2nd
,
Maddaus
MA
,
Johnstone
DW
,
Johnson
EA
,
Harpole
DH
, et al
Adjuvant paclitaxel plus carboplatin compared with observation in stage IB non-small-cell lung cancer: CALGB 9633 with the Cancer and Leukemia Group B, Radiation Therapy Oncology Group, and North Central Cancer Treatment Group Study Groups
.
J Clin Oncol
2008
;
26
:
5043
51
.
8.
Olaussen
KA
,
Mountzios
G
,
Soria
JC
. 
ERCC1 as a risk stratifier in platinum-based chemotherapy for nonsmall-cell lung cancer
.
Curr Opin Pulm Med
2007
;
13
:
284
9
.
9.
Bild
AH
,
Yao
G
,
Chang
JT
,
Wang
Q
,
Potti
A
,
Chasse
D
, et al
Oncogenic pathway signatures in human cancers as a guide to targeted therapies
.
Nature
2006
;
439
:
353
7
.
10.
Boutros
PC
,
Lau
SK
,
Pintilie
M
,
Liu
N
,
Shepherd
FA
,
Der
SD
, et al
Prognostic gene signatures for non-small-cell lung cancer
.
Proc Natl Acad Sci U S A
2009
;
106
:
2824
8
.
11.
Chen
HY
,
Yu
SL
,
Chen
CH
,
Chang
GC
,
Chen
CY
,
Yuan
A
, et al
A five-gene signature and clinical outcome in non-small-cell lung cancer
.
N Engl J Med
2007
;
356
:
11
20
.
12.
Lu
Y
,
Lemon
W
,
Liu
PY
,
Yi
Y
,
Morrison
C
,
Yang
P
, et al
A gene expression signature predicts survival of patients with stage I non-small cell lung cancer
.
PLoS Med
2006
;
3
:
e467
.
13.
Shedden
K
,
Taylor
JM
,
Enkemann
SA
,
Tsao
MS
,
Yeatman
TJ
,
Gerald
WL
, et al
Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study
.
Nat Med
2008
;
14
:
822
7
.
14.
Sun
Z
,
Wigle
DA
,
Yang
P
. 
Non-overlapping and non-cell-type-specific gene expression signatures predict lung cancer survival
.
J Clin Oncol
2008
;
26
:
877
83
.
15.
Hsu
DS
,
Balakumaran
BS
,
Acharya
CR
,
Vlahovic
V
,
Walters
KS
,
Garman
K
, et al
Pharmacogenomic strategies provide a rational approach to the treatment of cisplatin-resistant patients with advanced cancer
.
J Clin Oncol
2007
;
25
:
4350
7
.
16.
Hayes
DN
,
Monti
S
,
Parmigiani
G
,
Gilks
CB
,
Naoki
K
,
Bhattacharjee
A
, et al
Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts
.
J Clin Oncol
2006
;
24
:
5079
90
.
17.
Coombes
KR
,
Wang
J
,
Baggerly
KA
. 
Microarrays: retracing steps
.
Nat Med
2007
;
13
:
1276
7
;
author reply 7–8
.
18.
Ioannidis
JP
,
Allison
DB
,
Ball
CA
,
Coulibaly
I
,
Cui
X
,
Culhane
AC
, et al
Repeatability of published microarray gene expression analyses
.
Nat Genet
2009
;
41
:
149
55
.
19.
Bolstad
BM
,
Irizarry
RA
,
Astrand
M
,
Speed
TP
. 
A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
.
Bioinformatics
2003
;
19
:
185
93
.
20.
Huber
PJ
. 
1972 Wald Lecture - Robust Statistics - Review
.
Ann Math Stat
1972
;
43
:
1041
67
.
21.
Bair
E
,
Tibshirani
R
. 
Semi-supervised methods to predict patient survival from gene expression data
.
PLoS Biol
2004
;
2
:
E108
.
22.
Breiman
L
,
Friedman
J
,
Stone
JC
,
Olshen
RA
. 
Classification and regression trees
.
New York
:
Chapman & Hall/CRC
; 
1984
.
23.
Garzotto
M
,
Beer
TM
,
Hudson
RG
,
Peters
L
,
Hsieh
YC
,
Barrera
E
, et al
Improved detection of prostate cancer using classification and regression tree analysis
.
J Clin Oncol
2005
;
23
:
4322
9
.
24.
Hess
KR
,
Abbruzzese
MC
,
Lenzi
R
,
Raber
MN
,
Abbruzzese
JL
. 
Classification and regression tree analysis of 1000 consecutive patients with unknown primary carcinoma
.
Clin Cancer Res
1999
;
5
:
3403
10
.
25.
Koziol
JA
,
Zhang
JY
,
Casiano
CA
,
Peng
XX
,
Shi
FD
,
Feng
AC
, et al
Recursive partitioning as an approach to selection of immune markers for tumor diagnosis
.
Clin Cancer Res
2003
;
9
:
5120
6
.
26.
Valera
VA
,
Walter
BA
,
Yokoyama
N
,
Koyama
Y
,
Iiai
T
,
Okamoto
H
, et al
Prognostic groups in colorectal carcinoma patients based on tumor cell proliferation and classification and regression tree (CART) survival analysis
.
Ann Surg Oncol
2007
;
14
:
34
40
.
27.
Kaplan
ELM P
. 
Nonparametric estimation from incomplete observations
.
J Am Stat Assoc
1958
;
53
:
457
81
.
28.
Collett
D
. 
Modelling survival data in medical research
.
Boca Raton
:
Chapman & Hall/CRC
; 
2003
.
29.
van ‘t Veer
LJ
,
Dai
H
,
van de Vijver
MJ
,
He
YD
,
Hart
AA
,
Mao
M
, et al
Gene expression profiling predicts clinical outcome of breast cancer
.
Nature
2002
;
415
:
530
6
.
30.
Mountain
CF
. 
The new International Staging System for Lung Cancer
.
Surg Clin North Am
1987
;
67
:
925
35
.
31.
Bhattacharjee
A
,
Richards
WG
,
Staunton
J
,
Li
C
,
Monti
S
,
Vasa
P
, et al
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
.
Proc Natl Acad Sci U S A
2001
;
98
:
13790
5
.
32.
Hatzis
P
,
Talianidis
I
. 
Regulatory mechanisms controlling human hepatocyte nuclear factor 4alpha gene expression
.
Mol Cell Biol
2001
;
21
:
7320
30
.
33.
Farragher
SM
,
Tanney
A
,
Kennedy
RD
,
Paul Harkin
D
. 
RNA expression analysis from formalin fixed paraffin embedded tissues
.
Histochem Cell Biol
2008
;
130
:
435
45
.
34.
Cronin
M
,
Pho
M
,
Dutta
D
,
Stephans
JC
,
Shak
S
,
Kiefer
MC
, et al
Measurement of gene expression in archival paraffin-embedded tissues: development and performance of a 92-gene reverse transcriptase-polymerase chain reaction assay
.
Am J Pathol
2004
;
164
:
35
42
.
35.
Gianni
L
,
Zambetti
M
,
Clark
K
,
Baker
J
,
Cronin
M
,
Wu
J
, et al
Gene expression profiles in paraffin-embedded core biopsy tissue predict response to chemotherapy in women with locally advanced breast cancer
.
J Clin Oncol
2005
;
23
:
7265
77
.
36.
van't Veer
LJ
,
Bernards
R
. 
Enabling personalized cancer medicine through analysis of gene-expression patterns
.
Nature
2008
;
452
:
564
70
.
37.
Loudig
O
,
Milova
E
,
Brandwein-Gensler
M
,
Massimi
A
,
Belbin
TJ
,
Childs
G
, et al
Molecular restoration of archived transcriptional profiles by complementary-template reverse-transcription (CT-RT)
.
Nucleic Acids Res
2007
;
35
:
e94
.
38.
Penland
SK
,
Keku
TO
,
Torrice
C
,
He
X
,
Krishnamurthy
J
,
Hoadley
KA
, et al
RNA expression analysis of formalin-fixed paraffin-embedded tumors
.
Lab Invest
2007
;
87
:
383
91
.
39.
Ravo
M
,
Mutarelli
M
,
Ferraro
L
,
Grober
OM
,
Paris
O
,
Tarallo
R
, et al
Quantitative expression profiling of highly degraded RNA from formalin-fixed, paraffin-embedded breast tumor biopsies by oligonucleotide microarrays
.
Lab Invest
2008
;
88
:
430
40
.
40.
Roberts
RA
,
Sabalos
CM
,
LeBlanc
ML
,
Martel
RR
,
Frutiger
YM
,
Unger
JM
, et al
Quantitative nuclease protection assay in paraffin-embedded tissue replicates prognostic microarray gene expression in diffuse large-B-cell lymphoma
.
Lab Invest
2007
;
87
:
979
97
.
41.
Hoshida
Y
,
Villanueva
A
,
Kobayashi
M
,
Peix
J
,
Chiang
DY
,
Camargo
A
, et al
Gene expression in fixed tissues and outcome in hepatocellular carcinoma
.
N Engl J Med
2008
;
359
:
1995
2004
.
42.
Hassan
KA
,
Chen
G
,
Kalemkerian
GP
,
Wicha
MS
,
Beer
DG
. 
An embryonic stem cell-like signature identifies poorly differentiated lung adenocarcinoma but not squamous cell carcinoma
.
Clin Cancer Res
2009
;
15
:
6386
90
.
43.
Glinsky
GV
,
Berezovska
O
,
Glinskii
AB
. 
Microarray analysis identifies a death-from-cancer signature predicting therapy failure in patients with multiple types of cancer
.
J Clin Invest
2005
;
115
:
1503
21
.