Purpose: This study aimed to identify and validate a gene expression signature for squamous cell carcinoma of the lung (SQCC).

Experimental Design: A published microarray dataset from 129 SQCC patients was used as a training set to identify the minimal gene set prognostic signature. This was selected using the MAximizing R Square Algorithm (MARSA), a novel heuristic signature optimization procedure based on goodness-of-fit (R square). The signature was tested internally by leave-one-out-cross-validation (LOOCV), and then externally in three independent public lung cancer microarray datasets: two datasets of non–small cell lung cancer (NSCLC) and one of adenocarcinoma (ADC) only. Quantitative-PCR (qPCR) was used to validate the signature in a fourth independent SQCC cohort.

Results: A 12-gene signature that passed the internal LOOCV validation was identified. The signature was independently prognostic for SQCC in two NSCLC datasets (total n = 223) but not in ADC. The lack of prognostic significance in ADC was confirmed in the Director's Challenge ADC dataset (n = 442). The prognostic significance of the signature was validated further by qPCR in another independent cohort containing 62 SQCC samples (hazard ratio, 3.76; 95% confidence interval, 1.10-12.87; P = 0.035).

Conclusions: We identified a novel 12-gene prognostic signature specific for SQCC and showed the effectiveness of MARSA to identify prognostic gene expression signatures. Clin Cancer Res; 16(20); 5038–47. ©2010 AACR.

Translational Relevance

Lung squamous cell cancer (SQCC) is the second most frequent lung cancer. Treatment for this malignancy is determined by stage, although this system lacks precision at an individual patient level with respect to predicting likelihood of recurrence and death from cancer. Tumor phenotype has been shown to be driven by the underlying gene expression. The present study aimed to identify a gene expression signature capturing the altered key pathways in cancer biology for SQCC patients. We identified a 12-gene signature prognostic for SQCC that is independent of stage. Importantly, its prognostic independence has been consistently validated either in silico or by quantitative PCR in multiple independent SQCC cohorts. This signature enables us to identify low- and high-risk patients with distinct survival outcomes that potentially may direct postoperative therapy. The signature has the potential to play an important role in refining lung cancer management.

Identifying gene expression signatures that capture altered key pathways/regulators in carcinogenesis may lead to the discovery of molecular subclasses and predict patient outcomes (1). Several prognostic gene expression signatures have been published for non–small cell lung cancer (NSCLC; refs. 28) and its adenocarcinoma (ADC) subtype (912). Fewer studies have been done to identify prognostic signatures specific for lung squamous cell carcinoma (SQCC; refs. 13, 14), but their validation in independent cohorts or datasets has been limited.

Factors such as patient/sample heterogeneity, small sample size, variation in microarray platforms, and RNA preparation and hybridization protocols could all contribute to difficulties in the validation of gene expression signatures. In addition, the loss of information through arbitrary exclusion of patients or genes prior to analysis may play an important role. Supervised data mining methodology assigns cases into good and poor prognosis subgroups at specified time points (13, 15). This arbitrary assignment of a cutoff to split good and poor prognosis cases could be problematic due to the nonlinear relationships between gene expression and patient survival. Other investigators have compared two extremes in outcome (very early death versus long survival; refs. 3, 12); however, this approach may result in significant information loss, for almost half of the cases with intermediate survival are excluded from analysis, thereby leading to high finite sample variation (16) and making the cohort under study less representative. This variability explains, therefore, why the validation of the identified signature published to date has been challenging.

It is estimated that most tissues express only 30% to 40% of genes (17) or 10,000 to 15,000 genes (18). Furthermore, among the expressed genes from similar tissue types, only a small fraction is differentially expressed. Only these differentially expressed genes distinguish one phenotype from another. In an attempt to compensate for this in genome-wide microarray studies, some investigators have excluded genes with low expression or low variation prior to signature selection (3, 810). This approach may result in the exclusion of potentially important low-expression but key regulatory genes, leading to another potential source of information loss. In addition, signatures are generated using a forced forward inclusion procedure predetermined by the rank of significance of the gene (8, 9) or the bootstrap score (13), regardless of whether the included gene contributes to the classification ability of the signature. The lack of heuristic measures in these methods potentially reduces the robustness of these signatures.

We applied a new heuristic MAximizing R Square Analysis (MARSA) algorithm to identify a novel 12-gene prognostic signature from a microarray dataset of 129 SQCC patient samples (13). To show the robustness of this signature, we tested it in multiple independent cohorts of histology-defined patients with microarray or quantitative reverse transcriptase-PCR (qPCR).

Datasets

Four large, publicly available NSCLC microarray datasets were used: 129 SQCC samples collected between October 1991 and July 2002 from Molecular Diagnostics, Veridex LLC (UM; ref. 13), 85 NSCLC samples (44 SQCC and 41 ADC) samples from Duke Lung Cancer Prognostic Laboratory at Duke University (Duke; ref. 3), 138 NSCLC samples (76 SQCC and 62 ADC) from Sungkyunkwan University (SKKU) without preoperative therapy (7), and 327 ADC samples from the National Cancer Institute Director's Challenge Consortium for the Molecular Classification of ADC (DCC; ref. 11). UM was used as the training set, whereas the remaining three datasets served as independent test sets. In addition, qPCR validation of the signature was carried out in 62 SQCC samples (tumor cellularity ≥60%) from the University Health Network (UHN) without preoperative or postoperative adjuvant therapy. Patient demographics of the five independent datasets are shown in Table 1. The primary survival end point was 5-year survival (in UM, Duke, DCC) and overall survival for UHN or disease-free survival (SKKU). Additional details are provided online as “Supplementary Materials.”

Table 1.

Demographic data for patients in the five datasets

UMDukeSKKUDCC*UHN
n 129 89 138 327 62 
Age, years 
    <65 52 (40.3) 33 (37.1) 79 (57.2) 152 (46.5) 20 (32.3) 
    ≥65 77 (59.7) 56 (62.9) 59 (42.8) 175 (53.5) 42 (67.7) 
Sex 
    Male 82 (63.6) 54 (60.7) 104 (75.4) 172 (52.6) 41 (66.1) 
    Female 47 (36.4) 35 (39.3) 34 (24.6) 155 (47.4) 21 (33.9) 
Stage 
    IA 27 (20.9) 37 (41.6) 16 (11.6) 108 (33.0) 12 (19.4) 
    IB 46 (35.7) 30 (33.7) 72 (52.2) 120 (36.7) 25 (40.3) 
    IIA 6 (4.7) 5 (5.6) 6 (4.3) 17 (5.2) 4 (6.5) 
    IIB 27 (20.9) 13 (14.6) 18 (13.0) 42 (12.8) 16 (25.8) 
    IIIA 17 (13.1) 3 (3.4) 16 (11.6) 31 (9.5) 5 (8.0) 
    IIIB 6 (4.7)  10 (7.2) 8 (2.4) 
    IV 1 (1.1) 
Histology 
    AD 43 (48.3) 62 (44.9) 327 (100) 
    SQ 129 (100) 46 (51.7) 76 (55.1) 62 (100) 
Platform U133A U133 +2 U133 +2 U133A qPCR 
UMDukeSKKUDCC*UHN
n 129 89 138 327 62 
Age, years 
    <65 52 (40.3) 33 (37.1) 79 (57.2) 152 (46.5) 20 (32.3) 
    ≥65 77 (59.7) 56 (62.9) 59 (42.8) 175 (53.5) 42 (67.7) 
Sex 
    Male 82 (63.6) 54 (60.7) 104 (75.4) 172 (52.6) 41 (66.1) 
    Female 47 (36.4) 35 (39.3) 34 (24.6) 155 (47.4) 21 (33.9) 
Stage 
    IA 27 (20.9) 37 (41.6) 16 (11.6) 108 (33.0) 12 (19.4) 
    IB 46 (35.7) 30 (33.7) 72 (52.2) 120 (36.7) 25 (40.3) 
    IIA 6 (4.7) 5 (5.6) 6 (4.3) 17 (5.2) 4 (6.5) 
    IIB 27 (20.9) 13 (14.6) 18 (13.0) 42 (12.8) 16 (25.8) 
    IIIA 17 (13.1) 3 (3.4) 16 (11.6) 31 (9.5) 5 (8.0) 
    IIIB 6 (4.7)  10 (7.2) 8 (2.4) 
    IV 1 (1.1) 
Histology 
    AD 43 (48.3) 62 (44.9) 327 (100) 
    SQ 129 (100) 46 (51.7) 76 (55.1) 62 (100) 
Platform U133A U133 +2 U133 +2 U133A qPCR 

NOTE: The values represent number of patients and comparative percentage in parentheses.

Abbreviations: AD, adenocarcinoma of lung; SQ, squamous cell carcinoma of lung.

*One case in DCC has no stage.

Not included in analysis.

Data preprocessing

Microarray data were obtained directly from the authors (13) or downloaded from the web (3, 7). Raw .cel files from UM and Duke were preprocessed by the Robust Multichip Average (RMA) algorithm using RMAexpress v0.5 (19), and log2 transformed. Because there were no raw data under the accession number GSE8894 (7), the GCRMA preprocessed data from SKKU were downloaded instead. Probe-sets (ps) were annotated using NetAffx v4.2 annotation tool (20). Only probe-sets with “grade A” annotation were used for signature optimization. Preprocessed data were standardized by Z-score transformation, which centered the expression level to mean zero and SD of 1 (21); risk score was the Z-score weighted by the coefficient of the Cox proportional model (4, 9).

Univariate analysis and signature selection

The association of the expression of individual probe-sets with 5-year survival was evaluated by Cox proportional hazards regression. A stringent threshold of P < 0.005 was set for preselecting the candidate probe-sets chosen for signature optimization (22).

Signature optimization was conducted by an exclusion followed by an inclusion selection procedure (Fig. 1A). The exclusion procedure took all probe-sets (96 ps) that met preselection criteria. Each probe-set was excluded one at a time, and this procedure was repeated until there was only one probe-set left. The inclusion procedure started with the probe-set left by the exclusion procedure. Each probe-set was added one at a time, and this procedure was repeated until all probe-sets were included. Then the probe set against R2 was plotted and a set of minimum number of probe-sets (12 ps) having the largest R2 was identified (Fig. 1B). Principal component analysis (PCA) and a stepwise selection in the Cox model were then used to synthesize gene expression information of the selected signature for dimensionality reduction, removing possible collinear expression of genes, and constructing a parsimonious multivariate model. The risk score for validation was the product of Z-score of gene expression and the coefficient of selected principal components (PC) in the multivariate model (Supplementary Table S1). The signature was then verified by an internal validation by the leave-one-out-cross-validation (LOOCV).

Fig. 1.

Selection of the prognostic signature. A, pipeline of the identification and validation of the prognostic signature. Ninety-six probe sets from 19,619 probe sets with grade A annotations were preselected by univariate analysis at P < 0.005. The signature was selected sequentially by exclusion and inclusion procedures. B, plot of the exclusion/inclusion selection. C, survival curves of the low- and high-risk groups classified by the 12-gene signature in the training set.

Fig. 1.

Selection of the prognostic signature. A, pipeline of the identification and validation of the prognostic signature. Ninety-six probe sets from 19,619 probe sets with grade A annotations were preselected by univariate analysis at P < 0.005. The signature was selected sequentially by exclusion and inclusion procedures. B, plot of the exclusion/inclusion selection. C, survival curves of the low- and high-risk groups classified by the 12-gene signature in the training set.

Close modal

Validation of the expression signature

The signature generated from the training set and verified by LOOCV was then validated in silico in three independent datasets from Duke (3), SKKU (7), and DCC (11), and then by qPCR in our own UHN cohort. The independent prognostic value of the signature was tested in the multivariate Cox proportional hazard model, and was adjusted for stage, age, and sex. Statistical analyses were done using SAS v9.1 (SAS Institute).

Quantitative reverse transcriptase-PCR

qPCR was done on the ABI 7900HT real-time PCR system (AppliedbioSystems). Primers were designed primarily within the target sequences of the selected probe-sets; however, if no primer could be found in this area, primers were designed in the CDS of the target gene (Supplementary Table S2). PCR reaction optimization has been described previously (21). Five nanograms of cDNA were used for each reaction, and expression level was quantified using 2-ΔΔCt method after normalization to three housekeeping genes (TBP, BAT1, and ACTB; Supplementary Table S3).

Protein-protein interaction network construction and analysis

Protein-protein interaction (PPI) data were obtained by querying the Interologous Interaction Database (http://ophid.utoronto.ca/i2d; I2D v1.71; ref. 23). The interacting proteins were then used to query the same database to determine whether any interactions were present among them. The PPI network was visualized and annotated using NAViGaTOR v2.1.13 (http://ophid.utoronto.ca/navigator; ref. 24).

Gene Ontology term and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways enrichment analysis

GoStat (25) was used to evaluate Gene Ontology (GO) term representation enrichment. For KEGG pathways (http://www.genome.jp/kegg/; ref. 26) representation enrichment analysis, Fisher's exact test was used and the significance was corrected by the Bonferroni method. Representation enrichment in the PPI network of the three signature probe-sets against 1,000 randomly generated PPI networks was tested by Student's t-test.

New prognostic gene expression signature for lung SQCC

The steps leading to signature identification and subsequent validation are represented schematically in Fig. 1A. In total there were 22,215 ps on the U133A chip, 19,619 with grade A annotation. Univariate analysis identified 96 ps that were significantly associated with overall survival at P < 0.005. The exclusion selection procedure started with these 96 ps and by stepwise exclusion, probe set 211514_at was identified as its last one. This was followed by the inclusion procedure using 211514_at as its starting probe-set. The procedure included one probe-set at a time until all 96 ps were included. The exclusion procedure identified the largest R2 of 0.77 with a combination of 12 ps (12-gene; Fig. 1B). PCA analysis and the multivariate Cox proportional hazard model with stepwise selection revealed that four PCs were significantly associated with survival at P < 0.05 (Supplementary Table S1). Subsequent LOOCV identified a predicting error of the signature of only 4.7% (6 cases). Thus, the 12-gene combination was established as the prognostic gene signature (Table 3).

When the risk score was dichotomized at the optimal cutoff (-0.056, Supplementary Table S1), the 12-gene signature classified 63 and 66 SQCC patients into low- and high-risk groups, respectively, with a significant difference in overall survival [hazard ratio (HR), 11.47; 95% confidence interval (95% CI), 4.78-27.49; P < 0.0001; Fig. 1C]. No significant association of the 12-gene signature with T and N stages, age, and sex was noted (data not shown). Multivariate analysis revealed that the signature was an independent prognostic factor after adjustment for stage, age, and sex (HR, 15.18; 95% CI, 6.04-38.11; P < 0.0001; Supplementary Table S4).

In silico validation of the new 12-gene signature

We first tested the 12-gene signature in the Duke 89 NSCLC dataset (46 SQCC and 43 ADC). Four patients with stage III-IV (2 ADC and 1 SQCC in stage III and 1 SQCC in stage IV) were excluded from further analysis (Table 1). When the risk score was dichotomized at -0.056, the signature classified 25 and 19 of 44 SQCC and 13 and 28 of 41 ADC into low- and high-risk groups, respectively. High-risk SQCC had significantly poorer survival than the low-risk group (HR, 2.91; 95% CI, 1.17-7.24; P = 0.022; Fig. 2A), whereas the survival difference between the different risk groups for the ADC patients was not significant (HR, 1.87; 95% CI, 0.92-3.82; P = 0.54; Supplementary Fig. S1A). Stratified analysis by stage showed that the high-risk group classified by the signature had poorer survival in both stage I (HR, 1.87; 95% CI, 0.65-5.43; P = 0.247; Fig. 2B) and II SQCC (HR, 7.69; 95% CI, 0.87-67.67; P = 0.066; Fig. 2C). The high-risk group was associated with older patients (age ≥65 years, P = 0.02) but not with stage or sex (data not shown). Furthermore, multivariate analysis showed that the signature was an independent prognostic factor in SQCC (HR, 3.05; 95% CI, 1.14-8.21; P = 0.027) but not in ADC (HR, 1.73; 95% CI, 0.59-5.12; P = 0.322; Table 2 and Supplementary Table S5) after adjustment for stage, age, and sex.

Fig. 2.

In silico and qPCR validation of the 12-gene signature in SQCC samples from Duke (A-C), SKKU (D-F), and UHN (G-I). Recurrence-free survival was used for SKKU.

Fig. 2.

In silico and qPCR validation of the 12-gene signature in SQCC samples from Duke (A-C), SKKU (D-F), and UHN (G-I). Recurrence-free survival was used for SKKU.

Close modal
Table 2.

Validation of the 12-gene signature

Squamous cell carcinomaAdenocarcinoma
nHR (95% CI)PnHR (95% CI)P
In silico validation 
Duke 44 3.05 (1.14-8.21) 0.027 41 1.73 (0.59-5.12) 0.322 
SKKU 76 2.77 (1.34-5.73) 0.006 62 1.92 (0.91-4.05) 0.086 
DCC    327 1.23 (0.85-1.78) 0.267 
Quantitative reverse transcriptase PCR validation 
UHN 62 3.76 (1.10-12.87) 0.035    
Squamous cell carcinomaAdenocarcinoma
nHR (95% CI)PnHR (95% CI)P
In silico validation 
Duke 44 3.05 (1.14-8.21) 0.027 41 1.73 (0.59-5.12) 0.322 
SKKU 76 2.77 (1.34-5.73) 0.006 62 1.92 (0.91-4.05) 0.086 
DCC    327 1.23 (0.85-1.78) 0.267 
Quantitative reverse transcriptase PCR validation 
UHN 62 3.76 (1.10-12.87) 0.035    

NOTE: The prognostic effect of the MARSA 12-gene signature was adjusted for stage, patients' age, and sex.

Abbreviation: n, number of patients.

The SKKU dataset (7) included 138 stage I-III NSCLC (76 SQCC and 62 ADC) patients profiled using the U133 plus 2 chip. This is the only NSCLC microarray dataset from Asia. Validation of our signature used recurrence-free survival as this is the only end point reported for this study. Because the GEO database has no raw data, we downloaded the expression data which was already GCRMA-preprocessed and log2 transformed. Gene expression level was Z-score transformed and risk score was derived using the formula listed in Supplementary Table S1. The 12-gene signature classified 41 and 35 of 76 SQCC and 27 and 35 of 62 ADC into low- and high-risk groups, respectively. Significantly shortened recurrence-free survival was observed in the high-risk group in the SQCC (HR, 2.46; 95% CI, 1.26-4.79; P = 0.008; Fig. 2D) but not in the ADC (HR, 1.43; 95% CI, 0.70-2.90; P = 0.323; Supplementary Fig. S1B). Stratified analysis by stage showed that the signature worked in stage I (HR, 2.52; 95% CI, 0.93-6.78; P = 0.068; Fig. 2E) and stage II and III (HR, 6.20; 95% CI, 1.84-20.86; P = 0.003; Fig. 2F). The risk groups were not significantly associated with stage, sex, and age (data not shown). Multivariate analysis showed that the signature was independently prognostic in SQCC (HR, 2.77; 95% CI, 1.34-5.73; P = 0.006) but not in ADC (HR, 1.92; 95% CI, 0.91-4.05; P = 0.086; Table 2 and Supplementary Table S5) after adjustment for stage, age, and sex.

To determine further whether the signature was prognostic in ADC, the 12-gene signature was tested in the largest available ADC microarray dataset from the NIH Director's Challenge Consortium study (11), which included 442 samples. Among them, 327 patients did not receive any adjuvant chemotherapy or radiotherapy and had follow-up longer than 1 month. The 12-gene signature was not prognostic (HR, 1.26; 95% CI, 0.87-1.81; P = 0.221; Supplementary Fig. S1C). Multivariate analysis showed that it was not an independent prognostic factor in ADC (HR, 1.23; 95% CI, 0.85-1.78; P = 0.267; Table 2). These data confirm that the signature was not prognostic in ADC.

qPCR validation in UHN SQCC cohort

qPCR validation of the 12-gene signature was done in an independent set of 62 snap-frozen SQCC samples from UHN without preoperative or postoperative therapy. Fold change was calculated using 2-ΔΔCt method and then Z-score transformed. Risk score was generated using parameters listed in Supplementary Table S1. When risk score was dichotomized at -0.056, the 12-gene signature was able to separate 41 and 21 SQCC into low- and high-risk groups with significant difference in 5-year overall survival (HR, 4.00; 95% CI, 1.20-13.31; P = 0.024; Fig. 2G). Stratified analysis by stage revealed that the signature was able to separate low- and high-risk groups with different survival outcomes; however, the significance was marginal due to the small sample size (stage I: HR, 3.39; 95% CI, 0.66-17.47; P = 0.145; Fig. 2H; and stage II and III: HR, 5.33; 95% CI, 0.88-32.19; P = 0.069; Fig. 2I). The risk groups were not associated with stage, age, and sex (data not shown). Nevertheless, multivariate analysis again showed that the signature was an independent prognostic factor (HR, 3.76; 95% CI, 1.10-12.87; P = 0.035; Table 2).

Because smoking influences airway gene expression, we assessed the association of the signature with smoking status. There were 46 past and 15 current smokers, and only 1 lifetime nonsmoker in the UHN cohort. The χ2 test indicated that the association between the signature and patients' past/current smoking status was not significant (P = 0.81). A technical factor that may have an impact on the gene expression level is the tumor cellularity. We did a Pearson correlation analysis of the risk score (as a continuous variable) and tumor cellularity in the UHN cohort. It showed that the risk score was not significantly correlated with tumor percentage (Pearson r = -0.08, P = 0.71).

The composition of the 12-gene signature

Table 3 shows the members of the12-gene signature and their ranks of expression level, variance, and significance in the Veridex dataset (in decreasing order of importance). Notably, the expression level of individual genes varies greatly, from very high levels as for RPL22 (ranked in the top 0.6%) to extremely low levels for PTPN20A/B (ranked at 99.7%). The SD value also varies greatly, from very large as for G0S2 (ranked at 1.9% of the total) to very small for RIPK5 (ranked at 97.5% of the total). These data showed that the low-expression and low-variabity genes were as important as those with higher expression and higher variability.

Table 3.

Composition of the 12-gene signature

Probe setGene symbolGene titleRank of exp. (%)Rank of SD (%)Rank of sig. (%)
n = 19,619n = 19,619n = 96
221775_x_at RPL22 Ribosomal protein L22 117 (0.6) 12,095 (61.7) 79 (82.3) 
211527_x_at VEGFA Vascular endothelial growth factor A 3,660 (18.7) 910 (4.6) 48 (50.0) 
213524_s_at G0S2 G0-G1switch 2 4,403 (22.4) 365 (1.9) 69 (71.9) 
218678_at NES Nestin 4,504 (23.0) 4,749 (24.2) 64 (66.7) 
211282_x_at TNFRSF25 Tumor necrosis factor receptor superfamily, member 25 7,582 (38.7) 6,614 (33.7) 59 (61.5) 
36552_at DKFZP586P0123 Hypothetical protein 9,094 (46.4) 11,934 (60.8) 31 (32.3) 
221900_at COL8A2 Collagen, type VIII, α 2 10,236 (52.2) 1,574 (8.0) 66 (68.8) 
219604_s_at ZNF3 Zinc finger protein 3 15,673 (79.9) 18,300 (93.3) 71 (74.0) 
211514_at RIPK5 Receptor interacting protein kinase 5 15,976 (81.4) 19,129 (97.5) 2 (2.1) 
221909_at RNFT2 Ring finger protein, transmembrane 2 16,306 (83.1) 2,740 (14.0) 3 (3.1) 
201335_s_at ARHGEF12 Rho guanine nucleotide exchange factor (GEF) 12 17,123 (87.3) 18,491 (94.3) 21 (21.9) 
215172_at PTPN20A/B Protein tyrosine phosphatase, non-receptor type 20A/B 19,558 (99.7) 17,956 (91.5) 65 (67.7) 
Probe setGene symbolGene titleRank of exp. (%)Rank of SD (%)Rank of sig. (%)
n = 19,619n = 19,619n = 96
221775_x_at RPL22 Ribosomal protein L22 117 (0.6) 12,095 (61.7) 79 (82.3) 
211527_x_at VEGFA Vascular endothelial growth factor A 3,660 (18.7) 910 (4.6) 48 (50.0) 
213524_s_at G0S2 G0-G1switch 2 4,403 (22.4) 365 (1.9) 69 (71.9) 
218678_at NES Nestin 4,504 (23.0) 4,749 (24.2) 64 (66.7) 
211282_x_at TNFRSF25 Tumor necrosis factor receptor superfamily, member 25 7,582 (38.7) 6,614 (33.7) 59 (61.5) 
36552_at DKFZP586P0123 Hypothetical protein 9,094 (46.4) 11,934 (60.8) 31 (32.3) 
221900_at COL8A2 Collagen, type VIII, α 2 10,236 (52.2) 1,574 (8.0) 66 (68.8) 
219604_s_at ZNF3 Zinc finger protein 3 15,673 (79.9) 18,300 (93.3) 71 (74.0) 
211514_at RIPK5 Receptor interacting protein kinase 5 15,976 (81.4) 19,129 (97.5) 2 (2.1) 
221909_at RNFT2 Ring finger protein, transmembrane 2 16,306 (83.1) 2,740 (14.0) 3 (3.1) 
201335_s_at ARHGEF12 Rho guanine nucleotide exchange factor (GEF) 12 17,123 (87.3) 18,491 (94.3) 21 (21.9) 
215172_at PTPN20A/B Protein tyrosine phosphatase, non-receptor type 20A/B 19,558 (99.7) 17,956 (91.5) 65 (67.7) 

NOTE: Rank of expression level (rank of exp.) is from high to low, rank of SD is from large to small, and rank of signature level (rank of sig.) is from high to low.

The GO (27) and KEGG pathway (26, 28) annotations revealed the involvement of several of the prognostic genes in signal transduction (e.g., VEGFA, TNFRSF25), cell cycle (e.g., VEGFA, G0S2), apoptosis (e.g., TNFRSF25), adhesion (e.g., COL8A2), and transcription and translation (ZNF3 and RPL22, respectively; Supplementary Table S6).

PPI network analysis

To assess the potential SQCC-specific biological relevance of the 12-gene signature genes further, we evaluated the functional relationship between our 12-gene signature and the reported Raponi (13) and Sun (8) 50-gene signatures (mapped to 12, 48, and 42 genes, respectively) through their corresponding PPI networks. We mapped 8 of 12 genes of the 12-gene signature, 35 of 48 and 31 of 42 for the Raponi and Sun signatures, respectively, to PPIs in the I2D 1.7 (23). Whereas the Raponi and Sun signatures had 10 overlapping probe-sets (9 genes), the 12-gene signature had no probe-sets/genes overlapping with either of the 50-gene signatures. However, direct interactions between the signature genes/proteins or via shared interacting proteins were seen among these signatures, implying a rich shared functional milieu (Fig. 3). Annotation of the resulting PPI network with KEGG pathways indicated significant enrichment for proteins from the mitogen-activated protein kinase (MAPK) signaling pathway (P = 0.019; 80 of 1,075 proteins), which form direct interactions with 3, 14, and 9 genes/proteins of ours and the Raponi and Sun signatures, respectively (Supplementary Tables S6-S8).

Fig. 3.

Genes of the 12-gene signature, and of the Sun 50-gene and Raponi 50-gene SQCC prognostic signatures mapped to PPI data form a connected PPI network. Genes of the 12-gene and the two previously published prognostic signatures for SQCC were mapped to PPI data in I2D (http://ophid.utoronto.ca/i2d; v.1.7) and visualized in NaVIGaTOR v.2.08 (http://ophid.utoronto.ca/navigator; ref. 24). The network comprises 1,075 proteins and 14,651 interactions. Shapes/nodes, proteins; lines/edges, interactions. Node color corresponds to biological function according to GO annotation as indicated in the legend. In the 12-gene signature, 8 of 12 genes were mapped to PPI data. In the Sun 50-gene signature, 31 of 42 targets were mapped. In the Raponi 50-gene signature, 35 of 48 targets were mapped. Eight of 9 genes overlapping between Sun 50-gene and Raponi 50-gene signatures were mapped to PPI data. Direct interaction between the 12-gene signature gene ARHGEF12 and IGF-IR, a therapeutic target in SQCC, is indicated by turquoise edge color (top right). Faded-out nodes and edges, interactions of individual signature genes that which do not contribute to the interaction among the three signatures.

Fig. 3.

Genes of the 12-gene signature, and of the Sun 50-gene and Raponi 50-gene SQCC prognostic signatures mapped to PPI data form a connected PPI network. Genes of the 12-gene and the two previously published prognostic signatures for SQCC were mapped to PPI data in I2D (http://ophid.utoronto.ca/i2d; v.1.7) and visualized in NaVIGaTOR v.2.08 (http://ophid.utoronto.ca/navigator; ref. 24). The network comprises 1,075 proteins and 14,651 interactions. Shapes/nodes, proteins; lines/edges, interactions. Node color corresponds to biological function according to GO annotation as indicated in the legend. In the 12-gene signature, 8 of 12 genes were mapped to PPI data. In the Sun 50-gene signature, 31 of 42 targets were mapped. In the Raponi 50-gene signature, 35 of 48 targets were mapped. Eight of 9 genes overlapping between Sun 50-gene and Raponi 50-gene signatures were mapped to PPI data. Direct interaction between the 12-gene signature gene ARHGEF12 and IGF-IR, a therapeutic target in SQCC, is indicated by turquoise edge color (top right). Faded-out nodes and edges, interactions of individual signature genes that which do not contribute to the interaction among the three signatures.

Close modal

We describe here the MARSA algorithm, a heuristic signature selection method that includes only genes contributing to the separation ability of the signature. By applying the algorithm to the UM dataset, we identified a 12-gene prognostic signature. The prognostic value of the 12-gene signature was validated in silico in two independent SQCC microarray datasets (Duke HR, 3.05, 95% CI, 1.14-8.21; P = 0.027; SKKU HR, 2.73; 95% CI, 1.32-5.64; P = 0.007; Table 2) but not in the corresponding ADC datasets (Table 2). Further, we confirmed the absence of the prognostic value of the 12-gene signature in the largest available ADC dataset from DCC containing 442 ADC samples (Table 2). Importantly, qPCR validation in another independent cohort confirmed that the signature was an independent prognostic factor in SQCC (Table 2). Combined, our data strongly suggested that the 12-gene signature is a valuable prognostic factor for SQCC.

A prognostic factor provides information about a patient's overall outcome that is independent of therapy, whereas a predictive biomarker gives information about the effect of a therapeutic intervention. When establishing a prognostic factor, it is best to evaluate outcome in patients who have not received systemic adjuvant therapy after surgery because chemotherapy has been shown to change the course of the disease and to improve overall survival significantly. In the present study, samples in the UM dataset were collected from patients between October 1991 and July 2002 who were not likely to have received adjuvant chemotherapy. Importantly, the signature was computationally confirmed in the two validation datasets (Duke and SKKU), which were used to explore prognostic signatures, and validated by qPCR in 62 patients in the UHN cohort of patients who had not received adjuvant therapy. Therefore, in these untreated patients our 12-gene signature is prognostic, and not predictive.

The cellular origin and pathogenesis of SQCC and ADC remain controversial. In contrast to ADC, SQCC tends to arise in the epithelium of large airways and its etiology is clearly linked to smoking, suggesting different pathogenetic differences between the two lung cancer types (29). This is supported by differences in the occurrence of key genetic alterations in the two types of cancer (30). Although frequently mutated in ADC, KRAS (31, 32) and EGFR (33) mutations occur very infrequently in SQCC. In contrast, P53 mutation (32), TIMP3 (34) and HIF-1α (35) overexpressions occur more frequently in SQCC than in ADC of the lung. Moreover, gene expression profiling has shown distinctive patterns among the subtypes of NSCLC (36). Therefore, it may not be surprising that there could be gene signatures that are prognostic in SQCC but not in ADC patients.

Cancer phenotype is characterized by underlying gene expression. Thus, gene expression signatures may predict clinical outcome. The fact that our signature could be validated consistently in multiple independent SQCC cohorts supports a notion that it might have captured a key gene expression program in squamous cancer biology. Indeed, many members of the 12-gene signature have been reported to be involved in processes underlying tumorigenesis, including tumor necrosis factor receptor superfamily member 25 (TNFRSF25), triggering apoptosis and activating the transcription factor NF-κB in HEK293 or HeLa cells (37), RIPK5, a cell death inducer (38). Vascular endothelial growth factor (VEGF or VEGFA) has been extensively studied (39) and is a major regulator of tumor angiogenesis (40). ARHGEF4 (Rho guanine nucleotide exchange factor 4) is involved in G-protein mediated signaling, which has been implicated in regulating cell morphology and invasion (41). It has also been shown to interact directly with insulin-like growth factor receptor I (IGF-IR), providing a link between G protein-coupled and IGF-IR signaling pathways (ref. 42; Fig. 3). Inhibitors of IGF-IR are being studied in clinical trials in combination with chemotherapy and EGFR therapy, and preliminary results show high response rates in advanced NSCLC patients, especially of the SQCC subtype (43), although randomized trials to date have been negative. In addition, our PPI analysis reveals significant enrichment in representation of genes involved in the MAPK signaling pathway (P = 0.019), which has been shown to be active in SQCC (4446). These support the functional relevance of the 12-gene signature in SQCC. However, further biological and clinical validation of the signature is warranted.

Previous approaches to the identification of prognostic signatures filtered out low-expression or low-variance genes prior to signature selection. However, this might lead to the exclusion of low-expression but important genes in the signatures. In fact, one third of the genes (ARHGEF4, RIPK5, PTPN20A, and ZNF3) in the 12-gene signature had expression levels in the lowest 20% (from 79.9-99.7%), whereas their variation (SD) was in the lowest 10% (from 91.5-97.5%; Table 3) of all probe-sets. The consistent performance of the 12-gene signature in the training and test cohorts implied that these low-expressed and low-variable genes might have played important roles in tumor progression, and thus these genes must be included in signature selection.

In summary, MARSA is an effective approach to identify prognostic gene expression signatures, and this novel 12-gene prognostic signature seems specific for SQCC.

No potential conflicts of interest were disclosed.

We thank Dr. Raponi and Veridex for providing the raw data of UM dataset, Dr. Keyue Ding of the National Cancer Institute of Canada Clinical Trials Group, and Dr. Geoffrey Liu and Ms. Melania Pintilie for helpful suggestions.

Grant Support: Canadian Cancer Society (NCIC 015184); Ontario Research Fund-Research Excellence (#03-020); Canada Foundation for Innovation (#12301, #203383); Canada Research Chair program; the Natural Science & Engineering Research Council of Canada (Grant #104105); Canadian Institutes for Health Research (CIHR 202370); and IBM. This research was funded in part by the Ontario Ministry of Health and Long Term Care (OMOHLTC). The views expressed do not necessarily reflect those of the OMOHLTC.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1
Ramaswamy
S
,
Tamayo
P
,
Rifkin
R
, et al
. 
Multiclass cancer diagnosis using tumor gene expression signatures
.
Proc Natl Acad Sci U S A
2001
;
98
:
15149
54
.
2
Tomida
S
,
Koshikawa
K
,
Yatabe
Y
, et al
. 
Gene expression-based, individualized outcome prediction for surgically treated lung cancer patients
.
Oncogene
2004
;
23
:
5360
70
.
3
Potti
A
,
Mukherjee
S
,
Petersen
R
, et al
. 
A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer
.
N Engl J Med
2006
;
355
:
570
80
.
4
Chen
HY
,
Yu
SL
,
Chen
CH
, et al
. 
A five-gene signature and clinical outcome in non-small-cell lung cancer
.
N Engl J Med
2007
;
356
:
11
20
.
5
Lu
Y
,
Lemon
W
,
Liu
PY
, et al
. 
A gene expression signature predicts survival of patients with stage I non-small cell lung cancer
.
PLoS Med
2006
;
3
:
e467
.
6
Ikehara
M
,
Oshita
F
,
Sekiyama
A
, et al
. 
Genome-wide cDNA microarray screening to correlate gene expression profile with survival in patients with advanced lung cancer
.
Oncol Rep
2004
;
11
:
1041
4
.
7
Lee
ES
,
Son
DS
,
Kim
SH
, et al
. 
Prediction of recurrence-free survival in postoperative non-small cell lung cancer patients by using an integrated model of clinical information and gene expression
.
Clin Cancer Res
2008
;
14
:
7397
404
.
8
Sun
Z
,
Wigle
DA
,
Yang
P
. 
Non-overlapping and non-cell-type-specific gene expression signatures predict lung cancer survival
.
J Clin Oncol
2008
;
26
:
877
83
.
9
Beer
DG
,
Kardia
SL
,
Huang
CC
, et al
. 
Gene-expression profiles predict survival of patients with lung adenocarcinoma
.
Nat Med
2002
;
8
:
816
24
.
10
Bhattacharjee
A
,
Richards
WG
,
Staunton
J
, et al
. 
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses
.
Proc Natl Acad Sci U S A
2001
;
98
:
13790
5
.
11
Shedden
K
,
Taylor
JM
,
Enkemann
SA
, et al
. 
Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study
.
Nat Med
2008
;
14
:
822
7
.
12
Larsen
JE
,
Pavey
SJ
,
Passmore
LH
,
Bowman
RV
,
Hayward
NK
,
Fong
KM
. 
Gene expression signature predicts recurrence in lung adenocarcinoma
.
Clin Cancer Res
2007
;
13
:
2946
54
.
13
Raponi
M
,
Zhang
Y
,
Yu
J
, et al
. 
Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung
.
Cancer Res
2006
;
66
:
7466
72
.
14
Larsen
JE
,
Pavey
SJ
,
Passmore
LH
, et al
. 
Expression profiling defines a recurrence signature in lung squamous cell carcinoma
.
Carcinogenesis
2007
;
28
:
760
6
.
15
Bianchi
F
,
Nuciforo
P
,
Vecchi
M
, et al
. 
Survival prediction of stage I lung adenocarcinomas by expression of 10 genes
.
J Clin Invest
2007
;
117
:
3436
44
.
16
Schumacher
M
,
Binder
H
,
Gerds
T
. 
Assessment of survival prediction models based on microarray data
.
Bioinformatics
2007
;
23
:
1768
74
.
17
Su
AI
,
Cooke
MP
,
Ching
KA
, et al
. 
Large-scale analysis of the human and mouse transcriptomes
.
Proc Natl Acad Sci U S A
2002
;
99
:
4465
70
.
18
Jongeneel
CV
,
Iseli
C
,
Stevenson
BJ
, et al
. 
Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing
.
Proc Natl Acad Sci U S A
2003
;
100
:
4702
5
.
19
Bolstad
BM
,
Irizarry
RA
,
Astrand
M
,
Speed
TP
. 
A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
.
Bioinformatics
2003
;
19
:
185
93
.
20
Affymetrix
.
Transcript assignment for NetAffxTM annotation
.
Affymetrix
2006
.
21
Lau
SK
,
Boutros
PC
,
Pintilie
M
, et al
. 
Three-gene prognostic classifier for early-stage non small-cell lung cancer
.
J Clin Oncol
2007
;
25
:
5562
9
.
22
Simon
R
. 
Roadmap for developing and validating therapeutically relevant genomic classifiers
.
J Clin Oncol
2005
;
23
:
7332
41
.
23
Brown
KR
,
Jurisica
I
. 
Unequal evolutionary conservation of human protein interactions in interologous networks
.
Genome Biol
2007
;
8
:
R95
.
24
Brown
KR
,
Otasek
D
,
Ali
M
, et al
. 
NAViGaTOR: Network Analysis, Visualization and Graphing Toronto
.
Bioinformatics
2009
;
25
:
3327
9
.
25
Beissbarth
T
,
Speed
TP
. 
GOstat: find statistically overrepresented Gene Ontologies within a group of genes
.
Bioinformatics
2004
;
20
:
1464
5
.
26
Kanehisa
M
,
Goto
S
. 
KEGG: Kyoto encyclopedia of genes and genomes
.
Nucleic Acids Res
2000
;
28
:
27
30
.
27
Ashburner
M
,
Ball
CA
,
Blake
JA
, et al
. 
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
.
Nat Genet
2000
;
25
:
25
9
.
28
Ogata
H
,
Goto
S
,
Sato
K
,
Fujibuchi
W
,
Bono
H
,
Kanehisa
M
. 
KEGG: Kyoto Encyclopedia of Genes and Genomes
.
Nucleic Acids Res
1999
;
27
:
29
34
.
29
Ishikawa
H
,
Nakayama
Y
,
Kitamoto
Y
, et al
. 
Effect of histologic type on recurrence pattern in radiation therapy for medically inoperable patients with stage I non-small-cell lung cancer
.
Lung
2006
;
184
:
347
53
.
30
Zhu
CQ
,
Shih
W
,
Ling
CH
,
Tsao
MS
. 
Immunohistochemical markers of prognosis in non-small cell lung cancer: a review and proposal for a multiphase approach to marker evaluation
.
J Clin Pathol
2006
;
59
:
790
800
.
31
Salgia
R
,
Skarin
AT
. 
Molecular abnormalities in lung cancer
.
J Clin Oncol
1998
;
16
:
1207
17
.
32
Tsao
MS
,
Aviel-Ronen
S
,
Ding
K
, et al
. 
Prognostic and predictive importance of p53 and RAS for adjuvant chemotherapy in non small-cell lung cancer
.
J Clin Oncol
2007
;
25
:
5240
7
.
33
Tsao
MS
,
Sakurada
A
,
Cutz
JC
, et al
. 
Erlotinib in lung cancer - molecular and clinical predictors of outcome
.
N Engl J Med
2005
;
353
:
133
44
.
34
Mino
N
,
Takenaka
K
,
Sonobe
M
, et al
. 
Expression of tissue inhibitor of metalloproteinase-3 (TIMP-3) and its prognostic significance in resected non-small cell lung cancer
.
J Surg Oncol
2007
;
95
:
250
7
.
35
Lee
CH
,
Lee
MK
,
Kang
CD
, et al
. 
Differential expression of hypoxia inducible factor-1 alpha and tumor cell proliferation between squamous cell carcinomas and adenocarcinomas among operable non-small cell lung carcinomas
.
J Korean Med Sci
2003
;
18
:
196
203
.
36
Hofmann
HS
,
Bartling
B
,
Simm
A
, et al
. 
Identification and classification of differentially expressed genes in non-small cell lung cancer by expression profiling on a global human 59.620-element oligonucleotide array
.
Oncol Rep
2006
;
16
:
587
95
.
37
Marsters
SA
,
Sheridan
JP
,
Donahue
CJ
, et al
. 
Apo-3, a new member of the tumor necrosis factor receptor family, contains a death domain and activates apoptosis and NF-kappa B
.
Curr Biol
1996
;
6
:
1669
76
.
38
Zha
J
,
Zhou
Q
,
Xu
LG
, et al
. 
RIP5 is a RIP-homologous inducer of cell death
.
Biochem Biophys Res Commun
2004
;
319
:
298
303
.
39
Leung
DW
,
Cachianes
G
,
Kuang
WJ
,
Goeddel
DV
,
Ferrara
N
. 
Vascular endothelial growth factor is a secreted angiogenic mitogen
.
Science
1989
;
246
:
1306
9
.
40
Folkman
J
. 
Angiogenesis in cancer, vascular, rheumatoid and other disease
.
Nat Med
1995
;
1
:
27
31
.
41
Kitzing
TM
,
Sahadevan
AS
,
Brandt
DT
, et al
. 
Positive feedback between Dia1, LARG, RhoA regulates cell morphology and invasion
.
Genes Dev
2007
;
21
:
1478
83
.
42
Taya
S
,
Inagaki
N
,
Sengiku
H
, et al
. 
Direct interaction of insulin-like growth factor-1 receptor with leukemia-associated RhoGEF
.
J Cell Biol
2001
;
155
:
809
20
.
43
Karp
DD
,
Paz-Ares
LG
,
Novello
S
, et al
. 
High activity of the anti-IGF-IR antibody CP-751,871 in combination with paclitaxel and carboplatin in squamous NSCLC
.
J Clin Oncol
2008
;
26
:
Abstr 8015
.
44
Sekido
Y
,
Fong
KM
,
Minna
JD
. 
Molecular genetics of lung cancer
.
Annu Rev Med
2003
;
54
:
73
87
.
45
Fong
KM
,
Sekido
Y
,
Gazdar
AF
,
Minna
JD
. 
Lung cancer. 9: Molecular biology of lung cancer: clinical implications
.
Thorax
2003
;
58
:
892
900
.
46
Scagliotti
GV
,
Selvaggi
G
,
Novello
S
,
Hirsch
FR
. 
The biology of epidermal growth factor receptor in lung cancer
.
Clin Cancer Res
2004
;
10
:
4227
32s
.