Gene-Expression Profile Changes Correlated with Tumor Progression and Lymph Node Metastasis in Esophageal Cancer

Purpose: The purpose of this research was to identify molecular clues to tumor progression and lymph node metastasis in esophageal cancer and to test their value as predictive markers. Experimental Design: We explored the gene expression profiles in cDNA array data of a 36-tissue training set of esophageal squamous cell carcinoma (ESCC) by using generalized linear model-based regression analysis and a feature subset selection algorithm. By applying the identified optimal feature sets (predictive gene sets), we trained and developed ensemble classifiers consisting of multiple probabilistic neural networks combined with AdaBoosting to predict tumor stages and lymph node metastasis. We validated the classifier abilities with 18 independent cases of ESCC. Results: We identified 71 genes of 1289 cancer-related genes of which the expression correlated with tumor stages. Of the 71 genes, 47 significantly differed between the Tumor-Node-Metastasis pT1/2 and pT3/4 stages. Cell cycle regulators and transcriptional factors possibly promoting the growth of tumor cells were highly expressed in the early stages of ESCC, whereas adhesion molecules and extracellular matrix-related molecules possibly promoting invasiveness increased in the later stages. For lymph node metastasis, we identified 44 genes with predictive values, which included cell adhesion molecules and cell membrane receptors showing higher expression in node-positive cases and cell cycle regulators and intracellular signaling molecules showing higher expression in node-negative cases. The ensemble classifiers trained with the selected features predicted tumor stage and lymph node metastasis in the 18 validation cases with respective accuracies of 94.4% and 88.9%. This demonstrated the reproducibility and predictive value of the identified features. Conclusion: We suggest that these characteristic genes will provide useful information for understanding the malignant nature of ESCC as well as information useful for personalizing the treatments.


INTRODUCTION
Esophageal cancer shows the poorest prognosis among the malignant tumors of the digestive tract. Although advances in diagnostic methods and health education have contributed to discovery of the disease in its early stages, many cases are still not detected until the advanced stage. Despite the use of modern surgical techniques in conjunction with multitreatment modalities such as radio-and chemotherapy, the overall 5-year survival rate remains 40 -60% (1)(2)(3)(4). In an effort to better understand this disease and predict its clinical outcome, numerous molecular markers for tumor progression and prognosis, e.g., STMY3 (5), interleukin 6 (6), C1orf10 (7), and EC45 (8), and for lymph node metastasis, e.g., caveolin-1 (9), EphA2 (10), FAK (11), cystain B (12), and MMP-12 (13), have been identified. This limited information, however, is not enough to clarify the carcinogenesis, tumor progression, and invasiveness of esophageal cancer; just as in the allegory about the five blind men and the elephant, the complete picture of the pathophysiology of the disease remains elusive.
As distinct from these "single gene-oriented" approaches, expression profiling that collectively analyzes the expression of many genes using a cDNA array has drawn attention as a promising approach for uncovering molecular mechanisms independently of previous knowledge (14 -16). In the current study, we analyzed gene expression profiles in a total of 36 esophageal cancers and their relation to pathological features based on the Tumor-Node-Metastasis staging. By using a generalized linear model-based regression analysis that allowed us to extract information on features relevant to discrete graded categories, we identified genes of which the expression was altered in association with tumor progression. Also, we identified genes of which the expressions were associated with lymph node metastasis by a feature-subset selection algorithm: this algorithm enabled us to optimize the statistical explanatory model for distinction, which we demonstrated in the classification of 18 independent cases for validation.

MATERIALS AND METHODS
Patients and Samples. We collected the primary cancer tissues obtained from a total of 54 patients with histologically verified esophageal squamous cell carcinomas (ESCCs) who underwent surgery from January 2001 to September 2003 in the Hokkaido University Hospital and 15 affiliated hospitals in Hokkaido prefecture, Japan. Only patients who agreed with the aim and contents of this study and who provided their written informed consent were included. One to five bulk tumor tissue samples of ϳ5 mm-size were immediately cut from the esophagus resected by a standard surgical procedure, snap frozen in liquid nitrogen, and stored at Ϫ80°C until use. Part of each sample piece was cut and stained with H&E for verification of the presence of squamous cell carcinoma cells, and the sample pieces, of which the evaluated areas were composed of at least 50% tumor cells, were used for RNA extraction. All of the procedures in this portion of the study were approved by the Ethics Committee of Hokkaido University and the independent internal ethics committees of the affiliated hospitals.
Clinicopathological Parameters. Histological subclassification and staging of the tumors was done by reviewing the specimens taken for pathological diagnosis, according to the Tumor-Node-Metastasis classification (17). The tumor status of each case was categorized based on the Tumor-Node-Metastasis classification (Unio Internationale Contra Cancrum, 6th edition) for the pT, pN, and pM stages. Pertinent major clinicopathological parameters are shown in Table 1.
Analysis of Gene Expression. Each frozen tissue was crushed in liquid nitrogen to powder by using a CRYO-PRESS compressor (Microtec Nition, Chiba, Japan). Total RNA was extracted using TRIzol reagent (Invitrogen, Tokyo, Japan). Before mRNA extraction, total RNA was treated with 1 unit/l DNase I and 10 units/l RNase inhibitor (TOYOBO, Osaka, Japan) at 4°C for 15 min. Polyadenylated mRNA was extracted from total RNA using oligodeoxythymidylic acid magnetic beads (MagExtracter-mRNA-kit; TOYOBO). The mRNA (1.0 g) was then reverse-transcribed using ReverTraAce reverse transcriptase (TOYOBO). The quality of the sample cDNA was checked by amplifying ␤-actin, G3PDH, and ␣-tubulin with PCR amplification. The cDNA was ethanol-precipitated and substituted to polyadenylated tailing with terminal deoxynucleotidyl transferase (TOYOBO). The Polyadenylated cDNA was then amplified by PCR (25 cycles) for biotin-16-dUTP labeling with KOD dash DNA polymerase (TOYOBO). The labeled cDNA was ethanol-precipitated, denatured at 68°C, and used as a hybridization probe in 10 ml of PerfectHyb solution (TOYOBO). Hybridization was done overnight at 68°C on a GeneticLab in-house cDNA array of 1289 genes plus 11 housekeeping genes (GeneticLab, Sapporo, Japan). The membrane was washed three times with 2ϫSSC/0.1% SDS at 68°C and then three times with 0.1ϫ SSC/0.1% SDS at 68°C. Chemiluminescence was done using a streptavidin-biotinylated alkaline phosphatase system (Imaging High -Chemilumi-Gene Navigator; TOYOBO) using CDP-Star as the luminogen, and the signals were photographed by a Fluor-S MultiImager (Nippon Bio-Rad Laboratories, Tokyo, Japan). Image analysis and quantification were performed with Imagine 4.2 software (BioDiscovery Inc., Los Angeles, CA), averaging the signals from the adjusted regions of two symmetrically arranged spots for each gene, relative to background values.
Statistical Analyses and Array Data Analysis. In this study we investigated changes in the gene-expression profile during tumor progression. Analysis of gene expression profiles and related statistical analyses were performed using programs originally developed in the MATLAB language, ver. 6.5 (Math-Works, Tokyo, Japan). To determine confounding relations among the histopathological parameters, a contingency table analysis ( 2 test) was performed between pN status and pT status.
The array data in the present study consisted of 1289 dimensional variables (genes) that showed an exponential distribution. The data distribution of each sample exhibited a cumulative distribution curve of different shape, reflecting a biased size-shift (nonlinear offset effect) because of a different condition in each hybridization. Normalizations using expression values of housekeeping genes were not considered appropriate, because they often lost linearity due to overhybridization. Thus, we normalized the data by dividing the expression value of each sample by the mean value of the genes of which the values were within 0.1-1.2-fold of the median expression. We set this interval because of the stability of data values observed Fig. 1 The cumulative distributions of cDNA array data of the 54 esophageal squamous cell carcinomas. Each OOO in the cascade-like picture represents a cumulative distribution curve of the expression values of 1289 genes in a case. A, before normalization, note that distributions are uneven across the cases, forming an irregularly shaped "cascade"; B, after normalization, the shape of the cascade was smoothed, especially in the lower 80 percentiles of genes. Brighter color indicates higher expression. B, heat-map view of the 88 genes of which the expressions significantly differed between the pT1/2 (n ϭ 16) and pT3/4 (n ϭ 20) groups in the 36-case training set. The normalized relative expression for each transcript (rows) in each sample (columns) is shown in color (brighter color indicates higher expression). For comparison, the expressions of the genes in the pT1/2 (n ϭ 7) and pT3/4 (n ϭ 11) groups in the 18-case validation set are shown on the right of the panel. Note that the similar gene-expression patterns in the validation cases demonstrate the reproducibility of the expression profile in the independent set of samples. C, the process of feature-subset gene selection with predictive value for pT stages (pT1/2 versus pT3/4). Sequential addition of the most significant genes at each step, with a total of 3916 models tested, resulted in a minimal error rate of 0.0%. in a number of empirical cumulative distribution curves of the array data. We divided the 54 cases into a training set of 36 cases and a validation set of 18. The pertinent clinicopathological data of the constituents of each set are shown in Table 1. They indicate an unbiased division of cases. We used the data of the training set to explore the characteristics of tumor progression (advancing pT stages) and metastatic potential (pN stages). We then used the validation set to evaluate the predictive ability of the selected features.
To detect genes that increased or decreased in correlation with the graded depth of invasion, we adopted a regression analysis on a generalized linear model (18). This model allows a reduction of the relation between an explanatory variable X (the expression value of each gene) and a response variable Y (4-categorized pT stages given dummy integer numbers 1-4) into a linear relation if the distribution of Y belongs to the exponential distribution function, by transforming Y with an appropriate link (kernel) function: where ␣ is the intercept and ␤ is the regression coefficient. Probit function (inverse Gaussian distribution function) was used for the link function. In the estimation of ␤, the maximum likelihood method is used to minimize the residuals (the weighted least-square fit). We applied this method to each of 1289 genes and extracted those that showed Ps of regression coefficient ␤ Ͻ 0.05.
On the other hand, to determine the expression profiles that are characteristic of the presence or absence of lymph node metastasis and those that are characteristic of the pT1/2 versus pT3/4 stages, we selected genes on the basis of statistical difference between mean expressions of the classes (two-sided t test, P Յ 0.05) in the 36-case training set. To determine whether the selected genes were classified accurately as samples in the various classes, we performed a classification using an expectation maximization algorithm that separates a mixture of different data distributions in iterative steps of maximum likelihood estimation (19). We also tested the separability of the distributions with a k-means clustering (20). In addition, we tested the differences of gene expressions between the classes by a nonparametric bootstrap resampling test that permitted us to check the stability of statistical differences independent of the data distribution (21). This procedure randomly and multiply resamples the data with replacement (allowing overlap) and compares the means of the resampled data with the means of the original samples (pivotal test, two-sided, ␣ level 0.05). This tests the apparent by-chance difference attributable to the presence of outlier values or biased distributions of values, to which the parametric t test is insensitive.
We then performed a feature-subset selection that allows identification of a set of genes optimal for pattern classification from the selected genes to extract the essential part of the gene set. As the algorithm for the feature-subset selection, we used a sequential forward selection, which extensively explored a combination of genes minimizing the leave-one-out error rate of a k-nearest neighbor classifier. This algorithm, proposed by Whitney (22), adds the most significant feature (gene) that gives a minimal error rate for each step, from an empty set to the full set of the given data, and finally selects a series of gene sets yielding the grand minimal error rate. As a classifier we adopted the k-nearest neighbor method, which classifies each sample according to the class memberships of its k nearest points based on Euclidean distance in the N-dimensional space (20,23). The leave-one-out error cross-validation procedure designates one sample as a test set and the remaining N -1 samples as a training set. The k-nearest neighbor rule developed on the training set was repeatedly applied for each left-one sample, and a cumulative training error was calculated. The feature subset selection yields a number of feature (gene) sets that give the same grand minimal error rate. The gene sets were then analyzed for their selection tree structure with Dijkstra and depth-first search algorithms to extract, by pruning leaves and branches, the connected parts (vertices) that form the tree trunk (24,25). This procedure based on the graph theory permits extraction of the essential part of the feature subsets, by discarding the possible local minimal solutions (prediction models) that are considered the major cause of deterioration of the generalization ability of a classifier. The obtained sets of genes were then used as the final diagnostic sets to form an ensemble classifier consisting of multiple probabilistic neural networks (PNN; Ref. 26) in which each component classifier was weighted with the AdaBoost algorithm (27). PNN is a space-dividing classifier like k-nearest neighbor, but uses the radial basis function to simulate Bayesian posterior probabilities for each data point instead of evaluating simple Euclidean distances as in k-nearest neighbor. With the use of AdaBoosting, each classifier (PNN) was weighted with leave-one-out insample errors in the training procedure to optimize the training and voting. Finally, to determine the performance of the ensemble classifier, we tested it with the 18-case validation set independent of the 36-case training set and evaluated the overall classification error.

RESULTS
In this study we attempted to extract gene expression profiles characteristic of the invasiveness and progression status of ESCCs, with a particular focus on lymph node metastasis (pN stage) and depth of tumor invasion (pT stage). A contingency table analysis ( 2 test) showed no significant relationship between the pN stages and the pT stages (pN0: pT1, 7; pT2, 3; pT3, 14; pT4, 1; pN1: pT1, 3; pT2, 10; pT3, 14; pT4, 2 cases; 2 ϭ 5.436, P ϭ 0.1425). The cumulative distribution curves of the array data of the 54 cases before and after normalization are shown in Fig. 1. After normalization, the different shapes in the exponential distributions were well smoothed-over in all of the cases.
We then investigated genes that altered their expression in association with tumor progression using a generalized linear model-based regression analysis in the 36-case training set data (pT1, 6; pT2, 10; pT3, 17; pT4, 3 cases). The analysis detected 47 genes up-regulated and 24 genes down-regulated in association with pT stages ( Fig. 2A; Table 2). As we noted above, the most conspicuous changes in gene expression occurred between the pT2 and pT3 stages. Therefore, we extracted a total of 88 genes that each showed a significant difference (two-sided t test, P Ͻ 0.05) in expression between pT1/2 cases (n ϭ 16) and pT3/4 cases (n ϭ 20) in the training set (Fig. 2B). It was of note that the expression patterns of the 88 genes in the independent validation set (18 cases: pT1/2 7 cases, pT3/4 11 cases) were similar to those in the 36-case training set. Among these 88 genes, 47 were in common with the above 71 genes selected by the regression analysis. We confirmed the stability of statistical differences for these 88 genes with a nonparametric bootstrap test (10,000 times resampling). We tested the separability of the data subspaces (distributions) formed by the expression values of the 88 genes by using an expectation maximization algorithm. This algorithm misclassified 1 of the 36 cases (error rate, 2.8%). Another classifier, k-means clustering, misclassified 11 of 36 cases (error rate, 30.6%). To identify the essential features to optimally classify the training set data, we applied a sequential forward-selection algorithm that sequentially selected a better combination of genes based on leave-one-out error rates of a k-nearest neighbor classifier. The algorithm and the subsequent tree analysis using the Dijkstra and depth-first search algorithms finally selected 33 combinations of features (genes), consisting of 26 -62 genes that classified the two classes with 100% accuracy (Fig. 2C).
As the genes characteristic of lymph node metastasis, we extracted a total of 87 genes of which the expression showed a significant difference (two-sided t test, P Ͻ 0.05) between pN 0 cases (n ϭ 16) and pN 1 cases (n ϭ 20; Fig. 3A). Again, the independent validation set (pN 1 , 9 cases; pN 0 , 9 cases) showed expression patterns similar to those of the training set, demonstrating the reproducibility of the expression profiles across the independent sets of samples. We confirmed the statistical differences by the bootstrap resampling test. A test of the separability of the data subspaces (87 dimensions) using the expectation maximization algorithm classified 31 of the 36 in-sample cases correctly (error rate, 13.9%). Also, the k-means clustering misclassified 11 of 36 cases (error rate, 30.6%). In the same way as described above, to prune features that made the margin between pN 0 and pN 1 classes and obtain an optimal set of features that characterize the lymph node status, we performed a feature-subset selection. The algorithm and the subsequent tree analysis finally selected 19 sets consisting of 11-44 genes that classified the two classes with 97.2% accuracy ( Fig. 3B; Table 3).
The 71 genes of which the expressions were altered in association with T stages (Table 2) and the 87 genes that were differentially expressed between node-positive and node-negative cases shared 14 common genes (NOL3, CAV2, CD47, CYP2D6, DRD2, DRD3, EPHB6, ITGB6, MDM2, MMP14, MUC2, GRIN2C, P2RY1, and TGFBI). We estimated the probability of the coincidental selection of these 14 genes with a Monte-Carlo simulation (100,000 times), which gave a P of 1.0 ϫ 10 Ϫ5 . The 57 genes that excluded these 14 genes from the 71 genes may specifically reflect the molecular characteristics of depth of tumor invasion (pT stage). Only 4 (CYP2D6, EPHB6, ITGB6, and MDM2) of the 71 genes were also included among the 44 genes that characterized lymph node metastasis (P ϭ 0.07695, Monte-Carlo simulation). Whereas these may represent the common molecular events between graded depth of invasion and lymph node metastasis, most genes were different between the two categories, suggesting that depth of invasion and lymph node metastasis had distinct underlying molecular pathophysiologies.
To test the validity of the selected features, we tested the feature subsets composed of 33 combinations of genes (26 -62) for pT1/2 versus pT3/4 classification, as well as 19 combinations (11-44 genes) for pN1 and pN 0 , by using a PNN ensemble classifier. For each subset of features, a PNN classifier was formed, and its leave-one-out classification error was evaluated with the 36-case training set. The AdaBoost algorithm increased the weight of the misclassified cases and accordingly modified the successive training. Finally, the ensemble classification was done by summing up the weighted votes of all of the PNN classifiers for the independent validation-set cases and then by classifying each test case to a class given a larger vote-sum. As shown in Table 4, the overall prediction error rates for pT stages and pN stages were respectively 5.6% (17 of 18 cases correctly classified) and 11.1% (16 of 18 cases). Furthermore, we tested whether or not the 71 genes selected by the regression analysis could discriminate between the pT1/2 and pT3/4 classes in the 18-case validation set. The feature subset selection algorithm selected a total of 20 subsets consisting of 10 -29 genes, giving an in-sample error rate of 2.8%. The final performance of the formed ensemble classifier was a 16.7% overall error rate (16 of 18 cases correctly classified, 2 of 7 pT1/2 and 1 of 11 pT3/4 misclassified). The finally selected 29 genes contained 18 in common with the above 62 finally selected features.

DISCUSSION
Multiple genes are involved in the complex multistep process of carcinogenesis, tumor progression, and acquisition of invasiveness in esophageal cancer. The cDNA array technology has enabled us to analyze the expression profiles of thousands of genes simultaneously and to investigate the correlation between clinicopathological phenotypes and gene expression status. This technique provides a powerful means to stratify (or personalize) tumors into classes with distinct molecular pathophysiologies that has not been completely proven. In esophageal cancer, some authors have reported that the gene expression profile of ESCC is different from those of pericancerous, normal, or dysplastic epithelia, implying the potential usefulness of the technique for the differential diagnosis of early ESCC from precancerous benign lesions (28,29). To date, however, there has been no report that investigates expression profiles associated with specific malignant characters of ESCC. In the present study, we for the first time identified genes of which the expression was altered in correlation with the graded depth of invasion (pT stages) by applying a generalized linear modelbased regression analysis to cDNA array data. The changes in gene expression were most conspicuous between pT 2 and pT 3 (invasion beyond the musculus proprius), suggesting that a major alteration in molecular pathophysiology occurs at a rather late stage of tumor progression. The validity of the identified genes was confirmed by the reproducibility of the expression patterns in the independent set of samples, and by the overall performance of the classifier starting from both feature sets selected by the regression analysis and binary-class comparison.
As shown in Table 2, the identified genes that increased their expression in association with advance of pT stages had a wide spectrum of functions, including cell adhesion molecules, extracellular matrix-related proteins, cell death regulators, and growth/differentiation factors. Among them, for instance, integrin ␣5 is known to activate signal transduction, enhance cell proliferation, suppress apoptosis, and thereby contribute to tumor progression in lung squamous cell carcinoma (30). ITGB6 also promotes the progression of oral squamous cell carcinoma by activating Fyn on binding to fibronectin (31), which showed higher expression in the progressed tumors in our study. Among cell death regulators, NOL3 inhibits apoptosis (32). LGALS1 is involved in neoplastic progression and proliferative activities (33). The matrix metalloproteinase MMP1 is associated with depth of invasion in esophageal cancer (34), whereas MMP7 and MMP10 have been shown to be prognostic factors in esophageal cancer (35,36). TIMP1, which is an inhibitor of matrix metalloproteinases, correlated with the tumor progression in the present results, as one previous report indicated a positive correlation with tumor progression (37). VEGFC, which is a growth/ differentiation factor, is known to promote esophageal tumor progression via lymphangiogenesis and angiogenesis (38). Taken together, these results indicate that tumor progression of ESCC is associated with factors that are relevant to the invasive characters of tumors.
The genes of which the expressions were decreased in Sequential addition of the most significant genes at each step, testing a total of 3828 models, resulted in a minimal error rate of 2.8%. association with advancing pT stage included several cell cycle regulators and transcription factors. Among the cell cycle regulators, CDC2L1 is known to induce Fas-mediated apoptosis in malignant melanoma cells (39), and RBL2, which is a member of the retinoblastoma gene family, is known to be a tumor suppressor gene that is positively correlated with the prognosis of ESCC (40). NCOA1 and NCOA2 are ligand-dependent transcription factors involved in a wide array of biological processes, including cell differentiation, reproduction, and homeostasis (41). PAX8 promotes thyroid follicular carcinoma progression by forming fusion proteins with PPAR␥ (42). These results suggest that the expression of cell cycle regulators acting as a tumor suppressor is maintained in the early pT stages of ESCC. Also, the transcription factors are implicated in the promotion of tumor cell proliferation through activation of multiple signal transductions. Additional investigation of these genes may provide a clue for interpretation of the mechanism by which malignancy is acquired in concert with tumor progression.
In regard to lymph node metastasis, which is another important factor for determination of clinical stage, we also confirmed the validity of the identified genes according to the reproducibility and performance of a classifier beyond the sample sets. It was here shown that cell adhesion proteins (CD33, GJB1, ITGA2B, ITGAX, and ITGB6) and cell membrane receptors (CD2, ERBB4, and CSF2RB) showed higher expression in node-positive cases. In contrast, cell cycle regulators (CDC16, CDK8, CCNF, and PPM1D) and intracellular signaling molecules (APOD, APOH, FRAP1, GAB1, NCK1, and MAPK14) were expressed in lower levels in node-positive cases. These results might indicate that the invasive character of tumor cells is determined by cellular interaction with the extracellular environment rather than the proliferative potentials. It is interesting that the spectrum of expressed genes in relation to lymph node metastasis is distinct from that for tumor progression.
In the present study, we used feature-subset selection to extract genes relevant to specific characteristics of esophageal cancers. This approach has been successfully used to extract essential features to classify cDNA array samples into categories (28,43). If one merely selects features according to statistical differences, the selected data still contain noise that is not well separated as distinct spatial distributions, as was seen in the present study using expectation maximization-algorithm and k-means clustering (respective separation rates: 86.1% and 69.4% for lymph node status). This may often be the case in cDNA array data, which are rather susceptible to randomness in expressions and their measurements. Feature-subset selection that searches for optimal features in separation of the feature subspaces can extract genes of which the expression is consistently different in the given classes. In this regard, we used the k-nearest neighbor classifier rather than the more sophisticated classifiers that transform feature subspaces or unequally place weights on the feature values. We consider that such transformation or weighting can result in a local minimal solution that overestimates trivial features that do not have true biological significance. For the final validation of the selected features, we used a similar strategy, the use of a PNN that divides the feature space with simulated Bayesian posterior probabilities. The ensemble classification using the AdaBoost algorithm combined the multiple feature sets (models) and performed well. The good classification performance obtained for the independent set of samples, by the use of a classifier system different from that used in feature selection, indeed demonstrates the validity of the selected features. We propose that the presently shown method provides a useful means to extract biologically significant differences from noisy cDNA array data.
Although the selected genes that are characteristic of tumor progression and invasiveness in ESCC displayed some interesting trends, these features do not uncover the whole appearance of the malignant nature underlying complex gene network, because these were selected from only a part (3-4%; 1289 genes) of the whole human genes. Hence, additional investigation should be warranted to better understand the malignant nature of esophageal SCC by making the selected features as a clue.