Abstract
Whole genome association studies of complex human diseases represent a new paradigm in the postgenomic era. In this study, we report application of the Affymetrix, Inc. (Santa Clara, CA) high-density single nucleotide polymorphism (SNP) array containing 11,555 SNPs in a pilot case-control study of esophageal squamous cell carcinoma (ESCC) that included the analysis of germ line samples from 50 ESCC patients and 50 matched controls. The average genotyping call rate for the 100 samples analyzed was 96%. Using the generalized linear model (GLM) with adjustment for potential confounders and multiple comparisons, we identified 37 SNPs associated with disease, assuming a recessive mode of transmission; similarly, 48 SNPs were identified assuming a dominant mode and 53 SNPs in a continuous mode. When the 37 SNPs identified from the GLM recessive mode were used in a principal components analysis, the first principal component correctly predicted 46 of 50 cases and 47 of 50 controls. Among all the SNPs selected from GLMs for the three modes of transmission, 39 could be mapped to 1 of 33 genes. Many of these genes are involved in various cancers, including GASC1, shown previously to be amplified in ESCCs, and EPHB1 and PIK3C3. In conclusion, we have shown the feasibility of the Affymetrix 10K SNP array in genome-wide association studies of common cancers and identified new candidate loci to study in ESCC.
Introduction
Esophageal squamous cell carcinoma (ESCC) is one of the most common malignancies in the Chinese population. The standardized incidence rate of esophageal cancer in Shanxi Province, China is >100 per 100,000 person-years (1–3), although both incidence and mortality rates have declined slowly the past 10 years in this area (4). People in high-risk regions, such as Shanxi Province, are much more likely to develop this cancer than individuals residing in low-risk areas of the world. Within the high-risk regions, there is a strong tendency toward familial aggregation, suggesting that genetic susceptibility, in conjunction with environmental exposures, plays a role in the etiology of ESCC. ESCC is most likely a complex disease caused by mutations or risk alleles in multiple genes, each with a small contribution to overall risk. In the past several years, we and others have tried to identify susceptibility genes as well as biomarkers involved in ESCC, which can be used to screen high-risk populations in north central China, including genome-wide loss of heterozygosity testing, candidate tumor suppressor gene mutation testing, and analysis of expression arrays (5–10). The results from these studies indicate that several genes may play an important role in development of this tumor, but we have not yet found a classifier that can be used to screen high-risk populations.
Two approaches, linkage analysis and association studies, are commonly used to identify susceptibility genes involved in tumorigenesis. Linkage analysis involves genotyping of individuals from affected families, whereas association studies are done using subjects from population-based or family studies. In one example of such an association study, Sun et al. found that polymorphisms in apoptosis pathway genes Fas and FasL were associated with increased risk of developing ESCC (11). However, most studies were limited to reports of using a few single nucleotide polymorphism (SNP; refs. 12–14). It is estimated that SNPs occur one in every 1,000-bp nucleotides. Several genotyping studies on the chromosome-wide level using high-density SNPs have already been reported (15, 16). Recently, the GeneChip Mapping 10K Array for whole genome SNP analysis became available (Affymetrix, Inc., Santa Clara, CA) and a few initial reports of allelic imbalance or loss in cancer as well as cancer cell lines using the 10K SNP array have been published (17–23).
Here, we report the results of a pilot ESCC case-control study using the 10K SNP array. We had two primary and one secondary aims in this study. Our primary aims were to identify SNPs and genes that are associated with ESCC and to develop initial approaches appropriate for the analysis and interpretation of genome-wide association studies, including describing limitations and applications of such studies. Our secondary aim was to begin development of a classification method that combines multiple genotypes and environmental factors to predict susceptibility to ESCC.
Materials and Methods
Patients and Controls
The study was approved by the institutional review boards of the Shanxi Cancer Hospital and the National Cancer Institute.
ESCC patients selected. Patients diagnosed with ESCC between 1998 and 2000 in the Shanxi Cancer Hospital in Taiyuan, Shanxi Province, People's Republic of China and considered candidates for curative surgical resection were identified and recruited to participate in this study. None of the patients had prior therapy and Shanxi was the ancestral home for all. After obtaining informed consent, patients were interviewed to obtain information on demographic and lifestyle cancer risk factors (smoking, alcohol drinking, and family history of cancer) and clinical data. We selected 50 males by identifying the first 25 with a positive family history of esophageal cancer and the first 25 without a family history of esophageal cancer from our roster ordered by study identification number.
Controls. Age-, sex-, and neighborhood-matched controls were selected and evaluated within 6 months of the case being diagnosed. The “neighborhood” in China refers to the residence blocks within communities. The ancestral home for all controls was also in Shanxi Province.
Biological Specimen Collection and Processing
Venous blood (10 mL) was taken from patients before surgery and from controls after interview. Germ line DNA was extracted and purified using standard methods.
GeneChip Mapping 10K Array
The 10K SNP array provides comprehensive coverage of the genome for genotyping studies. Each array contained 11,555 biallelic polymorphic sequences randomly distributed throughout the genome, except for the Y chromosome. The median physical distance between SNPs is ∼105 kb and the mean distance between SNPs is 210 kb. The average heterozygosity for these SNPs is 0.37, with an average minor allele frequency of 0.25. The algorithm used for making genotype calls was described previously by Affymetrix (24, 25).
Target preparation. DNA samples, including two control DNA samples from Affymetrix, were assayed according to the protocol (GeneChip Mapping Assay manual) supplied by Affymetrix. The procedure was similar to the one described previously (24). Briefly, a total of 250 ng germ line DNA was digested with XbaI and then ligated to XbaI adaptor before subsequent PCR amplification. All the steps mentioned above were carried on in the pre-PCR clean room. Cycling was conducted as follows: 95°C for 3 minutes followed by 35 cycles of 95°C for 20 seconds, 59°C for 15 seconds, and 72°C for 15 seconds. Final extension was done at 72°C for 7 minutes (DNA Engine Tetrad PTC-225, MJ Research, Waltham, MA). To evaluate PCR products, 3 μL of each PCR product was mixed with 3 μL of the 2× gel loading dye on 2% Tris-borate EDTA gel and run at 120 V for 1 hour to check for the expected product (bands) between 250 and 1,000 bp. After purification and elution of the PCR products using Qiagen MinElute 96 (Qiagen, Valencia, CA), quantification of purified PCR product was done using spectrophotometric analysis. A final 20 μg of PCR product was fragmented with DNase I. An aliquot of the fragmented PCR product was run on a 4% Tris-borate EDTA gel at 120 V for 30 minutes to 1 hour. Successful fragmentation was confirmed by the presence of a smear with the darkest region corresponding to 50 to 100 bp. The fragmented PCR product was end labeled with biotin and hybridized to the array. Arrays were incubated at 48°C for 18 hours in the Affymetrix GeneChip system hybridization oven. Microarrays were washed and stained in the GeneChip Fluidics Station 450 (Affymetrix) following the manufacturer's instructions.
Scanning and genotype generation. The 10K SNP arrays were scanned with the Affymetrix GeneChip Scanner 3000 using GeneChip Operating System 1.0 (Affymetrix). Data files were generated automatically. Genotype assignments (i.e., calls) were made automatically by GeneChip DNA Analysis Software 2.0 (Affymetrix). The genetic map used in the analysis was obtained from GeneChip Mapping 10K library files: Mapping10K_Xba131. “Signal Detection Rate” is the percentage of SNPs that pass the discrimination filter. “Call Rate” is the percentage of SNPs called on the array. The genotype calls are defined as AA, AB, or BB; “no call” means the SNP does not pass the discrimination filter.
Statistical analyses. All statistical analyses were developed using R and Splus packages. We applied the generalized linear model (GLM) implemented in the function GLM to evaluate the risk of each SNP that satisfied Hardy-Weinberg equilibrium at the significance level of P > 0.01. Three numerical coding schemes were used to represent genotypes: (a) (AA, AB, BB) = (1, 0, 0), (b) (AA, AB, BB) = (1, 1, 0), and (c) (AA, AB, BB) = (1, 0.5, 0). The first scheme corresponds to the assumption that allele A is recessive (equivalently, the allele B is dominant), the second scheme assumes that allele A is dominant (equivalently, the allele B is recessive), and the third scheme assumes a continuous mode.
GLM was applied to model the probability of being a case based on each SNP plus five potential explanatory variables, including x1 (family history positive, yes/no), x2 (alcohol use, yes/no), x3 (tobacco use, yes/no), x4 (pickled vegetable consumption, yes/no), and x5 (age, continuous):
Three variables (age, smoking, and pickled vegetables) were insignificant for nearly all SNPs and were dropped from further consideration. Using a GLM for each SNP and the variables, we computed the P of the GLM based on the difference between null deviance D0 and residual deviance D1 using the χ2 goodness-of-fit test. χ2 statistic is D0-D1 with 3 df. To account for multiple comparisons, we used the Bonferroni-adjusted significance level to select our GLMs.
We used principal components analysis (PCA) to visualize similarity and variability among individuals. We applied PCA to each of the three numerical genotype coding schemes for all 100 case/control samples. The 100 samples were projected in the space defined by the first and second principal components. When case and control samples have two cluster structures in two principal components spaces, one or two principal components can be used to construct a classifier to separate cases and controls. The classifier was based on the genotyping of selected SNPs and its performance was evaluated for accuracy = (Tp + Tn) / 100, sensitivity = Tp / (Tp + Fn), and specificity = Tn / (Fp + Tn), where Tp and Tn are the numbers of true positives and true negatives and Fp and Fn are the numbers of false positives and false negatives. The odds ratio of the classifier is defined as Tp * Tn / [(50 − Tp) * (50 − Tn)]. Although developing and testing predictors using the identical same data is acknowledged to result in upward bias of predictor estimates (i.e., sensitivity, specificity, and accuracy), we calculated these values as a frame of reference only and not for clinical application without further confirmation (26).
Results and Discussion
In the present study, 50 male ESCC patients and 50 matched controls were examined using 10K SNP chips. Signal detection rates were high in both cases and controls (average of 98.9% and 99.1%, respectively), as were average SNP call rates (95.8% cases and 95.8% in controls; Table 1). The overall distributions of genotypes and allele frequency in the two groups are shown in Table 1.
Based on National Center for Biotechnology Information (NCBI) Build 34, we summarized characteristics of the 11,555 SNPs and mapped these SNPs to chromosomes and genes. Thirty-four percent (3,947 of 11,555) of the SNPs were mapped in or near (within 1 kb of either 3′ or 5′ end) 2,187 different genes, including 108 SNPs in exons and 3,689 SNPs in introns. One hundred and thirty SNPs were removed because they could not be mapped to the human genome with NCBI Build 34. We removed another 953 SNPs that were homozygous in either case or control groups. We also removed 208 SNPs that did not satisfy Hardy-Weinberg equilibrium in the control group (P < 0.01). Following application of these filters, 10,264 SNPs remained for further analysis.
We first compared cases and controls for each of the 10,264 SNPs individually using multivariate analyses in the GLM assuming each of the three different modes of transmission described above (i.e., recessive, dominant, and continuous). Potential explanatory variables that might influence the analysis were adjusted for in the GLM. Because 10,264 separate analyses were done, multiple comparisons were a major concern. We corrected for multiple comparisons using Bonferroni-adjusted significance levels, which, for 10,264 analyses, means that we accepted as significant only Ps < 4.87187e−06 (which corresponds to a single test with α level of 0.05). Using multivariate GLMs with Bonferroni adjustment as described, we identified 37 statistically significant SNPs under the recessive transmission mode assumption, 48 SNPs for the dominant mode, and 53 SNPs assuming a continuous mode.
A secondary aim of this study is to develop in the future a method to predict individual risk of ESCC based on genotypes and explanatory variables. To begin approaching this aim, we combined the 37 SNPs selected from the recessive mode GLM to classify samples using PCA (Fig. 1A). With few exceptions, the cases and controls were clearly separated into two different clusters. As a comparison, we also did a PCA using all available SNPs in which there were no missing genotype data (n = 3,369 SNPs; Fig. 1B). It is clear that the PCA using all available SNPs resulted in no segregation between cases and controls, which serves to show that cases and controls came from the same population and that there were no major genotype differences between cases and controls at the population level. Given that there was good separation between cases and controls in the PCA using the 37 SNPs identified from GLM in the recessive mode, we developed a classifier to predict individual risk of esophageal cancer. Our classifier was defined by the first principal component (PC1), which contains weighed combinations of genotypes from these 37 SNPs. A person was classified as a case if PC1 was ≤0 or a control if PC1 was >0. Using PC1, we were able to correctly classify 46 of 50 cases and 47 of 50 controls. The accuracy, sensitivity, and specificity for this PCA classification were 0.93, 0.94, and 0.92, respectively (Table 2), and the odds ratio for being a case was 180.2. Similar results were also obtained when SNPs selected from the dominant or continuous mode GLMs were used (Table 2). We also did PCA loading analyses to assess discrimination when smaller numbers of the SNPs were used for classification. This analysis indicated that we could predict individual cancer risk using just 10 SNPs with an overall accuracy of 80%, sensitivity of 76%, and specificity of 84%; the odds ratio for these 10 SNPs was 16.6 (Table 2). We also did permutation tests (1,000 tests) using randomly selected two thirds of the samples for training and one third of the samples for testing in PCA analysis. The permutation tests indicated that our PCA classification can be generalized. Hierarchical cluster analysis using the 37 SNPs selected from the GLMs in recessive mode was also able to classify cases and controls with similar performance (data not shown).
One alternative, and perhaps preferable, approach to reduce false-positive SNPs, beyond statistical adjustment, is to focus on SNPs that are in or near genes. When we combined results from our GLMs for all three modes of transmission, with information from NCBI Build 34 to identify SNPs in or near genes, we identified a total of 39 SNPs in 33 genes (Table 3). Twenty of these 33 genes are named genes, many of which are involved in cancer. For example, EPHB1 encodes a receptor tyrosine kinase and PIK3C3 encodes a class 3 phosphoinositide-3-kinase. Receptor tyrosine kinase and phosphoinositide-3-kinase are common members of oncogenes. GASC1 maps to 9p24, a region frequently amplified in ESCCs (27). Yang et al. (27) cloned GASC1, which stands for gene amplified in squamous cell carcinoma-1, and showed that GASC1 was overexpressed in several cell lines. GASC1 protein contains 2 PHD finger motifs and a PX domain. PHD finger motifs are zinc finger-like sequences found in nuclear proteins that function in chromatin-mediated transcriptional regulation and are present in some oncogenes. The SNP rs951998 was also identified by GLM, although it was not included in Table 3 because it is located 25 kb upstream of CDK8 and 241 kb downstream of RNF6. Interestingly, somatic mutations in RNF6 in ESCC tumor samples were reported previously (8).
Herein, we have described our initial efforts using genome-wide SNP arrays applied to germ line DNA in a population-based epidemiologic case-control association study to explore genetic susceptibility to ESCC. We have addressed two of the major methodologic concerns in such studies, potential confounding and multiple comparisons, by adjusting for numerous potential confounders in our statistical models and by accepting SNP associations as real only under very stringent and conservative statistical conditions. The SNPs we identified in our GLMs as associated with ESCC seem to be robust in their ability to separate cases from controls. Each of the various discriminatory methods we applied—GLM with different modes of transmission, PCA with various number of SNPs, and hierarchical clustering—all distinguished cases from controls. We are encouraged that PCA using multiloci genotyping may provide a valuable new tool for assessing risk of developing ESCC at the level of the individual. In the meantime, several different but complementary approaches remain to be pursued: further family-based linkage analysis will be used to confirm a subset of the loci that are genetically linked to ESCC; additional genotyping using higher-density arrays, such as Affymetrix 100K chip, and more detailed examination of SNPs across the 33 loci identified here will permit identification of haplotype block structures that will further refine the mapping and cloning of genes important for the etiology of ESCC; case-control studies involving more subjects will permit testing and refinement of SNP profiles for risk prediction; and molecular genetic studies of the 33 genes reported here will provide additional evidence for the role of these genes in ESCC.
Acknowledgments
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
We thank Jenny Kelley for critical reading of the article.