To the Editor:
A recent esophageal cancer genome-wide association study by Hu and colleagues (1) identified 37 statistically significant single nucleotide polymorphisms (SNP) and reported a nearly perfect classification of cancer cases and controls on the basis of only these SNPs. Taken at face value, this implies that esophageal cancer is a solely genetic disease, although literature in the field suggests that environmental factors make a major contribution to susceptibility for many cancer types (2). To shed light on this issue, we reanalyzed the data of Hu and colleagues (1) and identified two data analysis pitfalls that caused overoptimistic conclusions in the original article.
First, the SNP selection method by Hu and colleagues (1) was severely biased toward claiming significance for SNPs that are not truly associated with the disease. The calculation of P value in the published generalized linear model (GLM)–based SNP selection method does not reflect the significance of the SNP under consideration but the significance of three variables combined (SNP, family history of esophageal cancer, and alcohol consumption). Because family history and alcohol consumption are strong risk factors for esophageal cancer, this P value will be biased toward zero, even when the SNP has nothing to do with esophageal cancer. When an unbiased GLM-based procedure is used instead, no SNPs can be found significant at the Bonferroni adjusted 0.05 α-level. See Fig. 1 for details and histograms of the distributions of SNP P values produced by both previously published and unbiased procedures for SNP screening.
Second, both SNP selection and building of the principal component analysis–based classifier model were performed by Hu and colleagues (1) on the same 100 subjects as used for estimation of the final classification accuracy. Because neither cross-validation nor independent sample validation was performed, the resulting classification performance estimate is overoptimistic as explained by Simon and colleagues (3). To obtain an unbiased performance estimate for the SNP selection method and the classifier of Hu and colleagues (1), the above methods were applied by repeated 10-fold cross-validation procedure (4). The resulting classification performance estimate was 0.68 area under receiver-operating characteristics curve (AUC), whereas the original procedure in Hu and colleagues (1) led to 0.98 AUC, indicating a 0.30 AUC overestimation.
These findings suggest that the data analysis of Hu and colleagues (1) identified nonstatistically significant SNPs and derived a severely biased estimate of classification performance of esophageal cancer patients and healthy controls. For a study of effects of environment and genetics versus data analysis pitfalls, see Statnikov and colleagues (5). The present case study also underscores the importance of sound data analysis in genome-wide association studies.