Abstract
We have validated differences in DNA methylation levels of candidate genes previously reported to discriminate between normal colon mucosa of patients with colon cancer and normal colon mucosa of individuals without cancer. Here, we report that CpG sites in 16 of the 30 candidate genes selected show significant differences in mean methylation level in normal colon mucosa of 24 patients with cancer and 24 controls. A support vector machine trained on these data and data for an additional 66 CpGs yielded an 18-gene signature, composed of ten of the validated candidate genes plus eight additional candidates. This model exhibited 96% sensitivity and 100% specificity in a 40-sample training set and classified all eight samples in the test set correctly. Moreover, we found a moderate–strong correlation (Pearson coefficients r = 0.253–0.722) between methylation levels in colon mucosa and methylation levels in peripheral blood for seven of the 18 genes in the support vector model. These seven genes, alone, classified 44 of the 48 patients in the validation set correctly and five CpGs selected from only two of the seven genes classified 41 of the 48 patients in the discovery set correctly. These results suggest that methylation biomarkers may be developed that will, at minimum, serve as useful objective and quantitative diagnostic complements to colonoscopy as a cancer-screening tool. These data also suggest that it may be possible to monitor biomarker methylation levels in tissues collected much less invasively than by colonoscopy. Cancer Prev Res; 7(7); 717–26. ©2014 AACR.
Introduction
Colorectal cancer is the second largest cause of cancer deaths in both men and women, despite the availability of an effective screening test (1). In fact, only about half of adults recommended to undergo a screening colonoscopy (those older than 50 years) comply with these guidelines (2). Although highly effective, colonoscopy is both invasive and subjective, depending on the unaided eye of the endoscopist to detect cancer and precancerous lesions. Up to 12% of precancerous lesions are not detected, either as a result of polyp morphology (so-called “flat” or “serrated” polyps) or failure to visualize the entire colon (3), and approximately 10% of colorectal cancers occur in individuals within 3 years of a screening colonoscopy (4).
Because epigenetic changes play a strong role in colorectal cancer (5–10, e.g., also reviewed in refs. 11, 12), we (13, 14), and others (15–17) have suggested that it may be possible to use quantitative objective measures of epigenetic change in normal tissues to detect colorectal cancer or precancerous lesions. In fact, we identified significant differences in methylation level at CpGs in 114 to 874 genes between the normal colon mucosa of 30 patients with cancer and 18 controls by array-based methylation profiling (13). The practical utility of such differences as an independent screening tool or as a complement to current screening methods depends on whether they are reproducible across test populations. In this report, we describe our validation study of methylation differences across 30 candidate genes selected from our previous study in an independent population of patients with cancer and controls. We have also used a combination of validated candidates and additional small-scale methylation profiling to build support vector machines (SVM) that are effective at discriminating patients with cancer from controls.
Materials and Methods
Description of control patients
We collected biologic specimens from patients undergoing routine screening colonoscopy at Temple University Medical Center (Philadelphia, PA) to serve as the control arm of the study. We excluded patients with a personal or first-degree family history of cancer of any kind. We also excluded any patients with a previous colonoscopic finding of polyps. Patients who were not excluded at this point underwent a complete colonoscopic evaluation by a board certified gastroenterologist. If the colonoscope could not be passed to the appendiceal orifice, the patient was excluded. If the complete colon was visualized, two cold forceps biopsies were performed. Two biopsies of normal colonic endothelium from the ascending colon (proximal to the hepatic flexure) were pooled as “right colon,” and two biopsies of normal colonic endothelium from the descending colon (distal to the splenic flexure, but proximal to the rectum) were pooled as “left colon.” Specimens were placed into RNALater RNA Stabilization Reagent (Ambion) and stored at 4°C before DNA isolation. Peripheral blood samples were also collected at this time and DNA was extracted by standard procedures (13).
Description of patients with cancer
We also collected biologic specimens from patients undergoing colon resection for presumed or biopsy-proven colon cancers. Patients were considered eligible if they had no personal or family history of colon cancer before this encounter. Patients with known or clinical features of hereditary cancer syndromes (specifically, hereditary nonpolyposis colorectal cancer or familial adenomatous polyposis syndrome) were excluded. In addition, patients with any personal history of chemotherapy or radiation therapy were also excluded. Patients who remained eligible underwent colon resection at a single National Cancer Institute designated Comprehensive Cancer Center (Temple/Fox Chase Cancer Center). Specimens were processed immediately, and normal appearing colon mucosa, well away (∼10 cm) from the lesion in question, was obtained. These specimens were classified “right colon” and “left colon” accordingly.
Patient data for samples used in this study are presented in Supplementary Table S1. Patients were matched for sex and as close as possible for age, although patients with cancer were, on average, almost 6 years older than controls. None of the patients examined in the present study was examined in the previous study (13). All patient materials were collected with the approval of Temple University Institutional Review Board (IRB) protocol 11910 or Fox Chase Cancer Center IRB protocol 11–866.
Sample preparation, bisulfite conversion
Tissue samples were rinsed with sterile saline and blotted dry before nucleic acid extraction. DNA was extracted using standard phenol-chloroform techniques. The isolated DNA was dissolved in 10 mmol/L TrisCl (pH 8.0). Samples were quantified by spectrophotometry and stored at −80°C until ready for use. The EZ DNA Methylation-Gold Kit (Zymo Research) was used to convert unmethylated genomic DNA cytosine to uracil. Site-specific CpG methylation was analyzed in the converted DNA template (5 μL at 50 ng/μL) as described below.
Veracode array analysis
Site-specific CpG methylation was analyzed in the converted DNA template (5 μL at 50 ng/μL) using a custom Veracode Array (Illumina, Inc.) at the Children's Hospital of Philadelphia Center (Philadelphia, PA) for Applied Genomics. Methylation levels (β-values: fraction of methyl CpG at each site tested) were assessed at 96 CpGs. Thirty of the CpG sites were selected because they fulfilled statistical criteria as significantly different between patients with cancer and controls in our original study (13). The additional 66 CpGs were selected as of interest for other studies (18–20) but many of these sites also differed significantly between patients with cancer and controls in our original study (13).
Statistical analysis
Because the present study is a validation experiment, in which we have a prior expectation for the direction of the difference between the mean methylation levels of patients with cancer versus controls, one-sided, paired t tests were performed. A P value of 0.05 was considered significant for the validation study because the candidates were identified after correction for multiple testing in the discovery study (13). Solely for comparison purposes, we also assessed which candidates would be significant after correction for multiple testing a second time using the Benjamini–Hochberg FDR of 0.05.
Binary classification of cancer versus control samples
A SVM with recursive feature elimination (21) was used to classify samples. Random, 10-fold cross-validation was repeated 10 times and a score was calculated for each tested sample at each reduction step. Average accuracy was calculated at each step and all the genes at the point of maximal accuracy were used as initial discriminator. A subsequential step to reduce the final discriminator to the minimum number of genes, for which the accuracy remains the same, was applied (22).
Validation of the SVM classifier
Twenty percent of the original dataset was not included in the creation of the discriminator, but we used these samples, together with other samples coming from an independent platform, to test the quality of the signature.
The signature was applied as an equation of the form:
X = a[A] + b[B] + c[C]… + z[Z] + constant
Where A, B, C…Z are the methylation level and a, b, c…z are the coefficient associated with each value. If the classification score (X) calculated for each sample is higher than 0, the sample will be declared as cancer, if less than 0 as control. The higher the score, the greater the confidence that the sample is cancer, the lower and more negative the score, the greater the confidence that the sample is control (22).
Results
Validation of previously identified candidates
In our previous study (13), we profiled methylation levels across 27,578 CpG sites in DNA extracted from normal colon mucosa (see Materials and Methods) of 30 patients with colon cancer and 18 controls. We identified significant differences in mean methylation level between patients with cancer and controls using three different statistical thresholds (13): 119 CpGs in 114 genes differed after Bonferroni correction for multiple testing; 909 sites in 873 genes differed after applying the Benjamini–Hochberg FDR of 0.05; and 299 sites in 65 genes differed after applying the ad hoc criterion that genes in which three or more CpGs differed significantly at P < 0.05 (for a nominal significance of 0.05 × 0.05 × 0.05 = 1.25 × 10−4). From these gene/CpG lists, we selected 30 CpGs in 30 genes (Table 1) to test in an independent sample of normal colon mucosa from 24 cancer cases and 24 case-matched controls (see Materials and Methods and Supplementary Table S1). The 30 CpGs were selected from those having the largest magnitude of difference between means in the original study (Tables 1 and 2 in ref. 13), as well as a selection of CpGs in genes of additional interest that were significantly different in the original study (13).
The 30 CpGs selected for independent validation whose methylation levels differed significantly between normal colon mucosa of patients with cancer versus controls in ref. 13
Gene selected for independent validation . | aCancer mucosa hypo (<), hyper (>), or no diff (−) . | CpG ID . | Selection criterion from Silviera et al. (13) . | P in validation setb . | FDR P . |
---|---|---|---|---|---|
AHNAK | </> | cg19902569 | Bonferroni | 0.17 | 0.255 |
BCDIN3 | </< | cg17607973 | Bonferroni | 0.21 | 0.3 |
BCL2 | </> | cg08554462 | 3 CpGs<0.05 | 0.32 | 0.3778 |
CASP8 | >/> | cg05130485 | 3 CpGs<0.05 | 5.2 × 10−4 | 0.0039 |
DDX49 | >/− | cg14757492 | Bonferroni | 0.41 | 0.4241 |
ENPEP | </< | cg17854440 | Bonferroni | 0.041 | 0.0863 |
FXD7 | >/> | cg22392666 | Bonferroni | 0.22 | 0.3 |
GABRR2 | </> | cg06445611 | Bonferroni | 0.1 | 0.1579 |
GALR1 | </< | cg15343119 | 3 CpGs<0.05 | 0.046 | 0.0863 |
GATA4 | </< | cg24646414 | 3 CpGs<0.05 | 0.33 | 0.3778 |
GP1BB | >/> | cg07359545 | Bonferroni | 0.04 | 0.0863 |
GPX4 | </< | cg18061485 | Benjamini–Hochberg | 0.041 | 0.0863 |
GRB10b | >/> | cg01720588 | 3 CpGs<0.05 | 0.002 | 0.0086 |
HSPA5 | </− | cg01733783 | Benjamini–Hochberg | 0.44 | 0.44 |
IGF1 | </< | cg01305421 | Benjamini–Hochberg | 0.023 | 0.0627 |
IGF2b | </< | cg10649864 | 3 CpGs<0.05 | 0.058 | 0.1024 |
INS | >/> | cg03366382 | 3 CpGs<0.05 | 0.019 | 0.0627 |
KCNQ1 | </< | cg06719391 | Bonferroni | 0.08 | 0.1334 |
KRTHB6 | </< | cg04123507 | Bonferroni | 0.009 | 0.0338 |
MEST | >/< | cg01888566 | 3 CpGs<0.05 | 0.26 | 0.3391 |
MGC9712 | >/> | cg06194808 | Bonferroni | 9 × 10−4 | 0.0054 |
MTHFR | >/− | cg03427831 | Benjamini–Hochberg | 0.39 | 0.4179 |
OSBPL5b | </< | cg06282660 | 3 CpGs<0.05 | 0.044 | 0.0863 |
RASSF5 | >/> | cg17558126 | Bonferroni | 0.022 | 0.0627 |
SERPINB5 | >/> | cg20837735 | 3 CpGs<0.05 | 0.34 | 0.3778 |
SLC16A3 | >/> | cg18345635 | Bonferroni | 1.7 × 10−4 | 0.0017 |
SULT1C2 | </< | cg17966192 | Bonferroni | 0.002 | 0.0086 |
VAV1 | >/> | cg13470920 | Bonferroni | 1.7 × 10−4 | 0.0017 |
VHL | >/> | cg16869108 | 3 CpGs<0.05 | 5 × 10−5 | 0.0015 |
ZNF512 | </> | cg18611281 | 3 CpGs<0.05 | 0.29 | 0.3625 |
Gene selected for independent validation . | aCancer mucosa hypo (<), hyper (>), or no diff (−) . | CpG ID . | Selection criterion from Silviera et al. (13) . | P in validation setb . | FDR P . |
---|---|---|---|---|---|
AHNAK | </> | cg19902569 | Bonferroni | 0.17 | 0.255 |
BCDIN3 | </< | cg17607973 | Bonferroni | 0.21 | 0.3 |
BCL2 | </> | cg08554462 | 3 CpGs<0.05 | 0.32 | 0.3778 |
CASP8 | >/> | cg05130485 | 3 CpGs<0.05 | 5.2 × 10−4 | 0.0039 |
DDX49 | >/− | cg14757492 | Bonferroni | 0.41 | 0.4241 |
ENPEP | </< | cg17854440 | Bonferroni | 0.041 | 0.0863 |
FXD7 | >/> | cg22392666 | Bonferroni | 0.22 | 0.3 |
GABRR2 | </> | cg06445611 | Bonferroni | 0.1 | 0.1579 |
GALR1 | </< | cg15343119 | 3 CpGs<0.05 | 0.046 | 0.0863 |
GATA4 | </< | cg24646414 | 3 CpGs<0.05 | 0.33 | 0.3778 |
GP1BB | >/> | cg07359545 | Bonferroni | 0.04 | 0.0863 |
GPX4 | </< | cg18061485 | Benjamini–Hochberg | 0.041 | 0.0863 |
GRB10b | >/> | cg01720588 | 3 CpGs<0.05 | 0.002 | 0.0086 |
HSPA5 | </− | cg01733783 | Benjamini–Hochberg | 0.44 | 0.44 |
IGF1 | </< | cg01305421 | Benjamini–Hochberg | 0.023 | 0.0627 |
IGF2b | </< | cg10649864 | 3 CpGs<0.05 | 0.058 | 0.1024 |
INS | >/> | cg03366382 | 3 CpGs<0.05 | 0.019 | 0.0627 |
KCNQ1 | </< | cg06719391 | Bonferroni | 0.08 | 0.1334 |
KRTHB6 | </< | cg04123507 | Bonferroni | 0.009 | 0.0338 |
MEST | >/< | cg01888566 | 3 CpGs<0.05 | 0.26 | 0.3391 |
MGC9712 | >/> | cg06194808 | Bonferroni | 9 × 10−4 | 0.0054 |
MTHFR | >/− | cg03427831 | Benjamini–Hochberg | 0.39 | 0.4179 |
OSBPL5b | </< | cg06282660 | 3 CpGs<0.05 | 0.044 | 0.0863 |
RASSF5 | >/> | cg17558126 | Bonferroni | 0.022 | 0.0627 |
SERPINB5 | >/> | cg20837735 | 3 CpGs<0.05 | 0.34 | 0.3778 |
SLC16A3 | >/> | cg18345635 | Bonferroni | 1.7 × 10−4 | 0.0017 |
SULT1C2 | </< | cg17966192 | Bonferroni | 0.002 | 0.0086 |
VAV1 | >/> | cg13470920 | Bonferroni | 1.7 × 10−4 | 0.0017 |
VHL | >/> | cg16869108 | 3 CpGs<0.05 | 5 × 10−5 | 0.0015 |
ZNF512 | </> | cg18611281 | 3 CpGs<0.05 | 0.29 | 0.3625 |
NOTE: Genes/CpGs in bold differed significantly in the validation.
aNote that only five of 30 CpGs tested varied in the direction opposite that expected from the discovery profile (13) and none of the five variant CpGs was significant in the validation.
bOne-tailed t test was used because there was an expected direction of difference in this validation. Eight of 14 Bonferroni candidates, two of four Benjamini–Hochberg candidates, and six of 12 “3 CpGs<0.05” candidates validated (16/30 validated overall).
The top 18 CpGs/genes selected by a SVM to classify cancer samples and controls
Gene name . | CpG ID . | P cancer vs. control in Veracode array . | P in Infinium array . |
---|---|---|---|
ANKRD15 | cg17694279 | 0.007 | 0.196 |
CASP8 | cg05130485 | 5.2 × 10−4 | 0.007 |
EDA2R | cg14372520 | 0.003 | 0.045 |
ENPEP | cg17854440 | 0.041 | 1.26E−06 |
GRB10 | cg01720588 | 0.002 | N/A |
IGFBP5 | cg24617085 | 0.016 | 0.262 |
INS | cg03366382 | 0.019 | 1.28E−05 |
ITGB4 | cg12146151 | 0.006 | 0.451 |
LGALS2 | cg11081833 | 0.045 | 0.994 |
MGC9712 | cg06194808 | 9 × 10−4 | 4 × 10−4 |
NMUR1 | cg10642330 | 0.017 | 0.72 |
RASSF5 | cg17558126 | 0.022 | 1.19E−05 |
SLC16A3 | cg18345635 | 1.7 × 10−4 | 7.28E−05 |
SULT1C2 | cg17966192 | 0.002 | 0.016 |
TIMP4 | cg25982743 | 0.004 | 0.019 |
VAV1 | cg13470920 | 1.7 × 10−4 | 1.76E−05 |
VHL | cg16869108 | 5 × 10−5 | 5 × 10−4 |
VMD2 | cg09726693 | 0.001 | N/A |
Gene name . | CpG ID . | P cancer vs. control in Veracode array . | P in Infinium array . |
---|---|---|---|
ANKRD15 | cg17694279 | 0.007 | 0.196 |
CASP8 | cg05130485 | 5.2 × 10−4 | 0.007 |
EDA2R | cg14372520 | 0.003 | 0.045 |
ENPEP | cg17854440 | 0.041 | 1.26E−06 |
GRB10 | cg01720588 | 0.002 | N/A |
IGFBP5 | cg24617085 | 0.016 | 0.262 |
INS | cg03366382 | 0.019 | 1.28E−05 |
ITGB4 | cg12146151 | 0.006 | 0.451 |
LGALS2 | cg11081833 | 0.045 | 0.994 |
MGC9712 | cg06194808 | 9 × 10−4 | 4 × 10−4 |
NMUR1 | cg10642330 | 0.017 | 0.72 |
RASSF5 | cg17558126 | 0.022 | 1.19E−05 |
SLC16A3 | cg18345635 | 1.7 × 10−4 | 7.28E−05 |
SULT1C2 | cg17966192 | 0.002 | 0.016 |
TIMP4 | cg25982743 | 0.004 | 0.019 |
VAV1 | cg13470920 | 1.7 × 10−4 | 1.76E−05 |
VHL | cg16869108 | 5 × 10−5 | 5 × 10−4 |
VMD2 | cg09726693 | 0.001 | N/A |
NOTE: The 10 CpGs/genes in bold are from the validated candidates in Table 1.
Methylation levels at the individual CpG sites in the normal colon mucosa of the 24 patients with cancer and 24 matched controls (Supplementary Table S1) were assayed on bisulfite-converted DNA using a custom-designed Illumina high-throughput “Veracode” array (see Materials and Methods). Comparison of mean methylation levels between the normal colon mucosa of patients with cancer and controls revealed that mean methylation levels of 16 of the 30 CpG sites tested (53%) differed significantly in the independent population (Table 1, next to last column). In this analysis, a P value of 0.05 was considered significant because the candidate genes were selected for validation on the basis of having passed multiple statistical selection criteria in the discovery population, including correction for multiple testing (13). The success rate for validation was approximately the same regardless of whether the original statistical threshold was very stringent (Bonferroni; 8/14 candidates validated), moderately stringent (Benjamini–Hochberg; 2/4 candidates validated), or our ad hoc statistical threshold requiring a modest level of significance (P < 0.05) for at least three CpGs (nominal P < 0.00125) in the candidate gene (6/12 candidates validated). If the candidates are subjected to correction for multiple testing, yet again, in the present study, eight of the 16 candidates that are significant at P < 0.05 are also significant at an FDR of 0.05 (Table 1). Notably, the P values for the other eight candidates are all <0.087 (Table 1).
Support vector machine model
The Illumina Veracode array used for validation contained 66 CpGs in addition to the 30 shown in Table 1. Although these CpGs were not selected by using the criteria outlined in our previous study, almost all of the genes containing these additional CpGs were profiled on the Infinium array in our original study and a number of them exhibited statistically significant mean methylation differences between patients with cancer and controls (see below).
A SVM was trained on the 96 CpG array data using 20 of the 24 patients with cancer and 20 of the 24 controls (Supplementary Table S1). The SVM identified 18 CpGs with optimum performance (96% sensitivity, 100% specificity; Fig. 1A) in classifying patients with cancer and controls correctly (one patient with cancer in the training set was misclassified as a control). These 18 CpGs (Table 2) consist of 10 of the original 16 validated candidates (Table 1) and eight additional candidates. Three of the eight additional candidates (TIMP4, NMUR1, and EDA2R) also exhibited significant methylation differences between patients with cancer and controls in our original study (Table 2, last column). Methylation levels at these 18 CpGs classified all eight patients in the test set correctly (Fig. 1B).
A, Performance of 18 CpG SVM in validation population training set of 20 cancer cases and 20 controls. B, performance of 18 CpG SVM in validation population test set of four cancer cases and four controls. C, performance of the optimum 39 CpG SVM in classifying patients with cancer and controls in the discovery population of patients from our original study (13). This SVM has been selected from the 66 CpGs interrogated (13) in the 18 candidate genes described in Table 1.
A, Performance of 18 CpG SVM in validation population training set of 20 cancer cases and 20 controls. B, performance of 18 CpG SVM in validation population test set of four cancer cases and four controls. C, performance of the optimum 39 CpG SVM in classifying patients with cancer and controls in the discovery population of patients from our original study (13). This SVM has been selected from the 66 CpGs interrogated (13) in the 18 candidate genes described in Table 1.
We devised a further test of the utility of this 18-gene model by selecting all 65 CpGs interrogating these 18 genes on the original 27K discovery array (on which 30 patients with cancer and 18 controls were profiled; ref. 13) from which the 10 validated genes were selected (Table 1). The SVM selected 39 CpGs in 16 of the genes (Supplementary Table S2) as the optimum model (93% sensitivity, 94% specificity; Fig. 1C). Of note, only four CpGs in the INS gene were interrogated on the array and all four were selected for the SVM model. Multiple CpGs in RASSF5 (five selected of nine interrogated), VHL (five selected of seven interrogated), GRB10 (four selected of 12 interrogated), and CASP8 (three selected of six interrogated) were also selected. In addition, seven genes (VAV1, MGC97112, SULT1C2, SLC16A3, ITGB4, ANKRD15, and ENPEP) were interrogated by only two CpGs on the array and both CpGs in each of the seven genes were selected for the SVM.
Correlation between methylation levels in colon mucosa and methylation levels in peripheral blood
We had the opportunity to profile the methylation levels of the 96 CpGs on the Veracode array in DNA extracted from peripheral blood on 15 of the patients without cancer, as well as normal colon mucosa on the same 15 patients. We compared the CpG site-specific methylation levels between the two tissues and identified a number of genes in which methylation levels between the two tissues were correlated strongly. In fact, 14 of the 96 CpGs showed strong positive correlation (Pearson correlation, r>0.5) between methylation levels in normal colon mucosa and methylation levels in peripheral blood. Of the CpGs selected by the original 18 CpG SVM (Fig. 1; Table 2), seven exhibit moderate to strong methylation correlations between tissues (Table 3; Fig. 2), and use of colon methylation levels of these seven CpGs, only, results in correct classification of 40 of 48 patients (sensitivity 92%, specificity 87%; Fig. 3A) profiled on the Veracode array. Use of a seven CpG SVM (data for the identical CpG or nearest CpG interrogated on the Infinium 27 K array) to classify patients profiled on the discovery array (13) results in a sensitivity of 83% and a specificity of 61% (not shown). Allowing an SVM to be optimized from all 31 CpGs interrogated in these seven genes resulted in a model with 87% sensitivity and 83% specificity (Fig. 3B).
Correlation between methylation levels in normal colon mucosa (y-axis) and peripheral blood (x-axis) of 15 of the 24 control patients. A, INS cg03366382. B, LGALS2 cg11081833. C, ANKRD15 cg17694279. D, VHL cg16869108. Trend lines were drawn by “lm” function in R.
Correlation between methylation levels in normal colon mucosa (y-axis) and peripheral blood (x-axis) of 15 of the 24 control patients. A, INS cg03366382. B, LGALS2 cg11081833. C, ANKRD15 cg17694279. D, VHL cg16869108. Trend lines were drawn by “lm” function in R.
A, performance of SVM using seven CpGs showing correlation between methylation levels in normal colon and peripheral blood in classifying patients with cancer and controls in the validation population. B, performance of SVM using seven CpGs showing correlation between methylation levels in normal colon and peripheral blood in classifying patients with cancer and controls in the discovery population (13).
A, performance of SVM using seven CpGs showing correlation between methylation levels in normal colon and peripheral blood in classifying patients with cancer and controls in the validation population. B, performance of SVM using seven CpGs showing correlation between methylation levels in normal colon and peripheral blood in classifying patients with cancer and controls in the discovery population (13).
Gene name . | CpG ID . | Pearson r . | SVM score . |
---|---|---|---|
INS | cg03366382 | 0.72 | 0.79 |
LGALS2 | cg11081833 | 0.68 | 0.54 |
ANKRD15 | cg17694279 | 0.64 | 0.84 |
VHL | cg16869108 | 0.61 | 0.93 |
EDA2R | cg14372520 | 0.57 | 0.30 |
NMUR1 | cg10642330 | 0.30 | 0.34 |
GRB10 | cg01720588 | 0.25 | 0.26 |
Gene name . | CpG ID . | Pearson r . | SVM score . |
---|---|---|---|
INS | cg03366382 | 0.72 | 0.79 |
LGALS2 | cg11081833 | 0.68 | 0.54 |
ANKRD15 | cg17694279 | 0.64 | 0.84 |
VHL | cg16869108 | 0.61 | 0.93 |
EDA2R | cg14372520 | 0.57 | 0.30 |
NMUR1 | cg10642330 | 0.30 | 0.34 |
GRB10 | cg01720588 | 0.25 | 0.26 |
Discussion
We have validated multiple site-specific DNA methylation differences in normal colon mucosa between patients with colon cancer and patients without cancer. Our success rate of validation was 53% in a sample of 48 patients, none of whom were examined in the original discovery study (13). The fact that approximately half of the CpGs tested exhibit significant differences in mean methylation level in an independent set of samples is a substantial success rate, even for discovery genes judiciously selected (Table 1) from whole-genome profiles. Methylation differences at two additional candidate CpGs, one in IGF2 and one in KCNQ1 (Table 1), approached but did not achieve significance (P = 0.06 and P = 0.08, respectively).
Whether the failure to validate additional CpGs reflects a spurious result in the original study or true differences between small subpopulations of patients with colon cancer and controls sampled from the overall population in which true differences in means exist cannot be determined from only two independent samples. However, we observed that the degree of overlap in CpG site methylation level between cancer and control populations was substantial for many loci (ref. 13; data not shown). As a result, it is expected that many statistically significant differences between population means will be difficult to validate in independent populations if they have similar variance, even if there are true differences in population mean.
There are two potential weaknesses that could affect the conclusions of our study: (i) the average age of the patients with cancer (mean age = 65.4) is greater than the average age of the controls (mean age = 59.6, P = 0.03) and (ii) variability in the anatomical site of biopsy within the colon between patients with cancer and controls. Both of these differences are potentially important. Methylation levels at some CpGs have been demonstrated to change with age (23–28). Overall, the percentage of CpGs that change in an age-related way has been estimated as between 15% (using the Illumina HumanMethylation450 Bead Chip array; ref. 26) and 28% (using the Illumina HumanMethylation27 Bead Chip array; ref. 28). The major concern, with respect to our candidates, is whether they are more likely to be in the age-related group than CpGs selected at random and whether the differences we observe are the result of differences in age between the cancer population and the controls. Because our original 30 candidates were selected from the same Illumina HumanMethylation27 Bead Chip array (13) used in one of the aging studies (28), we queried whether any of the candidates in Table 1 corresponded to the more than 700 aging-related CpGs identified in that study. Only one CpG (in SLC16A3) in Table 1 appears among the aging-related CpGs in (ref. 28). Of note, in addition, is that the estimated average rate of methylation change is 0.07% to 0.2% per year in blood (26, 28) and 0.2% per year in colon (29). Given the 6-year average difference between the patients with cancer and the controls, we might expect age to account for an approximately 1.2% average difference in colon methylation between the two groups. Only three of the 16 validated candidate genes in Table 1 (GALR1, GPX4, and RASSF5) differ by less than this amount between patients with cancer and controls and none of these three candidates were identified as having age-related changes (28). Of the three genes in which CpGs have been demonstrated to incur age-related changes (INS, OSBPL5, and SLC16A3), the differences between groups are substantially larger than 1.2% (5% at INS, 3% at OSBPL5, and 10% at SLC16A3). Therefore, we do not believe that the 6-year difference in average ages of patients with cancer and controls could be the explanation for the larger and consistent differences we see in methylation levels in two independent populations.
The second source of potential bias in our study is variability in the site of the normal colon biopsy being compared between the patients with cancer and controls. In our previous study, we compared right side control biopsies with right side biopsies from patients with cancer who had right side tumors (13). In this study, we used left side biopsies from the patients with cancer because all of the cancers were left side cancers and only left side biopsies were available. However, we used right side biopsies from the controls because we were attempting to validate candidate genes identified from right side biopsies, also reasoning that biomarkers that could distinguish cancer, independent of site within the colon would be more valuable, clinically. Although site-specific DNA methylation has been reported by us and others (14, 25, 27) to vary as a function of biopsy site within the colon for a number of genes, biopsy site does not seem to be a major factor at most CpG sites. Genome-wide comparison of methylation in right-side biopsies (from the ascending colon) and biopsies from the left side or rectum has identified both hyper- and hypomethylation differences between the two anatomical sites (27). Although some of these differences were substantial in magnitude, significant differences between the two anatomical sites were found in <2% of CpG sites (8,388 sites out of >430,000) examined. Only three of the CpGs among our 30 candidates or the eight additional CpGs comprising the SVM (Table 2) are present among the 8,388 right side/rectum differing CpGs in Kaz and colleagues (27). One of these was among the 14 CpGs that were not validated in Table 1, one was among the 16 validated CpGs (in GP1BB) but not selected for the SVM and one (in LGALS2) was not among the original candidates but was selected by the SVM. Leaving this CpG out of the SVM does not substantially alter the sensitivity or specificity of the model. In summary, neither age differences nor biopsy site differences within the colon are likely to account for the reproducible differences we observe between patients with cancer and controls in two independent population samples.
Looking forward, it is likely that multigene models like those presented here will be required to classify patients with the level of accuracy required to be useful in the clinic. The number of methylation biomarker genes that will be required will be dependent on the discriminatory power of the markers but clinically useful distinctions for some clinical outcomes are made currently on the basis of measuring transcript levels of only 12 to 23 genes (30). With respect to methylation biomarkers, there are significant advantages to using DNA, rather than RNA, as a diagnostic molecule (31). In addition to the relative stability of DNA compared with RNA, DNA methylation level, like mRNA level, is a continuous variable. However, it exhibits considerably lower variance than population mRNA levels (e.g., 32, reviewed in ref. 33), in part, because DNA methylation levels are constrained between 0 and 1.
When assembling a panel of methylation biomarkers, an additional consideration is whether interrogation of a single or a small number of sites in a larger number of genes (as was the case with early array-based methods; ref. 34; methylation-sensitive restriction endonuclease-based methods, e.g., ref. 35; and most bisulfite pyrosequencing assays; ref. 36) or interrogation of a greater number of sites in a smaller number of genes (larger or custom arrays; refs. 37, 38; or multiple bisulfite pyrosequencing assays) would be superior. Although we have observed that it is often the case that methylation levels of different CpG sites within the same CpG island are highly correlated (13, 18, 39) and would, therefore, be predicted to add little additional information, it is possible that interrogation of additional CpGs will add predictive power if the additional information is not completely redundant. We are able to make a preliminary assessment of whether single CpG sites in our 18-gene model perform as well as the 39 CpG sites in 16 of these genes. Supplementary Figure S1A shows the result of using a single CpG in 17 of the 18 genes (VMD2/BEST1 is not interrogated on the Illumina 27 K array used in the original study) in classifying the 30 patients with cancer and 18 controls in our original study. Comparing this result with that in Fig. 1C, we see that the specificity is the same (94% success in classifying controls) but the sensitivity drops from 93% to 83% (five patients with cancer misclassified vs. two patients with cancer misclassified). Thus, for this particular set of candidate genes, additional precision is gained by assessing methylation at more than one site per gene. Even if we expand the number of individual CpGs/individual genes (38 CpGs/38 individual genes; again, VMD2/BEST1 is not present) to the same number used in Fig. 1C, specificity remains at 94% but specificity rises to only 86% (Supplementary Fig. S1B), suggesting that interrogating multiple CpGs per gene is a superior approach to interrogating individual CpGs in a larger number of genes.
An additional consideration for whether candidate methylation biomarkers such as those identified here will be clinically useful is whether they can be assessed in tissues collected less invasively than by colonoscopy. Although there are multiple factors associated with uptake of the test, including education, insurance coverage, and ethnicity (40), the fact that less than half of those patients recommended to have a screening colonoscopy are compliant (2) suggests that uptake of a less invasive test, such as might be performed on peripheral blood or saliva, might be substantially greater. There is much interest and some progress in developing such biomarkers (41, 42, reviewed in ref. 43). As far as whether such biomarkers give a realistic picture of methylation levels in the organ of interest, it is often assumed that methylation levels are tissue specific, but there are many sites for which methylation levels vary between individuals but do not vary substantially between tissues of the same individual (44, 45). Our finding that a significant fraction of candidate biomarkers (seven out of 18; Tables 2 and 3; Fig. 2) shows a strong correlation between methylation levels in colon mucosa and methylation levels in peripheral blood offers the possibility that methylation levels of candidate biomarkers in tissues collected less invasively could serve as a proxy not only for colon, but also for any other tissue or organ that might be difficult or impossible to biopsy.
One clear advantage of cancer screening by colonoscopy is that premalignant lesions that are detected can be removed before they have a chance to progress to cancer. We have described two methylation biomarkers that show some promise in distinguishing individuals without cancer or polyps from cancer-free individuals who have polyps (14). In this study, we attempted to define a SVM to classify which individuals in our group of controls carried polyps but its performance was mediocre: 12 of the controls had polyps (two with hyperplastic polyps, eight with tubular adenomas, and two with polyps that were not described histopathologically) and 12 did not (Supplementary Table S1). A SVM using six CpGs selected from the 96 on the array was able to classify eight of the 12 polyp-carrying individuals correctly. The two individuals with hyperplastic polyps (which are not thought to give rise to malignancy) were classified as controls but four of the controls were also classified with the polyp group. It seems likely that additional large-scale screens such as that performed by Lange and colleagues (41) will be necessary to identify markers of sufficient discriminatory power to make the more subtle distinction between individuals without cancer but with premalignant lesions from individuals who are cancer free and polyp free. However, the potential benefits of noninvasive, large-scale screening for relative cancer risk could be highly significant in reducing disease burden.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: C. Sapienza
Development of methodology: M. Cesaroni, C. Sapienza
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): C. Sapienza
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): M. Cesaroni, J. Powell, C. Sapienza
Writing, review, and or revision of the manuscript: M. Cesaroni, C. Sapienza
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): J. Powell, C. Sapienza
Study supervision: C. Sapienza
Acknowledgments
The authors thank Dr. Andrew Kaz for providing information on the 8,388 CpGs whose methylation levels vary between right side colon and rectal biopsies (cited in Kaz and colleagues; ref. 27) and Drs. Noor Dawany and Andrew Kossenkov for providing SVM scripts and for the helpful discussion about statistical methods.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Grant Support
This work was supported by a grant from the National Institutes of Health, NIH R03 CA180533-01.