Abstract
Aberrant CpG island methylation is associated with transcriptional silencing of regulatory genes in human cancer. Although most CpG islands remain unmethylated, a subset accrues aberrant methylation in cancer via unknown mechanisms. Previously, we showed that CpG islands differ in their intrinsic propensity towards hypermethylation. We developed a classifier (PatMAn) based on the frequencies of seven DNA sequence patterns that discriminated methylation-prone (MP) and methylation-resistant (MR) CpG islands. Here, we report on the genome-wide application and direct testing of PatMAn in cancer. Although trained on data from a cell culture model of de novo methylation involving the overexpression of DNMT1, PatMAn accurately predicted CpG islands at increased risk of hypermethylation in cancer cell lines and primary tumors. Analysis of CpG islands predicted to be MP revealed a strong association with embryonic targets of polycomb-repressive complex 2 (PRC2), indicating that PatMAn predicts not only aberrant methylation, but also PRC2 binding. A second classifier (SUPER-PatMAn) that integrates the seven PatMAn DNA patterns with SUZ12 enriched regions as a marker of PRC2 occupancy showed improved performance (prediction accuracy, 81–88%). In addition to many non-PRC2 targets, SUPER-PatMAn identified a subset of PRC2 targets that were more likely to be hypermethylated in cancer. Genome-wide, CpG islands predicted to be MP were enriched in genes known to undergo hypermethylation in cancer, genes functioning in transcriptional regulation, and components of developmental pathways. These findings show that hypermethylation of certain gene loci is controlled in part by an underlying susceptibility influenced by both local sequence context and trans-acting factors. [Cancer Res 2009;69(1):282–91]
Introduction
CpG island hypermethylation is associated with local changes in chromatin architecture and serves as one mechanism for silencing tumor suppressor gene transcription in human cancer. It is estimated that individual tumors exhibit aberrant de novo DNA methylation of 1% to 5% of the nearly 38,000 CpG islands in the human genome (1–3). Although there is significant variation in the methylation profile from one tumor to the next, a subset of CpG islands are reproducibly methylated across multiple tumors and cancer types (1). However, the mechanisms by which specific CpG islands are targeted for aberrant methylation in cancer cells remain unclear. One hypothesis suggests that the DNA methyltransferase enzymes may be aberrantly targeted to specific loci by transcription factors or other DNA binding proteins. For example, the oncogenic PML-RAR transcription factor has been shown to bind the DNMTs and direct de novo methylation to a downstream target gene in acute promyelocytic leukemias (4). More recently, the DNMTs have been shown to interact with components of the polycomb-repressive complex 2 (PRC2) and to be recruited to sites of polycomb-mediated repression in cancer cells (5–8). PRC2 consists of SUZ12, EED, RbAP46/48, and the histone methyltransferase, EZH2, which mediates the trimethylation of histone H3 lysine 27 (H3K27me3) at ∼10% of genes in human embryonic stem cells (9). Interestingly, a fraction of these PRC2 target genes undergo aberrant DNA methylation in human cancers (8, 10, 11), suggesting that the mark imposed by PRC2 during development may predispose some genes for later de novo methylation.
In previous work, we identified CpG islands with different propensities for aberrant methylation in response to stable overexpression of the DNMT1 DNA methyltransferase (12, 13). Of 1,749 CpG islands analyzed, the majority (70%; n = 1,223) were methylation-resistant (MR) and remained unmethylated in multiple cell clones regardless of DNMT1 expression. However, a distinct subset (3%; n = 66) was found to be methylation-prone (MP) in that they were consistently hypermethylated in multiple independent DNMT1-overexpressing clones (13). Using pattern recognition and supervised machine learning techniques, we established a classifier based on seven short DNA patterns (TCCCCCNC, TTTCCTNC, TCCNCCNCCC, GGAGNAAG, GAGANAAG, GCCACCCC, and GAGGAGGNNG) that was capable of accurately discriminating MP and MR CpG islands in cross-validation and blind tests (Fig. 1; ref. 13). We refer to this sequence-based classifier as PatMAn (pattern-based methylation analysis). These initial findings indicated that individual CpG islands differ in their inherent susceptibility to aberrant DNA methylation and suggested that this susceptibility is conferred in part by local features encoded in the DNA sequence. These data support the concept of an “instructive” mechanism of de novo DNA methylation in cancer wherein the risk of methylation is a predetermined intrinsic property of some CpG islands (3).
Developing and testing sequence-based computational tools for predicting susceptibility to aberrant methylation. MP and MR CpG islands were identified by NotI RLGS in human lung fibroblasts stably overexpressing DNMT1 or a control plasmid (NeoR). A training set of nine MP and nine MR CpG islands was used in a three-stage computational approach involving DNA pattern recognition, feature selection, and an optimization-based discrete support vector machine (DAMIP) to arrive at a classifier based on seven DNA patterns with maximal discrimination potential. This pattern-based methylation analysis classifier, termed PatMAn, was applied to all human CpG islands. Classifier performance was assessed by testing the actual methylation status of a subset of CpG islands from chromosomes 21 and 22 in DNMT1-overexpressing cells, a series of normal and cancer cell lines, as well as primary lung tumor samples. Improvements were made to the classifier through the incorporation of an additional feature (i.e., SUZ12 binding). Re-training based on actual methylation status of tested CpG islands allows for additional refinement.
Developing and testing sequence-based computational tools for predicting susceptibility to aberrant methylation. MP and MR CpG islands were identified by NotI RLGS in human lung fibroblasts stably overexpressing DNMT1 or a control plasmid (NeoR). A training set of nine MP and nine MR CpG islands was used in a three-stage computational approach involving DNA pattern recognition, feature selection, and an optimization-based discrete support vector machine (DAMIP) to arrive at a classifier based on seven DNA patterns with maximal discrimination potential. This pattern-based methylation analysis classifier, termed PatMAn, was applied to all human CpG islands. Classifier performance was assessed by testing the actual methylation status of a subset of CpG islands from chromosomes 21 and 22 in DNMT1-overexpressing cells, a series of normal and cancer cell lines, as well as primary lung tumor samples. Improvements were made to the classifier through the incorporation of an additional feature (i.e., SUZ12 binding). Re-training based on actual methylation status of tested CpG islands allows for additional refinement.
We now report on the genome-wide application and biological validation of the PatMAn classifier. We find that PatMAn predicts, with high confidence, CpG islands at increased risk of de novo methylation not only in DNMT1-overexpressing cells, but also in cancer cell lines and primary tumors. Furthermore, we find a significant enrichment of PRC2 target genes among the MP CpG island class suggesting that the algorithm and the sequence patterns that define it are predictive not only of aberrant methylation, but also of polycomb binding. The development of a second classifier that integrates PRC2 occupancy data as an additional biological feature increased the accuracy and specificity of methylation-susceptibility predictions. These findings show that aberrant CpG island methylation is influenced by both local sequence context and at least one trans-acting factor.
Materials and Methods
Cell lines and primary tumor specimens. The generation of human fibroblasts overexpressing DNMT1 and matched controls expressing vector alone (NeoR) has been previously described (12). Maintenance of the human mammary epithelial cells (HMEC), MCF10A, SKBR3, Hs578t, T47D, MDA-MB-468, MCF7, ZR75-1, MDA-MB-435s, MDA-MB-453, and MDA-MB-231 cell lines has been previously described (14). All other cell lines (A549, Calu-1, H157, H1792, H226, and H460) were obtained from American Type Culture Collection and maintained in DMEM containing 10% fetal bovine serum. Primary human bronchial epithelial cells (HBEC) were generated from autopsy samples after enzymatic dissociation of epithelium and stroma with collagenase. Twenty snap-frozen non–small cell lung cancer specimens (16 adenocarcinoma and 4 squamous cell carcinoma) and paired adjacent normal tissue were obtained from the Emory University School of Medicine Tissue Procurement and Banking Service.
Methylation analyses. Genomic DNA was extracted from cell lines and primary tumors using the DNeasy Tissue Kit (Qiagen) and was bisulfite-modified as previously described (15). For primary tissue samples, 1 μg of DNA was bisulfite-modified with the EZ DNA Methylation-Gold kit (Zymo Research) according to the recommendations of the manufacturer. Methylation-specific PCR (MSP) was performed with ∼80 ng of bisulfite-modified DNA as previously described (14). MSP primers are listed in Supplementary Table S1. As a methylation-positive control, genomic DNA was in vitro methylated with the bacterial DNA methyltransferase M.SssI (New England Biolabs) according to the manufacturer's recommendations.
CpG island extraction. A database of CpG island genomic coordinates from the HG17 freeze of the human genome (NCBI Build 35) was generated by applying a modified version of the CpG Island Searcher PERL program4
using the criteria of length (≥500 bp), GC content (≥55%), and CpG Obs/Exp (≥0.65) established by Takai and Jones (16).Annotation of CpG islands to genome-wide ChIP data sets. Raw SUZ12 ChIP-chip data from human embryonic stem cells (9) was obtained from ArrayExpress5
(ID: E-WMIT-7). Data processing and identification of SUZ12-enriched, -nonenriched, and uninformative regions are described in detail in the Supplementary Methods. Briefly, SUZ12-enriched and -nonenriched genomic regions were identified with a modified version of the PERL-implementation of the ChIPOTle program (17). This analysis identified 4,350 SUZ12-enriched regions (average length, 1,313 bp). CpG islands were then assessed for proximity (within 1 kb) to SUZ12-enriched and -nonenriched regions. This allowed for the annotation of SUZ12 binding status for 93% of CpG islands in the genome. There were 3,642 SUZ12 (+) CpG islands, 31,238 SUZ12 (−) CpG islands, and the remaining 2,650 had insufficient data.ChIP-Seq datasets for H3K27me3, H3K4me3, RNA PolII, and H3K27Ac in CD4+ T cells were obtained from National Heart, Lung, and Blood Institute web sites (18, 19).6
,7 Spatial mapping was performed with a custom PERL program which aligns CpG islands by their centers and then calculates the average number of ChIP-Seq tags at each base within a specified window. A 500 bp centered moving average was applied to highlight larger trends and smooth out short-range fluctuations.Classifier generation and application. A supervised learning strategy was used to develop a predictive rule based on a set of sequence attributes that discriminate MP and MR sequences (see Supplementary Methods for a detailed description; ref. 13). Briefly, pattern recognition was first used on a training set of nine MP and nine MR CpG islands to identify common short DNA patterns. Feature selection and a novel optimization-based discrete support vector machine (DAMIP; ref. 20) were then applied. This machine-learning approach returned a classifier based on a set of seven short discriminatory DNA patterns that achieved an accuracy of 89% in 10-fold cross-validation tests. This DNA sequence-based classifier is herein referred to as PatMAn, and has been previously reported (13). A second classifier, termed SUPER-PatMAn, was developed based on the same training set of MP and MR sequences using the seven discriminatory patterns from PatMAn and SUZ12 enrichment status (scored as positive, negative, or uninformative; see Supplementary Methods) as input into the DAMIP classification engine. To compare their predictive power, the PatMAn and SUPER-PatMAn classifiers were then applied to all human CpG islands.
Classifier performance calculations. Accuracy, specificity, and sensitivity were calculated as follows:
Accuracy = (TP + TN)/(TP + FP + TN + FN)
Specificity = TN/(TN + FP)
Sensitivity = TP/(TP + FN)
where TP = true positive, FP = false positive, TN = true negative, and FN = false negative. For scoring purposes, a CpG island was scored as MP if it exhibited increased methylation in at least two of three DNMT1 clones compared with the average methylation among NeoR clones or higher methylation in at least 20% of cancer cell lines relative to the highest methylation event observed in control cells.
Results
Using methylation data from a cell culture model in which de novo methylation is induced by overexpression of DNMT1, we previously generated a classifier involving 7 novel DNA sequence patterns that could discriminate MP and MR CpG islands (13). To further evaluate the predictive potential of the PatMAn classifier and to determine the extent to which PatMAn is predictive of aberrant methylation in other settings, we applied it to all 37,530 CpG islands from the NCBI build 35 (UCSC HG17) meeting the criteria of Takai and Jones (length ≥500 bp, GC content ≥55%, and CpG Obs/Exp ≥0.65; ref. 16). PatMAn predicted 1,535 (4.1%) CpG islands to be MP (Supplementary Table S2). The chromosomal distribution of predicted MP CpG islands showed no apparent clustering after CpG island density was considered (Fig. 2A). The accuracy of these predictions was then validated experimentally by assessing the actual methylation status of 44 randomly selected CpG islands from chromosomes 21 and 22 (23 predicted to be MP, 21 predicted to be MR) in normal IMR90 fibroblasts, DNMT1-overexpressing cells, and vector-only controls (NeoR; Fig. 2B and C). CpG islands exhibiting increased methylation in at least 2 of 3 DNMT1 clones compared with the average methylation among NeoR clones were considered to be truly MP (i.e., true-positive if predicted MP and false-negative if predicted MR). Two CpG islands (MGC16635, RIPK4) were methylated in all samples examined, including normal fibroblasts and primary tissues (Figs. 2 and 3) and thus their potential for aberrant de novo methylation could not be assessed accurately. After excluding these, more than half (12 of 22; 54.5%) of the CpG islands predicted to be MP by PatMAn were indeed hypermethylated in DNMT1-overexpressing cells. In contrast, only 3 of 20 (15%) CpG islands predicted to be MR were hypermethylated (Fig. 2C; P = 0.01, Fisher's exact). Therefore, PatMAn was capable of predicting the actual methylation status of CpG islands in DNMT1-overexpressing cells with an accuracy of 69% (specificity = 63%; sensitivity = 80%; see Methods for calculation details).
Classification and biological testing of CpG islands from chromosomes 21 and 22. A, 52 of 1,358 chromosome 21/22 CpG islands were classified as MP by the PatMAn classifier (black ticks to the right of each chromosome), and central moving average of CpG island density (CpGi/100 kb; left of each chromosome). B, the methylation status of a subset of CpG islands from chromosomes 21/22 were assessed by MSP in normal fibroblasts (IMR90), three independent vector-only clones (NeoR), and three independent DNMT1-overexpressing clones (DNMT1). DNA methylated in vitro with M.SssI is included as a positive control. U, unmethylated; M, methylated. C, heat map representation of MSP results. Each MSP was performed at least thrice. The degree of methylation was estimated from the relative abundance of the methylated and unmethylated products and is scored on a 5-point scale ranging from completely unmethylated (white) to completely methylated (blue). *, CpG islands scored as truly MP were defined as those that exhibited higher levels of methylation in at least two DNMT1-overexpressing clones compared with the average NeoR methylation.
Classification and biological testing of CpG islands from chromosomes 21 and 22. A, 52 of 1,358 chromosome 21/22 CpG islands were classified as MP by the PatMAn classifier (black ticks to the right of each chromosome), and central moving average of CpG island density (CpGi/100 kb; left of each chromosome). B, the methylation status of a subset of CpG islands from chromosomes 21/22 were assessed by MSP in normal fibroblasts (IMR90), three independent vector-only clones (NeoR), and three independent DNMT1-overexpressing clones (DNMT1). DNA methylated in vitro with M.SssI is included as a positive control. U, unmethylated; M, methylated. C, heat map representation of MSP results. Each MSP was performed at least thrice. The degree of methylation was estimated from the relative abundance of the methylated and unmethylated products and is scored on a 5-point scale ranging from completely unmethylated (white) to completely methylated (blue). *, CpG islands scored as truly MP were defined as those that exhibited higher levels of methylation in at least two DNMT1-overexpressing clones compared with the average NeoR methylation.
Methylation status of CpG islands classified by PatMAn as MP or MR in control and cancer cell lines. The methylation status of predicted MP (n = 23; left) and predicted MR (n = 21; right) CpG islands was assessed by MSP in normal HMECs, primary HBECs from a nonsmoker and a smoker, immortalized nontransformed breast epithelial cells (MCF10A), nine breast cancer cell lines (Hs578t, MCF7, MDA-MB-231/435s/453/468, SKBR3, T47D, and ZR75-1), and six lung cancer cell lines (A549, Calu1, H157, H226, H460, and H1792). Each MSP was performed at least thrice. The degree of methylation was estimated from the relative abundance of the methylated and unmethylated products and is scored on a 5-point scale ranging from completely unmethylated (white) to completely methylated (blue). SUZ12 occupancy status (white, negative; black, positive). *, CpG islands scored as truly MP were defined as those that exhibited higher levels of methylation in at least 20% of cancer cell lines compared with the highest methylation level observed in the control cells.
Methylation status of CpG islands classified by PatMAn as MP or MR in control and cancer cell lines. The methylation status of predicted MP (n = 23; left) and predicted MR (n = 21; right) CpG islands was assessed by MSP in normal HMECs, primary HBECs from a nonsmoker and a smoker, immortalized nontransformed breast epithelial cells (MCF10A), nine breast cancer cell lines (Hs578t, MCF7, MDA-MB-231/435s/453/468, SKBR3, T47D, and ZR75-1), and six lung cancer cell lines (A549, Calu1, H157, H226, H460, and H1792). Each MSP was performed at least thrice. The degree of methylation was estimated from the relative abundance of the methylated and unmethylated products and is scored on a 5-point scale ranging from completely unmethylated (white) to completely methylated (blue). SUZ12 occupancy status (white, negative; black, positive). *, CpG islands scored as truly MP were defined as those that exhibited higher levels of methylation in at least 20% of cancer cell lines compared with the highest methylation level observed in the control cells.
To determine the extent to which PatMAn is predictive of aberrant methylation in human cancer, we next analyzed the aforementioned 44 CpG islands in primary HMEC and HBEC, immortalized, nontransformed breast epithelial cells (MCF10A), and a panel of nine breast and six lung cancer cell lines (Fig. 3). In general, those CpG islands predicted to be MP by PatMAn were more frequently methylated in the cancer cell lines than in cultured primary cells (HMEC, HBEC) or a nontumorigenic (MCF10A) cell line. Again, two CpG islands (MGC16635, RIPK4) were methylated in all samples examined, including primary mammary and bronchial epithelial cultures (Fig. 3) and were excluded from performance calculations. If we consider CpG islands that exhibit higher methylation in at least 20% of cancer cell lines relative to the highest methylation event observed in control cells to be true-positives, then the accuracy of the classifier was 76.2% (specificity, 69.2%; sensitivity, 87.5%). CpG islands predicted to be MP that were actually hypermethylated in DNMT1-overexpressing cells also tended to be hypermethylated in breast and lung cancer cell lines (Figs. 2C and 3).
Although based on a limited data set, it is also interesting to note that HBEC isolated from a smoker exhibited hypermethylation of several predicted MP CpG islands as compared with HBEC from a nonsmoker (Fig. 3). A similar phenomenon was observed in immortalized, yet nontransformed mammary epithelial cells (MCF10A) compared with cultured primary HMECs. These data suggest that the aberrant methylation of some CpG islands predicted to be MP may be an early event in the tumorigenic process. Thus, the PatMAn classifier seems to identify a class of CpG islands that are prone to aberrant methylation across multiple cell (fibroblast and epithelial) and tumor types (breast and lung cancer), and in response to other premalignant stress conditions (carcinogen exposure, immortalization).
The identification of CpG islands with different propensities for aberrant DNA methylation provides an opportunity to examine other biological characteristics that might correlate with methylation susceptibility. To this end, gene ontology analyses were performed on a data set of MP and MR CpG islands previously identified by restriction landmark genomic scanning (RLGS) in the DNMT1 overexpression model (13). These studies revealed that MP CpG islands were significantly enriched in genes functioning in transcriptional regulation (Fig. 4A), whereas the MR class was enriched in genes functioning in protein binding, phosphorylation, and metal binding (data not shown). In particular, the homeobox class was the most significantly enriched category among the MP genes (16%; P = 2.73 × 10−8), whereas none of the MR genes encoded homeodomains.
Predicted MP CpG islands are enriched in targets of PRC2. A, annotation of gene ontology terms among MP and MR CpG islands identified by RLGS in DNMT1-overexpressing cells using the DAVID Bioinformatics Database. B, analysis of SUZ12 binding, EED binding, or the H3K27me3 modification (9) at MP and MR CpG islands identified by RLGS in DNMT1-overexpressing cells (13). Pie charts indicate the fraction of CpG islands enriched for 0, 1, 2, or all three of these factors. C, representative SUZ12-enriched (TBX1) and nonenriched (ADRBK2) CpG islands. Plotted are the normalized SUZ12 enrichment ratios for each probe within the window (red bars). Regions scored as SUZ12-enriched (red boxes) were compared with the genomic positions of CpG islands (green boxes), and RefSeq genes (blue). D, methylation status of CpG islands predicted to be MP (left) or MR (right) by SUPER-PatMAn. Each MSP was performed at least thrice. The degree of methylation was estimated from the relative abundance of the methylated and unmethylated products and is scored on a 5-point scale ranging from completely unmethylated (white) to completely methylated (blue). *, CpG islands scored as truly MP were defined as described in Fig. 3. #, CpG islands re-classified by SUPER-PatMAn; NS, nonsmoker; S, smoker. SUZ12 occupancy status (white, negative; black, positive).
Predicted MP CpG islands are enriched in targets of PRC2. A, annotation of gene ontology terms among MP and MR CpG islands identified by RLGS in DNMT1-overexpressing cells using the DAVID Bioinformatics Database. B, analysis of SUZ12 binding, EED binding, or the H3K27me3 modification (9) at MP and MR CpG islands identified by RLGS in DNMT1-overexpressing cells (13). Pie charts indicate the fraction of CpG islands enriched for 0, 1, 2, or all three of these factors. C, representative SUZ12-enriched (TBX1) and nonenriched (ADRBK2) CpG islands. Plotted are the normalized SUZ12 enrichment ratios for each probe within the window (red bars). Regions scored as SUZ12-enriched (red boxes) were compared with the genomic positions of CpG islands (green boxes), and RefSeq genes (blue). D, methylation status of CpG islands predicted to be MP (left) or MR (right) by SUPER-PatMAn. Each MSP was performed at least thrice. The degree of methylation was estimated from the relative abundance of the methylated and unmethylated products and is scored on a 5-point scale ranging from completely unmethylated (white) to completely methylated (blue). *, CpG islands scored as truly MP were defined as described in Fig. 3. #, CpG islands re-classified by SUPER-PatMAn; NS, nonsmoker; S, smoker. SUZ12 occupancy status (white, negative; black, positive).
Homeobox genes and other developmental regulators are frequent targets of polycomb-mediated repression. One complex that mediates this repression is PRC2 which consists of SUZ12, EED, RbAp46/48, and the histone methyltransferase EZH2 which catalyzes the trimethylation of H3K27 (H3K27me3; ref. 21). PRC2 components are up-regulated in cancers (22) and interactions between EZH2 and DNMTs have been reported (5–8). Genome-wide studies have characterized the distribution of H3K27me3, SUZ12, and EED in human embryonic stem cells (9). Analysis of these data revealed a striking relationship between loci enriched for PRC2 components and/or marked by H3K27me3 and those CpG islands determined by us to be MP by RLGS in DNMT1-overexpressing cells (Fig. 4B). Approximately half (50.9%) of the MP CpG islands were enriched for SUZ12, EED, and/or H3K27me3, whereas only 17.6% of MR CpG islands were similarly enriched (P = 7.2 × 10−7, Fisher's exact). A similar analysis of genome-wide binding data for the chromatin insulator CTCF (23) showed no relationship with methylation propensity (Supplementary Fig. S1).
There was also a striking relationship between SUZ12 occupancy and CpG islands predicted by PatMAn to be MP (Fig. 3). Indeed, those CpG islands predicted to be MP that were actually hypermethylated in cancer cells tended to be those bound by SUZ12. Of the nine CpG islands predicted to be MP that were also bound by SUZ12, all were hypermethylated in cancer cells. On the other hand, only 5 (38%) of the 13 CpG islands predicted to be MP that were negative for SUZ12 were hypermethylated (P = 0.006, Fisher's exact). In contrast, there was no correlation between SUZ12 binding and actual methylation status among the predicted MR CpG islands. Only two predicted MR CpG islands were bound by SUZ12 and neither was hypermethylated. Thus, the PatMAn classifier, which is based solely on DNA sequence, is capable of distinguishing those PRC2 occupancy events that are associated with aberrant DNA methylation in cancer cells from those that are not.
Based on these observations, we next sought to determine whether the inclusion of polycomb occupancy data in combination with DNA sequence features might aid in the discrimination of MP and MR CpG islands. We used the genome-wide SUZ12 ChIP-chip data from human embryonic stem cells (9) to annotate CpG islands for PRC2 occupancy status (Fig. 4C; Supplementary Methods). This analysis allowed for the annotation of 93% of CpG islands genome-wide. Using the same training set of CpG islands and supervised learning approach used to generate PatMAn, we generated a new classifier in which SUZ12 occupancy status was considered as a discriminatory feature in combination with the frequencies of the original seven PatMAn DNA patterns. The accuracy of this new classifier, which we refer to as SUPER-PatMAn (for SUZ12 protein–enriched regions and pattern–based methylation analysis), was then estimated by 10-fold cross-validation. Results of the cross-validation showed that MP CpG islands were classified with an accuracy of 88% (one of nine misclassified) and MR CpG island with an accuracy of 78% (two of nine misclassified) for an overall rate of correct classification of 83%.
When applied to all 37,530 human CpG islands, SUPER-PatMAn predicted 1,232 (3.3%) to be MP (Supplementary Table S2). Analysis of the same 44 CpG islands used to assess PatMAn performance showed that prediction accuracy improved from 69% to 81% in DNMT1-overexpressing cells and from 76.2% to 88.1% in cancer cell lines. This improved performance was due to a reduced rate of false-positives resulting from the re-classification of six CpG islands originally classified as MP by PatMAn (Fig. 4D). As a result, specificity increased from 62.9% to 81.5% in DNMT1-overexpressing cells and from 69.2% to 88.5% in cancer cell lines. Thus, this classification algorithm based on DNA sequence patterns plus PRC2 occupancy exhibits increased predictive power for the classification of methylation susceptibility.
We next evaluated the ability of PatMAn and SUPER-PatMAn to identify cancer-associated hypermethylation in primary tumors. We analyzed the methylation status of the 44 test CpG islands in a collection of non–small cell lung tumors (T) and paired adjacent normal (N) tissues from the same patient (Fig. 5A). CpG islands that exhibited tumor-specific hypermethylation in a preliminary screen of five N-T pairs were further assessed in 15 additional N-T pairs (Fig. 5B). Six CpG islands (TBX1, OLIG2, ADAMTS5, KCNJ6, MGC16635, and RIPK4) exhibited some methylation in normal adjacent tissues (data not shown) and were not considered further. Of the remaining 38 CpG islands, 9 of 18 (50%) CpG islands predicted to be MP by PatMAn exhibited tumor-specific hypermethylation. In contrast, only 2 of 20 (10%) CpG islands predicted to be MR exhibited any hypermethylation (P = 0.01, Fisher's exact). SUPER-PatMAn showed improved performance, with 69.2% (9 of 13) predicted MP CpG islands exhibiting tumor-specific methylation, whereas only 8% (2 of 25) predicted MR CpG islands showed any methylation (P = 0.0002, Fisher's exact; Fig. 5B). Taking into consideration total hypermethylation events among all genes and tumors tested, CpG islands predicted to be MP by the SUPER-PatMAn and PatMAn classifiers were methylated 9.1 and 6.3 times more frequently than the predicted MR CpG islands, respectively. Considering that our analysis was limited to a single tumor type, the observed sensitivity of these classifiers is likely an underestimate. Thus, the PatMAn/SUPER-PatMAn classifiers trained on methylation data from DNMT1-overexpressing cells are also capable of identifying CpG islands that are prone to hypermethylation in primary lung tumors.
Methylation status of CpG islands classified by SUPER-PatMAn as MP or MR in primary lung tumors. The methylation status of predicted MP and predicted MR CpG islands was assessed by MSP in a panel of five paired normal and cancerous primary lung samples. CpG islands exhibiting hypermethylation in this sample set were further tested in 15 additional normal-tumor (N-T) pairs. A, representative MSP data for three predicted MP CpG islands (ERG, PP2447, and SIM2) and three predicted MR CpG islands (TPST2, C21orf33, and CRYZL1/ITSN1) in five N-T pairs. U, unmethylated; M, methylated. B, summary of methylation frequencies of all CpG islands tested.
Methylation status of CpG islands classified by SUPER-PatMAn as MP or MR in primary lung tumors. The methylation status of predicted MP and predicted MR CpG islands was assessed by MSP in a panel of five paired normal and cancerous primary lung samples. CpG islands exhibiting hypermethylation in this sample set were further tested in 15 additional normal-tumor (N-T) pairs. A, representative MSP data for three predicted MP CpG islands (ERG, PP2447, and SIM2) and three predicted MR CpG islands (TPST2, C21orf33, and CRYZL1/ITSN1) in five N-T pairs. U, unmethylated; M, methylated. B, summary of methylation frequencies of all CpG islands tested.
Genome-wide, there were 1,535 (4.1%) and 1,232 (3.3%) CpG islands predicted to be MP by the PatMAn and SUPER-PatMAn classifiers, respectively. There was considerable overlap between the two sets with 1,128 CpG islands being common between them (Fig. 6A and B). However, 407 CpG islands predicted to be MP by PatMAn were re-classified as MR by SUPER-PatMAn. Based on our direct testing of chromosome 21/22 CpG islands, these CpG islands likely represent false-positives misclassified by PatMAn. Additionally, 104 CpG islands predicted to be MR by PatMAn were re-classified as MP by SUPER-PatMAn. Thus, the combinatorial contribution of DNA sequence features and PRC2 occupancy predicts a unique set of MP CpG islands that differs from those identified by either DNA sequence patterns or PRC2 binding alone.
Genome-wide comparison of PatMAn and SUPER-PatMAn predictions with PRC2 occupancy. PatMAn and SUPER-PatMAn were applied to all 37,530 CpG islands in the human genome. A, CpG islands classified as MP by PatMAn (green) or SUPER-PatMAn (blue) are indicated to the right of each chromosome. Red ticks, regions enriched for the PRC2 component SUZ12 in human embryonic stem cells (9). B, Venn diagram representing the overlap between CpG islands classified as MP by PatMAn and/or SUPER-PatMAn, and those bound by SUZ12. C, spatial analysis of the relationship between CpG island predictions and H3K27me3 ChIP-Seq data. All human CpG islands were aligned by their centers and the average number of H3K27me3 ChIP-Seq tags (500 bp centered moving average) was calculated extending out 15 kb in each direction. D, analysis of KEGG pathways significantly enriched among CpG islands predicted to be MP by PatMAn and SUPER-PatMAn using Ingenuity Pathways software. Dashed line, P = 0.05. E, comparison of observed/expected frequencies of functional terms significantly enriched among SUPER-PatMAn predictions using the PANTHER classification system (38).
Genome-wide comparison of PatMAn and SUPER-PatMAn predictions with PRC2 occupancy. PatMAn and SUPER-PatMAn were applied to all 37,530 CpG islands in the human genome. A, CpG islands classified as MP by PatMAn (green) or SUPER-PatMAn (blue) are indicated to the right of each chromosome. Red ticks, regions enriched for the PRC2 component SUZ12 in human embryonic stem cells (9). B, Venn diagram representing the overlap between CpG islands classified as MP by PatMAn and/or SUPER-PatMAn, and those bound by SUZ12. C, spatial analysis of the relationship between CpG island predictions and H3K27me3 ChIP-Seq data. All human CpG islands were aligned by their centers and the average number of H3K27me3 ChIP-Seq tags (500 bp centered moving average) was calculated extending out 15 kb in each direction. D, analysis of KEGG pathways significantly enriched among CpG islands predicted to be MP by PatMAn and SUPER-PatMAn using Ingenuity Pathways software. Dashed line, P = 0.05. E, comparison of observed/expected frequencies of functional terms significantly enriched among SUPER-PatMAn predictions using the PANTHER classification system (38).
As expected, a significant fraction (n = 471, 38.2%) of the CpG islands predicted to be MP by SUPER-PatMAn exhibited SUZ12 binding in human embryonic stem cells (χ2 = 1,059, P < 0.00001; Fig. 6B) and were flanked by regions enriched in the polycomb-mediated H3K27me3 modification relative to predicted MR CpG islands in an independent data set of histone H3 modifications from CD4+ T cells (ref. 18; Fig. 6C). However, it should be noted that even in the absence of the additional SUZ12 occupancy feature, there was a highly significant association between CpG islands predicted to be MP by the sequence-based PatMAn and those bound by SUZ12 in human embryonic stem cells (n = 370, 24.1%; χ2 = 386, P < 0.0001; Fig. 6B). Similarly, PatMAn-predicted MP CpG islands were surrounded by H3K27me3-enriched regions in CD4+ T cells relative to predicted MR CpG islands (Fig. 6C). This observation suggests that the seven DNA sequence patterns that define the PatMAn classifier capture information that is predictive not only of methylation, but also of polycomb binding.
Interestingly, the spatial analysis of H3K27me3 in CD4+ T cells showed that CpG islands predicted to be MP by either classifier were flanked by this modification when compared with predicted MR CpG islands. This relative enrichment of H3K27me3 seemed to be greatest at the edges of the CpG islands and spanned several kilobases in either direction (Fig. 6C). The relative depletion of H3K27me3 over the center of the CpG islands may be explained in part by the presence of a peak of acetylated H3K27 (Supplementary Fig. S2), as these two marks have been reported to be mutually exclusive (19). In contrast, when H3K4me3 or RNA polymerase II were similarly analyzed, no difference was observed between the CpG islands predicted to be MP or MR by either classifier, suggesting that this is not a general correlation with all chromatin-associated features, and that overall, there is little difference in transcriptional activity between the MP and MR classes (Supplementary Fig. S3).
In order to further investigate the genes that may be affected by aberrant methylation of CpG islands predicted to be MP, CpG islands were assessed for proximity to RefSeq genes. Automated literature searches followed by manual confirmation showed that at least 100 genes known to be hypermethylated in cancer were predicted to be MP by SUPER-PatMAn, including CCND2, GATA4, GATA6, HIC-1, and TIMP3 (Supplementary Table S3). Furthermore, pathway analysis of genes predicted to be MP by PatMAn and/or SUPER-PatMAn revealed significant associations with components of the WNT, Notch, Hedgehog, cell cycle, and transforming growth factor-β pathways (Fig. 6D), many of which are known to be regulated by PRC2 (9, 24, 25) and are reported to be methylated in cancer (Supplementary Table S4). Molecular function analysis of SUPER-PatMAn predictions also revealed significant enrichment of homeobox genes and other DNA-binding proteins among the MP genes (Fig. 6E). In addition to being targets of PRC2, homeobox genes are frequently aberrantly methylated in human cancer (26). Indeed, 28 (60%) of the 47 homeobox genes predicted by SUPER-PatMAn to be MP were recently reported to be hypermethylated in lung cancer cells (26). Thus, the genes associated with CpG islands predicted to be MP by our classifiers constitute a unique fraction of the genome that is enriched for SUZ12 binding, developmental signaling pathways, and molecular functions related to DNA binding and transcriptional regulation.
Discussion
This study shows that aberrant de novo DNA methylation is in part dictated by the underlying sequence context of CpG islands and reveals a role for additional trans-acting chromatin regulators. We have used these features to develop two classifiers capable of predicting CpG island methylation susceptibility with high confidence. Although other methylation prediction tools have been developed, these have focused primarily on the methylation states of individual CpG dinucleotides or methylation of CpG islands in normal cells (27–30). Our PatMAn and SUPER-PatMAn classifiers represent some of the first computational approaches to identify CpG islands at increased risk of aberrant hypermethylation. Genome-wide application of these classifiers predicted 3% to 4% of CpG islands to be MP, including genes known to be methylated in cancer and many others that have not yet been reported to be methylated. Thus, our predicted MP CpG islands provide a rich resource for the identification of novel targets of aberrant methylation.
Interestingly, although none of the MP CpG islands from the training set encoded homeobox genes or were otherwise specifically selected for polycomb occupancy, there was still a striking relationship between the CpG islands predicted to be MP by PatMAn and PRC2 occupancy. Thus, the PatMAn classifier, and the seven DNA sequence patterns that define it, are not only predictive of aberrant methylation but also of PRC2 occupancy. At present, the mechanism by which PRC2 is directed to specific loci during mammalian development is largely unknown. In Drosophila, PRC2 is directed by its interaction with complex enhancer elements known as polycomb response elements (31). Such elements are several hundred base pairs in length, can act at great distances from the target gene, and do not conform to a particular consensus sequence. Rather, these elements have been functionally defined through the study of known polycomb target genes in flies. No such element has been identified in mammalian systems. Therefore, our studies may have uncovered sequence features contributing to the mammalian equivalent of a polycomb response element. In this regard, several of the DNA patterns identified by our computational model (GGAGNAAG, GAGANAAG, GAGGAGGNNG) resemble the consensus binding motifs for the ZESTE (YGAGYG) and GAF (GAGAG) transcription factors which are thought to act as polycomb recruiting factors in Drosophila (32).
Several recent studies, including ours, have shown a strong association between genes hypermethylated in human cancers and those targeted by PRC2 in embryonic stem cells (8, 10, 11). This finding suggests that PRC2 or the H3K27me3 mark imposed by this complex may predispose certain CpG islands to aberrant DNA methylation during tumorigenesis. However, the molecular mechanism linking the two processes is yet to be determined. Importantly, whereas genome-wide studies estimate that as many as 10% of genes are marked by PRC2 in embryonic cells (9), only a fraction of these are further targeted for de novo DNA methylation in cancer cells, suggesting that additional factors are involved. Consequently, the use of SUZ12 binding alone as an indicator of aberrant methylation would result in a high rate of false positives. Our classifiers, on the other hand, predict only a small subset (10.3–12.9%) of SUZ12-bound CpG islands as MP. For example, on chromosomes 21 and 22, only 11 (19.6%) of the 56 SUZ12-enriched CpG islands were predicted to be MP and, of the 9 tested in this study, all were hypermethylated in cancer cell lines. Conversely, none of the examined SUZ12-bound CpG islands that were predicted to be MR were hypermethylated. These data suggest that our classifiers combine DNA sequence and SUZ12 binding information to identify a subset of genes marked by PRC2 in embryonic cells that are more likely to be aberrantly methylated in cancer.
Despite the strong relationship between polycomb-mediated repression and methylation susceptibility, PRC2 binding alone cannot account for all CpG island methylation in human cancers. Indeed, similar to other studies (8, 10, 11), only half (9 of 18) of the CpG islands found to be hypermethylated in this study were bound by SUZ12 in embryonic stem cells. These findings imply the existence of PRC2-independent mechanisms in cancer-associated methylation. Only 24% to 38% of genes predicted to be MP by either classifier are SUZ12 targets, suggesting that the DNA sequence patterns (PatMAn) or combined DNA sequence and SUZ12-based (SUPER-PatMAN) signatures associated with our classifiers effectively identify many of these PRC2-independent events. Indeed, there were several CpG islands predicted to be MP by both classifiers that were hypermethylated in cancer cells that are SUZ12-negative (see Figs. 3 and 4). It is possible that the sequence patterns that define the PatMAn classifier pick up information that is reflective of sequence-specific DNA binding proteins that might target DNMTs to MP CpG islands. The best such example is the PML-RAR fusion protein which targets DNA methylation to its target genes (4). However, this is a rare oncogenic event and few similar cases have been reported. Alternatively, it is possible that the DNA patterns/chromatin signature are reflective of a particular secondary structure that is prone to de novo methylation by DNA methyltransferases. For example, Bock and colleagues (29) determined that the rise and roll of the DNA helix correlate with CpG island methylation status in normal lymphocytes. Additional work will be necessary to determine the contribution of PRC2-dependent and -independent mechanisms in CpG island methylation in human cancers.
Although it has been known for over a decade that trithorax group and polycomb group proteins have important roles in gene regulation, studies are only now beginning to reveal the considerable role that these complexes and the histone modifications (i.e., H3K4me3 and H3K27me3) they impart have on the establishment and fate of DNA methylation. Recent studies have identified bivalent domains that are marked simultaneously by both H3K4me2/3 and H3K27me3 in embryonic stem cells (33, 34). During differentiation, these domains resolve to be marked by either H3K4me3 or H3K27me3, and are either permissive or repressive for gene expression (34, 35). Those that resolve into H3K27me3-only domains are associated with increased DNA methylation during differentiation perhaps due to the ability of DNMT3L to bind an unmethylated H3K4 (34, 36). Conversely, those domains that lose H3K27me3 and retain H3K4me3 remain unmethylated (34). Similar changes may precipitate alterations in DNA methylation during human tumorigenesis. The dynamics of histone modification have not yet been thoroughly assessed genome-wide during tumorigenesis. Nevertheless, it is noteworthy that a subset of the predicted MP CpG islands reported in this study are flanked by H3K27me3, yet enriched for H3K4me3 within the island in normal CD4+ T cells. Thus, aberrant DNA methylation may be induced at these CpG islands through spreading of H3K27me3 into the island, perhaps stimulating H3K4 demethylation through the recently reported recruitment of the Rbp2 (JARID1A) H3K4me3 demethylase by PRC2 (37).
The PatMAn and SUPER-PatMAn classifiers were trained on methylation data derived from DNMT1-overexpressing fibroblasts, but nevertheless can predict with some accuracy CpG islands at increased risk of methylation in cancer cell lines and primary tumors, and across cancer types. The sequence signatures associated with these classifiers thus likely reflect features that are common to CpG island methylation in multiple settings. Previous studies have shown that human tumors exhibit both shared and tumor type–specific methylation profiles (1). In this regard, current efforts are focused on the development of tumor type–specific classifiers based on methylation data from primary tumors which may uncover novel features reflecting the contribution of tissue type–specific factors to aberrant methylation.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).
Acknowledgments
Grant support: National Cancer Institute grants CA077337 and CA116676 (P.M. Vertino), funds from the National Science Foundation and NIH grant U54 RR 024380-01 (E.K. Lee), and an American Cancer Society grant PF-07-130-01-MGO and Frederick Gardner Cottrell Postdoctoral Fellowship (M.T. McCabe). P.M. Vertino is a Georgia Cancer Coalition Distinguished Cancer Scholar.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The authors thank Drs. Joseph Costello, Christoph Plass, Martin Brena, and Dominic Smiraglia for sharing sequence information for methylation events identified by RLGS, Dr. Paul Wade for his thoughtful critique of the manuscript, and Dr. Colleen McCabe for bioinformatics advice and support. We thank the Emory University Histology Core for technical assistance.