Background: DNA methylation microarrays have become an increasingly popular means of studying the role of epigenetics in cancer, although the methods used to analyze these arrays are still being developed and existing methods are not always widely disseminated among microarray users.

Methods: We investigated two problems likely to confront DNA methylation microarray users: (i) batch effects and (ii) the use of widely available pathway analysis software to analyze results. First, DNA taken from individuals exposed to low and high levels of drinking water arsenic were plated twice on Illumina's Infinium 450 K HumanMethylation Array, once in order of exposure and again following randomization. Second, we conducted simulations in which random CpG sites were drawn from the 450 K array and subjected to pathway analysis using Ingenuity's IPA software.

Results: The majority of differentially methylated CpG sites identified in Run One were due to batch effects; few sites were also identified in Run Two. In addition, the pathway analysis software reported many significant associations between our data, randomly drawn from the 450 K array, and various diseases and biological functions.

Conclusions: These analyses illustrate the pitfalls of not properly controlling for chip-specific batch effects as well as using pathway analysis software created for gene expression arrays to analyze DNA methylation array data.

Impact: We present evidence that (i) chip-specific effects can simulate plausible differential methylation results and (ii) popular pathway analysis software developed for expression arrays can yield spurious results when used in tandem with methylation microarrays. Cancer Epidemiol Biomarkers Prev; 22(6); 1052–60. ©2013 AACR.

DNA methylation arrays have become a popular tool in cancer research (1–4). The use of arrays and the popularity of epigenetic studies have naturally attracted many investigators to this relatively new technology. Methylation array studies are particularly prone to several problems that, while well known to researchers who specialize in the analysis of array data, are often not appreciated by investigators who are less familiar with array analysis.

It has been understood for some time that batch effects represent a significant problem in array analysis that must be both avoided, as much as possible, and accounted for (5–9). For example, it was found that confounding by batch effects along with several other problems accounted for an association reported between gene expression profiles and response to ovarian cancer treatment (10); subsequently, the majority of authors of the original article (11) published a retraction. Similarly, another team of researchers found that a surprisingly high estimate of differentially expressed genes between individuals with European and Yoruban ancestry (12) may have been influenced by confounding by the year in which samples were processed (13). Finally, re-analysis of a study of gene expression in Plasmodium falciparum (14), the malaria parasite, found that confounding by batch complicated the evaluation of transcriptional diversity (15). Because many array studies are biomedical in nature, the findings often have important practical implications. Therefore, understanding and controlling for batch effects requires particular attention.

Thus far, attention has focused on relatively large differences in batch, such as differences in the laboratory in which samples are run, differences in the day of run, or differences in sample preparation methods. The implications of batch effects due to more subtle differences, such as the location of samples on different chips run concurrently, may not be sufficiently appreciated. Here, we use actual array data to illustrate how chip-specific effects can simulate plausible and interesting differential methylation results.

Moreover, because software programs used in the study of gene expression arrays are widely available and user-friendly, many of these programs have been “borrowed” for applications involving DNA methylation array data. In some cases, however, this is problematic. We present evidence that popular pathway analysis software, developed for expression array results but commonly used for methylation arrays as well (see, for example, refs. 16, 17–22), can yield spurious results.

Our hope is that this pair of illustrations will help investigators using DNA methylation arrays to avoid two common problems that could yield biased results.

450 K array study

Sample collection and DNA extraction.

Peripheral blood mononuclear cells (PBMCs) were obtained from a randomized, controlled trial intended to look at the effect of folate supplementation on arsenic (As) metabolism. This trial was approved by both the Bangladesh Medical Research Council (BMRC) and the Institutional Review Board at Columbia University Medical Center. Blood was collected in EDTA-containing tubes at our field clinic in Araihazar. Plasma was separated from roughly 8 mL of blood, and the remainder was diluted with PBS to 15 mL. Subsequently, 7.5 mL of the diluted blood was gently layered atop 4 mL of Ficoll. The tube containing blood and Ficoll was centrifuged at 400 × g for 30 minutes, and the mononuclear layer along with half of the underlying Ficoll was transferred to a new tube. PBS was added in equal volume, and the contents were centrifuged for 10 minutes at 200 × g. The cells were washed once more with PBS, then the supernatant was removed and the cell pellet was resuspended in 3.5 mL 5 PRIME ArchivePure DNA Cell Lysis solution, after which proteinase K was added. The cells were incubated at room temperature for 20 minutes, then the samples were stored at −80°C. Samples were transported on dry ice, first to Dhaka by car, then to Columbia University for analysis.

After arriving at Columbia, 48 samples were chosen from the PBMCs taken during the baseline visit based on membership in either a “low” or “high” As exposure group, members of the first having water As concentrations between 50 and 100 μg/L and members of the second having water As concentrations over 100 μg/L. DNA was extracted from the PBMCs using the 5 PRIME ArchivePure DNA Blood Kit. DNA was visualized on gels to confirm quality and quantified using the Quant-iT PicoGreen dsDNA Assay Kit (Life Technologies).

The PBMC DNA samples were run twice on Illumina Infinium 450 K HumanMethylation Arrays at Roswell Park Cancer Institute. Each run was identical except for the order in which samples were plated. In the first run, Run One, “low” As group samples were plated, then “high” As group samples were plated (Fig. 1). In the second run, Run Two, samples from the “low” and “high” As groups were randomly assigned to positions on the plates (Fig. 1).

Figure 1.

Plating scheme: Run One (samples plated by exposure group) versus Run Two (samples randomized on chips). Samples are labeled by the same number in both runs.

Figure 1.

Plating scheme: Run One (samples plated by exposure group) versus Run Two (samples randomized on chips). Samples are labeled by the same number in both runs.

Close modal

Statistical analysis.

The β values, or the percentage of CpGs at a given site that were methylated, were calculated for every sample at each CpG site. Observations with detection P > 0.05 were set to missing, and any CpG site with missing data was omitted from the analysis. The R program ComBat (8), included in the SVA package (23), was used to adjust for the chip on which samples were run. This method uses a parametric empirical Bayes framework to adjust data for batch effects; it is a particularly robust method for dealing with small samples sizes. To determine whether ComBat effectively removed batch effects, principal component analysis (PCA) was used to determine the top 5 principal components (PCs) present in the pre- and post-ComBat β values for Runs One and Two. Afterward, the association between each PC and (i) the chip on which samples were run and (ii) “low” or “high” As group was determined using linear regression. In addition, Spearman's correlation coefficients between β values in Runs One and Two were calculated for each CpG site, pre- and post-ComBat; we also calculated pre-Combat, between-run correlations stratified by Run One chip.

To investigate whether As exposure status was associated with differential DNA methylation in Runs One or Two, β values were transformed into M values (log2(β/(1 − β)) before statistical analysis (24). SVA was used to calculate F-statistics and their associated P values by comparing two nested linear models. The larger model contained As group, coded as a factor variable, as well as any other variables of interest, whereas the smaller model contained only an intercept and the non-As variables of interest. Q values were then calculated, also in SVA. In this way, the number of differentially methylated sites (q < 0.05) for Run One and Two were determined, both before and after ComBat was used to adjust for the chips on which samples were run. A clustering analysis was conducted on the Run One data, using pre-ComBat β values and the heatmap function in R. The top 100 differentially methylated sites, determined by q value, were included in the analysis. Color coding for both exposure group and chip was implemented. The same process was conducted for Run Two data, after adjusting models for chip, using ComBat, as well as two potential confounders: betelnut use and land ownership.

Pathway analysis simulations

One hundred simulations were conducted, in which random CpG sites from the 450 K array were chosen and entered into Ingenuity's IPA software. For each simulation, 100 CpG sites were randomly drawn from those included on the 450 K array. To randomize the choice of sites, all CpGs on the chip were assigned a random number between 0 and 1 and the 100 sites assigned the smallest values were chosen. The genes associated with these random sites were determined, and the gene lists were then uploaded into the IPA software. The Ingenuity Knowledge Base used in this analysis consisted of genes only, and both direct and indirect relationships were considered. “Diseases and Disorders,” “Molecular and Cell Function,” “Physiological System Development and Function,” and “Top Canonical Pathway” results linked to a given dataset at P < 0.05 were tallied.

Deconstructing Illumina's 450 K HumanMethylation Array

The number of CpG sites associated with each refgene accession number (i.e., gene transcript) included on the 450 K array was determined using the data file generated by the GenomeStudio software for Run One. Refgene accession number duplicates in each row of the data file were removed and then all remaining accession numbers were tallied, yielding the number of probes associated with each transcript covered by the chip. Refgene accession numbers were divided into 12 “bins”: these consisted of transcripts that had (i) <10; (ii) 11–20; (iii) 21–30; (iv) 31–40; (v) 41–50; (vi) 51–60; (vii) 61–70; (viii) 71–80; (ix) 81–90; (x) 91–100; (xi) 101–200; or (xii) 201–1,300 CpG sites included on the array. The distribution of transcripts in the histogram was then checked using the data contained in the GenomeStudio data file generated for Run Two.

For each bin, 500 refgene accession numbers at a time were uploaded into the IPA software, representing one set. The number of sets included in each bin varied from 1 to 29, according to the number of transcripts it included. Each set was loaded into a “My Pathway” analysis, and the “overlay functions and diseases” tool was used to determine the number of genes within it that were associated with (i) cancer, (ii) connective tissue disorders, and (iii) developmental disorders.

The median number of CpG sites included in each set was determined. This number was transformed using the natural logarithm, and a square root transformation of the percentage of genes in each set associated with a given disease/disorder was also performed. The Pearson correlation coefficient between the log-transformed median number of associated CpG sites in each set and the square root–transformed percentage of cancer, connective tissue disorder, and developmental disorder–associated genes was calculated.

Batch effects in Illumina 450 K methylation array analysis

PCA showed that ComBat effectively removed batch effects in Run Two, in which samples were randomized (Fig. 2). Before using ComBat, the first PC in Run Two was significantly associated with the chip on which samples were run (P < 0.001). After using ComBat, the first and third PCs were significantly associated with As exposure group (P < 0.05), and batch effects were not associated with any of the first 5 PCs. By contrast, in nonrandomized Run One, both chip and exposure group were significantly associated with PC 2 before using ComBat (P < 0.05), and both remained significantly associated with PC 3 after using ComBat (P < 0.001).

Figure 2.

Principal Component Analysis shows that ComBat effectively removed batch effects from randomized Run Two but not from nonrandomized Run One. Before ComBat, the first PC in Run Two was significantly associated with the chip on which samples were run. After using ComBat, the first and third PCs were significantly associated with exposure group, and batch effects were not associated with any of the major PCs. The percent variance explained by each PC is indicated.

Figure 2.

Principal Component Analysis shows that ComBat effectively removed batch effects from randomized Run Two but not from nonrandomized Run One. Before ComBat, the first PC in Run Two was significantly associated with the chip on which samples were run. After using ComBat, the first and third PCs were significantly associated with exposure group, and batch effects were not associated with any of the major PCs. The percent variance explained by each PC is indicated.

Close modal

Before correction for batch effects, the results from Run One included 1,203 CpG sites that were differentially methylated according to As exposure group (Table 1). A clustering analysis showed that by using the top 100 differentially methylated sites by q value, all but two of the study participants could be assigned to the correct exposure group (Fig. 3). However, the same analysis showed that samples clustered by the chip on which they were run (Fig. 3). The two samples which could not be assigned to the correct exposure group, 25 and 26, included a low-As sample that was run on a chip that otherwise contained all high-As samples and a high-As sample run on a chip that contained predominately low-As samples (Fig. 1).

Figure 3.

A seemingly strong association between exposure group and DNA methylation patterns in Run One (in which samples were plated by exposure group) appears to be due to batch effects. A clustering analysis was conducted in which 48 samples were grouped on the basis of methylation patterns observed in the top 100 differentially methylated CpG sites (as determined by q value). In Run One: (A) all but 2 individuals (samples 25 and 26) could be correctly assigned to the correct exposure group on the basis of β values at these sites, but (B) there was a strong relationship between the chip on which a sample was run and the other samples with which it clustered. Later analysis showed that the two misclassified samples in (A) were run on chips containing predominantly samples from the other exposure group. By contrast, in Run Two, in which the same samples were plated after being randomized and the models used to determine significance were adjusted for batch effects as well as several possible confounders: (C) 41 of 48 samples were correctly assigned to the correct exposure group, and (D) no relationship between the chip on which a sample was run and the clustering pattern was evident.

Figure 3.

A seemingly strong association between exposure group and DNA methylation patterns in Run One (in which samples were plated by exposure group) appears to be due to batch effects. A clustering analysis was conducted in which 48 samples were grouped on the basis of methylation patterns observed in the top 100 differentially methylated CpG sites (as determined by q value). In Run One: (A) all but 2 individuals (samples 25 and 26) could be correctly assigned to the correct exposure group on the basis of β values at these sites, but (B) there was a strong relationship between the chip on which a sample was run and the other samples with which it clustered. Later analysis showed that the two misclassified samples in (A) were run on chips containing predominantly samples from the other exposure group. By contrast, in Run Two, in which the same samples were plated after being randomized and the models used to determine significance were adjusted for batch effects as well as several possible confounders: (C) 41 of 48 samples were correctly assigned to the correct exposure group, and (D) no relationship between the chip on which a sample was run and the clustering pattern was evident.

Close modal
Table 1.

Number of differentially methylated CpG sites (q < 0.05) in samples run on the Illumina 450 K HumanMethylation Array two times: first, plated by exposure status, and then with samples randomized to chip

Run One (nonrandomized)Run Two (randomized)
Before adjusting for chip 1,203 
After adjusting for chipa 24,184 187 
Number of sites identified in both runs one and two, after adjusting for chip 25 
Run One (nonrandomized)Run Two (randomized)
Before adjusting for chip 1,203 
After adjusting for chipa 24,184 187 
Number of sites identified in both runs one and two, after adjusting for chip 25 

aThe program ComBat was used to adjust for the chip on which samples were run (8).

After adjusting for chip using the ComBat program, the results from Run One included more than 24,000 CpG sites that were differentially methylated according to As exposure group (Table 1). However, when the same DNA samples were re-analyzed after being plated in randomly assigned order on new 450 K array chips (Run Two), only 449 differentially methylated sites were identified (Table 1). Only 25 differentially methylated sites overlapped between the batch-adjusted results from both Run One and Run Two (Table 1). They are associated with a number of genes involved in cancer pathogenesis, as well as DNA/RNA binding and the production of oxidative stress (data not shown); in the future, we plan to verify these differentially methylated sites using NextGen bisulfite sequencing.

The pre-ComBat Spearman's correlations between β values from Runs One and Two were poor overall (mean r = 0.30) with a bimodal distribution, the larger peak centered around zero (Supplementary Fig. S1). The correlation between runs did not improve after using ComBat and remained biomodal (mean r = 0.29; Supplementary Fig. S1). The peak around zero was not present when correlations between samples run on a single chip in Run One and the corresponding samples in Run Two were calculated (Supplementary Fig. S2); it emerged only when correlations were calculated across chips in Run One (Supplementary Fig. S1) and reflected the prominence of batch effects in this nonrandomized run.

Pathway analysis simulations

In this study, we ran 100 simulations, each consisting of all of the genes linked to 100 CpG sites drawn randomly from those occurring on the 450 K chip. Each list of genes was entered into the IPA software to determine whether it was significantly associated with any diseases or biological functions in the IPA database. In our trials, the software assigned each of the simulations a results profile that contained five significantly associated (P < 0.05) “Diseases and Disorders,” “Molecular and Cellular Functions,” and “Physiological System Development and Functions,” as well as a varying number of significantly linked “Top Canonical Pathways.” Cancer (71% of simulations) was the most common disease/disorder to be reported as statistically associated with the gene lists, followed by developmental disorders (50% of simulations), and connective tissue disorders (37% of simulations; Fig. 4). The most common molecular/cellular functions were cellular development (55% of simulations), cell morphology (52% of simulations), and cellular assembly and organization (45% of simulations; Fig. 4). In terms of physiological system development and function, embryonic development (65% of simulations), tissue development (41% of simulations), and nervous system development and function (37% of simulations) were most frequently encountered (Fig. 4). Finally, more than 200 canonical pathways were significantly linked to the simulation results, although no single pathway was linked to more than 6% of the simulations. Complete tables of the significant results from these simulations can be found in Supplementary Tables S1 to S4.

Figure 4.

The results of 100 simulations in which 100 random CpG sites were drawn from Illumina's 450 K array and the associated genes were entered into Ingenuity's IPA software for Pathway Analysis. The 10 most frequent (A) diseases and disorders, (B) molecular and cellular functions, and (C) physiological system development and function results that were significantly associated with the random simulation data (P < 0.05) are presented here. Significant links to cancer, developmental disorder, cellular development, cell morphology, and embryonic development were found in half or more of the simulations.

Figure 4.

The results of 100 simulations in which 100 random CpG sites were drawn from Illumina's 450 K array and the associated genes were entered into Ingenuity's IPA software for Pathway Analysis. The 10 most frequent (A) diseases and disorders, (B) molecular and cellular functions, and (C) physiological system development and function results that were significantly associated with the random simulation data (P < 0.05) are presented here. Significant links to cancer, developmental disorder, cellular development, cell morphology, and embryonic development were found in half or more of the simulations.

Close modal

Deconstructing Illumina's 450 K HumanMethylation Array

To identify the reason why CpG sites chosen at random from the chip would yield such results, we first examined the distribution of transcript-associated CpG sites on the 450 K chip (Fig. 5). We found that the number of sites associated with a given transcript varied from 1 to 1,299. In addition, we found that transcripts associated with genes of particular interest to cancer researchers, such as TP53 or KRAS, were disproportionally represented on the chip, possessing a greater number of associated sites than the Illumina-reported average of 17 (25). We also examined the relationship between the number of CpG sites associated with a given gene and the likelihood that the gene would be associated with some of the most common diseases/disorders in our IPA results. To do this, we examined the correlation between the median number of gene-associated CpG sites and the percentage of genes associated with a given disease/disorder in 69 “sets” of 500 transcripts or less, which incorporated every refgene sequence represented on the 450 K chip. We found that the median number of CpG sites in a set was highly correlated with the percentage of genes within that set that IPA linked to cancer (P < 0.0001), connective tissue disorders (P = 0.005), and developmental disorders (P < 0.0001; Fig. 6).

Figure 5.

Distribution of transcript-associated CpG sites on Illumina's 450 K HumanMethylation Array. While some transcripts were associated with only a single CpG site included on the array, others were associated with many more. For example, some cancer-linked genes such as TP53 and KRAS have more than 30 associated CpG sites on the chip, whereas another gene, PTPRN2, is associated with 1,299 CpG sites.

Figure 5.

Distribution of transcript-associated CpG sites on Illumina's 450 K HumanMethylation Array. While some transcripts were associated with only a single CpG site included on the array, others were associated with many more. For example, some cancer-linked genes such as TP53 and KRAS have more than 30 associated CpG sites on the chip, whereas another gene, PTPRN2, is associated with 1,299 CpG sites.

Close modal
Figure 6.

Genes with more CpG probes on Illumina's 450 K HumanMethylation Array were more likely to be associated with disease in Ingenuity's IPA software. Here, we present the correlation between the median number of CpG sites in a “set” of transcripts (see Materials and Methods) and the percent of genes associated with (A) cancer, (B) connective tissue disorders, and (C) developmental disorders, which represent the disease categories most commonly linked to our data in 100 random-draw trials. Each dot in the graph represents a set of 500 or fewer transcripts.

Figure 6.

Genes with more CpG probes on Illumina's 450 K HumanMethylation Array were more likely to be associated with disease in Ingenuity's IPA software. Here, we present the correlation between the median number of CpG sites in a “set” of transcripts (see Materials and Methods) and the percent of genes associated with (A) cancer, (B) connective tissue disorders, and (C) developmental disorders, which represent the disease categories most commonly linked to our data in 100 random-draw trials. Each dot in the graph represents a set of 500 or fewer transcripts.

Close modal

In this article, we provided illustrations of two problems that researchers studying gene-specific methylation frequently encounter: batch effects in DNA methylation arrays and difficulties associated with applying pathway analysis software that was originally created to work with gene expression arrays to DNA methylation arrays.

Although an entire book has been written about dealing with batch effects in array data (26), investigators new to the field often underestimate this challenge, which may have serious ramifications upon their research. While it may be clear to some that running samples in different laboratories or preparing them in different ways invites batch effects, the effect of simply plating samples on separate chips run at the same time may not be adequately appreciated (Fig. 3). Some researchers may accidentally confound exposure group and batch, as in the example presented here; if they are not aware of the power of batch effects, they may publish spurious findings with real health implications (See Fig. 3 and Run One Results in Table 1). Especially when samples are being run by a commercial or partner laboratory, it is crucial for the researcher supplying the samples to assign them run IDs that mask their identity and to specify the order in which they should be plated. If samples are not properly randomized across chips and batch and the phenotype of interest are confounded, even a program that is very effective at removing batch effects, such as ComBat, will not be able to salvage the data (Fig. 2, Supplementary Figs. S1 and S2). Conversely, other researchers may properly randomize their samples but fail to appreciate the potential of batch effects to swamp the signal in which they are interested. If this is the case, they may find few significant results in their data and abandon a project, not realizing that by simply controlling for a factor such as the chip on which samples were run, they could unmask real differences between the 2 groups they are comparing (See Run Two Results in Table 1 and Fig. 2). It is our hope that the examples presented in this article, based on real data, will provide convincing evidence that carefully adjusting for batch effects is a necessary part of methylation array analysis. Batch effects are likely to be particularly problematic when researchers anticipate that any observed differences in methylation will be subtle, as when low-level exposures are studied in healthy people. In these situations, the signals of interest are more easily masked by batch effects.

Unlike the perils of batch effects, which have been appreciated for some time, to the best of our knowledge, no one has discussed publicly the problems associated with using commercial pathway analysis software to analyze DNA methylation array data. This is such a common practice in the literature that some reviewers may request that it be done, even though these software packages were created for use with gene expression arrays. Here, we show that when used with Illumina HumanMethylation 450 K array data, the IPA software consistently results in spurious, statistically significant links to various diseases, disorders, and biological processes such as cancer, developmental disorders, and cellular and embryonic development. These problems appear to stem from the failure of commercially available pathway analysis software to account for the differential representation of genes on an array. That is, some genes are associated with many more CpG probes on the array than others (Fig. 5), and genes that are interesting to researchers due to their links to disease are disproportionately represented on the chip (Fig. 6). These problems do not appear to be limited to Illumina's 450 K array or Ingenuity's IPA software. For example, Affymetrix states of their popular GeneChip Human Promoter 1.0R Array, “for over 1,300 cancer-associated genes, coverage of promoter regions was expanded to include additional genomic content” (27). Similarly, the commonly used pathway analysis program DAVID (28) allows users to specify the expression array used but does not provide this option for methylation arrays, resulting in an inability to control for the number of probes associated with each gene. Unless a pathway analysis package has been modified for use with gene methylation arrays, so that differential representation of genes is considered when calculating the observed versus expected ratios of genes associated with diseases or gene networks, it is unlikely that the program will yield reliable results.

In this article, we highlighted two problems associated with DNA methylation array analysis. However, there are additional methodologic challenges that should be noted. First, although we demonstrated that pathway analysis can yield spurious associations when genes are differentially represented on an array, it is important to highlight another potential limitation inherent to this approach. Pathway analysis works by detecting the statistical overrepresentation of genes belonging to a certain pathway in a dataset. However, hypermethylation in the promoter region of a single upstream regulator could shut down an entire biologic pathway without being detected by pathway analysis. By concentrating on pathway analysis results, single-gene alterations in methylation associated with important biological changes could be missed. Second, many methylation studies are carried out on PBMCs or peripheral blood leukocyte (PBL) samples. Such samples are composed of a variety of white cell types, which have distinct DNA methylation patterns (29). If a factor being studied influences the distribution of these cell types, alterations in DNA methylation due to a change in cell-type composition may be mistakenly attributed to the direct effects of the factor of interest. To avoid this problem, cell-type distribution may be determined at the time of the blood draw via a differential cell count. Alternately, a computational method of estimating the cell mixture distribution in a sample has been developed by Houseman and colleagues (29), and recently, it was used to determine differentially methylated genes in the blood of patients with rheumatoid arthritis (30). Third, single-nucleotide polymorphisms (SNPs) in the sites targeted by an array's probes can prove problematic. Because the Illumina array quantifies the level of methylation at each target site, a polymorphism can result in a decline in methylation levels that is not due to fewer unmethylated CpG sites (31). Thus, the issues associated with population stratification that complicate genome-wide association studies could also complicate DNA methylation analyses. Finally, the best method of controlling for multiple comparisons when analyzing array data varies based on the type of study, and under certain circumstances, it may be helpful to reduce the dimensionality of array data before analyzing association with a phenotype of interest (32). All of these considerations make methylation array data analysis a challenging endeavor.

Investigation of the role of DNA methylation in disease holds much promise, and many scientists are currently exploring epigenetic influences on a multitude of outcomes. Using a study that exploits the availability of repeated measures and a set of in silico experiments, we have provided illustrations of (i) the insidious nature of batch effects and (ii) problems encountered when using commercially available gene expression pathway analysis software for the analysis of DNA methylation array data. Collectively, these analyses highlight two common pitfalls associated with the analysis and interpretation of DNA microarray data.

No potential conflicts of interest were disclosed.

Conception and design: K.N. Harper, M.V. Gamble

Development of methodology: K.N. Harper, B.A. Peters

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): K.N. Harper, B.A. Peters, M.V. Gamble

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): K.N. Harper, M.V. Gamble

Writing, review, and/or revision of the manuscript: K.N. Harper, B.A. Peters, M.V. Gamble

Study supervision: M.V. Gamble

The authors thank Jeff Conroy, Shuang Wang, Hokeun Sun, Julie Herbstman, and Megan Hall for their help; Greg Gibson and John Storey for their summer course on working with array data; the members of the Epigenetics Working Group at CUMC; and two anonymous reviewers whose comments helped improve this manuscript.

This work was supported by funding from a Robert Wood Johnson Health & Society Scholars seed grant and Columbia University's Cancer Epidemiology Training Program (5-R25-CA 094061; to K.N. Harper) and NIH grants R01CA133595, R01ESO17875, and P42ES10349 (to M.V. Gamble).

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Marsit
C
,
Koestler
D
,
Christensen
B
,
Karagas
M
,
Houseman
E
,
Kelsey
K
. 
DNA methylation array analysis identifies profiles of blood-derived DNA methylation associated with bladder cancer
.
J Clin Oncol
2011
;
29
:
1133
9
.
2.
Teschendorff
A
,
Menon
U
,
Gentry-Maharaj
A
,
Ramus
S
,
Weisenberger
D
,
Shen
H
, et al
Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer
.
Genome Res
2010
;
20
:
440
6
.
3.
Hansen
K
,
Timp
W
,
Bravo
H
,
Sabunciyan
S
,
Langmead
B
,
McDonald
O
, et al
Increased methylation variation in epigenetic domains across cancer types
.
Nat Genet
2011
;
43
:
768
75
.
4.
Breitling
L
,
Yang
R
,
Korn
B
,
Burwinkel
B
,
Brenner
H
. 
Tobacco-smoking-related differential DNA methylation: 27K discovery and replication
.
Am J Hum Genet
2011
;
88
:
450
7
.
5.
Leek
J
,
Scharpf
R
,
Bravo
H
,
Simcha
D
,
Langmead
B
,
Johnson
W
, et al
Tackling the widespread and critical impact of batch effects in high-throughput data
.
Nat Rev Genet
2010
;
11
:
733
9
.
6.
Sun
Z
,
Chai
H
,
Wu
Y
,
White
W
,
Donkena
K
,
Klein
C
, et al
Batch effect correction for genome-wide methylation data with Illumina Infinium platform
.
BMC Med Genomics
2011
;
4
:
84
.
7.
Wang
D
,
Zhang
Y
,
Huang
Y
,
Li
P
,
Wang
M
,
Wu
R
, et al
Comparison of different normalization assumptions for analyses of DNA methylation data from the cancer genome
.
Gene
2012
;
506
:
36
42
.
8.
Johnson
W
,
Rabinovic
A
,
Li
C
. 
Adjusting batch effects in microarray expression data using empirical Bayes methods
.
Biostatistics
2006
;
8
:
118
27
.
9.
Yan
L
,
Ma
C
,
Wang
D
,
Hu
Q
,
Qin
M
,
Conroy
J
, et al
OSAT: a tool for sample-to-batch allocations in genomics experiments
.
BMC Genomics
2012
;
13
:
689
.
10.
Baggerly
K
,
Coombes
K
,
Neeley
E
. 
Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer
.
J Clin Oncol
2008
;
26
:
1186
7
.
11.
Dressman
H
,
Berchuck
A
,
Chan
G
,
Zhai
J
,
Bild
A
,
Sayer
R
, et al
An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer
.
J Clin Oncol
2007
;
25
:
517
25
.
12.
Spielman
R
,
Bastone
L
,
Burdick
J
,
Morley
M
,
Ewens
W
,
Cheung
V
. 
Common genetic variants account for differences in gene expression among ethnic groups
.
Nat Genet
2007
;
39
:
226
31
.
13.
Akey
J
,
Biswas
S
,
Leek
J
,
Storey
J
. 
On the design and analysis of gene expression studies in human populations
.
Nat Genet
2007
;
39
:
807
8
.
14.
Daily
J
,
Scanfeld
D
,
Pochet
N
,
Roch
KL
,
Plouffe
D
,
Kamal
M
, et al
Distinct physiological states of Plasmodium falciparum in malaria-infected patients
.
Nature
2007
;
450
:
1091
5
.
15.
Lemieux
J
,
Feller
A
,
Holmes
C
,
Newbold
C
. 
Reply to Wirth et al.: in vivo profiles show continuous variation between two cellular populations
.
PNAS
2009
;
106
:
E71
E2
.
16.
Einstein
F
,
Thompson
R
,
Bhagat
T
,
Fazzari
M
,
Verma
A
,
Barzilai
N
, et al
Cytosine methylation dysregulation in neonates following intrauterine growth restriction
.
PLoS ONE
2010
;
5
:
e8887
.
17.
Kim
S
,
Kelly
W
,
Fu
A
,
Haines
K
,
Hoffman
A
,
Zheng
T
, et al
Genome-wide methylation analysis identifies involvement of TNF-alpha mediated cancer pathways in prostate cancer
.
Cancer Lett
2011
;
302
:
47
53
.
18.
Sadikovic
B
,
Yoshimoto
M
,
Al-Romaih
K
,
Maire
G
,
Zielenska
M
,
Squire
J
. 
In vitro analysis of integrated global high-resolution DNA Methylation profiling with genomic imbalance and gene expression in osteosarcoma
.
PLoS One
2008
;
3
:
e2834
.
19.
Thompson
R
,
Atzmon
G
,
Gheorghe
C
,
Liang
H
,
Lowes
C
,
Greally
J
, et al
Tissue-specific dysregulation of DNA methylation in aging
.
Aging Cell
2010
;
9
:
506
18
.
20.
Zhu
Y
,
Stevens
R
,
Hoffman
A
,
Tjonneland
A
,
Vogel
U
,
Zheng
T
, et al
Epigenetic impact of long-term shiftwork: pilot evidence from circadian genes and whole-genome methylation analysis
.
Chronobiol Int
2011
;
28
:
852
61
.
21.
Lokk
K
,
Vooder
T
,
Kolde
R
,
Välk
K
,
Võsa
U
,
Roosipuu
R
, et al
Methylation markers of early-stage non-small cell lung cancer
.
PLoS ONE
2012
;
7
:
e39813
.
22.
Novakovic
B
,
Yuen
R
,
Gordon
L
,
Penaherrera
M
,
Sharkey
A
,
Moffett
A
, et al
Evidence for widespread changes in promoter methylation profile in human placenta in response to increasing gestational age and environmental/stochastic factors
.
BMC Genomics
2011
;
12
:
529
.
23.
Leek
J
,
Johnson
W
,
Parker
H
,
Jaffe
A
,
Storey
J
. 
SVA: surrogate variable analysis. R package version 3.2.1
. 
2012
.
24.
Du
P
,
Zhang
X
,
Huang
C
,
Jafari
N
,
Kibbe
W
,
Hou
L
, et al
Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis
.
BMC Bioinformatics
2010
;
11
:
587
.
25.
Illumina
. 
Infinium HumanMethylation450 BeadChip Kit
. 
2012
;
Available from
: http://www.illumina.com/products/methylation_450_beadchip_kits.ilmn
26.
Scherer
A
. 
Batch effects and noise in microarray experiments: sources and solutions
.
Hoboken, NJ
:
John Wiley & Sons
; 
2009
.
27.
Affymetrix
. 
GeneChip Human Promoter 1.0R Array
.
[cited 2013 Mar 18]; Available from
: http://www.affymetrix.com/estore/browse/products.jsp?productId=131461#1_1.
28.
Huang
D
,
Sherman
B
,
Lempicki
R
. 
Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources
.
Nat Protoc
2009
;
4
:
44
57
.
29.
Houseman
E
,
Accomando
W
,
Koestler
D
,
Christensen
B
,
Marsit
C
,
Nelson
H
, et al
DNA methylation arrays as surrogate measure of cell mixture distribution
.
BMC Bioinformatics
2012
;
13
:
86
.
30.
Liu
Y
,
Aryee
M
,
Padyukov
L
,
Fallin
M
,
Hesselberg
E
,
Runarsson
A
, et al
Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis
.
Nat Biotechnol
2013
;
31
:
142
7
.
31.
Chen
Y
,
Choufani
S
,
Ferreira
J
,
Grafodatskaya
D
,
Butcher
D
,
Weksberg
R
. 
Sequence overlap between autosomal and sex-linked probes on the Illumina HumanMethylation27 microarray
.
Genomics
2011
;
97
:
214
22
.
32.
Houseman
E
. 
Biostatisical methods in epigenetic epidemiology
. In:
Michels
K
,
editor
. 
Epigenetic epidemiology
.
New York, NY
:
Springer Verlag
; 
2012
.
p.
57
76
.