## Abstract

**Background:** Metabolite levels within an individual vary over time. This within-individual variability, coupled with technical variability, reduces the power for epidemiologic studies to detect associations with disease. Here, the authors assess the variability of a large subset of metabolites and evaluate the implications for epidemiologic studies.

**Methods:** Using liquid chromatography/mass spectrometry (LC/MS) and gas chromatography-mass spectroscopy (GC/MS) platforms, 385 metabolites were measured in 60 women at baseline and year-one of the Shanghai Physical Activity Study, and observed patterns were confirmed in the Prostate, Lung, Colorectal, and Ovarian Cancer Screening study.

**Results:** Although the authors found high technical reliability (median intraclass correlation = 0.8), reliability over time within an individual was low. Taken together, variability in the assay and variability within the individual accounted for the majority of variability for 64% of metabolites. Given this, a metabolite would need, on average, a relative risk of 3 (comparing upper and lower quartiles of “usual” levels) or 2 (comparing quartiles of observed levels) to be detected in 38%, 74%, and 97% of studies including 500, 1,000, and 5,000 individuals. Age, gender, and fasting status factors, which are often of less interest in epidemiologic studies, were associated with 30%, 67%, and 34% of metabolites, respectively, but the associations were weak and explained only a small proportion of the total metabolite variability.

**Conclusion:** Metabolomics will require large, but feasible, sample sizes to detect the moderate effect sizes typical for epidemiologic studies.

**Impact:** We offer guidelines for determining the sample sizes needed to conduct metabolomic studies in epidemiology. *Cancer Epidemiol Biomarkers Prev; 22(4); 631–40. ©2013 AACR*.

## Introduction

Metabolomics is the assessment of small molecules (1), often defined to be only those molecules participating in cellular metabolism, within a given biologic system (2). Modern methods, such as nuclear magnetic resonance (NMR) and mass spectroscopy coupled with liquid chromatography or gas chromatography (3), can identify and quantify a large number of metabolites simultaneously within a biospecimen, capturing its metabolomic profile. These profiles have been used to predict the risk of diabetes (4, 5), diagnose prostate cancer (6), and identify biomarkers of Crohn's disease (7). While these initial studies have shown the potential of metabolomics, several important issues need to be resolved before considering metabolomics as a tool for large epidemiologic studies.

A common goal in epidemiology is to relate a “usual” (8) level of an exposure, such as blood pressure, vitamin D levels, or smoking status, with the risk of disease. Usual is an ambiguous term, but it might be loosely translated as the average level over the last month or, perhaps, year. To assess the potential association, epidemiologic studies often rely on only a single measurement in time as an estimate or surrogate for an individual's usual level. For characteristics that have large day-to-day variation or are measured with low technical reliability, the surrogate may poorly reflect the desired quantity. Given that it is the usual level that is likely to be associated with the disease, within-individual and technical variability will reduce the study's power to detect and quantify the tested association (9–11).

There is a potential concern that a single metabolomic profile may poorly reflect usual levels. Several metabolites are already known to vary within an individual over time. For example, vitamin D levels vary with the seasons (12), estrogen levels vary with the menstrual cycle in premenopausal women (13, 14), and aldosterone, cortisol, and rennin levels follow a circadian rhythm (15). On a shorter time scale, carbohydrate, lipid, and amino acids levels in the blood respond to dietary patterns, spiking sharply in the postprandial period (1–2 hours after eating; ref. 16). However, recent studies have suggested that metabolomic profiles may be relatively stable (11, 17–20). Floegel and colleagues found that the median intraclass correlation (ICC), over a 4-month interval, of 163 serum metabolites measured by mass spectroscopy was 0.57 (20) and Nicholson and colleagues found the stable proportion of biologic variation, over a similar period, for 38 annotated plasma metabolites measured by NMR was, on average, 0.68 (11), with similar ICCs for a larger number of spectral peaks. Here, we extend this research by studying a larger set of 385 metabolites, by including nonfasting samples so as to represent samples typically collected in epidemiologic studies and by considering measurements separated by 1 year so as to capture the variability around persistent exposures, which are more likely to affect the risk of many diseases.

Our overarching goal is to provide key information needed to design metabolomics analyses in the context of large-scale epidemiologic studies. Our first objective is to estimate the within-individual, technical, and between-individual variability in 385 plasma metabolites measured by LC/MS and GC/MS, when samples are collected as part of an epidemiologic study. Higher between-individual variability is desirable, because that encompasses the measurable differences that can be associated with disease. We assess these 3 sources of variability in 184 individuals from the Shanghai Physical Activity (SPA) study and confirm our observations in a smaller subsample from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial. Our second objective is to translate these estimates of variability into estimates of the study power (11) that can be expected for epidemiologic studies, with a specific focus on large case–control and case–cohort studies. While our conclusions are based on our results observed for LC/GC mass spectroscopy, our methodologic framework can evaluate other metabolomic platforms, such as NMR.

## Materials and Methods

### Studies and sample collection

The Shanghai Women's Health Study (SWHS) and Shanghai Men's Health Study (SMHS) are prospective cohort studies that include 74,943 women (ages 40–70 years at study baseline) and 61,582 men (ages 40–75 years at baseline) from 8 communities in Shanghai, China, between 1997 and 2006. The SPA study included a randomly selected subcohort residing in 2 communities (21, 22). Participants were each enrolled for 1 year and provided EDTA plasma samples at the beginning (T0) and end of the study year (T1). Samples were stored at −70°C (23). Our analysis includes all 106 women and 78 (out of 100) men who were enrolled in the first wave of recruitment, donated T1 plasma samples and have a valid Actigraph accelerometer measurement, a requirement for a complementary study of physical activity. The study included the 60 men with the most extreme levels of physical activity (30 high; 30 low) and 18 randomly selected men. The median age at T1 was 55 and 52 years for men and women, respectively and 55% of the women were postmenopausal.

Metabolite levels were measured for all 184 individuals at T1 and a randomly selected subset of 60 women at T0. Although fasting was not required, 6, 4, and 21 of these 60 women reported to be fasting during the morning of sample collection at both T0 and T1, T0 only, and T1 only. Two replicate samples, needed to assess technical variability, were measured on 8 of the T0 samples.

The PLCO screening trial is a large randomized trial, starting in 1993, which examines the effects of screening on cancer-related outcomes in the United States (24). Biologic specimen were collected under a uniform protocol and placed in long-term storage at −70°C in a common PLCO Biorepository at Frederick, MD (25). Our analysis focused on 254 individuals, collected as healthy age and gender-matched controls for 254 colorectal cancer cases. At baseline (T0), the median age for the 143 men and 111 women were 65 and 63 years, respectively, and 98% of the women reported being postmenopausal. Metabolites were measured in serum samples from all 254 individuals at T0 and a randomly selected group of 30 individuals (14 women and 16 men) at year 1 (T1). To evaluate technical variability in PLCO, we used EDTA plasma samples collected as part of a separate pilot study that measured replicate samples collected during the fourth year (T4) of follow-up from 15 randomly selected healthy men. Previous studies have already shown that plasma and serum metabolite profiles behave similarly (26). Because of differences in the study populations, we do not conduct a combined analysis, and instead, use our observations from PLCO, which has fewer individuals with multiple measurements, to confirm the SPA results.

### Metabolite measurement

Study samples were analyzed at the laboratory of Metabolon Inc. using ultra high-performance liquid-phase chromatography and gas chromatography coupled with mass spectrometry and tandem mass spectrometry, as described previously (27, 28). A nontargeted single extraction was used, followed by protein precipitation, to recover a diversity of metabolites. Relative quantities were obtained from mass spectroscopy peaks, and peaks were linked by informatics methods to metabolite identities. The list of measured metabolites includes, but is not limited to, amino acids, carbohydrates, fatty acids, androgens, and xenobiotics (Supplementary Table S1). Metabolites were individually normalized according to test-day.

### Between-individual, within-individual, and technical variability

For each metabolite, we can estimate the variance across all measurements. We consider the variance of the transformed quantity, the log of the peak intensity, as it is the quantity most commonly used in association studies. After log-transformation, metabolites were approximately normally distributed (Supplementary Fig. S4). One goal is to decompose this total variance, |$\sigma _{\rm T}^2$|, into 3 different components: the between subject variance, |$\sigma _{\rm B}^2$|, which can also be considered the variance of the “usual” level in a population; the within subject variance, |$\sigma _{\rm W}^2$|, which reflects the true year-to-year variability around the “usual” level within an individual; the technical variance or laboratory reproducibility, |$\sigma _{\rm E}^2$|, which is the expected variance from 2 identical samples.

These 3 variance components can be combined into other quantities of interest.

The biologic variance (29, 30) is |$\sigma _{\rm B}^2 + \sigma _{\rm W}^2$|.

The technical ICC is the proportion of the total variation that is attributable to biologic variance, as opposed to random laboratory error. The technical ICC is a common measure of laboratory accuracy or reproducibility.

We denote the proportion of the population's biologic variability that is due to the variation across individuals, by

The usual measurement “error,” or the variation around an individual's “usual” level, is |$\sigma _{\rm W}^2 + \sigma _{\rm E}^2$|. Larger values of this “error” in the usual measurement often imply lower power for an epidemiologic study to detect associations. Therefore, we desire that the proportion, |$\pi _{\rm T}^{\rm B}$|, of total variability attributable to between subject differences to be large. This proportion, |$\pi _{\rm T}^{\rm B}$|, has also been known as the ICC (11, 20) in previous literature but we use |$\pi _{\rm T}^{\rm B}$| here to avoid confusion with the technical ICC.

We can estimate each of the 3 variance components, and the other relevant quantities, by using linear mixed models with the normalized log-transformed metabolite level, Y, as the outcome and random effects for subject, S, and year, T (nested within subject; ref. 9). Estimates of |$\pi _{\rm T}^{\rm B}$| in PLCO are based only on individuals with samples collected at year T0 and T1, whereas estimates of technical ICC are based only on the 15 individuals with samples collected at T4.

### Evaluating associations with age, fasting status, and gender

For each metabolite, we can further partition the sources of variation by expanding upon Eq. (1). We now consider covariates for subject *i*: *G _{i}* = gender,

*F*= fasting status, and

_{i}*A*= age quartile, with both gender and fasting status being binary, 0 or 1, variables.

_{i}We expand Eq. (1) by including fixed effects for age quartile, gender, and fasting status, which are respectively represented by *α*, *γ*, and *ϕ* in Eq. (2). The subscripts, *G _{i}*,

*F*, and

_{i}*A*, indicate that we are including the fixed effect appropriate for subject

_{i}*i*. By adjusting out these factors, we can assess the percentage of variability attributable to between-subject differences within specific demographic subcohorts.

We can now define the proportion of unexplained variability attributed to between subject differences using Eq. (2) as:

We will also identify the variance attributable to age, gender, and fasting, denoted by *σ*^{2}(age), *σ*^{2}(gender), and *σ*^{2}(fasting) to get a better idea of their overall influence on the metabolomic profile. Exact definitions of these variance components are provided in Appendix A, but we can now define the total variance by

and examine the proportion of the variance attributable to each of the 3 covariates

Furthermore, we can assess whether the covariates are significantly associated with metabolite levels and obtain *P* values by conducting an ANOVA on the mixed models described by Eq. (2).

### Global summaries

Until now, we have focused on statistics that describe the behavior of a single metabolite. For each metabolite, we have described calculating |$\hat \pi _{{\rm BW}}^{\rm B}$|, |$\hat \pi _{\rm T}^{\rm B}$|, and ICC^{est}, our estimates for |$\pi _{{\rm BW}}^{\rm B}$|, |$\pi _{\rm T}^{\rm B}$|, and ICC, by fitting linear mixed models. However, we are also interested in statistics that can describe the global behavior of all metabolites that were reported in at least 90% of the samples. We will therefore report the proportion of these metabolites where the estimated parameters exceed 0.2, 0.5, or 0.8 and treat these proportions as estimates of the proportion of metabolites exceeding the corresponding threshold. As a global summary, we also estimate the proportion of these metabolites that are associated with age, gender, and fasting status by the maximum false discovery rate estimated among all metabolites.

### Estimates of power

Our objective is to estimate the expected power for a case–control study focused on a single disease. Specifically, we estimate the average power, or the proportion of true metabolite-disease associations that are expected to be discovered, accounting for the 3 sources of variability and the testing of multiple metabolites.

We assume that a study will collect *n* individuals equally split between cases and controls and use a *t* test with the appropriate Bonferroni-corrected significance threshold to test for an association between the disease and each metabolite. We then estimate the power to detect each metabolite given its variance components measured in SPA and an assumed effect size. Study-level power averaged these values over all metabolites. We also consider the scenario where there are 1 to 5 samples per individual.

For purposes of interpretation and to enable comparisons with previously reported studies, we define the effect size for a given metabolite to be the relative risk of disease for an individual in the top quartile of the usual metabolite levels, as compared with the bottom quartile. Note, we still presume the metabolites are normally distributed and assume a *t* test is used in the study (Appendix B).

## Results

### Measurement/technical variability

Within the 252 SPA samples, there were 567 observed metabolites. Of those 567 metabolites, 385 metabolites were observed in at least 90% of all samples and 341 were observed in 95% of all samples. We consider only those 385 most common metabolites for the remainder of this article. Of those 385 metabolites, the identities of 254 had already been determined.

The majority of technical ICCs, a measure of the similarity between replicate samples, were high (Fig. 1A). With the SPA samples, the estimated ICCs for 57%, 85%, and 97% of the metabolites exceeded 0.8, 0.5, and 0.2, respectively (Table 1). The distribution of ICCs was similar for all categories of metabolites (Supplementary Table S2). The distribution of Coefficient of Variation (CVs) is illustrated in Supplementary Fig. S3. When the analysis was repeated using the T4 samples from 15 men in PLCO, the distribution of estimated ICCs was nearly identical and is depicted by the red line in Fig. 1A. Among metabolites common to both studies, the reported ICCs were highly correlated (*ρ* = 0.48) but far from identical (Supplementary Fig. S1), as expected for 2 distinct populations, 1 from China and 1 from the United States, and given the limited sample size.

. | Parameter threshold . | ||
---|---|---|---|

. | 0.2 . | 0.5 . | 0.8 . |

ICC | 97% | 85% | 57% |

|$\pi _{{\rm BW}}^{\rm B}$| | 93% | 61% | 23% |

|$\pi _{\rm T}^{\rm B}$| | 87% | 36% | 3.6% |

. | Parameter threshold . | ||
---|---|---|---|

. | 0.2 . | 0.5 . | 0.8 . |

ICC | 97% | 85% | 57% |

|$\pi _{{\rm BW}}^{\rm B}$| | 93% | 61% | 23% |

|$\pi _{\rm T}^{\rm B}$| | 87% | 36% | 3.6% |

NOTE: The first row lists the percentage of metabolites in the SPA study with an estimated ICC coefficient that exceeds thresholds of 0.2, 0.5, and 0.8. The second row lists the percentages of metabolites where the estimated proportion of biologic variability (|$\pi _{{\rm BW}}^{\rm B}$|) attributable to between-subject differences exceeds these same thresholds. The third row lists the percentages of metabolites where the estimated proportion of total variability (|$\pi _{\rm T}^{\rm B}$|) attributable to between-subject differences exceeds these same thresholds.

### Within and between-individual variability

Given only a single measurement, a study's power to detect long-term epidemiologic associations tends to be higher when |$\pi _{\rm T}^{\rm B}$|, the proportion of total variability attributed to between subject differences, is larger. The estimates of |$\pi _{\rm T}^{\rm B}$| were generally lower than the estimated ICCs, with only 3.6%, 36%, and 87% of metabolites having |$\hat \pi _{\rm T}^{\rm B}$| exceeding 0.8, 0.5, and 0.2, respectively (Fig. 1B and Table 1). The distribution of |$\hat \pi _{\rm T}^{\rm B}$| was not unimodal, with 23 identified and 13 unidentified metabolites having high values of |$\hat \pi _{\rm T}^{\rm B}$| above 0.7. The majority of these metabolites were in the biosynthesis pathway of androsterone or markers of specific dietary habits. The metabolites with the lowest values of |$\hat \pi _{\rm T}^{\rm B}$|, among all metabolites with an ICC^{est} > 0.8, were more heterogeneous and included multiple markers for episodically consumed foods (Supplementary Table S1).

Within specific age and gender demographic groups, study power will be limited by |${\pi ^\prime} _{\rm T}^{\rm B}$|, the proportion of variability attributed to between subject differences after adjusting for the covariates in Eq. (2). Similar to the distribution of |$\hat \pi _{\rm T}^{\rm B}$|, 3.1%, 31%, and 83% of metabolites had values |$\hat \pi {^\prime} _{\rm T}^{\rm B}$| exceeding 0.8, 0.5, and 0.2. When estimating |$\pi _{\rm T}^{\rm B}$| in a subgroup of only women and among only metabolites measured in women, results again were nearly unchanged with 3.7%, 33%, and 88% of metabolites having values of |$\hat \pi {^\prime} _{\rm T}^{\rm B}$| exceeding 0.8, 0.5, and 0.2. Even among the 36 metabolites with the highest estimates of |$\hat \pi _{\rm T}^{\rm B}$|, many of which were strongly associated with age and gender, the proportion of variation attributable to between subject differences after these adjustments only decreased minimally (Table 2).

. | |$\bm \hat \pi _{\rm T}^{\rm B}}$| . | |${\bm \hat \pi _{\rm T}^{\rm B}}$| . | |$\hat \pi _{\rm T}^{\rm B}$| . | P value
. | P value
. |
---|---|---|---|---|---|

. | . | A.G.A. . | Women . | Age . | Gender . |

1,5-Anhydroglucitol (1,5-AG) | 0.91 | 0.91 | 0.92 | 0.88 | 0.32 |

4-Androsten-3β,17β-diol disulfate 1 | 0.9 | 0.86 | 0.88 | 0.085 | <0.0001 |

Pregnen-diol disulfate* | 0.9 | 0.87 | 0.89 | 0.018 | <0.0001 |

DHEA-S | 0.89 | 0.86 | 0.89 | 0.00039 | <0.0001 |

4-Androsten-3β,17β-diol disulfate 2 | 0.85 | 0.82 | 0.86 | 0.076 | <0.0001 |

Pyroglutamine | 0.83 | 0.68 | 0.74 | <0.0001 | <0.0001 |

Androsterone sulfate | 0.82 | 0.77 | 0.82 | 0.022 | <0.0001 |

Andro steroid monosulfate 2 | 0.81 | 0.81 | 0.84 | 0.92 | 0.19 |

5α-Androstan-3β,17β-diol disulfate | 0.8 | 0.7 | 0.72 | 0.0064 | <0.0001 |

Epiandrosterone sulfate | 0.79 | 0.73 | 0.78 | 0.033 | <0.0001 |

Pseudouridine | 0.78 | 0.68 | 0.8 | <0.0001 | 0.43 |

Pregn steroid monosulfate | 0.76 | 0.72 | 0.75 | 0.014 | <0.0001 |

3-(4-Hydroxyphenyl)lactate | 0.76 | 0.7 | 0.69 | 0.0044 | <0.0001 |

21-Hydroxypregnenolone disulfate | 0.76 | 0.74 | 0.77 | 0.13 | 0.00067 |

α-Hydroxyisovalerate | 0.76 | 0.72 | 0.67 | 0.4 | <0.0001 |

C-Glycosyltryptophan | 0.74 | 0.63 | 0.75 | <0.0001 | 0.15 |

Urate | 0.74 | 0.72 | 0.73 | 0.14 | <0.0001 |

Glutaroyl carnitine | 0.72 | 0.69 | 0.65 | 0.0037 | 0.00081 |

Creatine | 0.72 | 0.62 | 0.55 | 0.00026 | <0.0001 |

3-Dehydrocarnitine | 0.72 | 0.71 | 0.71 | 0.87 | 0.32 |

1-Arachidonoylglycerophosphocholine | 0.72 | 0.7 | 0.74 | 0.3 | 0.15 |

2-Hydroxybutyrate (AHB) | 0.71 | 0.71 | 0.71 | 0.43 | 0.24 |

Undecanoate (11:0) | 0.7 | 0.65 | 0.64 | 0.53 | <0.0001 |

. | |$\bm \hat \pi _{\rm T}^{\rm B}}$| . | |${\bm \hat \pi _{\rm T}^{\rm B}}$| . | |$\hat \pi _{\rm T}^{\rm B}$| . | P value
. | P value
. |
---|---|---|---|---|---|

. | . | A.G.A. . | Women . | Age . | Gender . |

1,5-Anhydroglucitol (1,5-AG) | 0.91 | 0.91 | 0.92 | 0.88 | 0.32 |

4-Androsten-3β,17β-diol disulfate 1 | 0.9 | 0.86 | 0.88 | 0.085 | <0.0001 |

Pregnen-diol disulfate* | 0.9 | 0.87 | 0.89 | 0.018 | <0.0001 |

DHEA-S | 0.89 | 0.86 | 0.89 | 0.00039 | <0.0001 |

4-Androsten-3β,17β-diol disulfate 2 | 0.85 | 0.82 | 0.86 | 0.076 | <0.0001 |

Pyroglutamine | 0.83 | 0.68 | 0.74 | <0.0001 | <0.0001 |

Androsterone sulfate | 0.82 | 0.77 | 0.82 | 0.022 | <0.0001 |

Andro steroid monosulfate 2 | 0.81 | 0.81 | 0.84 | 0.92 | 0.19 |

5α-Androstan-3β,17β-diol disulfate | 0.8 | 0.7 | 0.72 | 0.0064 | <0.0001 |

Epiandrosterone sulfate | 0.79 | 0.73 | 0.78 | 0.033 | <0.0001 |

Pseudouridine | 0.78 | 0.68 | 0.8 | <0.0001 | 0.43 |

Pregn steroid monosulfate | 0.76 | 0.72 | 0.75 | 0.014 | <0.0001 |

3-(4-Hydroxyphenyl)lactate | 0.76 | 0.7 | 0.69 | 0.0044 | <0.0001 |

21-Hydroxypregnenolone disulfate | 0.76 | 0.74 | 0.77 | 0.13 | 0.00067 |

α-Hydroxyisovalerate | 0.76 | 0.72 | 0.67 | 0.4 | <0.0001 |

C-Glycosyltryptophan | 0.74 | 0.63 | 0.75 | <0.0001 | 0.15 |

Urate | 0.74 | 0.72 | 0.73 | 0.14 | <0.0001 |

Glutaroyl carnitine | 0.72 | 0.69 | 0.65 | 0.0037 | 0.00081 |

Creatine | 0.72 | 0.62 | 0.55 | 0.00026 | <0.0001 |

3-Dehydrocarnitine | 0.72 | 0.71 | 0.71 | 0.87 | 0.32 |

1-Arachidonoylglycerophosphocholine | 0.72 | 0.7 | 0.74 | 0.3 | 0.15 |

2-Hydroxybutyrate (AHB) | 0.71 | 0.71 | 0.71 | 0.43 | 0.24 |

Undecanoate (11:0) | 0.7 | 0.65 | 0.64 | 0.53 | <0.0001 |

NOTE: Rows include metabolite name, |$\hat \pi _{\rm T}^{\rm B}$|, the equivalent value from the age and gender adjusted (A.G.A) model, the equivalent from a female-only model, *P* value for the metabolite's association with age, and *P* value for the metabolite's association with gender.

We also measured |$\hat \pi _{\rm T}^{\rm B}$| in the individuals from PLCO. Although these measurements were from serum samples, the distribution of |$\hat \pi _{\rm T}^{\rm B}$| was nearly identical in this population (Fig. 1B), and there was high correlation (*ρ* = 0.49) when comparing the estimates of between-subject variability among metabolites common to both groups. Supplementary Fig. S2 confirms that those metabolites with high values of |$\hat \pi _{\rm T}^{\rm B}$| in SPA have similarly high values in PLCO.

Our study design permits the estimation of |$\pi _{{\rm BW}}^{\rm B}$|, the proportion of biologic variability that can be attributed to between-individual differences. Although these results are limited by our ability to distinguish technical and within-individual variability using only 8 replicate samples, the majority of natural variability seemed to be attributable to between-subject differences: 23%, 61%, and 93% of the metabolites having estimated values of |$\pi _{{\rm BW}}^{\rm B}$| exceeding 0.8, 0.5, and 0.2 (Table 1). Again, adjusting for age and gender did not alter our estimates much: 22%, 62%, and 92% of metabolites had estimates of |$\pi {^\prime} _{{\rm BW}}^{\rm B}$| exceeding 0.8, 0.5, and 0.2.

### Age, gender, and fasting status

Those covariates suspected to have associations with metabolite levels were able to explain small, but statistically significant, proportions of the variation in many of the metabolites. We found that age, fasting status, and gender were correlated with 30%, 34%, and 67% of metabolites, respectively. Using the Bonferonni-adjusted α-level of 0.05/385, we find that 9.1%, 14.3%, and 7.3% of metabolites have a statistically significant association with age, fasting status, and gender, respectively. However, the proportion of the variability attributable to each metabolite was small, explaining why |$\pi _{\rm T}^{\rm B}$| changed little after adjusting for covariates. Figure 2 shows the proportion of total variability attributed to these covariates.

### Power

We quantified the effect size as the relative risk of disease when comparing individuals in the top and bottom quartiles of the usual metabolite level. However, when calculating power, we presumed a *t* test comparing cases and controls. Given this definition of effect size and the assumption that all measured metabolites are equally likely to be associated with the disease, a case–control study with a total of 500 individuals is expected to detect less than 1%, 38%, and 75% of the metabolites with a relative risk of 1.5, 3.0, and 5.0 (Fig. 3A). Similarly, a study with 1,000 individuals should detect 3%, 74%, and 92% and a study with 5,000 individuals should detect 55%, 97%, and 98% of metabolites with a relative risk of 1.5, 3.0, and 5.0. All estimates assume a conservative Bonferroni-adjusted α-level of 0.0013 = 0.05/385 (Fig. 3A and Table 3). Although these relative risks are larger than typically reported in epidemiologic studies, the naïve or observed relative risks would be lower and in-line with typical values. When the true relative risks are 1.5, 3.0, and 5.0, the naïve relative risks are expected to be 1.3, 2.0, and 2.8.

. | Relative risk . | ||
---|---|---|---|

N . | 1.5 . | 3.0 . | 5.0 . |

500 | <1% | 38% | 75% |

1,000 | 2.9% | 74% | 92% |

5,000 | 55% | 97% | 98% |

. | Relative risk . | ||
---|---|---|---|

N . | 1.5 . | 3.0 . | 5.0 . |

500 | <1% | 38% | 75% |

1,000 | 2.9% | 74% | 92% |

5,000 | 55% | 97% | 98% |

NOTE: The entries list the average power to detect associations between metabolites and disease in a case–control study that has 500, 1,000, and 5,000 individuals and where the metabolites have true relative risks of 1.5, 3.0, and 5.0. These true values translate to naïve estimates for relative risk of 1.3, 2.0, and 2.8. *Corresponding naïve relative risks would be 1.3, 2.0, and 2.8.

We will detect higher proportions of those metabolites that have higher ICCs. Considering only those 36 metabolites with a |$\hat \pi _{\rm T}^{\rm B}$| more than 0.7, a case–control study with 1,000 individuals should detect 25%, 50%, and 80% of metabolites with a true relative risk of 1.7, 1.9, and 2.2. Focusing on only the 287, 142, and 36 metabolites with a |$\hat \pi _{\rm T}^{\rm B}$| exceeding 0.3, 0.5, and 0.7 would be equivalent to setting the α-threshold at 0.00017, 0.00035, and 0.0014.

If the most promising set of metabolites can be evaluated in a second stage of the study or if a complementary study can limit candidate pathways, requiring a family-wise error rate of 0.05 would be unnecessarily strict. If we raise the α-level to 0.001, a case–control study with 1,000 individuals should detect 25%, 50%, and 80% of metabolites with a true relative risk of 1.8, 2.1, and 2.8. The corresponding naïve relative risk would be respectively 1.5, 1.6, and 2.0. Figure 3B compares the power for studies with an α-threshold of 0.01, 0.001, and 0.00013.

Power would be improved by collecting multiple samples from each individual. Additional samples reduce the within-individual and technical variability. Figure 3C illustrates the gains in power from taking 2, 3, or 5 samples throughout the year for a 1,000 subject study, where we assume that the correlation between any 2 measures is independent of the time separating them. Collecting a second sample increases the study's power by 1.84×, 1.15×, and 1.05× when the relative risk are 2, 3, and 4, respectively.

## Discussion

Our objective was to assess the potential role of metabolomics in large epidemiologic studies, with a specific focus on case–control studies. We first showed that although LC/MS and GS-MS produced reliable and reproducible results, there was also considerable within-individual variability. Approximately, 40% of the biologic variability, on average, could be attributed to variation occurring within an individual over time. Using our estimates of technical and within-individual variability, we then estimated the power for detecting metabolite-disease associations in epidemiologic studies.

Although we assume associations will be tested by a *t* test or linear regression, we quantify a metabolite's effect size by the relative risk comparing individuals within the top and bottom quartiles of the metabolite's distribution. We show the need for a large number of samples in case–control studies, and expect our figures relating relative risk to power, as well as the distributions of |$\hat \pi _{\rm T}^{\rm B}$|, |$\hat \sigma ^2$|, and ICC^{est} used to calculate that relationship, to serve as a guide for studies considering metabolomic profiling of samples. If laboratory variability can be reduced, perhaps by using a targeted approach of only a few metabolites, similar levels of power could be achieved with fewer individuals.

Our results corroborate and expand upon previous studies measuring metabolomic variability and estimating power for epidemiologic studies (11, 20). Our median 1-year |$\hat \pi _{\rm T}^{\rm B}$| of 0.43 was similar to an earlier targeted analysis of 163 metabolites that found a median 4-month |$\hat \pi _{\rm T}^{\rm B}$| (or ICC using their definition) to be 0.57 and an NMR analysis of 38 metabolites that also found a median 4-month |$\hat \pi _{\rm T}^{\rm B}$| around 0.57 (0.68/1.19), after accounting for technical variability. Our |$\hat \pi _{\rm T}^{\rm B}$| are likely lower, but perhaps more pertinent for planning epidemiologic studies, because of the 8 additional months between measurements and the fewer requirements imposed on study participants (e.g., no fasting). Unlike many metabolomic focused studies that control for diet (11, 20) or behavior (31), our samples were collected as part of SPA and PLCO, and therefore the observed variability will likely be more similar to that reported in future epidemiologic studies. Even with our slightly lower |$\hat \pi _{\rm T}^{\rm B}$| values, we reached the same qualitative conclusion as Nicholson and colleagues (11) that studies will require large sample sizes, upwards of 1,000 subjects, to detect metabolomic associations. We have further expanded upon Nicholson and colleagues' results by relating power to relative risk, considering different significance thresholds, and discussing the power from repeat measurements.

Studies should plan for, but not be discouraged by, the potentially high intraindividual variability. Strong associations between usual exposure levels and disease risk have allowed previous epidemiologic studies to overcome imprecision of this magnitude. For example, in postmenopausal women, insulin and estradiol have |$\hat \pi _{\rm T}^{\rm B}$| of 0.68 and 0.59, respectively, over a span of 1 to 3 years (32, 33), but studies, nonetheless, have successfully detected their associations with breast cancer (34, 35). Similarly, for heart-disease (36) and diabetes (4), studies have recently identified metabolites related to branch chain amino acids as important predictors of disease risk, with ORs ranging from 2 to 4 for top versus bottom quartile comparisons. For smoking-related cancers, a comparison of high versus low cotinine levels would be expected to yield ORs of up to 20+ (37).

Moreover, we showed that although a reasonably high proportion of metabolites were associated with age, fasting status, and gender, these 3 covariates only accounted for a small proportion of the total variability. Therefore, metabolomic profiles can still be useful for distinguishing risks within specific demographic cohorts, in that within-cohort variation is still high. Similarly, metabolomic profiles can still be useful even when epidemiologic studies did not impose dietary restrictions (e.g., fasting) before blood draws. These factors do little to affect our overall conclusions about detectable relative risks.

Our study had 5 main limitations. First, the SPA dataset only contained measurements on women at 2 time points, and therefore within-subject variability results are gender specific. However, similar results were seen in the PLCO dataset, which includes equal numbers of men and women. The second limitation is that we only calculated power for identifying disease associations with individual metabolites. There is also interest in identifying metabolic profiles, such as those created by partial least squares regression (PLS; ref. 38), that can differentiate cases and controls and be used as a diagnostic tool. Third, LC/MS and GC/MS platforms do not report the actual metabolite levels, but peak intensities and our estimates of parameters can be slightly sensitive to scale. Fourth, we only had a total of 23 replicates for assessing technical variability. While this may introduce imprecision for estimating the |$\hat \sigma _{\rm E}^{\rm 2}$| and |$\hat \pi _{{\rm BW}}^{\rm B}$| for individual metabolites, their distributions, across all metabolites, should be more accurate. Also, our power calculations combined within-individual and technical variability, and therefore were based on the larger sample set of 60 samples repeated over time. Finally, the |$\hat \pi _{\rm T}^{\rm B}$| may be underestimated if storage at −70°C had a variable effect on samples or if the additional year of storage had a significant impact on biomarker levels (39–41).

Even given the limitations of the study, we were able to assess the magnitude of each of the 3 variance components, and estimate the power for large case–control studies. Because the likely relative risks will depend on the disease, time between sample collection, disease diagnosis, and specimen type, we do not offer any universal conclusion about the use of metabolomics in epidemiologic studies. We do, however, strongly suggest considering our analysis of power when planning such studies.

## Appendix A

We discuss the “variance” attributable to age, fasting status, and gender. However, we caution against any literal interpretation of their values, which is our reason for the quotes. We define the “variance” as the proportion of the total variance that can be explained by that covariate in a population where all categories are equally represented (e.g., 50% men/50% women and 50% fasting/50% nonfasting) and there is 1 sample per individual. Note that the variances are highly dependent on how we chose to categorize the variables and their distribution within SPA.

Dropping the quotes, we now define the variances for the 3 covariates to be

where *α _{k}* is the fixed effect for age quartile

*k*,

*γ*

_{1}, and

*γ*

_{0}are the fixed effects for gender,

*ϕ*

_{1}and

*ϕ*

_{0}are the fixed effects for fasting, and:

These variances and their corresponding proportions *π*(Age), *π*(Fasting), and *π*(Gender), provide a measure of the influence of these 3 covariates on metabolite profiles.

## Appendix B

This appendix provides the details for estimating power. If we define the effect of a metabolite in terms of its SD and the mean difference between cases and controls, this calculation would be trivial. The appendix is only needed because we choose to define the effect by the more interpretable relative risk. Again, the relative risk (RR) is defined as the probability of disease for an individual in the top quartile of the usual metabolite levels, as compared with an individual in the bottom quartile:

where *D* and *X* are random variables respectively indicating disease status and metabolite level, *t*_{0} is the threshold for the top quartile of *X*, and *t*_{1} is the threshold for the bottom quartile of *X*, i.e.:

We assume that the usual metabolite level within cases and controls are each normally distributed with respective means at μ_{RR}/2 and −μ_{RR}/2 and a common variance, |$\sigma _{\rm B}^2$|. Thus Eq. (3) can be reformulated as

where *Z* is a normal variable with mean = 0 and variance = 1. We further assume that the disease has a prevalence of 0.1 and Eq. (4) can be reformulated as:

We can thus solve for *μ*_{RR}, *t*_{0}, and *t*_{1} from the set of 3 equations earlier. For a given value of *μ*_{RR}, |$\sigma _{\rm T}^2$| and false-positive rate, *α*, the power for a case–control study is the probability that a chi-squared variable with noncentrality parameter |$n{\frac{{\mu _{{\rm RR}}^2}}{{\sigma _{\rm T}^2}}}$| and 1 degree of freedom (df) exceeds the 1 − *α* quantile of a central chi-squared distribution with 1 df.

We have defined the effect size for a given metabolite in terms of the “usual” metabolite level. The listed relative risks will therefore be substantially higher than those reported in previous epidemiologic studies that did not correct for measurement error. To assess whether the listed relative risks are reasonably in line with previous studies, we also report the naïve relative risk, RR′, or the uncorrected estimate. For each metabolite and each specified relative risk, we estimate the naïve relative risk by solving the following set of equations for RR′, *t*_{0}′, and *t*_{1}′, where *μ*_{RR} is calculated earlier and *σ*_{B} is replaced by *σ*_{T}:

We then estimate the average naïve relative risk for a given true relative risk across all metabolites.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Authors' Contributions

**Conception and design:** J.N. Sampson, X.O. Shu, R.Z. Stolzenberg-Solomon, A.W. Hsing, B.-T. Ji, R. Sinha, A.J. Cross, S.C. Moore

**Development of methodology:** J.N. Sampson, A.W. Hsing, R. Sinha, S.C. Moore

**Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.):** X.O. Shu, R.Z. Stolzenberg-Solomon, C.E. Matthews, A.W. Hsing, Y.T. Tan, B.-T. Ji, W.-H. Chow, Q. Cai, D.K. Liu, G. Yang, Y.B. Xiang, W. Zheng, R. Sinha, A.J. Cross, S.C. Moore

**Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis):** J.N. Sampson, S.M. Boca, X.O. Shu, R.Z. Stolzenberg-Solomon, C.E. Matthews, A.W. Hsing, Y.T. Tan, B.-T. Ji, R. Sinha, A.J. Cross, S.C. Moore

**Writing, review, and/or revision of the manuscript:** J.N. Sampson, S.M. Boca, X.O. Shu, R.Z. Stolzenberg-Solomon, C.E. Matthews, A.W. Hsing, Y.T. Tan, B.-T. Ji, W.-H. Chow, G. Yang, Y.B. Xiang, W. Zheng, R. Sinha, A.J. Cross, S.C. Moore

**Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases):** Y.T. Tan, B.-T. Ji, D.K. Liu, Y.B. Xiang, W. Zheng, R. Sinha, A.J. Cross, S.C. Moore

**Study supervision:** X.O. Shu, B.-T. Ji, R. Sinha

## Acknowledgments

The authors thank Dr. Mitch Gail (National Cancer Institute) for valuable discussions.

## Grant Support

This study is, in part, supported by the Intramural Research Program of the NIH and the Breast Cancer Research Stamp Fund, awarded through competitive peer review. SWHS was supported by R37CA070867.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.