Abstract
Incidence of early-onset colorectal cancer (EOCRC; e.g., diagnosed before age 50) in the United States has increased substantially since the 1990s but the underlying reasons remain unclear.
We examined the ecologic associations between dietary factors and EOCRC incidence in adults aged 25–49 during 1977–2016 in the United States, using negative binomial regression models, accounting for age, period, and race. The models also incorporated an age-mean centering (AMC) approach to address potential confounding by age. We stratified the analysis by sex and computed incidence rate ratio (IRR) for each study factor. Study factor data (for 18 variables) came from repeated national surveys; EOCRC incidence data came from the Surveillance Epidemiology, and End Results Program.
Results suggest that confounding by age on the association with EOCRC likely existed for certain study factors (e.g., calcium intake), and that AMC can alleviate the confounding. EOCRC incidence was positively associated with smoking [IRR (95% confidence interval (CI): 1.17 (1.10–1.24) for men; 1.15 (1.09–1.21) for women] and alcohol consumption [IRR (95% CI), 1.08 (1.04–1.12) for men; 1.08 (1.04–1.11) for women]. No strong associations were found for most other study factors (e.g., fiber and calcium).
Alcohol consumption was positively associated with EOCRC and has increased among young adults since the 1980s, which may have contributed to the EOCRC incidence increases since the 1990s. The AMC approach may help alleviate age confounding in similar ecologic analyses.
Increases in alcohol consumption may have contributed to the recent increases in colorectal cancer incidence among young adults.
Introduction
Recent studies have showed increases in early-onset colorectal cancer (EOCRC, e.g., those diagnosed before age 50) incidence in the United States since roughly the 1990s (1–3). Studies have also projected a further increase in EOCRC incidence (e.g., >90% higher by 2030 compared with 2010; ref. 4), if this trend continued. Thus, accurate identification of modifiable risk factors of EOCRC is urgently needed to inform effective prevention in younger adults.
While there is a body of risk factor research on colorectal cancer primarily based on cases 50 years and older (5, 6), important research gaps on EOCRC exist. Exposure during early life and critical development period are widely believed to be important in EOCRC development (7, 8), yet studies of such exposure are largely absent. In typical cohorts, exposure measurements start in the 40s, the age of cohort recruitment. Moreover, risk factors of EOCRC and older cases may differ. Compared with older colorectal cancer cases, EOCRC is associated with more aggressive pathology and late diagnosis (7, 8). As such, current risk-classification tools based on family history and inflammatory bowel disease could wrongly classify many EOCRC cases as average risk, resulting in late diagnosis (7). It is also challenging to study risk factors of EOCRC using traditional cohort and/or nested case–control designs. Because the absolute EOCRC risk is relatively low, prohibitively large sample sizes would be needed to provide sufficient statistical power. For example, assuming an incidence rate as that among U.S. women aged 25–49 during 2011–2016 (i.e., 12.9/100,000; ref.9), to observe 500 cases over five years, a cohort of 0.78 million would be needed.
Given the above research gap and challenges, we conducted an ecologic analysis to examine the association of EOCRC incidence with a range of dietary factors, which are of major interest in EOCRC etiology and amenable to public health interventions. We focused on the U.S. population aged 25–49 [i.e., age groups shown to experience substantial EOCRC incidence increases (2, 3, 10, 11)] during 1977–2016. We also proposed a set of regression models to address two common challenges in similar ecologic analyses (i.e., time lag from exposure to disease and confounding by age). The proposed ecologic approach allows efficient and low-cost investigations of various exposures at different life stages and could be used to study other early-onset cancers with similar rapid increases in recent decades (10).
Materials and Methods
Study design
In a previous study, Pfeiffer and colleagues examined ecologic associations between concurrent exposures and breast cancer incidence in the United States across population groups defined by age, period, race, and sex (12). Here, to further account for the potential latent period from exposure to cancer diagnosis, we propose two strategies: regress the outcome on (i) the exposure 10 years ago (equivalent to lagging the outcome) or (ii) the cumulative exposure over the 10 years before the outcome.
Another challenge in our study is potential confounding by age, when both the outcome (here, cancer incidence) and exposure can be associated with age and sometimes in opposite directions (see, e.g., fat intake in Fig. 1A). For such exposures, including age as a covariate in the model may not be able to handle the discordant association of age with the outcome versus exposure. Moreover, for exposures with similar positive association with age as for colorectal cancer, residual age confounding is also possible. To address this, we propose an age-mean centering (AMC) approach. Briefly, we remove the association between an exposure and age, by subtracting the age-specific mean exposure for each age, and use these age-removed exposure data in the models (see details below). In so doing, we decouple the age association with the exposure and allow the covariate age to account for its association with the outcome alone (Fig. 1). This approach is similar to a strategy in behavioral sciences that disaggregates between-person and within-person effect (13). Here, we tested five regression models combining the above strategies.
Study factors
Data source
We obtained study factor data from the National Health and Nutrition Examination Surveys (NHANES; ref. 14), the National Health Interview Surveys (NHIS; ref. 15), and the Behavioral Risk Factor Surveillance System (BRFSS; ref. 16). These three programs conduct repeated national cross-sectional studies in the United States over several decades (see Table 1 for survey designs and included survey cycles; refs. 17–19). We included the following dietary factors: smoking, the intake of alcohol, tea, coffee, caffeine, whole fruit, fruit juice, total fruit (whole fruit and fruit juice combined), cholesterol, protein, fiber, calcium, magnesium, fat, saturated fat, total energy, and carbohydrate, and serum folate (see Supplementary Table S1 for the availability of the study factors in the surveys and sample sizes; see Supplementary Table S2 for the measurements). Further details on compiling study factor data and handling of periods with no data are described in the Supplementary Methods and Supplementary Figs. S1 and S2.
. | NHANES . | NHIS . | BRFSS . |
---|---|---|---|
Target population | Civilian, noninstitutionalized US population | Civilian, noninstitutionalized US population | US adult population |
Sampling approach | Multistage, probability sampling | Multistage, probability sampling | Sampling based on landline (and also cellular telephone numbers since 2011) |
Data collection | Household interviews and physical examinations | Household interviews | Telephone interviews |
Included surveys | NHANES I (1971–1974), NHANES II (1976 to 1980), NHANES III (1988 to 1994), continuous NHANES (1999–2016) | NHIS (1976, 1977, 1985, 1987, 1988, 1990–1995, and 1997–2016) | BRFSS (1984–2016) |
. | NHANES . | NHIS . | BRFSS . |
---|---|---|---|
Target population | Civilian, noninstitutionalized US population | Civilian, noninstitutionalized US population | US adult population |
Sampling approach | Multistage, probability sampling | Multistage, probability sampling | Sampling based on landline (and also cellular telephone numbers since 2011) |
Data collection | Household interviews and physical examinations | Household interviews | Telephone interviews |
Included surveys | NHANES I (1971–1974), NHANES II (1976 to 1980), NHANES III (1988 to 1994), continuous NHANES (1999–2016) | NHIS (1976, 1977, 1985, 1987, 1988, 1990–1995, and 1997–2016) | BRFSS (1984–2016) |
Abbreviations: BRFSS, Behavioral Risk Factor Surveillance System; NHANES, National Health and Nutrition Examination Surveys; NHIS, National Health Interview Surveys.
Computing study factor levels
We harmonized study factor data from the different surveys and computed the weighted prevalence for each study factor for each population group defined by age, period, race, and sex (20, 21). The study population (whites and blacks aged 25–49 during 1977–2016) was divided into 160 subgroups: Five 5-year age groups (25–29, 30–34, …, 45–49) × eight 5-year periods (1977–1981, 1982–1986, …, 2012–2016) × two race groups (whites and blacks) × two sexes (men and women). In addition, we computed weighted prevalence for the population groups aged 20–24 and during 1972–1976 for use in the lagged or cumulative models. See Supplementary Table S3 for specific age and period groups used in each model. Because of small sample sizes, we did not include races other than whites and blacks; we also did not stratify by ethnicity (Hispanic/non-Hispanic), as such information was unavailable from the surveys (e.g., NHANES I and II) or cancer surveillance programs (see below) for earlier periods.
For the no-lag and lagged models (see below), all exposures were categorized by quintiles, as done in Pfeiffer and colleagues (12). The quintiles were determined on the basis of all population groups (i.e., 80 subgroups for men/women when data were complete). For the other three models (AMC no-lag, AMC lagged, and AMC cumulative; see below), exposures were analyzed as continuous variables, because the AMC-processed exposures no longer spanned a wide range of quintile categories for different age groups and could lead to unstable model estimates using quintiles.
EOCRC incidence
We obtained EOCRC incidence data from the Surveillance Epidemiology, and End Results (SEER) Program using SEER*STAT (9, 22, 23). To match with the exposures, the EOCRC incidence data were aggregated to the same 160 groups specified above. As the coverage of SEER expanded over time, we used SEER data in two ways. In the main analysis, we used SEER 9, which included nine registries, covered 9.4% of the U.S. population, and provided EOCRC incidence throughout our study period (1977–2016). As a sensitivity analysis, we combined SEER 9 with SEER 13 (13 registries; 13.5% coverage; 1992–2016) and SEER 18 (18 registries; 27.8% coverage; 2000–2016). The SEER program, albeit covering a subset of the U.S. population, is representative of the general U.S. population (24); in addition, SEER started in 1973, earlier than many other national cancer surveillance programs (vs. e.g., the National Program of Cancer Registries starting in 1992).
Statistical analysis
Using the population groups defined above, we applied five negative binomial regression models to examine the association between EOCRC incidence and each study factor for men and women, separately.
No-lag model
Lagged model
The lagged model used the same structure as Eq (A), except that |${Z}_{a,p,r,q}$| was replaced by |${Z}_{a - 2,p - 2,r,q}$|. That is, the exposure occurred 10 years before EOCRC diagnosis (i.e., two 5-year periods ago, hence p-2 in the subscript) when the EOCRC cases were 10 years younger (i.e., two 5-year age intervals ago, hence a-2). The 10-year lag was chosen, given the likely induction time (5) and data availability (note the youngest age group, i.e., 25–29, can no longer be included due to a lack of earlier measurements; details in Supplementary Table S3). In addition, we also tested models with a 5-year or 15-year lag to explore pattern across different lags.
AMC no-lag model
AMC lagged model
The AMC lagged model extends the AMC no-lag model to include the time-lag from exposure to cancer diagnosis. The AMC lagged model equation is the same as Eq (C) except that |${R}_{a,p,r}$| is replaced by |${R}_{a - 2,p - 2,r}$|.
AMC cumulative model
The AMC cumulative model uses exposures summed over the 10 years before cancer diagnosis. The model equation is the same as Eq (3) except that |${R}_{a,p,r}$| is replaced by |${R}_{a - 1,p - 1,r} + {R}_{a - 2,p - 2,r}$|.
Examine the association between age and each study factor
For men and women, separately, we regressed each study factor upon age using 12 groups defined by age and race: six 5-year age groups (20–24, 25–29, …45–49) × two race groups (whites and blacks).
Assess the association between each study factor and EOCRC
All models estimated the incidence rate ratio (IRR) of EOCRC in relation to each study factor, including the mean, 95% confidence interval (95% CI), and P value (see Table 2 and Supplementary Table S4). In addition, we used the Bayesian information criterion (BIC) to assess the strength of estimated associations (25). Specifically, for each study factor and model (one of the five described above), we also tested a corresponding null model with all covariates but the study factor. We calculated the BIC for both models and the diffidence ∆BIC = BIC0-BICf (BICf for the full model including the study factor and BIC0 for the null model). ∆BIC>0 indicates the EOCRC data are better explained when the study factor is included, thus supporting the association between the study factor and EOCRC. The evidence was deemed weak, positive, strong, and very strong for ∆BICs in the ranges of 0–2, 2–6, 6–10, and >10, respectively (25). ∆BIC<0 implies an absence of such evidence. All data processing and analyses were conducted using R (https://www.r-project.org).
. | . | . | Men . | Women . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Study factor . | Method . | Quintile . | IRR (95% CI) . | P (quintiles) . | P (cont.)d . | ΔBIC . | Evidence . | IRR (95% CI) . | P (quintiles) . | P (cont.)d . | ΔBIC . | Evidence . |
Smoking | No-lag | 1 | 1 | <0.001 | 2 | Positive | 1 | <0.001 | -5.4 | No improvement | ||
2 | 0.92 (0.82–1.04) | 0.21 | 1.13 (0.99–1.28) | 0.07 | ||||||||
3 | 0.98 (0.87–1.10) | 0.71 | 1.20 (1.06–1.37)b | 0.005 | ||||||||
4 | 1.11 (0.95–1.29) | 0.19 | 1.28 (1.11–1.48)c | <0.001 | ||||||||
5 | 1.26 (1.04–1.53)a | 0.02 | 1.37 (1.13–1.66)b | 0.002 | ||||||||
Lagged | 1 | 1 | <0.001 | -0.9 | No improvement | 1 | <0.001 | 7.7 | Strong | |||
2 | 1.12 (1.02–1.24)a | 0.02 | 1.15 (1.07–1.24)c | <0.001 | ||||||||
3 | 1.13 (1.01–1.27)a | 0.03 | 1.19 (1.07–1.32)b | 0.002 | ||||||||
4 | 1.21 (1.06–1.37)b | 0.004 | 1.26 (1.10–1.43)c | <0.001 | ||||||||
5 | 1.33 (1.14–1.55)c | <0.001 | 1.39 (1.18–1.62)c | <0.001 | ||||||||
AMC no-lag | 1.14 (1.05–1.23)c | <0.001 | 6.1 | Strong | 1.12 (1.03–1.21)b | 0.005 | 3.3 | Positive | ||||
AMC lagged | 1.17 (1.10–1.24)c | <0.001 | 21.4 | Very strong | 1.15 (1.09–1.21)c | <0.001 | 21 | Very strong | ||||
AMC cumulative | 1.20 (1.13–1.29)c | <0.001 | 24.9 | Very strong | 1.19 (1.12–1.27)c | <0.001 | 27.8 | Very strong | ||||
Alcohol | No-lag | 1 | 1 | 0.5 | -15.5 | No improvement | 1 | 0.81 | -4.9 | No improvement | ||
2 | 0.98 (0.88–1.09) | 0.72 | 0.87 (0.78–0.96)b | 0.009 | ||||||||
3 | 1.02 (0.91–1.15) | 0.69 | 0.87 (0.77–0.97)a | 0.01 | ||||||||
4 | 0.99 (0.89–1.11) | 0.92 | 0.98 (0.89–1.09) | 0.76 | ||||||||
5 | 1.03 (0.92–1.16) | 0.57 | 0.92 (0.83–1.03) | 0.14 | ||||||||
Alcohol | Lagged | 1 | 1 | <0.001 | 8.1 | Strong | 1 | <0.001 | 16.3 | Very Strong | ||
2 | 1.10 (0.97–1.24) | 0.13 | 1.04 (0.94–1.15) | 0.45 | ||||||||
3 | 1.17 (1.05–1.31)b | 0.005 | 1.02 (0.91–1.13) | 0.77 | ||||||||
4 | 1.27 (1.13–1.43)c | <0.001 | 1.15 (1.05–1.26)b | 0.003 | ||||||||
5 | 1.28 (1.13–1.46)c | <0.001 | 1.23 (1.11–1.37)c | <0.001 | ||||||||
AMC no-lag | 1.03 (1–1.06)a | 0.03 | 0.3 | Weak | 1 (0.97–1.04) | 0.81 | -4.2 | No improvement | ||||
AMC lagged | 1.08 (1.04–1.12)c | <0.001 | 11.3 | Very strong | 1.08 (1.04–1.11)c | <0.001 | 15.6 | Very strong | ||||
AMC cumulative | 1.06 (1.03–1.09)c | <0.001 | 14.5 | Very strong | 1.07 (1.04–1.11)c | <0.001 | 17.7 | Very strong | ||||
Calcium | No-lag | 1 | 1 | 0.44 | 6 | Positive | 1 | 0.008 | 16.2 | Very strong | ||
2 | 0.91 (0.81–1.01) | 0.08 | 0.87 (0.78–0.98)a | 0.02 | ||||||||
3 | 0.85 (0.74–0.97)a | 0.01 | 0.81 (0.71–0.92)b | 0.001 | ||||||||
4 | 0.88 (0.76–1.03) | 0.1 | 0.74 (0.62–0.87)c | <0.001 | ||||||||
5 | 0.96 (0.77–1.19) | 0.69 | 0.80 (0.63–1.01) | 0.06 | ||||||||
Lagged | 1 | 1 | 0.29 | -3.7 | No improvement | 1 | 0.04 | 13.5 | Very strong | |||
2 | 0.86 (0.77–0.96)b | 0.006 | 0.92 (0.83–1.02) | 0.12 | ||||||||
3 | 0.84 (0.74–0.97)a | 0.02 | 0.80 (0.70–0.91)c | <0.001 | ||||||||
4 | 0.85 (0.73–0.99)a | 0.04 | 0.81 (0.69–0.95)b | 0.01 | ||||||||
5 | 0.85 (0.70–1.04) | 0.12 | 0.90 (0.73–1.10) | 0.31 | ||||||||
AMC no-lag | 1 (0.96–1.04) | 0.86 | -4.2 | No improvement | 0.90 (0.84–0.97)b | 0.008 | 2.2 | Positive | ||||
AMC lagged | 0.98 (0.95–1.01) | 0.12 | -1.4 | No improvement | 1.03 (0.95–1.11) | 0.46 | -3.3 | No improvement | ||||
AMC cumulative | 0.98 (0.94–1.02) | 0.33 | -3.1 | No improvement | 0.98 (0.90–1.06) | 0.6 | -3.8 | No improvement |
. | . | . | Men . | Women . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Study factor . | Method . | Quintile . | IRR (95% CI) . | P (quintiles) . | P (cont.)d . | ΔBIC . | Evidence . | IRR (95% CI) . | P (quintiles) . | P (cont.)d . | ΔBIC . | Evidence . |
Smoking | No-lag | 1 | 1 | <0.001 | 2 | Positive | 1 | <0.001 | -5.4 | No improvement | ||
2 | 0.92 (0.82–1.04) | 0.21 | 1.13 (0.99–1.28) | 0.07 | ||||||||
3 | 0.98 (0.87–1.10) | 0.71 | 1.20 (1.06–1.37)b | 0.005 | ||||||||
4 | 1.11 (0.95–1.29) | 0.19 | 1.28 (1.11–1.48)c | <0.001 | ||||||||
5 | 1.26 (1.04–1.53)a | 0.02 | 1.37 (1.13–1.66)b | 0.002 | ||||||||
Lagged | 1 | 1 | <0.001 | -0.9 | No improvement | 1 | <0.001 | 7.7 | Strong | |||
2 | 1.12 (1.02–1.24)a | 0.02 | 1.15 (1.07–1.24)c | <0.001 | ||||||||
3 | 1.13 (1.01–1.27)a | 0.03 | 1.19 (1.07–1.32)b | 0.002 | ||||||||
4 | 1.21 (1.06–1.37)b | 0.004 | 1.26 (1.10–1.43)c | <0.001 | ||||||||
5 | 1.33 (1.14–1.55)c | <0.001 | 1.39 (1.18–1.62)c | <0.001 | ||||||||
AMC no-lag | 1.14 (1.05–1.23)c | <0.001 | 6.1 | Strong | 1.12 (1.03–1.21)b | 0.005 | 3.3 | Positive | ||||
AMC lagged | 1.17 (1.10–1.24)c | <0.001 | 21.4 | Very strong | 1.15 (1.09–1.21)c | <0.001 | 21 | Very strong | ||||
AMC cumulative | 1.20 (1.13–1.29)c | <0.001 | 24.9 | Very strong | 1.19 (1.12–1.27)c | <0.001 | 27.8 | Very strong | ||||
Alcohol | No-lag | 1 | 1 | 0.5 | -15.5 | No improvement | 1 | 0.81 | -4.9 | No improvement | ||
2 | 0.98 (0.88–1.09) | 0.72 | 0.87 (0.78–0.96)b | 0.009 | ||||||||
3 | 1.02 (0.91–1.15) | 0.69 | 0.87 (0.77–0.97)a | 0.01 | ||||||||
4 | 0.99 (0.89–1.11) | 0.92 | 0.98 (0.89–1.09) | 0.76 | ||||||||
5 | 1.03 (0.92–1.16) | 0.57 | 0.92 (0.83–1.03) | 0.14 | ||||||||
Alcohol | Lagged | 1 | 1 | <0.001 | 8.1 | Strong | 1 | <0.001 | 16.3 | Very Strong | ||
2 | 1.10 (0.97–1.24) | 0.13 | 1.04 (0.94–1.15) | 0.45 | ||||||||
3 | 1.17 (1.05–1.31)b | 0.005 | 1.02 (0.91–1.13) | 0.77 | ||||||||
4 | 1.27 (1.13–1.43)c | <0.001 | 1.15 (1.05–1.26)b | 0.003 | ||||||||
5 | 1.28 (1.13–1.46)c | <0.001 | 1.23 (1.11–1.37)c | <0.001 | ||||||||
AMC no-lag | 1.03 (1–1.06)a | 0.03 | 0.3 | Weak | 1 (0.97–1.04) | 0.81 | -4.2 | No improvement | ||||
AMC lagged | 1.08 (1.04–1.12)c | <0.001 | 11.3 | Very strong | 1.08 (1.04–1.11)c | <0.001 | 15.6 | Very strong | ||||
AMC cumulative | 1.06 (1.03–1.09)c | <0.001 | 14.5 | Very strong | 1.07 (1.04–1.11)c | <0.001 | 17.7 | Very strong | ||||
Calcium | No-lag | 1 | 1 | 0.44 | 6 | Positive | 1 | 0.008 | 16.2 | Very strong | ||
2 | 0.91 (0.81–1.01) | 0.08 | 0.87 (0.78–0.98)a | 0.02 | ||||||||
3 | 0.85 (0.74–0.97)a | 0.01 | 0.81 (0.71–0.92)b | 0.001 | ||||||||
4 | 0.88 (0.76–1.03) | 0.1 | 0.74 (0.62–0.87)c | <0.001 | ||||||||
5 | 0.96 (0.77–1.19) | 0.69 | 0.80 (0.63–1.01) | 0.06 | ||||||||
Lagged | 1 | 1 | 0.29 | -3.7 | No improvement | 1 | 0.04 | 13.5 | Very strong | |||
2 | 0.86 (0.77–0.96)b | 0.006 | 0.92 (0.83–1.02) | 0.12 | ||||||||
3 | 0.84 (0.74–0.97)a | 0.02 | 0.80 (0.70–0.91)c | <0.001 | ||||||||
4 | 0.85 (0.73–0.99)a | 0.04 | 0.81 (0.69–0.95)b | 0.01 | ||||||||
5 | 0.85 (0.70–1.04) | 0.12 | 0.90 (0.73–1.10) | 0.31 | ||||||||
AMC no-lag | 1 (0.96–1.04) | 0.86 | -4.2 | No improvement | 0.90 (0.84–0.97)b | 0.008 | 2.2 | Positive | ||||
AMC lagged | 0.98 (0.95–1.01) | 0.12 | -1.4 | No improvement | 1.03 (0.95–1.11) | 0.46 | -3.3 | No improvement | ||||
AMC cumulative | 0.98 (0.94–1.02) | 0.33 | -3.1 | No improvement | 0.98 (0.90–1.06) | 0.6 | -3.8 | No improvement |
Note: Five models (2nd column) were used to estimate incidence rate ratio (IRR) of EOCRC for each unit/quintile change in each study factor (1st column). Of the five models, the no-lag and lagged models used study factor levels defined by quintile (3rd column) as input and estimated the IRR for each quintile, relative to the first quintile (i.e., the reference); the age-mean centering (AMC) models used continuous values as input and estimated the IRR for each unit change of the study factor. Of note, the estimated IRRs before and after AMC are not comparable because the exposure was on different scales. P values, and ΔBIC and assessment of the strength of evidence (see details in main text) are also shown for each model. IRRs were adjusted for age, period, and race.
aP < 0.05;
bP < 0.01;
cP < 0.001;
dFor the no-lag and lagged models, we generated P- (cont.) by treating the median of the quintiles as a continuous variable in regression.
Method validation
To test the models, we performed two sets of model validation. First, we tested the models on model-generated synthetic data, for which the underlying associations are known and thus can be compared with model estimates. Second, we applied the models to older age groups (i.e., 35–59-year-olds) and a subset of well-studied exposures (smoking, alcohol consumption, and calcium intake; refs. 6, 26). For details, see Supplementary Methods, Supplementary Tables S5–S7, and Supplementary Figs. S3–S5.
Data availability statement
Results
Method validation
As detailed in the Supplementary Methods, synthetic testing showed that both the lagged and AMC-lagged models were able to accurately identify the true direction of association in most tests (overall accuracy: 79% and 80% by the lagged and AMC-lagged models, respectively; Figures S4-S5). When the association between EOCRC and exposure was close to the null (i.e., IRRs close to 1), the AMC-lagged model was more accurate than the lagged model (71% vs. 65% accuracy; Supplementary Fig. S5), suggesting the AMC approach may alleviate potential biases to more accurately estimate the true association. Furthermore, model results for those aged 35–59 were generally consistent with findings in the literature (i.e., positive associations of CRC with smoking and alcohol consumption and a negative association with calcium intake, primarily based on cases 50 years and older (6, 26)); see the red cells (representing positive association) for smoking and alcohol and blue cells (negative association) for calcium in Supplementary Fig. S6 and Supplementary Table S7 for specific estimates.
Effect of AMC on estimated associations
We designed AMC to address potential age confounding between study factors and EOCRC. In multiple instances, changes of estimated associations after AMC were consistent with the expected. For instance, as calcium intake decreased with age (Supplementary Fig. S7) while CRC increased with age, age confounding could bias the estimated association between calcium intake and EOCRC towards the negative (Supplementary Table S5). Indeed, without removing the negative association between calcium intake and age, the no-lag and lagged models estimated negative associations (see blue cells in Fig. 2) with larger ∆BICs (∆BIC>6 except for men using the lagged model; Table 2), indicating stronger evidence for this association. In comparison, the AMC models, designed to remove the age association with calcium intake, generally estimated negative associations with lower ∆BICs, indicating weaker evidence for this association.
For tea, coffee, and caffeine, intake generally increased with age (Supplementary Fig. S7), which could nudge the estimated association with EOCRC towards the positive (Supplementary Table S5). Indeed, without removing the age association with these exposures, in multiple instances, the no-lag and lagged models estimated positive associations for these exposures (see red cells in Fig. 2). In contrast, with AMC, the models in general estimated negative or no association (see light blue or white cells in Fig. 2).
Association between study factors and EOCRC
For smoking, the models found a positive association with EOCRC for both men and women (Fig. 2). For 25–49-year-old men, the no-lag model estimated that IRRs were 1.11 (95% CI, 0.95–1.29) and 1.26 (95% CI, 1.04–1.53) for the top two quintiles (Table 2). When smoking prevalence 10 years before EOCRC diagnosis was used, the lagged model estimated that IRRs increased from 1.12 (95% CI, 1.02–1.24) for the second quintile to 1.33 (95% CI, 1.14–1.55) for the fifth. Consistently, estimated IRRs were 1.14 (95% CI, 1.05–1.23) per the AMC no-lag, 1.17 (95% CI, 1.10–1.24) per the AMC lagged, and 1.20 (1.13–1.29) per the AMC cumulative models. For the three AMC models, comparison with the corresponding null models showed strong to very strong support for this association (∆BIC ranged from 6.1 to 24.9; Table 2).
For alcohol consumption, the models also generally found a positive association with EOCRC for both young men and women (Fig. 2). For 25–49-year-old men, the lagged model estimated the IRRs increased from 1.10 (95% CI, 0.97–1.24) for the second quintile to 1.28 (95% CI, 1.13–1.46) for the fifth; for the AMC lagged and AMC cumulative models, estimated IRRs were 1.08 (95% CI, 1.04–1.12) and 1.06 (95% CI, 1.03–1.09), respectively (Table 2). The three models incorporating the time-lag also outperformed their corresponding null models (∆BICs ranged from 8.1 to 14.5; Table 2), further supporting the association. Models without the time-lag generally found no association for alcohol consumption (except the AMC no-lag model for men).
For the intake of whole fruit, fruit juice, and total fruit, the estimated associations with EOCRC tend to be negative, but the overall evidence was not strong (Fig. 2). For the intake of cholesterol, protein, fiber, and magnesium, the estimated associations with EOCRC were either nonsignificant or inconsistent across different models for meaningful interpretation (Fig. 2).
The estimated associations between a few study factors and EOCRC were unexpected: negative associations for fat, total energy, and carbohydrate intake, and a positive association for serum folate (Fig. 2).
Model results using EOCRC data combining SEER 9, 13, and 18 were similar to those above using SEER 9 data alone (Supplementary Fig. S8; Supplementary Table S8). Results from models using different lags are also similar to the main analyses using a 10-year lag; we did not find any clear pattern (Supplementary Fig. S9) except for alcohol, for which the IRRs were the largest with a 10-year lag.
Discussion
To explore reasons underlying the recent increases in EOCRC incidence, we have examined the ecologic association between EOCRC and 18 dietary factors. Given the ecologic nature of the study, model results represent a first assessment to generate hypotheses regarding potential risk factors to inform more in-depth investigation. Overall, we found that smoking and alcohol consumption starting in young adulthood were positively associated with EOCRC. While these exposures are long-established carcinogens for many cancers including colorectal cancer (26, 27), most studies are based on older populations and mid to late life exposure (26, 28). Given the likely long induction time (5), our findings suggest that primary prevention strategies for EOCRC, which are urgently needed, should incorporate tobacco and alcohol control measures targeting younger populations. The findings also suggest smoking and alcohol consumption may be important risk factors for identifying young adults for early screening and detection of EOCRC in clinical settings.
The contributions of smoking and alcohol consumption to the recent increases in EOCRC, however, likely differ. As shown in Fig. 3, smoking prevalence has been decreasing significantly in recent decades (see details of the break-point trend analysis in the Supplementary Methods), suggesting changes in smoking are likely not the reason behind the recent EOCRC increases. In contrast, alcohol consumption decreased significantly from 1971 to around 1980, consistent with the decrease of EOCRC incidence from 1973 to the early 1990s; alcohol consumption then increased since the 1980s, albeit not statistically significant, followed by the increases in EOECR incidence since the 1990s (Fig. 3). These lagged, concordant trends of alcohol consumption and EOCRC incidence resemble the parallel trends in smoking and lung cancer that have strongly supported smoking as a main cause of lung cancer (26). Consistently, using the approach in Fig. 3 of Pfeiffer and colleagues (12), we showed that, compared with the adjusted EOCRC incidence setting alcohol consumption at the lowest quintile, for both men and women, the observed EOCRC incidence was higher from 1992 onwards and the gap reached the maximum during recent periods (e.g., 2012–2016), when alcohol consumption levels were the highest (see Supplementary Fig. S10 and details in Supplementary Methods). Given these analyses, we hypothesize that increase in alcohol consumption is a key contributor to the recent EOCRC incidence increases. Further investigation is warranted while teasing out the effect of other potential risk factors.
We found some, albeit weak evidence for negative associations of caffeine, whole fruit, fruit juice, and total fruit intake with EOCRC (18/24 of the IRRs in the range of 0.95–0.99 after AMC). The literature on biological effects of these dietary factors also suggests negative associations (29–31). More in-depth investigation into the potential role of fruit and caffeine using stronger epidemiologic designs may thus prove fruitful for EOCRC prevention.
For fiber, calcium, and magnesium intake, we found either no or weak negative association with EOCRC. In contrast, epidemiologic studies among older adults suggest these nutrients are protective against colorectal cancer (6, 32). For instance, an umbrella review of meta-analyses of cohort studies found convincing evidence for a negative association of colorectal cancer with fiber and calcium intake, separately, and some evidence of a negative association with magnesium (6). Unlike previous studies using cohorts, we used aggregated population-level data, due to the challenges studying EOCRC as noted in the Introduction. This ecologic design may be less powered to identify milder risk factors, especially for younger population (e.g., aged 25–49 here). Moreover, unlike other study factors (e.g., smoking), fiber and magnesium data were unavailable during 1972–1987, further reducing the sample sizes and statistical power. Nonetheless, the direction of our estimates for calcium and magnesium (see the blue cells indicating negative associations in Fig. 2) is consistent with previous findings.
Importantly, we note that fiber, calcium, and magnesium intake among blacks were significantly lower than those among whites (P < 0.001, paired t test; Supplementary Figs. S11–S13), and also considerably lower than the recommended levels per current dietary guidelines (33). Supporting the disparities in intake of these nutrients and potential impact on EOCRC, models including these nutrients partly explained the higher incidence for blacks than whites (e.g., estimated IRRs for black compared with white men: 1.03–1.16 vs. 1.19–1.25 using the lagged model with vs. without one of these nutrients; Supplementary Table S9). Given the higher EOCRC among blacks and multiple health benefits of these nutrients, these findings suggest increasing the intake of these nutrients may help mitigate EOCRC risk among blacks.
Some of our findings were at odds with the literature. In particular, for colorectal cancer, past studies found positive associations with high fat diet and total energy intake (32, 34, 35), no association with carbohydrate intake (36), and negative associations with folate intake (37, 38). Model estimates here were inconsistent with these previous findings, particularly for young men, which highlights limitations in this ecologic analysis. Nonetheless, we note that while EOCRC increased during the latter part of our study period (from 1990s onwards), fat intake had been decreasing among young men (Supplementary Fig. S14). Similar time-trends were observed for total energy and carbohydrate (Supplementary Figs. S15 and S16). These trends suggest that, at the population level, the changes in fat, total energy, and carbohydrate intake are likely not associated with the recent increases in EOCRC. For folate intake, serum folate concentration increased during 1987–2016 likely due to the folic acid fortification program implemented in 1998 (ref. 39; Supplementary Fig. S17; see Supplementary Table S2 for reasons for excluding earlier serum folate data); this coincided with the increases in EOCRC during the time period. The positive association between serum folate and EOCRC may have been an artefact of such concurrent changes. We thus caution the above limitations, even though ecologic studies could be invaluable in examining potential risk factors taking advantage of long-term population data. Furthermore, we advocate for comprehensive result interpretation combining ecologic modeling results, findings from the literature, and careful inspection of underlying data, as demonstrated here.
We note several study limitations, apart from the ecologic design. First, while our analysis included cigarette smoking, other forms of tobacco consumption were not included due to a lack of long-term data. For example, e-cigarettes have gained popularity among youth and young adults in the United States in the 2010s. The potential impacts of such exposure, particularly during critical development periods, warrant future investigations. Second, due to challenges in converting and harmonizing intake of various vegetable items (e.g., inconsistent classification/inclusion schemes and definitions of serving size; refs. 40, 41), we were unable to analyze the association of EOCRC with total vegetable intake. Third, this study focused on testing the proposed methods and dietary factors. Future work will extend to non-dietary factors, including those that have been found to affect colorectal cancer risks among older adults (e.g., body weight and physical exercise; ref. 5). Fourth, this study estimated the marginal effect of each study factor, as done in Pfeiffer and colleagues (12) Future work considering potential interactions among various study factors is under way. Fifth, while our models accounted for and estimated the age and period effect, to incorporate the risk factor data and estimate their associations with EOCRC, the models were not formulated as conventional age-period-cohort models (42) to enable estimation of birth cohort effect.
In sum, we found that alcohol consumption was strongly associated with EOCRC incidence and has increased since the 1980s, which may have contributed to recent EOCRC increases among U.S. adults aged 25–49. We have also proposed an AMC approach, which may be applied in ecologic studies of risk factors and other diseases where large-cohort data are unavailable.
Authors' Disclosures
J. Chen reports grants from Data Science Institute and Irving Institute for Cancer Dynamics Seed Funds Program at Columbia University and grants from NCI during the conduct of the study. I.L. Zhang reports grants from Data Science Institute and Irving Institute for Cancer Dynamics Seed Funds Program and grants from NCI during the conduct of the study. W. Yang and M.B. Terry report grants from the Data Science Institute and Irving Institute for Cancer Dynamics Seed Funds Program at Columbia University and NCI during the conduct of the study.
Authors' Contributions
J. Chen: Resources, data curation, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing. I.L. Zhang: Resources, data curation, software, formal analysis, validation, investigation, visualization, writing–original draft, writing–review and editing. M.B. Terry: Conceptualization, methodology, writing–review and editing. W. Yang: Conceptualization, resources, data curation, supervision, funding acquisition, investigation, methodology, project administration, writing–original draft.
Acknowledgments
This study was supported by the Data Science Institute and Irving Institute for Cancer Dynamics Seed Funds Program at Columbia University and the NCI (grant number: R01CA257971).
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Epidemiology, Biomarkers & Prevention Online (http://cebp.aacrjournals.org/).