## Abstract

Obtaining detailed individual-level data on both exposure and cancer outcomes is challenging, and it is difficult to understand and characterize how temporal aspects of exposures translate into cancer risk. We show that, in lieu of individual-level information, population-level data on cancer incidence and etiologic agent prevalence can be leveraged to investigate cancer mechanisms and to better characterize and predict cancer trends. We use mechanistic carcinogenesis models [multistage clonal expansion (MSCE) models] and data on smoking, *Helicobacter pylori* (*H. pylori*), and HPV infection prevalence to investigate trends of lung, gastric, and HPV-related oropharyngeal cancers. MSCE models are based on the initiation–promotion–malignant conversion paradigm and allow for interpretation of trends in terms of general biological mechanisms. We assumed the rates of initiation depend on the prevalence of the corresponding risk factors. We performed two types of analysis, using the agent prevalence and cancer incidence data to estimate the model parameters and using cancer incidence data to infer the etiologic agent prevalence as well as the model parameters. By including risk factor prevalence, MSCE models with as few as three parameters closely reproduced 40 years of age-specific cancer incidence data. We recovered trends of *H. pylori* prevalence in the United States and demonstrated that cohort effects can explain the observed bimodal, age-specific pattern of oral HPV prevalence in men. Our results demonstrate the potential for joint analyses of population-level cancer and risk factor data through mechanistic modeling. This approach can be a first step in systematically testing relationships between exposures and cancer risk when individual-level data is lacking.

**Significance:** Analysis of trends in risk-factor prevalence and cancer incidence can shed light on cancer mechanisms and the way that carcinogen exposure through time shapes the risk of cancer at different ages.

**Graphical Abstract:** http://cancerres.aacrjournals.org/content/canres/78/12/3386/F1.large.jpg. *Cancer Res; 78(12); 3386–96. ©2018 AACR*.

Cancer arises from a series of rare, stochastic events, and the numbers of premalignant and malignant cells is modeled as a continuous-time Markov chain (Fig. 1).

The first initiating mutation is modeled as a nonhomogeneous Poisson process. We assume that this rate is dependent on the age-specific prevalence of the etiologic agent in the population.

Clonal expansion and malignant conversion of initiated cells is modeled as a birth–death–mutation process.

The model hazard corresponds to age-specific cancer incidence data.

Modeled cancer survival *x*_{1}(*t*) is defined as the probability that an individual has no malignant cells at age *t*. The model hazard is defined as

and corresponds to age-specific cancer incidence data. The per-cell initiating mutation rate *ν*(*t*) can be written as

where *ν*_{0} is the baseline initiation rate, *P*(*t*) is the prevalence of the etiologic agent at age *t*, and *ρ* is the relative risk of initiation given exposure to the etiologic agent. If *α* is the cell growth rate, *β* the cell death rate, *μ*_{1} the malignant conversion mutation rate, and *X* the number of normal cells in the tissue, then the survival and hazard of the two-stage clonal expansion model at age *t* are found by numerically solving for *x*_{1}(*t*) and *x*_{2}(*t*), respectively, in the following system of equations, in which *x*_{3}(*t*) and *x*_{4}(*t*) can be treated as dummy variables,

with initial conditions

## Introduction

Risk factors for cancer can be difficult to assess because of the long time that it takes for cancer to develop and the long lag time between exposures to etiologic agents and cancer onset and detection. It is even more challenging to further characterize how different temporal patterns of exposure translate into age-specific cancer risk. For instance, how does the risk associated with an exposure at one age compare with that for the same exposure at a different age? Or, how does risk compare for two people with the same cumulative exposures but very different temporal distributions of that exposure? For most risk factors, especially those whose association or causal relation with cancer are not well established, the answers to these questions are usually unknown, with very limited individual data available to study these issues. We must then resort to analyses of population-level data of cancer incidence and risk factor prevalence as a first step to study the temporal relationship between exposures and cancer risk.

Modeling the connection between population-level prevalence of etiologic agents and population-level incidence of cancer is particularly challenging because of the multiple spatial and temporal scales involved (e.g., from cell level to population level and exposure timescale to carcinogenesis timescale). In particular, time of exposure, duration of exposure, dependence on age, and the dose–response relationship may all impact the risk of cancer onset and progression. Multistage clonal expansion (MSCE) models are a family of mechanistic models that provide a framework to integrate time-varying exposures into the analysis of cancer epidemiologic data. The initiation–promotion–malignant conversion hypothesis (1, 2) posits that cancer is the accumulation of rare events, nominally mutations; after accumulating a sufficient number of mutations (initiation), the tumor expands clonally (promotion), and a final mutation transforms a cell to malignancy (malignant conversion). On the basis of this framework, MSCE models seek to explain cancer incidence patterns in terms of the modeled, underlying biological carcinogenesis mechanism. Several studies have leveraged MSCE models to look at time-varying exposures in prospective individual-level data (e.g., of radiation, smoking, and benzene exposure; refs. 3–5). One recent study has used an MSCE model to connect population-level risk factor data to cancer-incidence: Hazelton and colleagues (6) investigated the connection between prevalence of symptomatic gastroesophageal reflux disease (sGERD) to incidence of esophageal adenocarcinoma (EAC). Estimation of model parameter values from cancer incidence data has been a focus of MSCE research both because parameter estimates are necessary for cancer incidence rate prediction and the evaluation of interventions such as screening (e.g., 7, 8) and because they can help explain the underlying mechanistic reasons for health disparities (e.g., 9–12).

To show the potential of combining population-level risk factor and cancer data through mechanistic modeling, we consider two kinds of time-varying cancer risk factors for which data at the population level is sometimes available: infectious agents and tobacco use. Approximately 15%–20% of cancer-related deaths worldwide are attributed to infectious agents, including the human papillomavirus (HPV), *Helicobacter pylori* (*H. pylori*), hepatitis B and C viruses, Epstein–Barr viruses, Kaposi sarcoma herpes virus, and liver flukes (*Opisthorchis viverrini*), among others (13). Some infectious agents, particularly viruses, can cause cancer through direct destabilization of normal cell controls, while other agents cause chronic inflammation or suppress the immune system in ways that can indirectly lead to carcinogenesis (14). Another 25%–30% of cancer-related deaths can be attributed to tobacco use (13); although known to be the etiologic agent of most lung cancers, tobacco use is also a risk factor for many other cancers (liver, head and neck, and colorectal cancer in particular; ref. 15).

In this article, we investigate how much can be learned by coupling population-level risk factor prevalence data to cancer incidence data through mechanistic modeling of carcinogenesis and statistical analyses of cancer trends. In particular, we evaluate whether the use of risk factor data improves the estimation and inference of mechanistic model cancer parameters that represent the rates of cancer initiation, promotion, and malignant conversion. By assuming that the tumor initiation rate depends on the risk factor prevalence, we connect population-level data on the etiologic agent prevalence to population-level data on cancer incidence in three case studies: *H. pylori* and intestinal-type noncardia gastric adenocarcinoma (GAC), smoking and malignant neoplasms of the bronchus and lung (LC), and HPV and HPV-related oral (oropharyngeal and oral cavity) squamous cell carcinoma (OSCC). We not only use prevalence data to inform models of cancer incidence but determine whether cancer incidence can help us to estimate historic agent prevalence patterns.

## Materials and Methods

### Data

#### Cancer incidence.

We consider cancers reported to the Surveillance, Epidemiology, and End Results (SEER) cancer registries (16), using SEER 9 data from 1973–2014. We use the International Classification of Diseases (ICD) and histologic subtype codes to identify cases of GAC (C16.1–16.6, C16.8–16.9 of type M8010, M8140, M8211, or M8144), HPV-related OSCC (C01.9, C02.4, C09.0, C09.1, C09.8–09.9, C10.0–10.4, C10.8, C10.9, C14.2), and LC (C34).

#### Risk factor prevalence.

Yeh and colleagues (17) reported cohort prevalence of *H. pylori* at age 20 for men in the National Health and Nutrition Examination Survey (specifically in NHANES III and Continuous NHANES; ref. 18); we smooth this data using natural splines. Smoking prevalence in the United States by cohort was estimated using the data generated by Holford and colleagues (19) and available here (20). Age-specific cervicogenital prevalence of HPV was previously reported by Brouwer and colleagues (21). Prevalences of *H. pylori*, smoking, and cervicogenital HPV are shown in the Supplementary Material (Supplementary Fig. S1A–S1D).

### Mechanistic models

#### MSCE models with population-level risk factors.

Multistage clonal expansion models, a class of continuous-time Markov chain models, were developed under the initiation–promotion–malignant conversion paradigm and have been extended to include multiple preinitiation steps (1, 2, 22, 23). We modify both two-stage and three-stage models by making the rates of the first destabilizing mutations dependent on the population-level prevalence of the associated etiologic agent (an assumption first used by Hazelton and colleagues; ref. 6). The model variables and parameters are given in Table 1, and a schematic of the two-stage model is shown in Fig. 1A. Modeled cancer survival *x*_{1}(*t*) is defined as the probability that an individual has no malignant cells at age *t*. We do not include an explicit reporting rate to model the time between the first malignant cell and tumor detection. Although modeling such a rate is useful when modeling individual-level data, it is not separately identifiable in practice for population-level data. In this model, the tumor detection rate is wrapped into the malignant conversion rate.

The model hazard is defined as

and corresponds to age-specific cancer incidence data. The survival and hazard of the two-stage clonal expansion model at age *t* are found by numerically solving for *x*_{1}(*t*) and *x*_{2}(*t*) in Eq. A, where *x*_{3}(*t*) and *x*_{4}(*t*) can be treated as dummy variables,

with initial conditions |${x_1}( 0 ) = 1,{x_2}( 0 ) = {x_3}( 0 ) = 1,{x_4}( 0 ) = - {\mu _1}.$| The equations for the three-stage clonal expansion model, technical details of the models, and theoretical proofs are left to the Supplementary Material.

Parameters . | |
---|---|

X | Number of normal cells, X = X(0) |

ν_{0} | Baseline per-cell mutation rate for normal cells (asymmetric division) |

ρ | Relative risk of an initiating mutation in the presence of the etiologic agent |

σ | Initiation rate due to the etiologic agent, σ = ν_{0}(ρ − 1) |

P(t) | Population-level, age-specific prevalence of the etiologic agent |

α | Initiated cell clonal expansion rate (symmetric division) |

β | Initiated cell death rate |

μ_{1} | Initiated cell malignant conversion rate (asymmetric division) |

Identifiable parameter combinations | |

p | Net cell proliferation (Eqs. D and G) |

q | Malignant conversion (Eqs. D and G) |

r | Background tumor initiation (Eqs. E and H) |

s | Agent-related tumor initiation (Eqs. F and I) |

Parameters . | |
---|---|

X | Number of normal cells, X = X(0) |

ν_{0} | Baseline per-cell mutation rate for normal cells (asymmetric division) |

ρ | Relative risk of an initiating mutation in the presence of the etiologic agent |

σ | Initiation rate due to the etiologic agent, σ = ν_{0}(ρ − 1) |

P(t) | Population-level, age-specific prevalence of the etiologic agent |

α | Initiated cell clonal expansion rate (symmetric division) |

β | Initiated cell death rate |

μ_{1} | Initiated cell malignant conversion rate (asymmetric division) |

Identifiable parameter combinations | |

p | Net cell proliferation (Eqs. D and G) |

q | Malignant conversion (Eqs. D and G) |

r | Background tumor initiation (Eqs. E and H) |

s | Agent-related tumor initiation (Eqs. F and I) |

NOTE: Parameters and identifiable parameter combinations of a multistage clonal expansion model with two-stages and population-level etiologic agent prevalence

One way to formulate the initiation rate is as the sum of the baseline mutation rate times the probability of not being exposed to the etiologic agent and the baseline mutation rate times the probability of being exposed to the etiologic agent times the relative risk given exposure:

We can reparameterize this equation as

where |$\sigma = {\nu _0}( {\rho - {\rm{ }}1} ).$| This equation also represents the alternate formulation where the initiation rate is the sum of an initiation rate related to the etiologic agent and the baseline initiation rate related to all other factors. This formulation allows us to estimate the fraction of cases resulting from the etiologic agent versus other factors. When estimating *P*(*t*) instead of using data, *σ/ν*_{0} and *P*(*t*) form an identifiable combination, meaning that their product can be estimated from the incidence data but that their values cannot individually be estimated (see Corollary 1 in the Supplementary Material). Thus, when estimating *P*(*t*), we determine the relative prevalence between different ages or birth-cohorts rather than the absolute prevalence. The error structure for estimates of *P*(*t*) will depend on its parameterization (e.g., number of spline knots).

For the two-stage model, four parameters can be estimated (i.e., are identifiable) from prevalence and incidence data, namely

For rare mutations (μ_{1} << 1) and biologically reasonable growth rates, parameter *p*_{2} is approximately equal to −(*α*−*β*−*μ*_{1}), that is, the negative of the net cell proliferation rate, and *q*_{2} is on the order of the malignant conversion rate *μ*_{1}. Parameters *r*_{2} and *s*_{2} are related to the background and agent-related tumor initiation rates.

We also consider the three-stage model, which is identical to the two-stage model except for an additional preinitiation stage before clonal expansion (Fig. 1B, details in the Supplementary Material; ref. 24). In this case, we can also estimate four parameters, namely,

Age-specific cancer incidence data corresponds to the model hazard or incidence function (details in the Supplementary Material). The age-specific etiologic agent prevalence *P*(*t*) can be either be taken as known through data or it can be treated as an unknown to be estimated.

The two- and three-stage models with constant parameters are associated with distinct age-specific model hazard shapes (23). The two-stage model flattens out with increasing age and can display peaks when there are cohort effects. The three-stage model, on the other hand, is characterized by a linear phase and does not decrease (on the time-span of human lives) without cohort effects. The choice of model is typically based on incidence shape, and likelihood-based model selection is used when needed. Biological evidence can be a useful corroboration.

In this formulation, we model the etiologic agent as impacting the initiation rate. In previous work, we considered time-varying promotion or malignant conversion (12), and here we considered a model where the etiologic agent impacts the promotion rate instead. However, that model generally did not fit the data as well and we will not discuss the results in detail here (selected results are presented in the Supplementary Material).

### Analytic framework

We conduct two types of analyses, frameworks for which are depicted in Fig. 2. In the first (forward framework), we assume that the agent prevalence is known. We then use the agent prevalence and cancer incidence data to estimate the MSCE model parameters, including the background cancer initiation (*r*), the initiation rate due to the etiologic agent (*s*), and the tumor promotion (*p*), and malignant conversion rates (*q*). In the second type of analysis (inverse framework), we use only cancer incidence data to infer the etiologic agent prevalence (*P*(*t*)) as well as the MSCE model parameters, including the relative background and agent-dependent initiation rates.

### Case studies

We use three case studies as illustrative examples: *H. pylori* and gastric cancer, smoking and lung cancer, and HPV and oral cancer.

### Case study 1: *H. pylori* and gastric cancer

As *H. pylori* infections are known to occur in childhood and persist unless treated, we assume that *H. pylori* infection status does not vary over the lifetime (*P*(*t*) ≡ *P*), as in ref. 17, but that prevalence varies by birth cohort. Gastric cancer is modeled with the three-stage MSCE model with age-independent parameters and prevalence of *H. pylori* affecting both the preinitiation and initiation rates (24). The three-stage model was chosen to model gastric cancer because its age-specific incidence is consistent with the three-stage model hazard shape, that is, the incidence does not peak or flatten out (23). We first estimate the model parameters using the NHANES *H. pylori* data for both men and women, and the SEER gastric cancer incidence data (forward framework). Later, *H. pylori* prevalence is estimated along with model parameters from cancer incidence alone (inverse framework).

### Case study 2: Smoking and lung cancer

Smoking prevalence varies by age and birth cohort. Lung cancer is modeled with the three-stage MSCE model with age-independent biological parameters and with prevalence of smoking affecting the initiation and preinitiation rates. Comparison of two- and three-stage model fits indicated that lung cancer incidence was more consistent with the three-stage model. Carcinogenesis model parameters are estimated using the modeled smoking prevalence (forward framework). We do not use the inverse framework for lung cancer because the shape of age-specific smoking prevalence effects changes significantly over the available data (Supplementary Fig. S1B and S1C), making the estimation problem computationally infeasible for this case study.

### Case study 3: HPV and HPV-related oral cancer

Studies of age-specific oral HPV prevalence, which have only recently begun, do not span enough years to disentangle age and cohort effects. Hence, we assume that the age-specific prevalence of oral HPV has the same shape as female cervicogenital HPV prevalence (1970 reference cohort). Birth cohort patterns on oral HPV prevalence are allowed to be different for men and women and be different from the genital cohort effects. HPV-related oral cancer is modeled with a two-stage MSCE model because the cancer incidence peaks and comes down, consistent with a two-stage model with cohort effects. Birth cohort effects on HPV prevalence, in addition to carcinogenesis model parameters, are estimated for men and women (inverse framework). Because oral HPV has only been tested for since 2009, we cannot use the forward framework for the HPV-related oral cancer case study.

### Model fitting and parameter estimation

Following the usual age–period–cohort framework (12, 25–28), we assume that the prevalence of the etiologic agent is related to both the age *t* and birth cohort *c*,

where *g _{A}* and

*g*are functions (natural splines, here) of age

_{C}*t*and cohort

*c*and (spline) parameters

*θ*and

_{t}*θ*, respectively. The age- and cohort-specific incidence rate

_{c}*λ*is given by the model hazard,

where *π* represents all of the parameters of the MSCE model (see formulation of model hazard *h* in the Supplementary Material). Hence, in this formulation, age and cohort effects are on the prevalence of the etiologic agent, not on the carcinogenesis parameters directly (as they were in our previous work; ref. 12).

We assume that cancers cases are independent and Poisson-distributed, so that the likelihood for observed cases {*x _{i}*} with corresponding population-at-risk sizes {

*n*} under these models is given by

_{i}where

We minimized the negative log-likelihood of the model with a Davidson–Fletcher–Powell algorithm in the R (v3.3.1) and gFortran versions of the Bhat package (29).

## Results

Model fits to age-specific cancer incidence data by birth cohort are presented in Fig. 3A–F and are discussed in the following sections. Corresponding plots of predicted cancer incidence by birth cohort are available in the Supplementary Material (Supplementary Fig. S2A–S2F).

### Gastric cancer

#### Estimation of cancer parameters using cancer incidence and *H. pylori* data (forward framework).

A likelihood-ratio test (*p* < 0.01) indicated that the model was fit best by assuming that all GAC cases were related to the *H. pylori* pathway (i.e., *r*_{3} ≡ 0) for both men and women. Estimated parameters *p*_{3}, *q*_{3}, *s*_{3} are listed in Table 2.

Cancer . | Model . | Prevalence . | Sex . | p ×10^{1} (95% CI)
. | q × 10^{4} (95% CI)
. | r × 10^{3} (95% CI)
. | s × 10^{2} (95% CI)
. |
---|---|---|---|---|---|---|---|

Gastric | 3-stage | Assumed | Male | −1.29 (−1.37 to −1.21) | 1.42 (1.16 to 1.75) | ^{a} | 2.07 (1.95 to 2.19) |

Female | −0.81 (−0.89 to −0.74) | 3.49 (3.07 to 3.97) | ^{a} | 2.35 (2.03 to 2.72) | |||

Fit | Male | −1.31 (−1.43 to −1.19) | 2.42 (1.77 to 3.31) | ^{a} | 1.83 (1.69 to 1.97) | ||

Female | −0.71 (−0.82 to −0.61) | 6.05 (4.99 to 7.33) | ^{a} | 2.34 (1.90 to 2.90) | |||

Lung | 3-stage | Assumed | Male | −1.84 (−1.87 to −1.82) | 2.63 (2.49 to 2.78) | 2.21 (2.06 to 2.38) | 4.98 (4.92 to 5.04) |

Female | −1.87 (−1.91 to −1.84) | 1.89 (1.73 to 2.07) | 4.04 (3.86 to 4.22) | 5.33 (5.21 to 5.46) | |||

HPV-related oral | 2-stage | Fit | Male | −2.10 (−2.15 to −2.05) | 0.44 (0.38 to 0.51) | ^{a} | 0.30 (0.29 to 0.31) |

Female | −1.97 (−2.07 to −1.88) | 1.13 (0.87 to 1.47) | ^{a} | 0.051 (0.048 to 0.055) |

Cancer . | Model . | Prevalence . | Sex . | p ×10^{1} (95% CI)
. | q × 10^{4} (95% CI)
. | r × 10^{3} (95% CI)
. | s × 10^{2} (95% CI)
. |
---|---|---|---|---|---|---|---|

Gastric | 3-stage | Assumed | Male | −1.29 (−1.37 to −1.21) | 1.42 (1.16 to 1.75) | ^{a} | 2.07 (1.95 to 2.19) |

Female | −0.81 (−0.89 to −0.74) | 3.49 (3.07 to 3.97) | ^{a} | 2.35 (2.03 to 2.72) | |||

Fit | Male | −1.31 (−1.43 to −1.19) | 2.42 (1.77 to 3.31) | ^{a} | 1.83 (1.69 to 1.97) | ||

Female | −0.71 (−0.82 to −0.61) | 6.05 (4.99 to 7.33) | ^{a} | 2.34 (1.90 to 2.90) | |||

Lung | 3-stage | Assumed | Male | −1.84 (−1.87 to −1.82) | 2.63 (2.49 to 2.78) | 2.21 (2.06 to 2.38) | 4.98 (4.92 to 5.04) |

Female | −1.87 (−1.91 to −1.84) | 1.89 (1.73 to 2.07) | 4.04 (3.86 to 4.22) | 5.33 (5.21 to 5.46) | |||

HPV-related oral | 2-stage | Fit | Male | −2.10 (−2.15 to −2.05) | 0.44 (0.38 to 0.51) | ^{a} | 0.30 (0.29 to 0.31) |

Female | −1.97 (−2.07 to −1.88) | 1.13 (0.87 to 1.47) | ^{a} | 0.051 (0.048 to 0.055) |

NOTE: The number of stages used for the multistage clonal expansion model used and whether the etiologic agent prevalence is assumed (forward framework) or fit (inverse framework) are indicated.

^{a}Likelihood-ratio test indicates *r* ≡ 0.

We compare GAC incidence-by-cohort data (points) with the resulting model predictions (lines) in Fig. 3A and B. The figures demonstrate that GAC incidence has decreased considerably between the birth cohorts of the 1900s and the 1970s in a pattern that is consistent with the reduced prevalence of *H. pylori* over the same time frame. The three-stage MSCE model with three parameters and initiation rate varying by birth-cohort (driven by the relative *H. pylori* prevalence by cohort) fits the GAC trends well.

#### Estimation of both cancer parameters and *H. pylori* prevalence by cohort (inverse framework).

Next, we consider the extent to which *H. pylori* prevalence can be estimated from incidence data. In addition to parameters *p*_{3}, *q*_{3}, and *s*_{3}, we estimate *H. pylori* prevalence by birth cohort, parameterized as a natural spline with five degrees of freedom. Because prevalence and *s*_{3} form an identifiable product, we can only estimate relative prevalence; here, we use 1925 as the reference birth cohort (when prevalence was approximately 50% (Supplementary Fig. S1A). Table 2 shows the estimated cancer parameters in this framework; differences between biological parameter estimates when estimating *H. pylori* prevalence (here) or not (previous framework) are relatively small. The estimated prevalence for men and women is plotted in Fig. 4 with the smoothed NHANES data for comparison. The estimated prevalences both match the data well, with only minor discrepancies for those born in 1950s and 1960s (note that the discrepancies would have manifested for different birth cohorts had a different reference cohort been chosen). The corresponding GAC incidence data by cohort and the model predictions under this framework are given in the Supplementary Material (Supplementary Fig. S3A and S3B).

### Lung cancer

#### Estimation of cancer parameters using cancer incidence and smoking prevalence data (forward framework).

Estimated parameters *p*_{3}, *q*_{3}, *r*_{3}, *s*_{3} are listed in Table 2. The models, which have only four degrees of freedom and are driven by smoking prevalence by cohort and gender, fit the data well up until the late-life period, where some deviations are seen (Fig. 3C and D). The model estimates that the majority of lung cancer cases are due to smoking (or other effects captured by the cohort trends); the exact estimates depend on the smoking prevalence, but, for example, for the 1960 birth cohort, the model estimates 97.3% (97.1%–97.4%) of lung cancers in men and 92.2% (91.2%–92.5%) in women were caused by smoking. Over all of the cohorts, the model estimates that the relative risk of an initiating mutation if one is smoking is *ρ* = 23.5 (22.1–25.0) for men and 14.2 (13.9–14.5) for women. The incidence-by-cohort plots suggest that the incidence of LC has varied dramatically in a cohort fashion, consistent with the patterns of smoking in the U.S. Predicted cancer incidence by cohort is available in the Supplementary Material (Supplementary Fig. S2C and S2D).

### HPV-related oral cancer

#### Estimation of both cancer parameters and relative oral HPV prevalence by cohort using cancer incidence and genital HPV age-specific prevalence (inverse framework).

Because oral HPV-testing is a recent development, we must estimate the cohort effects of oral HPV from the cancer incidence data. As a first step, we assume that the age-specific prevalence of oral HPV in men and women in 1970 is a scaling of the age-specific prevalence of cervicogenital HPV in 1970. A likelihood-ratio test (*p* < 0.01) preferences the model without background initiation (*r*_{2} ≡ 0) for both men and women, so we estimate only *p*_{2}, *q*_{2}, *s*_{2} (Table 2) and the parameters for a natural spline with eight degrees of freedom. Model fits are plotted in Fig. 3E and F. Predicted cancer incidence by cohort is available in the Supplementary Material (Supplementary Fig. S2E and S2F).

There is more spread in the data for this cancer than the two previously considered, resulting in larger model residuals, particularly for women, but the overall patterns in the age-specific incidence by birth cohort are captured. A comparison of the estimated relative birth cohort effects is given in the Supplementary Material (Supplementary Fig. S4). We plot the modeled cross-sectional, age-specific prevalence of oral HPV for men (Fig. 5A) and women (Fig. 5B) between 1975 and 2010. Because prevalence is identifiable only up to a scalar, we scale the estimated prevalence to the 2009–2010 estimates of U.S. age-specific oral HPV prevalence from the NHANES survey (30). A bimodal pattern emerges for men when plotting estimated oral HPV prevalence by calendar year; this pattern is qualitatively consistent with the 2009–2010 NHANES data, although shifted by approximately a decade. For women, the prevalence modeled from the cancer incidence does not capture the second peak in oral HPV prevalence seen in the data. Instead, the model suggests that the second prevalence mode may be aging out. This interpretation is driven by the cancer data (Fig. 3F), where incidence for women, unlike for men, has been decreasing since the 1920–1929 cohort. Discrepancies between the estimates based on cancer incidence and the data may be due to misspecification of the age effects, unaccounted-for period effects, or other factors affecting the temporal relationship between oral HPV and cancer (e.g., smoking, alcohol use, or other oral cancer risk factors).

## Discussion

In this study, we assessed whether population-level cancer and risk factor data can be leveraged to infer cancer mechanisms and improve cancer rate projections, and, conversely, whether cancer risk factor patterns can be inferred from age-specific cancer incidence trends. Despite the analysis simplifications, modeling exposures by age and birth-cohort and relying on the MSCE model formulation to integrate and synthesize the information from both scales has proven an effective way to explore the complex relationship between risk factor prevalence and cancer incidence data at the population level. This result is particularly helpful when one considers that, in general, individual-level data for assessing risk factors are rare.

### Advantages and challenges of incorporating risk factor data into analyses of cancer incidence data

A strength of this approach is the ability to achieve good fits parsimoniously, that is, with few parameters. The GAC and LC models are able to reproduce forty years of cancer incidence data with only three and four parameters, respectively, and the risk factor prevalence data. Our analyses demonstrate that, when considering risk factors with a strong causal link to a type of cancer, incorporating data from risk factor prevalence into analyses and projections of cancer incidence might be preferable to nonparametric approaches such as age–period–cohort analyses, which suffer from identifiability problems and over-fitting concerns (11, 12, 22). By using population-level data, however, we lose the ability to assess heterogeneities in risk. For example, the risk of gastric cancer from *H. pylori* infection varies widely depending on some virulence factors of the pathogen and differences in host susceptibility (31). Nevertheless, we can capture the overall temporal trend in incidence from the prevalence as well as risk profiles that average over the existing heterogeneities. We note, then, that the estimated biological parameters represent population averages that likely have significant individual variability depending on the pathogen and host characteristics.

Modeling GAC and LC incidence using *H. pylori* or smoking prevalence data gave good fits to the data in general, largely thanks to the strong relationship of these risk factors to the corresponding cancer. However, there is more uncertainty in the OSCC model. There are at least two major reasons for this. First, the SEER cancer registry does not record tumor HPV testing. We must rely, as previous studies have done (12, 32–34), on a presumed HPV-related status based on the location of the cancer. Because not all HPV-related cancers are HPV-positive, some error is introduced into the data. Second, because oral HPV infection has only been tested for at the population level in recent years (NHANES began oral HPV testing in 2009; ref. 18), it is not yet possible to determine which signals in the data are related to the age of the participant and which are due to cohort effects. Several studies of cervicogenital HPV prevalence and HPV serum antibodies have concluded that cohort effects, likely related to changes in sexual behaviors, have driven patterns of genital HPV infection (21, 35, 36), and it is likely that cohort effects have similarly played a role in patterns of HPV oral prevalence. In this study, we assumed that the age-specific prevalence of oral HPV in 1970 was proportional to cervicogenital prevalence. While unlikely to be exactly true, this assumption is not completely unreasonable as HPV infection for both sites is largely driven by sexual contact and as there is a correlation between individual (and population-level) oral and genital infections (21, 37). Still, any misspecification of the shape of the age effects in the model will propagate errors. Moreover, there may be period effects (i.e., calendar year dependent effects) that we did not account for in the model. Hence, the exact quantitative estimates of past oral HPV infection should not be given much weight. Nevertheless, we were able to broadly capture the bimodal prevalence pattern of oral HPV prevalence in the United States observed in 2009–2010 in men (30). At the time, it was not clear whether this bimodal pattern was an age effect or a cohort effect. Our results demonstrate that the data can be explained as resulting from cohort variations in HPV infection risk.

A major limitation of our study is that we restricted the analyses to a single risk factor per cancer. It is known, for instance, that smoking, alcohol consumption, and diet are also important covariates for OSCC (38). Besides smoking, other environmental and occupational exposures, such as radon and asbestos, are relevant lung cancer risk factors (39). Similarly diet, smoking, genetic factors, and medical conditions such as pernicious anemia have been associated with GAC (40). However, the risk factors considered in all three case studies here are likely responsible for the large majority of the corresponding cancers, and here we show that accounting for their temporal patterns can improve analyses of cancer incidence trends and that one could also potentially estimate trends of the major risk factor for a specific cancer, if one exists, directly from cancer incidence data. Future work will consider two or more exposures simultaneously.

In this analysis, we assumed that the mechanism of action of the etiologic agents was on tumor initiation. Individual-level analyses of prospective cohort data, however, have suggested that this not always the case and that exposure-related promotion is a relevant mechanism in multiple cancers, such as smoking-related lung cancer (3, 4, 41), as well as in other exposure–cancer pairs (5, 42, 43). We found that an MSCE model where the etiologic agent increased the net cell proliferation did not fit the cancer incidence data as well (see Supplementary Data). This finding does not necessarily mean that promotion is not the mechanistic pathway; there may be nonlinear or exposure magnitude effects that we cannot capture with the population-level data available. Nevertheless, the excellent fits and predictions obtained with as few as four parameters, suggest that, as a first approximation, assuming the effects on initiation is adequate for predicting population-level trends.

### Estimation of risk factor prevalence from cancer incidence data only

Our analyses indicate that when there is a strong causal link between a risk factor and a cancer outcome, it is possible to extrapolate and estimate risk factor prevalence for time periods without direct data. A major barrier to this kind of estimation has been the long and highly variable temporal distance between the risk factor and the cancer outcome. Here, by leveraging MSCE models, we use basic carcinogenesis mechanisms to understand the temporal structure of these delays, allowing us to both predict future cancer incidence knowing current risk factor prevalence and to estimate previous risk factor prevalence from current cancer incidence.

### Estimation of the relative contribution of background versus agent-related initiation

Our approach explicitly differentiates between tumor initiation due to the etiologic agent versus due to background causes. We find, however, that in practice the models often neglect the background initiation pathway (i.e., we estimate the background rate to be *r* = 0). This phenomenon occurs in part because the model calibration will wrap any other pathway effects with similar temporal patterns into the agent pathway, resulting in a potential overestimation of the fraction of cases due to the etiologic agent of interest. Hence, while the LC model estimated that more than 90% of the LC cases were due to smoking, attributable fraction estimates suggest that the true values are likely somewhat less than this—for example, the U.S. Surgeon General estimates that 80%–90% of LC is caused by smoking (15), although this varies by gender and year. The data used covers the period of time where smoking-related lung cancer incidence was at its highest, which likely biases our results to higher attributable fraction estimates. Our analysis could be refined if data on case exposure were available (e.g., identification of LC cases as never, former and current smokers, or HPV status of OSCC tumors). That being said, our analyses, which are based on models that account for not only random or agent-caused mutations but also tumor cell clonal expansion, long-term cancer registry data, and temporal patterns of leading risk factors, suggest that, at least for these cancers, a majority of cases can be attributed to these environmental exposures.

Interest and discussion concerning the proportion of cancer cases (or cancer types) attributable to environmental factors, hereditary factors, or “bad luck” mutations has increased in recent years following the publication of Tomasetti and Vogelstein's analysis (44, 45) that suggested that two-thirds of the variation in cancer risk between cancers of different tissues could be explained by differences in stem cell division rates. This work has been very controversial, in part due to misinterpretation of their conclusions, and many others have weighed in on this topic (e.g., refs. 46–51). Our analyses of specific risk factors and cancers using mechanistic models, which explicitly account for clonal expansion and evolution in addition to random and environmentally driven mutations, show the relevance of both environmental and random effects in the carcinogenesis process.

## Conclusion

Relating risk factor patterns to cancer incidence is difficult: the temporal relationships between exposures and outcomes are complex. We have demonstrated that integrating population-level data on both cancer and exposures though mechanistic, mathematical models can be a first step to systematically testing relationships between exposures and cancer risk, particularly when individual-level data representative of a population is lacking.

## Disclosure of Potential Conflicts of Interest

No potential conflicts of interest were disclosed.

## Authors' Contributions

**Conception and design:** A.F. Brouwer, M.C. Eisenberg, R. Meza

**Development of methodology:** A.F. Brouwer, M.C. Eisenberg, R. Meza

**Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.):** R. Meza

**Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis):** A.F. Brouwer, M.C. Eisenberg

**Writing, review, and/or revision of the manuscript:** A.F. Brouwer, M.C. Eisenberg, R. Meza

**Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases):** A.F. Brouwer

**Study supervision:** M.C. Eisenberg, R. Meza

## Acknowledgments

All authors were supported by NIH grant U01CA182915.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.