Abstract
Standard descriptive methods for the analysis of cancer surveillance data include canonical plots based on the lexis diagram, directly age-standardized rates (ASR), estimated annual percentage change (EAPC), and joinpoint regression. The age-period-cohort (APC) model has been used less often. Here, we argue that it merits much broader use. First, we describe close connections between estimable functions of the model parameters and standard quantities such as the ASR, EAPC, and joinpoints. Estimable functions have the added value of being fully adjusted for period and cohort effects, and generally more precise. Second, the APC model provides the descriptive epidemiologist with powerful new tools, including rigorous statistical methods for comparative analyses, and the ability to project the future burden of cancer. We illustrate these principles by using invasive female breast cancer incidence in the United States, but these concepts apply equally well to other cancer sites for incidence or mortality. Cancer Epidemiol Biomarkers Prev; 20(7); 1263–8. ©2011 AACR.
Introduction
Cancer incidence and mortality rates are closely monitored to track the burden of cancer and its evolution in populations (1–4), provide etiologic clues (5–11), reveal disparity (12–14), and gauge the dissemination of screening modalities (15–17) and therapeutic innovations (18, 19). A standard “toolbox” of graphical and quantitative methods has evolved to handle the needs of cancer surveillance researchers. Perhaps, the most widely used methods include classical descriptive plots based on the lexis diagram (20–22), directly age-standardized rates (ASR; ref. 23), estimated annual percentage change (EAPC; ref. 24), and the joinpoint regression method (25). The underlying philosophy is agnostic and empirical; hence, standard tools are particularly well suited to descriptive, exploratory, and hypothesis-generating studies.
At the same time, the age-period-cohort (APC) model has been developed in the statistics literature as a mathematical counterpoint to purely descriptive approaches (20, 26–33). The APC model is based on fundamental generalized linear model theory (34); in principle, it allows the descriptive epidemiologist to both generate and test hypotheses. However, although the APC model is generally accepted, our sense is that it remains more of a niche methodology than an integral part of mainstream practice.
We believe 2 misunderstandings have slowed the uptake of the APC approach. First, there are concerns about the “identifiability problem” of the APC model (27, 28). Second, close connections between the classical toolbox and the APC model have not been clearly spelled out in the literature. In this commentary, we will attempt to clarify both misunderstandings and thereby make the case that the APC model merits much wider use.
Data, Methods, and Results
Example: breast cancer incidence data
We will develop this commentary using as a concrete example the incidence of invasive female breast cancers in the United States. For this purpose, we obtained age-specific case and population data from the National Cancer Institute's Surveillance, Epidemiology, and End Results 9 Registries (SEER 9) database for the 36-year time period from 1973 through 2008 (November 2010 submission; ref. 35).
In general, for any given cancer and population group, the matrix Y = [Ypa, p = 1, …, P, a = 1, … A] contains the number of cancer diagnoses in calendar period p and age group a, and the matrix O = [Opa, p = 1, …, P, a = 1, … A] contains the corresponding person-years. The observed incidence rates per 100,000 person-years are λpa = 105 Ypa/Opa, and the expected log rates are ρpa = log [E(Ypa)/Opa].
It is instructive to think of the rate matrix in terms of its corresponding Lexis diagram (Fig. 1), which makes visually clear how the diagonals of matrices Y and O, from top right to bottom left, represent successive birth cohorts indexed by c = p − a + A, from the oldest (c = 1) to the youngest (c = C ≡ P + A − 1). From this perspective, it becomes clear that a new cohort enters prospective follow-up with each consecutive calendar period. For this reason, one can think of a registry as a “cohort of cohorts.” Because cancer registries are operated in perpetuity, over time, a substantial number of birth cohorts are followed. Our example includes C = 24 nominal 8-year cohorts born from 1892 through 1984 (referred to by midyear of birth).
The APC model: formulation
The APC analysis is based on a log-linear model for the expected rates with additive effects for age, period, and cohort:
The generic additive effects in Equation (A) can be partitioned into linear and nonlinear components (28). There are number of equivalent ways to make this partition while incorporating the fundamental constraint that c = p − a + A. Two of the most useful (36) are the age-period form
and the age-cohort form
Notation and parameters are summarized in Table 1. Importantly, all the parameters in Equations (B) and (C) can be estimated from the data without imposing additional constraints, and fitted rates from both forms are identical.
Quantitya . | Nomenclature . |
---|---|
μ | Grand mean |
|$\tilde a_a,\tilde \pi _p,\;{\rm and}\;\tilde \gamma _c $| | Age, period, and cohort deviations |
(αL + πL) | Longitudinal age trend |
(αL + γL) | Cross-sectional age trend |
(πL + γL) | Net drift ≈ EAPC of the ASRb |
|$\mu + \left({\alpha _L + \pi _L } \right)\left({a - \bar a} \right) + \tilde a_a$| | Fitted longitudinal age-at-event curve |
|$\mu + \left({\alpha _L - \gamma _L } \right)\left({a - \bar a} \right) + \tilde a_a$| | Fitted cross-sectional age-at-event curve |
|$\mu + \left({\pi _L + \gamma _L } \right)\left({p - \bar p} \right) + \tilde \pi _p$| | Fitted temporal trends |
Quantitya . | Nomenclature . |
---|---|
μ | Grand mean |
|$\tilde a_a,\tilde \pi _p,\;{\rm and}\;\tilde \gamma _c $| | Age, period, and cohort deviations |
(αL + πL) | Longitudinal age trend |
(αL + γL) | Cross-sectional age trend |
(πL + γL) | Net drift ≈ EAPC of the ASRb |
|$\mu + \left({\alpha _L + \pi _L } \right)\left({a - \bar a} \right) + \tilde a_a$| | Fitted longitudinal age-at-event curve |
|$\mu + \left({\alpha _L - \gamma _L } \right)\left({a - \bar a} \right) + \tilde a_a$| | Fitted cross-sectional age-at-event curve |
|$\mu + \left({\pi _L + \gamma _L } \right)\left({p - \bar p} \right) + \tilde \pi _p$| | Fitted temporal trends |
aThe APC model is defined over a P x A event matrix Y and corresponding matrix of person-years O. The referent age, period, and cohort are |$\bar a = [(A + 1)/2]$|, |$\bar p = [(P + 1)/2]$|, and |$\bar c = \bar p - \bar a + A$|, respectively, where P and A are the total numbers of period and age groups and [.] is the greatest integer function.
bFrom Last (23).
There is a close correspondence between APC parameters and estimable functions in Table 1 and fundamental aspects of the data investigated using the standard descriptive toolbox. Before highlighting some of these connections below, we hopefully can shed further light on the much discussed identifiability problem.
Identifiability: “problem” or uncertainty principle?
The aspect of identifiability in question concerns whether log-linear trends in rates can uniquely be attributed to the influences of age, period, or cohort, quantified by parameters |$\alpha _L,\pi _L {\rm and}\gamma _L $|. Mathematically, it has been shown by Holford (28) that one cannot do this without imposing additional unverifiable assumptions, because the 3 time scales are colinear (cohort equals period minus age, c = p − a). This issue has often implicitly been held out as a unique and unfortunate limitation of the APC model. In fact, the same issue affects time-to-event analysis of any cohort study.
To see this, consider the following thought experiment. Suppose one enrolls a cohort of exchangeable persons of identical age (e.g., the 1956 birth cohort in Fig. 1) and follows them longitudinally over a decade for cancer. At the end of the study, one observes that the log incidence rate increases linearly with age. It is natural to attribute this trend entirely to the effects of aging and equate the age-associated slope to the value of a parameter αL.
However, suppose one had also assembled an identical cohort of persons of the same age, but this study had been conducted 10 years earlier. It is possible that the age-associated slopes of the 2 studies would be very different, if disease-causing exposures out of experimental control had been increasing or decreasing in prevalence over time. Hence, the observed age-associated slope actually estimates parameter (αL + πL) or longitudinal age trend [(LAT) in Fig. 1; ref. 32], where αL is the component of the trend that is attributable to aging and πL is the component of the trend due to the net impact of unknown and uncontrollable exposures over successive calendar-periods.
A similar issue affects any cross-sectional analysis. To “control” for the effects of aging, suppose one studied in succession over time an event rate in persons of the same age (e.g., age group 65–69 years in Fig. 1) to estimate the slope of the time-trend πL. By definition, each successive group in this cross-sectional study was born a year later. Hence, both unknown factors and factors out of experimental control associated with birth cohort could also play a role. Therefore, the observed slope over time actually estimates a parameter (πL + γL) or net drift in Figure 1 (29, 30), where πL is the component of the trend that is attributable to calendar time and γL is the component of the trend attributable to the successive cohorts enrolled in the study.
These simple thought experiments, Figure 1, and Table 1 illustrate an important “uncertainty principle” regarding the measurement of absolute rates in cohorts. Interestingly, this principle is seldom considered in the context of most epidemiologic cohort and case–control studies, perhaps because these studies have a fairly narrow accrual window and often focus on relative rates rather than absolute rates. In contrast, this issue is often central in the analysis of registry data, because the follow-up has sufficient breadth and depth to reveal long-term secular trends in the population associated with age, period, and cohort. Indeed, a unique role of registry studies is to identify and quantify such trends, thereby providing direction and guidance regarding the needs for targeted analytic studies.
Estimable functions: separating signal from noise
The APC model provides a unique set of best-fitting log incidence rates, |$\hat \rho _{pa}$| or equivalently |$\hat \rho _{ca}$|, obtained by plugging in maximum likelihood estimators into Equation (B) or (C), respectively. The corresponding variances are readily calculated. In our experience, the fitted rates have an appealing amount of smoothing, and we use them routinely in our studies (36–45), especially for rare cancer outcomes. Experience suggests that for rate matrices of “moderate” size (in terms of A and P), the APC model smoothes the data conservatively, about as much as a 3-point moving average, yielding around a 40% to 60% reduction in the width of the CIs. Of course, the precise amount of noise reduction depends on a number of technical details including whether overdispersion is present or accounted for.
This application of the APC model is illustrated in Figure 2 for the breast cancer data. The ASRs over time calculated using the observed rates are nearly identical to the ASRs calculated using the APC-fitted rates. However, the pointwise CIs for the fitted rates are substantially narrower, by around 40% averaged over the study period.
Estimable functions: connections to the classical approaches
The APC parameter called the net drift [Table 1 and Equations (B) and (C)] estimates the same quantity as the EAPC of the ASR, that is, the overall long-term secular trend. The point estimates for these quantities are almost identical for the breast cancer data in Figure 2 [net drift = 0.83% per year (95% CI: 0.78–0.85) and EAPC = 0.78% per year (95% CI: 0.18–1.39)]. However, for this example, the estimated confidence bands are much narrower for the net drift.
We introduced a novel estimable function called the fitted age-at-onset curve to summarize the longitudinal (i.e., cohort-specific) age-associated natural history (Table 1 and Fig. 3; ref. 46). By construction, the fitted curve extrapolates from observed age-specific rates over the full range of birth cohorts to estimate past, current, and future rates for the referent cohort (e.g., the 1932 cohort in this example). The fitted age-at-onset curve provides a longitudinal age-specific rate curve that is adjusted for both calendar-period and birth-cohort effects. We view it as an improved version of the cross-sectional age-specific rate curve, improved because the cross-sectional curve is not adjusted for period and cohort effects (47). The fitted curve has proven very useful in practice (38–40, 42–44, 46, 48).
APC analysis: beyond the basics
There are many useful extensions to the basic APC model. Estimable functions are amenable to formal hypothesis tests (29, 30). Parameters associated with age, period, and cohort can be smoothed (49). Parametric assumptions about the shape of the age incidence curve derived from mathematical models of carcinogenesis can be incorporated (50). Other extensions have included parametric (33) and nonparametric (51, 52) assessments of changes in period and cohort deviations, as well as simultaneous modeling of a moderate or large number of strata, such as geographic areas, using Bayes and empirical Bayes methods (53).
Recently, we developed novel methods to compare age-related natural histories and time trends between distinct event rates assuming that separate APC models hold for each (36). Using this approach, one can formally contrast the incidence of a given tumor such as breast cancer in 2 populations, say black versus white women (46), or the incidence of 2 tumor subtypes in the same population, say estrogen receptor (ER)-positive versus ER-negative breast cancers [(46), Supplementary Figure]. We showed that 2 event rates are proportional over age, period, or cohort if and only if certain sets of APC parameters are all equal across the respective event-specific models (36). We also developed corresponding tests of proportionality and estimators of rate ratios.
A number of authors have forecast future cancer rates by using the APC model (54–58). Projections quantify the future implications of current trends, for example, the impact of a net drift of 1% versus 2% over time, or the future impact of recent changes in birth cohort patterns.
Discussion
Successful technological evolution builds on effective design. This is just as true for statistical methods as for computers and cellular phones. We have argued here that the APC model provides a useful evolutionary extension to the standard armamentarium of methods available to the descriptive epidemiologist. The APC model is not a replacement for existing methods, which are popular and successful. Rather, it provides a refined means of estimating the same quantities while also adding useful new capabilities, such as formal methods for comparing 2 sets of rates or projecting the future cancer burden.
Using the APC model, cancer registry data can be analyzed in the same spirit as any other epidemiologic cohort using the same concepts, such as proportional hazards, confounding, and effect modification/interaction. Importantly, because cancer registries follow a cohort of cohorts, analysis of registry data can reveal fundamental changes in population rates that are not usually discernable in standard cohort or case–control studies.
Currently, the software for APC analysis is available only through fairly specialized packages (SAS, R, Matlab). Development of good stand-alone software, in addition to education and training, is needed if the full potential of the APC model is to be exploited by descriptive epidemiologists.
Disclosure of Potential Conflicts of Interest
All of the authors had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. No potential conflicts of interest were disclosed.
Grant Support
This research was supported by the Intramural Research Program of NIH, National Cancer Institute.