Abstract
Risk prediction models that estimate an individual's risk of developing colon cancer could be used for a variety of clinical and public health interventions, including offering high-risk individuals enhanced screening or lifestyle interventions. However, if risk prediction models are to be translated into actual clinical and public health practice, they must not only be valid and reliable, but also be easy to use. One way of accomplishing this might be to simplify the information that users of risk prediction tools have to enter, but it is critical to ensure no resulting detrimental effects on model performance. We compared the performance of a simplified, largely categorized exposure-based colon cancer risk model against a more complex, largely continuous exposure-based risk model using two prospective cohorts. Using data from the Nurses’ Health Study and the Health Professionals Follow-up Study we included 816 incident colon cancer cases in women and 412 in men. The discrimination of models was not significantly different comparing a categorized risk prediction model with a continuous prediction model in women (c-statistic 0.600 vs. 0.609, Pdiff = 0.07) and men (c-statistic 0.622 vs. 0.618, Pdiff = 0.60). Both models had good calibration in men [observed case count/expected case count (O/E) = 1.05, P > 0.05] but not in women (O/E = 1.19, P < 0.01). Risk reclassification was slightly improved using categorized predictors in men [net reclassification index (NRI) = 0.041] and slightly worsened in women (NRI = −0.065). Categorical assessment of predictor variables may facilitate use of risk assessment tools in the general population without significant loss of performance.
Introduction
Colon cancer is the third most common cancer diagnosis and the third leading cause of cancer-related death in the United States (1). Approximately 95,500 new colon cancer cases were diagnosed in 2017 and 50,260 cases died of the disease (1). Despite advances in clinical management, only 63.7% of cases with colon cancer survive 5 years beyond diagnosis (2). Population screening and prevention programs have significantly reduced colon cancer incidence and mortality (3, 4). Risk assessments may improve personalized colon cancer prevention by facilitating targeted screening and prevention strategies for high-risk individuals and minimizing overdiagnosis-related harms to very low-risk individuals (5, 6).
More than 10 models for colon cancer risk prediction in asymptomatic individuals from the general population have been developed (for review, see ref. 7). They assess various sociodemographic and modifiable lifestyle and dietary risk factors. These tools vary in the number of risk factors included (e.g., three vs. more than a dozen), the level of specificity required for reporting the risk factors (e.g., age vs. detailed dietary history), the extent to which a layperson is likely to know his/her status on a given risk factor (e.g., age vs. total serum cholesterol), and the extent to which medical testing is required (e.g., tobacco use vs. polygenetic profile).
If risk prediction models are to be translated into risk assessment tools that improve the health of the most people possible, they need to be in a form that laypeople and clinicians are willing and able to use. For example, an internet-based risk assessment tool that aims to educate laypeople about their colon cancer risk and about strategies that they can take to reduce their risk could potentially have extraordinary reach. However, if an individual is not able to remember specific inputs required for a model, for example their total serum cholesterol, the model results will be inaccurate. If the individual becomes bored with completing a detailed dietary history and leaves the website before receiving their risk results and behavior recommendations, they will not receive the full intervention. Similar decrements in model accuracy and intervention receipt will occur in the clinical setting if a patient's insurance will not pay for a test the model requires, or if the clinician does not have the time to complete an extensive risk assessment.
One possible solution could be to adapt a validated risk prediction model by categorizing continuous risk factors and by excluding risk factors that participants are unlikely to be able to answer (e.g., results of medical tests). Categorization of continuously distributed predictors may have some disadvantages (8), although at the same time make completion of risk assessment tools significantly easier for the user. In this study, we used data from two large prospective cohorts with a long-term follow-up through 2010 to compare two colon cancer risk prediction models with different forms of dietary and lifestyle variables, one with simplified categorized predictor variables and the other with continuous predictor variables.
We began the categorization process with the colon cancer risk prediction model used in the Your Disease Risk (YDR) website (http://www.yourdiseaserisk.wustl.edu). YDR is a publicly accessible, web-based risk assessment tool for cancer and other chronic diseases and has among the highest performance of self-completed questionnaires and is comparable to self-completed questionnaires plus biomarkers or genetic factors (7). Drawing on self-reported risk factors and consensus-based relative risk estimates for risk factors (9), YDR has acceptable validity for predicting the risk of colon cancer (10).
Materials and Methods
Study population
The Nurses’ Health Study (NHS) was established in 1976 when 121,702 female registered nurses ages 30 to 55 years enrolled. The Health Professional Follow-up Study (HPFS) includes 51,529 U.S. male health professionals ages 40 to 75 years at enrollment in 1986. At baseline and subsequently every 2 years, participants in both cohorts completed a mailed questionnaire about their medical history and lifestyle. Participants implied consent with the completion and return of questionnaires. The NHS protocol was approved by the Human Subjects Committees at the Harvard T.H. Chan School of Public Health and Brigham and Women's Hospital (Boston, MA). The HPFS protocol was approved at the Harvard T.H. Chan School of Public Health (Boston, MA).
This analysis included the participants who completed the 1986 questionnaire (n = 102,170 in the NHS; n = 51,468 in the HPFS) to keep the baseline year of analysis (1986) consistent in both cohorts. We excluded those (n = 38,951 in the NHS; n = 11,438 in the HPFS) who had a cancer diagnosis prior to 1986, had missing information on at least one of the colon cancer risk factors used in YDR, reported an unusual total energy intake (<500 kcal/day or >3,500 kcal/day for women; <800 kcal/day or >4,200 kcal/day for men), or had missing data of at least 10 items on the 1986 food frequency questionnaire (FFQ; Supplementary Figure). The analysis includes 63,219 women and 40,030 men.
Assessment of risk factors
We used information on risk factors that were collected on the 1986 questionnaire [with the exception of family history of colorectal cancer in the NHS and endoscopic colorectal cancer screening in both cohorts (see below)] and were updated during the follow-up through 2008. The risk factors for colon cancer included: height, body mass index (BMI), hormone replacement therapy (for women), physical activity, smoking, calcium intake from dairy food, alcohol and multivitamin intake, regular aspirin use, colonoscopy/sigmoidoscopy use, and family history of colorectal cancer in first-degree relatives. Height and weight was collected on the baseline questionnaire, and weight was updated biannually. Cigarette smoking was assessed at baseline and was updated biennially. Dietary history was assessed in 1986 and was updated every 4 years using a validated 130-item FFQ (11, 12). Total calcium intake was estimated by summing calcium from all related dietary sources that were included in the FFQ. We also estimated intake of calcium from only dairy products including whole milk, cream, sour cream, ice cream, cottage/ricotta cheese, cream cheese, other cheese, yogurt, and butter. Intake of these dairy products was used as an approximate estimate of total calcium intake that have been associated with a lower risk of colon cancer (13). A high level of dairy calcium intake (≥1,000 mg/day) was defined as at least three servings of dairy products each day. Physical activity was assessed in 1986, 1988, 1992, and biennially after that. Participants reported the average time per week they spent in the following activities over the past year: walking or hiking outdoors, jogging, running, bicycling, racquet sports, swimming laps, and calisthenics or aerobics. Physical activity in YDR includes walking, jogging, and running and thus was analyzed as hours per week of these three types of activities combined. Aspirin use was first collected in the NHS in 1980, and frequency of use was first asked in 1984. We calculated cumulative averages of frequency, dose, and duration of aspirin use in each of follow-up years. Because of lack of the information in the HPFS, duration of regular aspirin use was not analyzed for men. In the NHS, family history of colon or rectal cancer in first-degree relatives was assessed in 1982 and was updated in 1988 and subsequently every 4 years. In the HPFS, family history of colorectal cancer in first-degree relatives was asked in 1986 and was updated in 1990, 1992, 1996, and 2008. The NHS collected colorectal cancer screening in 1988 and subsequently every other year. The participants in the NHS reported their history of endoscopic colorectal cancer screening between 1980 and 1990 on the 1990 questionnaire. The history of endoscopic colorectal cancer screening among the HPFS participants was collected in 1988 and updated biannually. Menopausal status and hormone replacement therapy were asked at baseline and updated biannually.
Cancer case ascertainment
Incident cancer cases were identified through participants’ self-reports on biennial follow-up questionnaires and confirmed through medical record review. The National Death Index was used to identify the participants whose death was attributed to colon cancer. A diagnosis of colon cancer was confirmed through medical record review for over 98% of nonrespondents who died of colon cancer (14).
Statistical analysis
The model with predominantly categorized predictor variables included height (≥68 inches vs. <68 inches), BMI (25–29.9 kg/m2 and ≥30 kg/m2 vs. <25 kg/m2), postmenopausal hormone use (premenopausal, postmenopausal and past use, and postmenopausal and current use vs. postmenopausal and never use), physical activity (0.5–2.9 hours/week and ≥3 hours/week vs. <0.5 hours/week), pack-years of smoking (>0–<40 and ≥40 vs. 0), calcium intake from dairy products (≥1,000 mg/day vs. <1,000 mg/day), intakes of alcohol and multivitamin (alcohol <1 drink/day or alcohol ≥1 drink/day and multivitamin ≥5 times/week, alcohol ≥1 drink/day and multivitamin <5 times/week vs. no alcohol), years of daily aspirin use (≥15 vs. <15), history of sigmoidoscopy/colonoscopy (yes vs. no), family history of colorectal cancer (yes vs. no), and continuous age. In the model with continuous predictor variables, height, BMI, physical activity, pack-years of smoking, total calcium intake, years of daily aspirin use, and age were analyzed as continuous variables and all the other variables were analyzed as categorical variables as defined in the first model.
Person-years were calculated from the return date of the 1986 questionnaire until the date of colorectal cancer diagnosis, death, loss to follow-up, other self-reported cancer (except nonmelanoma skin cancer), or June 1, 2010, whichever occurred first. We used Cox proportional hazards regression models to estimate relative risks (RR) and 95% confidence intervals (CI).
The accuracy of risk prediction models was evaluated by means of calibration and discrimination. To assess calibration, we estimated RRs for individuals and then absolute risks by combining RR scores with 1986–2010 Surveillance Epidemiology and End Results (SEER) data. We then grouped the participants by deciles of estimated absolute risks, compared observed and expected counts of incident colon cancer, and tested for trend using Poisson regression approaches (15). The c-statistic (equivalent to the area under the receiver operating characteristic curve) is estimated to reflect the discriminative capability of a model. The c-statistics within 5-year age groups were calculated and an overall c-statistic was obtained by averaging the age-specific c-statistics weighted by their inverse variance (16). We compared the c-statistic for the model with categorized predictors versus the model with continuous predictors using the Wilcoxon rank-sum test (16). The c-statistic is not sensitive to small incremental differences between prediction models and has little direct clinical relevance (17). Therefore, reclassification of participants was also assessed by categorical net reclassification improvement (NRI) using quartiles of estimated absolute risk of developing colon cancer during the follow-up from 1986 to 2010 (18). To estimate the incidence rates of colon cancer among participants who were reclassified upwards, downwards, and for the whole sample, we used the Kaplan–Meier approach. NRI was calculated for colon cancer cases and controls separately; a positive NRI for colon cancer cases suggested a net percentage of participants with colon cancer correctly classified upward and a positive NRI for controls suggested a net percentage of participants without a cancer history correctly classified downward (19). An overall NRI, the sum of net percentages of correctly reclassified participants with and without colon cancer, was also reported. All analyses were performed using SAS version 9.3 for UNIX (SAS Institute). All statistical tests were two-sided using a significance level of P < 0.05.
Results
The mean age of participants at baseline was 52.8 years for women and 54.0 years for men. Among 103,249 participants (63,219 in the NHS and 40,030 in the HPFS), 1,228 (816 in the NHS; 412 in the HPFS) developed colon cancer between 1986 and 2010. Median (25th, 75th percentile) follow-up time was 24 (23, 25) years for the NHS and 24 (24, 24) years for the HPFS.
Supplementary Table S1 shows the age-standardized baseline characteristics of participants who developed colon cancer and those who remained free from cancer during follow-up. Supplementary Table S2 presents the regression coefficients and relative risks for colon cancer for risk factors for the continuous and categorical models. In both models, pack-years of smoking, family history of colorectal cancer, and age were associated with significantly higher risk of colon cancer among women. Postmenopausal hormone use, duration of daily aspirin use, and endoscopic colorectal cancer screening were associated with significantly lower risk in both models. Continuous, but not dichotomized, height was significantly related to a higher risk of colon cancer in women. Similarly, continuous total calcium intake, but not dichotomized calcium intake from dairy products, was significantly related to a lower risk. While continuous BMI was not significantly related to the risk of colon cancer, the risk significantly increased in women with BMI ≥ 30 kg/m2 compared with women with BMI < 25 kg/m2. Among men, both models showed a significantly higher risk of colon cancer for higher BMI, more pack-years of smoking, higher alcohol consumption with lower levels of multivitamin use, a family history of colon cancer, and older age. Colonoscopy or sigmoidoscopy screening was related to a lower risk of colon cancer in men.
Table 1 summarizes the age-specific c-statistics of each model. The age-adjusted c-statistic of the model with categorized predictor variables was 0.600 in women, which was comparable with that of the model with continuous predictor variables (c-statistic = 0.609, Pdiff = 0.07). The performance of the model with categorized predictor variables was better in women under 70 years than those over 70. A secondary analysis of women under 70 years showed an improvement in the calibration of both models; the age-adjusted c-statistic was 0.623 for the model with categorized predictor variables and 0.626 for the model with continuous predictor variables (Pdiff = 0.59). Compared with the model with continuous predictor variables, the model with categorized predictor variables was slightly better based on the age-adjusted c-statistic (0.622 vs. 0.618, Pdiff = 0.60) in men.
Age- and sex-specific c-statistics of two colon cancer risk prediction models in the NHS and the HPFS: one with simplified categorized predicting variables and the other with continuous predicting variables
. | . | Categorical . | Continuous . | Differencea . | ||||
---|---|---|---|---|---|---|---|---|
Age group . | Number of cases . | AUC . | SE . | AUC . | SE . | AUC . | SE . | P . |
Women in the NHS | ||||||||
<50 | 19 | 0.633 | 0.063 | 0.646 | 0.063 | −0.013 | 0.031 | 0.69 |
50–54 | 47 | 0.619 | 0.041 | 0.594 | 0.041 | 0.024 | 0.020 | 0.23 |
55–59 | 79 | 0.635 | 0.031 | 0.661 | 0.030 | −0.026 | 0.016 | 0.10 |
60–64 | 163 | 0.598 | 0.022 | 0.608 | 0.022 | −0.010 | 0.011 | 0.40 |
65–69 | 169 | 0.640 | 0.021 | 0.633 | 0.021 | 0.006 | 0.011 | 0.57 |
70–74 | 167 | 0.577 | 0.022 | 0.603 | 0.022 | −0.026 | 0.012 | 0.03 |
≥75 | 172 | 0.551 | 0.022 | 0.562 | 0.022 | −0.011 | 0.011 | 0.34 |
Age-adjusted | 816 | 0.600 | 0.010 | 0.609 | 0.010 | −0.009 | 0.005 | 0.07 |
Men in the HPFS | ||||||||
<50 | 15 | 0.600 | 0.073 | 0.626 | 0.072 | −0.026 | 0.031 | 0.40 |
50–54 | 19 | 0.734 | 0.057 | 0.699 | 0.059 | 0.036 | 0.026 | 0.17 |
55–59 | 37 | 0.673 | 0.044 | 0.657 | 0.044 | 0.017 | 0.020 | 0.40 |
60–64 | 74 | 0.619 | 0.032 | 0.626 | 0.032 | −0.006 | 0.015 | 0.67 |
65–69 | 90 | 0.580 | 0.030 | 0.585 | 0.030 | −0.005 | 0.014 | 0.69 |
70–74 | 92 | 0.619 | 0.029 | 0.607 | 0.029 | 0.012 | 0.014 | 0.36 |
≥75 | 85 | 0.618 | 0.030 | 0.616 | 0.030 | 0.002 | 0.013 | 0.91 |
Age-adjusted | 412 | 0.622 | 0.014 | 0.618 | 0.014 | 0.003 | 0.006 | 0.60 |
. | . | Categorical . | Continuous . | Differencea . | ||||
---|---|---|---|---|---|---|---|---|
Age group . | Number of cases . | AUC . | SE . | AUC . | SE . | AUC . | SE . | P . |
Women in the NHS | ||||||||
<50 | 19 | 0.633 | 0.063 | 0.646 | 0.063 | −0.013 | 0.031 | 0.69 |
50–54 | 47 | 0.619 | 0.041 | 0.594 | 0.041 | 0.024 | 0.020 | 0.23 |
55–59 | 79 | 0.635 | 0.031 | 0.661 | 0.030 | −0.026 | 0.016 | 0.10 |
60–64 | 163 | 0.598 | 0.022 | 0.608 | 0.022 | −0.010 | 0.011 | 0.40 |
65–69 | 169 | 0.640 | 0.021 | 0.633 | 0.021 | 0.006 | 0.011 | 0.57 |
70–74 | 167 | 0.577 | 0.022 | 0.603 | 0.022 | −0.026 | 0.012 | 0.03 |
≥75 | 172 | 0.551 | 0.022 | 0.562 | 0.022 | −0.011 | 0.011 | 0.34 |
Age-adjusted | 816 | 0.600 | 0.010 | 0.609 | 0.010 | −0.009 | 0.005 | 0.07 |
Men in the HPFS | ||||||||
<50 | 15 | 0.600 | 0.073 | 0.626 | 0.072 | −0.026 | 0.031 | 0.40 |
50–54 | 19 | 0.734 | 0.057 | 0.699 | 0.059 | 0.036 | 0.026 | 0.17 |
55–59 | 37 | 0.673 | 0.044 | 0.657 | 0.044 | 0.017 | 0.020 | 0.40 |
60–64 | 74 | 0.619 | 0.032 | 0.626 | 0.032 | −0.006 | 0.015 | 0.67 |
65–69 | 90 | 0.580 | 0.030 | 0.585 | 0.030 | −0.005 | 0.014 | 0.69 |
70–74 | 92 | 0.619 | 0.029 | 0.607 | 0.029 | 0.012 | 0.014 | 0.36 |
≥75 | 85 | 0.618 | 0.030 | 0.616 | 0.030 | 0.002 | 0.013 | 0.91 |
Age-adjusted | 412 | 0.622 | 0.014 | 0.618 | 0.014 | 0.003 | 0.006 | 0.60 |
Abbreviations: AUC, area under the curve; SE, standard error.
aContinuous as a reference model.
The reclassification of participants on the basis of their predicted risks is presented in Supplementary Tables S3 and S4. Overall, the NRI of the simplified categorical predictor model was −0.065 (95% CI, −0.085 to −0.037) in women and 0.041 (95% CI, −0.048–0.106) in men. Among women who developed colon cancer, using categorized lifestyle and dietary predictors instead of continuous predictors incorrectly reclassified 4.4% of women into lower risk quartiles. In contrast, among women who did not develop colon cancer, using categorized lifestyle and dietary predictors instead of continuous predictors incorrectly reclassified 2.1% of participants into higher risk quartiles. Among men who developed colon cancer, using categorized, rather than continuous, lifestyle and dietary predictors reclassified 0.5% of men into correct, higher risk quartiles. Among men who did not develop colon cancer, categorization of continuous lifestyle and dietary predictors correctly reclassified 3.7% of participants into lower risk quartiles.
Table 2 shows the calibration of risk prediction models. Both models had poor calibration in women and good calibration in men. The expected counts of colon cancer were estimated by deciles of absolute risks. The observed case count was significantly higher than the expected case count at lower event rates in women. In contrast, the observed case count was significantly lower than the expected case count at higher event rates in women. Overall, the model fit was significantly different from the SEER data and underestimated the risk of colon cancer in women [observed case count/expected case count (O/E) = 1.19 for both models, P < 0.01]. Among men, the observed case count was slightly higher than the predicted case count at lower risk deciles and was slightly lower than the predicted case count at higher risk deciles. Overall, the model fit was not significantly different from the SEER data and slightly underestimated the risk of colon cancer in men (O/E = 1.05, P = 0.34–0.36).
Calibration of the models in the NHS and the HPFS against the 1986–2010 SEER data
Model with categorized variables . | Model with continuous variables . | ||||
---|---|---|---|---|---|
Risk decile . | Observed number of cases . | Expected number of cases . | Risk decile . | Observed number of cases . | Expected number of cases . |
Women in the NHS | |||||
1 | 69 | 12.2 | 1 | 65 | 11.8 |
2 | 32 | 22.3 | 2 | 35 | 21.3 |
3 | 59 | 29.2 | 3 | 51 | 28.5 |
4 | 68 | 37.0 | 4 | 77 | 36.2 |
5 | 85 | 46.5 | 5 | 84 | 45.9 |
6 | 83 | 58.6 | 6 | 80 | 57.6 |
7 | 108 | 74.7 | 7 | 99 | 74.0 |
8 | 98 | 95.6 | 8 | 103 | 94.1 |
9 | 96 | 124.1 | 9 | 106 | 124.5 |
10 | 118 | 188.3 | 10 | 116 | 194.1 |
Total | 816 | 688.5 | Total | 816 | 688.0 |
α = 0.1700; SE(α) = 0.0350; P < 0.01 | α = 0.1706; SE(α) = 0.0350; P < 0.01 | ||||
O/E = exp(α) = 1.19 | O/E = exp(α) = 1.19 | ||||
Men in the HPFS | |||||
1 | 20 | 4.0 | 1 | 20 | 4.0 |
2 | 20 | 10.5 | 2 | 20 | 10.7 |
3 | 27 | 15.6 | 3 | 26 | 15.7 |
4 | 31 | 21.1 | 4 | 32 | 21.0 |
5 | 47 | 27.6 | 5 | 42 | 27.6 |
6 | 38 | 35.8 | 6 | 42 | 35.8 |
7 | 44 | 44.7 | 7 | 43 | 45.0 |
8 | 39 | 56.6 | 8 | 38 | 56.2 |
9 | 47 | 72.4 | 9 | 58 | 71.8 |
10 | 99 | 105.6 | 10 | 91 | 104.9 |
Total | 412 | 393.9 | Total | 412 | 392.7 |
α = 0.0451; SE(α) = 0.0493; P = 0.36 | α = 0.0474; SE(α) = 0.0493; P = 0.34 | ||||
O/E = exp(α) = 1.05 | O/E = exp(α) = 1.05 |
Model with categorized variables . | Model with continuous variables . | ||||
---|---|---|---|---|---|
Risk decile . | Observed number of cases . | Expected number of cases . | Risk decile . | Observed number of cases . | Expected number of cases . |
Women in the NHS | |||||
1 | 69 | 12.2 | 1 | 65 | 11.8 |
2 | 32 | 22.3 | 2 | 35 | 21.3 |
3 | 59 | 29.2 | 3 | 51 | 28.5 |
4 | 68 | 37.0 | 4 | 77 | 36.2 |
5 | 85 | 46.5 | 5 | 84 | 45.9 |
6 | 83 | 58.6 | 6 | 80 | 57.6 |
7 | 108 | 74.7 | 7 | 99 | 74.0 |
8 | 98 | 95.6 | 8 | 103 | 94.1 |
9 | 96 | 124.1 | 9 | 106 | 124.5 |
10 | 118 | 188.3 | 10 | 116 | 194.1 |
Total | 816 | 688.5 | Total | 816 | 688.0 |
α = 0.1700; SE(α) = 0.0350; P < 0.01 | α = 0.1706; SE(α) = 0.0350; P < 0.01 | ||||
O/E = exp(α) = 1.19 | O/E = exp(α) = 1.19 | ||||
Men in the HPFS | |||||
1 | 20 | 4.0 | 1 | 20 | 4.0 |
2 | 20 | 10.5 | 2 | 20 | 10.7 |
3 | 27 | 15.6 | 3 | 26 | 15.7 |
4 | 31 | 21.1 | 4 | 32 | 21.0 |
5 | 47 | 27.6 | 5 | 42 | 27.6 |
6 | 38 | 35.8 | 6 | 42 | 35.8 |
7 | 44 | 44.7 | 7 | 43 | 45.0 |
8 | 39 | 56.6 | 8 | 38 | 56.2 |
9 | 47 | 72.4 | 9 | 58 | 71.8 |
10 | 99 | 105.6 | 10 | 91 | 104.9 |
Total | 412 | 393.9 | Total | 412 | 392.7 |
α = 0.0451; SE(α) = 0.0493; P = 0.36 | α = 0.0474; SE(α) = 0.0493; P = 0.34 | ||||
O/E = exp(α) = 1.05 | O/E = exp(α) = 1.05 |
Abbreviations: exp, exponential function; O/E, ratio of the observed case count to the expected case count.
Table 3 shows incidence rates of colon cancer by deciles of predicted risk. Both models showed a trend of higher incidence rates of colon cancer with higher risk deciles in women. Among women in the highest decile of risk derived from the model with categorized predictors, 118 incident cases were identified, which accounted for 14.5% of all incident colon cancer cases. Compared with the lowest decile of risk, the incidence rate ratio of colon cancer was 1.39 (95% CI, 1.03–1.87) for the highest decile of risk derived from the model with categorized predictor variables and 1.49 (95% CI, 1.10–2.02) for the highest decile of risk derived from the model with continuous predictor variables. Both models showed that men in the higher deciles of risk tended to have a higher incidence rate of colon cancer. Among men in the highest decile of risk, 99 incident cases were identified, which accounted for 24.0% of all incident colon cancer cases. Compared with men in the lowest decile of risk, the incidence rate ratio of colon cancer among those in the highest decile of risk was 4.20 (95% CI, 2.60–6.79) for the model with categorized predictor variables and 3.90 (95% CI, 2.41–6.33) for the model with continuous predictor variables.
Incidence rates of colon cancer by deciles of predicted risk
. | Model with categorized variables . | Model with continuous variables . | ||||||
---|---|---|---|---|---|---|---|---|
Risk decile . | Cases . | Person-years . | Incidence rate (per 105) . | Incidence rate ratio (95% CI) . | Cases . | Person-years . | Incidence rate (per 105) . | Incidence rate ratio (95% CI) . |
Women in the NHS | ||||||||
1 (lowest) | 69 | 106,642 | 65 | 1.00 (reference) | 65 | 108,870 | 60 | 1.00 (reference) |
2 | 32 | 132,504 | 24 | 0.37 (0.25–0.57) | 35 | 131,114 | 27 | 0.45 (0.30–0.67) |
3 | 59 | 131,621 | 45 | 0.69 (0.49–0.98) | 51 | 131,581 | 39 | 0.65 (0.45–0.94) |
4 | 68 | 128,896 | 53 | 0.82 (0.58–1.14) | 77 | 129,050 | 60 | 1.00 (0.72–1.39) |
5 | 85 | 127,057 | 67 | 1.03 (0.75–1.42) | 84 | 127,739 | 66 | 1.10 (0.80–1.52) |
6 | 83 | 126,052 | 66 | 1.02 (0.74–1.40) | 80 | 125,885 | 64 | 1.06 (0.77–1.48) |
7 | 108 | 126,536 | 85 | 1.32 (0.98–1.78) | 99 | 127,195 | 78 | 1.30 (0.95–1.78) |
8 | 98 | 128,819 | 76 | 1.18 (0.86–1.60) | 103 | 127,900 | 81 | 1.35 (0.99–1.84) |
9 | 96 | 131,189 | 73 | 1.13 (0.83–1.54) | 106 | 130,848 | 81 | 1.36 (1.00–1.85) |
10 (highest) | 118 | 131,190 | 90 | 1.39 (1.03–1.87) | 116 | 130,326 | 89 | 1.49 (1.10–2.02) |
Men in the HPFS | ||||||||
1 (lowest) | 20 | 42,107 | 48 | 1.00 (reference) | 20 | 42,378 | 47 | 1.00 (reference) |
2 | 20 | 65,249 | 31 | 0.65 (0.35–1.20) | 20 | 66,776 | 30 | 0.63 (0.34–1.18) |
3 | 27 | 68,175 | 40 | 0.83 (0.47–1.49) | 26 | 68,132 | 38 | 0.81 (0.45–1.45) |
4 | 31 | 67,105 | 46 | 0.97 (0.55–1.71) | 32 | 66,669 | 48 | 1.02 (0.58–1.78) |
5 | 47 | 65,543 | 72 | 1.51 (0.89–2.55) | 42 | 65,320 | 64 | 1.36 (0.80–2.32) |
6 | 38 | 63,808 | 60 | 1.25 (0.73–2.15) | 42 | 63,641 | 66 | 1.40 (0.82–2.38) |
7 | 44 | 60,793 | 72 | 1.52 (0.90–2.59) | 43 | 61,101 | 70 | 1.49 (0.88–2.53) |
8 | 39 | 59,150 | 66 | 1.39 (0.81–2.38) | 38 | 58,690 | 65 | 1.37 (0.80–2.36) |
9 | 47 | 56,628 | 83 | 1.75 (1.04–2.95) | 58 | 56,041 | 104 | 2.19 (1.32–3.65) |
10 (highest) | 99 | 49,602 | 200 | 4.20 (2.60–6.79) | 91 | 49,411 | 184 | 3.90 (2.41–6.33) |
. | Model with categorized variables . | Model with continuous variables . | ||||||
---|---|---|---|---|---|---|---|---|
Risk decile . | Cases . | Person-years . | Incidence rate (per 105) . | Incidence rate ratio (95% CI) . | Cases . | Person-years . | Incidence rate (per 105) . | Incidence rate ratio (95% CI) . |
Women in the NHS | ||||||||
1 (lowest) | 69 | 106,642 | 65 | 1.00 (reference) | 65 | 108,870 | 60 | 1.00 (reference) |
2 | 32 | 132,504 | 24 | 0.37 (0.25–0.57) | 35 | 131,114 | 27 | 0.45 (0.30–0.67) |
3 | 59 | 131,621 | 45 | 0.69 (0.49–0.98) | 51 | 131,581 | 39 | 0.65 (0.45–0.94) |
4 | 68 | 128,896 | 53 | 0.82 (0.58–1.14) | 77 | 129,050 | 60 | 1.00 (0.72–1.39) |
5 | 85 | 127,057 | 67 | 1.03 (0.75–1.42) | 84 | 127,739 | 66 | 1.10 (0.80–1.52) |
6 | 83 | 126,052 | 66 | 1.02 (0.74–1.40) | 80 | 125,885 | 64 | 1.06 (0.77–1.48) |
7 | 108 | 126,536 | 85 | 1.32 (0.98–1.78) | 99 | 127,195 | 78 | 1.30 (0.95–1.78) |
8 | 98 | 128,819 | 76 | 1.18 (0.86–1.60) | 103 | 127,900 | 81 | 1.35 (0.99–1.84) |
9 | 96 | 131,189 | 73 | 1.13 (0.83–1.54) | 106 | 130,848 | 81 | 1.36 (1.00–1.85) |
10 (highest) | 118 | 131,190 | 90 | 1.39 (1.03–1.87) | 116 | 130,326 | 89 | 1.49 (1.10–2.02) |
Men in the HPFS | ||||||||
1 (lowest) | 20 | 42,107 | 48 | 1.00 (reference) | 20 | 42,378 | 47 | 1.00 (reference) |
2 | 20 | 65,249 | 31 | 0.65 (0.35–1.20) | 20 | 66,776 | 30 | 0.63 (0.34–1.18) |
3 | 27 | 68,175 | 40 | 0.83 (0.47–1.49) | 26 | 68,132 | 38 | 0.81 (0.45–1.45) |
4 | 31 | 67,105 | 46 | 0.97 (0.55–1.71) | 32 | 66,669 | 48 | 1.02 (0.58–1.78) |
5 | 47 | 65,543 | 72 | 1.51 (0.89–2.55) | 42 | 65,320 | 64 | 1.36 (0.80–2.32) |
6 | 38 | 63,808 | 60 | 1.25 (0.73–2.15) | 42 | 63,641 | 66 | 1.40 (0.82–2.38) |
7 | 44 | 60,793 | 72 | 1.52 (0.90–2.59) | 43 | 61,101 | 70 | 1.49 (0.88–2.53) |
8 | 39 | 59,150 | 66 | 1.39 (0.81–2.38) | 38 | 58,690 | 65 | 1.37 (0.80–2.36) |
9 | 47 | 56,628 | 83 | 1.75 (1.04–2.95) | 58 | 56,041 | 104 | 2.19 (1.32–3.65) |
10 (highest) | 99 | 49,602 | 200 | 4.20 (2.60–6.79) | 91 | 49,411 | 184 | 3.90 (2.41–6.33) |
Discussion
Continuous variables are often converted into categorical variables in epidemiologic research by grouping values into two or more categories. Categorization of continuously distributed variables relies on assumed homogeneity of risk within categories, leading to both a loss of statistical power and residual confounding (8, 20). In this analysis, we categorized measures of height, BMI, duration of daily aspirin use, and lifestyle and diet factors and observed similarities in their associations with colon cancer risk to continuous forms except for calcium intake. Compared with the model including continuous lifestyle and dietary factors, the discriminatory ability of the model with categorical variables was not significantly different in term of the c-statistics. Both models had good calibration in men but not in women; there was no notable difference in calibration between the models with different forms of dietary and lifestyle predictors. In models that used categorized instead of continuous predictor variables, risk reclassification was slightly improved in men and slightly worsened in women, with an overall reclassification rate of 4.1%–6.5%. These data provided evidence for the use of categorized predictor variables in colon cancer risk prediction models and more broadly in primary care settings.
More than 10 risk prediction models for colon cancer have been developed (7, 13). Most models have acceptable to good discrimination with a c-statistic of 0.60–0.77. Some of the models with questionnaire variables and nongenetic biomarkers showed better discrimination than those with questionnaire variables only (7). However, measurement and availability of biomarkers might not be cost effective or feasible and thus might be inappropriate for stratifying screening recommendations in the general population. It is also unlikely that individuals in the general population will have access to the detailed medical information for biomarker-based risk assessment tools, so their potential for translation outside the clinical setting is further limited.
The discriminatory ability of our categorical model for men was acceptable and better than that for women. Our model for women was comparable to a recently updated and expanded Rosner-Wei model of colon cancer risk in terms of risk factor profiles and c-statistics (13). That model added folate intake that was significantly associated with a lower risk of colon cancer. The value of adding folate intake should be assessed in future studies. Consistent with the Rosner-Wei model, we observed that the c-statistic of our model was lower among older women (≥70 years). A better understanding of the colon cancer risk factors specifically relevant to people beyond 70 years might improve the performance of model in this population.
Colon cancer risk prediction models we examined did not distinguish among low-risk groups of women. It indicates that potentially relevant risk factors might be missed. However, it was not due to categorization of continuous predictor variables. Identifying the population at extremely low risk of colon cancer was beyond the scope of this study.
The primary limitation of our analysis was related to the accuracy of self-reported risk factors for colon cancer. However, validation studies in the NHS and the HPFS demonstrated acceptable validity of self-reported risk factors. Self-reported weight from 123 men in the HPFS and 140 women in the NHS were highly correlated with objective measures (r = 0.97; ref. 21). The FFQ was validated in the NHS and HPFS using food records, and estimated intakes were reasonably correlated with actual intakes (11, 22, 23). Leisure-time physical activity measures have been demonstrated to have moderate reliability and validity in a subgroup of participants in the NHS II, a prospective study in a younger cohort of U.S. female nurses with data collection similar to the NHS (24). Second, a large number of participants with incomplete risk factor information were excluded from the analysis, which might reduce the generalizability of the results. Third, due to lack of information on daily aspirin use in the HPFS, we could not assess the contribution of long-term aspirin use to colon cancer risk and whether it improves the performance of the model. Colon cancer risk prediction may be improved by taking into consideration the latency between exposures and cancer diagnosis (25). However, the purpose of the current study was to approximate a clinical encounter in which information regarding cancer latency periods may not be available. Processed red meat intake has been associated with a higher risk of colon cancer (13). Future categorized models should consider adding processed red meat intake. The inconsistent associations of continuous and dichotomized calcium intake with colon cancer risk suggest that the defined cut-off value of calcium intake might not be appropriate and should be modified. Finally, the type of endoscopic colorectal cancer screening was not collected during early follow-up periods through 2002. Our defined variable of endoscopic colorectal cancer screening within a 10-year interval might be underestimated due to a 5-year interval recommended for sigmoidoscopy.
Our results suggest that categorization of continuously distributed lifestyle and dietary factors did not significantly affect the discrimination and calibration of the model for colon cancer risk prediction. The performance of the model with categorized predictor variables was acceptable for men and women under age 70. Considering the potentially greater acceptability of categorized risk factor ascertainment among users, future risk prediction modelers should consider focusing on developing models based on categorical risk factors.
Disclosure of Potential Conflicts of Interest
E. Wei is a senior staff clinical scientist and has ownership interest (including stock, patents, etc.) at GRAIL, Inc. No potential conflicts of interest were disclosed by other authors.
Disclaimer
The study sponsors had no role in the design of the study; the collection, analysis, and interpretation of the data; the writing of the manuscript; or the decision to submit the manuscript for publication.
Authors' Contributions
Conception and design: Y. Liu, G.A. Colditz, H. Dart, E.A. Waters
Development of methodology: Y. Liu, G.A. Colditz
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): Y. Liu, G.A. Colditz
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): Y. Liu, G.A. Colditz, B.A. Rosner, H. Dart, E. Wei
Writing, review, and/or revision of the manuscript: Y. Liu, G.A. Colditz, B.A. Rosner, H. Dart, E. Wei, E.A. Waters
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): Y. Liu
Study supervision: E.A. Waters
Acknowledgments
We would like to thank the participants and staff of the Nurses’ Health Study and the Health Professional Follow-up Study, the Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School for their valuable contributions. We also thank the following state cancer registries for their help: AL, AZ, AR, CA, CO, CT, DE, FL, GA, ID, IL, IN, IA, KY, LA, ME, MD, MA, MI, NE, NH, NJ, NY, NC, ND, OH, OK, OR, PA, RI, SC, TN, TX, VA, WA, WY. The authors assume full responsibility for analyses and interpretation of these data. This study was supported by a research grant from the NIH (R01CA190391). Y. Liu, G. Colditz, and E. Waters were supported by the NCI (R01CA190391 and P30 CA091842) and the Foundation for Barnes Jewish Hospital (St. Louis, MO). The Nurses’ Health Study (UM1CA186107; P01CA87969) and the Health Professional Follow-Up Study (UM1CA167552) were supported by the NIH.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.