Abstract
Identifying risk factors for early-onset colorectal cancer (EOCRC) could help reverse its rising incidence through risk factor reduction and/or early screening. We sought to identify EOCRC risk factors that could be used for decisions about early screening. Using electronic databases and medical record review, we compared male veterans ages 35 to 49 years diagnosed with sporadic EOCRC (2008–2015) matched 1:4 to clinic and colonoscopy controls without colorectal cancer, excluding those with established inflammatory bowel disease, high-risk polyposis, and nonpolyposis syndromes, prior bowel resection, and high-risk family history. We ascertained sociodemographic and lifestyle factors, family and personal medical history, physical measures, vital signs, medications, and laboratory values 6 to 18 months prior to case diagnosis. In the derivation cohort (75% of the total sample), univariate and multivariate logistic regression models were used to derive a full model and a more parsimonious model. Both models were tested using a validation cohort. Among 600 cases of sporadic EOCRC [mean (SD) age 45.2 (3.5) years; 66% White], 1,200 primary care clinic controls [43.4 (4.2) years; 68% White], and 1,200 colonoscopy controls [44.7 (3.8) years; 63% White], independent risk factors included age, cohabitation and employment status, body mass index (BMI), comorbidity, colorectal cancer, or other visceral cancer in a first- or second-degree relative (FDR or SDR), alcohol use, exercise, hyperlipidemia, use of statins, NSAIDs, and multivitamins. Validation c-statistics were 0.75–0.76 for the full model and 0.74–0.75 for the parsimonious model, respectively. These independent risk factors for EOCRC may identify veterans for whom colorectal cancer screening prior to age 45 or 50 years should be considered.
Screening 45- to 49-year-olds for colorectal cancer is relatively new with uncertain uptake thus far. Furthermore, half of EOCRC occurs in persons < 45 years old. Using risk factors may help 45- to 49-year-olds accept screening and may identify younger persons for whom earlier screening should be considered.
Introduction
Although the United States has had clear and sustained declines in colorectal cancer incidence and mortality in persons aged 50 years and older, there has been a steady increase in both incidence and mortality for persons under the age of 50 (1). Nearly 10% of all colorectal cancers occur in persons younger than 50 years, half of which occur in persons younger than 45 years of age (2).
On the basis of trends in U.S. population–based data, the American Cancer Society (ACS) made a qualified recommendation in 2018 to begin average-risk screening at age 45, deviating from nearly all other professional guideline organizations’ recommendations at that time (3). In 2021, the American College of Gastroenterology made a conditional recommendation based on very low-quality evidence to begin average-risk screening at age 45, updating its previous guideline to begin average-risk screening at age 45 only for Blacks based on higher colorectal cancer incidence and mortality (4). Most recently, the U.S. Preventive Services Task Force recommended screening in adults ages 45 to 49 years with moderate certainty of a moderate net benefit (5). However, even if screening at 45 is eventually well-accepted among guideline organizations and by patients and providers, half of early-onset colorectal cancer (EOCRC) will remain undetected prior to the onset of signs and symptoms. Clearly, better detection is needed in the shorter term.
Two measures can be used to improve EOCRC detection. One measure is to identify persons who have familial risk, for whom screening prior to age 50 is well accepted. Individuals with a first-degree relative (FDR) younger than age 60 or with two FDRs regardless of age should be screened with colonoscopy before the age of 50 years (6). Using data from the Colon Cancer Family Registry, Gupta and colleagues determined that 25% of persons 40 to 49 years with EOCRC met the criteria for family history–based early screening (7). The other measure involves lowering the patient threshold for reporting lower gastrointestinal symptoms, particularly blood per rectum, and lowering the provider threshold to act on these symptoms. Although both measures would have an immediate impact on disease detection, the proportion of younger persons identified among the susceptible population is likely to be low.
An intermediate measure to improve detection is to identify risk factors for EOCRC and to examine these factors to make decisions about whom to screen early and perhaps how to screen them, with incidence and mortality reduction through detection of early-stage colorectal cancer and prevention through detection and removal of advanced, precancerous lesions. Furthermore, identifying risk factors for EOCRC could also help reverse current incidence and mortality trends through risk factor reduction. Because veterans are a high-risk group for colorectal cancer for both male predominance and factors independent of sex, we undertook this study to identify risk factors for EOCRC, with the longer-term goal of determining who should be considered for early screening.
Materials and Methods
This National, Veterans Affairs (VA)-based, retrospective case–control study was approved by the Institutional Review Board (IRB) at Indiana University Purdue University (Indianapolis, IN) and by the Research and Development Committee at the Richard L. Roudebush VA Medical Center (Indianapolis, IN), which follow the Declaration of Helsinki, Belmont Report, and U.S. Common Rule. The study was funded by Health Services Research and Development, Veterans Health Administration (IIR 14–011), which had no role in the conduct, analysis, or interpretation of the study findings. Waiver of consent was granted given that the data were deidentified. We followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines in preparing this manuscript.
Using National VA datasets, including the Corporate Data Warehouse (CDW) and the Computerized Patient Record System (CPRS), we identified a National sample of male veterans with EOCRC and conducted chart reviews to ensure study eligibility. The VA CDW is a National repository of clinical data from electronic medical records (EMR) and other sources (e.g., billing). The VA CDW Oncology Raw registry contains direct extracts from CDW of oncology-specific data, all of which are ultimately located in the VA Central Cancer Registry and were used to initially identify veterans with colorectal cancer.
Using the VA CDW Oncology registry, we identified 956 men ages 35 to 49 years diagnosed with sporadic (i.e., nonhereditary) adenocarcinoma of the colon or rectum between 2008 and 2015. VA CDW data were then used to identify and exclude 356 cases with a history of inflammatory bowel disease, previous bowel resection, and those with a prior colorectal cancer diagnosis. A total of 850 colorectal cancer cases met study inclusion criteria and chart reviews were then conducted to determine final eligibility. Chart review was used to confirm colorectal cancer diagnosis to exclude those with established inflammatory bowel disease, established high-risk syndromes (i.e., polyposis syndromes, Lynch syndrome, or others), previous bowel resection for any reason or established high-risk family history (defined as one FDR with colorectal cancer diagnosed prior to age 60 or two FDRs with colorectal cancer regardless of age). The final analytic case population included N = 600 males with EOCRC. Cases were matched for the year of diagnosis and VA facility to two control groups, with two controls obtained from each group. The colonoscopy-negative controls included male veterans who had no advanced neoplasia (colorectal cancer or advanced, precancerous polyp) found on a colonoscopy performed for diagnostic (i.e., for symptoms or signs) indications. The clinic controls did not have a diagnosis of colorectal cancer or advanced adenoma and were seen in the primary care outpatient setting ≥ 1 times per year for the 2 years prior to the date of diagnosis of the matched case.
CDW data identified 50,939 potentially eligible colonoscopy controls and 982,412 potentially eligible clinical controls. As with the case population, additional exclusion criteria were applied to controls at CDW and via chart review. Eligible controls from both control groups were randomly selected for EMR review from the same VA facility as the matched case. EMR reviews were conducted until there were two controls who met study inclusion criteria from each control group for each eligible case. Thus, each eligible case was matched to two eligible colonoscopy-negative controls and two clinic controls on VA facility. Ultimately, the EMRs of 2,144 colonoscopy controls and 1,843 clinic controls were reviewed until 1,200 eligible colonoscopy controls and 1,200 eligible clinic controls were identified. The most common reasons for exclusion from the colonoscopy-negative control group included incomplete or prior colonoscopy and colonoscopy for high-risk screening or surveillance (i.e., prior colorectal neoplasia) indications. The most common reasons for exclusion from the clinic control group also included prior colonoscopy or inability to confirm prior primary care visits. Agreement for selection as a case or control among research assistants who abstracted data from the medical record was assessed in small samples of subjects with adjudication by the project manager or first author.
Covariates
After verifying study eligibility, research assistants abstracted the following data categories and variables from the EMR for the N = 600 cases and N = 2,400 controls: sociodemographic factors (age, sex, race/ethnicity, occupation, and marital and employment status), lifestyle factors (cigarette smoking, ethanol use, regular physical activity), medical history (diagnoses, prescription, and over-the-counter medications), family medical history of colorectal cancer and any visceral cancer, physical measures [height, weight, and body mass index (BMI)], vital signs, and basic laboratory test results. For variables that could change with the onset of colorectal cancer, including vital signs, physical measures, and laboratory tests, we abstracted values identified between 6 and 18 months prior to colorectal cancer diagnosis for cases and closest to the date of colorectal cancer diagnosis for the matched controls. If data were unavailable within the 6- to18-month timeframe, the variable was considered missing. Agreement among research assistants for abstraction of certain variables was measured for 10% of cases and controls, with agreement between pairs ranging from 77% to 100%, with adjudication by either the project manager or first author.
On the basis of the same timeframe used for electronic medical record review and abstraction, certain variables [e.g., Charlson comorbidity index score (CCI), medication use, and service-connected disability] were extracted from the VA CDW. The CCI was originally developed with 19 medical conditions with different clinical weights to predict mortality (8). The total CCI score consists of the sum of the weights, with higher scores indicating more severe comorbid conditions. For this study, we identified clinical conditions coded in the medical record using International Classification of Diseases (ICD)-9 diagnosis codes in the 1-year period prior to the index date (colorectal cancer diagnosis date for cases or corresponding date for controls) and used the Deyo adaptation (9) of the CCI score to calculate a total CCI. Veterans with service-connected disabilities do not have copayments for VA healthcare. Veterans without service-connected disability may or may not have a copayment for care; this status is determined by income, coinsurance, and other factors, and was considered a proxy variable for socioeconomic status.
Analysis
As the objective of the analysis was to identify variables that could help clinicians determine a patient's risk for EOCRC, the two control groups were combined, as they differed only in the presence of prior diagnostic colonoscopy. The dataset was then randomly partitioned into 2,250 (450 cases; 1,800 controls) for derivation and 750 (150 cases; 600 controls) for validation sets, while maintaining the proportion of cases to controls at 1 to 4 and with equal numbers of colonoscopy and clinic controls (Fig. 1). Sixty-seven variables were considered as candidate independent variables. Those with missing values greater than 30% (n = 12) or with low frequency or sparseness (i.e., variable characteristics less than 5%; n = 8) were excluded from subsequent analyses, leaving 47 candidate variables. Full details on multiple imputation and model selection are included in Supplementary Methods S1. Briefly, because there were 47 candidate variables, 27 of which had some level of missing data, we started the screening phase in the derivation cohort to reduce the number of potential variables. The screening phase resulted in 15 candidate variables for model selection, as described in Supplementary Methods 1. None of the 20 variables excluded except for education level has been shown or is believed to be associated with colorectal cancer.
Diagram for model development. Diagram describing how the study population was divided into derivation and test (validation) groups and how the analysis was conducted.
Diagram for model development. Diagram describing how the study population was divided into derivation and test (validation) groups and how the analysis was conducted.
To determine the best model using the screened-in 15 variables, the BOOT MI method (10–12) was implemented, in which 200 bootstrap samples of the derivation cohort (n = 2,250) were created (including data with missing values; Fig. 1). Multiple imputations (five imputations) were then performed on each bootstrap sample using chained-equations yielding 1,000 (200 × 5) bootstrap-MI datasets for model selection. Model selection using the bestglm package in R (RRID:SCR_001905) was conducted on each dataset. The top 10 most frequently selected models from the search algorithm were then fitted to the validation data, and the models were ranked by the area under the receiver operating characteristic curve (AUC). Multiple logistic regression models for the full model, including all 15 variables, and the final top-ranked AUC model were provided for the derivation cohort based on the bootstrap-MI estimates. The model estimates account for both the uncertainty due to model selection (of the screened variables) and the uncertainty in the estimates due to multiple imputation.
The full and top-ranked AUC models were also fitted to the five imputed (13) datasets in the validation cohort. The model metrics of discrimination and calibration are reported. The observed and expected rates of EOCRC were plotted for the models using validation data. All variables were reverse-coded, if necessary, such that higher values indicated increased odds of EOCRC. All analyses were conducted using R statistical software. All authors had access to the study data and reviewed and approved the final manuscript.
Data availability
The dataset supporting this article is not available. According to Department of VA policy, these data are stored behind the VA firewall and cannot be shared even after deidentification. Investigators interested in the analyses of existing data are encouraged to contact the corresponding author.
Results
Figure 2 shows how patients were identified, excluded, and enrolled as study subjects into the study's case and control groups. The baseline characteristics of the patients and the two control groups are shown in Table 1. The three groups were comparable in mean age (45.2 years for cases. 44.7 years for colonoscopy controls, and 43.4 years for clinic controls). Most study subjects were White (65.7%), followed by Black (30.9%), and Other (3.5%). Clinic controls were more likely to be White than colonoscopy controls (68.4% vs. 62.7%). Cases were more likely to indicate current alcohol use (75.1%) compared with colonoscopy (65%) and clinic (51.2%) controls, and they were less likely to exercise (6.8%) compared with the two control groups, that indicated exercise in 9.3% and 13.6%, respectively. The American Joint Committee on Cancer (AJCC) stage distribution among the cases was as follows: stage I, 17.5%; stage II, 16.7%; stage III, 33.0%; stage IV, 30.0%; and stage unknown, 2.8%.
Flow diagram for derivation of cases and controls. Consort diagram for how cases and controls were screened, excluded, and enrolled into the study.
Flow diagram for derivation of cases and controls. Consort diagram for how cases and controls were screened, excluded, and enrolled into the study.
Descriptive characteristics for EOCRC cases, clinic controls, and colonoscopy controls.
. | CRC cases (N = 600) . | Clinic controls (N = 1,200) . | Colonoscopy controls (N = 1,200) . | Total (N = 3,000) . |
---|---|---|---|---|
Age at index date | ||||
Mean (SD) | 45.2 (3.5) | 43.4 (4.2) | 44.7 (3.8) | 44.3 (4.0) |
Range | 35.0–49.0 | 35.0–50.0 | 35.0–50.0 | 35.0–50.0 |
Sex | ||||
Male | 600 (100.0%) | 1,200 (100.0%) | 1,200 (100.0%) | 3,000 (100.0%) |
Race | ||||
Black | 184 (30.7%) | 334 (27.8%) | 408 (34.0%) | 926 (30.9%) |
White | 397 (66.2%) | 821 (68.4%) | 752 (62.7%) | 1,970 (65.7%) |
Other | 19 (3.2%) | 45 (3.8%) | 40 (3.3%) | 104 (3.5%) |
Marital status | ||||
Living alone | 355 (59.3%) | 535 (49.2%) | 613 (52.1%) | 1,503 (52.5%) |
Living with partner | 244 (40.7%) | 552 (50.8%) | 563 (47.9%) | 1,359 (47.5%) |
Missing (n) | 1 | 113 | 24 | 138 |
Employment history | ||||
Employed | 396 (79.7%) | 1,163 (97.6%) | 765 (79.0%) | 2,324 (87.5%) |
Unemployed/retired/disabled | 101 (20.3%) | 28 (2.4%) | 203 (21.0%) | 332 (12.5%) |
Missing (n) | 103 | 9 | 232 | 344 |
Smoking | ||||
Current | 184 (30.8%) | 418 (35.6%) | 391 (32.6%) | 993 (33.4%) |
Former | 113 (18.9%) | 234 (19.9%) | 233 (19.4%) | 580 (19.5%) |
Never used | 300 (50.3%) | 523 (44.5%) | 575 (48.0%) | 1,398 (47.1%) |
Missing (n) | 3 | 25 | 1 | 29 |
Smoking, number of pack years | ||||
Mean (SD) | 8.5 (15.3) | 7.4 (13.2) | 7.2 (13.1) | 7.6 (13.7) |
Range | 0.0–92.0 | 0.0–105.0 | 0.0–90.0 | 0.0–105.0 |
Missing (n) | 111 | 379 | 306 | 796 |
Alcohol use | ||||
Current | 346 (75.1%) | 592 (50.2%) | 561 (65.0%) | 1,499 (59.9%) |
Former | 92 (20.0%) | 383 (32.5%) | 279 (32.3%) | 754 (30.1%) |
None | 23 (5.0%) | 205 (17.4%) | 23 (2.7%) | 251 (10.0%) |
Missing (n) | 139 | 20 | 337 | 496 |
Personal cancer history | 15 (2.5%) | 24 (2.0%) | 27 (2.25%) | 66 (2.2%) |
Exercise | 41 (6.8%) | 163 (13.6%) | 112 (9.3%) | 316 (10.5%) |
Family Hx – CRC in FDR | 30 (5.0%) | 5 (0.4%) | 48 (4.0%) | 83 (2.8%) |
Family Hx – CRC in FDR or SDR | 101 (16.8%) | 25 (2.1%) | 142 (11.8%) | 268 (8.9%) |
Family Hx – Any visceral cancer in FDR | 172 (28.7%) | 191 (15.9%) | 225 (18.8%) | 588 (19.6%) |
Family Hx – Any visceral cancer in FDR or SDR | 224 (37.3%) | 258 (21.5%) | 292 (24.3%) | 774 (25.8%) |
CCI | ||||
Mean (SD) | 1.1 (2.2) | 0.5 (1.1) | 0.7 (1.3) | 0.7 (1.5) |
CCI–categorized | ||||
0 | 378 (63.0%) | 877 (73.1%) | 818 (68.2%) | 2,073 (69.1%) |
1 | 75 (12.5%) | 218 (18.2%) | 209 (17.4%) | 502 (16.7%) |
2 or more | 147 (24.5%) | 105 (8.8%) | 173 (14.4%) | 425 (14.2%) |
Metformin | 41 (6.8%) | 103 (8.6%) | 87 (7.2%) | 231 (7.7%) |
Statin | 96 (16.0%) | 330 (27.5%) | 329 (27.4%) | 755 (25.2%) |
NSAIDs | 117 (19.5%) | 481 (40.1%) | 484 (40.3%) | 1,082 (36.1%) |
Aspirin | 57 (9.5%) | 170 (14.2%) | 184 (15.3%) | 411 (13.7%) |
Multivitamin | 47 (7.8%) | 165 (13.8%) | 171 (14.2%) | 383 (12.8%) |
Vascular disease | 64 (10.7%) | 72 (6.0%) | 137 (11.4%) | 273 (9.1%) |
Hyperlipidemia | 217 (36.2%) | 609 (50.8%) | 553 (46.1%) | 1,379 (46.0%) |
BMI | ||||
Mean (SD) | 30.4 (6.6) | 31.1 (6.2) | 30.7 (6.1) | 30.8 (6.2) |
Missing (n) | 58 | 1 | 3 | 62 |
Systolic BP | ||||
Mean (SD) | 128.1 (12.6) | 127.3 (11.8) | 127.1 (11.5) | 127.4 (11.9) |
Missing (n) | 52 | 1 | 2 | 55 |
Diastolic BP | ||||
Mean (SD) | 80.9 (9.6) | 80.2 (8.7) | 79.9 (8.6) | 80.2 (8.8) |
Missing (n) | 52 | 1 | 2 | 55 |
Waist circumference | ||||
Mean (SD) | 105.1 (16.8) | 106.8 (15.6) | 105.8 (15.5) | 106.1 (15.8) |
Missing (n) | 58 | 1 | 3 | 62 |
. | CRC cases (N = 600) . | Clinic controls (N = 1,200) . | Colonoscopy controls (N = 1,200) . | Total (N = 3,000) . |
---|---|---|---|---|
Age at index date | ||||
Mean (SD) | 45.2 (3.5) | 43.4 (4.2) | 44.7 (3.8) | 44.3 (4.0) |
Range | 35.0–49.0 | 35.0–50.0 | 35.0–50.0 | 35.0–50.0 |
Sex | ||||
Male | 600 (100.0%) | 1,200 (100.0%) | 1,200 (100.0%) | 3,000 (100.0%) |
Race | ||||
Black | 184 (30.7%) | 334 (27.8%) | 408 (34.0%) | 926 (30.9%) |
White | 397 (66.2%) | 821 (68.4%) | 752 (62.7%) | 1,970 (65.7%) |
Other | 19 (3.2%) | 45 (3.8%) | 40 (3.3%) | 104 (3.5%) |
Marital status | ||||
Living alone | 355 (59.3%) | 535 (49.2%) | 613 (52.1%) | 1,503 (52.5%) |
Living with partner | 244 (40.7%) | 552 (50.8%) | 563 (47.9%) | 1,359 (47.5%) |
Missing (n) | 1 | 113 | 24 | 138 |
Employment history | ||||
Employed | 396 (79.7%) | 1,163 (97.6%) | 765 (79.0%) | 2,324 (87.5%) |
Unemployed/retired/disabled | 101 (20.3%) | 28 (2.4%) | 203 (21.0%) | 332 (12.5%) |
Missing (n) | 103 | 9 | 232 | 344 |
Smoking | ||||
Current | 184 (30.8%) | 418 (35.6%) | 391 (32.6%) | 993 (33.4%) |
Former | 113 (18.9%) | 234 (19.9%) | 233 (19.4%) | 580 (19.5%) |
Never used | 300 (50.3%) | 523 (44.5%) | 575 (48.0%) | 1,398 (47.1%) |
Missing (n) | 3 | 25 | 1 | 29 |
Smoking, number of pack years | ||||
Mean (SD) | 8.5 (15.3) | 7.4 (13.2) | 7.2 (13.1) | 7.6 (13.7) |
Range | 0.0–92.0 | 0.0–105.0 | 0.0–90.0 | 0.0–105.0 |
Missing (n) | 111 | 379 | 306 | 796 |
Alcohol use | ||||
Current | 346 (75.1%) | 592 (50.2%) | 561 (65.0%) | 1,499 (59.9%) |
Former | 92 (20.0%) | 383 (32.5%) | 279 (32.3%) | 754 (30.1%) |
None | 23 (5.0%) | 205 (17.4%) | 23 (2.7%) | 251 (10.0%) |
Missing (n) | 139 | 20 | 337 | 496 |
Personal cancer history | 15 (2.5%) | 24 (2.0%) | 27 (2.25%) | 66 (2.2%) |
Exercise | 41 (6.8%) | 163 (13.6%) | 112 (9.3%) | 316 (10.5%) |
Family Hx – CRC in FDR | 30 (5.0%) | 5 (0.4%) | 48 (4.0%) | 83 (2.8%) |
Family Hx – CRC in FDR or SDR | 101 (16.8%) | 25 (2.1%) | 142 (11.8%) | 268 (8.9%) |
Family Hx – Any visceral cancer in FDR | 172 (28.7%) | 191 (15.9%) | 225 (18.8%) | 588 (19.6%) |
Family Hx – Any visceral cancer in FDR or SDR | 224 (37.3%) | 258 (21.5%) | 292 (24.3%) | 774 (25.8%) |
CCI | ||||
Mean (SD) | 1.1 (2.2) | 0.5 (1.1) | 0.7 (1.3) | 0.7 (1.5) |
CCI–categorized | ||||
0 | 378 (63.0%) | 877 (73.1%) | 818 (68.2%) | 2,073 (69.1%) |
1 | 75 (12.5%) | 218 (18.2%) | 209 (17.4%) | 502 (16.7%) |
2 or more | 147 (24.5%) | 105 (8.8%) | 173 (14.4%) | 425 (14.2%) |
Metformin | 41 (6.8%) | 103 (8.6%) | 87 (7.2%) | 231 (7.7%) |
Statin | 96 (16.0%) | 330 (27.5%) | 329 (27.4%) | 755 (25.2%) |
NSAIDs | 117 (19.5%) | 481 (40.1%) | 484 (40.3%) | 1,082 (36.1%) |
Aspirin | 57 (9.5%) | 170 (14.2%) | 184 (15.3%) | 411 (13.7%) |
Multivitamin | 47 (7.8%) | 165 (13.8%) | 171 (14.2%) | 383 (12.8%) |
Vascular disease | 64 (10.7%) | 72 (6.0%) | 137 (11.4%) | 273 (9.1%) |
Hyperlipidemia | 217 (36.2%) | 609 (50.8%) | 553 (46.1%) | 1,379 (46.0%) |
BMI | ||||
Mean (SD) | 30.4 (6.6) | 31.1 (6.2) | 30.7 (6.1) | 30.8 (6.2) |
Missing (n) | 58 | 1 | 3 | 62 |
Systolic BP | ||||
Mean (SD) | 128.1 (12.6) | 127.3 (11.8) | 127.1 (11.5) | 127.4 (11.9) |
Missing (n) | 52 | 1 | 2 | 55 |
Diastolic BP | ||||
Mean (SD) | 80.9 (9.6) | 80.2 (8.7) | 79.9 (8.6) | 80.2 (8.8) |
Missing (n) | 52 | 1 | 2 | 55 |
Waist circumference | ||||
Mean (SD) | 105.1 (16.8) | 106.8 (15.6) | 105.8 (15.5) | 106.1 (15.8) |
Missing (n) | 58 | 1 | 3 | 62 |
Abbreviations: BP, blood pressure; CRC, colorectal cancer; Hx, history.
Variables with notable univariate numerical differences between clinic and colonoscopy controls include the proportion of subjects 45 to 49 years (45.9% vs. 59.1%, respectively), employment at the time of enrollment (93.8% vs. 75.7%, respectively), and the proportion with an FDR or second-degree relative (SDR) with colorectal cancer (2.1% vs. 11.8%, respectively). Otherwise, the baseline features between the two control groups were comparable, providing justification for combining them into one control group.
On the basis of the derivation subgroup of 450 cases and 1,800 controls, logistic regression results from the screening phase for each candidate covariate are reported in Supplementary Table S1. The 15 candidate variables identified on the basis of univariate P values were age, cohabitation status, employment status, service-connected disability status, BMI < 20 kg/m2, CCI, any visceral cancer in a FDR or SDR, colorectal cancer in an FDR or SDR, current alcohol use, any reported exercise, hyperlipidemia, and reported regular use of multivitamins, aspirin, NSAIDs, and statins.
The full multiple model, that included 15 variables, is presented in Table 2, that displays adjusted odd ratios (OR) for each variable in both models. This model showed good discrimination, as measured by an AUC of 0.748 [95% confidence interval (CI), 0.720–0.773] and Brier score (0.138; 95% CI, 0.133–0.144). Increased odds of EOCRC in the derivation cohort were associated with older age (OR, 1.08; 95% CI, 1.05–1.11), unemployment/retirement (OR, 1.44; 95% CI, 1.10–1.89), not having a service-connected disability (OR, 1.52; 95% CI, 1.20–1.90), low BMI (<20 kg/m2; OR, 3.42; 95% CI, 1.37–9.51), higher comorbidity (OR, 1.15; 95% CI, 1.06–1.23), current alcohol use (OR, 1.74; 95% CI, 1.40–2.18), no exercise (OR, 2.05; 95% CI, 1.34–3.20), having a FDR or SDR with colorectal cancer (OR, 2.28; 95% CI, 1.66–3.34) or visceral cancer (OR, 1.70; 95% CI, 1.29–2.18), and not using multivitamins (OR, 1.76; 95% CI, 1.16–2.80), NSAIDs (OR, 2.44; 95% CI, 1.93–3.11), or statins (OR, 1.56; 95% CI, 1.15–2.23).
Adjusted ORsa in the final models estimated from bootstrap estimates in the derivation cohort (N = 2,250).
. | 15-variable model . | 7-variable model . |
---|---|---|
. | OR (95% CI)b . | OR (95% CI)b . |
Age | 1.08 (1.05–1.11) | 1.09 (1.06–1.12) |
Living alone | 1.15 (0.90–1.44) | — |
Unemployed or retired | 1.44 (1.10–1.89) | — |
Nonservice connected or copay | 1.52 (1.20–1.90) | 1.61 (1.27–1.97) |
BMI < 20 kg/m2 (vs. 20–25 kg/m2) | 3.42 (1.37–9.51) | — |
CCI | 1.15 (1.06–1.23) | 1.15 (1.06–1.23) |
Any visceral cancer FDR or SDR | 1.70 (1.29–2.18) | — |
CRC in FDR or SDR | 2.28 (1.66–3.34) | 2.50 (1.78–3.60) |
Current alcohol use (vs. never / former use) | 1.74 (1.40–2.18) | 1.75 (1.43–2.18) |
No exercise | 2.05 (1.34–3.20) | — |
No aspirin | 1.36 (0.88–2.05) | — |
No hyperlipidemia | 1.19 (0.93–1.64) | — |
No multivitamin use | 1.76 (1.16–2.80) | — |
No NSAID use | 2.44 (1.93–3.11) | 2.42 (1.92–3.07) |
No statin use | 1.56 (1.15–2.23) | 1.94 (1.50–2.48) |
AUC | 0.748 (0.720–0.773) | 0.718 (0.687–0.743) |
Brier score | 0.138 (0.133–0.144) | 0.143 (0.139–0.148) |
. | 15-variable model . | 7-variable model . |
---|---|---|
. | OR (95% CI)b . | OR (95% CI)b . |
Age | 1.08 (1.05–1.11) | 1.09 (1.06–1.12) |
Living alone | 1.15 (0.90–1.44) | — |
Unemployed or retired | 1.44 (1.10–1.89) | — |
Nonservice connected or copay | 1.52 (1.20–1.90) | 1.61 (1.27–1.97) |
BMI < 20 kg/m2 (vs. 20–25 kg/m2) | 3.42 (1.37–9.51) | — |
CCI | 1.15 (1.06–1.23) | 1.15 (1.06–1.23) |
Any visceral cancer FDR or SDR | 1.70 (1.29–2.18) | — |
CRC in FDR or SDR | 2.28 (1.66–3.34) | 2.50 (1.78–3.60) |
Current alcohol use (vs. never / former use) | 1.74 (1.40–2.18) | 1.75 (1.43–2.18) |
No exercise | 2.05 (1.34–3.20) | — |
No aspirin | 1.36 (0.88–2.05) | — |
No hyperlipidemia | 1.19 (0.93–1.64) | — |
No multivitamin use | 1.76 (1.16–2.80) | — |
No NSAID use | 2.44 (1.93–3.11) | 2.42 (1.92–3.07) |
No statin use | 1.56 (1.15–2.23) | 1.94 (1.50–2.48) |
AUC | 0.748 (0.720–0.773) | 0.718 (0.687–0.743) |
Brier score | 0.138 (0.133–0.144) | 0.143 (0.139–0.148) |
aORs are adjusted for all other variables in the model.
bCI obtained from 2.5% and 97.5% quantiles of bootstrap estimates.
The top-ranked model based on AUC in the derivation cohort, which was a more parsimonious model, contained seven risk factors (Table 2). This model also showed good discrimination, with an AUC of 0.718 (0.687–0.743) and Brier score = 0.143 (0.139–0.148). Increased odds of EOCRC were associated with older age (OR, 1.09; 95% CI, 1.06–1.12), not having a service-connected disability (OR, 1.61; 95% CI, 1.27–1.97), higher comorbidity (OR, 1.15; 95% CI, 1.06–1.23), having an FDR or SDR with colorectal cancer (OR, 2.50; 95% CI, 1.78–3.60), current alcohol use (OR, 1.75; 95% CI, 1.43–2.18), and no report of regular use of NSAIDs (OR, 2.42; 95% CI, 1.92–3.07) or statins (OR, 1.94; 95% CI, 1.50–2.48). Although the indices for the 7-variable model were slightly lower than those of the 15-variable model, the 95% bootstrap CIs overlapped for the two models; thus, both models were similar in terms of calibration and discrimination.
On the basis of the validation cohort, the 15 and 7-variable models showed good calibration. The Hosmer–Lemeshow goodness of fit test was not statistically significant for any of the imputed datasets, indicating that neither model showed a significant lack of fit (Table 3). Both models also exhibited good discrimination, with AUC values ranging from 0.753 to 0.760 for the 15-variable model, and 0.744 to 0.750 for the 7-variable model. Although the AUC values in each imputed testing dataset were slightly lower for the 7-variable model, the 95% CIs overlapped with those of the 15-variable model. Figure 3 shows larger observed versus expected differences between the two models for some levels of predicted probability, although the overall model fit was both good and consistent among iterations for both models.
Model metrics of discrimination and calibration—validation cohort (N = 750).
. | . | . | . | Hosmer–Lemeshow GOF test . | ||
---|---|---|---|---|---|---|
Imputation dataset . | Model . | Area under ROC . | 95% CI . | χ2 . | df . | P . |
1 | 15 - variable model | 0.760 | 0.717–0.803 | 6.676 | 8 | 0.572 |
7 - variable model | 0.747 | 0.704–0.789 | 8.112 | 8 | 0.423 | |
2 | 15 - variable model | 0.755 | 0.711–0.799 | 7.422 | 8 | 0.492 |
7 - variable model | 0.747 | 0.704–0.790 | 7.556 | 8 | 0.478 | |
3 | 15 - variable model | 0.759 | 0.715–0.802 | 4.243 | 8 | 0.834 |
7 - variable model | 0.747 | 0.704–0.790 | 8.682 | 8 | 0.370 | |
4 | 15 - variable model | 0.753 | 0.709–0.797 | 5.306 | 8 | 0.725 |
7 - variable model | 0.744 | 0.701–0.787 | 7.410 | 8 | 0.493 | |
5 | 15 - variable model | 0.756 | 0.712–0.799 | 5.577 | 8 | 0.695 |
7 - variable model | 0.750 | 0.707–0.792 | 9.339 | 8 | 0.315 |
. | . | . | . | Hosmer–Lemeshow GOF test . | ||
---|---|---|---|---|---|---|
Imputation dataset . | Model . | Area under ROC . | 95% CI . | χ2 . | df . | P . |
1 | 15 - variable model | 0.760 | 0.717–0.803 | 6.676 | 8 | 0.572 |
7 - variable model | 0.747 | 0.704–0.789 | 8.112 | 8 | 0.423 | |
2 | 15 - variable model | 0.755 | 0.711–0.799 | 7.422 | 8 | 0.492 |
7 - variable model | 0.747 | 0.704–0.790 | 7.556 | 8 | 0.478 | |
3 | 15 - variable model | 0.759 | 0.715–0.802 | 4.243 | 8 | 0.834 |
7 - variable model | 0.747 | 0.704–0.790 | 8.682 | 8 | 0.370 | |
4 | 15 - variable model | 0.753 | 0.709–0.797 | 5.306 | 8 | 0.725 |
7 - variable model | 0.744 | 0.701–0.787 | 7.410 | 8 | 0.493 | |
5 | 15 - variable model | 0.756 | 0.712–0.799 | 5.577 | 8 | 0.695 |
7 - variable model | 0.750 | 0.707–0.792 | 9.339 | 8 | 0.315 |
Abbreviations: df, degree of freedom; GOF; Goodness of Fit.
Comparison of model fit in validation cohort. Plots of the Observed (dotted lines) and expected (solid lines) probabilities of colorectal cancer on the test (validation) group data for both 15-variable and 7-variable models for each of 5 imputations.
Comparison of model fit in validation cohort. Plots of the Observed (dotted lines) and expected (solid lines) probabilities of colorectal cancer on the test (validation) group data for both 15-variable and 7-variable models for each of 5 imputations.
Discussion
The Veterans Health Administration has endorsed lowering the screening age to 45 years. However, even if these new guideline recommendations are implemented within the VA system and are followed by high adherence, half of sporadic EOCRC will be missed (2). Knowing the risk factors for EOCRC could help providers identify those persons for whom screening prior to age 45 or 50 years is most needed. In this case–control study, we identified several factors independently associated with an increased risk of colorectal cancer among male veterans. These factors may be used to guide discussions between patients and providers about whether and how to screen for colorectal cancer, and for targeting those at high risk for interventions that increase the uptake of screening.
Given the limitations of case–control studies, we believe ours is reasonably and robustly constructed. We chose cases that most closely reflected sporadic colorectal cancer. We chose two control groups with distinct sampling frames to minimize bias in the spectrum of possible risk factors associated with either control group alone. The colonoscopy controls may be considered the more fastidious or valid control group that some would consider less generalizable, while the clinic control group (some of whom may have had colonoscopy) may be considered the more generalizable but less fastidious control group. For case and control groups, we excluded persons with established risk for colorectal cancer (e.g., strong family history of colorectal cancer, inflammatory bowel disease) because (early) screening for these indications is established. We further excluded from colonoscopy controls persons who underwent colonoscopy for an indication of “screening” because it is likely that they had a high-risk family history (for which screening is established). We used a highly structured and robust strategy for analysis, including consideration of variables with missing data of less than 30% by using multiple imputations; we excluded from consideration variables with missing data of 30% or more. To pare down the number of candidate variables, we excluded factors with no clear link to colorectal cancer (e.g., red cell distribution width or white blood cell count) or that were represented by other factors (e.g., triglyceride level represented by a diagnosis of hyperlipidemia). We pursued model building using stepwise variable selection instead of established risk factors for colorectal cancer because of uncertainty about whether risk factors for usually-onset versus early-onset are the same. We do not believe that any of the variables in our dataset are unique or specific for this population, except for the service-connection / copay variable, which we consider to be a proxy for income and/or socioeconomic status, albeit only an approximate one.
Although we identified 15 factors associated with EOCRC, a reduced model of 7 factors provided similar discrimination (c-statistic = 0.718). All seven factors are readily collectable either from the EMR or by asking patients (about their family history of colorectal cancer). On the basis of model fit, the additional complexity of the 15-variable model may not necessarily translate into better discrimination between cases and controls. The simpler 7-variable model has metrics comparable to those of the 15-variable model and may be easier to use in clinical practice to estimate the relative risk for EOCRC.
Several studies have identified risk factors for EOCRC (14–19). Recent among them is a case–control study by Gausman and colleagues from a tertiary academic hospital comparing sociodemographic and medical features in 269 individuals with EOCRC, 2,802 usual-onset colorectal cancer (i.e., occurring in individuals age 50 or older), and 1,122 age-matched (to the EOCRC group) controls without colorectal cancer (20). Male sex, inflammatory bowel disease, and family history of colorectal cancer were more prevalent in EOCRC cases than in controls. Along with Asian ethnicity, these same factors were more prevalent in EOCRC than in late-onset colorectal cancer. Consistent with other studies, Gausman and colleagues found that EOCRC is more likely to be detected in the distal colon and at a later stage than late-onset colorectal cancer.
Among these, perhaps the most pertinent is the VA-based study by Low and colleagues (21) The study compared EOCRC cases identified by colonoscopy in 18- to 49-year-olds to controls who were colorectal cancer–free at their baseline colonoscopy and through 3 years of follow-up, relying solely on data in the VA's Corporate Data Warehouse. Independent risk factors included older age (within the 18- to-49 year-old range) and male sex, while both regular aspirin use and overweight or obesity were protective against EOCRC. In a post hoc analysis, weight loss of 5 kg or more within the 5-year period preceding colonoscopy was associated with a 2.23-fold higher risk of EOCRC. Despite differences in case and control age range and sex, the study by Low and colleagues and this study have common findings. Both studies found that older age is associated with an increased risk of EOCRC. Because our study included only men, we were unable to examine the effect of sex on risk. The study by Low, and colleagues found that aspirin was protective. In our study, aspirin was protective in univariate analysis only. Finally, both studies found that lower BMI was associated with an increased risk of EOCRC. BMI was significant only in the univariate analysis in our study. In both studies, the association between low BMI and EOCRC may be due to protopathic bias, where an early manifestation of the disease (in this case, EOCRC) “causes” the exposure (in this case, weight loss and resulting low BMI). Several studies on risk factors for EOCRC support a higher BMI as a risk factor (15). As with our study, we believe that the weight loss associated with EOCRC in the low study is likely a true finding and an early manifestation of EOCRC. In this setting, we would consider weight loss (presumed to be unintentional) a symptom that should prompt consideration of diagnostic colonoscopy, rather than a risk factor that best applies to “asymptomatic” individuals.
At least two studies supporting risk prediction models come from East Asian countries and used phenotypic factors as the predictor variables. Jung and colleagues used age, sex, BMI, family history of colorectal cancer, and cigarette smoking to predict the risk of advanced neoplasia (the combination of colorectal cancer and advanced precancerous polyps) in 96,235 Korean men and women ages 30 to 49 years who underwent screening colonoscopy (17). With an overall 1.2% prevalence of advanced neoplasia, the optimal cutoff for the derivation cohort of 57,635 was 1.14%. In the validation cohort of 38,600 participants, the model had a c-statistic of 0.67, suggesting modest discrimination, which was better than the Asia–Pacific Colorectal Screening score (0.59), the Korean Colorectal Screening score (0.60), and the score by Kaminski and colleagues (0.59; ref. 22). Similarly, Park and colleagues created a scoring system on the basis of age, sex, cholesterol, triglycerides, and H. Pylori infection in 2,781 Korean individuals ages 40 to 49 years old who underwent screening colonoscopy (18). With a cutoff of ≥ 4, the scoring system discriminated between those with and without advanced neoplasia, with a sensitivity of 79%, specificity of 58%, and c-statistic of 0.72.
One strength of this study is the large, system-wide sample size, which provided adequate power to identify risk factors with potentially broad applicability within VHA and potentially to nonveteran males. Another strength is the use of two control groups, colonoscopy controls and clinic controls, each with its unique potential for bias, which, when combined, mitigated the bias of using either control group alone. A third strength was the careful review of the EMR by trained research personnel to identify candidate clinical and personal lifestyle risk factors and family history factors. The last was a careful analysis that included identification of a parsimonious model with metrics comparable with the full model but with greater potential for implementation.
There are several study limitations that require comment, one of which is the uncertainty of the “optimal” control group for comparison, a critical issue in a case–control study, as the choice of controls may affect study validity. For this reason, we chose a highly valid, fastidious, but limited control group of persons who underwent colonoscopy for diagnostic reasons and one with less precision for what is in the colorectum but greater generalizability and perhaps greater comparability to cases. A second limitation is missing data, that precluded use of certain variables and required a more complex model-selection technique because multiple imputation was conducted. The amount of missing data in the candidate variables was less than 30%; thus, the multiple imputation approach was valid and appropriately captured the additional variability in parameter estimates due to the imputation process. A third limitation is the potential for measurement bias, as medical records of cases may have been more likely to have had more complete or accurate data than controls on lifestyle factors and family history, information that is often recorded after the cancer diagnosis. A fourth limitation is the potential for protopathic bias, in which early manifestations of colorectal cancer affect exposure (or risk factor). Vital signs, anthropometric variables, and laboratory parameters may have been most susceptible to this bias. We attempted to mitigate the effect of protopathic bias by examining and recording these categories of variables no closer than six months prior to the cancer diagnosis for cases and in most cases, between 6 and 18 months prior to the diagnosis. However, the extent to which this strategy mitigated this bias remains unclear. A fifth limitation is that medication use was determined from electronic pharmacy records, that may be incomplete because they do not always include regularly used non-VA prescription medications or over-the-counter medications. A final limitation is the uncertain generalizability of our findings, especially for women whose colorectal cancer risk is about half that of men in any age category, and to nonveteran males who may differ from veterans receiving care within the VA healthcare system. Independent validation of both models in a nonveteran male (and perhaps female) population is required to determine generalizability.
We recognize that because this study was conducted, the age threshold for starting colorectal cancer screening has been lowered from 50 years to 45 years. Although our results might have greater current clinical relevance if we had used age 45 as the threshold for separating early-onset from “usual” onset colorectal cancer, we suggest that it is early in the post updated guidelines period, and that uptake of colorectal cancer screening among 45- to 49-year-olds is not yet clear (23). We expect that a substantial proportion of persons in this age group may not embrace colorectal cancer screening prior to age 50, in which case risk factors may be useful for the patient-provider discussion about colorectal cancer screening.
Considering its strengths and limitations, we conclude that this study identified phenotypic and lifestyle factors for sporadic EOCRC and derived and validated a well-calibrated model with moderately good discrimination. A more parsimonious model was identified, with model metrics comparable with those of the larger model. These risk factors and models may be useful for identifying veterans at a higher than average risk for EOCRC for whom screening prior to age 45 or 50 should be considered.
Authors' Disclosures
T.F. Imperiale reports grants from VA during the conduct of the study. J.K. Daggy reports grants from VA HSR&D during the conduct of the study. No disclosures were reported by the other authors.
Authors' Contributions
T.F. Imperiale: Conceptualization, formal analysis, supervision, funding acquisition, investigation, methodology, writing–original draft, writing–review, and editing. L.J. Myers: Data curation, supervision, investigation, writing–review, and editing. B.C. Barker: Data curation, supervision, investigation, project administration, writing–review, and editing. J. Larson: Data curation, supervision, investigation, writing–review, and editing. T.E. Stump: Software, formal analysis, investigation, writing–original draft, writing–review, and editing. J.K. Daggy: Software, formal analysis, supervision, investigation, writing–original draft, writing–review, and editing.
Acknowledgments
This work was supported by grants from the Health Services Research and Development – IIR 14–011, Veterans Health Administration.
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Prevention Research Online (http://cancerprevres.aacrjournals.org/).