Abstract
Objective: Colorectal cancer (CRC) is the 3rd most common cancer in the United States. Despite declining incidence in patients older than 50, there has been an increasing trend of CRC in patients under the age of 50. Some risk factors such as inflammatory bowel disease and inherited genetic syndromes can lead to early onset of CRC; yet incidence of these disorders is not concomitant with the increase of CRC under 50 years old. This study utilizes machine learning models with electronic health records (EHR) data from the OneFlorida network—a clinical research network contributing to the national PCORnet to demonstrate a potential avenue for investigation of the rise in incidence. Methods: This study applied four machine learning algorithms on a cohort of 1,227 CRC cases and matched 34,157 controls, all under the age of 50 extracted from the OneFlorida network. Colon cancer (CC) and rectal cancer (RC) were modeled separately. For each case patient, we created a prediction window starting from the first recorded encounter in their EHR to end dates of 0, 1, 3, and 5 years prior to the case CC/RC index date. For each control patient we matched them to cases based on age at an encounter date to close to the case index date. The data was split into a training set (80%) for training the models, and a testing set (20%) used to measure model performance. SHapley Additive exPlanations (SHAP) was used to analyze significant features. Results: Notable trends in model prediction results were decreased sensitivity across prediction windows as data per patient decreased, in both RC and CC cohorts. Zero-year and 1-year prediction area under the curve (AUC) was significant at 0.64 to 0.75 for all algorithms across RC and CC. As the prediction window widened, the prediction performance dropped to as low as 0.35 (i.e., 5-year prediction). The best performing algorithm across all experiments was the support vector machine (SVM). Top predictors identified in the CC cohort include hypertension, cough/asthma, chronic sinusitis, anxiety d/o, and atopic dermatitis, while top predictors in the RC cohort include obesity, female gender, HIV, anxiety d/o and asthma. Conclusions: Disorders with chronic immunosuppression (e.g., HIV) or inflammation (e.g., obesity, asthma, sinusitis, dermatitis) may represent immune-axis derangements contributing to a favorable state for CRC. This preliminary study provides early insight into the capacity of artificial intelligence to uncover new risk factors in the population of patients with onset of young-onset CRC with more algorithm refinement and risk factor exploration underway.
Citation Format: Michael B. Quillen, Taylor M. Parker, Jiang Bian, Thomas George. Identifying new risk factors for early-onset CRC in population under 50 years old using EHR-based machine learning [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PR-10.