Colorectal cancer (CRC) is a leading cause of cancer death, yet many CRC deaths are preventable via CRC screening. Currently only age and family history are used to define screening eligibility. However, CRC risk varies substantially in the population. In recent years polygenic risk scores (PRS) have gained attention as powerful risk prediction tool to personalize interventions. PRS provides a quantitative measure of an individual's inherited risk based on the cumulative effect of many genetic risk variants. Here, we benchmark several genome wide PRS techniques to select the best performing models in CRC risk prediction.

We built CRC risk prediction models that incorporate genome-wide genotype data from large-scale research studies (55,105 cases and 65,079 controls, European ancestries) with the imputed genetic data on over 40 million variants. The risk prediction models were externally evaluated in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort, including 101,987 genotyped individuals within the Kaiser Permanente Northern California (KPNC) integrated healthcare delivery system. We built genome-wide PRS using various methods including known CRC risk variants, thresholding and pruning followed by machine learning approaches (ML), LDpred, improved LDpred2, SBayesR, PRS-CS, Lassosum and empirical Bayes.

Among 55,033 individuals of European ancestry in the GERA cohort, we evaluated the performance of models in terms of the age and sex-adjusted AUC. We showed that LDpred, LDpred2, LDpred2-sparse, SBayesR and PRS-CS perform equally well in terms of discriminatory accuracy (AUC=0.65). In addition, the PRS developed using the above-mentioned techniques identified the top 30% of the GERA European population has a hazard ratio estimate of ~2.2 on CRC risk, which is comparable to that for having an affected first-degree relative. The developed CRC PRSs will provide way for risk-stratified CRC screening and other targeted interventions.

PRS derivation methodsNo. of variantsAUC(1,311 cases and 53,722 controls)Hazard ratio estimates (CI)Top 30% of population vs. remaining
Known variants 140 0.63 1.92 (1.75-2.23) 
PT Clumping + ML (Ridge) 10,000 0.63 1.94 (1.72-2.19) 
LDpred 1.2M 0.65 2.20 (1.94-2.47) 
LDPred2 1.2M 0.65 2.20 (1.93-2.45) 
LDpred2 Sparse 530K 0.65 2.20 (1.90-2.41) 
SBayesR 1.2M 0.65 2.20 (1.88-2.38) 
PRS-CS 1.2M 0.65 2.20 (1.91-2.43) 
Lassosum 1.2M 0.62 1.76 (1.56-2.58) 
EBPRS 1.2M 0.62 1.81 (1.66-2.11) 
PRS derivation methodsNo. of variantsAUC(1,311 cases and 53,722 controls)Hazard ratio estimates (CI)Top 30% of population vs. remaining
Known variants 140 0.63 1.92 (1.75-2.23) 
PT Clumping + ML (Ridge) 10,000 0.63 1.94 (1.72-2.19) 
LDpred 1.2M 0.65 2.20 (1.94-2.47) 
LDPred2 1.2M 0.65 2.20 (1.93-2.45) 
LDpred2 Sparse 530K 0.65 2.20 (1.90-2.41) 
SBayesR 1.2M 0.65 2.20 (1.88-2.38) 
PRS-CS 1.2M 0.65 2.20 (1.91-2.43) 
Lassosum 1.2M 0.62 1.76 (1.56-2.58) 
EBPRS 1.2M 0.62 1.81 (1.66-2.11) 

AUC based on family history in GERA cohort is 0.54

Citation Format: Minta Thomas, Lori C Sakoda, Jeffrey K Lee, Mark A Jenkins, Andrea Burnett-Hartman, Heather Hampel, Elisabeth A Rosenthal, Hermann Brenner, Jenny Chang-Claude, Marc J Gunter, Polly A Newcomb, Steven Gallinger, Tabitha A Harrison, Graham Casey, Victor Moreno, Gail P Jarvik, Stephen B Gruber, Robert E Schoen, Andrew T Chan, Richard B Hayes, Douglas A Corley, Ulrike Peters, Li Hsu. Benchmarking genome-wide polygenic risk score development techniques in colorectal cancer risk prediction [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 881.