Abstract
Lung cancer is the leading cause of cancer-related death globally. An improved risk stratification strategy can increase efficiency of low-dose CT (LDCT) screening. Here we assessed whether individual's genetic background has clinical utility for risk stratification in the context of LDCT screening. On the basis of 13,119 patients with lung cancer and 10,008 controls with European ancestry in the International Lung Cancer Consortium, we constructed a polygenic risk score (PRS) via 10-fold cross-validation with regularized penalized regression. The performance of risk model integrating PRS, including calibration and ability to discriminate, was assessed using UK Biobank data (N = 335,931). Absolute risk was estimated on the basis of age-specific lung cancer incidence and all-cause mortality as competing risk. To evaluate its potential clinical utility, the PRS distribution was simulated in the National Lung Screening Trial (N = 50,772 participants). The lung cancer ORs for individuals at the top decile of the PRS distribution versus those at bottom 10% was 2.39 [95% confidence interval (CI) = 1.92–3.00; P = 1.80 × 10−14] in the validation set (Ptrend = 5.26 × 10−20). The OR per SD of PRS increase was 1.26 (95% CI = 1.20–1.32; P = 9.69 × 10−23) for overall lung cancer risk in the validation set. When considering absolute risks, individuals at different PRS deciles showed differential trajectories of 5-year and cumulative absolute risk. The age reaching the LDCT screening recommendation threshold can vary by 4 to 8 years, depending on the individual's genetic background, smoking status, and family history. Collectively, these results suggest that individual's genetic background may inform the optimal lung cancer LDCT screening strategy.
Three large-scale datasets reveal that, after accounting for risk factors, an individual's genetics can affect their lung cancer risk trajectory, thus may inform the optimal timing for LDCT screening.
Introduction
Lung cancer continues to be the leading cause of cancer-related death globally and the reduction of lung cancer–related deaths remains to be a public health priority (1). Because the landmark article by the National Lung Screening Trial (NLST; ref. 2), which demonstrated a 20% of mortality reduction by low-dose computed tomography (LDCT) screening, how to effectively conduct LDCT screening in high-risk populations have been a topic of debate. More recently, the long-awaited Dutch-Belgian Lung Cancer Screening (NELSON) trial has also demonstrated a substantial mortality reduction up to 25% to 50%, depending on gender and the length of the follow-up time (3), which solidified the effectiveness of LDCT screening for lung cancer mortality reduction.
With the increasing uptake of LDCT, it is important to identify the high-risk population and determine the best timing to start LDCT screening. Most of current LDCT guidelines were derived from the NLST eligibility criteria, simply based on age (55–74 or 80 years old) and tobacco smoking history (at least 30 pack-years, or quit smoking within 15 years), including the United States Preventive Services Task Force (USPSTF) guideline (4). It has been suggested that individual risk assessment based on risk prediction models is more effective for selecting high-risk individuals for LDCT screening (5). However, none of the previous risk models has taken individual's genetic profiles into account at the genome-wide level.
Genome-wide association studies (GWAS) uncovered multiple lung cancer susceptibility genes, and consortium efforts greatly increased our ability to investigate the genetic architecture of histologic subtypes (6, 7). However, the clinical utility of these genomic discoveries remains unclear. It is evident that the individual susceptibility genes do not adequately represent individuals' background genetic risk. Whereas, polygenic risk scores (PRS) are considered an effective approach of quantifying individual's inherent risk, and have been applied to other common complex diseases such as cardiovascular diseases and breast and prostate cancer with some success (8–13). However, no studies have comprehensively investigated risk prediction for lung cancer incorporating polygenic risk scores, beyond a handful of known susceptibility genes (14, 15).
To comprehensively evaluate the predictive performance of polygenic risk model in lung cancer beyond known loci identified by previous GWAS, we constructed the polygenic PRS based on the OncoArray data of 23,127 individuals using a machine learning approach, and independently validated the PRS based on UK Biobank data with 335,931 individuals. We assess the performance of the risk model integrating PRS in UK Biobank, including model calibration and ability to discriminate. Finally, to evaluate the potential clinical utility of the polygenic risk model in the screening-eligible populations, we simulated the PRS distribution in the National Lung Screening Trial with 50,772 participants. Our objective is to assess whether and how an individual's inherited susceptibility to lung cancer would affect the optimal implementation of the LDCT in the high-risk population.
Materials and Methods
Lung cancer OncoArray project of the International Lung Cancer Consortium (ILCCO) has been previously published (6). A total of 18,316 histologically confirmed lung cancer cases and 14,025 controls from 26 studies were used for PRS construction (16, 17). A total of 13,119 cases and 10,008 controls had epidemiologic data that were needed for the risk modeling and were used for the downstream analysis combining genetic and epidemiologic data (Supplementary Fig. S1A). UK Biobank is a population-based cohort study of over 500,000 participants ages 40–69 years at entry, recruited throughout the United Kingdom between 2006 and 2010 (18, 19). For risk prediction modeling, 1,768 incident lung cancer cases, defined as those who were diagnosed after baseline enrollment, and 334,163 unrelated controls were included (Supplementary Fig. S1B). Additional details of ILCCO OncoArray Project and UK Biobank are included in the Supplementary Materials. The protocol of the pooled analysis was approved by the Research Ethics Review Board at the Sinai Health System. The recruitment and data collection of all participating research institutes was approved by the local ethics review committees.
Statistical analysis
Construction of PRS
PRS is constructed as the sum of the number of minor alleles one carries, weighted by effect coefficients as the per allele log-odds ratio, including two components: (i) the known susceptibility loci of lung cancer and conditions related to lung cancer (such as lung function impairment) previously identified through literature curation and NHGRI-EBI GWAS Catalog (6, 7, 14, 20–23), and (ii) additional loci that passed the suggestive significance-level (P < 5 × 10−6), and were identified in this analysis through penalized regression using the least absolute shrinkage and selection operator (LASSO) after 10-fold cross validations. When correlation exists, variants representing independent loci with the strongest statistical significance were retained. The final component of known lung cancer–related loci included 35 variants (PRS-35), and the best performing LASSO model selected 93 variants after accounting for linkage disequilibrium (PRS-93). The final PRS (PRS-128) was constructed by combining both components (Supplementary Table S1). The detailed process of PRS construction is included in the Supplementary Materials.
ORs and 95% confidence interval (CI) were used to evaluate the association between PRS and lung cancer risk based on logistic regressions, adjusting for age, sex, and top five principal components. We compared effect sizes of PRS for lung cancer risk based on PRS deciles by histologic type, smoking status, and family history of lung cancer in first-degree relatives.
Validation of PRS
The PRS in the UK Biobank was computed based on the same weights derived and applied in the OncoArray dataset to avoid model overfitting. Fourteen (2 from PRS-35) variants were not genotyped or imputed on the basis of Haplotype Reference Consortium (HRC) panel, which resulted in PRS-114 for the analysis in UK Biobank. PRS-114 and PRS-128 are highly comparable with Pearson correlation coefficient of 0.984. All of the variants in the PRS passed imputation quality threshold (INFO > 0.3). To validate the risk model built in the OncoArray, we used the same effect coefficients for the parameters included in the model (Supplementary Table S2).
Baseline risk model for overall population and never smokers
For overall population, we built upon the PLCOall2014 model previously developed on the basis of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial (24). The predictors included age, race, education level, body mass index, chronic obstructive pulmonary disease (COPD), personal history of cancer, family history of lung cancer in first-degree relatives, smoking status, smoking intensity, smoking duration, and smoking quit time. To address the issue of potential over- or under-estimation of the absolute risk when importing the coefficients of a risk model previous developed in a different population, and to integrate PRS into the risk model, we recalibrated and reparametrized the risk model using 50% of the UK Biobank cohort. Recalibration is a statistical approach commonly used to adapt a risk model developed in a different population (25). The remaining 50% of UK Biobank cohort is kept as the strict hold-out validation set for prospective evaluation (Supplementary Material). The analysis flow is depicted in Supplementary Fig. S2. Multiplicative interactions assumption between PRS and the epidemiologic risk factors were assessed (Supplementary Materials).
It was well-recognized that lung cancer risk profiles are markedly different for never smokers, but there is currently no established risk model for never smokers. Taking advantage of the risk data available in UK Biobank, we adapted the split 80% training-20% testing design using the UK Biobank cohort data, to investigate the predictive performance of additional risk factors that might be particularly relevant for never smokers, such as impaired lung function, ambient air pollution, and second-hand smoke. The latter two did not improve the model performance; therefore, the risk factors included in the parsimonious model for never smokers are age, sex, education, family history, personal history of cancer, and impaired lung function (Supplementary Materials).
Risk model evaluation based on the hold-out validation set in the UK Biobank cohort
Evaluation of the model performance in the prospective study, including calibration and discrimination, was conducted on the basis of the 50% hold-out set for the overall model and 20% hold-out set for the never-smoker model in UK Biobank cohort. Model calibration was assessed by evaluating how much the slope of the calibration line (plotting the predicted vs. the observed probabilities) deviates from the ideal of 1. The 95% confidence intervals (95% CI) of the predicted risk were computed with the percentile-based bootstrap. Calibration was formally tested using Spiegelhalter z-statistic and P values are reported (26, 27). The model's ability to discriminate was assessed by the area under the receiver operator characteristic curves (AUC). Risk discrimination improvement of the developed PRS was evaluated by comparing a base model with epidemiologic risk factors and a model that includes epidemiologic risk factors and PRS.
Absolute risk estimation
The five-year and cumulative absolute risk of developing lung cancer was estimated on the basis of Cox proportional hazards model, accounting for the competing risk of all causes of death other than lung cancer (28). The absolute risk was estimated in a given time interval by integrating three components: (i) a model of relative risks, (ii) age-specific lung cancer incidence rates, and (iii) distributions of risk factors of the population of interest (9, 28, 29). To estimate the absolute risk trajectories for the overall population in the United Kingdom, we applied the recalibrated PLCOall2014 model (Supplementary Table S2) with PRS and the age-specific incidence rate and competing rates for mortality rates obtained from Cancer Research UK, 2012 (29). For never smokers, we applied our never-smoker risk model as reported in the Supplementary Table S2, and the age-specific lung cancer rates specifically for never smokers that were derived from the UK Million Women Cohort (30) and the average male-to-female incidence ratio of lung cancer in never smokers previously reported in population cohorts (31). The detailed estimation process is outlined in the Supplementary Materials.
Projection in the NLST
To assess how the risk model would work in a population that would be eligible for LDCT screening, we projected the absolute risks to the NLST population. There are 1,986 incident lung cancer and 48,786 controls in NLST with variables needed for the risk modeling available for our analysis. Because this population is comprised of ever-smokers only, we used PLCOm2012 (designed for ever-smokers only) as the baseline model for this component. Genotype information was not available for the NLST participants, so PRS profiles were simulated conditional on lung cancer status and family-history of lung cancer based on the methods described previously (9, 28). The weights of the PRS were based on the coefficient estimated from the independent PRS validation set (UK Biobank) to reduce over-fitting. The details parameter settings and reference rates are specified in Supplementary Materials. All tests of statistical significance were two-sided. All analyses were performed in R v.3.5.1.
Results
The study characteristics of OncoArray (model training), UK Biobank (validation), and NLST (projection) are summarized in Table 1. In the OncoArray project, age and gender are well matched as most studies have applied frequency matching for these factors. As expected, there are more smokers, more individuals with family history of lung cancer or previous COPD history among patients with lung cancer compared with controls. In the UK Biobank, being a general population cohort, the majority of the study participants are never or former smokers. The NLST study is a smoker only population, as all individuals in this population have met the NLST screening criteria.
Demographic characteristics of the study populations including ILCCO OncoArray, UK Biobank, and the NLST.
. | PRS construction . | PRS validation Model performance evaluation . | Projection in screen-eligible population . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | ILCCO OncoArray (N = 23,127) . | UK Biobank (N = 335,931) . | NLST (N = 50,772) . | |||||||||
. | Lung cancer cases . | Controls . | Lung cancer cases . | Controls . | Lung cancer cases . | Controls . | ||||||
. | Number . | % . | Number . | % . | Number . | % . | Number . | % . | Number . | % . | Number . | % . |
Total | 13,119 | 10,008 | 1,768 | 334,163 | 1,986 | 48,786 | ||||||
Mean age (SD) | 65.3 | (9.7) | 62.4 | (10.3) | 62.0 | (5.6) | 56.8 | (8.0) | 63.7 | (5.3) | 61.3 | (5.0) |
Sex | ||||||||||||
Men | 8,500 | (65) | 6,494 | (65) | 921 | (52) | 154,571 | (46) | 1,195 | (60) | 28,748 | (59) |
Women | 4,619 | (35) | 3,514 | (35) | 847 | (48) | 179,592 | (54) | 791 | (40) | 20,038 | (41) |
Smoking | ||||||||||||
Never | 1,196 | (9) | 2,896 | (29) | 236 | (13) | 183,435 | (55) | 0 | 0 | ||
Former | 4,878 | (38) | 3,292 | (33) | 808 | (46) | 117,502 | (35) | 789 | (40) | 25,589 | (52) |
Current | 6,857 | (53) | 3,286 | (33) | 724 | (41) | 33,226 | (10) | 1,197 | (60) | 23,197 | (48) |
Family history of lung cancer among first degree relatives | ||||||||||||
No | 10,291 | (83) | 8,570 | (89) | 1,323 | (77) | 286,297 | (87) | 1,460 | (74) | 38,100 | (78) |
Yes | 2,153 | (17) | 1,042 | (11) | 387 | (23) | 42,153 | (13) | 526 | (26) | 10,686 | (22) |
COPD history | ||||||||||||
No | 7,850 | (69) | 6,587 | (81) | 1,581 | (90) | 328,465 | (98) | 1,790 | (90) | 46,405 | (95) |
Yes | 3,514 | (31) | 1,565 | (19) | 180 | (10) | 5,311 | (2) | 196 | (10) | 2,381 | (5) |
. | PRS construction . | PRS validation Model performance evaluation . | Projection in screen-eligible population . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
. | ILCCO OncoArray (N = 23,127) . | UK Biobank (N = 335,931) . | NLST (N = 50,772) . | |||||||||
. | Lung cancer cases . | Controls . | Lung cancer cases . | Controls . | Lung cancer cases . | Controls . | ||||||
. | Number . | % . | Number . | % . | Number . | % . | Number . | % . | Number . | % . | Number . | % . |
Total | 13,119 | 10,008 | 1,768 | 334,163 | 1,986 | 48,786 | ||||||
Mean age (SD) | 65.3 | (9.7) | 62.4 | (10.3) | 62.0 | (5.6) | 56.8 | (8.0) | 63.7 | (5.3) | 61.3 | (5.0) |
Sex | ||||||||||||
Men | 8,500 | (65) | 6,494 | (65) | 921 | (52) | 154,571 | (46) | 1,195 | (60) | 28,748 | (59) |
Women | 4,619 | (35) | 3,514 | (35) | 847 | (48) | 179,592 | (54) | 791 | (40) | 20,038 | (41) |
Smoking | ||||||||||||
Never | 1,196 | (9) | 2,896 | (29) | 236 | (13) | 183,435 | (55) | 0 | 0 | ||
Former | 4,878 | (38) | 3,292 | (33) | 808 | (46) | 117,502 | (35) | 789 | (40) | 25,589 | (52) |
Current | 6,857 | (53) | 3,286 | (33) | 724 | (41) | 33,226 | (10) | 1,197 | (60) | 23,197 | (48) |
Family history of lung cancer among first degree relatives | ||||||||||||
No | 10,291 | (83) | 8,570 | (89) | 1,323 | (77) | 286,297 | (87) | 1,460 | (74) | 38,100 | (78) |
Yes | 2,153 | (17) | 1,042 | (11) | 387 | (23) | 42,153 | (13) | 526 | (26) | 10,686 | (22) |
COPD history | ||||||||||||
No | 7,850 | (69) | 6,587 | (81) | 1,581 | (90) | 328,465 | (98) | 1,790 | (90) | 46,405 | (95) |
Yes | 3,514 | (31) | 1,565 | (19) | 180 | (10) | 5,311 | (2) | 196 | (10) | 2,381 | (5) |
The list of the variants included in PRS-128 is shown in Supplementary Table S1. The distribution of the PRS in OncoArray and UK Biobank is shown in the Supplementary Fig. S3A and S3B, where we observed a shift of the PRS distribution toward the right (i.e., higher PRS) for the lung cancer cases. The association between PRS and lung cancer risk based on OncoArray data and UK Biobank is shown in Table 2. There was an increasing risk of lung cancer by decile, with approximately 3.5-fold of relative risk when comparing individuals in the lowest versus the highest decile in the PRS distribution in the OncoArray dataset with OR of 3.52 (95% CI = 3.11–3.98; P = 7.34 × 10−88). A strong association was also observed in the independent validation set, UK Biobank, with increasing risk by PRS decile, and the OR of lung cancer for those in the top PRS decile is 2.39 (95% CI = 1.92–3.00; P = 1.80 × 10−14). The statistical significance diminished in the UK Biobank dataset given much smaller number of patients with lung cancer available in this analysis. Nonetheless, the dose–response relationships between PRS and lung cancer risk remained prominent in both OncoArray (Ptrend = 1.77 × 10−127) and UK Biobank (Ptrend = 5.26 × 10−20).
The OR and 95% CIs of the PRS and lung cancer risk by decile in OncoArray and UK Biobank.
. | Model building (OncoArray) . | Validation (UK Biobank) . | ||
---|---|---|---|---|
PRS decile . | OR (95% CI)a . | P . | OR (95% CI)a . | P . |
0–10% | 1 (reference) | 1 (reference) | ||
10–20% | 1.30 (1.15–1.46) | 1.39 × 10−5 | 1.31 (1.02–1.68) | 3.54 × 10−2 |
20–30% | 1.62 (1.44–1.82) | 1.34 × 10−15 | 1.16 (0.90–1.50) | 2.46 × 10−1 |
30–40% | 1.58 (1.41–1.78) | 1.94 × 10−14 | 1.57 (1.24–2.00) | 2.07 × 10−4 |
40–50% | 1.77 (1.57–1.99) | 3.18 × 10−21 | 1.67 (1.32–2.12) | 2.25 × 10−5 |
50–60% | 2.01 (1.78–2.26) | 1.12 × 10−30 | 1.56 (1.23–1.98) | 2.92 × 10−4 |
60–70% | 2.19 (1.94–2.46) | 7.91 × 10−38 | 1.67 (1.32–2.13) | 1.89 × 10−5 |
70–80% | 2.38 (2.11–2.69) | 1.07 × 10−45 | 1.69 (1.34–2.15) | 1.27 × 10−5 |
80–90% | 2.70 (2.39—3.04) | 4.23 × 10−58 | 2.00 (1.60–2.53) | 2.58 × 10−9 |
90–100% | 3.52 (3.11–3.98) | 7.34 × 10−88 | 2.39 (1.92–3.00) | 1.80 × 10−14 |
Ptrend | 1.77 × 10−127 | 5.26 × 10−20 |
. | Model building (OncoArray) . | Validation (UK Biobank) . | ||
---|---|---|---|---|
PRS decile . | OR (95% CI)a . | P . | OR (95% CI)a . | P . |
0–10% | 1 (reference) | 1 (reference) | ||
10–20% | 1.30 (1.15–1.46) | 1.39 × 10−5 | 1.31 (1.02–1.68) | 3.54 × 10−2 |
20–30% | 1.62 (1.44–1.82) | 1.34 × 10−15 | 1.16 (0.90–1.50) | 2.46 × 10−1 |
30–40% | 1.58 (1.41–1.78) | 1.94 × 10−14 | 1.57 (1.24–2.00) | 2.07 × 10−4 |
40–50% | 1.77 (1.57–1.99) | 3.18 × 10−21 | 1.67 (1.32–2.12) | 2.25 × 10−5 |
50–60% | 2.01 (1.78–2.26) | 1.12 × 10−30 | 1.56 (1.23–1.98) | 2.92 × 10−4 |
60–70% | 2.19 (1.94–2.46) | 7.91 × 10−38 | 1.67 (1.32–2.13) | 1.89 × 10−5 |
70–80% | 2.38 (2.11–2.69) | 1.07 × 10−45 | 1.69 (1.34–2.15) | 1.27 × 10−5 |
80–90% | 2.70 (2.39—3.04) | 4.23 × 10−58 | 2.00 (1.60–2.53) | 2.58 × 10−9 |
90–100% | 3.52 (3.11–3.98) | 7.34 × 10−88 | 2.39 (1.92–3.00) | 1.80 × 10−14 |
Ptrend | 1.77 × 10−127 | 5.26 × 10−20 |
aAdjusted for age, sex, and top five principal components.
The association between PRS and lung cancer risk per SD in major risk strata by smoking, family history of lung cancer, and histology is shown in Table 3. The effect estimates were slightly higher in the OncoArray dataset, which was expected as the model building set. Albeit slightly reduced statistical significance, PRS conferred robust associations in the UK Biobank population across all major risk strata, as the independent validation.
The OR and 95% CIs of the PRS and lung cancer risk by smoking status, family history, COPD history, and histology in OncoArray and UK Biobank.
. | PRS Building (OncoArray) . | PRS Validation (UK Biobank) . | ||
---|---|---|---|---|
Risk Strata . | ORa per SD (95% CI) . | P . | ORa per SD (95% CI) . | P . |
Overall | 1.43 (1.39–1.47) | 7.77 × 10−138 | 1.26 (1.20–1.32) | 9.69 × 10−23 |
Histology | ||||
Adenocarcinoma | 1.44 (1.39–1.49) | 1.22 × 10−86 | 1.30 (1.23–1.37) | 6.59 × 10−23 |
Squamous cell | 1.42 (1.36–1.48) | 1.75 × 10−61 | 1.23 (1.16–1.30) | 9.58 × 10−13 |
Small cell | 1.32 (1.24–1.41) | 1.14 × 10−18 | 1.25 (1.18–1.32) | 4.23 × 10−14 |
Smoking | ||||
Never | 1.29 (1.20–1.38) | 1.57 × 10−12 | 1.28 (1.13–1.46) | 8.86 × 10−5 |
Former | 1.42 (1.35–1.49) | 3.81 × 10−47 | 1.25 (1.17–1.34) | 1.44 × 10−10 |
Current | 1.46 (1.39–1.53) | 2.42 × 10−60 | 1.28 (1.19–1.38) | 3.87 × 10−11 |
Family history | ||||
Yes | 1.38 (1.27–1.49) | 8.94 × 10−16 | 1.16 (1.05–1.27) | 4.03 × 10−3 |
No | 1.43 (1.39–1.48) | 7.92 × 10−116 | 1.29 (1.22–1.36) | 5.95 × 10−21 |
COPD diagnosis | ||||
Yes | 1.37 (1.28–1.46) | 1.22 × 10−20 | 1.26 (1.09–1.46) | 1.58 × 10−3 |
No | 1.41 (1.36–1.46) | 8.03 × 10−81 | 1.26 (1.20–1.32) | 1.86 × 10−20 |
. | PRS Building (OncoArray) . | PRS Validation (UK Biobank) . | ||
---|---|---|---|---|
Risk Strata . | ORa per SD (95% CI) . | P . | ORa per SD (95% CI) . | P . |
Overall | 1.43 (1.39–1.47) | 7.77 × 10−138 | 1.26 (1.20–1.32) | 9.69 × 10−23 |
Histology | ||||
Adenocarcinoma | 1.44 (1.39–1.49) | 1.22 × 10−86 | 1.30 (1.23–1.37) | 6.59 × 10−23 |
Squamous cell | 1.42 (1.36–1.48) | 1.75 × 10−61 | 1.23 (1.16–1.30) | 9.58 × 10−13 |
Small cell | 1.32 (1.24–1.41) | 1.14 × 10−18 | 1.25 (1.18–1.32) | 4.23 × 10−14 |
Smoking | ||||
Never | 1.29 (1.20–1.38) | 1.57 × 10−12 | 1.28 (1.13–1.46) | 8.86 × 10−5 |
Former | 1.42 (1.35–1.49) | 3.81 × 10−47 | 1.25 (1.17–1.34) | 1.44 × 10−10 |
Current | 1.46 (1.39–1.53) | 2.42 × 10−60 | 1.28 (1.19–1.38) | 3.87 × 10−11 |
Family history | ||||
Yes | 1.38 (1.27–1.49) | 8.94 × 10−16 | 1.16 (1.05–1.27) | 4.03 × 10−3 |
No | 1.43 (1.39–1.48) | 7.92 × 10−116 | 1.29 (1.22–1.36) | 5.95 × 10−21 |
COPD diagnosis | ||||
Yes | 1.37 (1.28–1.46) | 1.22 × 10−20 | 1.26 (1.09–1.46) | 1.58 × 10−3 |
No | 1.41 (1.36–1.46) | 8.03 × 10−81 | 1.26 (1.20–1.32) | 1.86 × 10−20 |
Note: OncoArray SD = 0.54; UK Biobank SD = 0.50.
aOR adjusted for age, sex, and top five principal components.
In UK Biobank prospective cohort, the risk model for overall population was reasonably calibrated (Supplementary Fig. S4A) in the 50% hold-out validation set. For never smokers, while the observed risk was in general consistent with the predicted risk in the training set, it was less well-calibrated and appeared to fluctuate around the calibration slope given the limited sample size in the hold-out testing set, although the P value based on the Spiegelhalter z-test was not significant (Supplementary Fig. S4B). The overall AUC did not substantially change when adding PRS for overall population with AUC of 0.832 (from AUC of 0.828 without PRS), but a modest increase in AUC among never smokers was observed from AUC of 0.670 to 0.687 (Supplementary Table S3). When estimating the AUC separately by age of onset, it appeared that the PRS contributed to the risk model in those with younger age of onset (<50), albeit modest added value: The AUC for those with young onset was 0.798 (95% CI = 0.680–0.917) and 0.811 (95% CI = 0.701–0.902) without and with PRS terms, respectively (Supplementary Table S3).
To evaluate how PRS would affect individual's absolute risk with increasing age, we estimated the absolute risk of lung cancer by the PRS decile. The average risk of the population was estimated on the basis of the final model including all aforementioned risk factors and PRS. We observed a divergence of absolute risk trajectories that are due to individual's genetic risk background, as encapsulated by PRS decile (Fig. 1A and B). The span of absolute risk trajectory due to individual's PRS was increasingly notable with older age. To understand the implication for LDCT screening in populations with different background risks, Fig. 2 shows the 5-year absolute risk estimation stratified by smoking status and family history of lung cancer. For example, in the UK Biobank among current smokers with family history of lung cancer, the average risk of lung cancer in the next 5 years at 60 years old was approximately 4.29%, whereas the risk was 7.64% for those at top 10% PRS decile (P top 10% PRS vs. 40–60% PRS = 8.80 × 10−15). As the absolute risk increases as the function of age, the direct consequence is when individuals would reach the threshold for LDCT screening.
Absolute risk estimates of lung cancer by PRS-114 deciles based on the UK Biobank study. A, Five-year absolute risk. B, Cumulative risk until age 80. The risk factors included in the model are sex, race, education, BMI, tobacco smoking, COPD history, and family history of cancer. The x-axis is the age of cohort entry. The curves depict average risk of individuals in different PRS deciles as specified by the legends. The dashed curve represents the average risk of the overall population in different ages based on the final model, which include all risk factors and PRS. The divergence of the risk curves represents the contribution of PRS and increasing age.
Absolute risk estimates of lung cancer by PRS-114 deciles based on the UK Biobank study. A, Five-year absolute risk. B, Cumulative risk until age 80. The risk factors included in the model are sex, race, education, BMI, tobacco smoking, COPD history, and family history of cancer. The x-axis is the age of cohort entry. The curves depict average risk of individuals in different PRS deciles as specified by the legends. The dashed curve represents the average risk of the overall population in different ages based on the final model, which include all risk factors and PRS. The divergence of the risk curves represents the contribution of PRS and increasing age.
Absolute risk estimates of lung cancer by smoking status and family history of lung cancer based on the UK Biobank study. The risk factors included in the model are sex, race, education, BMI, tobacco smoking, COPD history, and family history of lung cancer. The x-axis is the age of cohort entry. The curves depict average risk of individuals in different PRS deciles as specified by the legends. The dashed curve represents the average risk of the overall population in different ages based on the final model, which include all risk factors and PRS-114. The divergence of the risk curves represents the contribution of PRS and increasing age. The blue horizontal dotted line represents 1.5% of 5-year absolute risks of lung cancer.
Absolute risk estimates of lung cancer by smoking status and family history of lung cancer based on the UK Biobank study. The risk factors included in the model are sex, race, education, BMI, tobacco smoking, COPD history, and family history of lung cancer. The x-axis is the age of cohort entry. The curves depict average risk of individuals in different PRS deciles as specified by the legends. The dashed curve represents the average risk of the overall population in different ages based on the final model, which include all risk factors and PRS-114. The divergence of the risk curves represents the contribution of PRS and increasing age. The blue horizontal dotted line represents 1.5% of 5-year absolute risks of lung cancer.
Assuming 1.5% lung cancer absolute risk within the next 5 years as the threshold to be recommended for LDCT screening, never smokers did not reach sufficient risk threshold to be recommended for LDCT screening regardless their PRS deciles. Therefore, the PRS distribution does not appear to have implications among the never smoker group in general. On the other hand, among ever-smokers, the PRS distribution can affect when the individuals reach the absolute risk threshold for LDCT screening. For example, on average, individuals who smoked but without family history reach the 1.5% of 5-year absolute risk at age 61, whereas those who are at the top 1% of PRS distribution would reach the threshold at age 53 (Fig. 2; Supplementary Table S4). Among those who smoked and with positive family history of lung cancer, the average age to reach the LDCT screening recommendation threshold would be 56, but those who are at top 5% PRS would reach the threshold at age 52, earlier than the previous LDCT screening guideline (Fig. 2; Supplementary Table S4; ref. 4). Among current smokers, those with family history of lung cancer and at the top 10% of the PRS distribution would reach 1.5% of 5-year risk before they turn 50.
To show the impact of smoking status and PRS, Supplementary Fig. S5 illustrates the absolute risk trajectory based on the combination of both smoking status and PRS. It is clear that smoking cessation reduces the lung cancer absolute risk regardless of which PRS category one belongs to, with a relative reduction of approximately 45% of lung cancer risk by age 70, which is consistent with previous reports (32, 33). For example, among those at the top 10% of PRS, smoking cessation reduced the 5-year absolute risk from 10.5% to 5.6% by age 70 representing an absolute risk reduction of 4.9%; and among those with intermediate PRS, smoking cessation reduced the 5-year absolute risk from 5.5% to 3.0%, representing an absolute risk reduction of 2.5%.
To evaluate extent of the absolute risks could be modified by PRS in a LDCT eligible population (heavy smokers and older), we show the 5-year absolute risks and cumulative risk by age 85 for the NLST population in Fig. 3A and B, with PRS simulated per methods described. The absolute risk of lung cancer differed by individual's genetic background in this high-risk population, and the risk differences between different PRS decile increased along with increasing age.
Absolute risk estimates of lung cancer based on the projection in the NLST. A, Five-year absolute risk. B, Cumulative risk until age 80. The risk factors included in the model are race, BMI, education, smoking history, personal history of cancer and COPD, family history of lung cancer. The x-axis is the age of cohort entry. The curves depict the average risk of individuals in different PRS-114 deciles as specified by the legends. The dashed curve represents the average risk of the overall population in the corresponding age based on the final model including all risk factors and PRS-114. The divergence of the risk curves represents the contribution of PRS and increasing age.
Absolute risk estimates of lung cancer based on the projection in the NLST. A, Five-year absolute risk. B, Cumulative risk until age 80. The risk factors included in the model are race, BMI, education, smoking history, personal history of cancer and COPD, family history of lung cancer. The x-axis is the age of cohort entry. The curves depict the average risk of individuals in different PRS-114 deciles as specified by the legends. The dashed curve represents the average risk of the overall population in the corresponding age based on the final model including all risk factors and PRS-114. The divergence of the risk curves represents the contribution of PRS and increasing age.
Discussion
In this study, we evaluated whether individual's genetic background can be used to stratify their lung cancer absolute risk, incorporated within the well-known lung cancer risk models. Our analysis showed PRS is associated with individual's lung cancer risk with a dose–response relationship. Furthermore, individual's genetic background, as encapsulated by PRS, can further stratify individual's lung cancer absolute risk in the next 5 years, or cumulatively in their life time. The risk model was developed and validated in two large independent datasets.
The key observation of this analysis is that individual's genetic background has limited impact on the risk model's ability to discriminate whether individuals eventually develop lung cancer. However, the genetic background is informative regarding individual's age when reaching the LDCT screening-eligible threshold, as the absolute risk trajectories diverge by PRS decile and increasing age. This is clinically relevant, as it could potentially affect when LDCT screening should be recommended to the individuals. The absolute risk stratified by smoking and family history of lung cancer showed that ever smokers would reach the LDCT screening threshold at a very different age depending on their family history of lung cancer and their genetic makeup, with the difference as large as 4 years compared with the average age among those with family history and 8 years among those without family history. These differences are clinically meaningful as they would represent much more timely detection for those who are at top 10% of PRS and can start screening before the previous official USPSTF recommended age of 55 (4), and also identify those who do not need to be screened until past age 60, which would reduce healthcare burden and radiation exposures. Most recently, USPSTF task force presented the draft recommendation updated in July 2020, expanding the eligibility to an earlier starting age of 50 (uspreventiveservicestaskforce.org), which would help to include some of those with higher genetic risk. On the other hand, it also showed that the vast majority of the never smokers would never reach the LDCT screening threshold despite their genetic background.
One of the potential hindrances of implementing the genetic testing among potentially eligible population for more precise LDCT screening recommendation would be the cost and feasibility associated with the genotyping. With the reduction of the genotyping cost, we expect that the genotyping cost can be offset by the reduction of unnecessary LDCT scans and quality-adjusted life year saved when the lung cancer is detected earlier. However, an systematic assessment of feasibility and a formal cost-effective analysis with detailed sensitivity analysis with varying parameters will be required to provide an in-depth comparison of the different approaches, which is beyond the scope of this study.
The variants that were selected into the model, either through previous work (PRS-35) or the penalized regression applied in this study (PRS-93), were located in several different regions. The 35 variants were predominately from previously known lung cancer loci (such as TERT, HLA, CHEK2), and the biology implications have been previously reported. The variants selected by the LASSO penalized regression include additional variants from previously known regions but not sufficiently tagged by those in PRS-35, as well as from other genetic regions from pathways related to cytokines and chemokines (e.g., TRIM31, TRIM15, XCL2, IRF4, ILC33, VSTM1, etc.) and signaling pathways (MAP3K20, NUMBL; Supplementary Table S1).
There are several potential limitations of this study. First, the PRS assumes multiplicativity among genetic variants. While we have assessed the pair-wise interactions and did not observe any interactions between the variants, we did not assess higher order of interactions. Nevertheless, this is a method that is considered efficient and reasonable for representing individual's genetic background (13, 34). We have assessed the potential interactions between risk factors and PRS, although nominal interactions were detected between age and smoking status, including interaction terms did not lead to material change of the results. We therefore consider our parsimonious model (less variables with same predictive accuracy) to be the reasonable one to use in the clinical setting. Second, this analysis was done based on the population with European ancestry, thus likely cannot be readily generalized to other racial groups. Additional analysis in other ethnicities will be needed, in particular Asians and African ancestry population. A separate effort for establishing a PRS model based on the China Kadoorie Biobank, which contains genetic data on approximately 95,000 individuals, is currently underway. The cohort study we used to evaluate the model prospectively, UK Biobank, is a general population cohort, although the social economic status is skewed toward the higher levels similar to other population cohorts, thus the prevalence of some related risk factors (such as smoking prevalence) might be under-represented, which can affect the absolute risk estimation. However, this would not affect model's ability to discriminate. In addition, we addressed this issue by recalibrating the model using 50% of the UK Biobank data and applied the recalibrated coefficients to the absolute risk estimation and by estimating the absolute risks in never smokers separately. Finally, even though we built a de novo model for never smokers, the model's ability to discriminate remained modest. However, we were able to investigate additional risk factor that can be relevant for never smokers, such as second-hand smoke, ambient air pollution, and impaired lung function, albeit the sample size of nonsmoking lung cancer cases in UK Biobank is limited. With increasing availability of data on these data elements, it is possible for the model performance to improve, and if so, risk of never smokers may reach sufficient threshold to warrant CT screening with vastly improved predictive performance.
Our study has several important strengths: We have constructed and validated PRS based on the largest lung cancer germline genomic data to date, which provide the most robust estimates currently available. In addition, we have conducted the multi-stage model building and validation with large population cohort dataset with a total over 350,000 participants with both stages. This ensures the validity of the model and minimizes the potential over-optimism. Finally, we applied novel methodology to simulate PRS distribution in the NLST population to assess the potential clinical utility of PRS in a screening-eligible population.
In summary, our study showed that individual's genetic background can potentially affect the optimal timing of starting LDCT screening. It is possible to continue to refine the risk prediction algorithm if the sample sizes increase substantially. This is the first study that reported the potential clinical utility of PRS in the European descendent population with comprehensive assessment.
Authors' Disclosures
G. Liu reports grants and personal fees from AstraZeneca and Takeda; personal fees from Roche, Pfizer, and Bristol Myers Squibb; grants from Boehringer Ingelheim; and personal fees from EMD Serono outside the submitted work. M. Johansson reports grants from NIH (U19 CA203654, Integrative Analysis of Lung Cancer Risk and Etiology, INTEGRAL) during the conduct of the study. L. Le Marchand reports grants from NCI during the conduct of the study. S. Lam reports grants from Terry Fox Research Institute, VGH-UBC Hospital Foundation, and BC Cancer Foundation during the conduct of the study. S.M. Arnold reports grants from Merck Sharp & Dohme Corporation, Kura Oncology Incorporated, Stemcentrx Incorporated, Regeneron Pharmaceuticals, AbbVie Incorporated, Nektar Therapeutics, Exelixis, and grants from AstraZeneca Pharmaceuticals outside the submitted work. M.C. Aldrich reports grants from NIH/NCI and Lung Cancer Research Foundation outside the submitted work. A. Risch reports grants from Deutsche Krebshilfe and grants from NIH-U19 during the conduct of the study. P. Brennan reported grants from NIH (U19 CA203654, Integrative Analysis of Lung Cancer Risk and Etiology, INTEGRAL) during the conduct of the study. C.I. Amos reports grants from Baylor College of Medicine during the conduct of the study. No disclosures were reported by the other authors.
Authors' Contributions
R.J. Hung: Conceptualization, resources, data curation, supervision, funding acquisition, validation, investigation, methodology, writing–original draft, project administration, writing–review and editing. M.T. Warkentin: Conceptualization, data curation, formal analysis, validation, investigation, visualization, writing–review and editing. Y. Brhane: Data curation, formal analysis, validation, investigation, visualization, methodology, writing–review and editing. N. Chatterjee: Conceptualization, resources, software, investigation, methodology, writing–review and editing. D.C. Christiani: Resources, data curation, project administration, writing–review and editing. M.T. Landi: Resources, data curation, project administration, writing–review and editing. N.E. Caporaso: Resources, data curation, project administration, writing–review and editing. G. Liu: Resources, data curation, project administration, writing–review and editing. M. Johansson: Resources, data curation, project administration, writing–review and editing. D. Albanes: Resources, data curation, project administration, writing–review and editing. L. Le Marchand: Resources, data curation, project administration, writing–review and editing. A. Tardon: Resources, data curation, project administration, writing–review and editing. G. Rennert: Resources, data curation, writing–original draft, project administration. S.E. Bojesen: Resources, data curation, writing–original draft, project administration. C. Chen: Resources, data curation, project administration, writing–review and editing. J.K. Field: Resources, data curation, project administration, writing–review and editing. L.A. Kiemeney: Resources, data curation, project administration, writing–review and editing. P. Lazarus: Resources, data curation, project administration, writing–review and editing. S. Zienolddiny: Resources, data curation, project administration, writing–review and editing. S. Lam: Resources, data curation, writing–original draft, project administration. A.S. Andrew: Resources, data curation, project administration, writing–review and editing. S.M. Arnold: Resources, data curation, project administration, writing–review and editing. M.C. Aldrich: Resources, data curation, project administration, writing–review and editing. H. Bickeböller: Resources, data curation, project administration, writing–review and editing. A. Risch: Resources, data curation, project administration, writing–review and editing. M.B. Schabath: Resources, data curation, project administration, writing–review and editing. J.D. McKay: Conceptualization, resources, data curation, investigation, writing–original draft, project administration. P. Brennan: Conceptualization, resources, data curation, funding acquisition, investigation, project administration, writing–review and editing. C.I. Amos: Conceptualization, resources, funding acquisition, investigation, project administration, writing–review and editing.
Acknowledgments
This research has been conducted using the UK Biobank Resource under Application Number 23261. We thank all participating studies. The CAPUA study was supported by FIS-FEDER/Spain grant numbers FIS-01/310, FIS-PI03-0365, and FIS-07-BI060604, FICYT/Asturias grant numbers FICYT PB02-67 and FICYT IB09-133, and the University Institute of Oncology (IUOPA), of the University of Oviedo and the Ciber de Epidemiologia y Salud Pública. CIBERESP, SPAIN. CARET is funded by the NCI, NIH through grants U01-CA063673, UM1-CA167462, and U01-CA167462. The Liverpool Lung project is supported by the Roy Castle Lung Cancer Foundation. The Harvard Lung Cancer Study was supported by the NIH (NCI) grants CA092824, CA090578, CA074386, and 5U01CA209414. The Multiethnic Cohort Study was partially supported by NIH grants CA164973, CA033619, CA63464, and CA148127. The work performed in MSH-PMH study was supported by The Canadian Cancer Society Research Institute (020214), Ontario Institute of Cancer and Cancer Care Ontario Chair Award (to R.J. Hung and G. Liu), and the Alan Brown Chair and Lusi Wong Programs at the Princess Margaret Hospital Foundation. The Norway study was supported by Norwegian Cancer Society, Norwegian Research Council. The work in TLC study has been supported in part the James & Esther King Biomedical Research Program (09KN-15), NIH Specialized Programs of Research Excellence (SPORE) Grant (P50 CA119997), and by a Cancer Center Support Grant (CCSG) at the H. Lee Moffitt Cancer Center and Research Institute, an NCI-designated Comprehensive Cancer Center (grant number P30-CA76292). The Vanderbilt Lung Cancer Study – BioVU dataset used for the analyses described was obtained from Vanderbilt University Medical Center's BioVU, which is supported by institutional funding, the 1S10RR025141-01 instrumentation award, and by the Vanderbilt CTSA grant UL1TR000445 from NCATS/NIH, and K07CA172294. The Copenhagen General Population Study (CGPS) was supported by the Chief Physician Johan Boserup and Lise Boserup Fund, the Danish Medical Research Council, and Herlev Hospital. The NELCS study: grant number P20RR018787 from the National Center for Research Resources (NCRR), a component of the NIH. Kentucky Lung Cancer Research Initiative was supported by the Department of Defense (Congressionally Directed Medical Research Program, U.S. Army Medical Research and Materiel Command Program) under award number: 10153006 (W81XWH-11-1-0781). Views and opinions of, and endorsements by the author(s) do not reflect those of the US Army or the Department of Defense. It also was supported by NIH grant UL1TR000117 and P30 CA177558 using Shared Resource Facilities: Cancer Research Informatics, Biospecimen and Tissue Procurement, and Biostatistics and Bioinformatics. Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer/World Health Organization. This study was funded by the NIH (U19 CA203654, Integrative Analysis of Lung Cancer Risk and Etiology, INTEGRAL), and CIHR Foundation Grant (FDN 167273) and Canada Research Chair (to R.J. Hung). The funding organizations have no role in any aspect of the study, including study design, management, data collection, analysis, result interpretation, or any stage of the manuscript preparation.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.