Abstract
High disease burden suggests the desirability to identify high-risk Asian never-smoking females (NSF) who may benefit from low-dose CT (LDCT) screening. In North America, one is eligible for LDCT screening if one satisfies the U.S. Preventive Services Task Force (USPSTF) criteria or has model-estimated 6-year risk greater than 0.0151. According to two U.S. reports, only 36.6% female patients with lung cancer met the USPSTF criteria, while 38% of the ever-smokers ages 55 to 74 years met the USPSTF criteria.
Using data on NSFs in the Taiwan Genetic Epidemiology Study of Lung Adenocarcinoma and the Taiwan Biobank before August 2016, we formed an age-matched case–control study consisting of 1,748 patients with lung cancer and 6,535 controls. Using these and an estimated age-specific lung cancer 6-year incidence rate among Taiwanese NSFs, we developed the Taiwanese NSF Lung Cancer Risk Models using genetic information and simplified questionnaire (TNSF-SQ). Performance evaluation was based on the newer independent datasets: Taiwan Lung Cancer Pharmacogenomics Study (LCPG) and Taiwan Biobank data after August 2016 (TWB2).
The AUC based on the NSFs ages 55 to 70 years in LCPG and TWB2 was 0.714 [95% confidence intervals (CI), 0.660–0.768]. For women in TWB2 ages 55 to 70 years, 3.94% (95% CI, 2.95–5.13) had risk higher than 0.0151. For women in LCPG ages 55 to 74 years, 27.03% (95% CI, 19.04–36.28) had risk higher than 0.0151.
TNSF-SQ demonstrated good discriminative power. The ability to identify 27.03% of high-risk Asian NSFs ages 55 to 74 years deserves attention.
TNSF-SQ seems potentially useful in selecting Asian NSFs for LDCT screening.
This article is featured in Highlights of This Issue, p. 265
Introduction
The National Lung Screening Trial (NLST) reported that a 20% decrease in mortality from lung cancer was observed in the arm screened using low-dose CT (LDCT) compared with the arm using chest radiography. Its success depends critically on its application of screening to high-risk individuals. Participants in NLST were 55–74 years of age, smoked no less than 30 pack-years, and had no more than 15 years of smoking quit time (1). Subsequently, the U.S. Preventive Services Task Force (USPSTF) recommended annual LDCT lung cancer screening of high-risk populations; namely, those who are 55–80 years of age and have the same smoking experience as those in NLST (2).
Several risk prediction models have since been considered to deal with the important problem of selecting high-risk individuals for LDCT screening (3–6). For example, the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) 2012 model (PLCOM2012) that estimates the probability of a smoker developing lung cancer in a 6-year period was used to show that the mortality rates in the NLST LDCT arm were consistently below those in the chest X-ray arm among individuals with PLCOM2012 risk ≥ 0.0151 (7). Model-based 6-year risk higher than 0.0151 and 0.02 have been considered in selecting individuals for LDCT screening for lung cancer (8, 9).
The USPSTF or NLST criteria have been examined continuously in practical situations. For example, only about 36.6% of female patients with lung cancer diagnosed during 2005–2011 met the USPSTF criteria for screening and this proportion had been decreasing (10), while about 38% (28,401/74,218) of the ever-smoking participants, or about 20% [28,401/(74,218 + 65,711)] of all the participants, in the PLCO were eligible for screening according to NLST criteria (4, 7, 9). Besides, selecting lung cancer screenees using PLCOM2012 risk higher than 1.5% or 2% have also been compared extensively with that using USPSTF/NLST criterion (4, 7, 9). In particular, PLCOM2012 would select 8.8% fewer persons and identify 12.4% more cases of lung cancer than USPSTF criteria.
The recent NELSON trial observed a 26% mortality reduction in male patients with lung cancer and up to a 61% mortality reduction in female patients with lung cancer among those screened using LDCT compared with those having no screening (11). This, together with a Korean report (12) that using LDCT screening, only 5% of the lung cancers detected in never-smokers were interval cancers, compared with 43% of those in ever-smokers, suggests that LDCT screening may be effective in reducing lung cancer–related mortality in Asian never-smoking females (NSF).
About 25% of lung cancer cases arise in never-smokers and lung cancer in never-smokers (LCINS) ranks as the seventh most common cause of cancer-related death worldwide (13–15). The proportion of LCINS has been increasing overtime, and about 60%–80% of female patients with lung cancer in Asia are never-smokers, much higher than in the United States and Europe (15, 16). LCINS exhibits distinct molecular characteristics, and the incidence of lung cancer in Asian NSFs is particularly high (17, 18). In Taiwan, 55% of the lung cancers are in never-smokers, lung cancer is the leading cause of cancer-related death among women, and over 90% of female patients with lung cancer are never-smokers (19). Recently, there has been an interest in LDCT screening for lung cancer among female never-smokers in China, Japan, Korea, and Taiwan (20–25). In particular, a lung cancer screening program for never-smokers in Taiwan is ongoing (ClinicalTrials.gov Identifier: NCT02611570), whose eligibility criteria were never-smokers aged between 55 and 75 years with one of the following risk: family history of lung cancer within third-degree, passive smoking exposure, TB/chronic obstructive pulmonary disease, and high cooking index without using ventilator during cooking (24, 25).
The above observations suggest that it is crucial to be able to identify high-risk Asian NSFs who may benefit from LDCT screening. To this end, developing lung-cancer risk prediction tools for Asian NSFs based on risk factors consistently identified in previous studies becomes a priority (26). However, this is difficult and challenging. Unlike the situation of tobacco-driven lung cancer, there are no established risk factors dominating the development of lung cancer among never-smokers. Numerous risk factors have been suggested and their effects vary greatly by geographic region (15, 18, 27–31). We note that PLCO models do not seem to be useful for Asian NSFs because PLCO included only about 2,000 never-smokers of Asian ethnicity and only seven lung cancers (7, 26, 32). Indeed, none of the never-smokers in the PLCO (n = 65,711) had a 6-year risk >0.0151, using the PLCOM2014 that is analogous to PLCOM2012 and included never-smokers (7).
The main purpose of this study was to propose models to estimate the risk of lung cancer diagnosis over a 6-year period in a Taiwanese NSF and to examine whether they are useful in identifying high-risk never-smoking women who may benefit from LDCT screening. In particular, we evaluated the criteria of using the 6-year risk higher than 0.0151 or 0.02 as thresholds to select NSFs for LDCT screening. The evaluation is based on datasets obtained chronologically later than the datasets used in the model development. We believe this study is useful and timely from both the public and personal perspectives. In particular, it is hoped that this study could improve the implementation of LDCT lung cancer screening program in Asian NSF.
The model was developed on the basis of an age-matched case–control study (AMCCS) and the age-specific six-year lung cancer incidence rate (ASSIR) in Taiwanese NSFs. The AMCCS included the Genetic Epidemiology Study of Lung Adenocarcinoma in Taiwan (GELAC), which has been used to study the genetic and environmental risk factors for lung cancer in Asian NSFs (33–39). The core risk factors included in our final model are age, body mass index (BMI), chronic obstructive pulmonary disease (COPD), education, family history of lung cancer, and SNPs reported in the genome-wide association studies (GWAS) of lung cancer in Asian NSFs, although PM2.5 did help in prediction.
Materials and Methods
AMCCS and selection of risk factors
Model construction began using the questionnaire and genetic information of 2,105 patients with lung cancer and 1,405 healthy controls from the NSFs in the case–control study component of the GELAC (33, 35, 38), which were recruited during 2000–2015, and 7,687 healthy NSFs from the Taiwan Biobank data granted before August 2016 (TWB1) having SNP array data. Cases in GELAC were incident patients with lung cancer. Supplementary Text S1 in the Supplementary Materials has more information about GELAC and Taiwan Biobank; some of the Taiwan Biobank participants were administered with simplified questionnaires, instead of the complete ones. Implementing quality control on these datasets by using both SNP array data and questionnaire information resulted in a dataset, referred to as the initial dataset, consisting of 7,094 females from TWB1 and 2,085 cases and 1,365 controls from GELAC, among which any two individuals genotyped by SNP array were unrelated, with their relatedness coefficient (PI-HAT) less than 0.05. Sex suggested by their SNP arrays is female.
Because 35.6% in TWB1 used the simplified questionnaire (SQ) and the rest used the complete one, and because we considered only risk factors having identical meanings in the Taiwan Biobank and GELAC questionnaires, we included only age, BMI, COPD, education, family history of lung cancer, the GWAS-identified SNPs, and PM2.5*I(Age≥55) in developing our main model.
Because Taiwan Biobank questionnaire includes only the districts that its participants were residents of at enrollment and only since the beginning of 2006, Taiwan government measured PM2.5 exposure at 76 stations over Taiwan main island, the PM2.5*I(Age≥55) for an individual in this article equals to the average PM2.5 exposure during 2006–2008 interpolated at the district office of this individual's residence if aged 55 or more and 0 otherwise; there are a total of 352 districts in Taiwan main island; see Text S2 for more details. We considered PM2.5 for those aged 55 or more to reflect its long-term effect (40). Its concentration was measured in |\mu g/{m^3}.$|
On the basis of the initial dataset, we formed the AMCCS as follows. For each case, we age-matched (±1 year, age at recruitment) healthy controls selected from GELAC or TWB1 until each case had 1–5 matched controls. This resulted in the AMCCS that consisted of 1,748 age-matched groups. Figure 1 details these procedures.
On the basis of the AMCCS, we assessed a variety of risk factors, shown in Supplementary Table S0, Supplementary Materials. Supplementary Table S1, Supplementary Materials, provide information on the 11 SNPs.
Approach to absolute risk
We built our risk prediction models by combining relative risk models with population incidence rates. This approach was used in breast cancer risk models and the Liverpool Lung Project risk model (41–43); additional efforts were made to take into consideration the increase of female lung cancer incidence rate in Taiwan and competing causes of death.
It is a two-stage approach to fit a logistic regression model including intercept, age, and other risk factors. In the first stage, we fitted multivariate logistic regression models, using a conditional likelihood approach (44), to the AMCCS data to obtain the ORs of risk factors other than age. The subset of samples of the AMCCS used in the final step of this fitting procedure, referred to as selected AMCCS (SAMCCS), was used in the second stage.
Taking into account competing causes of death and the increase of female lung cancer incidence rate in Taiwan (Fig. 5b in Chien and colleagues; ref. 45), we constructed the 2011 age-specific rates of developing lung cancer in the next 6 years among Taiwanese NSFs, based on population data prior to 2011, including the Taiwan Cancer Registry, Taiwan Cause of Death Database, Monthly Bulletin of Interior Statistics from the Taiwan Ministry of the Interior, and the 2010 Taiwan life table. Details are in Supplementary Text S2 and Supplementary Tables S2 and S3, which reports the ASSIR.
We assume that the healthy controls in the AMCCS are representatives of NSFs in Taiwan conditional on the matching variable, age, and that a logit relationship holds between risk factors and the probability of developing cancer in the subsequent 6-year period. In the second stage, we estimated the effect of age and intercept using 2011 ASSIR, conditional on the ORs of the risk factors from the first stage, to obtain the risk prediction model. Details are in Supplementary Text S2, Supplementary Materials.
Chronologically later datasets for performance evaluation
We used the data of the NSFs in the Taiwan Lung Cancer Pharmacogenomics Study (LCPG) and those of the NSFs with SNP array genotype in the Taiwan Biobank granted from September 2016 to October 2018 (TWB2) as independent cohorts for performance evaluation. LCPG recruited patients with late-stage lung cancer whose first-line treatments were chemotherapy or targeted therapy in the period 2015–2017 from five hospitals in Taiwan. LCPG recruited 233 NSFs. Detailed information for TWB2 and LCPG are in Supplementary Text S1, Supplementary Materials.
We used LCPG and TWB2 to estimate the percentage of individuals with risk higher than certain thresholds and the AUC ROC (46). Participants in LCPG and lung cancer cases in GELAC are beyond risk, as they had lung cancer at enrollment. However, they serve to estimate the sensitivity of the risk prediction models if screening were available prior to clinical diagnosis. In fact, the percentage of LCPG higher than a threshold approximated the sensitivity of a slightly lower threshold, because their risks were based on questionnaire administered at diagnosis and risk usually increases with age; similarly, the percentage of TWB2 higher than a threshold approximated the 1 − specificity of a slightly lower threshold. With this understanding, we calculated these quantities and the AUC ROC.
We built our models for women ages 30–76 and mainly evaluated on women ages 55–70 or 55–74 because the Taiwan Biobank individuals were under 70 years of age and 55–74 was the age range in the NLST. We focused on two models; one included genetic information and the other did not.
We considered three risk levels: 0.0134, 0.0151, and 0.02. The level 0.0134 is the risk at which PLCOM2012 finds eligible the same number of individuals as the NLST criteria would find eligible in PLCO data (7). The level 0.02 was used to recruit participants for the PanCan study (8). All calculations were implemented in R.
This study was approved by the Institutional Review Board of the National Health Research Institutes (Zhunan, Taiwan) with the approval number EC1000902. Written consent was obtained from each participant in this study.
Results
Characteristics of the AMCCS
The epidemiologic characteristics of the AMCCS are shown in Supplementary Table S0, Supplementary Materials. Supplementary Table S0 is suggestive of risk factors that might be important as potential candidates for entry into the prediction models. On the basis of the univariate analyses at 0.05 significance level, education, family history of lung cancer, COPD, cooking time-year (CTY), CTY without fume extractor when cooking, hormone replacement therapy, and PM2.5*I(Age≥55) were associated with lung cancer risk. Nine of the GWAS-identified 11 SNPs showed P values less than 0.05; the two that did not, rs3817963 and rs2179920, had the largest P values in Seow and colleagues' study (37).
Risk model using the simplified questionnaire
We first obtained the effects of BMI, COPD, education, family history, PM2.5*I(Age≥55), and the 11 SNPs by fitting to the AMCCS data using conditional likelihood. Conditional on these effects and using the 2011 ASSIR, we then obtained the proposed risk prediction model. In fact, we constructed two such risk models; one involved variable selection in the first estimation stage; the other did not. The former is termed the Taiwan NSF Lung Cancer Risk Model using SQ (TNSF-SQ); the latter is termed TNSF-SQ1. Table 1 reports the ORs for TNSF-SQ except for age, whose effects are given in Supplementary Fig. S1. Risk calculators are given in the Supplementary Text S2, Supplementary Materials, where the coefficients are shown in more decimal points. These two models are similar, except that PM2.5*I(Age≥55) was included in TNSF-SQ1 only. In both models, education was protective and had a strong effect; both family history of lung cancer and COPD also had large ORs. The impact of BMI on lung cancer susceptibility showed that those having the smallest BMI were at the highest risk. Nine of the 11 SNPs from GWAS were included in the model. We considered TNSF-SQ our main model for reasons to be given in Discussion.
. | SAMCCS (6,684 = 1,341 + 5,343) . | |
---|---|---|
Variable . | OR (95% CI) . | Pa . |
Education | 0.543 (0.511–0.577) | <2.00E-16 |
BMI (kg/m2) | ||
BMI < 18.5 | 1.581 (1.115–2.243) | 1.02E-02 |
18.5 ≤ BMI < 24 | 1 | |
24 ≤ BMI < 27 | 0.886 (0.753–1.041) | 1.41E-01 |
27 ≤ BMI < 30 | 0.656 (0.521–0.824) | 3.06E-04 |
BMI ≥ 30 | 0.742 (0.542–1.015) | 6.18E-02 |
Family history of lung cancer | 2.067 (1.643–2.600) | 5.72E-10 |
COPD | 2.204 (1.382–3.513) | 8.96E-04 |
rs10937405 (A)b | 0.738 (0.663–0.822) | 3.33E-08 |
rs2736100 (C) | 1.360 (1.235–1.498) | 3.79E-10 |
rs2395185 (A) | 1.205 (1.093–1.329) | 1.85E-04 |
rs2495239 (A) | 1.158 (1.053–1.273) | 2.54E-03 |
rs9387478 (A) | 0.867 (0.789–0.953) | 3.11E-03 |
rs72658409 (T) | 0.755 (0.620–0.919) | 5.15E-03 |
rs7086803 (A) | 1.265 (1.144–1.398) | 4.64E-06 |
rs11610143 (G) | 0.857 (0.768–0.956) | 5.61E-03 |
rs7216064 (G) | 0.874 (0.790–0.967) | 9.10E-03 |
Variable | Coefficient | P |
Model constant | −19.45842 | 5.98E-14 |
Agec | 0.69922 | 7.80E-08 |
(Age)2 | −0.01018 | 1.74E-05 |
(Age)3 | 0.00005 | 4.67E-04 |
. | SAMCCS (6,684 = 1,341 + 5,343) . | |
---|---|---|
Variable . | OR (95% CI) . | Pa . |
Education | 0.543 (0.511–0.577) | <2.00E-16 |
BMI (kg/m2) | ||
BMI < 18.5 | 1.581 (1.115–2.243) | 1.02E-02 |
18.5 ≤ BMI < 24 | 1 | |
24 ≤ BMI < 27 | 0.886 (0.753–1.041) | 1.41E-01 |
27 ≤ BMI < 30 | 0.656 (0.521–0.824) | 3.06E-04 |
BMI ≥ 30 | 0.742 (0.542–1.015) | 6.18E-02 |
Family history of lung cancer | 2.067 (1.643–2.600) | 5.72E-10 |
COPD | 2.204 (1.382–3.513) | 8.96E-04 |
rs10937405 (A)b | 0.738 (0.663–0.822) | 3.33E-08 |
rs2736100 (C) | 1.360 (1.235–1.498) | 3.79E-10 |
rs2395185 (A) | 1.205 (1.093–1.329) | 1.85E-04 |
rs2495239 (A) | 1.158 (1.053–1.273) | 2.54E-03 |
rs9387478 (A) | 0.867 (0.789–0.953) | 3.11E-03 |
rs72658409 (T) | 0.755 (0.620–0.919) | 5.15E-03 |
rs7086803 (A) | 1.265 (1.144–1.398) | 4.64E-06 |
rs11610143 (G) | 0.857 (0.768–0.956) | 5.61E-03 |
rs7216064 (G) | 0.874 (0.790–0.967) | 9.10E-03 |
Variable | Coefficient | P |
Model constant | −19.45842 | 5.98E-14 |
Agec | 0.69922 | 7.80E-08 |
(Age)2 | −0.01018 | 1.74E-05 |
(Age)3 | 0.00005 | 4.67E-04 |
aExcept for age-related variables (i.e., Age, Age2, and Age3), the P values were obtained from the multivariate logistic regression using conditional likelihood. The P values for age-related variables were obtained from a linear regression analysis in the second stage (see Materials and Methods).
bThe genetic variables take values 0, 1, and 2 according to the number of the minor alleles the individuals have at the SNP. Here, the minor alleles, in the parentheses, are those reported in the literature.
cThe age effect can be visualized in Supplementary Fig. S1, Supplementary Materials.
Among women ages 55–70, the AUC based on TWB2 and LCPG was 0.714 [95% confidence interval (CI), 0.660–0.768]; among these women in TWB2, 3.94% (95% CI, 2.95–5.13) had risk higher than 0.0151. For women in LCPG ages 55–74, 27.03% (95% CI, 19.04–36.28) had risk higher than 0.0151 (Table 2A).
(A) TNSF-SQ . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | Ages 55–70 years . | Ages 55–70 years . | Ages 55–74 years . | |||||||
. | TWB1 . | GELAC cases . | TWB2 . | LCPG . | LCPG . | |||||
. | . | 1 − Specificity . | . | Sensitivity . | . | 1 − Specificity . | . | Sensitivity . | . | Sensitivity . |
Risk threshold . | Na . | % (95% CIb) . | N . | % (95% CI) . | N . | % (95% CI) . | N . | % (95% CI) . | N . | % (95% CI) . |
≥0 | 2,350 | 100 | 718 | 100 | 1,321 | 100 | 96 | 100 | 111 | 100 |
≥0.0134 | 191 | 8.13 (7.05–9.31) | 301 | 41.92 (38.28–45.63) | 70 | 5.3 (4.15–6.65) | 24 | 25 (16.72–34.88) | 33 | 29.73 (21.43–39.15) |
≥0.0151 | 145 | 6.17 (5.23–7.22) | 259 | 36.07 (32.55–39.71) | 52 | 3.94 (2.95–5.13) | 22 | 22.92 (14.95–32.61) | 30 | 27.03 (19.04–36.28) |
≥0.02 | 58 | 2.47 (1.88–3.18) | 174 | 24.23 (21.14–27.54) | 24 | 1.82 (1.17–2.69) | 10 | 10.42 (5.11–18.32) | 17 | 15.32 (9.18–23.39) |
AUC (95% CI) | 0.770 (0.749–0.791) | 0.714 (0.660–0.768) | ||||||||
(B) TNSF-SQNG | ||||||||||
≥0 | 2,351 | 100 | 822 | 100 | 1,323 | 100 | 101 | 100 | 117 | 100 |
≥0.0134 | 148 | 6.3 (5.35–7.35) | 289 | 35.16 (31.89–38.53) | 51 | 3.85 (2.88–5.04) | 19 | 18.81 (11.72–27.81) | 28 | 23.93 (16.53–32.7) |
≥0.0151 | 82 | 3.49 (2.78–4.31) | 232 | 28.22 (25.17–31.44) | 23 | 1.74 (1.11–2.60) | 11 | 10.89 (5.56–18.65) | 19 | 16.24 (10.07–24.19) |
≥0.02 | 50 | 2.13 (1.58–2.79) | 151 | 18.37 (15.78–21.19) | 15 | 1.13 (0.64–1.86) | 7 | 6.93 (2.83–13.76) | 14 | 11.97 (6.70–19.26) |
AUC (95%CI) | 0.754 (0.734–0.775) | 0.694 (0.637–0.751) |
(A) TNSF-SQ . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | Ages 55–70 years . | Ages 55–70 years . | Ages 55–74 years . | |||||||
. | TWB1 . | GELAC cases . | TWB2 . | LCPG . | LCPG . | |||||
. | . | 1 − Specificity . | . | Sensitivity . | . | 1 − Specificity . | . | Sensitivity . | . | Sensitivity . |
Risk threshold . | Na . | % (95% CIb) . | N . | % (95% CI) . | N . | % (95% CI) . | N . | % (95% CI) . | N . | % (95% CI) . |
≥0 | 2,350 | 100 | 718 | 100 | 1,321 | 100 | 96 | 100 | 111 | 100 |
≥0.0134 | 191 | 8.13 (7.05–9.31) | 301 | 41.92 (38.28–45.63) | 70 | 5.3 (4.15–6.65) | 24 | 25 (16.72–34.88) | 33 | 29.73 (21.43–39.15) |
≥0.0151 | 145 | 6.17 (5.23–7.22) | 259 | 36.07 (32.55–39.71) | 52 | 3.94 (2.95–5.13) | 22 | 22.92 (14.95–32.61) | 30 | 27.03 (19.04–36.28) |
≥0.02 | 58 | 2.47 (1.88–3.18) | 174 | 24.23 (21.14–27.54) | 24 | 1.82 (1.17–2.69) | 10 | 10.42 (5.11–18.32) | 17 | 15.32 (9.18–23.39) |
AUC (95% CI) | 0.770 (0.749–0.791) | 0.714 (0.660–0.768) | ||||||||
(B) TNSF-SQNG | ||||||||||
≥0 | 2,351 | 100 | 822 | 100 | 1,323 | 100 | 101 | 100 | 117 | 100 |
≥0.0134 | 148 | 6.3 (5.35–7.35) | 289 | 35.16 (31.89–38.53) | 51 | 3.85 (2.88–5.04) | 19 | 18.81 (11.72–27.81) | 28 | 23.93 (16.53–32.7) |
≥0.0151 | 82 | 3.49 (2.78–4.31) | 232 | 28.22 (25.17–31.44) | 23 | 1.74 (1.11–2.60) | 11 | 10.89 (5.56–18.65) | 19 | 16.24 (10.07–24.19) |
≥0.02 | 50 | 2.13 (1.58–2.79) | 151 | 18.37 (15.78–21.19) | 15 | 1.13 (0.64–1.86) | 7 | 6.93 (2.83–13.76) | 14 | 11.97 (6.70–19.26) |
AUC (95%CI) | 0.754 (0.734–0.775) | 0.694 (0.637–0.751) |
aThe number in TWB1 is the number of individuals in box b, Fig. 1, ages 55 to 70 years, with all the variables available for the model. The number in GELAC cases is the number of patients with lung cancer in box a, Fig. 1, ages 55 to 70 years, with all the variables available for the model. The number in TWB2 and LCPG is the number of individuals with all the variables available for the model.
bThe CIs for the proportion of the high-risk group are computed using binomial exact CI.
To compare the discriminative power of the risk factors, we similarly developed five additional models ignoring each of the covariates: education, COPD, family history of lung cancer, BMI, and SNPs, starting with the SAMCCS in box e, Fig. 1. Their AUCs and sensitivities are given in Supplementary Table S4, suggesting that education had the strongest discriminative power and that although some of these models had higher AUCs, their sensitivities were lower than TNSF-SQ's.
Risk model with no genetic variants
Because of its wider applicability, the risk model without using the SNPs, termed TNSF-SQ with no genetic variants (TNSF-SQNG), deserves attention. Among women ages 55–70, the AUC was 0.694 (95% CI, 0.637–0.751); among these women in TWB2, 1.74% (95% CI, 1.11–2.60) had risks higher than 0.0151. For women in LCPG ages 55–74, 16.24% (95% CI, 10.07–24.19) had risks higher than 0.0151 (Table 2B).
Other risk models
To assess the usefulness of other risk factors, we used the same method and the same AMCCS in box d, Fig. 1 to develop the model using genetic variants and age only (TNSF-G), the model using all the risk factors common to both the Taiwan Biobank and GELAC questionnaire (TNSF), and the model using all the risk factors common to both but without SNPs (TNSF-NG). Their performance is presented in Supplementary Table S5.
Comparison of training and validation datasets
Table 2 and Supplementary Table S5D show that the percentages of people with high risks in GELAC cases were higher than those in LCPG under all the models except under TNSF-G. To better understand this phenomenon, we present in Supplementary Table S6 the distributions of risk factors in GELAC cases, LCPG, TWB1, and TWB2. It shows that differences seemed to exist between GELAC cases and LCPG for certain factors but not for 10 of the 11 SNPs; similar remarks hold for TWB1 and TWB2, where none of the 11 SNPs showed difference. The 6-year risk distributions in these four cohorts under TNSF-SQ, TNSF-SQNG, and TNSF-G are given in Supplementary Fig. S2, Supplementary Materials, suggesting that the similarity between the 6-year risk distribution in TWB1 and that in TWB2 is higher than the similarity between that in GELAC cases and that in LCPG; the similarity between that in GELAC cases and that in LCPG is higher under TNSF-G than under the other two models. The number of participants in each of these cohorts under these studies is shown in Supplementary Table S7.
To know more of the risk factors, we present in Supplementary Table S8 the characteristics of the AMCCS restricted to GELAC.
Discussion
Performance of TNSF-SQ
TNSF-SQ seems to be the first model in literature based on standard risk factors to address the need to identify high-risk Asian NSFs who may benefit from LDCT lung cancer screening (26). Our performance evaluation is based on TWB2 and LCPG, which were formed chronologically later than TWB1 and GELAC, and hence is realistic. In addition to an AUC of 0.714, it seems that the percentages of NSFs in TWB2 and LCPG having risks higher than 0.0151 or 0.02 deserve attention.
Given that there are no established risk factors dominating the lung cancer development among never-smokers, the model TNSF-SQ seems to represent a major step in identifying high-risk Asian NSFs for lung cancer screening.
Table 2A suggests that about 3.94% of all the healthy NSFs ages 55–70 have TNSF-SQ risk higher than 0.0151. If we screened all these women at this moment, then among all the healthy NSFs ages 55–70 who will develop lung cancer in the subsequent 6-year period, about 23% would be among the screened. This percentage would be 27 if we screened all those ages 55–74 with risk higher than 0.0151. To put this into perspective, we consider the observations regarding the USPSTF and NLST criteria. First, only about 36.6% of U.S. female patients with lung cancer diagnosed during 2005–2011 met the USPSTF criteria for screening, and the proportion had been decreasing (2, 10). Second, about 38% of all the smokers ages 55–74 in the United States were eligible for screening according to NLST criteria (4, 7, 9). Comparison with the USPSTF or NLST criteria for ever-smokers helps appreciate the usefulness of TNSF-SQ in selecting NSFs for LDCT screening.
For Asian NSFs, both PLCOM2014 and TNSF-SQ used the same risk factors, except that the former also included a history of cancer and the latter included SNPs. But the effects of these common risk factors are different. Although the effect of BMI in TNSF-SQ is in-line with that in PLCOM2014 and with literature, it is worth noting that ours regard Asian NSFs (47). That education had a larger effect in TNSF-SQ seems reasonable; our preliminary studies suggest that higher education correlates with shorter CTY, lower environmental tobacco smoking (ETS) at home, and lower incense burning in worship. Further studies are needed.
Other risk factors
TNSF-SQ1 performed slightly better than TNSF-SQ (Supplementary Table S5C). Because PM2.5 in TNSF-SQ1 changes only with district, the populations of districts may be in the hundreds of thousands, and its real exposure varies greatly within each district, it may raise concerns in the implementation of TNSF-SQ1; hence, we consider TNSF-SQ our main model for practical reason. In fact, PM2.5 exposure has become a topic of debate in Taiwan's newspapers. Our discussions about PM2.5*I(Age≥55) are preliminary; further studies are needed.
It is worth noting that TNSF-SQNG could be implemented on a wider scale and in lower income regions where SNP genotyping is not available. It can also be used as a preliminary tool; individuals with high TNSF-SQNG risks are suggested to obtain genotype information and calculate TNSF-SQ risk.
A recent risk prediction model for developing lung cancer among never-smokers in Taiwan had a high AUC of 0.806 (48). Their model included maximum mid-expiratory flow, carcinoembryonic antigen, and alpha fetoprotein. With medical prescriptions, information on these covariates can be obtained through Taiwan's National Health Insurance Program. In general, information on these covariates and that on SNPs can be obtained at an individual's expense through health care providers and hence are not as readily available as that on the risk factors in TNSF-SQNG.
The findings that TNSF-SQ performed better than TNSF-SQNG and that TNSF-G was robust, suggest the usefulness of GWAS-identified SNPs. Because the utility of polygenic risk prediction models depends on the training dataset cohort size, underlying genetic architecture, and other risk factors (49), we believe that larger GWAS for lung cancer in Asian NSFs might lead to more predictive models.
Uncertainty
Underestimation of uncertainty in a two-stage estimation is a general concern. One may study the information matrix analytically by considering a probability structure in the second stage of the estimation (50). In this study, we described a bootstrap approach in Supplementary Text S2B; in particular, the CIs of the age effects presented in Supplementary Fig. S1 were obtained by the bootstrap. Further studies to compare the analytic approach and various bootstrap strategies are warranted; see Supplementary Text S2B for details. Because the imputed genotype data used in our model development seemed to be of good quality, we did not consider its uncertainty in this study; see Supplementary Text S1A for more discussion.
It is desirable to assess the calibration of TNSF-SQ in a prospective cohort like Taiwan Biobank (51, 52). However, Taiwan Biobank follow-up data are limited at this moment; see Supplementary Text S1. Here are some remarks on the representativeness of our case–control studies.
In view of the post-1960 industrialization in Taiwan, distribution of certain risk factors changed with time, as shown in Supplementary Table S6. These are in-line with the recruitment time of GELAC (2002–2015), LCPG (2015–2017), TWB1 (2008–2015), and TWB2 (2008–2016). These suggest that good calibration could be expected if recruitment period for training set and that for validation set are comparable and that there is a need for continuous development and evaluation of our models to take into account the change in risk factor distributions (53, 54).
Limitation
A limitation of this study is that the models were constructed from a two-stage design rather than from a prospective cohort study and our choices of the thresholds 0.0134, 0.0151, and 0.02 were based on prospective studies for smokers in the North America. We need a prospective study to build the model and to do the evaluation so as to decide the thresholds and conduct calibration study.
Another limitation regards information on ETS. The prevalence of ETS at home for GELAC cases and controls were 68% and 56%, respectively, and statistically different (Supplementary Table S8). Compared with GELAC, Taiwan Biobank provides very limited and somewhat different information on ETS, making it hard to use in model development. GELAC provides more solid information like the number of cigarettes consumed daily by a participant's spouse or father and the ETS exposure periods. To make ETS useful in risk prediction models, we suggest collect detailed information and standardize the ETS exposure measurement in future studies.
Conclusions
Our study may be useful for policy makers in screening program design when government budgets are limited. It might also be useful for Taiwanese NSFs or their doctors to get some idea about their risks for lung cancer and decide if they might benefit from LDCT lung cancer screening. Like GELAC in Taiwan, several cohorts in mainland China, South Korea, Japan, Hong Kong, and Singapore have jointly taken part in the GWAS of lung cancer in Asian NSFs (35, 37); these cohorts together with their age-specific local incidence rates could be utilized to build risk prediction models for the regions in which the cohorts were established. Given that these regions are similar in environment, life style, and genetic architecture and that more cases and controls are available, we expect similar but more useful predictive risk models for lung cancer among Asian NSFs to appear in the near future.
Disclosure of Potential Conflicts of Interest
K.-Y. Chen reports receiving speakers bureau honoraria from AstraZeneca, Roche, Boehringer Ingelheim, Pfizer, Novartis, Merck Sharp & Dohme, Ono Pharmaceutical, and Bristol-Myers Squibb and other remuneration (travel/accommodation/meeting expenses) from Merck Sharp & Dohme, Boehringer Ingelheim, and Pfizer. S.-K. Liang reports receiving speakers bureau honoraria from Roche, AstraZeneca, Pfizer, Merck Sharp & Dohme, Novartis, and Boehringer Ingelheim. No potential conflicts of interest were disclosed by the other authors.
Disclaimer
The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the article; or decision to submit the article for publication.
Authors' Contributions
Conception and design: L.-H. Chien, C.-L. Wang, P.-C. Yang, C.-J. Chen, I-S. Chang, C.A. Hsiung
Development of methodology: L.-H. Chien, C.-H. Chen, T.-Y. Chen, C.-J. Chen, I-S. Chang, C.A. Hsiung
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): G.-C. Chang, Y.-H. Tsai, K.-Y. Chen, W.-C. Su, M.-S. Huang, Y.-M. Chen, C.-Y. Chen, S.-K. Liang, C.-Y. Chen, C.-L. Wang, J.-W. Hu, S.J. Chanock, N. Rothman, Q. Lan, P.-C. Yang, C.-J. Chen, C.A. Hsiung
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): L.-H. Chien, C.-H. Chen, T.-Y. Chen, W.-C. Wang, Y.-M. Chen, S.-K. Liang, R.-H. Chung, F.-Y. Tsai, N. Chatterjee, S.J. Chanock, N. Rothman, Q. Lan, P.-C. Yang, C.-J. Chen, I-S. Chang, C.A. Hsiung
Writing, review, and/or revision of the manuscript: L.-H. Chien, T.-Y. Chen, W.-C. Wang, Y.-M. Chen, M.-H. Lee, H.A. Katki, N. Chatterjee, S.J. Chanock, N. Rothman, Q. Lan, C.-J. Chen, I-S. Chang, C.A. Hsiung
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): C.-F. Hsiao, Y.-M. Chen, P.-C. Yang, C.A. Hsiung
Study supervision: I-S. Chang, C.A. Hsiung
Acknowledgments
The authors thank Dr. Christine D. Berg for her valuable comments on an earlier version of this article. The authors also thank Ms. Hsiao-Han Hung, Wan-Shan Hsieh, and Hsin-Fang Jiang for technical assistance. This study was supported by grants from the Ministry of Health and Welfare (DOH100-TD-PB-111-TM013 to C.A. Hsiung, DOH101-TD-PB-111-TM015 to C.A. Hsiung, DOH102-TD-PB-111-TM024 to C.A. Hsiung, MOHW103-TDU-PB-211-144003 to C.A. Hsiung, and MOHW105-TDU-B-212-134013 to I-S. Chang) and the Ministry of Science and Technology (MOST 103-2325-B-400-023 to C.A. Hsiung, MOST 104-2325-B-400-012 to C.A. Hsiung and I-S. Chang, MOST 105-2325-B-400-010 to C.A. Hsiung and I-S. Chang, MOST 106-2319-B400-001 to C.A. Hsiung and I-S. Chang, and MOST 107-2319-B-400-001 to C.A. Hsiung and I-S. Chang).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.