Abstract
Somatic EGFR mutations define a subset of non–small cell lung cancers (NSCLC) that have clinical impact on NSCLC risk and outcome. However, EGFR-mutation-status is often missing in epidemiologic datasets. We developed and tested pragmatic approaches to account for EGFR-mutation-status based on variables commonly included in epidemiologic datasets and evaluated the clinical utility of these approaches.
Through analysis of the International Lung Cancer Consortium (ILCCO) epidemiologic datasets, we developed a regression model for EGFR-status; we then applied a clinical-restriction approach using the optimal cut-point, and a second epidemiologic, multiple imputation approach to ILCCO survival analyses that did and did not account for EGFR-status.
Of 35,356 ILCCO patients with NSCLC, EGFR-mutation-status was available in 4,231 patients. A model regressing known EGFR-mutation-status on clinical and demographic variables achieved a concordance index of 0.75 (95% CI, 0.74–0.77) in the training and 0.77 (95% CI, 0.74–0.79) in the testing dataset. At an optimal cut-point of probability-score = 0.335, sensitivity = 69% and specificity = 72.5% for determining EGFR-wildtype status. In both restriction-based and imputation-based regression analyses of the individual roles of BMI on overall survival of patients with NSCLC, similar results were observed between overall and EGFR-mutation-negative cohort analyses of patients of all ancestries. However, our approach identified some differences: EGFR-mutated Asian patients did not incur a survival benefit from being obese, as observed in EGFR-wildtype Asian patients.
We introduce a pragmatic method to evaluate the potential impact of EGFR-status on epidemiological analyses of NSCLC.
The proposed method is generalizable in the common occurrence in which EGFR-status data are missing.
Introduction
Somatic EGFR mutations define a unique subset of non–small cell lung cancers (NSCLC) and have clinical impact on NSCLC outcomes; further, genetic and environmental risk factors may be different in patients with EGFR-mutated and EGFR-wildtype NSCLCs. Clinico-pathologic factors such as being a lifetime neversmoker, female, of Asian ancestry, and having a histology of adenocarcinoma have each been independently associated with a greater likelihood of having EGFR-mutated NSCLC (1, 2). In contrast, heavy smoking, male sex, and squamous carcinoma histology are associated with NSCLC without EGFR mutations (i.e., EGFR wildtype; refs. 3, 4). Up to 90% of EGFR mutations are sensitizing mutations, therefore being strongly predictive of response to tyrosine kinase inhibitors (TKI) targeting the mutated EGFR protein; EGFR TKIs are used commonly in advanced or metastatic incurable patients with EGFR-mutated NSCLC to improve overall survival (5, 6), and recently to improve disease-free survival in early stage, resected patients (7).
Molecular detection of EGFR mutations itself only became widely available in routine clinical practice after publication of the seminal IPASS study in 2009 (8), which established EGFR-TKIs as the preferred treatment for patients with incurable stage IIIB/IV EGFR-mutated NSCLC; further, the availability of EGFR testing depended on the speed of clinical uptake, which varied across the world (9). Therefore, many epidemiological research databases have not historically collected EGFR mutation data or detailed treatment data. Consequently, interpretation of both risk and survival outcomes could be impacted by this lack of available information, especially for lung adenocarcinoma.
Among NSCLC subgroups, individuals carrying EGFR-mutated tumors represent the largest subgroup whose biology is markedly different than of typical smoking-related NSCLC; proportions of EGFR-mutated tumors can range from 10% to upwards of 50% (10–13). Thus, epidemiologic studies aiming to gain better understanding of the genetic and environmental etiologic factors will likely need to study EGFR-mutated and EGFR-wildtype NSCLCs separately.
To account for missing data, there have been prior efforts to predict EGFR-status based on clinical and demographic variables. Chang and colleagues developed a predictive model for being EGFR-mutated exclusively in an Asian population based on seven variables, namely sex, adenocarcinoma histology, smoking history, N-stage, M-stage, presence of brain metastases, and elevated CYFRA 21–1 serologic levels (14). With a sensitivity of 95% and specificity of 32.3%, their model achieved a positive predictive value (PPV) of 85.1% and a negative predictive value (NPV) of 65.6%. Another nomogram, proposed by Girard and colleagues for adenocarcinomas based on a non-Asian population, incorporated age, sex, smoking pack-years, time interval between smoking cessation and NSCLC diagnosis, disease stage (I–IIA vs. IIIB–IV) and predominant histologic subtype (solid, papillary, or bronchioalveolar); this study achieved a concordance index of 0.84 (15). However, despite acceptable accuracy, these two published predictive models cannot be easily applied in most epidemiologic studies because they incorporate some variables that are not readily available in existing epidemiologic or clinical studies, such as predominant histologic subtypes and CYFRA 21–1 levels.
The overarching aim of this study was to develop and evaluate a pragmatic approach to account for EGFR-status in the analysis of epidemiologic studies, using variables generally included in existing datasets. We developed a regression model for EGFR-status by analyzing International Lung Cancer Consortium (ILCCO) epidemiologic datasets. With this regression model, we applied two approaches, a clinical approach and an epidemiological approach. In the clinical approach, we identified a regression value cut-point from which we dichotomized patients into those who were most likely or least likely to have an EGFR-mutated NSCLC; we have termed this the restriction method because it “restricts” the entire population into a smaller dataset most likely to have or have not an EGFR mutation. The alternative epidemiologic approach utilized a multiple imputation approach to differentiate between the likely EGFR-mutated from patients who were less likely to have EGFR-mutated NSCLCs. We used these two approaches to represent approaches widely familiar with either clinicians or epidemiologists, respectively, and to demonstrate that these two approaches could yield in consistent results. We then applied these two different approaches to previous survival analyses to compare how much change in results would occur had we used these two approaches to separate our datasets into those most and least likely to carry EGFR mutations.
Materials and Methods
Study design
We first developed a pragmatic multivariable regression model with the outcome of EGFR-status, in an ILCCO subcohort dataset that included only patients with known EGFR mutation-status (EGFR-wildtype vs. EGFR-mutated). We then applied this regression model to predict EGFR-status in patients with NSCLC in the larger ILCCO dataset, using two different approaches: a clinical restriction approach where the probability of having either EGFR-wildtype or EGFR-mutated NSCLC was estimated through an optimal cut-off determined by the multivariable regression model, and an epidemiologic multiple imputation approach utilizing the same regression model for estimating EGFR-status.
Study population
ILCCO harmonizes compatible data from various epidemiologic studies worldwide to facilitate collaborative lung cancer epidemiology research in large combined datasets (details are available on http://ilcco.iarc.fr). Twenty-seven ILCCO studies participated in prior survival analyses, and among the participating studies the majority of patients with lung cancer were male, eversmokers, and of European ancestry, suggesting that the majority of cases would not carry a somatic EGFR mutation. Thus, our primary goal was to identify a subset of patients who are not likely to carry the mutation (i.e., EGFR-wildtype), so that we can perform sensitivity analyses to compare any main results in the entire ILCCO cohort (regardless of EGFR-status) to results generated in a predicted EGFR-wildtype subcohort to better understand possible influence of EGFR-status on survival outcomes. To explore possible utility in an Asian population with higher prevalence of EGFR-mutation, we performed additional analyses in our Asian subgroup accounting for EGFR-status. Ethics approval was obtained by each participating study from local review boards.
Analysis
Summary statistics were provided with continuous and categorical variables presented as median with range and as frequency with percentage (%), respectively. Comparisons of baseline clinico-pathologic profiles among different groups were performed using Kruskal–Wallis and Chi-square tests, as appropriate.
Multivariable regression model development
We first developed a multivariable regression model that incorporated basic clinico-epidemiological variables that are typically captured in most observational studies. We developed this regression model using only patients with known EGFR-status (EGFR-wildtype or EGFR-mutated). To develop the best regression models of clinico-demographic-pathologic variables and EGFR-status, we randomly divided data from patients with known EGFR status into a training set (comprised of two-thirds of the patients), which was used for prediction model development, and a testing set (including the remaining one-third) for model validation. In addition, the selected model was also validated using bootstrap resampling methods. The candidate variables in the regression model for EGFR status included age, gender, ethnicity, stage, smoking history, and histology. We used the backward selection algorithm with the Akaike information criterion to select the variables in the regression model. Odds ratios (OR) and 95% confidence intervals (CI) of each variable in the model were calculated.
Clinical or restriction approach to identify an EGFR-wildtype subcohort (as well as an EGFR-positive subcohort in Asian population-specific subanalyses)
As this regression model served to predict EGFR-status, the discriminatory ability of the model was quantified using the AUC of the ROC. The probability score (PS) was defined on the basis of the weighted summary of the variables in the model weighted by the corresponding regression coefficients. The optimal cut-point value of the PS for distinguishing high probability EGFR-wildtype lung cancers from others was determined using the ROC curve. The ROC of a perfect test passes through the left-upper corner of the ROC plot, the point where both sensitivity and specificity are equal to 1; the optimal cut-off point is the point on the ROC curve that has the smallest distance to this left-upper corner (16–18).
Those with a PS for having a specific EGFR-status that was greater than the optimal cut-off point was given that EGFR-status.
Epidemiologic or multiple imputation approach to identify an EGFR-wildtype subcohort (as well as an EGFR-positive subcohort in Asian population-specific subanalyses)
As a second approach, we used a multiple imputation algorithm to generate HRs (by applying the multivariable regression model). For each patient with unknown EGFR status, we compared the probability of EGFR status based on the predicted model and the generated random number with uniform distribution; if greater, then the patient of predicted EGFR status was assigned as positive, otherwise negative. The association between predicted EGFR status and overall survival was examined by using Cox regression. The above procedure was repeated 100 times and we summarized the data as mean HRs and 95% CIs (19, 20).
Application of both restriction and imputation approaches to prior ILCCO outcome analyses
Data on the relationship between BMI and survival outcomes from the ILCCO dataset were utilized for these assessments. For each sensitivity analysis, the clinical-restriction and epidemiologic-imputation approaches to identifying an EGFR-wildtype subcohort were individually compared with the analysis of the entire ILCCO cohort as previously published based on patient data from 16 centers; as some centers had since provided additional patient data and additional centers (now up to 27) had provided data, we analyzed this updated version of the dataset, as we found no logical reason to exclude these additional patients. As most EGFR-mutated tumors are adenocarcinomas, we conducted an additional sensitivity analyses exclusively in the adenocarcinoma subset of our cohort. In the Asian subgroup, restriction and imputation were also applied to generate a predicted EGFR-positive subgroup to be compared with the analyses of the entire Asian population of the ILCCO cohort. Application to Kaplan–Meier curves and Cox proportional hazards regression models were used in illustrative examples to demonstrate the potential impact of taking into account EGFR-status (restriction approach, imputation approach) when compared with previous analyses that did not consider EGFR-status, for the following two associations: BMI and overall survival (OS; ref. 21) and interaction of BMI with smoking, gender, and ethnicity on OS as measured through subset analyses (22).
In the restriction approach, we estimated hazard ratios on a restricted dataset that analyzed only predicted EGFR-wildtype patients based on the optimal PS cut-point as determined from the generated ROC curves. In the multiple imputation approach, after 100 HRs were generated, we summarized the data as mean HRs and 95% CIs. For the Asian subgroup, analyses were also performed using both approaches to identify both EGFR-wildtype and EGFR-mutated patient subgroups.
All statistical analyses were performed using R 4.0.1 (http://CRAN.R-project.org, The R Foundation for Statistical Computing). All P values were based on two-sided tests and considered statistically significant at P < 0.05.
Results
Baseline characteristics
Overall, there were 35,356 patients with lung cancer in the ILCCO database, of which EGFR-status was available in a subset of 4,231 patients across five studies, whereas 31,125 patients across 27 studies had unknown EGFR-status (Fig. 1). The majority of studies included in this analysis had completed the major part of their recruitment before 2009; however, EGFR testing became more available as standard of care only after 2009 (Supplementary Table S1). The characteristics of those with known and unknown EGFR-status are presented in Supplementary Table S2. Of the patients with known EGFR status, 1,481 were EGFR-mutated whereas 2,750 were EGFR-wildtype (EGFR-mutation prevalence of 35%). Studies from Asia had higher prevalence of EGFR-mutated patients (NCCRI-Japan 48%; Shanghai 56%) whereas American studies had lower prevalence (LCS 21%; Barretos-Brazil 19%); the multicultural Toronto MSH-PMH study had an intermediate prevalence of 42% (Supplementary Table S3). As expected, baseline characteristics differed significantly between EGFR-mutated and EGFR-wildtype patients with respect to age, sex, ethnicity, and smoking status (Table 1; Supplementary Table S4).
. | . | Patients with EGFR mutation-tested tumors: N (%) . | . | ||
---|---|---|---|---|---|
Covariate . | Category . | Full sample . | EGFR-mutated . | EGFR-wildtype . | P value . |
Total count (100%) | 4,231 | 1,481 | 2,750 | ||
Age | Median [Min–Max] | 63 [18–95] | 62 [22–95] | 63 [18–93] | 0.008 |
Sex | Male | 1,976 (47) | 510 (34) | 1,466 (53) | <0.001 |
Female | 2,255 (53) | 971 (66) | 1,284 (47) | ||
Ethnicity | White | 1,513 (43) | 371 (28) | 1,142 (53) | <0.001 |
Asian | 1,727 (49) | 892 (67) | 835 (39) | ||
Black/other | 252 (7) | 65 (5) | 187 (9) | ||
Unknown | 739 | 153 | 586 | ||
BMI (kg/m2) | <18.5 | 1,201 (53) | 432 (57) | 769 (50) | 0.0083 |
18.5–<25 | 159 (7) | 46 (6) | 113 (7) | ||
≥25 | 925 (40) | 278 (37) | 647 (42) | ||
Unknown | 1,946 | 725 | 1,221 | ||
Smoking status | Never | 1,686 (40) | 964 (66) | 722 (27) | <0.001 |
Former | 1,348 (32) | 349 (24) | 999 (37) | ||
Current | 1,133 (27) | 155 (11) | 978 (36) | ||
Unknown | 64 | 13 | 51 | ||
Packyearsa | ≤20 | 410 (26) | 179 (48) | 231 (19) | <0.001 |
>20 | 1,151 (74) | 195 (52) | 956 (81) | ||
Unknown | 920 | 130 | 790 | ||
NSCLC histology | Adeno | 3,974 (94) | 1,455 (98) | 2,519 (92) | <0.001 |
Squamous | 149 (4) | 11 (1) | 138 (5) | ||
Large cell | 33 (1) | 3 (0) | 30 (1) | ||
Not specified | 75 (2) | 12 (1) | 63 (2) | ||
Stage | I | 1,372 (32) | 565 (38) | 807 (29) | <0.001 |
II | 326 (8) | 106 (7) | 220 (8) | ||
III | 784 (19) | 227 (15) | 557 (20) | ||
IV | 1,749 (41) | 583 (39) | 1,166 (42) |
. | . | Patients with EGFR mutation-tested tumors: N (%) . | . | ||
---|---|---|---|---|---|
Covariate . | Category . | Full sample . | EGFR-mutated . | EGFR-wildtype . | P value . |
Total count (100%) | 4,231 | 1,481 | 2,750 | ||
Age | Median [Min–Max] | 63 [18–95] | 62 [22–95] | 63 [18–93] | 0.008 |
Sex | Male | 1,976 (47) | 510 (34) | 1,466 (53) | <0.001 |
Female | 2,255 (53) | 971 (66) | 1,284 (47) | ||
Ethnicity | White | 1,513 (43) | 371 (28) | 1,142 (53) | <0.001 |
Asian | 1,727 (49) | 892 (67) | 835 (39) | ||
Black/other | 252 (7) | 65 (5) | 187 (9) | ||
Unknown | 739 | 153 | 586 | ||
BMI (kg/m2) | <18.5 | 1,201 (53) | 432 (57) | 769 (50) | 0.0083 |
18.5–<25 | 159 (7) | 46 (6) | 113 (7) | ||
≥25 | 925 (40) | 278 (37) | 647 (42) | ||
Unknown | 1,946 | 725 | 1,221 | ||
Smoking status | Never | 1,686 (40) | 964 (66) | 722 (27) | <0.001 |
Former | 1,348 (32) | 349 (24) | 999 (37) | ||
Current | 1,133 (27) | 155 (11) | 978 (36) | ||
Unknown | 64 | 13 | 51 | ||
Packyearsa | ≤20 | 410 (26) | 179 (48) | 231 (19) | <0.001 |
>20 | 1,151 (74) | 195 (52) | 956 (81) | ||
Unknown | 920 | 130 | 790 | ||
NSCLC histology | Adeno | 3,974 (94) | 1,455 (98) | 2,519 (92) | <0.001 |
Squamous | 149 (4) | 11 (1) | 138 (5) | ||
Large cell | 33 (1) | 3 (0) | 30 (1) | ||
Not specified | 75 (2) | 12 (1) | 63 (2) | ||
Stage | I | 1,372 (32) | 565 (38) | 807 (29) | <0.001 |
II | 326 (8) | 106 (7) | 220 (8) | ||
III | 784 (19) | 227 (15) | 557 (20) | ||
IV | 1,749 (41) | 583 (39) | 1,166 (42) |
aOnly among eversmokers.
Multivariable regression model development
In univariable analysis, being female and Asian were associated with higher chance of being EGFR-mutated, whereas non-adenocarcinoma histology, BMI ≥25 kg/m2 and having any smoking history was inversely associated with being EGFR-mutated, as was heavy smoking (Supplementary Table S4). In this dataset, earlier stage was more likely to be associated with being EGFR-mutated, which was due to ascertainment bias, as the Asian studies were mostly from thoracic surgeon practices of early stage, resected lung cancers (Supplementary Table S4).
Multivariable regression models were primarily assessed for their ability to create accurate EGFR-wildtype cohorts, using different combinations of variables that have been shown to be significant in univariable analyses; we also evaluated several models that contained interaction terms (Ethnicity × smoking status; ethnicity × stage; ethnicity × sex; Supplementary Table S5) based on known associations between several key clinico-demographic factors and presence/absence of EGFR mutation. Concordance indices (C-indices) were very similar across models containing different variables: all between 0.740 and 0.778 (Supplementary Table S5). Therefore, we selected a pragmatic model that included only variables available for most ILCCO patients to maximize statistical power. Our final model included age, sex, ethnicity, histology, and smoking status (see parameters and estimates of final model in Supplementary Table S6), which achieved a C-index of 0.75 (95% CI, 0.74–0.77) in the training dataset and 0.77 (95% CI, 0.74–0.79) in the testing dataset (Fig. 2). Model performance was also validated using bootstrap resampling methods confirming model performance (Supplementary Table S7; Supplementary Fig. S1).
Choosing a clinically relevant probability score cut-point from the multivariable regression model for being EGFR-wildtype
On the basis of the ROC-curve generated by our model (Fig. 2) and distribution of PS (Supplementary Fig. S2), we evaluated various possible cut-points to determine which patients should be classified as EGFR-mutated versus EGFR-wildtype. With a PS cut-point of 0.335 (optimal cut-point from a statistical standpoint determined from the ROC curves generated by the regression model), there was a sensitivity of 69% and specificity of 72.5%. Lower PS cut-points would have resulted in decreased specificity.
The EGFR status-known dataset of 4,231 patients had a 35% EGFR mutation prevalence that corresponded to a EGFR-mutated PPV of 57% and NPV of 81%; the NPV was thus reasonably associated with identifying patients with EGFR-wildtype NSCLC whereas retaining 2,453 patients that would be considered EGFR-wildtype in the analysis. With a more conservative probability-score cut-point of 0.25, NPV increased to 85%, but at the expense of a substantially smaller sample size of patients that would be considered EGFR-wildtype (N = 1,879).
When assessing all ILCCO participants (n = 35,356; Supplementary Fig. S2D), the PS distribution was very different from the PS distribution observed in the EGFR-status-known cohort, which was also reflected in different distributions in characteristics associated with EGFR status (Table 1; Supplementary Table S2). This was because there was oversampling of the EGFR-mutated patients among all tested patients: until centers started to perform routine testing for EGFR-status in all patients, patients would often be selected for testing on the basis being a neversmoker, or being of Asian ethnicity. Thus, in our overall ILCCO dataset, we anticipated an EGFR mutation prevalence lower than 35%. As a sensitivity analysis, we artificially reduced the EGFR-mutation prevalence to 15% while keeping the same test sensitivity and specificity and recalculated the following: the NPV increased to 92% at a PS cut-point of 0.335 (n = 23,434), and to 94% (n = 18,484) at a PS cut-point of 0.25.
OS of EGFR-wildtype patients, as determined by different approaches
As expected, the OS of EGFR-mutated patients was longer, compared with EGFR-wildtype patients (Supplementary Figs. S3A and S3B). We then compared Kaplan–Meier curves of known EGFR-wildtype patients (median OS: 2.67 years) with those defined on the basis of PS <0.335 (median OS: 2.49 years) and PS <0.25 (median OS: 1.91 years), and found that the optimal cut-point of <0.335 selected patients with median OS closer to the known EGFR-wildtype patients (Supplementary Fig. S3C). To avoid confounding by stage, we also performed the same comparisons, but restricted to stage IV patients only (Supplementary Fig. S3D). We then compared Kaplan–Meier curves and median OS of true EGFR-wildtype patients with the predicted EGFR-wildtype patients in all ILCCO patients (Supplementary Figs. S3E and S3F) and demonstrated high concordance. The patterns and relationships of OS were similar across all the different approaches and sensitivity analyses.
Assessing the clinical utility of our clinical-restriction and epidemiological-imputation approaches
We re-analyzed previously published ILCCO-analyses on BMI-OS hypotheses described in the Materials and Methods section. Although test characteristics (sensitivity, specificity) of our model do not change with changes in EGFR prevalence, PPV and NPV, and therefore accuracy (true positives and true negatives, all divided by total evaluated) will change with changes in EGFR prevalence. As our overall model only had sufficient accuracy to predict patients with EGFR-wildtype status (being a largely Caucasian, smoking dataset) but lacked adequate PPV to identify EGFR-mutated patients in the overall population, we focused our re-analysis only on the EGFR-wildtype cohort using both clinical-restriction and epidemiologic-imputation approaches.
When re-analyzing our previous studies on the influence of BMI on OS in patients with NSCLC by clinical-restriction or epidemiologic-imputation approaches, the direction of change remained the same for all BMI levels and interactions. In most cases, the magnitude of HRs was similar too; however, in a few subgroups, the overall effect size varied (Figs. 3 and 4; Supplementary Tables S8 and S9). Results remained comparable in a sensitivity analysis exclusively in patients with known adenocarcinoma histology (Supplementary Tables S10 and S11).
Asian subcohort analyses
When using the ILCCO dataset with predominantly European ancestry, there is anticipated low prevalence of EGFR-mutation. Thus, there is no cut-point that provides a PPV with sufficiently high accuracy to classify patients confidently as being EGFR-mutated on the basis of our multivariable regression model. However, we did explore both EGFR-mutated and EGFR-wildtype patients in the Asian subcohort because of the higher prevalence of EGFR-mutations in this population, which therefore leads to a higher PPV and accuracy.
When exploring these sensitivity analyses in an exclusively Asian subpopulation, we applied clinical-restriction and epidemiologic-imputation methods to generate predicted EGFR-wildtype and EGFR-mutated cohorts. The relationship between BMI and OS remained similar, when stratified by EGFR status, with one exception. In the subset of Asian patients with BMI >30, the BMI–OS relationship remained comparable with the original study (HR, 0.70) for predicted EGFR-negative patients by both restriction and imputation methods (0.65 and 0.72, respectively); however, the direction and magnitude of the BMI–OS relationship in predicted EGFR-positive patients was quite different (Supplementary Table S12).
Discussion
Leveraging the variables available in the ILCCO datasets, we built a multivariable regression model to identify EGFR-status among patients who had missing EGFR-status data, based exclusively on clinical parameters readily available in most lung cancer epidemiologic studies. We utilized two approaches to predict for EGFR-status in individual patients based on the regression model: the first utilized a clinically-focused, restriction approach based on identifying an optimal cut-off point to distinguish between EGFR-mutated and EGFR-wildtype subgroups; a second approach was based on an alternative epidemiologic, multiple imputation approach. We find these two approaches complementary. Although the multiple imputation approach is preferred in the epidemiologic world, the restriction approach may be more acceptable to clinicians who are uncomfortable with the concept of assigning values to missing data, no matter how scientifically rigorous this process may be.
Given the underlying population of pooled ILCCO patients with NSCLC, we focused on evaluating the utility of defining an EGFR-wildtype subcohort through these two approaches. We then tested the potential clinical utility of our two approaches to compare EGFR-wildtype subcohorts with our original full-cohort analyses on two separate hypotheses on the influence of BMI on survival; here, we confirmed that our prior full-cohort analyses had similar direction and magnitude of associations when compared with the same analyses in our EGFR-wildtype subcohorts. This remained largely true in an exploratory analysis of exclusively Asian subcohort where we included predicted EGFR-mutated and EGFR wildtype patients; however, some differences especially in patients with BMI >30 were observed. We cannot readily explain this difference seen in the Asian compared with the nonselected population; however, there may be residual confounding specifically relevant to the Asian population due to confounder variables we have not collected in our Caucasian-predominant dataset and therefore not adjusted for.
Missing variables are a common problem in epidemiology studies and they are commonly categorized into three different categories depending on their relation to observed and unobserved data: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). Although for MCAR variables the probability of being missing is the same for all cases, MAR variables are missing in specific subgroups captured by the available data and NMAR variables are missing because of certain other variables not captured by the available dataset. However, methods to deal with missing data as multiple imputation are rarely utilized to account for these variables introducing bias (23).
In our dataset, EGFR-status was widely missing for several reasons. First, EGFR-testing was not standard clinical practice when most of the studies were designed or had started recruiting. However, when testing became available, oversampling bias occurred early during EGFR-test implementation, whereby patients selected for testing by clinicians tended to be those who had clinico-epidemiologic characteristics that enhanced the patient's probability of having an EGFR-mutated NSCLC, therefore enriching the population for EGFR-positive patients and in consequence leading to higher prevalence of EGFR-positive NSCLC in the tested group compared with what would be expected in the overall population. Further, availability of testing was very heterogeneous worldwide for some time. Therefore, missing EGFR-status data in our study population were likely a mixture of MCAR (these mutations were only identified in 2004, and broad clinical testing took a number of years and technological advances) and MAR (testing only in selected groups); and these are the two type of missing data patterns that can be addressed by multivariable and multiple imputation techniques. When we re-analyzed our previous ILCCO analyses on influence of BMI on OS, by restricting to an EGFR-wildtype subcohort, the overall direction, magnitude, and significance did not change much; this result was expected, given that majority of our ILCCO patients did not fit the clinico-demographic profile of EGFR-mutated patients with NSCLC. Results in our Asian subcohort including predicted EGFR-positive patients do suggest possible differences between the predicted EGFR-mutated and EGFR-wildtype patients, substantiating our hypothesis that in EGFR-mutated enriched populations, epidemiologic associations may truly vary by EGFR status. However, these exploratory findings will need to be validated in larger datasets of Asian patients.
Several factors should be taken into account. Many of the studies that comprised the ILCCO dataset involved patients diagnosed before 2009 when the seminal IPASS trial was published and therefore during a time when testing was not standard of care in most places worldwide. Therefore, only a small proportion of these studies did actually involve patients with stage IV disease after 2009 for which a finding of EGFR-mutation would have resulted in treatment with an EGFR TKI, which consequently may lead to markedly improvement survival (24). The vast majority of these ILCCO dataset patients would not have been affected.
We thus suggest that our approaches could be most useful when analyzing contemporary datasets, patients with stage IV metastatic NSCLC, or predominantly Asian patients with NSCLC or NSCLCs in other ethnicities with known higher EGFR-mutation prevalence, or in any dataset where a large fraction of patients are expected to be EGFR-mutated and/or treated with TKI. Note that when the proportion of patients with EGFR-mutations is high, even the early stage resected EGFR-mutated patients can influence results, as some of these patients invariably will relapse over time and be treated with EGFR TKIs; already, patients with early stage resected (stage IB–IIIA) EGFR-mutation positive NSCLC will have a new standard of care TKI therapy soon, based on a recent trial (7). In such instances, our approaches to deal with missing EGFR-status may become critical to interpret results properly. Further, etiologic studies of NSCLC also need to determine the potential impact of EGFR-status on results, given that most scientists and clinicians consider EGFR-wildtype and EGFR-mutated NSCLCs to be two separate carcinogenesis pathways (25). Having established approaches to dealing with missing EGFR-status and the use of these approaches in sensitivity analyses provides potential pragmatic solutions to these issues.
Our analysis has several limitations. First, treatment data were only available in a small fraction of study participants, too small to incorporate into our analyses. However, this underlines the importance of accounting for EGFR-status, as EGFR-mutated patients who initially or later relapse into late stage will then likely receive TKI therapy, thereby potentially increasing survival outcomes when compared with relapsed patients with NSCLC without driver mutations. Second, as our aim was to build a pragmatic model applicable to most epidemiologic studies, we could only include a small number of very basic clinical variables that have been collected in most of the studies; however, we are satisfied that the resultant concordance indices are quite reasonable. Third, in our model we did not consider other lung cancer risk factors such as environmental tobacco exposure (26) or especially radon, for which previously some association with EGFR mutations has been shown (27). Finally, the sample size of our Asian subpopulation analyses was small and potential residual confounding cannot be excluded.
In conclusion, we introduce a pragmatic, step-wise method that uses both restriction and multiple imputation approaches in sensitivity analyses to evaluate the potential impact of EGFR-status on epidemiologic analyses of NSCLC. Our model only incorporates readily available variables and therefore trades off some accuracy for the ability to be applied across a broad set of clinical circumstances in many other populations. This method is generalizable in the common occurrence in which EGFR-status data are missing from epidemiologic studies. With this method, we lay the foundation to refine future epidemiologic studies of NSCLC risk and outcome.
Authors' Disclosures
S. Schmid reports other support from Swiss Cancer Research Foundation, AstraZeneca, MSD, BMS, Boehringer Ingelheim; grants from AstraZeneca and BMS; and personal fees from Boehringer Ingelheim, Takeda, and MSD outside the submitted work. L. Ferro Leal reports grants from AstraZeneca - Brazil outside the submitted work. A.S. Wenzlaff reports grants from NIH during the conduct of the study. L. Le Marchand reports grants from NCI during the conduct of the study. A.G. Schwartz reports grants from NIH during the conduct of the study. L.C. Sakoda reports grants from NCI and California Tobacco-Related Disease Research Program; personal fees from NIH; and other support from National Lung Cancer Roundtable outside the submitted work. G. Liu reports other support from Princess Margaret Cancer Foundation during the conduct of the study as well as grants and personal fees from AstraZeneca, Takeda, and Roche; personal fees from Pfizer, Novartis, Bristol Myers Squibb, EMD Serono, Merck, Amgen, AbbVie, and Jazz Pharmaceuticals; and grants from Boehringer outside the submitted work. No disclosures were reported by the other authors.
Disclaimer
Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the International Agency for Research on Cancer/World Health Organization.
Authors' Contributions
S. Schmid: Conceptualization, formal analysis, investigation, methodology, writing–original draft. M. Jiang: Conceptualization, data curation, formal analysis, investigation, methodology, writing–original draft. M.C. Brown: Conceptualization, data curation, formal analysis, investigation, methodology, writing–original draft. A. Fares: Conceptualization, data curation, methodology, writing–review and editing. M. Garcia: Conceptualization, data curation, methodology, writing–review and editing. J. Soriano: Data curation, investigation, writing–review and editing. M. Dong: Data curation, formal analysis, methodology, writing–review and editing. S. Thomas: Data curation, investigation, writing–review and editing. T. Kohno: Data curation, investigation, writing–review and editing. L.F. Leal: Data curation, investigation, writing–review and editing. N. Diao: Data curation, investigation, writing–review and editing. J. Xie: Data curation, investigation, writing–review and editing. Z. Wang: Data curation, investigation, writing–review and editing. D. Zaridze: Data curation, investigation, writing–review and editing. I. Holcatova: Data curation, investigation, writing–review and editing. J. Lissowska: Data curation, investigation, writing–review and editing. B. Świątkowska: Data curation, investigation, writing–review and editing. D. Mates: Data curation, investigation, writing–review and editing. M. Savic: Data curation, investigation, writing–review and editing. A.S. Wenzlaff: Data curation, investigation, writing–review and editing. C.C. Harris: Data curation, investigation, writing–review and editing. N.E. Caporaso: Data curation, investigation, writing–review and editing. H. Ma: Data curation, investigation, writing–review and editing. G. Fernandez-Tardon: Data curation, investigation, writing–review and editing. M.J. Barnett: Data curation, investigation, writing–review and editing. G. Goodman: Data curation, investigation, writing–review and editing. M.P.A. Davies: Data curation, investigation, writing–review and editing. M. Pérez-Ríos: Data curation, investigation, writing–review and editing. F. Taylor: Data curation, investigation, writing–review and editing. E.J. Duell: Data curation, investigation, writing–review and editing. B. Schoettker: Data curation, investigation, writing–review and editing. H. Brenner: Data curation, investigation, writing–review and editing. A. Andrew: Data curation, investigation, writing–review and editing. A. Cox: Data curation, investigation, writing–review and editing. A. Ruano-Ravina: Data curation, investigation, writing–review and editing. J.K. Field: Data curation, investigation, writing–review and editing. L. Le Marchand: Data curation, investigation, writing–review and editing. Y. Wang: Data curation, investigation, writing–review and editing. C. Chen: Data curation, investigation, writing–review and editing. A. Tardon: Data curation, investigation, writing–review and editing. S. Shete: Data curation, investigation, writing–review and editing. M.B. Schabath: Data curation, investigation, writing–review and editing. H. Shen: Data curation, investigation, writing–review and editing. M.T. Landi: Data curation, investigation, writing–review and editing. B.M. Ryan: Data curation, investigation, writing–review and editing. A.G. Schwartz: Data curation, investigation, writing–review and editing. L. Qi: Data curation, investigation, writing–review and editing. L.C. Sakoda: Data curation, investigation, writing–review and editing. P. Brennan: Data curation, investigation, writing–review and editing. P. Yang: Data curation, investigation, writing–review and editing. J. Zhang: Data curation, investigation, writing–review and editing. D.C. Christiani: Data curation, investigation, writing–review and editing. R.M. Reis: Data curation, investigation, writing–review and editing. K. Shiraishi: Data curation, investigation, writing–review and editing. R.J. Hung: Conceptualization, data curation, investigation, writing–review and editing. W. Xu: Conceptualization, formal analysis, methodology, writing–original draft. G. Liu: Conceptualization, data curation, formal analysis, supervision, investigation, methodology, writing–original draft.
Acknowledgments
This study was partially supported by the Public Ministry of Labor Campinas (Research, Prevention, and Education of Occupational Cancer), FINEP - CT-INFRA (02/2010). We thank all members of the GTOP group (Translational Group of Pulmonary Oncology - Barretos Cancer Hospital, Brazil). Caret-Study was funded by the NCI, NIH, through grants U01-CA063673, UM1-CA167462, and U01-CA167462. D.C. Christiani has received funding through an U01 Grant (U01 CA209414). G. Liu was supported by Alan B. Brown Chair and the Lusi Wong Family Fund, Princess Margaret Cancer Foundation. M.C. Brown is supported by the Alan B. Brown Chair. S. Schmid was supported by the Swiss Cancer Research Foundation.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.