Increasing evidence points to a role for inflammation in lung carcinogenesis. A small number of circulating inflammatory proteins have been identified as showing elevated levels prior to lung cancer diagnosis, indicating the potential for prospective circulating protein concentration as a marker of early carcinogenesis. To identify novel markers of lung cancer risk, we measured a panel of 92 circulating inflammatory proteins in 648 prediagnostic blood samples from two prospective cohorts in Italy and Norway (women only). To preserve the comparability of results and protect against confounding factors, the main statistical analyses were conducted in women from both studies, with replication sought in men (Italian participants). Univariate and penalized regression models revealed for the first time higher blood levels of CDCP1 protein in cases that went on to develop lung cancer compared with controls, irrespective of time to diagnosis, smoking habits, and gender. This association was validated in an additional 450 samples. Associations were stronger for future cases of adenocarcinoma where CDCP1 showed better explanatory performance. Integrative analyses combining gene expression and protein levels of CDCP1 measured in the same individuals suggested a link between CDCP1 and the expression of transcripts of LRRN3 and SEM1. Enrichment analyses indicated a potential role for CDCP1 in pathways related to cell adhesion and mobility, such as the WNT/β-catenin pathway. Overall, this study identifies lung cancer–related dysregulation of CDCP1 expression years before diagnosis.
Prospective proteomics analyses reveal an association between increased levels of circulating CDCP1 and lung carcinogenesis irrespective of smoking and years before diagnosis, and integrating gene expression indicates potential underlying mechanisms.
See related commentary by Itzstein et al., p. 3441.
Growing epidemiologic evidence has indicated the central role of inflammation and chronic inflammation in carcinogenesis, which is now widely recognized as one of the hallmarks of cancer (1, 2). Chronic inflammation is a risk factor for the induction of certain cancers, and cancer itself can induce local inflammation processes that may promote tumor proliferation and metastasis (3–5). In blood, inflammation can be measured by the abundance of cytokines and other circulating proteins, which may therefore potentially serve as biomarkers of early carcinogenesis. In particular, inflammatory markers such as the C-reactive protein and interleukins have previously been identified as putative cancer prognostic markers (1, 6).
Lung cancer is the leading cause of cancer-related mortality worldwide. Inflammation and multiple chronic inflammatory conditions are associated with an increased risk of lung cancer. Although the potential for inflammation to exacerbate the harmful effect of smoking may partially explain this association (4), the detailed mechanisms at play remain unclear (7–9). Recent studies have identified associations linking the level of circulating inflammatory proteins and lung cancer risk (4, 6, 7, 10, 11). Increased levels of IL6 and IL8 have been associated with higher lung cancer risk in prospective cases less than 5 years before diagnosis (7, 11) as well as after 15 years of follow-up (11). These inflammatory proteins were suspected to be involved in pathways for smoking-induced carcinogenesis (7, 11, 12). These studies were, however, based on a very limited number of assayed inflammatory markers (less than 10; refs. 7 and 11) and/or on the analysis of serum from cases with a short time to diagnosis (less than 5 years; refs. 7 and 12). Our study extends these analyses by including a large panel (N = 92) of circulating inflammatory proteins in relation to future risk of lung cancer in participants who were healthy at baseline and were followed-up for up to 16 years. Our analyses include more than 600 prospective participants from two cohorts as a discovery set, and an additional 450 participants for validation. We also adopt an integrative approach combining proteomic and transcriptomic data obtained from the same samples to explore pathways and molecular mechanisms related to the identified protein markers.
Materials and Methods
Plasma samples were collected in participants from two prospective cohorts within the European Prospective Investigation into Cancer and Nutrition (EPIC): EPIC-Italy (13) and the Norwegian Women and Cancer Study (NOWAC; refs. 14 and 15) as already described. Details on participants are available in Supplementary Table S1. Our study population includes 325 lung cancer cases (N = 192 EPIC-Italy; N = 133 NOWAC) and 325 healthy controls matched on age, gender, year of recruitment, season of blood collection and center. Because of issues with the seal of the straws in which the samples were aliquoted, two EPIC-Italy cases were excluded, leaving 323 lung cancer cases. All study participants gave written informed consent for the study. For EPIC-Italy, the research was approved by the Ethics Committees at the Italian Institute of Genomic Medicine (IIGM, Turin, Italy). For NOWAC, the study was approved by the Regional Committee for Medical and Health Research Ethics in North Norway. Validation of specific result was sought in 450 additional samples including 316 (161 cases and 155 controls) from the EPIC cohort (Centers of Netherlands, United Kingdom, Germany, and Spain; ref. 16) and 134 from the Northern Sweden Health and Disease Study (NSHDS; ref. 17). The characteristics of the validation dataset are reported in Supplementary Table S2. Circulating levels of proteins from these samples were measured using the same platform as in our dataset.
Inflammatory proteins measurements
The levels of 92 inflammatory proteins were measured in citrate plasma samples by multiplex proximity extension assay with the manufacturer kit (Proseek Multiplex Inflammation I panel, Olink Bioscience) using a Fluidigm Biomark reader (Fluidigm Corporation), as described previously (18). The case–control paired samples were randomized over eight 96-well plates. To evaluate the repeatability of the measurements, we included two replicates for 56 EPIC-Italy participants (N = 28 cases and 28 controls) and 16 control samples with the same composition. Protein levels were expressed as normalized protein expression (Ct values with corrections for assay variation), and log2 transformed prior to statistical analysis. After verifying that the proportions of samples below the limit of detection (LoD) were similar in cases and controls, we excluded proteins with levels below the LoD in more than 30% of the samples (N = 21). As replicated measurements were highly consistent (Lin concordance correlation above 0.95), we used the average levels for statistical analyses. Values below the LoD were imputed using the quantile regression imputation of left-censored data (QRILC) algorithm for left-censored data, as implemented in the R package imputeLCMD (19, 20). Finally, we removed samples that (i) did not pass the quality control provided by the manufacturer (N = 15), (ii) showed sample conservation issues by visual inspection (N = 15 EPIC-Italy samples), and (iii) were detected as outliers using the Filzmoser, Maronna, and Werner algorithm for multivariate outlier detection (N = 40 additional samples; ref. 21).
To better characterize the functional role of the proteins, these data were integrated with measured levels of 11,610 transcripts, available for NOWAC participants (N = 222). Transcriptomics data were obtained from total RNA extraction from blood cells as described previously (22, 23) and log-transformed prior to statistical analyses. Briefly, RNA was extracted from buffy coats, miRNA expression profiling was performed on an Agilent Human miRNA Microarray (Release 19.0, 8 × 60 K), representing 2006 human miRNAs. Preprocessing and quality assessment of the data was performed as described previously (22, 23).
Women are overrepresented in our study population notably because the NOWAC study only includes women. We restricted our primary analyses to women participants to (i) maximize sample size while protecting against unmodeled potential gender and center-related confounding, and (ii) to ensure we could integrate gene expression data that were only available in NOWAC participants. We then sought to replicate our findings in men (from the Italian cohort). For completeness, we also included as a sensitivity analysis results from the analyses performed on the full study population (i.e., pooling men and women).
To account for technical variability, the data were denoised by extracting the residuals from linear mixed models where the proteins levels were modeled as the outcome and plate number and center of recruitment were included as random intercepts in the model (24). The association between the measured levels of each of the proteins from the inflammatory panel and prospective lung cancer status was evaluated using a series of logistic regression models applied on the denoised data. The disease status (outcome) was regressed against the protein levels, age, gender (for models on the full population only), and body mass index (BMI). We expressed effect size estimates as ORs measuring the risk change for an increase of one SD in protein levels. The strength of the association was evaluated using a likelihood ratio test comparing the fit (as measured by the likelihood) of models with that of models without the protein levels in the predictor set. Results were corrected for multiple testing using the Benjamini–Hochberg procedure controlling the FDR below 0.05. To investigate the potential confounding role of smoking, analyses further adjusted for pack-years were also conducted. We sought for validation of the main findings from our univariate analyses in an independent dataset (N = 450). We used the same logistic model, which was subsequently adjusted for smoking status, the only smoking exposure variable available for all participants (N = 450), and pack-years for the participants (N = 316) for whom this information was available.
The inflammatory protein levels were jointly regressed against the future disease risk using logistic-LASSO (Least Absolute Shrinkage and Selection Operator) models adjusted on age and BMI (unpenalized effects; ref. 25). To investigate the stability of our results, we computed the selection proportions of the proteins by fitting the model on 1,000 subsamples of 80% of the data (26). At each subsampling iteration, the logistic LASSO models were calibrated using 10-fold cross-validation minimizing the binomial deviance using the R package glmnet (25), and subsequently used to estimate variable selection proportion across the 1,000 calibrated models. The effect of adjustment on smoking was investigated by including pack-years in the set of predictors without penalizing this variable. As we observed strong and inseparable cohort and gender effects, analyses were performed on women participants and validated separately on men.
To investigate potential subtype-specific effects, univariate and multivariate models were applied to cases from each histologic subtype separately (adenocarcinoma N = 91, large-cell carcinoma N = 17, squamous cell carcinoma N = 26, and small cell carcinoma N = 32). As we observed strong cohort effects in our data, we performed analyses in EPIC-Italy and NOWAC women separately. To account for the confounding role of smoking, the analyses were also performed in never and current smokers separately. As effect size may vary during the natural history of disease progression, we also run our models separately in cases diagnosed before and after the median time to diagnosis. Time to diagnosis is defined by the time elapsed from blood sample collection date (at recruitment) to the date at which lung cancer cases were diagnosed. The median time to diagnosis in women was 4.9 years and ranged from 1 to 16 years (interquartile range of 4.8 years). In these analyses, circulating levels of all assayed proteins in cases from the long and short time to diagnosis subgroups separately were compared with those observed in the full set of controls.
To evaluate the amount of disease-relevant information brought about by the proteins, we performed a series of logistic models with proteins levels as predictors and future disease risk as the outcome. Models were fitted on a training set of 80% of the total population size and performances were computed on a test set including the remaining 20% of the observation. Subsamples were controlled such that each training and test sets included the same proportion of cases, that was representative of that in the full population. The procedure was repeated 1,000 times. The results were visualized as ROC curves, showing the pointwise average, and a confidence region delimited by the 5th and 95th percentiles of the true and false positive rates (27).
To gain insight into the functional role of the strongest association we identified, CDCP1 levels were regressed against transcript levels in linear mixed models adjusted for plate using random effects. In addition, the associations between CDCP1 and biological pathways were investigated using individual-level gene expression data (N = 11,610 transcripts matched to 11,485 unique gene symbols). Functionally relevant groups of transcripts were first identified using biological information from two large knowledgebases in Panther (28): Biological Processes and Reactome. For each database, the identified pathways (involving up to 684 genes for Biological Processes and 1,499 for Reactome) were then summarized using principal component analysis (PCA). All principal components (PC) explaining more than 5% of the group's variance were kept for analyses and used as a proxy for the biological pathway. In a second step, the PC of all biological pathways were regressed individually against CDCP1 (as the outcome) using linear models. To account for the overlap in transcript members between different pathways, the effective number of tests (ENT) was computed using a PCA on the entire set of summarized pathways and estimated as the number of PC needed to explain 90% of the variance. Results were corrected for multiple testing using the threshold in P value P = 0.05/ENT. All statistical analyses were performed in R, version 4.0.2.
Availability of data and materials
The data in EPIC that support the findings of this study are available from the corresponding author upon reasonable request. The NOWAC data cannot be shared publicly because of local and national ethical and security policy. Data access for researchers will be conditional on adherence to both the data access procedures of the Norwegian women and Cancer Cohort and the UiT The Arctic University of Norway (Tromsø, Norway; contact via T.M. Sandanger, firstname.lastname@example.org, Tonje Braaten email@example.com, and Arne Bastian Wiik, firstname.lastname@example.org) in addition to the local ethical committee.
The main features of the study population are summarized in Supplementary Table S1 and show that participants from both the Italian (EPIC-Italy) and Norwegian (NOWAC) cohorts share similar characteristics, except smoking habits, and distribution of lung cancer histologic subtypes. Particularly, we observed a slight excess of adenocarcinoma (47.4%) in NOWAC compared with EPIC Italy (44% and 39.6% for women and men, respectively). Small cell carcinoma was the second most prevalent cancer in NOWAC (14.2%) whereas large cell carcinoma was more frequent in EPIC Italy (15.5% in women and 21.7% in men). To maximize the comparability of the population from NOWAC (women only) and EPIC-Italy, we restricted our main analyses to women, and, as sensitivity analysis, investigated separately data from Italian men. Results obtained from these subgroup analyses are compared with models applied on the full population and further adjusted on gender.
Of the 92 inflammatory proteins assayed in our samples, 21 were excluded because of levels falling below the LoD in more than 30% of the samples, leaving 71 proteins for further analyses. The proportion of measurements below the LoD did not depend on the case–control status (Supplementary Fig. S1A). For the 56 technical replicates included in our assays, Lin concordance correlations of the measured protein levels were all above 0.95, indicating a good repeatability of the measurements (Supplementary Fig. S1B). To avoid generating results driven by outlying observations, we used an automatic outlier detection algorithm applied to the first five PCs (see the Materials and Methods for details) and excluded participants (N = 59) from our analyses. We also excluded samples (N = 15) that did not pass the quality control provided by the analyzing laboratory and samples (N = 21) that had a default in sample vials prior to the analysis (Supplementary Fig. S2A). To correct for the nuisance variation and to reduce the potential for technical bias, the data were subsequently denoised by extracting the residuals from linear mixed models with center and plate ID as random intercepts (Supplementary Fig. S2B and S2C).
Univariate analyses reveal higher levels of protein CDCP1 in future cases
Univariate logistic regression models indicated that the circulating levels of 12 proteins: CDCP1 (OR = 1.94, P = 5.49 × 10−9), HGF (OR = 1.43, P = 6.82 × 10−4), IL6 (OR = 1.46, P = 7.63 × 10−4), OSM (OR = 1.41, P = 1.09 × 10−3), MCP1 (OR = 1.38, P = 2.12 × 10−3), IL8 (OR = 1.29, P = 3.84 × 10−3), VEGFA (OR = 1.33, P = 5.39 × 10−3), CD6 (OR = 1.32, P = 7.08 × 10−3), and CD5 (OR = 1.32, P = 7.41 × 10−3) were associated with an increased risk of lung cancer in women after adjustment for multiple testing using the Benjamini–Hochberg procedure (FDR; Table 1). Levels of SCF (OR = 0.62, P = 1.02 × 10−5), TWEAK (OR = 0.76, P = 6.47 × 10−3), and IL12B (OR = 0.75, P = 6.65 × 10−3) were inversely associated with lung cancer risk. All these 12 proteins were also associated with exposure to tobacco smoke as measured by pack-years or smoking status (Supplementary Table S3). SCF, TWEAK, and IL12B were the only proteins showing decreased levels in relation to smoking.
|A .||All LC .||Adenocarcinoma .||Small cell carcinoma .||Large-cell carcinoma .||Squamous-cell carcinoma .|
|.||(N = 397) .||(N = 292) .||(N = 233) .||(N = 218) .||(N = 227) .|
|.||OR .||P .||OR .||P .||OR .||P .||OR .||P .||OR .||P .|
|B||All LC||Adenocarcinoma||Small cell carcinoma||Large-cell carcinoma||Squamous cell carcinoma|
|(N = 388)||(N = 286)||(N = 227)||(N = 214)||(N = 223)|
|A .||All LC .||Adenocarcinoma .||Small cell carcinoma .||Large-cell carcinoma .||Squamous-cell carcinoma .|
|.||(N = 397) .||(N = 292) .||(N = 233) .||(N = 218) .||(N = 227) .|
|.||OR .||P .||OR .||P .||OR .||P .||OR .||P .||OR .||P .|
|B||All LC||Adenocarcinoma||Small cell carcinoma||Large-cell carcinoma||Squamous cell carcinoma|
|(N = 388)||(N = 286)||(N = 227)||(N = 214)||(N = 223)|
Note: Models are adjusted on age and BMI (A). Center and plate effects were removed from the data by taking the residuals from linear mixed models with protein levels as the outcome and center and plate as random intercepts. Models further adjusted on pack-years are also reported (B). The P values of association with future disease status are derived from likelihood ratio tests comparing the fit of the model with that of the model without protein levels in the set of predictors. Results are presented for pooled lung cancer and for each histologic subtype for proteins found associated at least once with one lung cancer subtype considered after Benjamini–Hochberg correction for multiple testing. Variable for which the P value is below the per-test significance level, ensuring a family wise error rate control below 0.05, is presented in bold.
Associations between all proteins and future lung cancer risk were attenuated when adjusting for smoking (as measured by pack-years; Table 1), and only CDCP1 (OR = 1.58, P = 3.09 × 10−4) remained clearly associated with risk, albeit with a partly attenuated association with the risk of lung cancer independently of smoking. Analyses restricted to women (N = 132) who never smoked showed that CDCP1 (OR = 1.46, P = 7.78 × 10−2), and IL8 (OR = 1.61, P = 1.69 × 10−2) were the most dysregulated proteins in relation to future lung cancer, but neither survived correction for multiple testing (Supplementary Table S4). All of the three proteins that were associated with lung cancer in the analyses restricted to current smokers survived adjustment on pack-years (CDCP1, OR = 2.16 P = 2.00 × 10−4; SCF, OR = 0.53 P = 1.91 × 10−3; and IL6, OR = 2.14 P = 6.07 × 10−4; Supplementary Table S4).
Levels of CDCP1 in men (88 cases and 88 controls) were also associated with future risk of lung cancer (OR = 1.68, P = 1.51 × 10−3, and OR = 1.86, P = 7.42 × 10−4 for the unadjusted and the model adjusted for smoking, respectively; Supplementary Table S5).
Models applied to the full population and adjusted on gender yielded consistent results (Supplementary Table S6). Eleven of the 12 proteins (all except TWEAK) identified in women were also associated with future risk of lung cancer in the full population (Supplementary Table S6). CDCP1 was the only protein associated with the risk of lung cancer in the model adjusted for pack-years in the full population and in current smokers (OR > 1.83, P < 5.91 × 10−6; Supplementary Tables S6 and S7).
Validation of the association involving blood levels of CDCP1 was sought for in samples of participants (N = 450) from the EPIC study (Centers of Netherlands, United Kingdom, Germany, and Spain) and the NSHDS including 225 cases and 225 healthy controls. Results consistently showed elevated levels of CDCP1 in prospective cases (Supplementary Table S8). The associations were attenuated upon adjustment for pack-years (available only for EPIC samples), and to a lesser extent for smoking status (available for both studies) but remained associated with lung cancer outcome at a nominal significance level of 0.05. We obtained consistent results in the full population, and in women and men separately.
Analyses of CDCP1 and lung cancer by time-to-diagnosis subgroups and by cohort
We compared the levels of all assayed proteins in two subgroups of cases based of the time between blood draw and clinical onset (Supplementary Table S9) to those of all controls. We found that levels of CDCP1 were higher in cases (and in both subgroups of cases) than those observed in controls (Supplementary Fig. S3). Levels of CDCP1 were associated with future risk of lung cancer in both cases diagnosed before and after the median time to diagnosis (4.9 years). The association survived adjustment for smoking in the longer time to diagnosis group (OR = 1.91, P = 1.67 × 10−5) and was borderline significant in the shorter time to diagnosis group (OR = 1.35, P = 5.45 × 10−2).
Circulating levels of SCF were also found associated with the future risk of lung cancer irrespective of the time to diagnosis in the base model (OR < 0.64, P < 3.75 × 10−4). Two other proteins were found associated with lung cancer risk in the shorter (OSM, OR = 1.53, P = 9.08 × 10−4) and in the longer (MCP1, OR = 1.49, P = 2.03 × 10−3) time to diagnosis groups, but none of these association survived correction for multiple testing in models adjusted for smoking.
In analyses by cohort, only CDCP1 was associated with lung cancer risk in both the NOWAC and EPIC Italy (OR > 1.73, P < 8.04 × 10−5; Supplementary Table S10). This association was attenuated when adjusting for smoking (P = 2.26 × 10−2 and 2.22 × 10−3 in NOWAC and EPIC-Italy, respectively), and while it did not survive correction for multiple testing, it was suggestive of a consistent increased risk of lung cancer for higher levels of CDCP1 (at a nominal significance level of 0.05).
To account for the complex correlation patterns across proteins in cases and controls (Supplementary Fig. S4A and S4B), and to identify a sparse set of proteins jointly and complementarily contributing to lung cancer risk, we adopted a penalized logistic regression model using LASSO penalty to allow for variable selection. Penalized regression was coupled with stability assessment based on feature selection proportion, which was calculated over 1,000 subsamples of the full population. Our logistic LASSO models consistently selected CDCP1 as well as three other proteins (selected in over 90% of subsamples, MCP1, SCF, IL10) that were independently associated with lung cancer risk (Fig. 1A). Our logistic LASSO also selected ST1A1, CXCL11, CD8A (with selection proportion greater than 75%), while these proteins were not found associated with lung cancer risk in univariate analyses (P > 7.30 × 10−2). Possibly due to their correlation with MCP1 (ρ > 0.27), HGF, VEGFA and CD6 were found associated lung cancer risk in the univariate models (P < 7.08 × 10−3) but were not frequently selected in our LASSO analyses (selection proportions ranging from 31% to 42%). Selection proportion of all proteins were attenuated in models adjusted for pack-years, and only CDCP1 and IL10 remained selected with selection proportion over 80% in the adjusted model (Fig. 1A).
Results of the LASSO in men (N = 173) from the EPIC Italy study (Supplementary Fig. S5) highlight CDCP1 as the most frequently selected protein in both the unadjusted and smoking-adjusted models (with selection proportions of 81% and 70%, respectively). Five other proteins (SCF, CD5, CXCL10, FGF21, and AXIN1) were selected in over 60% of the subsamples in the base model, but their selection proportion dropped below 40% in the model adjusted for smoking. These results were consistent with those obtained from the full population, where CDCP1 is the only protein with a selection proportion above 0.8 for analyses all lung cancer, adenocarcinoma, and small cell carcinoma cases (Supplementary Fig. S6A–S6C, respectively).
Inflammatory proteins and cancer subtypes
To investigate the role of CDCP1 in association with specific subtypes and to account for histologic heterogeneity of lung cancer, we ran our analyses on the four most common subtypes represented in our study: adenocarcinoma (N = 91 cases), small cell carcinoma (N = 32), squamous cell carcinomas (N = 26), and large cell carcinoma (N = 17), and compared them with all controls (N = 201). CDCP1 was found associated with the risk of adenocarcinoma (OR = 1.84, P = 5.24 × 10−5) and small cell carcinoma (OR = 2.73, P = 7.82 × 10−5) in the model adjusted for pack-years (Table 1). Results obtained from models applied to the full population also suggest an association between CDCP1 and adenocarcinoma (OR = 1.98, P = 2.52 × 10−8) and small cell carcinoma (OR = 4.1, P = 8.03 × 10−12) in both the unadjusted model and the model adjusted for pack-years (Supplementary Table S6).
Despite the small number of observations in the validation set, results also suggested an association for higher levels of CDCP1 and the risk of adenocarcinoma (2.24 × 10−4, 2.28 × 10−4, and 1.42 × 10−2 for the base model and for the model adjusted for smoking status, or pack-years, respectively). Results were consistent but weaker in the analyses by gender (Supplementary Table S8).
In our population, the levels of eight other inflammatory proteins were increased in future small cell carcinoma cases (HGF, MCP1, CD6, IL18, CCL11, IL10RB, TRAIL, and CCL3, P < 0.005). Blood levels of SCF were inversely associated with the risk of squamous cell carcinoma (OR = 0.47, P = 6.39 × 10−5; Table 1). Of these, five (IL18, CCL11, IL10RB, TRAIL, and CCL3) were not associated with adenocarcinoma, large-cell or squamous cell carcinoma (P > 0.09).
Subtype-specific penalized regression models consistently selected CDCP1 (selection proportion of 0.99 and 0.98 in the models unadjusted and adjusted for pack-years, respectively), and IL10 (selection proportion of 0.8 in the unadjusted model and 0.95 in the model adjusted on pack-years, respectively) as jointly explaining the risk of adenocarcinoma (Fig. 1B). For the risk of small cell carcinoma (Fig. 1C), penalized regression model selected CDCP1, IL12B, and CCL11 (selection proportion > 0.8) as jointly contributing to risk in the unadjusted model. CDCP1 remained highly selected (selection proportion of 0.81) in the model adjusted for smoking, while selection proportions of IL12B and CCL11 dropped below 0.1.
Quantification of the explanatory abilities of CDCP1
We conducted ROC analyses to quantify the abilities of the circulating proteins to discriminate between future lung cancer cases and controls (Fig. 2A) in all women. The model including CDCP1 alone yielded a mean AUC of 0.65. The amount of disease-related information added by CDCP1 over and above that of pack-years was modest (mean AUC of 0.74 in the model with pack-years alone, and 0.75 with pack-years and CDCP1). The inclusion of additional proteins selected in the LASSO (N = 10 proteins with selection proportions≥0.8) improved the explanatory performance over that of the model only including CDCP1 on top of pack-years (mean AUC = 0.78). This suggests that selected proteins capture complementary disease-relevant information and slightly improve the discriminatory performance of pack-years only.
For adenocarcinoma, CDCP1 yielded a slightly higher explanatory performance (mean AUC = 0.68; Fig. 2B), which was comparable with that of the model with pack-years alone (mean AUC = 0.69). Including both CDCP1 and pack-years improved the performance of the model (mean AUC = 0.73), suggesting that CDCP1 provided additional risk-relevant information to pack-years for adenocarcinoma. Conversely, the risk of small cell carcinoma is more accurately explained by pack-years alone (AUC = 0.88), and neither CDCP1, nor the set of proteins selected by the LASSO (N = 3) improved risk explanation (Fig. 2C).
Correlation between CDCP1 levels and full resolution gene expression data suggests a role of the WNT signaling pathway
To better characterize the functional role of CDCP1, we explored the correlations between blood levels of CDPC1 and the levels of 11,610 transcripts previously assayed in the same NOWAC participants (N = 222). Univariate linear models regressing CDCP1 levels (as the outcome) against transcript levels, identified significant associations linking levels of CDPC1 and the expression of LRRN3 and SEM1 (Supplementary Fig. S7) after FDR control using the Benjamini–Hochberg approach. To ease results interpretation, we defined functional groups of transcripts using the Reactome and Biological Processes (Gene Ontology) knowledgebases (29, 30). We identified 1,545 and 3,600 functional groups for Reactome and Biological Processes databases, respectively, and each were summarized using PCA (we included all PCs explaining at least 5% of the variance of each pathway). In models regressing these summary variables against CDCP1, three Reactome pathways were significantly associated with CDCP1 after correction for multiple testing: Initial triggering of complement, Defective GALNT3 causes familial hyperphosphatemic tumoral calcinosis, and Deactivation of the β-catenin transactivating complex (P < 4.17 × 10−4; Fig. 3A; Supplementary Table S11A). These involved 6 to 30 transcripts, none of which were detected in univariate regressions. Using biological processes for the grouping, we identified 11 significant pathways, including: protein localization to nucleus, regulation of cell–cell adhesion, and regulation of chemotaxis (Fig. 3B; Supplementary Table S11B).
We assessed the association of a panel of circulating inflammation proteins with risk of subsequent lung cancer in two prospective cohorts as training set and validated our main finding in independent samples. We adopted complementary statistical approaches, which consistently identified CDCP1 as being directly associated with future lung cancer risk irrespective of time to diagnosis and smoking habits. In our univariate models, 12 proteins from our panel were found to be associated with lung cancer status, including CDCP1, SCF, IL6, and IL8. Consistently with previous studies, we observed increased levels of IL6 and IL8 associated with future lung cancer cases, although these associations were weaker than previously described, and were attenuated when accounting for tobacco exposure (7, 31). From our analyses, CDCP1 stood out due to its strong and consistent association with prospective lung cancer risk, irrespective of smoking habits (as measured by pack-years) and time to diagnosis. Overall, our results point toward elevated blood levels of CDCP1 in prospective lung cancer cases compared with controls (10.6%, 8.4%, and 9.9% increase in women, men, and the full population, respectively), irrespective of smoking habits and time to diagnosis.
Subtype analyses identified several differentially expressed proteins (in particular, for small cell carcinoma cases). While this may indicate subtype-specific dysregulation of inflammation, these results should be taken very carefully as these are based on a very limited number of observations.
CUB domain containing protein 1, or CDCP1 is a transmembrane noncatalytic receptor involved in the loss of anchorage in epithelial cells during mitosis (32). CDCP1 has been shown to be highly expressed in different types of cancer cells and particularly human colorectal and lung cancers (33). In lung cancer, it was shown to be associated with higher proliferation, poor prognosis, survival rate, and metastasis (34–37). Our findings demonstrated higher levels of circulating CDCP1 many years prior to lung cancer diagnosis, suggesting that CDCP1 is indicative of mechanisms important for lung cancer etiology, in addition to its potential role as prognostic marker. Our ROC analyses indicated modest explanatory abilities of CDCP1 for lung cancer and its main histologic subtypes. When combined with smoking, CDCP1 yielded (moderate) improvements in the explanation of lung cancer, suggesting that some of the CDCP1-lung cancer association we observe is not directly related to smoking and explains some other aspect of the lung cancer risk. Blood levels of CDCP1 may therefore have the potential to inform on the mechanisms of smoking-related and smoking-unrelated lung carcinogenesis.
Pathways related to CDPC1 and prediagnostic lung cancer have been poorly described to date. To better understand the potential pathways linking CDPC1 and early carcinogenesis, we integrated gene expression data measured in the same individuals. We detected two transcripts associated with CDPC1: LRRN3 and (Leucine-rich repeat neuronal protein 3) and SEM1 (26S proteasome complex subunit). In previous work, we identified LRRN3 to be positively associated with smoking exposure (38). Its positive association with CDPC1 suggest a potential implication of CDCP1 in smoking-induced lung cancer–related pathways. SEM1 has, to our knowledge, not been reported as linked to lung cancer or expression of CDCP1.
Reactome enrichment analysis revealed an association between CDCP1 and deactivation of the β-catenin transactivating complex. WNT/β-catenin's inappropriate activation has been linked to a wide range of cancers (39, 40). β-Catenin forms a complex with T-cell factor transcription factor family, resulting in the activation of genes implicated in tumor development. In vitro, WNT/β-catenin signaling has been identified as a critical pathway in human lung carcinogenesis (41). Here we show that higher levels of CDCP1 are negatively associated with the β-catenin deactivation pathway. In accordance with our findings, recent work in colorectal cancer cell lines have shown that CDCP1 is an important regulator of WNT signaling, and that similar to what we observe for lung cancer, elevated levels of CDCP1 were predictive of colorectal cancer (42). In accordance with these observations, pathway analysis with the PANTHER biological processes database also indicated that CDCP1 was associated with protein localization in the nucleus, as observed by Hu and colleagues (42). An additional interesting pathway associated with CDCP1 was the cell–cell adhesion. Dong and colleagues observed that a disruption of CDCP1 in vitro was associated with an interference in EGF/EGFR-induced cell migration and suggested CDCP1 as a potential target for EGFR-driven cancers (43). EGFR is an important therapeutic target for lung cancer, as over 60% of non–small cell lung carcinomas express EGFR. In lung cancer metastasis, it was shown that EGF stimulation increases CDCP1 expression and that EGFR inhibitor reduces the level of CDCP1 in lung cancer cells (44).
Strengths of our study include the validation of our association involving CDCP1 in three separate cohorts (EPIC, NOWAC, NSHDS) and with two distinct statistical methods, which enabled us to demonstrate the robust link between prediagnostically measured CDCP1 and risk of subsequent lung cancer, triangulate the evidence linking prospective blood levels of CDCP1, and future risk of lung cancer. Our work is also the first study to use prediagnostic data on both a broad panel of inflammatory proteins and gene expression in hundreds of lung cancer cases from the same population for integrated analyses, allowing for a more comprehensive assessment of potential mechanisms underlying the risk associations.
The limitations of our study include the relatively small sample size for the validation of our results in men only, as well as the analysis of histologic subtypes. In addition, our measures of inflammatory proteins were cross-sectional and other prospective studies with multiple measures on participants would be instrumental to further investigate our findings. Information on EGFR mutation status was not available for our cohort, and future studies investigating the role of CDPC1 in lung cancer would benefit from the inclusion of this information.
The survival of patients with lung cancer is highly dependent on accurate and early diagnosis. The U.S. National Lung Screening trial suggested that diagnosis by low-dose CT could reduce up to 20% of lung cancer mortality, but that it is also associated with a high level of false positives, resulting in a great number of potentially benign cases to unnecessary and costly follow-up (45). The identification of early biomarkers of susceptibility could significantly improve diagnosis by improving the identification of individuals at high risk who are more likely to benefit from screening (46). The development of OMICs technology has opened new grounds for biomarker identification. Quantitative proteomics provides different protein abundance for samples from control and cases, allowing the identification of biomarkers, pathways perturbations and molecular interactions (47). Our study suggests that circulating serum levels of CDCP1 provides additional information on future lung cancer risk, over and above that afforded by information on tobacco exposure and may therefore help in the identification of molecular pathways involved in lung carcinogenesis, years before diagnosis.
M.B. Schulze reports grants from German Cancer Aid, German Federal Ministry of Education and Research (BMBF), and European Union during the conduct of the study. No disclosures were reported by the other authors.
Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the International Agency for Research on Cancer/World Health Organization.
S. Dagnino: Conceptualization, data curation, formal analysis, investigation, writing–original draft, writing–review and editing. B. Bodinier: Conceptualization, data curation, formal analysis, investigation, writing–original draft, writing–review and editing. F. Guida: Formal analysis, validation, writing–review and editing. K. Smith-Byrne: Formal analysis, validation, writing–review and editing. D. Petrovic: Writing–review and editing. M.D. Whitaker: Data curation, writing–review and editing. T. Haugdahl Nøst: Data curation, writing–review and editing. C. Agnoli: Data curation. D. Palli: Data curation. C. Sacerdote: Data curation. S. Panico: Data curation. R. Tumino: Data curation. M.B. Schulze: Data curation. M. Johansson: Data curation, validation. P. Keski-Rahkonen: Resources, writing–review and editing. A. Scalbert: Resources, writing–review and editing. P. Vineis: Data curation, writing–review and editing. M. Johannson: Validation, writing–review and editing. T.M. Sandanger: Investigation, writing–review and editing. R.C. Vermeulen: Conceptualization, resources, data curation, supervision, writing–review and editing. M. Chadeau-Hyam: Conceptualization, supervision, investigation, writing–original draft, writing–review and editing.
This work was supported by Cancer Research UK Population Research Committee “Mechanomics” project grant (grant #22184 to M. Chadeau-Hyam). The NOWAC post-genome cohort study was funded by the ERC advanced grant and Transcriptomics in Cancer Epidemiology (ERC-2008-AdG-232997). M. Chadeau-Hyam, F. Guida, K. Smith-Byrne, T. Haugdahl Nøst, M. Johansson, and T.M. Sandanger acknowledge support from the Research Council of Norway (Id-Lung project FRIPRO 262111 to T.M. Sandanger). This research was supported by Institut National Du Cancer (France, PI: Mattias Johansson) and Cancerforskningsfonden i Norrland (Sweden, PI: Mikael Johansson). M. Chadeau-Hyam and R.C.H. Vermeulen acknowledge support from the H2020-EXPANSE project (Horizon 2020 grant no. 874627 to R.C.H. Vermeulen). S. Dagnino acknowledges support from Horizon 2020 Marie Skłodowska-Curie fellowship EXACT Identifying biomarkers of EXposure leading to Lung Cancer with AdduCTomics (Grant # 708392 to S. Dagnino). B. Bodonier received a PhD studentship from the MRC Centre for Environment and Health. R. Tumino acknowledges A.I.R.E. - O.N.L.U.S. Ragusa Italy. EPIC-Italy was funded by the Italian Association for Research on Cancer (AIRC). The EPIC-Norfolk study (DOI 10.22025/2019.10.105.00004) has received funding from the Medical Research Council (MR/N003284/1 and MC-UU_12015/1) and Cancer Research UK (C864/A14136). The authors are grateful to all the participants who have been part of the project and to the many members of the study teams at the University of Cambridge (Cambridge, United Kingdom) who have enabled this research. The authors would like to acknowledge principal investigators of the EPIC cohort for allowing validation of data in their respective cohorts, namely, Rudolf Kaaks for EPIC Heidelberg, Antonio Agudo for EPIC Spain, and Nick Wareham for EPIC Norfolk.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.