Abstract
Purpose: Lung cancer remains as the leading cause of cancer-related death worldwide, mainly due to late diagnosis. Cytology is the gold-standard method for lung cancer diagnosis in minimally invasive respiratory samples, despite its low sensitivity. We aimed to identify epigenetic biomarkers with clinical utility for cancer diagnosis in minimally/noninvasive specimens to improve accuracy of current technologies.
Experimental Design: The identification of novel epigenetic biomarkers in stage I lung tumors was accomplished using an integrative genome-wide restrictive analysis of two different large public databases. DNA methylation levels for the selected biomarkers were validated by pyrosequencing in paraffin-embedded tissues and minimally invasive and noninvasive respiratory samples in independent cohorts.
Results: We identified nine cancer-specific hypermethylated genes in early-stage lung primary tumors. Four of these genes presented consistent CpG island hypermethylation compared with nonmalignant lung and were associated with transcriptional silencing. A diagnostic signature was built using multivariate logistic regression model based on the combination of four genes: BCAT1, CDO1, TRIM58, and ZNF177. Clinical diagnostic value was also validated in multiple independent cohorts and yielded a remarkable diagnostic accuracy in all cohorts tested. Calibrated and cross-validated epigenetic model predicts with high accuracy the probability to detect cancer in minimally and noninvasive samples. We demonstrated that this epigenetic signature achieved higher diagnostic efficacy in bronchial fluids as compared with conventional cytology for lung cancer diagnosis.
Conclusions: Minimally invasive epigenetic biomarkers have emerged as promising tools for cancer diagnosis. The herein obtained epigenetic model in combination with current diagnostic protocols may improve early diagnosis and outcome of lung cancer patients. Clin Cancer Res; 22(13); 3361–71. ©2016 AACR.
Lung cancer is the leading cause of cancer mortality worldwide. Patient outcome is closely linked to tumor stage at diagnosis and unfortunately, most lung cancer patients are diagnosed at late stages when a curative treatment is no longer possible. Using an integrative genome-wide experimental method whereby hundreds of stage I patients from two independent lung cancer datasets were examined, we identified an epigenetic four-gene model with diagnostic value for detecting lung cancer. This DNA methylation signature was validated with a gene locus–specific technique in three minimally and noninvasive independent cohorts. The combination of this highly sensitive and specific epigenetic model with standard clinical markers may help to improve lung cancer diagnosis and therefore decrease current mortality rates.
Introduction
Lung cancer is the main cause of death from cancer worldwide (1). Several factors are associated with the poor outcome of lung cancer patients. One of them is, despite recent advances, the scarcity of effective therapies achieving durable responses. Another, and even more important, factor is late diagnosis, as most lung tumors are detected at advanced stages of the disease (2). This is crucial, taking into account that survival rates drop substantially from early to late stages.
In this context, the data reported by early observational studies and by the randomized National Lung Screening Trial (NLST) have shown that lung cancer screening with low-dose helical computed tomography (LDCT) is able to reduce lung cancer mortality, as significantly more cases can be detected in earlier stages (3). Last year, the United States Preventive Services Task Force (USPSTF) issued the recommendation to implement annual lung cancer screening for smokers with the inclusion criteria of the NLST. Nevertheless, there are still a significant number of open questions and areas for optimization of the different aspects related to this screening strategy. For example, there is a need for risk models and markers to improve the screening cost-benefit ratio by better selecting the screened population. Moreover, although the CT-based imaging is a very sensitive technique, its specificity is low, and it yields a large proportion of cases with indeterminate nodules, which may require further follow up or invasive procedures, which may turn out to be futile in the frequent case of these nodules being benign. Biomarkers for the correct classification of the indeterminate nodules and as an adjunct to the diagnostic procedure are a clear unmet clinical need (4, 5).
Epigenetic biomarkers, mainly DNA methylation, have emerged as one of the most promising approaches to improve cancer diagnosis and present several advantages as compared with other markers, such as gene expression or genetic signatures. DNA methylation alterations are covalent modifications that are remarkably stable and often occur early during carcinogenesis. In addition, DNA methylation can be detected by a wide range of sensitive and cost-efficient techniques even in samples with low tumor purity. This epigenetic modification can also be detected in different biologic fluids which represents a promising tool for minimally and noninvasive cancer detection (6). In recent years, different epigenetic candidates have been proposed, but none has reached the clinic yet, mainly due to the lack of large validation studies or the use of analytical methods difficult to standardize. In addition, most studies were performed by single candidate-gene hypothesis-driven (7–11), although incipient genome-wide approaches are also appearing (12). Nowadays, high-throughput epigenomic studies, that permit an unbiased data-driven research, have become a great tool for systematically dissecting the role of epigenetic variation in cancer with the potential of identifying novel and more robust biomarkers (13).
Bronchoscopic examination and pathologic assessment of cytologic specimen is the most currently used diagnostic method. However, almost half of the cases remain occult, especially in peripherally located tumors (14). This leads to additional invasive procedures, such as surgical lung biopsy or transthoracic needle biopsy associated with significant morbidity (15). The implementation of molecular biomarkers, including epigenetic and gene expression classifiers, in bronchial aspirates or sputum represents a promising approach to improve the accuracy of minimally and noninvasive neoplasm diagnosis (16,17). This kind of biomarkers can also be used to develop clinical tools such as nomograms, which allow calculating the probability of a clinical event. These predictive models can increase the individualized risk assessment compared with risk groups leading to a more personalized medicine (18).
Here, we have identified and validated a signature of DNA methylation biomarkers already present in early-stage lung cancer and globally absent in normal tissue. For this purpose, we used two different datasets: the CURELUNG FP7 Consortium and the Cancer Genome Atlas (TCGA). Subsequently, we tested by pyrosequencing the selected biomarkers in several independent case–control datasets (formalin-fixed paraffin-embedded tissues, bronchial aspirates, bronchioalveolar lavages, and sputum samples). This study provides a novel epigenetic predictive model that may help to improve lung cancer diagnosis.
Materials and Methods
Study design and participants
This is a collaborative and retrospective study including data from publicly available datasets, formalin-fixed paraffin-embedded (FFPE) tissues, bronchial aspirates/lavages, and sputum samples obtained from lung cancer patients and cancer-free individuals, as they arrived to the laboratory and passed the technical quality checks. Genome-wide DNA methylation data for the discovery cohort (Infinium 450K array) was downloaded from our previous published lung cancer dataset deposited at the Gene expression omnibus (GSE39279; ref. 19) or from the TCGA data repository (lung adenocarcinoma LUAD or Lung squamous cell carcinoma LUSC). The Biologic validation of the selected methylation biomarkers was conducted by pyrosequencing in four independent cohorts. Lung validating cohorts were obtained from different institutions in Spain. (i) A total of 201 FFPE samples were obtained from Health Institute Carlos III (ISCIII), Madrid and Centre for Applied Medical Research/Hospital of the University of Navarre, (CIMA/CUN) Pamplona. Regarding minimally invasive samples, (ii) 80 bronchial aspirates and (iii) 98 sputums were obtained from Catalan Institute of Oncology and Bellvitge University Hospital, Barcelona. (iv) 111 Bronchioalveolar lavages came from CIMA, Pamplona and Hospital of Talavera de la Reina, Talavera de la Reina. All DNA extractions form different specimens were developed and run by the same technicians to avoid interlaboratory variation.
The study was approved by the corresponding Institutional review board and patients signed up the informed consent to participate. The main clinical characteristics of the different cohorts are described in Table 1 or have been also described previously, as is the case for the discovery cohort (Table 1).
Clinical characteristics of the invasive [Discovery (A), Validation (B) and Paraffin (C)] and minimally invasive [BAS (D), BALs (E) and Sputum (F)] cohorts
. | A. Discovery cohort . | B. TCGA validation cohort . | C. Paraffin cohort . | |||
---|---|---|---|---|---|---|
Patients . | Tumor (n = 237) . | Nontumoral (n = 25) . | Tumor (n = 350) . | Nontumoral (n = 62) . | Tumor (n = 122) . | Nontumoral (n = 79) . |
Age (years) | 68 (38–90) | 63.5 (39–86) | 66.9 (33–90) | 66.9 (40–86) | 63.8 (40–80) | 62.5 (42–85) |
Gender | ||||||
Male | 131 (55.3%) | 20 (80%) | 190 (54.1%) | 36 (58.0%) | 108 (88.5%) | 66 (83.5%) |
Female | 106 (44.7%) | 5 (20%) | 160 (45.9%) | 26 (42.0%) | 14 (11.5%) | 13 (16.5%) |
Smoking history | ||||||
Current or former smoker | 190 (80.1%) | 24 (96%) | 313 (89.4%) | 55 (89.4%) | 70 (57.4%) | 71 (89.9%) |
Nonsmoker | 25 (10.5%) | 1 (4%) | 32 (9.1%) | 2 (9.1%) | 36 (29.5%) | 8 (10.1%) |
Unknown | 22 (9.4%) | 0 (0%) | 5 (1.5%) | 5 (1.5%) | 16 (13.1%) | 0 (0%) |
Stage | ||||||
I | 237 (100%) | 350 (100%) | 122 (100%) | |||
Histology | ||||||
Adenocarcinoma | 181 (76.4%) | 217 (62.1%) | 62 (50.8%) | |||
Squamous cell carcinoma | 56 (23.6%) | 133 (37.9%) | 60 (49.2%) | |||
Pack-years | 40 (0–180) | 54.4 (0–184) | 46.7 (1–94.5) | 46.4 (5–192) | 35.9 (0–130) | 46.7 (0–141) |
Data are average (range) or number (%) | ||||||
D. BAS cohort | E. BALs cohort | F. Sputum cohort | ||||
Patients | Lung cancer patient (n = 51) | Cancer-free donor (n = 29) | Lung cancer patient (n = 82) | Cancer-free donor (n = 29) | Lung cancer patient (n = 72) | Cancer-free donor (n = 26) |
Age (years) | 65.6 (47–85) | 64.0 (35–87) | 62.1 (38–83) | 57.5 (30–82) | 65.1 (40–83) | 52.7 (29–69) |
Gender | ||||||
Male | 46 (90.2%) | 16 (55.2%) | 66 (80.4%) | 19 (65.5%) | 62 (86.1%) | 17 (65.4%) |
Female | 5 (9.8%) | 10 (34.5%) | 16 (19.6%) | 9 (31.0%) | 7 (9.7%) | 9 (34.6%) |
Unknown | 0 (0%) | 3 (10.3%) | 0 (0%) | 1 (3.5%) | 3 (4.2%) | |
Smoking history | ||||||
Current or former smoker | 45 (88.2%) | 16 (55.2%) | 42 (51.2%) | 15 (51.7%) | 62 (86.1%) | 20 (76.9%) |
Nonsmoker | 4 (7.8%) | 8 (27.6%) | 39 (47.5%) | 12 (41.3%) | 7 (9.7%) | 6 (23.1%) |
Unknown | 2 (4.0%) | 5 (17.2%) | 1 (1.3%) | 2 (7%) | 3 (4.2%) | 0 (0%) |
Stage | ||||||
I | 5 (9.8%) | 17 (20.7%) | 12 (16.7%) | |||
II | 6 (11.8%) | 8 (9.8%) | 13 (18.0%) | |||
III | 21 (41.2%) | 20 (24.4%) | 23 (32.0%) | |||
IV | 18 (35.3%) | 18 (22.0%) | 19 (26.4%) | |||
Unknown | 1 (1.9%) | 19 (23.1%) | 5 (6.9%) | |||
Histology | ||||||
Adenocarcinoma | 17 (33.3%) | 25 (30.5%) | 38 (52.7%) | |||
Squamous cell carcinoma | 19 (37.3%) | 40 (48.8%) | 24 (33.3%) | |||
Large cell carcinoma | 2 (4.0%) | 2 (2.4%) | 2 (3%) | |||
Small cell carcinoma | 2 (4.0%) | 12 (14.6%) | 5 (7%) | |||
NSCLC (NOS) | 11 (21.4%) | 3 (3.7%) | 3 (4%) | |||
Pack-years | 49.6 (0–120) | 32.4 (0–100) | 45.5 (0–120) | 26.3 (0–90) | 49.1 (0–120) | 24.1 (0–114) |
. | A. Discovery cohort . | B. TCGA validation cohort . | C. Paraffin cohort . | |||
---|---|---|---|---|---|---|
Patients . | Tumor (n = 237) . | Nontumoral (n = 25) . | Tumor (n = 350) . | Nontumoral (n = 62) . | Tumor (n = 122) . | Nontumoral (n = 79) . |
Age (years) | 68 (38–90) | 63.5 (39–86) | 66.9 (33–90) | 66.9 (40–86) | 63.8 (40–80) | 62.5 (42–85) |
Gender | ||||||
Male | 131 (55.3%) | 20 (80%) | 190 (54.1%) | 36 (58.0%) | 108 (88.5%) | 66 (83.5%) |
Female | 106 (44.7%) | 5 (20%) | 160 (45.9%) | 26 (42.0%) | 14 (11.5%) | 13 (16.5%) |
Smoking history | ||||||
Current or former smoker | 190 (80.1%) | 24 (96%) | 313 (89.4%) | 55 (89.4%) | 70 (57.4%) | 71 (89.9%) |
Nonsmoker | 25 (10.5%) | 1 (4%) | 32 (9.1%) | 2 (9.1%) | 36 (29.5%) | 8 (10.1%) |
Unknown | 22 (9.4%) | 0 (0%) | 5 (1.5%) | 5 (1.5%) | 16 (13.1%) | 0 (0%) |
Stage | ||||||
I | 237 (100%) | 350 (100%) | 122 (100%) | |||
Histology | ||||||
Adenocarcinoma | 181 (76.4%) | 217 (62.1%) | 62 (50.8%) | |||
Squamous cell carcinoma | 56 (23.6%) | 133 (37.9%) | 60 (49.2%) | |||
Pack-years | 40 (0–180) | 54.4 (0–184) | 46.7 (1–94.5) | 46.4 (5–192) | 35.9 (0–130) | 46.7 (0–141) |
Data are average (range) or number (%) | ||||||
D. BAS cohort | E. BALs cohort | F. Sputum cohort | ||||
Patients | Lung cancer patient (n = 51) | Cancer-free donor (n = 29) | Lung cancer patient (n = 82) | Cancer-free donor (n = 29) | Lung cancer patient (n = 72) | Cancer-free donor (n = 26) |
Age (years) | 65.6 (47–85) | 64.0 (35–87) | 62.1 (38–83) | 57.5 (30–82) | 65.1 (40–83) | 52.7 (29–69) |
Gender | ||||||
Male | 46 (90.2%) | 16 (55.2%) | 66 (80.4%) | 19 (65.5%) | 62 (86.1%) | 17 (65.4%) |
Female | 5 (9.8%) | 10 (34.5%) | 16 (19.6%) | 9 (31.0%) | 7 (9.7%) | 9 (34.6%) |
Unknown | 0 (0%) | 3 (10.3%) | 0 (0%) | 1 (3.5%) | 3 (4.2%) | |
Smoking history | ||||||
Current or former smoker | 45 (88.2%) | 16 (55.2%) | 42 (51.2%) | 15 (51.7%) | 62 (86.1%) | 20 (76.9%) |
Nonsmoker | 4 (7.8%) | 8 (27.6%) | 39 (47.5%) | 12 (41.3%) | 7 (9.7%) | 6 (23.1%) |
Unknown | 2 (4.0%) | 5 (17.2%) | 1 (1.3%) | 2 (7%) | 3 (4.2%) | 0 (0%) |
Stage | ||||||
I | 5 (9.8%) | 17 (20.7%) | 12 (16.7%) | |||
II | 6 (11.8%) | 8 (9.8%) | 13 (18.0%) | |||
III | 21 (41.2%) | 20 (24.4%) | 23 (32.0%) | |||
IV | 18 (35.3%) | 18 (22.0%) | 19 (26.4%) | |||
Unknown | 1 (1.9%) | 19 (23.1%) | 5 (6.9%) | |||
Histology | ||||||
Adenocarcinoma | 17 (33.3%) | 25 (30.5%) | 38 (52.7%) | |||
Squamous cell carcinoma | 19 (37.3%) | 40 (48.8%) | 24 (33.3%) | |||
Large cell carcinoma | 2 (4.0%) | 2 (2.4%) | 2 (3%) | |||
Small cell carcinoma | 2 (4.0%) | 12 (14.6%) | 5 (7%) | |||
NSCLC (NOS) | 11 (21.4%) | 3 (3.7%) | 3 (4%) | |||
Pack-years | 49.6 (0–120) | 32.4 (0–100) | 45.5 (0–120) | 26.3 (0–90) | 49.1 (0–120) | 24.1 (0–114) |
NOTE: Data are average (range) or number (%).
Abbreviations: NSCLC (NOS), not otherwise specified.
Procedures
Preparation of lung specimens.
DNA was extracted from minimally and noninvasive specimens using a standard phenol chloroform extraction method. DNA from FFPE tissue blocks was extracted from two sequential unstained sections, each 10-μm thick. For each sample of tumor tissue, subsequent sections were stained with hematoxylin and eosin for histologic confirmation of the presence (>50%) of tumor cells. Unstained tissue sections were deparaffinized, and DNA was extracted using the same protocol as for minimally invasive specimens. Extracted DNA was checked for integrity and quantity with 1.3% agarose gel electrophoresis and picogreen quantification, respectively. Bisulfite conversion of 500 ng of DNA for each sample was performed according to the manufacturer's recommendation.
Data prefiltering.
DNA methylation status of 450,000 CpG sites by using the Infinium 450K Methylation array was available at (19). Methylation score of each CpG was represented as beta (β) value and were previously normalized for color bias adjustment, background level adjustment and quantile normalization across arrays. Probes and sample filtering involved a two-step process for removing SNPs and unreliable betas with high detection P value P > 0.001. Sex chromosome probes were also removed. After the filtering, the remaining 409,219 CpGs were considered valid for the study. Stage I patients were selected coming up to 237 lung tumor patients and 25 histologically nontumor lung tissue samples (Fig. 1).
Epigenetic signature in lung primary tumor patients using genome-wide DNA methylation datasets. A, DNA methylation levels of selected genes (Branched Chain Aminoacid Transaminase 1 -BCAT1-, Cysteine Dioxygenase type 1 -CDO1-, Tripartite Motif Containing 58 -TRIM58-, zinc finger protein 177 -ZNF177- and Crystallin, Gamma D -CRYGD-) in primary tumor samples from patients with lung cancer and nontumoral specimens using our FP7 Curelung dataset. B, validation of DNA methylation values using publicly available dataset from TCGA. C, expression values for the gene candidates using the TCGA database. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. ***, P < 0.001.
Epigenetic signature in lung primary tumor patients using genome-wide DNA methylation datasets. A, DNA methylation levels of selected genes (Branched Chain Aminoacid Transaminase 1 -BCAT1-, Cysteine Dioxygenase type 1 -CDO1-, Tripartite Motif Containing 58 -TRIM58-, zinc finger protein 177 -ZNF177- and Crystallin, Gamma D -CRYGD-) in primary tumor samples from patients with lung cancer and nontumoral specimens using our FP7 Curelung dataset. B, validation of DNA methylation values using publicly available dataset from TCGA. C, expression values for the gene candidates using the TCGA database. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. ***, P < 0.001.
Data filtering for hypomethylated CpGs in nontumoral tissues.
The choice of region to be studied is one of the critical challenges to establishing a DNA methylation biomarker that is clinically useful. The investigated region should ideally fulfill the following criteria: first, the region should be unmethylated in nontumor cases and methylated in lung cancer cases; and second, the methylation levels of this region should clearly allow the classification of a sample as non-cancerous or cancerous (20,21). We set thresholds to select homogeneous unmethylated CpGs in nontumor cases; (i) Average and median of β values lower than 0.1 and (ii) the percentile 90 for beta values of control donors lower than 0.2. Using these restrictive thresholds, we obtained 133,444 filtered CpGs.
Differentially methylated CpG identification between tumor and nontumor samples.
Differentially methylated CpGs between tumor and nontumor groups were identified using the following procedure: for each probe/CpG, the sets of methylation β values T (belonging to the tumor samples: first group) and NT (belonging to the nontumor lung tissue samples: second group) were compared. The following three measures were calculated:
Differences in average β values between groups higher than a set threshold. (MD = |μT − μNT| > 0.20)
Multiple testing correction P value with 95% of confidence to assign significant differentially methylated sites.
False discovery rate (FDR) with adjusted P value < 0.05.
To maximize differences between tumor and nontumor group. Difference between quartile 1 in tumor group and quartile 3 in nontumor should be higher that a selected cut-off. P25 (tumor)-P75 (nontumor) > 0.1
To identify early-stage cancer-related epigenetic markers with diagnostic value for both subtypes together, these three criteria (1,2,3) were evaluated in three distinct comparisons based on histologic subtypes.
Comparison: Adenocarcinoma (ADC) group (n = 181) versus nontumor group (n = 25). Identified 29 differentially methylated CpGs (DMCpG) specific for ADC (Supplementary Table S1A).
Comparison: Squamous cell carcinoma (SCC) group (n = 56) versus nontumor group (n = 25). Identified 78 DMCpGs specific for SCC (Supplementary Table S1B).
Comparison: Lung cancer (ADC and SCC; n = 237) versus nontumor group (n = 25). Identified 24 significant DMCpGs when both groups were analyzed together (Supplementary Table S1C).
Differentially methylated CpGs were selected using an integrative approach to rank the Infinium probes based on their methylation status and the fulfillment of all the criteria (1,2,3) and in the three different comparisons (a, b, c). Finally, 12 CpGs corresponding to 9 genes were common in the three comparisons and ranked by averaged z-score (Supplementary Fig. S1; Supplementary Table S2).
Pyrosequencing
Pyrosequencing analyses to determine CpG methylation status were developed as described previously (19) to validate the results obtained from the arrays. Briefly, a set of primers for PCR amplification and sequencing were designed using a specific software pack (PyroMark assay design version 2.0.01.15). Primer sequences were designed to hybridize with CpG-free sites to ensure methylation-independent amplification (Supplementary Table S2A). DNA was converted using the EZ DNA Methylation Gold (ZYMO RESEARCH) bisulfite conversion kit following the manufacturer's recommendations and used as a template for subsequent PCR step. PCR was performed under standard conditions with primers biotinylated to convert the PCR product to single-stranded DNA templates. We used the Vacuum Prep Tool (Biotage) to prepare single-stranded PCR products according to manufacturer's instructions. PCR products were observed at 2% agarose gels before pyrosequencing. Pyrosequencing reactions and methylation quantification were performed in a PyroMark Q24 System version 2.0.6 (Qiagen) using appropriate reagents and protocols, and the methylation value was obtained from the average of the CpG dinucleotides included in the sequence analyzed, with a minimum of 3 valid CpGs per primer. Only those average methylation values within the region analyzed with coefficient of variation lower than 1 were accepted as valid. Controls to assess correct bisulfite conversion of the DNA were included in each run, as well as sequencing controls to ensure the fidelity of the measurements.
Statistical analysis
Data were summarized by mean, SD, median, and first and third quartiles in the case of continuous variables and by relative and absolute frequencies in the case of categorical variables. Differences in expression values and methylation levels among groups were assessed using the nonparametric Wilcoxon rank sum test. Receiver Operating Characteristic (ROC) curves were used to assess the predictive capacity of each marker. Area under the curve (AUC) was computed for each ROC curve, and 95% confidence intervals (CI) were also estimated by bootstrapping with 1,000 iterations. A predictive model for each sample type was built including all selected markers in a multivariable logistic regression model. ROC curves and AUC were also computed for the predictive models. Calibration of the models was assessed by plotting predicted versus observed values obtained by bootstrap resampling of the original data. Internal validation of the models was performed using 10-fold crossvalidation. The final predictive models were represented in nomograms to facilitate their use by clinicians. Sensitivity and specificity were estimated at the optimal cut-off point according to Youden criterion. In addition, the sensitivity and specificity curves were estimated for the whole range of predictions of the model to allow for personalized decisions in different clinical scenarios. Globally, a two-tailed P value of less than 0.05 was considered to indicate statistical significance. P values were adjusted for multiple comparisons using the FDR procedure by Benjamini and Hochberg. All statistical analyses were performed using R software (version 3.2.0) and the pROC R-package (version 1.7.3).
Results
Identification and validation of cancer-related methylated genes
The discovery cohort consisted of 237 stage I non–small cell lung primary tumors (NSCLC) and 25 nontumoral matched lung tissues from the CURELUNG FP7 publicly available dataset (19). DMCpGs were identified by genome-wide DNA methylation analysis. In this cohort (Table 1A), lung ADC (n = 181, 76.3%) was the most frequent histologic subtype, followed by SCC (n = 56, 23.6%). To obtain highly cancer-specific biomarkers, we focused our analysis in those regions deeply hypomethylated in nontumoral tissues. After data filtering and analysis with restrictive criteria (Supplementary Fig. S1; Supplementary Tables S1 and S2), we obtained 12 significant DMCpGs common to both subtypes of NSCLC corresponding to 9 different genes. In cancer cells, hypermethylation in CpG islands (CGI) is a principal epigenetic mechanism for gene regulation that has been proposed as a relevant biomarker with diagnostic value (22). Therefore, the top 5 hypermethylated CGI-containing genes were selected as candidate biomarkers for further validation in NSCLC: BCAT1, CDO1, TRIM58, ZNF177, and CRYGD (for extended explanation, see Materials and Methods, Fig. 1A, and Supplementary Table S2B).
To confirm these results, we evaluated the DMCpGs of the 5 selected biomarkers in an independent cohort (350 stage I NSCLC patients; 62 nontumoral lung samples) from TCGA public database. The clinical characteristics of this cohort (Table 1B) resembled the previous discovery cohort, including 217 (62.1%) ADCs and 133 (37.9%) SCCs. As expected, the methylation levels of the 5 selected genes were similar to the discovery cohort with difference in median values for each gene (ΔBCAT1: 59%; ΔCDO1: 40%; ΔTRIM58: 50%; ΔZNF177: 46%; ΔCRYGD: 40%) and all with P values lower than 0.001 (Fig. 1B). In addition, no significant differences were found between ADCs and SCCs (Supplementary Fig. S2A). These data confirmed our previous results, suggesting that the methylation of the 5 selected biomarkers is a common feature for both NSCLC subtypes despite their differences at histologic and molecular level.
Epigenetic silencing of the cancer-specific hypermethylated genes in lung cancer primary tumors
Gene expression analysis from the TCGA cohort samples showed a significantly decreased expression in BCAT1, CDO1, TRIM58, and ZNF177 (Fig. 1C). However, no expression values were detected for CRYGD and this gene was discarded for future analysis. Interestingly, expression results were also obtained for ADCs and SCCs separately (Supplementary Fig. S2B). Moreover, promoter hypermethylation of multiple consecutive CpGs is recognized as an important mechanism by which genes may be silenced in both physiologically and pathologic conditions (23). This mechanism for gene silencing has also been shown to play a relevant functional role in the development and progression of many common human tumors (24). In this regard, analyzing the CURELUNG and TCGA datasets, we observed a similar methylation pattern between the significant DMCpGs of the selected biomarkers and their surrounding CpGs (Supplementary Fig. S3). These results reinforced the role of DNA methylation in the functional regulation of BCAT1, CDO1, TRIM58, and ZNF177. Importantly, the data obtained suggest that the methylation values of these four genes represent an epigenetic signature that may be relevant in early steps of lung carcinogenesis.
Diagnostic utility of the epigenetic signature to detect lung cancer in primary tumors
Once the epigenetic signature was established (BCAT1, CDO1, TRIM58, and ZNF177), we evaluated the ability of each individual biomarker of the four-gene panel to detect lung cancer in primary tumors by using pyrosequencing. This technique is a suitable approach in a clinical setting because it represents a quantitative and reproducible method able to detect multiple CpGs not only in FFPE tissues but also in minimally and noninvasive samples as biologic fluids. Therefore, an independent cohort of FFPE primary tumors (122 stage I NSCLC and 79 nonmalignant lung samples) was recruited and DNA methylation levels for all selected genes were determined by pyrosequencing. Clinical characteristics for this cohort are described in Table 1C. The four biomarkers had significantly higher levels of DNA methylation in tumor samples as compared with nontumoral controls (Fig. 2A). Next, ROC analysis was performed to assess the diagnostic value of each individual biomarker to detect lung cancer. Importantly, all the genes of the signature showed significant areas under the ROC curve (AUC) greater than 0.8 (AUCBCAT1 = 0.94, AUCCDO1 = 0.84, AUCTRIM58 = 0.97 and AUCZNF177 = 0.94), suggesting a great accuracy of these biomarkers for NSCLC diagnosis (Fig. 2B). Similarly, when samples were classified on the basis of histologic subtypes (ADC and SCC), we observed for all the biomarkers significant differences in methylation status (Supplementary Fig. S4A) and AUCs close to 1.0 (Supplementary Fig. S4B and S4C). These results confirmed the diagnostic value of evaluating DNA methylation levels by locus-specific PCR based techniques, such as pyrosequencing.
Epigenetic signature in paraffin samples using pyrosequencing. A, DNA methylation levels of candidate genes in paraffin-embedded sections from patients with lung cancer and control donors. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. ***, P < 0.001. B, ROC curves and area under the curve (AUC) with 95% confidence intervals for the candidate genes.
Epigenetic signature in paraffin samples using pyrosequencing. A, DNA methylation levels of candidate genes in paraffin-embedded sections from patients with lung cancer and control donors. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. ***, P < 0.001. B, ROC curves and area under the curve (AUC) with 95% confidence intervals for the candidate genes.
Validation of the epigenetic signature for lung cancer diagnosis using minimally invasive respiratory samples: bronchial aspirates and bronchioalveolar lavages
One of the most important aspects for early diagnostics is to identify markers associated with cancer using minimally invasive methods for sample collection (25). In line, we collected an independent cohort of BAS from patients diagnosed with lung cancer (n = 51) and cancer-free patients (n = 29; Table 1D). This cohort included different lung cancer subtypes, especially ADC and SCC. We compared by pyrosequencing the median methylation levels and generated ROC curves to assess the performance of each marker independently. Airways fluids from lung cancer patients presented significant differences in DNA methylation levels and high AUCs for all four genes (Fig. 3A and B). Combination of BCAT1, CDO1, TRIM58, and ZNF177 in a logistic regression model yielded a significant AUC of 0.91 [95% CI (0.83–0.98) P < 0.001, Fig. 3C]. Calibration of the model showed no evident deviations from the ideal identity slope (data not shown). Internal validation of the AUC estimate for this model yielded optimism corrected AUC of 0.90, showing high generalization of the predictive capacity of the model for future samples. There were also no evident differences in prediction accuracy among early and late tumor stages.
Epigenetic signature in bronchial aspirates using pyrosequencing. A, DNA methylation levels in bronchial aspirates from patients with lung cancer and control donors. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. ***, P < 0.001. B, ROC curves and areas under the curve (AUC) for the selected genes. C, The AUC for the combined signature using a logistic regression model D, sensitivity and specificity profiles for the different possible cut-off values of the results from the logistic regression model.
Epigenetic signature in bronchial aspirates using pyrosequencing. A, DNA methylation levels in bronchial aspirates from patients with lung cancer and control donors. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. ***, P < 0.001. B, ROC curves and areas under the curve (AUC) for the selected genes. C, The AUC for the combined signature using a logistic regression model D, sensitivity and specificity profiles for the different possible cut-off values of the results from the logistic regression model.
A visual representation of the methylation profile for the genes included in the model is provided as a heatmap (Supplementary Fig. S5A). A nomogram based on the results of this model is proposed as a predictive tool for clinical diagnostic use. Results of the nomogram provide an individual probability (0%–100%) for suffering lung cancer for each patient (Supplementary Fig. S5B and Materials and Methods). Evaluation of the full range of predictions of the model shows that shifting the cutoff to POC = 30% would yield a sensitivity of 100% and a specificity of 65.4% and shifting the cutoff to POC = 80% would yield a sensitivity of 71.4% and a specificity of 92.3%. Sensitivity and specificity at the optimal cutoff (probability of cancer; POC = 63%) were 84.6% and 81.0%, respectively (Fig. 3D). It is important to point out that current protocols for lung cancer diagnosis are based mainly in bronchioalveolar cytology and further lung biopsy. There are cases where the cytology is doubtful or inconclusive. Moreover, there are a notable number of cases where cytology and biopsy are negative for cancer cells, but there is high suspicion of cancer. Our results not only improve the overall prediction accuracy of BAS cytology in this cohort (sensitivity = 43.8%, specificity = 100%), but also permit a flexible and personalized approach for the clinicians in every possible scenario by simply adapting the cut-off value of the probabilistic model. In this sense, in our cohort 24 of 51 tumor samples were misinterpreted as nontumoral by the cytology test. However, using our predictive epigenetic model, 19 out of the 24 false negative cytologies (79%) would have been considered as positive setting our threshold at 50% probability of cancer (Supplementary Table S3). Of note, the majority of them (16/24) with a predicted probability of cancer higher than 80%. Also three of them were classified as borderline nontumor, with a predicted probability of cancer between 40% and 50%. In these three doubtful cases, clinical patient manage would require a closer follow up. This led us to propose our epigenetic signature as a useful clinical diagnostic tool in BAS specimens, especially in doubtful cases.
In addition, we evaluated DNA methylation levels in BAL from patients with lung cancer (n = 82) as compared with nonmalignant lung diseases (n = 29; Table 1E). The methylation levels of those four markers were significantly higher in BAL fluid from cancer patients than noncancer patients (Fig. 4A). AUCs were significant for all four genes with the following values AUCBCAT1 = 0.80, AUCCDO1 = 0.65, AUCTRIM58 = 0.72, and AUCZNF177 = 0.66 (Fig. 4B). Combination of the four genes in a logistic regression model achieved a significant AUC of 0.85 [95% CI (0.78–0.93) P < 0.001], with an optimism-corrected value of 0.83 (Fig. 4C). Evaluation of the full range of predictions of the model is also shown (Fig. 4D). As in the case with BAS specimens, our epigenetic signature with diagnostic value may be highly valuable for doubtful patients with negative cytology.
Epigenetic signature in bronchioalveolar lavages using pyrosequencing. A, DNA methylation levels in bronchioalveolar lavages from patients with lung cancer and control donors. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. ***, P < 0.001; *, P < 0.05. B, ROC curves and AUC for the selected genes. C, the AUC for the combined signature using a logistic regression model D, sensitivity and specificity profiles for the different possible cut-off values of the results from the logistic regression model.
Epigenetic signature in bronchioalveolar lavages using pyrosequencing. A, DNA methylation levels in bronchioalveolar lavages from patients with lung cancer and control donors. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. ***, P < 0.001; *, P < 0.05. B, ROC curves and AUC for the selected genes. C, the AUC for the combined signature using a logistic regression model D, sensitivity and specificity profiles for the different possible cut-off values of the results from the logistic regression model.
Validation of epigenetic biomarkers in noninvasive sputum samples
Finally, the methylation level of these 4 markers was examined in additional noninvasive samples. Sputums samples from 72 lung cancer patients and 26 cancer-free individuals were considered for evaluation (Table 1F). Methylation levels were significantly higher in individuals with lung cancer for all the genes tested, except for CDO1 (Fig. 5A). Individual AUC values were AUCBCAT1 = 0.92, AUCCDO1 = 0.67, AUCTRIM58 = 0.67, and AUCZNF177 = 0.69 (Fig. 5B). The multivariable logistic regression model yielded an AUC value of 0.93 [95% CI (0.86–1.0), P < 0.001; Fig. 5C]. Sensitivity and specificity for the different threshold values of the model are depicted (Fig. 5D). This result suggests that our markers may be of high value to detect lung cancer even in noninvasive specimens as sputum.
Epigenetic signature in sputum samples using pyrosequencing. A, DNA methylation levels in sputums from patients with lung cancer and control donors. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. ***, P < 0.001; **, P < 0.01; *, P < 0.05. B, ROC curves and AUC for the selected genes. C, the AUC for the combined signature using a logistic regression model. D, sensitivity and specificity profiles for the different possible cut-off values of the results from the logistic regression model.
Epigenetic signature in sputum samples using pyrosequencing. A, DNA methylation levels in sputums from patients with lung cancer and control donors. NT (light gray circle dots) stands for nontumoral and T (dark gray square dots) for tumor. P values for all the analyses were calculated using the two-sided Mann–Whitney U test. ***, P < 0.001; **, P < 0.01; *, P < 0.05. B, ROC curves and AUC for the selected genes. C, the AUC for the combined signature using a logistic regression model. D, sensitivity and specificity profiles for the different possible cut-off values of the results from the logistic regression model.
Discussion
Lung cancer is the leading cause of cancer-related death worldwide with 1.3 million deaths annually, following data from the World Health Organization (WHO) in 2011. Late diagnosis in lung cancer is one of the main reasons that explain the extremely high mortality of this disease. On one hand, screening by means of low-dose helical CT (LDCT) has shown to reduce mortality in a large randomized trial (26); however, the positive predictive value is still low. On the other hand, low sensitivity associated with minimally invasive cytologies is also a current hurdle for the accurate diagnosis of lung cancer. Thus, lung cancer diagnosis using minimally and noninvasive strategies is a major challenge to improve survival and its refinement is urgently needed to ameliorate the overall mortality figures for lung cancer worldwide. Here, we have searched for powerful biomarkers by using the two largest publicly available databases (FP7 Curelung and TCGA; ref. 19) with high-throughput data coming from Infinium 450k arrays. Only stage I cancer cases were selected to identify the molecular changes associated to earlier steps of cancer evolution. We developed an integrative approach to identify the most discriminative marks leading to a final epigenetic signature consisting of top four selected genes: CDO1, BCAT1, TRIM58, and ZNF177. We conducted several validation steps using minimally and noninvasive cohorts to define a consistent epigenetic model useful for early lung cancer diagnosis valid for both major histologic subtypes. This signature yielded a notably high specificity, one of the Achilles heels of LDCT and other methylation biomarkers (27,28) and also improved sensitivity, which is generally limited when using cytology for early lung cancer diagnosis.
The current results highlight the relevance of DNA methylation changes in the natural history of lung cancer. CpG island hypermethylation of MGMT and GSTP1 has already proven useful for the chemotherapy response prediction in gliomas (29–31) and the screening of prostate cancer, respectively (32,33). DNA methylation biomarkers have been proposed as promising candidates for early diagnosis (20,21) for several reasons: they are covalent and stable marks and they occur as early events in carcinogenesis, even in pretumoral stages such as adenomatous hyperplasia of the lung (34). Great efforts have been undertaken in identifying suitable DNA methylation markers to improve lung cancer diagnosis. However, only one biomarker, SHOX2 methylation, has been commercialized to date (10,35), although is not routinely used in the clinic.
It is noteworthy to explain that cancer-specific DNA methylation in our selected biomarkers correlated with gene silencing in lung primary tumors. This fact suggests a potential functional role with biologic implications in early stages of this pathologic process (36). To our knowledge, there is a recent study addressing this issue with a different approach, taking benefit of the TCGA database: Wrangle and colleagues recently identified a three-gene panel (CDO1, HOXA9, and TAC1) for detecting NSCLC (12). They focused on reexpressed genes after treatment with demethylating agents and used TCGA as the only database incorporating all-stage tumors, not only stage I, among other differences. Interestingly, despite using different strategies, CDO1 methylation was common for both studies. On the other hand, a study combining miRNA and gene expression arrays in three lung squamous cell carcinoma patients has also identified methylation-deregulated CDO1 (37). CDO1, cysteine dioxygenase type 1, has been postulated as a tumor suppressor gene silenced by promoter methylation in multiple human cancers, including breast, esophagus, lung, bladder, and stomach (38). For the other genes, BCAT1 (branched chain amino-acid transaminase 1) is a cytosolic enzyme that promotes cell proliferation though amino acid catabolism (39) and high frequency of methylation on BCAT1 promoter in colorectal cancer has been reported (40). ZNF177 is a zinc finger transcription factor that has been reported to be methylation-silenced in gastric cancer cell lines (41). TRIM58, tripartite motif containing 58, is an E3 ubiquitin ligase superfamily member that has already been patented as a potential epigenetic marker for detecting neoplastic cells originating from lung tissue of NSCLC patients (PCT/EP2012/061852). Moreover, it has also been reported hypermethylation of Trim58 promoter in hepatocytes derived from hepatitis B virus–related hepatocellular carcinoma (42). It is also worthy to indicate that we were very stringent in the selection of those genes, so alternative analyses from the same dataset may identify new DNA methylation biomarkers for lung cancer diagnosis in the future.
The diagnostic value of the epigenetic signature was first validated by pyrosequencing in FFPE samples from non–small cell lung primary tumors. Results from our four-gene methylation signature presented high diagnostic accuracy and were extremely similar to those obtained from public databases. Importantly, we analyzed a total of 79 nontumoral control tissues, and DNA methylation was almost negligible in the vast amount of samples, thus confirming previous results and encouraging the potential of the selected markers. Of note, in the study by Wrangle and colleagues, the methylation status was assessed by using the MSP technique, in a smaller number of FFPE nontumoral samples (12). We chose pyrosequencing, as targeted-region validation technique, because is an affordable and quantitative method that counterbalances some weaknesses of previous and extensively used methods, due to its easy standardization and lower false positive rate (43). Moreover, it is a robust and quantitative method able to detect multiple CpGs not only in FFPE tissues but also in minimally and noninvasive samples as biologic fluids with potential use in daily basis clinical settings.
The performance of the epigenetic model in these types of specimens, such as BAS, BAL, and sputum was outstanding despite the limited number of tumoral cells compared with FFPE samples. The improvement of the diagnosis of lung cancer patients represents a major challenge. Our epigenetic model provides a balanced and flexible approach able to cater to both extreme scenarios: the high sensitivity and low specificity of low dose CT in screening programs and the high specificity and low sensitivity of cytology (44,45) in respiratory specimens routinely used for lung cancer diagnosis. Our signature improves the predictions of cytology by providing a method for continuous predictions. Cytology is a useful dichotomized classifier producing two types of predictions: 100 % positive or 0% positive (100% negative). Therefore, the final output will be either a complete success or a total failure. In contrast, our signature based in a logistic regression model, represented by a nomogram, thus being able to produce a continuous range of predictions between 100% positive and 0% positive (46). That way, not all predictions are a complete success or a total failure, uncertainty can be measured for each prediction and errors are almost always lower (47). A clinician could take different actions according to the (un)certainty of the predictions, maybe performing additional tests in borderline cases. In a virtual situation where our model predicts two negative samples with different probability of being positive: such as 5% and 49%, the bimodal classifier predictor (cytology) would have output only absolute responses: negative and negative. Therefore, no information about uncertainty and chances of being positive for patient 1 (very low) and patient 2 (almost 50%) would have been delivered. The combination of current diagnostic protocol with new epigenetic nomograms will be of great help for diagnosis of lung cancer and consequently improving the outcome of lung cancer patients (48).
In summary, we have identified and independently validated a powerful epigenetic signature diagnosis of lung cancer in minimally and noninvasive samples. Genome-wide DNA methylation analyses led us to identify 4 candidates that have been tested not only in publicly available datasets, but also in extensive and independent cohorts of respiratory samples. The herein identified epigenetic model, once it will be validated in intended of use samples such as in patients with suspicious indeterminate lung nodules, might be extremely helpful to solve these clinical issues with current diagnostic protocols and define more precise screening programs for lung cancer. In addition, novel and more sensitive methods, currently in development, such as Methyl-Beaming or droplet digital PCR (49) could enhance their diagnostic value for the management of suspicious lung nodules in the clinic or within a program of lung cancer screening.
Disclosure of Potential Conflicts of Interest
J. Zulueta has ownership interest (including patents) in VisionGate, Inc. No potential conflicts of interest were disclosed by the other authors.
Authors' Contributions
Conception and design: A. Diaz-Lagares, J. Mendez-Gonzalez, J. Sandoval
Development of methodology: A. Diaz-Lagares, J. Mendez-Gonzalez, A.B. Crujeiras, M. Esteller, J. Sandoval
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): J. Mendez-Gonzalez, M. Saigi, M.J. Pajares, R. Pio, L.M. Montuenga, J. Zulueta, E. Nadal, A. Rosell, M. Esteller, J. Sandoval
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): A. Diaz-Lagares, J. Mendez-Gonzalez, D. Hervas, D. Garcia, E. Nadal
Writing, review, and/or revision of the manuscript: A. Diaz-Lagares, J. Mendez-Gonzalez, D. Hervas, L.M. Montuenga, E. Nadal, A. Rosell, J. Sandoval
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): M. Saigi, M.J. Pajares, D. Garcia, R. Pio, J. Zulueta, E. Nadal, A. Rosell
Study supervision: J. Mendez-Gonzalez, J. Sandoval
Acknowledgments
The authors thank Ana Remírez, Ana Moreno, Carles Arribas, Laia Seto, Lola González, Alejandro Fernández, and Cesar García for technical support.
Grant Support
This work was supported by the “Miguel Servet” (CP00055) grant, FIS grant (PI08/1042) from the FEDER, FSE and Carlos III Health Institute (ISCIII), and PR185/13 research grant from the Mutua Madrileña Foundation. A. Diaz-Lagares was supported by a Río Hortega (CM14/00067). A.B. Crujeiras and J. Mendez-Gonzalez are “Sara Borrell” researchers (C09/00365and CD13/00335 from the “ISCIII”). E. Nadal is supported by a Juan Rodés fellowship from the ISCIII (JR13/0002). L.M. Montuenga, R. Pio, and M.J. Pajares' work was supported by the “Instituto de Salud Carlos III” projects PI13/00806 and RD12/0036/0040. M. Esteller is an ICREA Research Professor and J. Sandoval is a “Miguel Servet” researcher (CP13/00055). J. Sandoval, L.M. Montuenga, J. Zulueta, E. Nadal, and M. Esteller are supported by the “Red temática de investigación cooperativa en cancer” (RTICCS); groups: RD 12/0036/0062, RD12/0036/0040, RD12/0036/0039, and RD12/0036/0045.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.