Abstract
Purpose: We propose a systematic methodology to quantify incidentally identified pulmonary nodules based on observed radiological traits (semantics) quantified on a point scale and a machine-learning method using these data to predict cancer status.
Experimental Design: We investigated 172 patients who had low-dose CT images, with 102 and 70 patients grouped into training and validation cohorts, respectively. On the images, 24 radiological traits were systematically scored and a linear classifier was built to relate the traits to malignant status. The model was formed both with and without size descriptors to remove bias due to nodule size. The multivariate pairs formed on the training set were tested on an independent validation data set to evaluate their performance.
Results: The best 4-feature set that included a size measurement (set 1), was short axis, contour, concavity, and texture, which had an area under the receiver operator characteristic curve (AUROC) of 0.88 (accuracy = 81%, sensitivity = 76.2%, specificity = 91.7%). If size measures were excluded, the four best features (set 2) were location, fissure attachment, lobulation, and spiculation, which had an AUROC of 0.83 (accuracy = 73.2%, sensitivity = 73.8%, specificity = 81.7%) in predicting malignancy in primary nodules. The validation test AUROC was 0.8 (accuracy = 74.3%, sensitivity = 66.7%, specificity = 75.6%) and 0.74 (accuracy = 71.4%, sensitivity = 61.9%, specificity = 75.5%) for sets 1 and 2, respectively.
Conclusions: Radiological image traits are useful in predicting malignancy in lung nodules. These semantic traits can be used in combination with size-based measures to enhance prediction accuracy and reduce false-positives. Clin Cancer Res; 23(6); 1442–9. ©2016 AACR.
Radiological image traits have been investigated extensively to prognosticate in lung CT both in the context of lesions and nodules. Most of the studies relied on patient survival–based models using single or multiple traits. In this study, we have taken a systematic approach to describe traits on a point scale and score the patient scans for the appearance of a trait. Agnostic learning method was used in a cross-validation setting to find the relationships of the traits to the malignancy status. The combination pairs were tested for reliability on a validation cohort. These pairs of radiological traits (semantics) could readily be used by the practicing clinician to provide risk assessment for pulmonary nodules, which will help to standardize radiologist inference and improve patient care. Certainly, any inference on biomarkers needs to be used with caution; one needs to account for system level variability at the clinic, parameter settings of the scanner, and the operator precision needs to brought to perspective.
Introduction
Lung cancer is the leading cause of cancer-related deaths globally and in the United States (1). Low-dose CT (LDCT) has evolved to be a sensitive imaging modality to detect pulmonary nodules. The National Lung Screening Trial (NLST), which compared LDCT and standard chest radiography (CXR) for three annual screens, found a 20% reduction in lung cancer mortality for CT compared with CXR (2). In the NLST trial, at least 39% of LDCT study participants had a nodule-positive scan during the study, and 96.4% of these were noncancerous (i.e., false-positive). The CXR arm had a lower rate of positive detection (16%) with a comparatively lower rate of false-positives (2, 3). In the NLST, it is estimated that about 18.5% [95% confidence interval (CI), 5.4%–30.6%] of the lung cancers detected in the study population were clinically insignificant and hence, overdiagnosed (4). Other studies have shown that overdiagnosis of lung cancer can be as high as 96% (5–7). Despite the high false-positive rate, the United States Preventive Services Task Force recommended lung cancer screening for high-risk individuals (8). However, debate on the effectiveness of LDCT screening still continues. The availability of abundant data has helped the development of clinical assessment models to predict probabilities of malignancy (9, 10).
Identification of malignancy continues to be a challenge even in the screening setting; patients with indeterminate pulmonary nodules (IPN) are typically monitored with scheduled follow-up scans (11). Advancements in image acquisition and improved computer-aided diagnostic tools coupled with effective treatment strategy have shown to improve patient survival (12). In recent work, nodule characteristics coupled with clinical risk factors have been widely used to prognosticate malignancy (13–15). Although the NLST shows <4% of the subjects with noncalcified nodules were diagnosed with lung cancer within a year (16), at present, there is limited ability to provide individual patient-level risk (17).
In this work, we focus on quantifying radiological imaging characteristics of nodules (shape, location, and texture), including associated structures of the lung, and relate them to cancer status. We followed a rigorous approach to find the optimal imaging characteristics in a training cohort using cross-validation methodology and validated in a test cohort dataset. These quantitative predictive scores [accuracy and or area under the receiver operator characteristic curve (AUROC)] obtained from images were used to develop classifier models to identify risk of malignancy (see Fig. 1).
Study design to find discriminant semantic features. The blocks describe the methodology followed in the article. The observed radiological trait by an expert was related to outcome with a train and validation setting.
Study design to find discriminant semantic features. The blocks describe the methodology followed in the article. The observed radiological trait by an expert was related to outcome with a train and validation setting.
Material and Methods
Patient cohort
In this study, we collected two cohorts (training and validation) from the Vanderbilt University Medical Center (VUMC), Nashville and Veterans Affairs (VA) Medical Center (Nashville, TN). The training cohort had 102 patients consisting of 42 with lung cancer and 60 patients with a positive scan that was not lung cancer (i.e., benign nodule). Of the 102 patients in the training set, there were 206 nodules (84 malignant, 122 benign), whereas the validation set had 70 patients 21 with lung cancer and 49 patients with positive scan that was non-lung cancer. Of the 70 patients in validation cohort, there were 102 nodules (26 malignant, 76 benign). The patients had 2 years of follow-up from the time of CT scans, and the biopsy results confirmed their cancer status. The median patient age for the training set was 67 years (σ = 8.6), whereas the validation set had 65 years (σ = 9.3). The patient samples were retrospectively curated; first batch of patients were used in the training set and the later batch was used in the validation set; the cohorts had a collection time difference of 6 to 9 months.
Table 1 describes the patient demographics, and Supplementary Table S1 describes the nodule dimensions in the two cohorts. This study was approved by the Institutional Review Board (IRB) at the collecting institution (VUMC/VA Hospital, Nashville, TN) and as a retrospective study to review deidentified patient records at the collaborating institution (Moffitt Cancer Center, Tampa, FL). The requirement for patients' informed consent was waived.
Patient demographics and tumor stage for the training and testing samples
. | Training set . | Validation set . | ||||
---|---|---|---|---|---|---|
. | Overall . | Lung cancer cases . | Patients with benign nodules . | Overall . | Lung cancer cases . | Patients with benign nodules . |
Demographics patients | 102 | 42 | 60 | 70 | 21 | 49 |
Nodules | 206 | 84 | 122 | 102 | 26 | 76 |
Age: median, mean (σ) | 67, 66.7 (8.6) | 69, 68.4 (8.7) | 65, 65.6 (8.3) | 65, 65.5 (9.3) | 69, 70.1 (7.4) | 64, 63.5 (9.5) |
Gender n (%) (male/female) | 102 | 95.2%/4.7% | 98%/1.6% | 70 | 28.6%/0% | 67.1%/4.3% |
Male | 99 | 40 | 59 | 68 | 20 | 47 |
Female | 3 | 2 | 1 | 3 | 0 | 3 |
Smoking: pack years median, mean (σ) | 46, 58.22 (42.15) | 50, 68 (53.53) | 51.365, 45 30.56934781 | 45, 48.1 (37.5) | 57, 61.4 (34.4) | 40, 41.9 (37.5) |
Race (Caucasian/African American/Native/others) | 92/9/0/1 | 35/7/0/0 | 57/2/0/1 | 63/2/2/3 | 21/0/0/0 | 42/2/2/3 |
. | Training set . | Validation set . | ||||
---|---|---|---|---|---|---|
. | Overall . | Lung cancer cases . | Patients with benign nodules . | Overall . | Lung cancer cases . | Patients with benign nodules . |
Demographics patients | 102 | 42 | 60 | 70 | 21 | 49 |
Nodules | 206 | 84 | 122 | 102 | 26 | 76 |
Age: median, mean (σ) | 67, 66.7 (8.6) | 69, 68.4 (8.7) | 65, 65.6 (8.3) | 65, 65.5 (9.3) | 69, 70.1 (7.4) | 64, 63.5 (9.5) |
Gender n (%) (male/female) | 102 | 95.2%/4.7% | 98%/1.6% | 70 | 28.6%/0% | 67.1%/4.3% |
Male | 99 | 40 | 59 | 68 | 20 | 47 |
Female | 3 | 2 | 1 | 3 | 0 | 3 |
Smoking: pack years median, mean (σ) | 46, 58.22 (42.15) | 50, 68 (53.53) | 51.365, 45 30.56934781 | 45, 48.1 (37.5) | 57, 61.4 (34.4) | 40, 41.9 (37.5) |
Race (Caucasian/African American/Native/others) | 92/9/0/1 | 35/7/0/0 | 57/2/0/1 | 63/2/2/3 | 21/0/0/0 | 42/2/2/3 |
LDCT protocol
Chest LDCT scans were performed with a single deep breath hold by using a Discovery VCT (GE Healthcare) from the base of the neck to the posterior lung gutters. The patients were scanned without intravenous contrast and images obtained by filtered back projection image reconstruction with a soft tissue filter to obtain a 512 × 512 matrix. CT energy was 120 kVp, with variable mAs, (range 30–400) to minimize radiation. Helical data were acquired using collimation of 40 mm, pitch 1.375:1, table speed 55 cm/sec, reconstructed as 1.25-mm pixels at contiguous 1.25-mm intervals, producing isotropic voxels. The field of view was based on patient body habitus, typically 35 to 43 cm.
Image analysis and reader agreement
All LDCT images were reviewed by a clinical radiologist (Y. Liu) with more than 6 years of experience in LDCT imaging of thoracic malignancies, who was blind to the clinical details and final diagnosis for the nodules at the time of image interpretation. Thin-section LDCT images were displayed using both standard mediastinal (width, 350 HU; level, 40 HU) and lung (width, 1,500 HU; level, −600 HU) window-width and window-level settings. Totally, 24 CT image descriptors were developed to characterize the pulmonary nodules and these were classified into eight categories: (i) location; (ii) size; (iii) shape; (iv) margin; (v) density; (vi) internal features; (vii) external features; and (viii) associated findings; example cases are shown in Supplementary Fig. S1. Each CT descriptor was rated using either an ordinal scale or a binary categorical variable (See Supplementary Tables S2 and S3).
To measure the reproducibility of semantic scoring metric, we randomly selected 80 patients (40 malignant and 40 benign) from the cohort and provided the scoring sheet with approximate anatomical location of the nodules to two different radiologists (Y. Liu and Q. Li who are resident radiologists with 3 years of clinical experience). The scoring sheet was used to compute the concordance of these discrete scoring between readers using kappa statistics (18, 19). In radiological observations, a value of kappa coefficient greater than 0.8 is considered perfect agreement, 0.61 to 0.8 is considered substantial agreement, 0.41 to 0.6 moderate agreement, 0.21 to 0.4 fair agreement, and below 0.2 is considered poor agreement (20). Supplementary Table S4 shows the kappa coefficient and the confidence limits in different sampling cohorts. Of the 24 features, 10 semantic features had a kappa >0.95, of which 9 features exhibited a perfect score. Eight semantic features had a kappa coefficient between 0.85 and 0.95, whereas three were between 0.7 and 0.8. One feature (distribution) could not be scored due to limited examples in the sample population. The two size-based features (long and short axis) repeatability were evaluated by computing interclass correlation coefficient, which was 0.94 (0.89–0.97) and 0.96 (0.834–0.99), respectively.
Reader variability and prediction outcome.
We evaluated reader variability on the classification outcome in a subset of 80 patients, which was further divided into training and testing (40 patients in each) with equal number of patients with benign and malignant nodules. Two radiologists independently scored the semantics metrics as described in the previous section. We compared prediction results (AUC) of the classifier by using semantic scoring for test and train samples coming from the same radiologist to the prediction testing carried out using semantic scoring coming from a different radiologist. Supplementary Table S5 shows the results of the AUC (sensitivity and specificity) of the interreader classification carried out in both ways. We find semantic metrics of contour and concavity showed differences of 10.2% and 6.6%, respectively, in the AUC derived from different radiologists. Notably, other semantic metrics showed less than 5% difference in the AUC.
Statistical analysis
Discriminatory analysis was conducted using a liner classifier to find the best predictive feature of cancer status. The error of classification was estimated using the hold-out cross-validation method, where 80% of the sample was selected for training and 20% for testing. The process was randomized and repeated for a large number of times (more than 200) and the average test accuracy (or error) was reported. For each combination, the AUROC was computed and compared with the clinical model proposed by Gould and colleagues (9). To make a comparison to the cross-validation method, clinical data were resampled using the bootstrap method, and the clinical model prediction was computed for each random partition (21–23). The average AUROC with deviation across multiple runs was reported, along with sensitivity and specificity. The classifier model was first built on the training cohort using the cross-validation method described and independently applied to the validation cohort to find the most promising feature combination.
The feature combination that exhibited the highest sensitivity and specificity (Youden J index; refs. 24, 25) in the training set was then selected to be tested for performance on the test cohort. The final lists of top candidate features were selected on the basis of their performance on both cohorts (training and test). This approach provides an additional validation step to overcome typical considerations of cross-validation methods (26–28). As such, the larger cohort sample size provided better performance capabilities and hence the elevated AUCs.
Integrity of training and testing samples was independently maintained without mixing the samples. We evaluated the overall survival (OS) difference between the classifier-discriminated patient population using Kaplan–Meier survival plots and P value by the log-rank test.
Finding the best set of features is a challenge. Various feature reduction methods have been proposed in the past; most often these methods have a range of performance, typically dependent on the complexity of the datasets (29). We used an exhaustive search to find the best performing feature, finding all possible feature combinations (up to four dimensions, more than 12,650 combinations). The top discriminating features are reported in Table 2 and Supplementary Table S6 (all nodules in Supplementary Tables S7 and S8) for the training data along with results of the clinical model. Accuracy (1-error), AUROC, sensitivity, and specificity were all considered in identifying the best discriminant combinations. We then used the top discriminating feature pairs shortlisted from the training and applied the discriminator blindly on the validation data. The error rate with sensitivity and specificity is reported in Table 3.
Prediction error rate for test and train set using four best semantic features with primary nodule
A. Including size-based features . | ||||||
---|---|---|---|---|---|---|
. | . | Training . | ||||
. | Features . | Accuracy (error), % . | Sensitivity . | Specificity . | E[AUC] μ, σ (95% CI) . | E[AUC] Gould (9) μ, σ (95% CI) . |
1 | Short axis (cm), contour, concavity, texture | 81.02 (18.98) | 0.762 | 0.917 | 0.88, 0.08 (0.69–0.98) | 0.57, 0.04 (0.49–0.65) |
2 | Long axis (cm), border definition, vascular convergence, lymphadenopathy | 79.6 (20.4) | 0.762 | 0.95 | 0.87, 0.1 (0.53–0.98) | 0.58, 0.05 (0.48–0.69) |
3 | Short axis (cm), contour, concavity, nodules in nontumor lobes | 79.02 (20.98) | 0.786 | 0.917 | 0.86, 0.09 (0.7–0.98) | 0.58, 0.06 (0.42–0.66) |
4 | Short axis (cm), contour, concavity, spiculation | 82.42 (17.58) | 0.762 | 0.917 | 0.86, 0.08 (0.66–0.98) | 0.58, 0.05 (0.47–0.66) |
5 | Short axis (cm), contour, spiculation, nodules in nontumor lobes | 80.9 (19.1) | 0.762 | 0.917 | 0.81, 0.1 (0.62–0.98) | 0.59, 0.05 (0.5–0.67) |
B. No size-based feature . | ||||||
. | Training . | |||||
. | Features . | Accuracy (error), % . | Sensitivity . | Specificity . | E[AUC] μ, σ (95% CI) . | E[AUC] Gould (9) μ, σ (95% CI) . |
1 | Location, fissure attachment, lobulation, spiculation | 73.2 (26.8) | 0.738 | 0.817 | 0.83, 0.09 (0.68–0.98) | 0.59, 0.049 (0.48–0.66) |
2 | Location, fissure attachment, spiculation, vascular convergence | 73.6 (26.4) | 0.738 | 0.817 | 0.82, 0.09 (0.65–0.96) | 0.578, 0.06 (0.46–0.69) |
3 | Concavity, border definition, spiculation, perinodule fibrosis | 69.3 (30.7) | 0.714 | 0.8 | 0.81, 0.09 (0.57–0.95) | 0.586, 0.056 (0.47–0.7) |
4 | Concavity, border definition, spiculation, texture | 67.3 (32.7) | 0.738 | 0.733 | 0.8, 0.1 (0.57–0.98) | 0.567, 0.05 (0.49–0.67) |
5 | Location, pleural attachment, spiculation, vascular convergence | 71.5 (28.5) | 0.714 | 0.817 | 0.8, 0.08 (0.7–0.97) | 0.587, 0.047 (0.51–0.69) |
A. Including size-based features . | ||||||
---|---|---|---|---|---|---|
. | . | Training . | ||||
. | Features . | Accuracy (error), % . | Sensitivity . | Specificity . | E[AUC] μ, σ (95% CI) . | E[AUC] Gould (9) μ, σ (95% CI) . |
1 | Short axis (cm), contour, concavity, texture | 81.02 (18.98) | 0.762 | 0.917 | 0.88, 0.08 (0.69–0.98) | 0.57, 0.04 (0.49–0.65) |
2 | Long axis (cm), border definition, vascular convergence, lymphadenopathy | 79.6 (20.4) | 0.762 | 0.95 | 0.87, 0.1 (0.53–0.98) | 0.58, 0.05 (0.48–0.69) |
3 | Short axis (cm), contour, concavity, nodules in nontumor lobes | 79.02 (20.98) | 0.786 | 0.917 | 0.86, 0.09 (0.7–0.98) | 0.58, 0.06 (0.42–0.66) |
4 | Short axis (cm), contour, concavity, spiculation | 82.42 (17.58) | 0.762 | 0.917 | 0.86, 0.08 (0.66–0.98) | 0.58, 0.05 (0.47–0.66) |
5 | Short axis (cm), contour, spiculation, nodules in nontumor lobes | 80.9 (19.1) | 0.762 | 0.917 | 0.81, 0.1 (0.62–0.98) | 0.59, 0.05 (0.5–0.67) |
B. No size-based feature . | ||||||
. | Training . | |||||
. | Features . | Accuracy (error), % . | Sensitivity . | Specificity . | E[AUC] μ, σ (95% CI) . | E[AUC] Gould (9) μ, σ (95% CI) . |
1 | Location, fissure attachment, lobulation, spiculation | 73.2 (26.8) | 0.738 | 0.817 | 0.83, 0.09 (0.68–0.98) | 0.59, 0.049 (0.48–0.66) |
2 | Location, fissure attachment, spiculation, vascular convergence | 73.6 (26.4) | 0.738 | 0.817 | 0.82, 0.09 (0.65–0.96) | 0.578, 0.06 (0.46–0.69) |
3 | Concavity, border definition, spiculation, perinodule fibrosis | 69.3 (30.7) | 0.714 | 0.8 | 0.81, 0.09 (0.57–0.95) | 0.586, 0.056 (0.47–0.7) |
4 | Concavity, border definition, spiculation, texture | 67.3 (32.7) | 0.738 | 0.733 | 0.8, 0.1 (0.57–0.98) | 0.567, 0.05 (0.49–0.67) |
5 | Location, pleural attachment, spiculation, vascular convergence | 71.5 (28.5) | 0.714 | 0.817 | 0.8, 0.08 (0.7–0.97) | 0.587, 0.047 (0.51–0.69) |
Validation using 70 patients with 49 being normal and 21 identified cancerous
A. Validation dataset (with size-based features) . | ||||||
---|---|---|---|---|---|---|
. | . | . | Testing . | |||
. | Features . | Coefficient (a0–a5) . | Accuracy (error), % . | AUC . | Sensitivity . | Specificity . |
1 | Short axis (cm), contour, concavity, texture | (0.401033, 0.141517, 0.436219, 0.042036) | 74.3 (25.7) | 0.8 | 0.667 | 0.776 |
2 | Long axis (cm), border definition, vascular convergence, lymphadenopathy | (0.319822, 0.035229, 0.552036, 0.595986) | 64.3 (35.7) | 0.73 | 0.667 | 0.633 |
3 | Short axis (cm), contour, concavity, nodules in nontumor lobes | (0.396477, 0.144428, 0.455541, −0.028359) | 74.3 (25.7) | 0.80 | 0.667 | 0.776 |
4 | Short axis (cm), contour, concavity, spiculation | (0.372755, 0.100117, 0.363608, 0.206085) | 78.6 (21.4) | 0.81 | 0.714 | 0.816 |
5 | Short axis (cm), contour, spiculation, nodules in nontumor lobes | (0.414338, 0.203542, 0.283427, −0.031426) | 80 (20) | 0.81 | 0.714 | 0.837 |
Combined voting (all five combinations) | Accuracy: 77.2% | 0.714 | 0.796 | |||
B. Validation dataset (no size-based features) . | ||||||
. | . | . | Testing . | |||
. | Features . | Coefficient (a0–a5) . | Accuracy (error), % . | AUC . | Sensitivity . | Specificity . |
1 | Location, fissure attachment, lobulation, spiculation | (−2.106962, 0.108965, 0.635492, 0.425675, 0.429703) | 71.4 (28.6) | 0.74 | 0.619 | 0.755 |
2 | Location, fissure attachment, spiculation, vascular convergence | (−1.515249, 0.094082, 0.768446, 0.428057, 0.698817) | 65.7 (34.3) | 0.68 | 0.619 | 0.673 |
3 | Concavity, border definition, spiculation, perinodule fibrosis | (−1.892709, 0.557655, 0.033625, 0.258149, 0.103441) | 71.4 (28.6) | 0.78 | 0.714 | 0.714 |
4 | Concavity, border definition, spiculation, texture | (−1.88911, 0.564026, 0.062267, 0.286365, 0.025501) | 68.6 (31.4) | 0.75 | 0.81 | 0.633 |
5 | Location, pleural-attachment, spiculation, vascular convergence | (−1.359628, 0.07856, 0.240703, 0.393194, 0.632097) | 64.3 (35.7) | 0.68 | 0.571 | 0.673 |
6 | Combined voting (all five combinations) | Accuracy: 77.46% | 0.619 | 0.694 |
A. Validation dataset (with size-based features) . | ||||||
---|---|---|---|---|---|---|
. | . | . | Testing . | |||
. | Features . | Coefficient (a0–a5) . | Accuracy (error), % . | AUC . | Sensitivity . | Specificity . |
1 | Short axis (cm), contour, concavity, texture | (0.401033, 0.141517, 0.436219, 0.042036) | 74.3 (25.7) | 0.8 | 0.667 | 0.776 |
2 | Long axis (cm), border definition, vascular convergence, lymphadenopathy | (0.319822, 0.035229, 0.552036, 0.595986) | 64.3 (35.7) | 0.73 | 0.667 | 0.633 |
3 | Short axis (cm), contour, concavity, nodules in nontumor lobes | (0.396477, 0.144428, 0.455541, −0.028359) | 74.3 (25.7) | 0.80 | 0.667 | 0.776 |
4 | Short axis (cm), contour, concavity, spiculation | (0.372755, 0.100117, 0.363608, 0.206085) | 78.6 (21.4) | 0.81 | 0.714 | 0.816 |
5 | Short axis (cm), contour, spiculation, nodules in nontumor lobes | (0.414338, 0.203542, 0.283427, −0.031426) | 80 (20) | 0.81 | 0.714 | 0.837 |
Combined voting (all five combinations) | Accuracy: 77.2% | 0.714 | 0.796 | |||
B. Validation dataset (no size-based features) . | ||||||
. | . | . | Testing . | |||
. | Features . | Coefficient (a0–a5) . | Accuracy (error), % . | AUC . | Sensitivity . | Specificity . |
1 | Location, fissure attachment, lobulation, spiculation | (−2.106962, 0.108965, 0.635492, 0.425675, 0.429703) | 71.4 (28.6) | 0.74 | 0.619 | 0.755 |
2 | Location, fissure attachment, spiculation, vascular convergence | (−1.515249, 0.094082, 0.768446, 0.428057, 0.698817) | 65.7 (34.3) | 0.68 | 0.619 | 0.673 |
3 | Concavity, border definition, spiculation, perinodule fibrosis | (−1.892709, 0.557655, 0.033625, 0.258149, 0.103441) | 71.4 (28.6) | 0.78 | 0.714 | 0.714 |
4 | Concavity, border definition, spiculation, texture | (−1.88911, 0.564026, 0.062267, 0.286365, 0.025501) | 68.6 (31.4) | 0.75 | 0.81 | 0.633 |
5 | Location, pleural-attachment, spiculation, vascular convergence | (−1.359628, 0.07856, 0.240703, 0.393194, 0.632097) | 64.3 (35.7) | 0.68 | 0.571 | 0.673 |
6 | Combined voting (all five combinations) | Accuracy: 77.46% | 0.619 | 0.694 |
Clinical predictor (Gould model)
Clinical patient characteristics, including size of the nodules, have been widely used as prognostic factors. There are several models proposed in the past; we used the clinical model proposed by Gould and colleagues (9). In the model, clinical malignancy of a nodule was predicted on the basis of smoking status, age of the patient, diameter of the nodule, and number of years since the patient quit smoking. These factors in the logistic regression model showed a high accuracy of malignancy prediction. We used this model as a baseline to compare the semantic-based predictors.
Results
Prediction of cancer status
We investigated combinations of up to four features, with and without size (long axis, short axis, and size category)-based descriptors. Figure 2 shows an example of cancerous and benign nodule across different slices. As expected, size-based features by themselves were good predictors for cancerous nodules, providing an average accuracy of 73% (AUC of 0.89; CI, 0.69–0.98), with low sensitivity (0.476) and high specificity (0.93), reported in Supplementary Table S6. Individual traits, including lymphadenopathy or vascular convergence, provided accuracies of 70% and 72% [AUC range, 0.71 (CI, 0.58–0.84) to 0.72 (CI, 0.55–0.9)], respectively. Using all the nodules identified in the patients shows varied prediction accuracy (see Supplementary Tables S7 and S8). Multivariate analyses of image features were shown to improve the accuracy of prediction, as shown in Table 2. For example, using size-based short axis with contour, concavity, and texture improved prediction accuracy to 81% (AUC of 0.88; CI, 0.68–0.98) using the primary largest nodule. The size-based features are conventionally known to be informative of malignancy, and hence, we removed size measurements to avoid bias in the predictions and repeated the process to find best non-size–based predictors of cancer status. The accuracy for the best non-size–based features (4 dimensions) was in the range of 67.2% to 73.6%; the average AUROC was in the order of 0.8 (CI, 0.57–0.98) to 0.83 (CI, 0.68–0.98), with sensitivity in the range of 0.71 to 0.74, and specificity of 0.73 to 0.82. Figure 3 shows the ROC with four semantic features (both with and without size) to predict malignancy. These were compared against the Gould model and a validation dataset. The best semantic model was based on size, concavity, contour, and spiculation. The non-size–based predictor was based on location, fissure attachment, lobulation, and spiculation, which are known to be related to malignancy (30–35).
Representative slices selected on the basis of four radiological traits (lobulation, border definition, texture, and nodules in primary tumor lobe), which was found to be one of the best discriminant pairs to predict malignant nodules. A and B, The slices in A correspond to malignant case (lobulation, 3; border definition, 2; texture, 3; nodules in primary tumor, 0) and in B to benign case (lobulation, 1; border definition, 1; texture, 3; nodules in primary tumor, 0).
Representative slices selected on the basis of four radiological traits (lobulation, border definition, texture, and nodules in primary tumor lobe), which was found to be one of the best discriminant pairs to predict malignant nodules. A and B, The slices in A correspond to malignant case (lobulation, 3; border definition, 2; texture, 3; nodules in primary tumor, 0) and in B to benign case (lobulation, 1; border definition, 1; texture, 3; nodules in primary tumor, 0).
ROC for semantic feature–based predictors (blue) compared with conventional clinical parameters using Gould model (red) and on the independent validation dataset. The panels below uses pairs with size feature (A) and without size-based features (B).
ROC for semantic feature–based predictors (blue) compared with conventional clinical parameters using Gould model (red) and on the independent validation dataset. The panels below uses pairs with size feature (A) and without size-based features (B).
Prediction of OS
These semantic-based predictor models were used to partition the samples into two groups, which showed significant differences in survival based on the Kaplan–Meier survival curves. As expected, the model that included a size-based feature was significantly associated with OS (Supplementary Fig. S2A; P = 0.013), whereas the model without size-based features was borderline significantly associated with OS (Supplementary Fig. S2B; P = 0.048).
The models were then blindly applied to the validation dataset to assess the ability to predict malignancy, and the predictor's discrimination ability was measured using accuracy, sensitivity, and specificity. As noted in Table 3 (and Supplementary Table S9), accuracy in predicting cancer status in the validation set was in the range of 64.3% to 80% (with AUROC, 0.73–0.80; sensitivity 66.7%–71.4%; specificity, 63.3%–83.7%) using a combination of size-based and semantic features. Semantic features by themselves had prediction accuracy in the range of 64.3% to 71.4% (AUROC, 0.68–0.78; sensitivity, 57%–81%; specificity 67%–75.5%).
To improve reliability of the predictors, the top five discriminating combinations were used to obtain an ensemble decision to predict cancer status. The voting-based top multidimensional feature predictor should improve the sensitivity and specificity. The accuracy in blindly predicting cancer status in the validation data was 77.2% (sensitivity, 71.4%; specificity, 79.6%) using primary nodules. In contrast, non-size–based features provided a comparable accuracy of 77.4% (sensitivity, 61.9%; specificity, 69.4%).
Discussion
In this study, we used observed radiological traits to systematically characterize the size, shape, and location of indeterminate pulmonary nodules and quantitatively represented these traits on a point scale. Traditionally, these semantics have been used to prognosticate malignancy in lung cancer (36). A linear classification model was applied on these quantified observed image traits to predict malignancy. The training dataset was used to find feature combinations and estimate the accuracy of the predictor in a cross-validation setting, graded based on the accuracy, sensitivity, specificity, and the AUROC. The ability of the predictor was blindly evaluated by applying it on the validation set. The top five, 4-dimensional features were determined, and it was interesting to observe that seven unique features appeared as candidates in both size- and non-size–based models: border definition, vascular convergence, concavity, lobulation, texture, spiculation, and nodules in nontumor lobe (see Supplementary Table S10). The non-size–based feature categories had four additional features selected by the top combinations, namely: location, fissure attachment, pleural attachment, and perinodule fibrosis. Although linear methods may not discriminate nonlinearly separable cases, the limitations are mitigated by using multiple linear fits to derive an ensemble decision (37, 38).
Comparison of various CT features, such as contour, shape, and margin, can be helpful for distinguishing between malignant and benign nodules (30, 39–41). The positive relationship of lesion size to likelihood of malignancy has been clearly demonstrated (31, 32). Zerhouni and colleagues (32) found that more than 80% of benign solitary pulmonary nodules were less than 2 cm in diameter; in contrast, diameters of malignant nodules were nearly uniformly distributed in the range of 1 to 6 cm, and 50% of the malignant nodules were larger than 2 cm in diameter. We similarly observed that nodules of bigger size were more likely to be malignant. Importantly, we also found that some top semantic features are good predictors of malignancy even after size-based features were removed to avoid size-based bias. A spiculated contour occurs significantly more often in malignant lung nodules (30, 34). This was supported by our results in which the majority of malignant lesions exhibited spiculation. In pathologic studies, spiculated contours were shown to be due to thickened interlobular septa, fibrosis caused by obstruction of peripheral vessels, or lymphatic channels filled with tumor cells (42). Nevertheless, in benign lung nodules, especially in inflammatory pseudo tumors or tuberculomas, spiculated edges may also be found (30). Our results agree with other publications (30, 43) that reported similar morphologic appearances for the differentiation of benign and malignant IPNs, such as regular shape for benign lesions and lobular for malignant lesions. In a recent study, it is confirmed that the prevalence of lung cancer among current smokers increased from 1.1% for those without emphysema to 2.3% for those with emphysema; among former smokers, the prevalence increased from 0.9% to 1.8%, and for never smokers, it increased from 0.4% to 2.6%. Thus, there was a little more than 2-fold increase for current and former smokers while a 6-fold increase among never smokers (44). This could be verified in the current study as we observed severe perinodule emphysema has a high frequency to be seen in malignant nodules.
Size (WHO, RECIST: long and short axis), rate of change in size, and volume are largely the most important prognostic metrics that have been widely used (45). In response to the community's need to converge on a standard, the American College of Radiology created guidelines to define a positive scan, Lung-RADS (46). Current clinical guidelines rely heavily on the nodule size (NCCN and LungRADS; refs. 11, 46). On the basis of nodule size, a wide range of false-positives has been reported, 96.4% by the CT arm of the NLST and 25% by others (4, 47, 48). Our semantic model, in addition to size-based predictors, will aid the clinical decision support system, including monitoring of the nodule growth.
Designing a predictive method poses several challenges; most often the cohort population has larger number of benign nodules compared with cancerous with range of nodule sizes. Image traits observed by an expert radiologist has the ability to adjust with the system variations (CT parameters) and nodule size differences. It then becomes critical for predictive models to balance positive and negative (sensitivity and specificity) findings rather than relying on a single figure of merit. The approach followed by us allows grading the feature pairs according to the predictive performance. We believe our approach of radiological semantics is novel, which uses the observed traits on a quantitative scale and applies machine learning classification approach in a systematic cross-validation setting. Our results show better performance compared with one of the widely used clinical model (9).
Semantic approach has practical relevance in nodule classification. In a recent lung nodule classification challenge, our team proposed semantic-based approaches to classify indeterminate pulmonary nodules. The challenge had about 10 samples for calibration or training (known outcome) and about 60 samples for blinded testing. The semantic-based approach came second with a test AUC of 0.66, whereas the computer-aided diagnosis (CAD) feature–based method was first with AUC of 0.68 with more than 15 international participants (49–51).
Because of the retrospective study design and small sample size, to avoid overfitting the data, we collected two independent cohorts from two institutions (train and validation from Vanderbilt Medical Center and Veteran Affairs Hospital in Nashville, samples randomly mixed in the cohorts). Further multi-Institutional studies with large number of patients are warranted to replicate these novel results.
Study limitations
Our study has some limitations. First, the number of patients was not large. We have taken effort to reduce false discovery by using training and test cohort. Despite this approach, it is possible there could be biases in the patient population, as the current cohorts were predominantly male, derived as they were from a VA population. This could be mitigated by collecting samples in a larger multi-institutional cohort study. Radiologist training and preferences will influence the semantic scoring. Although this is less of a concern in a research setting, it may be an issue in clinical practice. Efforts have been made to standardize the scoring sheet with a descriptive atlas (e.g., Supplementary Fig. S1; Supplementary Table S2) in a way that will be acceptable by the community at large.
Conclusions
We have shown that radiological image traits are useful in differentiating malignant from nonbenign nodules. These semantic features, along with size measurement, certainly enhance the prediction accuracy and reduce false-positives. The usefulness of radiological imaging traits (semantics) in predicting cancer status shows the ability to reduce diagnostic errors compared with clinical models. These, along with conventional measure based on the size, could be collectively used in clinical workflow to better diagnose malignancy.
Disclosure of Potential Conflicts of Interest
R.J. Gillies holds ownership interest (including patents) in and is a consultant/advisory board member for Health Myne. No potential conflicts of interest were disclosed by the other authors.
Disclaimer
The content of the article is solely the responsibility of the authors.
Authors' Contributions
Conception and design: Y. Balagurunathan, R.C. Walker, M.B. Schabath, R.J. Gillies, Y. Liu
Development of methodology: Y. Balagurunathan, R.C. Walker, P.P. Massion, R.J. Gillies, Y. Liu
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): T. Atwater, S. Antic, R.C. Walker, P.P. Massion, M.B. Schabath
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): Y. Balagurunathan, Q. Li, R.C. Walker, P.P. Massion, M.B. Schabath, R.J. Gillies, Y. Liu
Writing, review, and/or revision of the manuscript: Y. Balagurunathan, Q. Li, R.C. Walker, G.T. Smith, P.P. Massion, M.B. Schabath, R.J. Gillies, Y. Liu
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): R.J. Gillies
Study supervision: Y. Balagurunathan, R.J. Gillies
Other: Y. Liu
Grant Support
The NIH grant (CA 143062-01) and State of Florida Department of Health (2KT01) grant provided protected time for R.J. Gillies, Y. Balagurunathan, Y. Liu, Q. Li, and M. Schabath to work on the research project.
The NIH grants (CA186145, CA152662) and the Department of Defense grant W81XWH-11-2-0161 provided funding support for P.P. Massion. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.