Purpose: We propose a systematic methodology to quantify incidentally identified pulmonary nodules based on observed radiological traits (semantics) quantified on a point scale and a machine-learning method using these data to predict cancer status.

Experimental Design: We investigated 172 patients who had low-dose CT images, with 102 and 70 patients grouped into training and validation cohorts, respectively. On the images, 24 radiological traits were systematically scored and a linear classifier was built to relate the traits to malignant status. The model was formed both with and without size descriptors to remove bias due to nodule size. The multivariate pairs formed on the training set were tested on an independent validation data set to evaluate their performance.

Results: The best 4-feature set that included a size measurement (set 1), was short axis, contour, concavity, and texture, which had an area under the receiver operator characteristic curve (AUROC) of 0.88 (accuracy = 81%, sensitivity = 76.2%, specificity = 91.7%). If size measures were excluded, the four best features (set 2) were location, fissure attachment, lobulation, and spiculation, which had an AUROC of 0.83 (accuracy = 73.2%, sensitivity = 73.8%, specificity = 81.7%) in predicting malignancy in primary nodules. The validation test AUROC was 0.8 (accuracy = 74.3%, sensitivity = 66.7%, specificity = 75.6%) and 0.74 (accuracy = 71.4%, sensitivity = 61.9%, specificity = 75.5%) for sets 1 and 2, respectively.

Conclusions: Radiological image traits are useful in predicting malignancy in lung nodules. These semantic traits can be used in combination with size-based measures to enhance prediction accuracy and reduce false-positives. Clin Cancer Res; 23(6); 1442–9. ©2016 AACR.

Translational Relevance

Radiological image traits have been investigated extensively to prognosticate in lung CT both in the context of lesions and nodules. Most of the studies relied on patient survival–based models using single or multiple traits. In this study, we have taken a systematic approach to describe traits on a point scale and score the patient scans for the appearance of a trait. Agnostic learning method was used in a cross-validation setting to find the relationships of the traits to the malignancy status. The combination pairs were tested for reliability on a validation cohort. These pairs of radiological traits (semantics) could readily be used by the practicing clinician to provide risk assessment for pulmonary nodules, which will help to standardize radiologist inference and improve patient care. Certainly, any inference on biomarkers needs to be used with caution; one needs to account for system level variability at the clinic, parameter settings of the scanner, and the operator precision needs to brought to perspective.

Lung cancer is the leading cause of cancer-related deaths globally and in the United States (1). Low-dose CT (LDCT) has evolved to be a sensitive imaging modality to detect pulmonary nodules. The National Lung Screening Trial (NLST), which compared LDCT and standard chest radiography (CXR) for three annual screens, found a 20% reduction in lung cancer mortality for CT compared with CXR (2). In the NLST trial, at least 39% of LDCT study participants had a nodule-positive scan during the study, and 96.4% of these were noncancerous (i.e., false-positive). The CXR arm had a lower rate of positive detection (16%) with a comparatively lower rate of false-positives (2, 3). In the NLST, it is estimated that about 18.5% [95% confidence interval (CI), 5.4%–30.6%] of the lung cancers detected in the study population were clinically insignificant and hence, overdiagnosed (4). Other studies have shown that overdiagnosis of lung cancer can be as high as 96% (5–7). Despite the high false-positive rate, the United States Preventive Services Task Force recommended lung cancer screening for high-risk individuals (8). However, debate on the effectiveness of LDCT screening still continues. The availability of abundant data has helped the development of clinical assessment models to predict probabilities of malignancy (9, 10).

Identification of malignancy continues to be a challenge even in the screening setting; patients with indeterminate pulmonary nodules (IPN) are typically monitored with scheduled follow-up scans (11). Advancements in image acquisition and improved computer-aided diagnostic tools coupled with effective treatment strategy have shown to improve patient survival (12). In recent work, nodule characteristics coupled with clinical risk factors have been widely used to prognosticate malignancy (13–15). Although the NLST shows <4% of the subjects with noncalcified nodules were diagnosed with lung cancer within a year (16), at present, there is limited ability to provide individual patient-level risk (17).

In this work, we focus on quantifying radiological imaging characteristics of nodules (shape, location, and texture), including associated structures of the lung, and relate them to cancer status. We followed a rigorous approach to find the optimal imaging characteristics in a training cohort using cross-validation methodology and validated in a test cohort dataset. These quantitative predictive scores [accuracy and or area under the receiver operator characteristic curve (AUROC)] obtained from images were used to develop classifier models to identify risk of malignancy (see Fig. 1).

Figure 1.

Study design to find discriminant semantic features. The blocks describe the methodology followed in the article. The observed radiological trait by an expert was related to outcome with a train and validation setting.

Figure 1.

Study design to find discriminant semantic features. The blocks describe the methodology followed in the article. The observed radiological trait by an expert was related to outcome with a train and validation setting.

Close modal

Patient cohort

In this study, we collected two cohorts (training and validation) from the Vanderbilt University Medical Center (VUMC), Nashville and Veterans Affairs (VA) Medical Center (Nashville, TN). The training cohort had 102 patients consisting of 42 with lung cancer and 60 patients with a positive scan that was not lung cancer (i.e., benign nodule). Of the 102 patients in the training set, there were 206 nodules (84 malignant, 122 benign), whereas the validation set had 70 patients 21 with lung cancer and 49 patients with positive scan that was non-lung cancer. Of the 70 patients in validation cohort, there were 102 nodules (26 malignant, 76 benign). The patients had 2 years of follow-up from the time of CT scans, and the biopsy results confirmed their cancer status. The median patient age for the training set was 67 years (σ = 8.6), whereas the validation set had 65 years (σ = 9.3). The patient samples were retrospectively curated; first batch of patients were used in the training set and the later batch was used in the validation set; the cohorts had a collection time difference of 6 to 9 months.

Table 1 describes the patient demographics, and Supplementary Table S1 describes the nodule dimensions in the two cohorts. This study was approved by the Institutional Review Board (IRB) at the collecting institution (VUMC/VA Hospital, Nashville, TN) and as a retrospective study to review deidentified patient records at the collaborating institution (Moffitt Cancer Center, Tampa, FL). The requirement for patients' informed consent was waived.

Table 1.

Patient demographics and tumor stage for the training and testing samples

Training setValidation set
OverallLung cancer casesPatients with benign nodulesOverallLung cancer casesPatients with benign nodules
Demographics patients 102 42 60 70 21 49 
Nodules 206 84 122 102 26 76 
Age: median, mean (σ) 67, 66.7 (8.6) 69, 68.4 (8.7) 65, 65.6 (8.3) 65, 65.5 (9.3) 69, 70.1 (7.4) 64, 63.5 (9.5) 
Gender n (%) (male/female) 102 95.2%/4.7% 98%/1.6% 70 28.6%/0% 67.1%/4.3% 
 Male 99 40 59 68 20 47 
 Female 
Smoking: pack years median, mean (σ) 46, 58.22 (42.15) 50, 68 (53.53) 51.365, 45 30.56934781 45, 48.1 (37.5) 57, 61.4 (34.4) 40, 41.9 (37.5) 
Race (Caucasian/African American/Native/others) 92/9/0/1 35/7/0/0 57/2/0/1 63/2/2/3 21/0/0/0 42/2/2/3 
Training setValidation set
OverallLung cancer casesPatients with benign nodulesOverallLung cancer casesPatients with benign nodules
Demographics patients 102 42 60 70 21 49 
Nodules 206 84 122 102 26 76 
Age: median, mean (σ) 67, 66.7 (8.6) 69, 68.4 (8.7) 65, 65.6 (8.3) 65, 65.5 (9.3) 69, 70.1 (7.4) 64, 63.5 (9.5) 
Gender n (%) (male/female) 102 95.2%/4.7% 98%/1.6% 70 28.6%/0% 67.1%/4.3% 
 Male 99 40 59 68 20 47 
 Female 
Smoking: pack years median, mean (σ) 46, 58.22 (42.15) 50, 68 (53.53) 51.365, 45 30.56934781 45, 48.1 (37.5) 57, 61.4 (34.4) 40, 41.9 (37.5) 
Race (Caucasian/African American/Native/others) 92/9/0/1 35/7/0/0 57/2/0/1 63/2/2/3 21/0/0/0 42/2/2/3 

LDCT protocol

Chest LDCT scans were performed with a single deep breath hold by using a Discovery VCT (GE Healthcare) from the base of the neck to the posterior lung gutters. The patients were scanned without intravenous contrast and images obtained by filtered back projection image reconstruction with a soft tissue filter to obtain a 512 × 512 matrix. CT energy was 120 kVp, with variable mAs, (range 30–400) to minimize radiation. Helical data were acquired using collimation of 40 mm, pitch 1.375:1, table speed 55 cm/sec, reconstructed as 1.25-mm pixels at contiguous 1.25-mm intervals, producing isotropic voxels. The field of view was based on patient body habitus, typically 35 to 43 cm.

Image analysis and reader agreement

All LDCT images were reviewed by a clinical radiologist (Y. Liu) with more than 6 years of experience in LDCT imaging of thoracic malignancies, who was blind to the clinical details and final diagnosis for the nodules at the time of image interpretation. Thin-section LDCT images were displayed using both standard mediastinal (width, 350 HU; level, 40 HU) and lung (width, 1,500 HU; level, −600 HU) window-width and window-level settings. Totally, 24 CT image descriptors were developed to characterize the pulmonary nodules and these were classified into eight categories: (i) location; (ii) size; (iii) shape; (iv) margin; (v) density; (vi) internal features; (vii) external features; and (viii) associated findings; example cases are shown in Supplementary Fig. S1. Each CT descriptor was rated using either an ordinal scale or a binary categorical variable (See Supplementary Tables S2 and S3).

To measure the reproducibility of semantic scoring metric, we randomly selected 80 patients (40 malignant and 40 benign) from the cohort and provided the scoring sheet with approximate anatomical location of the nodules to two different radiologists (Y. Liu and Q. Li who are resident radiologists with 3 years of clinical experience). The scoring sheet was used to compute the concordance of these discrete scoring between readers using kappa statistics (18, 19). In radiological observations, a value of kappa coefficient greater than 0.8 is considered perfect agreement, 0.61 to 0.8 is considered substantial agreement, 0.41 to 0.6 moderate agreement, 0.21 to 0.4 fair agreement, and below 0.2 is considered poor agreement (20). Supplementary Table S4 shows the kappa coefficient and the confidence limits in different sampling cohorts. Of the 24 features, 10 semantic features had a kappa >0.95, of which 9 features exhibited a perfect score. Eight semantic features had a kappa coefficient between 0.85 and 0.95, whereas three were between 0.7 and 0.8. One feature (distribution) could not be scored due to limited examples in the sample population. The two size-based features (long and short axis) repeatability were evaluated by computing interclass correlation coefficient, which was 0.94 (0.89–0.97) and 0.96 (0.834–0.99), respectively.

Reader variability and prediction outcome.

We evaluated reader variability on the classification outcome in a subset of 80 patients, which was further divided into training and testing (40 patients in each) with equal number of patients with benign and malignant nodules. Two radiologists independently scored the semantics metrics as described in the previous section. We compared prediction results (AUC) of the classifier by using semantic scoring for test and train samples coming from the same radiologist to the prediction testing carried out using semantic scoring coming from a different radiologist. Supplementary Table S5 shows the results of the AUC (sensitivity and specificity) of the interreader classification carried out in both ways. We find semantic metrics of contour and concavity showed differences of 10.2% and 6.6%, respectively, in the AUC derived from different radiologists. Notably, other semantic metrics showed less than 5% difference in the AUC.

Statistical analysis

Discriminatory analysis was conducted using a liner classifier to find the best predictive feature of cancer status. The error of classification was estimated using the hold-out cross-validation method, where 80% of the sample was selected for training and 20% for testing. The process was randomized and repeated for a large number of times (more than 200) and the average test accuracy (or error) was reported. For each combination, the AUROC was computed and compared with the clinical model proposed by Gould and colleagues (9). To make a comparison to the cross-validation method, clinical data were resampled using the bootstrap method, and the clinical model prediction was computed for each random partition (21–23). The average AUROC with deviation across multiple runs was reported, along with sensitivity and specificity. The classifier model was first built on the training cohort using the cross-validation method described and independently applied to the validation cohort to find the most promising feature combination.

The feature combination that exhibited the highest sensitivity and specificity (Youden J index; refs. 24, 25) in the training set was then selected to be tested for performance on the test cohort. The final lists of top candidate features were selected on the basis of their performance on both cohorts (training and test). This approach provides an additional validation step to overcome typical considerations of cross-validation methods (26–28). As such, the larger cohort sample size provided better performance capabilities and hence the elevated AUCs.

Integrity of training and testing samples was independently maintained without mixing the samples. We evaluated the overall survival (OS) difference between the classifier-discriminated patient population using Kaplan–Meier survival plots and P value by the log-rank test.

Finding the best set of features is a challenge. Various feature reduction methods have been proposed in the past; most often these methods have a range of performance, typically dependent on the complexity of the datasets (29). We used an exhaustive search to find the best performing feature, finding all possible feature combinations (up to four dimensions, more than 12,650 combinations). The top discriminating features are reported in Table 2 and Supplementary Table S6 (all nodules in Supplementary Tables S7 and S8) for the training data along with results of the clinical model. Accuracy (1-error), AUROC, sensitivity, and specificity were all considered in identifying the best discriminant combinations. We then used the top discriminating feature pairs shortlisted from the training and applied the discriminator blindly on the validation data. The error rate with sensitivity and specificity is reported in Table 3. 

Table 2.

Prediction error rate for test and train set using four best semantic features with primary nodule

A. Including size-based features
Training
FeaturesAccuracy (error), %SensitivitySpecificityE[AUC] μ, σ (95% CI)E[AUC] Gould (9) μ, σ (95% CI)
Short axis (cm), contour, concavity, texture 81.02 (18.98) 0.762 0.917 0.88, 0.08 (0.69–0.98) 0.57, 0.04 (0.49–0.65) 
Long axis (cm), border definition, vascular convergence, lymphadenopathy 79.6 (20.4) 0.762 0.95 0.87, 0.1 (0.53–0.98) 0.58, 0.05 (0.48–0.69) 
Short axis (cm), contour, concavity, nodules in nontumor lobes 79.02 (20.98) 0.786 0.917 0.86, 0.09 (0.7–0.98) 0.58, 0.06 (0.42–0.66) 
Short axis (cm), contour, concavity, spiculation 82.42 (17.58) 0.762 0.917 0.86, 0.08 (0.66–0.98) 0.58, 0.05 (0.47–0.66) 
Short axis (cm), contour, spiculation, nodules in nontumor lobes 80.9 (19.1) 0.762 0.917 0.81, 0.1 (0.62–0.98) 0.59, 0.05 (0.5–0.67) 
B. No size-based feature
 Training
FeaturesAccuracy (error), %SensitivitySpecificityE[AUC] μ, σ (95% CI)E[AUC] Gould (9) μ, σ (95% CI)
Location, fissure attachment, lobulation, spiculation 73.2 (26.8) 0.738 0.817 0.83, 0.09 (0.68–0.98) 0.59, 0.049 (0.48–0.66) 
Location, fissure attachment, spiculation, vascular convergence 73.6 (26.4) 0.738 0.817 0.82, 0.09 (0.65–0.96) 0.578, 0.06 (0.46–0.69) 
Concavity, border definition, spiculation, perinodule fibrosis 69.3 (30.7) 0.714 0.8 0.81, 0.09 (0.57–0.95) 0.586, 0.056 (0.47–0.7) 
Concavity, border definition, spiculation, texture 67.3 (32.7) 0.738 0.733 0.8, 0.1 (0.57–0.98) 0.567, 0.05 (0.49–0.67) 
Location, pleural attachment, spiculation, vascular convergence 71.5 (28.5) 0.714 0.817 0.8, 0.08 (0.7–0.97) 0.587, 0.047 (0.51–0.69) 
A. Including size-based features
Training
FeaturesAccuracy (error), %SensitivitySpecificityE[AUC] μ, σ (95% CI)E[AUC] Gould (9) μ, σ (95% CI)
Short axis (cm), contour, concavity, texture 81.02 (18.98) 0.762 0.917 0.88, 0.08 (0.69–0.98) 0.57, 0.04 (0.49–0.65) 
Long axis (cm), border definition, vascular convergence, lymphadenopathy 79.6 (20.4) 0.762 0.95 0.87, 0.1 (0.53–0.98) 0.58, 0.05 (0.48–0.69) 
Short axis (cm), contour, concavity, nodules in nontumor lobes 79.02 (20.98) 0.786 0.917 0.86, 0.09 (0.7–0.98) 0.58, 0.06 (0.42–0.66) 
Short axis (cm), contour, concavity, spiculation 82.42 (17.58) 0.762 0.917 0.86, 0.08 (0.66–0.98) 0.58, 0.05 (0.47–0.66) 
Short axis (cm), contour, spiculation, nodules in nontumor lobes 80.9 (19.1) 0.762 0.917 0.81, 0.1 (0.62–0.98) 0.59, 0.05 (0.5–0.67) 
B. No size-based feature
 Training
FeaturesAccuracy (error), %SensitivitySpecificityE[AUC] μ, σ (95% CI)E[AUC] Gould (9) μ, σ (95% CI)
Location, fissure attachment, lobulation, spiculation 73.2 (26.8) 0.738 0.817 0.83, 0.09 (0.68–0.98) 0.59, 0.049 (0.48–0.66) 
Location, fissure attachment, spiculation, vascular convergence 73.6 (26.4) 0.738 0.817 0.82, 0.09 (0.65–0.96) 0.578, 0.06 (0.46–0.69) 
Concavity, border definition, spiculation, perinodule fibrosis 69.3 (30.7) 0.714 0.8 0.81, 0.09 (0.57–0.95) 0.586, 0.056 (0.47–0.7) 
Concavity, border definition, spiculation, texture 67.3 (32.7) 0.738 0.733 0.8, 0.1 (0.57–0.98) 0.567, 0.05 (0.49–0.67) 
Location, pleural attachment, spiculation, vascular convergence 71.5 (28.5) 0.714 0.817 0.8, 0.08 (0.7–0.97) 0.587, 0.047 (0.51–0.69) 
Table 3.

Validation using 70 patients with 49 being normal and 21 identified cancerous

A. Validation dataset (with size-based features)
Testing
FeaturesCoefficient (a0–a5)Accuracy (error), %AUCSensitivitySpecificity
Short axis (cm), contour, concavity, texture (0.401033, 0.141517, 0.436219, 0.042036) 74.3 (25.7) 0.8 0.667 0.776 
Long axis (cm), border definition, vascular convergence, lymphadenopathy (0.319822, 0.035229, 0.552036, 0.595986) 64.3 (35.7) 0.73 0.667 0.633 
Short axis (cm), contour, concavity, nodules in nontumor lobes (0.396477, 0.144428, 0.455541, −0.028359) 74.3 (25.7) 0.80 0.667 0.776 
Short axis (cm), contour, concavity, spiculation (0.372755, 0.100117, 0.363608, 0.206085) 78.6 (21.4) 0.81 0.714 0.816 
Short axis (cm), contour, spiculation, nodules in nontumor lobes (0.414338, 0.203542, 0.283427, −0.031426) 80 (20) 0.81 0.714 0.837 
 Combined voting (all five combinations) Accuracy: 77.2% 0.714 0.796 
B. Validation dataset (no size-based features)
Testing
FeaturesCoefficient (a0–a5)Accuracy (error), %AUCSensitivitySpecificity
Location, fissure attachment, lobulation, spiculation (−2.106962, 0.108965, 0.635492, 0.425675, 0.429703) 71.4 (28.6) 0.74 0.619 0.755 
Location, fissure attachment, spiculation, vascular convergence (−1.515249, 0.094082, 0.768446, 0.428057, 0.698817) 65.7 (34.3) 0.68 0.619 0.673 
Concavity, border definition, spiculation, perinodule fibrosis (−1.892709, 0.557655, 0.033625, 0.258149, 0.103441) 71.4 (28.6) 0.78 0.714 0.714 
Concavity, border definition, spiculation, texture (−1.88911, 0.564026, 0.062267, 0.286365, 0.025501) 68.6 (31.4) 0.75 0.81 0.633 
Location, pleural-attachment, spiculation, vascular convergence (−1.359628, 0.07856, 0.240703, 0.393194, 0.632097) 64.3 (35.7) 0.68 0.571 0.673 
Combined voting (all five combinations) Accuracy: 77.46% 0.619 0.694 
A. Validation dataset (with size-based features)
Testing
FeaturesCoefficient (a0–a5)Accuracy (error), %AUCSensitivitySpecificity
Short axis (cm), contour, concavity, texture (0.401033, 0.141517, 0.436219, 0.042036) 74.3 (25.7) 0.8 0.667 0.776 
Long axis (cm), border definition, vascular convergence, lymphadenopathy (0.319822, 0.035229, 0.552036, 0.595986) 64.3 (35.7) 0.73 0.667 0.633 
Short axis (cm), contour, concavity, nodules in nontumor lobes (0.396477, 0.144428, 0.455541, −0.028359) 74.3 (25.7) 0.80 0.667 0.776 
Short axis (cm), contour, concavity, spiculation (0.372755, 0.100117, 0.363608, 0.206085) 78.6 (21.4) 0.81 0.714 0.816 
Short axis (cm), contour, spiculation, nodules in nontumor lobes (0.414338, 0.203542, 0.283427, −0.031426) 80 (20) 0.81 0.714 0.837 
 Combined voting (all five combinations) Accuracy: 77.2% 0.714 0.796 
B. Validation dataset (no size-based features)
Testing
FeaturesCoefficient (a0–a5)Accuracy (error), %AUCSensitivitySpecificity
Location, fissure attachment, lobulation, spiculation (−2.106962, 0.108965, 0.635492, 0.425675, 0.429703) 71.4 (28.6) 0.74 0.619 0.755 
Location, fissure attachment, spiculation, vascular convergence (−1.515249, 0.094082, 0.768446, 0.428057, 0.698817) 65.7 (34.3) 0.68 0.619 0.673 
Concavity, border definition, spiculation, perinodule fibrosis (−1.892709, 0.557655, 0.033625, 0.258149, 0.103441) 71.4 (28.6) 0.78 0.714 0.714 
Concavity, border definition, spiculation, texture (−1.88911, 0.564026, 0.062267, 0.286365, 0.025501) 68.6 (31.4) 0.75 0.81 0.633 
Location, pleural-attachment, spiculation, vascular convergence (−1.359628, 0.07856, 0.240703, 0.393194, 0.632097) 64.3 (35.7) 0.68 0.571 0.673 
Combined voting (all five combinations) Accuracy: 77.46% 0.619 0.694 

Clinical predictor (Gould model)

Clinical patient characteristics, including size of the nodules, have been widely used as prognostic factors. There are several models proposed in the past; we used the clinical model proposed by Gould and colleagues (9). In the model, clinical malignancy of a nodule was predicted on the basis of smoking status, age of the patient, diameter of the nodule, and number of years since the patient quit smoking. These factors in the logistic regression model showed a high accuracy of malignancy prediction. We used this model as a baseline to compare the semantic-based predictors.

Prediction of cancer status

We investigated combinations of up to four features, with and without size (long axis, short axis, and size category)-based descriptors. Figure 2 shows an example of cancerous and benign nodule across different slices. As expected, size-based features by themselves were good predictors for cancerous nodules, providing an average accuracy of 73% (AUC of 0.89; CI, 0.69–0.98), with low sensitivity (0.476) and high specificity (0.93), reported in Supplementary Table S6. Individual traits, including lymphadenopathy or vascular convergence, provided accuracies of 70% and 72% [AUC range, 0.71 (CI, 0.58–0.84) to 0.72 (CI, 0.55–0.9)], respectively. Using all the nodules identified in the patients shows varied prediction accuracy (see Supplementary Tables S7 and S8). Multivariate analyses of image features were shown to improve the accuracy of prediction, as shown in Table 2. For example, using size-based short axis with contour, concavity, and texture improved prediction accuracy to 81% (AUC of 0.88; CI, 0.68–0.98) using the primary largest nodule. The size-based features are conventionally known to be informative of malignancy, and hence, we removed size measurements to avoid bias in the predictions and repeated the process to find best non-size–based predictors of cancer status. The accuracy for the best non-size–based features (4 dimensions) was in the range of 67.2% to 73.6%; the average AUROC was in the order of 0.8 (CI, 0.57–0.98) to 0.83 (CI, 0.68–0.98), with sensitivity in the range of 0.71 to 0.74, and specificity of 0.73 to 0.82. Figure 3 shows the ROC with four semantic features (both with and without size) to predict malignancy. These were compared against the Gould model and a validation dataset. The best semantic model was based on size, concavity, contour, and spiculation. The non-size–based predictor was based on location, fissure attachment, lobulation, and spiculation, which are known to be related to malignancy (30–35).

Figure 2.

Representative slices selected on the basis of four radiological traits (lobulation, border definition, texture, and nodules in primary tumor lobe), which was found to be one of the best discriminant pairs to predict malignant nodules. A and B, The slices in A correspond to malignant case (lobulation, 3; border definition, 2; texture, 3; nodules in primary tumor, 0) and in B to benign case (lobulation, 1; border definition, 1; texture, 3; nodules in primary tumor, 0).

Figure 2.

Representative slices selected on the basis of four radiological traits (lobulation, border definition, texture, and nodules in primary tumor lobe), which was found to be one of the best discriminant pairs to predict malignant nodules. A and B, The slices in A correspond to malignant case (lobulation, 3; border definition, 2; texture, 3; nodules in primary tumor, 0) and in B to benign case (lobulation, 1; border definition, 1; texture, 3; nodules in primary tumor, 0).

Close modal
Figure 3.

ROC for semantic feature–based predictors (blue) compared with conventional clinical parameters using Gould model (red) and on the independent validation dataset. The panels below uses pairs with size feature (A) and without size-based features (B).

Figure 3.

ROC for semantic feature–based predictors (blue) compared with conventional clinical parameters using Gould model (red) and on the independent validation dataset. The panels below uses pairs with size feature (A) and without size-based features (B).

Close modal

Prediction of OS

These semantic-based predictor models were used to partition the samples into two groups, which showed significant differences in survival based on the Kaplan–Meier survival curves. As expected, the model that included a size-based feature was significantly associated with OS (Supplementary Fig. S2A; P = 0.013), whereas the model without size-based features was borderline significantly associated with OS (Supplementary Fig. S2B; P = 0.048).

The models were then blindly applied to the validation dataset to assess the ability to predict malignancy, and the predictor's discrimination ability was measured using accuracy, sensitivity, and specificity. As noted in Table 3 (and Supplementary Table S9), accuracy in predicting cancer status in the validation set was in the range of 64.3% to 80% (with AUROC, 0.73–0.80; sensitivity 66.7%–71.4%; specificity, 63.3%–83.7%) using a combination of size-based and semantic features. Semantic features by themselves had prediction accuracy in the range of 64.3% to 71.4% (AUROC, 0.68–0.78; sensitivity, 57%–81%; specificity 67%–75.5%).

To improve reliability of the predictors, the top five discriminating combinations were used to obtain an ensemble decision to predict cancer status. The voting-based top multidimensional feature predictor should improve the sensitivity and specificity. The accuracy in blindly predicting cancer status in the validation data was 77.2% (sensitivity, 71.4%; specificity, 79.6%) using primary nodules. In contrast, non-size–based features provided a comparable accuracy of 77.4% (sensitivity, 61.9%; specificity, 69.4%).

In this study, we used observed radiological traits to systematically characterize the size, shape, and location of indeterminate pulmonary nodules and quantitatively represented these traits on a point scale. Traditionally, these semantics have been used to prognosticate malignancy in lung cancer (36). A linear classification model was applied on these quantified observed image traits to predict malignancy. The training dataset was used to find feature combinations and estimate the accuracy of the predictor in a cross-validation setting, graded based on the accuracy, sensitivity, specificity, and the AUROC. The ability of the predictor was blindly evaluated by applying it on the validation set. The top five, 4-dimensional features were determined, and it was interesting to observe that seven unique features appeared as candidates in both size- and non-size–based models: border definition, vascular convergence, concavity, lobulation, texture, spiculation, and nodules in nontumor lobe (see Supplementary Table S10). The non-size–based feature categories had four additional features selected by the top combinations, namely: location, fissure attachment, pleural attachment, and perinodule fibrosis. Although linear methods may not discriminate nonlinearly separable cases, the limitations are mitigated by using multiple linear fits to derive an ensemble decision (37, 38).

Comparison of various CT features, such as contour, shape, and margin, can be helpful for distinguishing between malignant and benign nodules (30, 39–41). The positive relationship of lesion size to likelihood of malignancy has been clearly demonstrated (31, 32). Zerhouni and colleagues (32) found that more than 80% of benign solitary pulmonary nodules were less than 2 cm in diameter; in contrast, diameters of malignant nodules were nearly uniformly distributed in the range of 1 to 6 cm, and 50% of the malignant nodules were larger than 2 cm in diameter. We similarly observed that nodules of bigger size were more likely to be malignant. Importantly, we also found that some top semantic features are good predictors of malignancy even after size-based features were removed to avoid size-based bias. A spiculated contour occurs significantly more often in malignant lung nodules (30, 34). This was supported by our results in which the majority of malignant lesions exhibited spiculation. In pathologic studies, spiculated contours were shown to be due to thickened interlobular septa, fibrosis caused by obstruction of peripheral vessels, or lymphatic channels filled with tumor cells (42). Nevertheless, in benign lung nodules, especially in inflammatory pseudo tumors or tuberculomas, spiculated edges may also be found (30). Our results agree with other publications (30, 43) that reported similar morphologic appearances for the differentiation of benign and malignant IPNs, such as regular shape for benign lesions and lobular for malignant lesions. In a recent study, it is confirmed that the prevalence of lung cancer among current smokers increased from 1.1% for those without emphysema to 2.3% for those with emphysema; among former smokers, the prevalence increased from 0.9% to 1.8%, and for never smokers, it increased from 0.4% to 2.6%. Thus, there was a little more than 2-fold increase for current and former smokers while a 6-fold increase among never smokers (44). This could be verified in the current study as we observed severe perinodule emphysema has a high frequency to be seen in malignant nodules.

Size (WHO, RECIST: long and short axis), rate of change in size, and volume are largely the most important prognostic metrics that have been widely used (45). In response to the community's need to converge on a standard, the American College of Radiology created guidelines to define a positive scan, Lung-RADS (46). Current clinical guidelines rely heavily on the nodule size (NCCN and LungRADS; refs. 11, 46). On the basis of nodule size, a wide range of false-positives has been reported, 96.4% by the CT arm of the NLST and 25% by others (4, 47, 48). Our semantic model, in addition to size-based predictors, will aid the clinical decision support system, including monitoring of the nodule growth.

Designing a predictive method poses several challenges; most often the cohort population has larger number of benign nodules compared with cancerous with range of nodule sizes. Image traits observed by an expert radiologist has the ability to adjust with the system variations (CT parameters) and nodule size differences. It then becomes critical for predictive models to balance positive and negative (sensitivity and specificity) findings rather than relying on a single figure of merit. The approach followed by us allows grading the feature pairs according to the predictive performance. We believe our approach of radiological semantics is novel, which uses the observed traits on a quantitative scale and applies machine learning classification approach in a systematic cross-validation setting. Our results show better performance compared with one of the widely used clinical model (9).

Semantic approach has practical relevance in nodule classification. In a recent lung nodule classification challenge, our team proposed semantic-based approaches to classify indeterminate pulmonary nodules. The challenge had about 10 samples for calibration or training (known outcome) and about 60 samples for blinded testing. The semantic-based approach came second with a test AUC of 0.66, whereas the computer-aided diagnosis (CAD) feature–based method was first with AUC of 0.68 with more than 15 international participants (49–51).

Because of the retrospective study design and small sample size, to avoid overfitting the data, we collected two independent cohorts from two institutions (train and validation from Vanderbilt Medical Center and Veteran Affairs Hospital in Nashville, samples randomly mixed in the cohorts). Further multi-Institutional studies with large number of patients are warranted to replicate these novel results.

Study limitations

Our study has some limitations. First, the number of patients was not large. We have taken effort to reduce false discovery by using training and test cohort. Despite this approach, it is possible there could be biases in the patient population, as the current cohorts were predominantly male, derived as they were from a VA population. This could be mitigated by collecting samples in a larger multi-institutional cohort study. Radiologist training and preferences will influence the semantic scoring. Although this is less of a concern in a research setting, it may be an issue in clinical practice. Efforts have been made to standardize the scoring sheet with a descriptive atlas (e.g., Supplementary Fig. S1; Supplementary Table S2) in a way that will be acceptable by the community at large.

We have shown that radiological image traits are useful in differentiating malignant from nonbenign nodules. These semantic features, along with size measurement, certainly enhance the prediction accuracy and reduce false-positives. The usefulness of radiological imaging traits (semantics) in predicting cancer status shows the ability to reduce diagnostic errors compared with clinical models. These, along with conventional measure based on the size, could be collectively used in clinical workflow to better diagnose malignancy.

R.J. Gillies holds ownership interest (including patents) in and is a consultant/advisory board member for Health Myne. No potential conflicts of interest were disclosed by the other authors.

The content of the article is solely the responsibility of the authors.

Conception and design: Y. Balagurunathan, R.C. Walker, M.B. Schabath, R.J. Gillies, Y. Liu

Development of methodology: Y. Balagurunathan, R.C. Walker, P.P. Massion, R.J. Gillies, Y. Liu

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): T. Atwater, S. Antic, R.C. Walker, P.P. Massion, M.B. Schabath

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): Y. Balagurunathan, Q. Li, R.C. Walker, P.P. Massion, M.B. Schabath, R.J. Gillies, Y. Liu

Writing, review, and/or revision of the manuscript: Y. Balagurunathan, Q. Li, R.C. Walker, G.T. Smith, P.P. Massion, M.B. Schabath, R.J. Gillies, Y. Liu

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): R.J. Gillies

Study supervision: Y. Balagurunathan, R.J. Gillies

Other: Y. Liu

The NIH grant (CA 143062-01) and State of Florida Department of Health (2KT01) grant provided protected time for R.J. Gillies, Y. Balagurunathan, Y. Liu, Q. Li, and M. Schabath to work on the research project.

The NIH grants (CA186145, CA152662) and the Department of Defense grant W81XWH-11-2-0161 provided funding support for P.P. Massion. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

1.
Siegel
R
,
Ma
J
,
Zou
Z
,
A.
J
. 
Cancer statistics
.
CA Cancer J Clin
2014
;
64
:
9
29
.
2.
Aberle
DR
,
Adams
AM
,
Berg
CD
,
Black
WC
,
Clapp
JD
,
Fagerstrom
RM
, et al
Reduced lung-cancer mortality with low-dose computed tomographic screening
.
N Engl J Med
2011
;
365
:
395
409
.
3.
Aberle
DR
,
Adams
AM
,
Berg
CD
,
Clapp
JD
,
Clingan
KL
,
Gareen
IF
, et al
Baseline characteristics of participants in the randomized national lung screening trial
.
J Natl Cancer Inst
2010
;
102
:
1771
9
.
4.
Patz
EF
 Jr
,
Pinsky
P
,
Gatsonis
C
,
Sicks
JD
,
Kramer
BS
,
Tammemagi
MC
, et al
Overdiagnosis in low-dose computed tomography screening for lung cancer
.
JAMA Intern Med
2014
;
174
:
269
74
.
5.
Swensen
SJ
,
Jett
JR
,
Hartman
TE
,
Midthun
DE
,
Mandrekar
SJ
,
Hillman
SL
, et al
CT screening for lung cancer: five-year prospective experience
.
Radiology
2005
;
235
:
259
65
.
6.
Croswell
JM
,
Baker
SG
,
Marcus
PM
,
Clapp
JD
,
Kramer
BS
. 
Cumulative incidence of false-positive test results in lung cancer screening: a randomized trial
.
Ann Intern Med
2010
;
152
:
505
12
.
7.
Henschke
CI
,
Yankelevitz
DF
,
Mirtcheva
R
,
McGuinness
G
,
McCauley
D
,
Miettinen
OS
. 
CT screening for lung cancer: frequency and significance of part-solid and nonsolid nodules
.
Am J Roentgenol
2002
;
178
:
1053
7
.
8.
Boiselle
PM
,
Chiles
C
,
Partz
E
,
Tammemagi
M
,
Wood
DE
. 
Expert opinion:United States preventive services task force recommendation on screening for lung cancer
.
J Thorac Imaging
2014
;
29
:
197
.
9.
Gould
MK
,
Ananth
L
,
Barnett
PG
. 
A clinical model to estimate the pretest probability of lung cancer in patients
with solitary pulmonary nodules
.
CHEST
2007
;
131
:
383
8
.
10.
Patel
VK
,
Naik
SK
,
Naidich
DP
,
Travis
WD
,
Weingarten
JA
,
Lazzaro
R
, et al
A practical algorithmic approach to the diagnosis and management of solitary pulmonary nodules: part 2: pretest probability and algorithm
.
CHEST
2013
;
143
:
840
6
.
11.
National Comprehensive Cancer Network
. 
NCCN Guidelines for Lung cancer screening
; 
2015
. Available from: http://www.nccn.org/patients/guidelines/lung_screening/.
12.
El-Baz
A
,
Suri
J
.
Lung imaging and computer aided diagnosis
.
Boca Raton, FL: CRC Press
; 
2011
.
13.
Brandman
S
,
Ko
JP
. 
Pulmonary nodule detection, characterization, and management with multidetector computed tomography
.
J Thorac Imaging
2011
;
26
:
90
105
.
14.
Sayyouh
M
,
Vummidi
DR
,
Kazerooni
EA
. 
Evaluation and management of pulmonary nodules: state-of-the-art and future perspectives
.
Expert Opin Med Diagn
2013
;
7
:
629
44
.
15.
Matsuguma
H
,
Mori
K
,
Nakahara
R
,
Suzuki
H
,
Kasai
T
,
Kamiyama
Y
, et al
Characteristics of subsolid pulmonary nodules showing growth during follow-up with CT scanning
.
Chest
2013
;
143
:
436
43
.
16.
Pinsky
PF
,
Nath
PH
,
Gierada
DS
,
Sonavane
S
,
Szabo
E
. 
Short-and Long-term lung cancer risk associated with noncalcified nodules observed on low-dose CT
.
Cancer Prev Res
2014
;
7
:
1179
85
.
17.
Massion
P
,
Walker
R
. 
Indeterminate pulmonary nodules: risk for having or for developing lung cancer?
Cancer Prev Res
2014
;
7
:
1173
8
.
18.
Carletta
J
. 
Assessing agreement on classification tasks: the kappa statistic
.
Comput Linguistics
1996
;
22
:
249
54
.
19.
Sim
J
,
Wright
CC
. 
The kappa statistic in reliability studies: use, interpretation, and sample size requirements
.
Phys Ther
2005
;
85
:
257
68
.
20.
Kundel
HL
,
Polansky
M
. 
Measurement of observer agreement
.
Radiology
2003
;
228
:
303
8
.
21.
Efron
B
. 
Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods
.
Biometrika
1981
;
68
:
589
99
.
22.
Efron
BT
,
Robert
J
.
An introduction to the bootstrap
.
New York, NY
:
Chapman & Hall
; 
1993
.
23.
Good
P
. 
Permutation, Parametric and Bootstrap Tests of Hypotheses
.
New York, NY
:
Springer-Verlag
; 
2005
.
24.
Ruopp
MD
,
Perkins
NJ
,
Whitcomb
BW
,
Schisterman
EF
. 
Youden index and optimal cut-point estimated from observations affected by a lower limit of detection
.
Biom J
2008
;
50
:
419
30
.
25.
WJ
Y
. 
Index for rating diagnostic tests
.
Cancer
1950
;
3
:
32
5
.
26.
Smialowski
P
,
Frishman
D
,
Kramer
S
. 
Pitfalls of supervised feature selection
.
Bioinformatics
2010
;
26
:
440
3
.
27.
Cawley
GC
,
Talbot
NLC
. 
Preventing over-fitting during model selection via bayesian regularisation of the hyper-parameters
.
J Mach Learn Res
2007
;
8
:
841
61
.
28.
Hemphill
E
,
Lindsay
J
,
Lee
C
,
Mandoiu
II
,
Nelson
CE
. 
Feature selection and classifier performance on diverse bio- logical datasets
.
BMC Bioinformatics
2014
;
15
Suppl 13
:
S4
.
29.
James
G
,
Witten
D
,
Hastie
T
,
Tibshirani
R
.
An introduction to statistical learning
.
New York, NY
:
Springer-Verlag
; 
2013
.
30.
Zwirewich
C
,
Vedal
S
,
Miller
R
,
Müller
N
. 
Solitary pulmonary nodule: high-resolution CT and radiologic-pathologic correlation
.
Radiology
1991
;
179
:
469
76
.
31.
Henschke
CI
,
Yankelevitz
DF
,
Naidich
DP
,
McCauley
DI
,
McGuinness
G
,
Libby
DM
, et al
CT screening for lung cancer: suspiciousness of nodules according to size on baseline scans 1
.
Radiology
2004
;
231
:
164
8
.
32.
Zerhouni
EA
,
Stitik
F
,
Siegelman
S
,
Naidich
D
,
Sagel
S
,
Proto
A
, et al
CT of the pulmonary nodule: a cooperative study
.
Radiology
1986
;
160
:
319
27
.
33.
Hu
H
,
Wang
Q
,
Tang
H
,
Xiong
L
,
Lin
Q
. 
Multi-slice computed tomography characteristics of solitary pulmonary ground-glass nodules: Differences between malignant and benign
.
Thorac Cancer
2016
;
7
:
80
7
.
34.
Fan
L
,
Liu
SY
,
Li
QC
,
Yu
H
,
Xiao
XS
. 
Multidetector CT features of pulmonary focal ground-glass opacity: differences between benign and malignant
.
Br J Radiol
2012
;
85
:
897
904
.
35.
Gomez-Saez
N
,
Hernandez-Aguado
I
,
Vilar
J
,
Gonzalez-Alvarez
I
,
Lorente
MF
,
Domingo
ML
, et al
Lung cancer risk and cancer-specific mortality in subjects undergoing routine imaging test when stratified with and without identified lung nodule on imaging study
.
Eur Radiol
2015
;
25
:
3518
27
.
36.
Wang
H
,
Schabath
M
,
Liu
Y
,
Berglund
A
,
Bloom
A
,
Kim
J
, et al
Semiquantitative computed tomography characteristics for lung adenocarcinoma and their association with lung cancer survival
.
Clin Lung Cancer
2015
;
16
:
e141
63
.
37.
Herman
GT
,
Yeung
KTD
. 
On piecewise-linear classification
.
IEEE Trans Pattern Anal Mach Intell
1992
;
14
:
782
6
.
38.
Sklansky
J
,
Michelotti
L
. 
Locally trained piecewise linear classifiers
.
IEEE Trans Pattern Anal Mach Intell
1980
;
2
:
101
11
.
39.
Yang
Z-G
,
Sone
S
,
Takashima
S
,
Li
F
,
Honda
T
,
Maruyama
Y
, et al
High-resolution CT analysis of small peripheral lung adenocarcinomas revealed on screening helical CT
.
AJR Am J Roentgenol
2001
;
176
:
1399
407
.
40.
Li
F
,
Sone
S
,
Maruyama
Y
,
Takashima
S
,
Yang
Z-G
,
Hasegawa
M
, et al
Correlation between high-resolution computed tomographic, magnetic resonance and pathological findings in cases with non-cancerous but suspicious lung nodules
.
Eur Radiol
2000
;
10
:
1782
91
.
41.
Li
F
,
Sone
S
,
Abe
H
,
MacMahon
H
,
Doi
K
. 
Malignant versus benign nodules at CT screening for lung cancer: comparison of thin-section CT findings 1
.
Radiology
2004
;
233
:
793
8
.
42.
Kuriyama
K
,
Tateishi
R
,
Doi
O
,
Kodama
K
,
Tatsuta
M
,
Matsuda
M
, et al
CT-pathologic correlation in small peripheral lung cancers
.
Am J Roentgenol
1987
;
149
:
1139
43
.
43.
Webb
WR
. 
Radiologic evaluation of the solitary pulmonary nodule
.
AJR Am J Roentgenol
1990
;
154
:
701
8
.
44.
Henschke
CI
,
Yip
R
,
Boffetta
P
,
Markowitz
S
,
Miller
A
,
Hanaoka
T
, et al
CT screening for lung cancer: importance of emphysema for never smokers and smokers
.
Lung Cancer
2015
;
88
:
42
7
.
45.
Prosch
H
. 
Implementation of lung cancer screening: promises and hurdles
.
Trans Lung Cancer Res
2014
;
3
:
286
90
.
46.
Pinsky
PF
,
Gierada
DS
,
Black
W
,
Munden
R
,
Nath
H
,
Aberle
D
, et al
Performance of Lung-RADS in the National Lung Screening Trial: a retrospective assessment
.
Ann Intern Med
2015
;
162
:
485
91
.
47.
Yankelevitz
DF
,
Kostis
WJ
,
Henschke
CI
,
Heelan
RT
,
Libby
DM
,
Pasmantier
MW
, et al
Overdiagnosis in chest radiographic screening for lung carcinoma: frequency
.
Cancer
2003
;
97
:
1271
5
.
48.
Veronesi
G
,
Maisonneuve
P
,
Bellomi
M
,
Rampinelli
C
,
Durli
I
,
Bertolotti
R
, et al
Estimating overdiagnosis in low-dose computed tomography screening for lung cancer: a cohort study
.
Ann Intern Med
2012
;
157
:
776
84
.
49.
SPIE-Medical-Imaging-2015
. 
Challenge press release 2015
. Available from: http://spie.org/about-spie/press-room/press-releases/mi15-lungx-wrap-3-24-2015.
50.
Armato
SG
 III
,
Hadjiiski
L
,
Tourassi
GD
,
Drukker
K
,
Giger
ML
,
Li
F
, et al
LUNGx challenge for computerized lung nodule classification: reflections and lessons learned
.
J Med Imaging
2015
;
2
:
020103
.
51.
Armato
SG
 III
,
Hadjiiski
L
,
Tourassi
G
,
Drukker
K
,
Giger
M
,
Li
F
, et al
SPIE-AAPM-NCI Lung Nodule Classification Challenge Dataset
. 
2015
. Available from: https://wiki.cancerimagingarchive.net/display/Public/LUNGx+SPIE-AAPM-NCI+Lung+Nodule+Classification+Challenge.