Abstract
Identification of early-stage pulmonary adenocarcinomas before surgery, especially in cases of subcentimeter cancers, would be clinically important and could provide guidance to clinical decision making. In this study, we developed a deep learning system based on 3D convolutional neural networks and multitask learning, which automatically predicts tumor invasiveness, together with 3D nodule segmentation masks. The system processes a 3D nodule-centered patch of preprocessed CT and learns a deep representation of a given nodule without the need for any additional information. A dataset of 651 nodules with manually segmented voxel-wise masks and pathological labels of atypical adenomatous hyperplasia (AAH), adenocarcinomas in situ (AIS), minimally invasive adenocarcinoma (MIA), and invasive pulmonary adenocarcinoma (IA) was used in this study. We trained and validated our deep learning system on 523 nodules and tested its performance on 128 nodules. An observer study with 2 groups of radiologists, 2 senior and 2 junior, was also investigated. We merged AAH and AIS into one single category AAH-AIS, comprising a 3-category classification in our study. The proposed deep learning system achieved better classification performance than the radiologists; in terms of 3-class weighted average F1 score, the model achieved 63.3% while the radiologists achieved 55.6%, 56.6%, 54.3%, and 51.0%, respectively. These results suggest that deep learning methods improve the yield of discriminative results and hold promise in the CADx application domain, which could help doctors work efficiently and facilitate the application of precision medicine.
Machine learning tools are beginning to be implemented for clinical applications. This study represents an important milestone for this emerging technology, which could improve therapy selection for patients with lung cancer.
Introduction
Lung cancer is the leading cause of cancer-related deaths in the world. The International Association for the Study of Lung Cancer (IASLC) International Staging Project confirms that a logical degradation of survival results, as tumor size increases (1), indicating that early detection and diagnosis is an effective and crucial way to decrease the mortality of patients with lung cancer. Lung cancer screening with low-dose computed tomography (LDCT) in high-risk patients (age >50 year, smoking history, family history lung cancer in first-degree relatives, etc.) has remarkably facilitated early stage pulmonary adenocarcinoma detection and diagnosis, especially for the nodules less than 1 cm in diameter (subcentimeter; refs. 2, 3). Previous screening programs have shown, 60% to 70% of detected lung cancers were in stage I, and 56% were subcentimeter lesions (1). However, the management of subcentimeter tumors encountered on screening CT images remains controversial. 10 mm diameter is used as a cutoff value to distinguish preinvasive (atypical adenomatous hyperplasia, AAH; adenocarcinoma in situ, AIS) and invasive lesions (minimally invasive adenocarcinoma, MIA; invasive pulmonary adenocarcinoma, IA) on CT images (4). However, previous studies reported that some subcentimeter ground-glass opacity nodules (GGN) may be MIA or IA (5, 6), and many of these are also recorded in the institution (see Fig. 1 for some examples). Prognosis varies widely among the different pathologic subtypes (7). Therefore, early identification of the invasive characteristics before surgery would be clinically important and could provide guidance to the clinical decision-making. However, subcentimeter GGNs presented on CT images make the differential diagnosis clinically difficult due to the absence of typical radiographic features (bubble lucency, pleural retraction, spiculated margin, etc.) of early cancers and may confuse clinical decision-making. In addition, evaluating a large number of detected GGNs by the experts or radiologists can still be time-consuming. In this context, computer-aided diagnosis (CADx), a much more efficient and effective way to evaluate the detected nodules, is expected to play an important role in the clinical evaluating task and is the hot spot of the current research.
Examples in the dataset of nodule patches in axial, coronal, and sagittal views. The type and volume in a cubic millimeter of nodules are shown in the subtitles. The blue contours represent the manually labeled boundary of the nodules. Each patch is depicted in a size of 32 mm. H&E, hematoxylin and eosin stain. Magnification, ×200.
Examples in the dataset of nodule patches in axial, coronal, and sagittal views. The type and volume in a cubic millimeter of nodules are shown in the subtitles. The blue contours represent the manually labeled boundary of the nodules. Each patch is depicted in a size of 32 mm. H&E, hematoxylin and eosin stain. Magnification, ×200.
In recent years, deep learning has become a powerful method of representation learning (8), reducing the necessities of hand-craft feature engineering. With the help of end-to-end deep supervised learning, especially convolutional neural networks, great progress have been made in natural image problems, such as image recognition (9), object detection (10), semantic segmentation (9). The effectiveness of deep learning has been proved in medical image analysis as well, such as recent progress in skin cancer classification (11), diabetic retinopathy detection (12), and pulmonary nodule detection in chest CT (13). With hierarchical representation learning in 3D views, deep neural networks can discover new patterns beyond the typical radiographic features, which may be invisible or subtle to human eyes and traditional CADx systems.
In this article, a deep learning system is designed to address the problem of automatically predicting the tumor invasiveness of subcentimeter pulmonary adenocarcinomas from CT scans. The main contributions are 3-fold: First, we proposed an automatic framework to predict the tumor invasiveness, trained with pre-processed chest CTs and the corresponding pathological labels. The proposed method does not require the nodule segmentation, estimate of nodule size, or other predefined features in the inference stage. To the best of our knowledge, this is the first automatic learning system for this problem. The neural networks are trained and validated using a dataset of 523 nodules, and a hold-out test set of 128 nodules is used to fairly evaluate our system. Instead of classifying the 4 categories (AAH, AIS, MIA, and IA), the problem is formalized as 3-category classification (AAH and AIS into a single class) due to some technical limitations (will be explained in “Materials and Methods”). Second, to make full use of the datasets, we proposed a “bottom-up and top-down” multitask learning architecture to predict the nodule invasiveness together with segmentation mask. Through joint training of the neural networks to solve these two related tasks, the model is able to attend to the areas that deserve more attention. In practice, this multitask learning approach is effective in both accuracy and training convergence, and makes the system less prone to be overfitting. Finally, to fairly compare the performance with human, an observer study with 2 groups of 4 radiologists, 2 experienced and 2 junior doctors is conducted, to classify the 128 hold-out nodules. It is shown that the proposed deep learning system achieves better classification performance than radiologists in the observer study.
Materials and Methods
Data collection
This retrospective study was approved by the institutional review board of Huadong Hospital affiliated to Fudan University (NO.2017K062), which waived the requirement for patients’ written informed consent referring to the CIOMS guideline.
From October 2011 to October 2017, a search of the electronic medical records and the radiology information systems of the hospital for patients with subcentimeter pulmonary nodules identified on chest CT scans was performed by one author (Yingli Sun). A total of 651 subcentimeter nodules from 560 patients [mean age, 54.1 years ± 12.2 (SD); range, 16–82 years] were enrolled in the study. There are 182 men [mean age, 54.6 years ± 12.2 (SD); range, 26–82 years] and 378 women [mean age, 53.9 years ± 12.2 (SD); range, 16–80 years]. The unbalanced distribution of gender was determined by a unique characteristic of female predominance in this case (7, 14). The inclusion criteria are as follows:
The presence of thin-slice chest CT (1–1.25 mm) scan before surgical treatment.
Nodules noted on CT examination with a diameter ≤10 mm.
No treatment before surgical treatment.
Among the 651 subcentimeter nodules (see Table 1), 205 nodules were pathologically identified as preinvasion lesions (39 AAH and 166 AIS), where 446 nodules were invasive lesions (316 MIA and 130 IA). On preoperative CT evaluation, 21, 284, 346 of the 651 nodules were classified into solid, part-solid and pure GGNs, respectively.
Number of nodules for training, validation, and testing
. | Training and validation . | Testing . | Total . |
---|---|---|---|
AAH | 33 | 6 | 39 |
AIS | 134 | 32 | 166 |
MIA | 252 | 64 | 316 |
IA | 104 | 26 | 130 |
AAH-AIS | 167 | 38 | 205 |
Total | 523 | 128 | 651 |
. | Training and validation . | Testing . | Total . |
---|---|---|---|
AAH | 33 | 6 | 39 |
AIS | 134 | 32 | 166 |
MIA | 252 | 64 | 316 |
IA | 104 | 26 | 130 |
AAH-AIS | 167 | 38 | 205 |
Total | 523 | 128 | 651 |
NOTE: To make full use of the training data, data augmentation was performed on the fly during the training process.
Preoperative chest CT was performed by using the following four scanners: GE Discovery CT750 HD (143 nodules), 64-slice LightSpeed VCT (199 nodules; GE Medical Systems); Somatom Definition flash (150 nodules), Somatom Sensation-16 (159 nodules; Siemens Medical Solutions) with the following parameters: 120 kVp; 100–200 mAs; pitch, 0.75–1.5; and collimation, 1–1.25 mm, respectively. All imaging data were reconstructed by using a medium sharp reconstruction algorithm with a thickness of 1–1.25 mm. 259 of the 561 patients were then administered contrast material after non-contrast enhanced CT scan. In the case of contrast-enhanced CT, a bolus of 80–100 mL of IV contrast medium (350 mg I/mL; Optiray, Mallinckrodt) was administered at a rate of 3–4 mL/s with the use of a power injector via an 18- or 20-gauge cannula in an antecubital vein. The contrast-enhanced CT scan was acquired 60 seconds after the administration of contrast medium. In this study, only the unenhanced CT images of the latest CT examination before surgery were collected. In all patients, CT images were acquired in the supine position at full inspiration. The mean interval between the latest CT examination and surgery was 13 days (range, 1–132 days; median, 7 days).
Nodule labeling and segmentation
A medical image processing and navigation software 3D Slicer (version 4.8.0, Brigham and Women's Hospital) was used to manually delineate the volume of interest (VOI) of the included 651 subcentimeter nodules at voxel level by one radiologist (Yingli Sun, with 5 years of experience in chest CT interpretation), then the VOI was confirmed by another radiologist Ming Li (with 12 years of experience in chest CT interpretation). Large vessels and bronchioles were excluded as much as possible from the volume of the nodule. The lung CT DICOM (Digital Imaging and Communications in Medicine) format images were imported into the software for delineating, and then the images with VOI information are extracted with NII format for next step analysis. Each segmented nodule was given a specific pathological label (AAH, AIS, MIA, IA), according to the detailed pathological report.
Dataset pretreatment
The data collected for this research is split into 5 parts: Subset 0, 1, 2, 3, and 4, each subset is selected by randomly choosing 20% of each of the 4 categories. Subset 4 is the hold-out test set and is never used before evaluation. Subset 0–3 is used for training and validation. See Table 1 for a detailed number of the nodules for training, validation, and testing. Hyperparameters are validated via cross-validation on Subset 0–3, and then the model with all of Subset 0–3 is trained with fixed hyperparameters.
However, the AAH samples are clearly too few for fairly training the deep neural networks. This is inherently destined because these particular lesions are usually considered as benign, and they rarely undergo surgical treatment unless obvious malignant signs are presented in the CT images. Therefore, pathologically identified AAH nodules are rare in practice. The study merged the samples labeled as AAH and AIS into a single class “AAH-AIS,” to avoid the problem of shortage in training samples. Fortunately, it's still reasonable in the clinical context, as these two subtypes of lesions (≤3 cm) are reported to have a 100% disease-specific survival if they are completely resected (7). In this way, the invasiveness prediction is treated as a 3-category classification problem in this work.
In the development of the deep learning system, each data sample is defined as:
A 3D patch of |32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm}}$|, cropped from the CT scan at the mass center of a nodule.
The pathologically identified label of invasiveness, in one of AAH-AIS, MIA, and IA.
Manually labeled voxel-wise nodule mask.
For efficient training the networks, online data augmentation is performed. The details of hyper-parameter setting, generation of 3D patches, neural network design and training will be explained further.
Observer study
To compare the deep learning system with human performance, four radiologists (two senior radiologists, Ming Li, Weilan Wu, with more than ten years of experience in chest CT interpretation; and two junior radiologists, Wei Zhao, Zhiming Yang, with more than 3 years of experience in chest CT interpretation) were enquired. They were blinded to the histopathologic results and clinical data independently to classify and diagnose all the test set nodules. Four chest radiologists classified the nodules on the basis of the new classification standard of lung adenocarcinoma published in 2011 (7).
Deep learning system
The input of the proposed model is a cubic patch of |32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm}}$|, generated by a (preprocessed) chest CT scan and the position |\mathfrak{c}{\rm{\ }} = {\rm{\ }}[ {{\rm{z}},{\rm{y}},{\rm{x}}} ]$|, that is, the mass center (roughly) of the nodule, which can be marked manually or obtained by an automatic nodule detection system (13). The output of the model is the categorical probability for the 3 categories (AAH-AIS, MIA, IA), together with the model-generated mask of the nodule segmentation. The framework is based on the proposed 3D convolution neural networks (CNN), referred as DenseSharp Networks, which processes the input cubes via a “bottom-up and top-down” architecture: The classification head as bottom-up path can enforce the network to extract meaningful features for diagnosis; meanwhile, the segmentation head works as the top-down path, and is able to teach the network to attend the regions of interest (ROI). With multitask learning, DenseSharp Networks can learn the classification and segmentation tasks end-to-end efficiently.
Generation of 3D patches
The 3D patches are generated by cropping the preprocessed volumetric data into a size of |32{\rm{\ }} \times {\rm{\ }}32{\rm{\ }} \times {\rm{\ }}32$| (voxels, 1 voxel denotes 1 mm). The preprocessing follows “standard” procedure for chest CT: the input CT scans are converted into Hounsfield units, followed by resizing of volumetric data into spacing of |1{\rm{\ mm\ }} \times {\rm{\ }}1{\rm{\ mm\ }} \times {\rm{\ }}1{\rm{\ mm}}$| by trilinear interpolation, clipping the voxel intensity into |{{\rm{I}}_{{\rm{HU}}}} \in [ { - 1024,400} ]$|, quantifying the density into grayscale, and transforming the values to |{\rm{I}} \in [ { - 1,1} )$| by a mapping |{\rm{I}} = \left[\displaystyle{ {\frac{{{{\rm{I}}_{{\rm{HU}}}} + 1024}}{{400 + 1024}}} \times 255} \right] /128 - 1$|.
The study uses many data augmentation techniques to increase the training data size, including:
Rotation by 90° increments
Left-right flipping
Transposition by small amounts in |[ { - 3,3} ]$| voxels in each axis
Reordering of axes
Zooming with a ratio in |[ {0.8,1.15} ]$|.
For the efficient use of the training data, data augmentation is performed on the fly during the training process, which acts as strong regularization for our models. Other sophisticated augmentation techniques like elastic transformation and salt-and-pepper noise are also tried, however, there seems no significant improvement.
The DenseSharp architecture
Because of the limited availability of data, the learning network should be very compact to make the training procedure relatively easy. DenseNets (15) have indicated compelling accuracy with more efficient use of parameters on natural image recognizing tasks; To leverage the power of dense connectivity, the study extended the 2D DenseNets into a 3D variant following the “bottleneck” and “compression” design (15), which naturally becomes the bottom-up classification head for predicting the invasiveness labels. Inspired by DeepMask (16) and SharpMask (17), a top-down segmentation head is used for predicting the nodule mask located near the center of patches, using shared features extracted by the same network. The study emphasizes, however, that the segmentation head is mainly used to teach the neural network to attend where it needs to pay more attention to, thus the segmentation head is designed to be lightweight, which consists of only transposed convolution without nonlinear activation.
The architecture of the proposed DenseSharp Networks is illustrated in Fig. 2. Specifically, it consists of stacked densely connected blocks (i.e., Dense Block), and each of the blocks consists of several convolutional modules (4 Dense Block Modules for this task). In each convolutional module, |1{\rm{\ }} \times {\rm{\ }}1{\rm{\ }} \times {\rm{\ }}1$| convolution kernels with 64 filters followed by |3{\rm{\ }} \times {\rm{\ }}3{\rm{\ }} \times {\rm{\ }}3$| convolution kernels with 16 filters (growth rate |k{\rm{\ }} = {\rm{\ }}16$|), the so-called “bottleneck” techniques (bottleneck |{\rm{B\ }} = {\rm{\ }}4$|), are used for efficient 3D representation learning. Batch normalization (18) layers are used for reducing internal covariance shift, and the rectified linear units, that is, |{\rm{ReLU}}( x ){\rm{\ }} = {\rm{\ max}}( {0,x} )$|, act as the nonlinear transform. The input features and the transformed features by the convolutional module are concatenated before sending to the next module, consequently, the subsequent layers receive feature maps from all their preceding layers. After an entire Dense Block, the feature maps will be “compressed” using |1{\rm{\ }} \times {\rm{\ }}1{\rm{\ }} \times {\rm{\ }}1$| convolution kernels with halved filters (compression |{\rm{C\ }} = {\rm{\ }}2$|), and then down-sampled by pooling. Finally, the last Dense Block and global pooling layer (19) get a representation of 120 channels, and fully connected layers with softmax activation as the classification head outputs the invasiveness labels.
The architecture of the proposed DenseSharp Network. A, The “classify head” gives the invasiveness labels, whereas the “segment head” gives the nodule mask. B, The illustration of the convolutional module in the Dense Block.
The architecture of the proposed DenseSharp Network. A, The “classify head” gives the invasiveness labels, whereas the “segment head” gives the nodule mask. B, The illustration of the convolutional module in the Dense Block.
The lightweight segmentation head outputs the nodule mask using the features extracted by the three Dense Blocks. Inspired by (19), rather than restoring the original resolution by a large up-sampling transposed convolutional layer directly, the feature maps are up-sampled gradually, and the high-resolution low-level features and the up-sampled high-level features are summed-up (shown in the top-down path). In this way, the low-level and high-level features are well combined to predict the segmentation masks, and the segmentation head becomes a top-down path for the entire network. The detailed architecture of the 3D DenseSharp Networks is shown in Supplementary Table S1.
The network was implemented using Python 3.6 based on TensorFlow 1.4.0 (20) and Keras 2.1.5 (21) deep learning library and trained the neural networks on a workstation with 2 NVIDIA TITAN X GPUs.
Code is open source at https://github.com/duducheng/DenseSharp/.
Training
The proposed DenseSharp Networks have two output heads, trained based on different loss. Stochastic gradient descent was used to minimize cross-entropy between the classification outputs and target labels for training the classification head. It was also used to maximize the Dice coefficient between the predicted masks and the real nodule masks for training the segmentation head. The two heads are trained jointly with a multitask loss |{\mathfrak{l}_{joint}}$|,
|{\mathfrak{l}_{cls}}$| shows the cross-entropy loss, |{y_{cls}}$| is the output of classification head, |{t_{cls}}$| is the invasiveness label. |{\mathfrak{l}_{seg}}$| shows the dice loss, |{y_{seg}}$| is the output of segmentation head, and |{t_{seg}}$| is the manually labeled nodule mask. The study chose |\lambda \ $| = 0.2 since segmentation works as an auxiliary supervised task.
To train the ConvNets, all the network parameters are well initialized using “he uniform” method (9). We have tried to pretrain the neural networks on LIDC-IDRI (22), a widely used lung nodule database, in a multitask learning scheme with radiological benign or malignant labels and nodule segmentation; However, it did not help in practice in terms of classification performance. During optimization, the study sampled the training data with a ratio of |1:1:1$| for the 3 classes with a batch size of 24, and used Adam (23) optimizer with a fix learning rate of |{10^{ - 4}}$| to update the model parameters. We early stop the training after 60 epochs. No weight decay nor dropout (24) has been used in the network.
Prediction
Given an input 3D patch of |32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm}}$| from CT scan, the trained network is able to predict the 3-class probability of invasiveness together with the nodule mask with the single forward pass. Because of the randomness in the neural network optimization process, the 15-run ensemble is constructed to reduce the error variance, which is an average result of 15 experiments with the same hyper-parameter setting. The predicted invasiveness labels are assigned by |y = {\rm{argmax}}_{\rm {k}}{{1}\over {15}}}\mathop \sum \limit_i^{15} prob{a_i}$|.
Results
Evaluation on three-category classification
After training, all nodules in the test set were processed by the proposed deep network, namely DenseSharp Network, a multitask architecture for classification and segmentation. As mentioned earlier, instead of classifying the 4 categories of AAH, AIS, MIA, and IA, the study merged AAH and AIS into one single category AAH-AIS, which makes a 3-category classification.
The study evaluated the classification performance of the best result on the test set, which is an ensemble of 15 experiments with the same setting to reduce the variance of neural network training. Because of the skewness of the distribution of the 3 nodule types, the classification accuracy [|{\rm{Accuracy\ }} = {{1}\over{n}}\sum 1( {{y_i}\ = \ {t_i}} )$|], per-class F1-score (|\displaystyle{\rm{F}}{1_{{\rm{cls}}}} = {{\rm{\ }}\frac{{2{\rm{Precisio}}{{\rm{n}}_{{\rm{cls}}}} \times Recal{l_{cls}}}}{{{\rm{Precisio}}{{\rm{n}}_{{\rm{cls}}}} + Recal{l_{cls}}}}$|) and weighted average F1-score (|\displaystyle{\rm{F}}{1_{{\rm{cls}}}}{\rm{\ }} = {{\rm{\ }}\frac{{{{\rm{n}}_{{\rm{AAH}} - {\rm{AIS}}}}{\rm{F}}{1_{{\rm{AAH}} - {\rm{AIS}}}} + {{\rm{n}}_{{\rm{MIA}}}}{\rm{F}}{1_{{\rm{MIA}}}} + {{\rm{n}}_{{\rm{IA}}}}{\rm{F}}{1_{{\rm{IA}}}}}}{{{{\rm{n}}_{{\rm{AAH}} - {\rm{AIS}}}} + {{\rm{n}}_{{\rm{MIA}}}} + {{\rm{n}}_{{\rm{IA}}}}}}$|) is compared with human radiologist performance. Besides, multi-class Matthews correlation coefficient (MCC; ref. 25), a metric less sensitive to class imbalance, is also used for evaluation. The results of the comparison are reported in Table 2, the best model (3D DenseSharp Network) achieves better classification performance in terms of all metrics except for the minor disadvantage in F1IA, even compared with senior radiologists, indicating the effectiveness of the proposed method. The proposed model without multitask learning (3D DenseNet) continues to achieve classification performance at a level matching or exceeding the observers.
Three-category classification performance for nodule invasiveness, in terms of accuracy, per-class F1-score, weighted average F1-score, and MCC
. | Accuracy . | F1AAH-AIS . | F1MIA . | F1IA . | F1AVG . | MCC . |
---|---|---|---|---|---|---|
3D DenseSharp Network | 64.1% | 55.7% | 68.1% | 62.7% | 63.3% | 0.407 |
3D DenseNet | 59.4% | 45.2% | 62.9% | 66.7% | 58.4% | 0.332 |
2D DenseNet | 43.0% | 54.9% | 50.5% | 59.3% | 53.6% | 0.293 |
Pretrained Inception-v3 | 35.9% | 55.6% | 34.8% | 53.6% | 44.9% | 0.249 |
Senior 1 | 55.4% | 50.0% | 55.9% | 63.0% | 55.6% | 0.304 |
Senior 2 | 56.3% | 50.6% | 57.9% | 62.5% | 56.6% | 0.307 |
Junior 1 | 53.9% | 49.4% | 53.3% | 63.8% | 54.3% | 0.271 |
Junior 2 | 50.8% | 48.9% | 59.6% | 48.7% | 51.0% | 0.234 |
. | Accuracy . | F1AAH-AIS . | F1MIA . | F1IA . | F1AVG . | MCC . |
---|---|---|---|---|---|---|
3D DenseSharp Network | 64.1% | 55.7% | 68.1% | 62.7% | 63.3% | 0.407 |
3D DenseNet | 59.4% | 45.2% | 62.9% | 66.7% | 58.4% | 0.332 |
2D DenseNet | 43.0% | 54.9% | 50.5% | 59.3% | 53.6% | 0.293 |
Pretrained Inception-v3 | 35.9% | 55.6% | 34.8% | 53.6% | 44.9% | 0.249 |
Senior 1 | 55.4% | 50.0% | 55.9% | 63.0% | 55.6% | 0.304 |
Senior 2 | 56.3% | 50.6% | 57.9% | 62.5% | 56.6% | 0.307 |
Junior 1 | 53.9% | 49.4% | 53.3% | 63.8% | 54.3% | 0.271 |
Junior 2 | 50.8% | 48.9% | 59.6% | 48.7% | 51.0% | 0.234 |
NOTE: “3D DenseSharp Network” denotes the results of our proposed network, and “3D DenseNet” denotes the performance without multitask learning. A 2D DenseNet with similar architecture and comparable parameters and an Inception-v3 pretrained on ImageNet database processing 2.5D (multiview) CT images are shown for comparison. Results for four observers (2 senior and 2 junior radiologists) are also reported. The higher is better.
Two 2D deep convolutional neural networks are used for comparison. As depicted in Table 2, the 2D (or 2.5D) CNNs work less well than the 3D ones. These 2D networks process 3-channel inputs of 2.5D CT images (on axial-coronal-sagittal views), see Supplementary Fig. S1 for illustration. The 2D DenseNets follow similar design pattern with our 3D DenseNets, using 2D convolutions instead of 3D convolutions; besides, we change the depth and filters of 2D DenseNets to keep the number of trainable parameters comparable with our 3D DenseNets. On the other hand, Inception-v3 Networks (26) have achieved a great success in both natural images and 2D medical images (11). We use an Inception-v3 network pretrained on ImageNet (27), and fine tune the network on the 2.5D CT images. See more in Supplementary “The details on the 2D CNNs processing the 2.5D CT images.”
The 3-class confusion matrix is shown in Table 3. The model is not prone to make severe mistakes: Nodules labeled as AAH-AIS will not be predicted as IA, and those labeled as IA will not be predicted as AAH-AIS. It means the model implicitly learns the relationship of the 3 categories. However, the observers’ results don't hold this property.
Confusion matrix of the 3-category classification on the test set
. | 3D DenseSharp network . | Radiologists (mean ± STD) . | ||||
---|---|---|---|---|---|---|
Ground Truth . | AAH-AIS . | MIA . | IA . | AAH-AIS . | MIA . | IA . |
AAH-AIS | 17 | 21 | 0 | 22.00 ± 0.71 | 14.25 ± 0.43 | 1.75 ± 0.83 |
MIA | 6 | 49 | 9 | 26.00 ± 3.08 | 32.00 ± 2.55 | 6.00 ± 1.41 |
IA | 0 | 10 | 16 | 2.5 ± 1.12 | 8.25 ± 1.30 | 15.25 ± 1.09 |
. | 3D DenseSharp network . | Radiologists (mean ± STD) . | ||||
---|---|---|---|---|---|---|
Ground Truth . | AAH-AIS . | MIA . | IA . | AAH-AIS . | MIA . | IA . |
AAH-AIS | 17 | 21 | 0 | 22.00 ± 0.71 | 14.25 ± 0.43 | 1.75 ± 0.83 |
MIA | 6 | 49 | 9 | 26.00 ± 3.08 | 32.00 ± 2.55 | 6.00 ± 1.41 |
IA | 0 | 10 | 16 | 2.5 ± 1.12 | 8.25 ± 1.30 | 15.25 ± 1.09 |
NOTE: 15-run ensemble results are listed in “3D DenseSharp Network,” and the observers’ (Radiologists) are also reported as the mean and standard variance of the four radiologists’ results.
Evaluation on two subtasks of binary classification
To fairly analyze the classification performance of our model trained on three-category classification, we have considered two clinically important subtasks: (i) binary classification of invasive nodules (IA or MIA) and preinvasive nodules (AIS or AAH), and (ii) binary classification of IA nodules and non-IA nodules (MIA, AIS or AAH).
The subtask (a) is urgently needed in clinical practice. According to the recently proposed IASLC/ATS/ERS classification (7), the lesions correspond to preinvasive AAH or AIS sufficiently often warrant a conservative approach emphasizing long-term CT surveillance, whereas MIA and IA need elective or immediate surgery treatment due to a worse prognosis than preinvasive lesions. On the subtask (a), we merged the output score, by the trained model for three-category classification, for IA and MIA by addition, that is, |{y^{{\rm{invasive}}}}\ = \ {y^{{\rm{IA}}}} + {y^{{\rm{MIA}}}}$|. In this way, our model achieved an area under receiver operating characteristic curve (AUC) of 0.788 on the subtask (a).
However, patients with MIA have a disease-free survival rate close to 100% if they are safely treated with limited resection (7), whereas those with IA have a disease-free survival rate of only 60% to 70% (28–30), indicating that more aggressive surgical treatment and subsequent treatment (e.g., chemotherapy) were needed. Besides, a few previous studies have merged AAH, AIS, and MIA into one category as well (31, 32). These are the reason why we addressed the subtask (b). We considered solely the output score of IA for the subtask (b), and our model achieved an AUC of 0.880.
As depicted in Fig. 3, the deep models trained with three-category classification approach produced competitive performances on two subtasks of binary classification, which were on par with, if not better than, the performances of the radiologists in our observer study. In fact, this strategy, that is, training the CNNs on finer disease partition, but running coarse inference, have been successfully applied in previous study (11). See Supplementary Tables S2 and S3 for detailed evaluation metrics of accuracy, weighted average F1-score, MCC and AUC on these two subtasks, and Supplementary Table S4 for the original experimental results.
ROC curves for two binary-classification subtasks. The performance of radiologists in the observer study are also depicted in the graph. A, Binary classification of invasive nodules (IA or MIA) and pre-invasive nodules (AIS or AAH). B, Binary classification of IA nodules and non-IA nodules (MIA, AIS, or AAH).
ROC curves for two binary-classification subtasks. The performance of radiologists in the observer study are also depicted in the graph. A, Binary classification of invasive nodules (IA or MIA) and pre-invasive nodules (AIS or AAH). B, Binary classification of IA nodules and non-IA nodules (MIA, AIS, or AAH).
Importance of multitask learning
The study argues that the top-down segmentation head is critical for training the bottom-up classification head, which teaches the network to attend the nodule part. The DenseSharp Network with multitask learning outperforms the one without multitask learning (see “3D DenseNet” in Table 2) in terms of classification performance on almost all metrics. Besides, it was found that the DenseSharp Networks are faster to train. The DenseSharp Networks achieve the best performance after training about 60 epochs, whereas the 3D DenseNets need 100 epochs of training. See Supplementary “Training and inference time cost” for more details.
Though segmentation works just as an auxiliary task, the DenseSharp Networks are able to predict fairly good nodule masks (see Fig. 4). On the test set, the study achieved an average Dice coefficient of 74.12% between the manually labeled and the model predicted masks.
Examples of the nodule segmentation predicted by the trained model. The blue contours show the manual segmentation, and the orange ones show the predicted segmentation at the center slice of the patches. The manually labeled masks (ground truth) and the predicted masks are also illustrated as the 3D contours. The color indicates the depth for each voxel in the coronal dimension. The well-trained neural network predicts the nodule mask and the invasiveness type in single forward computation.
Examples of the nodule segmentation predicted by the trained model. The blue contours show the manual segmentation, and the orange ones show the predicted segmentation at the center slice of the patches. The manually labeled masks (ground truth) and the predicted masks are also illustrated as the 3D contours. The color indicates the depth for each voxel in the coronal dimension. The well-trained neural network predicts the nodule mask and the invasiveness type in single forward computation.
Discussion
Automatic tumor invasiveness prediction from the CT scans can provide important medical insights. In the article, this task is tackled using novel 3D convolutional neural networks based on DenseNet. The approach with efficient multitask learning demonstrates promising accuracy with respect to this task and achieves better classification performance than the radiologists in the observer study.
Continuing to learn and discover hierarchical features invisible to the human eye is one of deep learning system's strengths, facilitating better performance to differentiate subtypes of lung cancer. On the contrary, radiologists diagnose the lesions mainly based on typically visible radiographic features (size, lesion margin, solid component, etc.), which might be less sensitive to some local evidence when compared with machine learning models. Moreover, substantial overlaps among radiographic features of preinvasive lesions and invasive lesions make it very challenging for radiologists to correctly assess them. Experiences can help radiologists, as proved in this study (see Table 2), to improve the diagnostic accuracy on tumor invasiveness. However, the incremental progress is relatively limited, probably due to inadequate training for radiologists in subcentimeter GGNs interpretation. Therefore, when radiographic features suggesting malignancy are absent or not fully identified by a radiologist, an inappropriate diagnosis would appear, especially in early stage of lung cancers.
Although the proposed deep learning system shows some advantages over the radiologists on this problem, there are still a lot of limitations. Biased and insufficient data for training the neural networks could limit the performance. The model tends to predict AAH-AIS to be MIA, whereas the radiologists tend to label the MIA to be AAH-AIS (see Table 3), which may also indicate the data collection bias. Besides, the proposed deep learning system uses |32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm\ }} \times {\rm{\ }}32{\rm{\ mm}}$| patches of nodules to diagnose the tumor invasiveness, whereas ideally, radiologists can use the entire CT scan, together with other information (patient's age, smoking, medical history, etc.), to better estimate tumor invasiveness. Aggregating the global context of the lung and patient's information may boost the classification performance further.
To the best of our knowledge, the dataset is already the largest for this kind of research on automatic predicting tumor invasiveness for subcentimeter GGNs, it is, however, still insufficient. Pathological subtypes of lung cancer like mucinous AIS, nonlepidic predominant growth pattern lung adenocarcinomas (i.e., acinar, papillary, micropapillary, and/or solid; ref. 7), are rare and may not be fully learned by the deep neural networks. The training of deep neural networks should benefit from more data. Another limitation is the lack of external validation on an independent validation dataset from other institutions, regions, races. However, such a particular public dataset (subcentimeter lung adenocarcinoma, mainly presented as GGN, with new diagnosis standard of AAH, AIS, MIA, and IA) hardly exists to date. Transferring neural network knowledge trained from larger databases for other related tasks (33), other than nodule segmentation and invasiveness classification, could also bring further improvements. Alternatively, pretraining on thousands of relatively cheap natural images is still worth more exploration. Inspired by recent advances in video analysis, it is feasible to convert 2D convolution kernels into 3D by “inflating” them (34). Plus, our 3D neural networks for medical image analysis may also benefit from large 3D neural networks pretrained on large-scale video dataset (35). We leave this as a future direction, which may boost the discriminative performance of our method further.
Another limitation for this study is the interpretability of the deep learning system. Though there have been great process on interpretability of machine learning system (36, 37)/deep learning system (38, 39), fully understanding the internal mechanism in deep neural networks is still a non-trivial task. Particularly, in biomedical analysis, we do want to understand how the imaging representations associate with specific molecular patterns, genotype (e.g., EGFR) and intratumoral microenvironment, which remains a rougher challenge at present. Such work, explaining the biological processes of the deep learning models was performed in our study, by investigating the association analysis between deep learned representations and EGFR mutations status. However, only 94 out of 651 nodules in this study were performed the EGFR mutation testing (see Supplementary Fig. S2), which are not enough for current deep learning methods to produce reasonable results. We will further address the interpretability for AI, especially in the medical context, by associating imaging information with genotypes and biomarkers, in a probabilistic deep learning framework. Moreover, combining deep learning with radiomics (33) may also help with robustness and interpretability.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: W. Zhao, J. Yang, Y. Hua, M. Li
Development of methodology: W. Zhao, J. Yang, Y. Sun, W. Wu, Z. Yang, B. Ni, Y. Hua, M. Li
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): W. Zhao, J. Yang, Y. Sun, L. Jin, Z. Yang, P. Gao, Y. Hua, M. Li
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): W. Zhao, J. Yang, Y. Sun, C. Li, W. Wu, L. Jin, Z. Yang, Y. Hua, M. Li
Writing, review, and/or revision of the manuscript: W. Zhao, J. Yang, Y. Sun, C. Li, W. Wu, L. Jin, Z. Yang, B. Ni, P. Wang, Y. Hua, M. Li
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): J. Yang, Z. Yang, P. Wang, Y. Hua, M. Li
Study supervision: W. Zhao, B. Ni, P. Wang, Y. Hua, M. Li
Other (algorithm and software development): J. Yang
Acknowledgments
We are grateful to Drs. Dexi Bi and Mengdi Xu for critically revising the manuscript. We thank Yuxiang Ye and Liang Ge in Diannei Technology for generous help in data and insightful discussion. We also express our sincere appreciation to all the kind and enthusiastic staffs from TCIA, NLST, and CDAS, who have helped us a lot to address the problem of public dataset validation. This study was supported by the Research Program of Shanghai Hospital Development Center SHDC22015025 (to M. Li), the National Key Research and Development Program of China 2017YFC0112800 (to P. Wang), 2017YFC0112905 (to M. Li), the National Science Foundation of China 61502301 (to B. Ni), and the Medical Imaging Key Program of Wise Information Technology of 120, Health Commission of Shanghai 2018ZHYL0103 (to M. Li). This study was supported by SJTU-UCLA Joint Center for Machine Perception and Inference (to B. Ni and J. Yang). The study was also partially supported by China's Thousand Youth Talents Plan, STCSM 17511105401, 18DZ2270700 (to B. Ni).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.