Purpose:

With increasing incidence of renal mass, it is important to make a pretreatment differentiation between benign renal mass and malignant tumor. We aimed to develop a deep learning model that distinguishes benign renal tumors from renal cell carcinoma (RCC) by applying a residual convolutional neural network (ResNet) on routine MR imaging.

Experimental Design:

Preoperative MR images (T2-weighted and T1-postcontrast sequences) of 1,162 renal lesions definitely diagnosed on pathology or imaging in a multicenter cohort were divided into training, validation, and test sets (70:20:10 split). An ensemble model based on ResNet was built combining clinical variables and T1C and T2WI MR images using a bagging classifier to predict renal tumor pathology. Final model performance was compared with expert interpretation and the most optimized radiomics model.

Results:

Among the 1,162 renal lesions, 655 were malignant and 507 were benign. Compared with a baseline zero rule algorithm, the ensemble deep learning model had a statistically significant higher test accuracy (0.70 vs. 0.56, P = 0.004). Compared with all experts averaged, the ensemble deep learning model had higher test accuracy (0.70 vs. 0.60, P = 0.053), sensitivity (0.92 vs. 0.80, P = 0.017), and specificity (0.41 vs. 0.35, P = 0.450). Compared with the radiomics model, the ensemble deep learning model had higher test accuracy (0.70 vs. 0.62, P = 0.081), sensitivity (0.92 vs. 0.79, P = 0.012), and specificity (0.41 vs. 0.39, P = 0.770).

Conclusions:

Deep learning can noninvasively distinguish benign renal tumors from RCC using conventional MR imaging in a multi-institutional dataset with good accuracy, sensitivity, and specificity comparable with experts and radiomics.

Translational Relevance

With the wide use of imaging modalities, the detection of incidental renal tumors increases rapidly. There is a substantial number of patients with benign renal tumors who undergo unnecessary surgery, with its concurrent risk and morbidity. Ultrasound, CT, and MRI have limited sensitivity and specificity in the discrimination of benign, indolent masses from aggressive malignant renal tumors. Percutaneous biopsy is nondiagnostic in 20% and provides an erroneous diagnosis in another 10% of resected cases. More accurate imaging diagnosis of renal masses is an urgent need. We trained a deep learning model to distinguish benign renal lesions from renal cell carcinoma with good accuracy, sensitivity, and specificity when compared with experts and radiomics. Our algorithm offers broad applicability by using conventional MR imaging sequences. Furthermore, it has the potential to save patients from unnecessary invasive biopsies/surgeries and to help guide clinical management.

Renal cell carcinomas (RCC) account for 85% of renal malignancies, affecting approximately 65,000 new patients each year (1). Resecting renal masses radiographically suspicious for RCC without a tissue diagnosis is within the standard management (2, 3), leading to overtreatment of benign renal tumors. About 20% of surgically removed renal masses are reported to be benign (4), which challenges the necessity of surgery for all suspicious lesions because of concurrent risk and morbidity exposed to patients (5).

Noninvasive preoperative imaging techniques, such as ultrasound, CT, and MRI, are widely used for characterization of renal masses. CT and MRI have limited sensitivity and specificity in the discrimination of benign, indolent masses, such as oncocytoma and angiomyolipoma (AML), from aggressive malignant renal tumors (6–8). Percutaneous biopsy is nondiagnostic in 20% and provides an erroneous diagnosis in another 10% of resected case (9). There is an unmet need for more accurate imaging diagnosis of small renal masses.

Radiomics features have been used for renal tumor evaluation (10–16). In these studies, predefined features such as shape, intensity and texture were selected and fed into a machine learning algorithm. However, these manually formulated or “handcrafted” features may not capture the full range of the information contained within the images and are limited by low reproducibility (17–24). Deep learning extracts features directly from the raw images and generates features adaptive to a given problem. Although machine learning and deep learning have been successful in predicting tumor molecular features, treatment response, and prognosis in oncology, there are very few studies focusing on renal tumors in the literature (13, 14, 25, 26).

The purpose of our study was to develop a deep learning-based predictive model using routine MR images to distinguish benign from malignant renal lesions in a multicenter cohort.

Patient cohort

Patients with renal lesions confirmed by histology or imaging were retrospectively identified from two large academic centers in the United States (HUP and MAY), two hospitals in People's Republic of China (SXH and PHH) and The Cancer Imaging Archive (TCIA) from 1984 to 2019. The study was conducted in accordance with Declaration of Helsinki and approved by the Institutional Review Boards of HUP, MAY, SXH, and PHH. With the agreement to use TCIA data, the IRB approval of our study was waived for TCIA. The informed consent of patients was waived. For histologic confirmed renal lesions, the inclusion criteria were (i) pathologically confirmed renal lesions, (ii) available preoperative MRI including T2-weighted (T2) and T1-contrast (T1C) enhanced sequences, (iii) quality of the images was adequate for analysis, without motion or artifacts. For a benign renal lesion to have a definite diagnosis on imaging, it must have typical imaging features and be stable on imaging follow-up. Our final cohort consisted of 1,162 renal lesions (913 lesions from HUP, 118 lesions from MAY, 56 lesions from TCIA, 25 lesions from SXH, and 50 lesions from PHH; Supplementary Fig. S1).

Among the 1,162 lesions, 655 were malignant based on histopathology and 507 were benign (162 lesions were confirmed by histopathology, other 345 lesions were diagnosed radiographically). The images were randomly divided into a training set of 816 lesions with 408,000 augmented images, a validation set of 234 lesions, and a test set of 112 lesions. All 347 radiographically diagnosed lesions were kept within the training or validation set. The detailed clinical characteristics of the total cohort and among train, validation and test sets are shown in Table 1 and Supplementary Table S1.

Table 1.

Histologic types and clinical characteristics of our cohort.

Benign (n = 507)Malignant (n = 655)P values
Age median, range (years) 53.0 (5–90) 61.0 (18–92) <0.001a 
Gender   <0.001a 
 Male 151 (29.8%) 441 (67.3%)  
 Female 356 (70.2%) 214 (32.7%)  
Tumor size, median, range (cm) 2.2 (0.6–14.2) 3.4 (0.2–18.7) <0.001a 
Laterality   0.223 
 Left 259 (51.1%) 311 (47.5%)  
 Right 248 (48.9%) 344 (52.5%)  
Location   0.5031 
 Upper 158 (31.2%) 225 (34.4%)  
 Interpole 210 (41.4%) 255 (38.9%)  
 Lower 139 (27.4%) 175 (26.7%)  
Approach for diagnosis   <0.001a 
 Surgical excision 161 (31.8%) 637 (97.3%)  
 Biopsy 1 (0.2%) 18 (2.7%)  
 Imaging 345 (68.0%) 0 (0%)  
Tumor subtype   – 
 Clear cell RCC – 425 (64.9%)  
 Papillary RCC – 152 (23.1%)  
 Chromophobe RCC – 32 (4.9%)  
 Clear cell papillary RCC – 30 (4.6%)  
 Multilocular cystic RCC – 7 (1.1%)  
 Unclassified RCC – 9 (1.4%)  
 Oncocytoma 92 (18.1%) –  
 Angiomyolipoma 406 (80.1%) –  
Mixed epithelial and stromal tumor 3 (0.6%) –  
Metanephric adenoma 3 (0.6%) –  
Renal adenoma 3 (0.6%) –  
Fuhrman/ISUP grade   – 
 Grade 1 – 66 (10.1%)  
 Grade 2 – 304 (46.4%)  
 Grade 3 – 160 (24.4%)  
 Grade 4  35 (5.4%)  
 Unavailable  90 (13.7%)  
Benign (n = 507)Malignant (n = 655)P values
Age median, range (years) 53.0 (5–90) 61.0 (18–92) <0.001a 
Gender   <0.001a 
 Male 151 (29.8%) 441 (67.3%)  
 Female 356 (70.2%) 214 (32.7%)  
Tumor size, median, range (cm) 2.2 (0.6–14.2) 3.4 (0.2–18.7) <0.001a 
Laterality   0.223 
 Left 259 (51.1%) 311 (47.5%)  
 Right 248 (48.9%) 344 (52.5%)  
Location   0.5031 
 Upper 158 (31.2%) 225 (34.4%)  
 Interpole 210 (41.4%) 255 (38.9%)  
 Lower 139 (27.4%) 175 (26.7%)  
Approach for diagnosis   <0.001a 
 Surgical excision 161 (31.8%) 637 (97.3%)  
 Biopsy 1 (0.2%) 18 (2.7%)  
 Imaging 345 (68.0%) 0 (0%)  
Tumor subtype   – 
 Clear cell RCC – 425 (64.9%)  
 Papillary RCC – 152 (23.1%)  
 Chromophobe RCC – 32 (4.9%)  
 Clear cell papillary RCC – 30 (4.6%)  
 Multilocular cystic RCC – 7 (1.1%)  
 Unclassified RCC – 9 (1.4%)  
 Oncocytoma 92 (18.1%) –  
 Angiomyolipoma 406 (80.1%) –  
Mixed epithelial and stromal tumor 3 (0.6%) –  
Metanephric adenoma 3 (0.6%) –  
Renal adenoma 3 (0.6%) –  
Fuhrman/ISUP grade   – 
 Grade 1 – 66 (10.1%)  
 Grade 2 – 304 (46.4%)  
 Grade 3 – 160 (24.4%)  
 Grade 4  35 (5.4%)  
 Unavailable  90 (13.7%)  

aStatistically significant.

Abbreviation: ISUP, International Society of Urological Pathology.

Image segmentation

MR images of all patients were loaded into 3D Slicer software (v4.6), regions of interest were manually drawn slice-by-slice on the T2 and T1C sequences by an abdominal radiologist (Y. Zhao) with 5 years of experience reading abdominal MRI (27). Two additional radiologists (Y.C. and L.Z.), with 4 and 3 years of reading abdominal MRI respectively, segmented all the images in the test set, creating a total of three sets of segmentations for the images in the test set. A Dice similarity coefficient (DSC; ref. 28) was calculated between these three set of segmentations.

Image processing

All images were downloaded in DICOM format at their original dimensions and resolution. Images were then processed with N4 bias correction (29) using ANTS (30) and intensity normalization of T2 images (31) using SimpleITK (32), using the T1 image of the same lesion as the reference image (33). Segmentations were used to crop the images. To create two dimensional images suitable for ResNet input, the largest axial, sagittal, and coronal slice were input as the red, green, and blue channel of an image, respectively; we refer to this method as the 2.5D Model (34). This process is visualized in Fig. 1A.

Figure 1.

A, An illustration of the 2.5D Model methods of generating RGB images for ResNet input. B, An illustration of the ensemble model architecture.

Figure 1.

A, An illustration of the 2.5D Model methods of generating RGB images for ResNet input. B, An illustration of the ensemble model architecture.

Close modal

Model building

A 7:2:1 training:validation:testing split was performed as detailed above. Model building was performed on the segmented images. Two separate models were independently trained on T1C and T2W images. During training, images were scaled up or down to 200 by 200 pixel squares using bilinear interpolation, then augmented live with horizontal flip, vertical flip, shear, and zoom transformations to add variability to the training set. Models were trained with a batch size of 16. Early stopping was used with the patience parameter set to 50; finally, each model was given a maximum of 500 epochs of training if it was never stopped early. After 100 training trials, the model with the best validation accuracy was selected. During training, a predetermined probability of 0.5 was assigned to the final sigmoid activation neuron as a threshold for classification of malignancy. The models were trained to maximize accuracy of the model prediction.

A 5-fold cross-validation was used to evaluate the model building pipeline. In addition, we ran our pipeline with all 118 lesions from one institution (MAY) placed in a separate test set, with the remaining lesions split into training (811 lesions) and validation (233 lesions) cohorts.

Model architecture

Models trained on T1C and T2WI image input were based on the ResNet50 architecture (35) with the following modifications: the 1000-class softmax fully connected layer was replaced with five fully-connected layers of decreasing width (256, 128, 64, 32, 16) with ReLU activations, and a single sigmoid output neuron for probability output and binary classification (benign or malignant); to address class imbalance, the loss was weighted by the reciprocal of the frequency. Pretrained convolutional neural network weights from ImageNet were used to initialize our model's weights and biases and were left unfrozen during training (36). No pretrained weights from the final dense layer were used as that layer was omitted.

Clinical variables (age, gender, and tumor volume) were fed into a separate model that used logistic regression for prediction of malignancy. The logistic regression was trained with a stochastic gradient optimizer with learning rate of 0.001, momentum of 0.9, L1 regularization of 0.00, and L2 regularization of 0.01.

A ensemble model was created using a bagging classifier (37) combining the output of the clinical variable logistic regression model, the T1C model, and the T2WI model. Figure 1B demonstrates the architecture of the final ensemble model. This model was trained on the same training data and evaluated on both the validation and test set data. One hundred ensemble models were trained with the highest validation accuracy model selected as the final ensemble model. The ensemble model was calibrated using the Pratt method using validation data (38).

Expert evaluation

Four expert radiologists (H.L., T.H., L.S., and D.P.), with 23, 12, 13, and 10 years of experience reading abdominal MRI respectively, blind to histopathologic data, evaluated unsegmented MRI images of the renal lesions in the test set for malignancy. They were given clinical information of each patient. The model's results were compared to these expert evaluations to assess model performance.

Radiomics model building and evaluation

Radiomics features were extracted from each patient's MRI for both T1C and T2WI sequences. For each image space, 79 non-texture (morphology and intensity-based) and 94 texture features were extracted according to the guidelines defined by the Image Biomarker Standardization Initiative (39). Each of the 94 texture features were computed 32 times using all possible combinations of the following extraction parameters, a process known as “texture optimization” (REF): (i) isotropic voxels of size 1, 2, 3, and 4 mm, (ii) fixed bin number (FBN) discretization algorithm, with and without equalization, and (iii) the number of gray levels of 8, 16, 32, and 64 for FBN. A total of (79 + 32*94), or 3087, radiomics features were thus computed in this study. All the features were normalized using unity based normalization and features from T1C and T2WI were combined into one dataset. In order to reduce dimensionality of the datasets, radiomics features were selected for training using thirteen different feature selection methods. Ten machine learning classifiers were trained and tested on features from the same splits of patients used in the deep learning methods. Names of the classifiers and feature selection methods can be found in Supplementary Figure S2. Each classifier was trained on the training set thirteen times using thirteen different feature selection methods and validated through 10-fold cross-validation. Classifiers were trained on 10, 20, 30, 40, 50, and 100 selected features and performances were compared on the validation set. In addition to performance, the stability of both classifiers and feature selection methods was recorded. We calculated relative standard deviation (RSD%) for classifier stability and used a stability measure proposed by Nogueira et al for feature selection stability (40). The performances of the top-performing classifiers were then compared to the performance of the automated optimized machine learning pipeline computed by TPOT, a Tree-Based Pipeline Optimization Tool that chooses the most optimal machine learning pipeline for an inputted dataset through genetic programming (41). The best-performing models were then tested on the final testing set.

Model assessment and statistical analysis

Each trained model was finalized with weights and biases frozen, and then assessed for its performance on validation and test sets based on its F1 score, accuracy, sensitivity, specificity, and AUC of its precision recall curve. In addition, models were evaluated on two additional sets of the same test data images that were segmented by two other radiologists. Finally, the model pipeline was run on a train/validation/test split where the test set was comprised of lesions from a separate institution that was explicitly not included in the training or validation sets.

Activations from the last convolutional layer of the best performing models were visualized by t-distributed Stochastic Neighbor Embedding (42).

Confidence interval was calculated using the adjusted Wald method (43). P values were calculated using a binomial test. A P value of 0.05 was considered as the threshold for significance. To evaluate the calibration of our final models, calibration scores were calculated using the expected calibration error (44).

Code availability

The implementation of our deep learning model was based on the Keras package (45) with the Tensorflow library as our backend (46). Our models were trained on a computer with two NVidia V100 GPUs. To allow other researchers to develop their models, the code is publicly available on Github at https://github.com/intrepidlemon/renal-mri. The implementation of the radiomics feature extraction was based on “radiomics-develop” package from the Naqa lab in McGill University (47, 48). This code is available for public use on Github at https://github.com/mvallieres/radiomics-develop. The radiomics pipeline was developed using Python's sklearn package. This code is publicly available at https://github.com/subhanik1999/Radiomics-ML.

Segmentation similarity

The average DSC between all of our segmenters was 0.81. Segmenters 1 and 2 had an average DSC of 0.82. Segmenterss 2 and 3 had an average DSC of 0.81. Segmenters 1 and 3 had an average DSC of 0.81. Benign lesions had an average DSC of 0.76 across all segmenters while malignant lesions had an average DSC of 0.82 across all segmenters.

Model performance

Performance of T1C, T2, clinical features and ensemble models in training, validation, and test sets compared with expert evaluation and radiomics model are summarized in Table 2. Confusion matrices and reliability curve of all models in training, validation, and test sets were displayed in Supplementary Figs. S3 and S4.

Table 2.

Performance of T1C, T2, clinical features and ensemble models in training, validation, and test sets compared with expert evaluation and radiomics model.

Training modalityF1 ScoreROC AUCPR AUCAcc (95% CI)TPR (95% CI)TNR (95% CI)PPVNPVFDR
Clinical 0.76 0.73 0.72 0.69 (0.65–0.72) 0.87 (0.83–0.89) 0.46 (0.41–0.51) 0.67 0.72 0.33 
T1C 0.99 0.99 0.99 0.99 (0.99–1.00) 1.00 (0.98–1.00) 0.99 (0.97–1.00) 0.99 0.99 0.01 
T2 0.95 0.99 0.99 0.94 (0.92–0.96) 0.97 (0.95–0.99) 0.90 (0.86–0.93) 0.93 0.96 0.07 
Ensemble 0.96 0.99 0.99 0.95 (0.94–0.97) 0.99 (0.98–1.00) 0.91 (0.87–0.93) 0.93 0.99 0.07 
Validation modality F1 Score ROC AUC PR AUC Acc (95% CI) TPR (95% CI) TNR (95% CI) PPV NPV FDR 
Clinical 0.77 0.64 0.64 0.68 (0.62–0.74) 0.92 (0.86–0.96) 0.36 (0.28–0.46) 0.65 0.79 0.35 
T1C 0.72 0.69 0.73 0.66 (0.60–0.72) 0.77 (0.69–0.84) 0.52 (0.42–0.61) 0.68 0.64 0.33 
T2 0.77 0.73 0.71 0.71 (0.65–0.77) 0.83 (0.75–0.88) 0.57 (0.47–0.66) 0.71 0.72 0.29 
Ensemble 0.80 0.77 0.73 0.75 (0.69–0.80) 0.87 (0.80–0.92) 0.60 (0.50–0.69) 0.74 0.78 0.26 
Test modality F1 Score ROC AUC PR AUC Acc (95% CI) TPR (95% CI) TNR (95% CI) PPV NPV FDR 
Clinical 0.66 0.43 0.54 0.52 (0.43–0.61) 0.83 (0.71–0.90) 0.12 (0.05–0.25) 0.55 0.35 0.45 
T1C 0.73 0.62 0.59 0.64 (0.55–0.73) 0.87 (0.77–0.94) 0.35 (0.23–0.49) 0.63 0.68 0.37 
T2 0.77 0.71 0.70 0.69 (0.60–0.77) 0.90 (0.80–0.96) 0.41 (0.28–0.55) 0.66 0.77 0.34 
Ensemble 0.77 0.73 0.76 0.70 (0.61–0.77) 0.92 (0.82–0.97) 0.41 (0.28–0.55) 0.67 0.80 0.33 
Expert 1 0.73 N/A N/A 0.65 (0.56–0.73) 0.84 (0.73–0.91) 0.41 (0.28–0.55) 0.65 0.67 0.35 
Expert 2 0.70 N/A N/A 0.59 (0.50–0.68) 0.84 (0.73–0.91) 0.27 (0.16–0.40) 0.60 0.57 0.40 
Expert 3 0.68 N/A N/A 0.60 (0.51–0.68) 0.75 (0.63–0.84) 0.41 (0.28–0.55) 0.62 0.56 0.38 
Expert 4 0.68 N/A N/A 0.58 (0.49–0.67) 0.78 (0.66–0.86) 0.33 (0.21–0.47) 0.60 0.53 0.40 
Radiomics 0.70 0.59 0.77 0.62 (0.52–0.70) 0.79 (0.68–0.88) 0.39 (0.26–0.53) 0.62 0.59 0.38 
Training modalityF1 ScoreROC AUCPR AUCAcc (95% CI)TPR (95% CI)TNR (95% CI)PPVNPVFDR
Clinical 0.76 0.73 0.72 0.69 (0.65–0.72) 0.87 (0.83–0.89) 0.46 (0.41–0.51) 0.67 0.72 0.33 
T1C 0.99 0.99 0.99 0.99 (0.99–1.00) 1.00 (0.98–1.00) 0.99 (0.97–1.00) 0.99 0.99 0.01 
T2 0.95 0.99 0.99 0.94 (0.92–0.96) 0.97 (0.95–0.99) 0.90 (0.86–0.93) 0.93 0.96 0.07 
Ensemble 0.96 0.99 0.99 0.95 (0.94–0.97) 0.99 (0.98–1.00) 0.91 (0.87–0.93) 0.93 0.99 0.07 
Validation modality F1 Score ROC AUC PR AUC Acc (95% CI) TPR (95% CI) TNR (95% CI) PPV NPV FDR 
Clinical 0.77 0.64 0.64 0.68 (0.62–0.74) 0.92 (0.86–0.96) 0.36 (0.28–0.46) 0.65 0.79 0.35 
T1C 0.72 0.69 0.73 0.66 (0.60–0.72) 0.77 (0.69–0.84) 0.52 (0.42–0.61) 0.68 0.64 0.33 
T2 0.77 0.73 0.71 0.71 (0.65–0.77) 0.83 (0.75–0.88) 0.57 (0.47–0.66) 0.71 0.72 0.29 
Ensemble 0.80 0.77 0.73 0.75 (0.69–0.80) 0.87 (0.80–0.92) 0.60 (0.50–0.69) 0.74 0.78 0.26 
Test modality F1 Score ROC AUC PR AUC Acc (95% CI) TPR (95% CI) TNR (95% CI) PPV NPV FDR 
Clinical 0.66 0.43 0.54 0.52 (0.43–0.61) 0.83 (0.71–0.90) 0.12 (0.05–0.25) 0.55 0.35 0.45 
T1C 0.73 0.62 0.59 0.64 (0.55–0.73) 0.87 (0.77–0.94) 0.35 (0.23–0.49) 0.63 0.68 0.37 
T2 0.77 0.71 0.70 0.69 (0.60–0.77) 0.90 (0.80–0.96) 0.41 (0.28–0.55) 0.66 0.77 0.34 
Ensemble 0.77 0.73 0.76 0.70 (0.61–0.77) 0.92 (0.82–0.97) 0.41 (0.28–0.55) 0.67 0.80 0.33 
Expert 1 0.73 N/A N/A 0.65 (0.56–0.73) 0.84 (0.73–0.91) 0.41 (0.28–0.55) 0.65 0.67 0.35 
Expert 2 0.70 N/A N/A 0.59 (0.50–0.68) 0.84 (0.73–0.91) 0.27 (0.16–0.40) 0.60 0.57 0.40 
Expert 3 0.68 N/A N/A 0.60 (0.51–0.68) 0.75 (0.63–0.84) 0.41 (0.28–0.55) 0.62 0.56 0.38 
Expert 4 0.68 N/A N/A 0.58 (0.49–0.67) 0.78 (0.66–0.86) 0.33 (0.21–0.47) 0.60 0.53 0.40 
Radiomics 0.70 0.59 0.77 0.62 (0.52–0.70) 0.79 (0.68–0.88) 0.39 (0.26–0.53) 0.62 0.59 0.38 

ROC AUC, area under ROC curve; PR AUC, area under precision–recall curve; TPR, true positive rate or sensitivity; TNR, true negative rate or specificity; Acc, accuracy; PPV, positive predictive value; NPV, negative predictive value; N/A, not applicable; 95% CI, 95% confidence interval.

The clinical variable logistic regression achieved a test accuracy of 0.52 (95% CI, 0.43–0.61), F1 score of 0.66, precision recall AUC of 0.54, sensitivity of 0.83 (95% CI, 0.71–0.90), and specificity of 0.12 (95% CI, 0.054–0.25).The T1C trained model achieved a test accuracy of 0.64 (95% CI, 0.55–0.73), F1 score of 0.73, precision recall AUC of 0.59, sensitivity of 0.87 (95% CI, 0.77–0.94), and specificity of 0.35 (95% CI, 0.23–0.49). The T2WI trained model achieved a test accuracy of 0.69 (95% CI, 0.60–0.77), F1 score of 0.77, precision recall AUC of 0.70, sensitivity of 0.90 (95% CI, 0.80–0.96), and specificity of 0.41 (95% CI, 0.28–0.55). The ensemble model achieved a test accuracy of 0.70 (95% CI, 0.61–0.77), F1 score of 0.77, precision recall AUC of 0.76, sensitivity of 0.92 (95% CI, 0.82–0.97), and specificity of 0.41 (95% CI, 0.28–0.55). The ensemble model achieved comparative performance on the second set of test set segmentations with a test accuracy of 0.70 (95% CI, 0.61–0.77), F1 score of 0.77, precision recall AUC of 0.76, sensitivity of 0.92 (95% CI, 0.82–0.97), and specificity of 0.41 (95% CI, 0.28–0.55) and the third set of test set segmentations with a test accuracy of 0.61 (95% CI, 0.51–0.69), F1 score of 0.71, precision recall AUC of 0.58, sensitivity of 0.86 (95% CI, 0.75–0.93), and specificity of 0.29 (95% CI, 0.18–0.43). On average, cross-validation analysis of the ensemble model demonstrated a test accuracy of 0.64 (95% CI, 0.66–0.89), F1 score of 0.88, precision recall AUC of 0.9, sensitivity of 0.96 (95% CI, 0.83–1), and specificity of 0.22 (95% CI, 0.057–0.54). Supplementary Table S2 summarizes the cross-validation performance of the ensemble model. Supplementary Table S3 summarizes the performance of the model on a separate train, validation, and test sets where the test set represents an independent cohort from a separate institution. The performance of this model on an independent test set was comparable with the main results.

The radiomics analysis showed that the classifier, linear discriminant analysis (LDA), had the highest median ROC AUC scores in predicting the malignancy of renal lesions (0.54). The ROC AUC values from each combination of classifier and feature selection method are shown in Supplementary Fig. S2. The median, mean, and SD of ROC AUC as well as stability (RSD%) for all the classifiers are shown in Supplementary Table S4. The performance of LDA was compared with the performance of a pipeline exported by TPOT on the test set. The TPOT pipeline specifics are shown in Supplementary Fig. S5, and its test performance is shown in Supplementary Table S5. LDA, along with Conditional Mutual Information Maximization feature selection, achieved a test accuracy of 0.62 (95% CI, 0.52–0.70), F1 score of 0.70, and sensitivity of 0.79 (95% CI, 0.68–0.88), and specificity of 0.39 (95% CI, 0.26–0.53). This hand-optimized pipeline outperformed the TPOT pipeline.

In comparison, expert 1 achieved a test accuracy of 0.65 (95% CI, 0.56–0.73), F1 score of 0.73, and sensitivity of 0.84 (95% CI, 0.73–0.91), and specificity of 0.41 (95% CI, 0.28–0.55); expert 2 had a test accuracy of 0.59 (95% CI, 0.5–0.68), F1 score of 0.7, and sensitivity of 0.84 (95% CI, 0.73–0.91), and specificity of 0.27 (95% CI, 0.16–0.40); expert 3 had a test accuracy of 0.6 (95% CI, 0.51–0.68), F1 score of 0.68, and sensitivity of 0.75 (95% CI, 0.63–0.84), and specificity of 0.41 (95% CI, 0.28–0.55); expert 4 had a test accuracy of 0.58 (95% CI, 0.49–0.67), F1 score of 0.68, and sensitivity of 0.78 (95% CI, 0.66–0.86), and specificity of 0.33 (95% CI, 0.21–0.47).

Compared with a baseline zero rule algorithm, the ensemble deep learning model had a statistically significant higher test accuracy (0.70 vs. 0.56, P = 0.004). Compared with all experts averaged, the ensemble deep learning model had higher test accuracy (0.70 vs. 0.60, P = 0.053), sensitivity (0.92 vs. 0.80, P = 0.017), and specificity (0.41 vs. 0.35, P = 0.450). Compared with the radiomics model, the ensemble deep learning model had higher test accuracy (0.70 vs. 0.62, P = 0.081), sensitivity (0.92 vs. 0.79, P = 0.012), and specificity (0.41 vs. 0.39, P = 0.770). Figure 2 shows the precision recall curves of all models overlaid with expert performance. Expected calibration error of our ensemble deep learning model was 0.19 on the training set, 0.065 on the validation set, and 0.094 on the test set. Reliability diagrams of the ensemble model are plotted in Supplementary Fig. S6. t-SNE representation of the final dense layer of ResNet demonstrates good separation of malignant and benign lesions by the deep learning model when compared with histopathologic diagnosis (Fig. 3).

Figure 2.

Precision recall curves (A–C) and ROC curves (D–F) for all models across training, validation, and test cohorts with expert performance.

Figure 2.

Precision recall curves (A–C) and ROC curves (D–F) for all models across training, validation, and test cohorts with expert performance.

Close modal
Figure 3.

TSNE transformed representation of the final layer of the T1C (A) and T2WI (B) neural network before the classification node for every image in the validation dataset color-coded as benign or malignant.

Figure 3.

TSNE transformed representation of the final layer of the T1C (A) and T2WI (B) neural network before the classification node for every image in the validation dataset color-coded as benign or malignant.

Close modal

Today, most patients who have a newfound kidney lesion either undergo biopsy or surgery. In this study, we developed a deep learning approach combining conventional MR images and clinical variables that demonstrate good accuracy in differentiating benign from malignant renal lesion. Our model was based on the ResNet architecture, which utilizes the concept of residuals connections between the convolutional layers and allows models to be trained to much deeper depths while maintaining a low complexity (35). Our final ensemble model achieved higher accuracy in distinguishing benign from malignant renal lesions compared to experts and radiomics. Ideally, a deep learning model would be implemented as a tool for risk stratification, in helping clinicians and patients understand the risk of malignancy for a lesion. To this end, specificity is the most important statistic as with higher specificity, fewer people will receive unnecessary biopsies; our deep learning model had the highest specificity between experts and models. Sensitivity is also an important metric as an increase in sensitivity would lead to fewer false negatives which may provide false relief. To evaluate intersegmenter variation, two additional radiologists segmented the renal lesions in the test set, achieving high Dice similarity score among the 3 segmenters. Consistently good accuracy, sensitivity and specificity of the model when evaluated on any of the 3 test set segmentations prove stability against intersegmenter variation and support potential for clinical applicability. When we evaluated our pipeline using data that the model had never seen before from a separate institution, we found that our model continued to be able to discriminate between benign and malignant lesions, suggesting that our model is able to generalize beyond a single institution. Our final ensemble test model is adequately calibrated and its expected calibration within expectation for a neural network–based model (49).

Preoperative differentiation between benign and malignant renal lesions using noninvasive imaging modalities is important for treatment planning but challenging. Several radiologic studies have used traditional machine-learning techniques, such as support vector machine and random forest, to distinguish benign from malignant renal lesions basing on CT radiomics (11–16). However, these radiomic features were low throughput and predefined by radiologists' expert knowledge. Deep learning represents an improvement over radiomic as it can automatically extract high-throughput features and increase the accuracy needed for clinical utilization (50). Recently, Lee and colleagues combined deep features with hand-crafted features in a random forest classifier to differentiate 39 AML without visible fat from 41 clear cell RCC on abdominal CT. The best model combining hand-crafted features with deep features from AlexNet using texture image patches achieved an accuracy of 76.6% (25). The study is limited by the small cohort size. Zhou and colleagues performed a transfer learning model on CT with the Inception V3 model pretrained on the ImageNet dataset to differentiate 58 benign from 134 malignant renal lesions (26). Although the validation accuracy reported for the best model was 97%, the small cohort size and the lack of a true test set makes generalization questionable.

Compared with previous machine-learning studies, our study has several differences. First, we chose MRI instead of CT. MRI provides multi-parametric sequence, which theoretically provide information than simple attenuation differences measured in Hounsfield units on CT. Second, our cohort included all available benign and malignant renal lesions that were pathologically or radiographically diagnosed from five institutions with MRI, which is more representative of the real clinical scenario. In contrast, most of previous studies only included selected tumor types. Third, we compared the model accuracy to expert radiologist interpretation to assess the model's clinical usefulness, which none of the previous studies has attempted.

In our study, the deep learning model outperformed the most optimized radiomic model in distinguishing benign and malignant renal tumors, with higher test accuracy, sensitivity, and specificity on the test set. Two main reasons likely account for the worse performance and lower reproducibility of radiomics when compared with deep learning: acquisition parameters and implementation. Radiomics can vary significantly based on acquisition parameters such as segmentation, size of lesion, and entropy of lesion and surrounding tissues. Given that our dataset contains lesions from multiple institutions, there is naturally variation in the model of MRI scanner used. Models of CT scanners have been found to influence reproducibility of radiomics features (19). For implementation, multi-user studies and studies for different radiomics software packages have shown inconsistency in results for the same kinds of analysis. Implementation details that may deviate include image reconstruction, interpolation, resampling, and quantization (51). Our deep-learning model was tested with a 5-fold cross-validation. Our cross-validation results achieved test accuracy of 0.64 and test PR AUC of 0.84 on average across all folds, supporting our methodology of model building.

There were several limitations to this study. Large and deep neural networks model is much easier to overfit, especially when trained with small numbers of data. Although we used data augmentation and early stopping to prevent overfitting of our model, a large amount data is essential for training of deep neural networks. Although we included 1,162 renal lesions on MRI in our cohort, more than any previous machine learning study on the topic, algorithm development can benefit further from a larger patient cohort, especially considering the heterogeneity of different image acquisition parameters at different institutions. Second, only conventional MR sequences were used. Inclusion of advanced MR sequences such as diffusion-weighted imaging and arterial spin labeling may further increase model accuracy (52, 53). Third, manual segmentations were performed in this study by a radiologist with 5 years of experience reading abdominal MR. Automatic segmentation based on deep learning has been successfully performed in other organs (28, 54–56) and kidney volume on CT with comparable accuracy to expert segmentation (57). However, deep learning segmentation of renal lesion on MRI is challenging. The algorithm needs to distinguish the lesion of interest from other targets in the kidney (e.g., cyst), from other organs in the body which may have similar intensity and finally from background in small renal lesions. We have not identified any study in the literature that reported automatic segmentation of kidney lesion on MRI. Finally, the earliest phase of the postcontrast phases was selected for analysis. Compared with single phase, multiple phases can provide more information about the enhancement pattern of the lesion and further improve model accuracy. Automatic renal tumor segmentation on MRI and multi-phase analysis will be incorporated in future work

In this study, we developed a deep learning-based model using conventional MR imaging to noninvasively classify benign and malignant renal tumors with good accuracy, sensitivity, and specificity comparable with experts and radiomics. If further validated, the model can spare patients unnecessary biopsy/surgery and help guide management in a clinical setting.

T.P. Gade is an employee/paid consultant for Trisalus Life Sciences. S.W. Stavropoulos is an employee/paid consultant for Becton Dickinson, and reports receiving commercial research grants from Sillajen and Cook Medical. No potential conflicts of interest were disclosed by the other authors.

Conception and design: I.L. Xi, R.Y. Huang, P.J. Zhang, M.C. Soulen, H.X. Bai, S.W. Stavropoulos

Development of methodology: I.L. Xi, R. Wang, S. Purkayastha, K. Chang, R.Y. Huang, Y. Fan, H.X. Bai, S.W. Stavropoulos

Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): I.L. Xi, Y. Zhao, R. Wang, M. Chang, A.C. Silva, P.J. Zhang, Z. Zhang, H.X. Bai, S.W. Stavropoulos

Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): I.L. Xi, Y. Zhao, R. Wang, S. Purkayastha, K. Chang, R.Y. Huang, A.C. Silva, M. Vallières, Y. Fan, H.X. Bai

Writing, review, and/or revision of the manuscript: I.L. Xi, Y. Zhao, R. Wang, M. Chang, S. Purkayastha, K. Chang, R.Y. Huang, A.C. Silva, M. Vallières, P. Habibollahi, Y. Fan, T.P. Gade, M.C. Soulen, H.X. Bai, S.W. Stavropoulos

Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): I.L. Xi, Z. Zhang, H.X. Bai, S.W. Stavropoulos

Study supervision: B. Zou, T.P. Gade, Z. Zhang, H.X. Bai, S.W. Stavropoulos

Other (code development): R. Wang

This study was supported by RSNA fellow research grant (RF1802), National Natural Science Foundation of China grant (8181101287), SIR Foundation Radiology Resident Research Grant, and National Cancer Institute of the National Institutes of Health under Award Number R03CA249554 to H.X. Bai. This project was supported by a training grant from the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health under award number 5T32EB1680 and by the National Cancer Institute (NCI) of the National Institutes of Health under Award Number F30CA239407 to K. Chang. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The study was supported by funding from the Nicole Foundation for Kidney Cancer Research to S.W. Stavropoulos. The authors acknowledge the help of Hui Liu (H.L.), Ting Huang (T.H.), Dehong Peng (D.P.), and Quanliang Shang (Q.S.) in evaluating all renal lesions in the test set; Lin Zhu (L.Z.) and Yeyu Cai (Y.C.) in segmenting all renal lesions in the test set; and Sukhdeep Khurana, Aidan McGirr, An Xie, and JianbinLiu in data collection.

1.
Siegel
R
,
Naishadham
D
,
Jemal
A
. 
Cancer statistics, 2013
.
CA Cancer J Clin
2018
:
68
:
7
30
.
2.
Steven
CC
,
Andrew
CN
,
Arie
B
,
Michael
LB
,
George
KC
,
Ithaar
HD
, et al
Guideline for management of the clinical T1 renal mass
.
J Urol
2009
;
182
:
1271
79
.
3.
Borje
L
,
Nigel
CC
,
Damian
CH
,
Milan
H
,
Markus
AK
,
Axel
SM
, et al
EAU guidelines on renal cell carcinoma: the 2010 update
.
Eur Urol
2010
;
58
:
398
406
.
4.
Lane
BR
,
Denise
B
,
Kattan
MW
,
Novick
AC
,
Gill
IS
,
Ming
Z
, et al
A preoperative prognostic nomogram for solid enhancing renal tumors 7 cm or less amenable to partial nephrectomy
.
J Urol
2007
;
178
:
429
34
.
5.
Gill
IS
,
Kavoussi
LR
,
Lane
BR
,
Blute
ML
,
Babineau
D
,
Jr
CJ
, et al
Comparison of 1,800 laparoscopic and open partial nephrectomies for single renal tumors
.
J Urol
2007
;
178
:
41
6
.
6.
Kang
SK
,
Huang
WC
,
Pandharipande
PV
,
Hersh
C
. 
Solid renal masses: what the numbers tell us
.
AJR Am J Roentgenol
2014
;
202
:
1196
.
7.
Rosenkrantz
AB
,
Hindman
N
,
Fitzgerald
EF
,
Niver
BE
,
Melamed
J
,
Babb
JS
. 
MRI features of renal oncocytoma and chromophobe renal cell carcinoma
.
AJR Am J Roentgenol
2010
;
195
:
W421
7
.
8.
François
C
,
Anne-Sophie
L
,
Thomas
T
,
Colette
D
,
Jean-Marie
F
,
Yann
LB
, et al
Combined late gadolinium-enhanced and double-echo chemical-shift MRI help to differentiate renal oncocytomas with high central T2 signal intensity from renal cell carcinomas
.
AJR Am J Roentgenol
2013
;
200
:
830
38
.
9.
Stuart G
S
,
Yu Unn
G
,
Koenraad J
M
,
Kemal
T
,
Edmund S
C
. 
Renal masses in the adult patient: the role of percutaneous biopsy
.
Radiology
2006
;
240
:
6
22
.
10.
Yan
L
,
Liu
Z
,
Wang
G
,
Huang
Y
,
Liu
Y
,
Yu
Y
, et al
Angiomyolipoma with minimal fat: differentiation from clear cell renal cell carcinoma and papillary renal cell carcinoma by texture analysis on CT images
.
Acad Radiol
2015
;
22
:
1115
21
.
11.
Raman
SP
,
Chen
Y
,
Schroeder
JL
,
Huang
P
,
Fishman
EK
. 
CT texture analysis of renal masses: pilot study using random forest classification for prediction of pathology
.
Acad Radiol
2014
;
21
:
1587
96
.
12.
Taryn
H
,
Mcinnes
MDF
,
Nicola
S
,
Flood
TA
,
Leslie
L
,
Thornhill
RE
. 
Can quantitative CT texture analysis be used to differentiate fat-poor renal angiomyolipoma from renal cell carcinoma on unenhanced CT images?
Radiology
2015
;
276
:
787
96
.
13.
Feng
Z
,
Rong
P
,
Cao
P
,
Zhou
Q
,
Zhu
W
,
Yan
Z
, et al
Machine learning-based quantitative texture analysis of CT images of small renal masses: Differentiation of angiomyolipoma without visible fat from renal cell carcinoma
.
Eur Radiol
2018
;
28
:
1625
33
.
14.
Yu
HS
,
Scalera
J
,
Khalid
M
,
Touret
AS
,
Bloch
N
,
Li
B
, et al
Texture analysis as a radiomic marker for differentiating renal tumors
.
Abdom Radiol
2017
;
42
:
1
9
.
15.
Kunapuli
G
,
Varghese
BA
,
Ganapathy
P
,
Desai
B
,
Cen
S
,
Aron
M
, et al
A decision-support tool for renal mass classification
.
J Digit Imaging
2018
:
1
11
.
16.
Lee
HS
,
Hong
H
,
Jung
DC
,
Park
S
,
Kim
J
. 
Differentiation of fat-poor angiomyolipoma from clear cell renal cell carcinoma in contrast-enhanced MDCT images using quantitative feature classification
.
Med Phys
2017
;
44
:
3604
14
.
17.
Berenguer
R
,
Mdr
PJ
,
Canales-Vãz
J
,
Castro-García
M
,
Villas
MV
,
Legorburo
FM
, et al
Radiomics of CT features may be nonreproducible and redundant: influence of CT acquisition parameters
.
Radiology
2018
;
288
:
407
15
.
18.
Shafiq-Ul-Hassan
M
,
Zhang
GG
,
Latifi
K
,
Ullah
G
,
Hunt
DC
,
Balagurunathan
Y
, et al
Intrinsic dependencies of CT radiomic features on voxel size and number of gray levels
.
Med Phys
2017
;
44
:
1050
.
19.
Kalpathycramer
J
,
Mamomov
A
,
Zhao
B
,
Lu
L
,
Cherezov
D
,
Napel
S
, et al
Radiomics of lung nodules: a multi-institutional study of robustness and agreement of quantitative imaging features
.
Tomography
2016
;
2
:
430
37
.
20.
Emaminejad
N
,
Wahi-Anwar
M
,
Hoffman
J
,
Kim
G
,
Brown
M
,
Mcnitt-Gray
M
. 
The effects of variations in parameters and algorithm choices on calculated radiomics feature values: initial investigations and comparisons to feature variability across CT image acquisition conditions. In: SPIE Medical Imaging: 2018 Feb 10–15: Houston, TX. Bellingham (WA): SPIE; 2018
.
21.
Kim
H
,
Chang
MP
,
Lee
M
,
Sang
JP
,
Song
YS
,
Lee
JH
, et al
Impact of reconstruction algorithms on CT radiomic features of pulmonary tumors: analysis of intra- and inter-reader variability and inter-reconstruction algorithm variability
.
PLoS One
2016
;
11
:
e0164924
e24
.
22.
Mackin
D
,
Fave
X
,
Zhang
L
,
Fried
D
,
Yang
J
,
Taylor
B
, et al
Measuring computed tomography scanner variability of radiomics features
.
Invest Radiol
2015
;
50
:
757
65
.
23.
Berenguer
R
,
Pastor-Juan
MDR
,
Canales-Vázquez
J
,
Castro-García
M
,
Villas
MV
,
Legorburo
FM
, et al
Radiomics of CT features may be nonreproducible and redundant: influence of CT acquisition parameters
.
Radiology
2018
;
288
:
407
15
.
24.
Dercle
L
,
Ammari
S
,
Bateson
M
,
Durand
PB
,
Haspinger
E
,
Massard
C
, et al
Limits of radiomic-based entropy as a surrogate of tumor heterogeneity: ROI-area, acquisition protocol and tissue site exert substantial influence
.
Sci Rep
2017
;
7
:
7952
.
25.
Lee
H
,
Hong
H
,
Kim
J
,
Jung
DC
. 
Deep feature classification of angiomyolipoma without visible fat and renal cell carcinoma in abdominal contrast-enhanced CT images with texture image patches and hand-crafted feature concatenation
.
Med Phys
2018
;
45
:
1550
61
.
26.
Zhou
L
,
Zhang
Z
,
Chen
YC
,
Zhao
ZY
,
Yin
XD
,
Jiang
HB
. 
A deep learning-based radiomics model for differentiating benign and malignant renal tumors
.
Transl Oncol
2018
;
12
:
292
300
.
27.
Fedorov
A
,
Beichel
R
,
Kalpathy-Cramer
J
,
Finet
J
,
Fillion-Robin
JC
,
Pujol
S
, et al
3D SLICER as an image computing platform for the quantitative imaging network
.
Magn Reson Imaging
2012
;
30
:
1323
41
.
28.
Pereira
S
,
Pinto
A
,
Alves
V
,
Silva
CA
. 
Brain tumor segmentation using convolutional neural networks in MRI images
.
IEEE Trans Med Imaging
2016
;
35
:
1240
51
.
29.
Tustison
NJ
,
Avants
BB
,
Cook
PA
,
Yuanjie
Z
,
Alexander
E
,
Yushkevich
PA
, et al
N4ITK: improved N3 bias correction
.
IEEE Trans Med Imaging
2010
;
29
:
1310
20
.
30.
Avants
BB
,
Tustison
N
,
Song
G
. 
Advanced normalization tools (ANTS)
.
Or Insight
2009
;
1
35
.
31.
Shinohara
RT
,
Sweeney
EM
,
Goldsmith
J
,
Shiee
N
,
Mateen
FJ
,
Calabresi
PA
, et al
Statistical normalization techniques for magnetic resonance imaging
.
Neuroimage Clin
2014
;
6
:
9
19
.
32.
Yaniv
Z
,
Lowekamp
BC
,
Johnson
HJ
,
Beare
R
. 
SimpleITK image-analysis notebooks: a collaborative environment for education and reproducible research
.
J Digit Imaging
2017
;
31
:
1
14
.
33.
Sun X
SL
,
Luo
Y
,
Yang
W
,
Li
H
,
Liang
P
, et al
Histogram-based normalization technique on human brain magnetic resonance images from different acquisitions
.
Biomed Eng Online
2015
;
14
:
73
.
34.
Chang
K
,
Bai
HX
,
Zhou
H
,
Su
C
,
Bi
WL
,
Agbodza
E
, et al
Residual convolutional neural network for determination of IDH status in low- and high-grade gliomas from MR imaging
.
Clin Cancer Res
2018
;
24
:
1073
81
.
35.
He
K
,
Zhang
X
,
Ren
S
,
Jian
S
, editors. 
Deep residual learning for image recognition
. In:
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27–30; Las Vegas, NV
.
Piscataway (NJ)
:
IEEE
; 
2016
.
36.
Deng
J
,
Dong
W
,
Socher
R
,
Li
LJ
,
Li
K
,
Li
FF
, editors. 
ImageNet: a large-scale hierarchical image database
. In:
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; 2009 Jun 20–25; Miami, FL
.
Piscataway (NJ)
:
IEEE
; 
2009
.
37.
Breiman
L
. 
Bagging predictors
.
Mach Learn
1996
;
24
:
123
40
.
38.
Platt
J
. 
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ, Bartlett P, Bernhard Schölkopf, editors, Advances in large margin classifiers
.
Cambridge, MA: MIT Press
; 
1999
. p.
61
74
.
39.
Zwanenburg
A
,
Leger
S
,
Vallières
M
,
Löck
S
. 
Image biomarker standardisation initiative - feature definitions. arXiv [Internet]; 2016 [cited 2016 Dec 21]. Available from: https://arxiv.org/pdf/1612.07003v1.pdf
.
40.
Nogueira
S
,
Sechidis
K
,
Brown
G
. 
On the stability of feature selection algorithms
.
J Mach Learn Rese
2018
;
18
:
1
54
.
41.
Olson
RS
,
Urbanowicz
RJ
,
Andrews
PC
,
Lavender
NA
,
Kidd
LC
,
Moore
JH
. 
Automating biomedical data science through tree-based pipeline optimization. In: Squillero G, Burelli P, editors. Applications of evolutionary computation. Cham: Springer International Publishing
; 
2016
. p.
123
37
42.
Maaten
LVD
,
Hinton
G
. 
Visualizing Data using t-SNE
.
J Mach Learn Res
2008
;
9
:
2579
605
.
43.
Alan
Agresti
,
Coull
B
. 
Approximate is better than “Exact” for interval estimation of binomial proportions
.
Am Stat
1998
;
52
:
119
26
.
44.
Naeini
MP
,
Cooper
GF
,
Hauskrecht
M
. 
Obtaining well calibrated probabilities using bayesian binning
. 
2015
:
2015
:
2901
7
.
45.
Chollet
F. Keras
. 
GitHub repository
2015
.
46.
Abadi
M
,
Barham
P
,
Chen
J
,
Chen
Z
,
Davis
A
,
Dean
J
, et al
TensorFlow: a system for large-scale machine learning. arXiv [Internet]; 2016 [cited 2016 May 31]. Available from: https://arxiv.org/pdf/1605.08695.pdf
.
47.
Vallières
M
,
Freeman
CR
,
Skamene
SR
,
Naqa
I
,
El Naga
I
. 
A radiomics model from joint FDG-PET and MRI texture features for the prediction of lung metastases in soft-tissue sarcomas of the extremities
.
Phys Med Biol
2015
;
60
:
5471
96
.
48.
Vallières
M
,
Kay-Rivest
E
,
Perrin
LJ
,
Liem
X
,
Furstoss
C
,
Hjwl
A
, et al
Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer
.
Sci Rep
2017
;
7
:
10117
.
49.
Guo
C
,
Pleiss
G
,
Sun
Y
,
Weinberger
KQ
. 
On calibration of modern neural networks
.
Proceedings of the 34th international conference on machine learning-volume 70 JMLR org
; 
2017
:
1321
30
.
50.
Truhn
D
,
Schrading
S
,
Haarburger
C
,
Schneider
H
,
Merhof
D
,
Kuhl
C
. 
Radiomic versus convolutional neural networks analysis for classification of contrast-enhancing lesions at multiparametric breast MRI
.
Radiology
2019
;
290
:
290
97
.
51.
Zhao
B
,
Tan
Y
,
Tsai
WY
,
Qi
J
,
Xie
C
,
Lu
L
, et al
Reproducibility of radiomics for deciphering tumor phenotype with imaging
.
Sci Rep
2016
;
6
:
23428
.
52.
Lassel
EA
,
Rao
R
,
Schwenke
C
,
Schoenberg
SO
,
Michaely
HJ
. 
Diffusion-weighted imaging of focal renal lesions: a meta-analysis
.
Eur Radiol
2014
;
24
:
241
49
.
53.
Lanzman
RS
,
Robson
PM
,
Sun
MR
,
Patel
AD
,
Kimiknu
M
,
Wagner
AA
, et al
Arterial spin-labeling MR imaging of renal masses: correlation with histopathologic findings
.
Radiology
2012
;
265
:
799
.
54.
Avendi
MR
,
Kheradvar
A
,
Jafarkhani
H
. 
A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI
.
Med Image Anal
2016
;
30
:
108
19
.
55.
Liao
S
,
Gao
Y
,
Oto
A
,
Shen
D
. 
Representation learning: a unified deep learning framework for automatic prostate MR segmentation
.
Med Image Comput Comput Assist Interv
2013
;
16
:
254
61
.
56.
Hu
P
,
Wu
F
,
Peng
J
,
Liang
P
,
Kong
D
. 
Automatic 3D liver segmentation based on deep learning and globally optimized surface evolution
.
Phys Med Biol
2016
;
61
:
8676
.
57.
Sharma
K
,
Rupprecht
C
,
Caroli
A
,
Aparicio
MC
,
Remuzzi
A
,
Baust
M
, et al
Automatic segmentation of kidneys using deep learning for total kidney volume quantification in autosomal dominant polycystic kidney disease
.
Sci Rep
2017
;
7
:
2049
.