We recently read the article by Yokoyama and colleagues (1), in which the authors report a predictive model integrating DNA methylation status of three mucin genes to predict overall survival at a designated 5-year interval in pancreatic cancer. They collected samples from 191 patients and compared support vector machines (SVM) with various kernels of ranging complexity and a neural network. Models were trained using k-fold cross-validation (with various choices of k) or leave-one-out cross-validation or a 50/50 split to evaluate generalizability. However, common discriminative performance measures, including AUC, F1-Score, sensitivity or specificity are not reported for neither the training nor testing set; resampling was not used consistently; model selection and evaluation was not performed separately; model calibration was not assessed; class imbalance and imputation were not considered; approaches to combat overfitting, such as dropout in neural networks, were not included–in total, the prognostic value of the model cannot safely be evaluated on the basis of the presented results.
When comparing the claimed equally well-performing SVM and neural network, then based on the first principle of “Occam's razor” the simpler of the two models should be preferred (2). Interpreted clinically, we argue that simplicity acts as proxy for comprehension and scalability. Deep learning dramatically advanced many fields including image and speech recognition. However, various domains remain, in which deep models might not be the optimal choice in comparison to their classical counterparts when considering input structure and available sample size (3). We suspect that the presented tabulated data fall into that regime, and that the performance gain from adopting a more elaborate nonlinear model, while controlling for overfitting and class imbalance, is negligible. Many modern neural networks (4) poorly calibrate. Especially, in clinical practice, calibration is crucial. Often, simpler models, such as logistic regression or Cox proportional hazards models, calibrate better with the appropriate complexity in relation to the classification problem at hand. It would be interesting to compare the calibration curve of the neural network with the simpler linear SVM calibrations. Although the neural network as a more complex model can produce more accurate predictions, they continue to be treated as black boxes by many clinicians and inherently lack interpretability when compared with simple yet powerful models as logistic regression (5). For clinicians, it is important to understand the underlying reasoning of a machine-learning approach and understand how to correctly apply them–otherwise, they are not well accepted into the clinical routine despite being strong prediction tools.