Abstract
Artificial intelligence (AI), especially deep learning, has the potential to fundamentally alter clinical radiology. AI algorithms, which excel in quantifying complex patterns in data, have shown remarkable progress in applications ranging from self-driving cars to speech recognition. The AI application within radiology, known as radiomics, can provide detailed quantifications of the radiographic characteristics of underlying tissues. This information can be used throughout the clinical care path to improve diagnosis and treatment planning, as well as assess treatment response. This tremendous potential for clinical translation has led to a vast increase in the number of research studies being conducted in the field, a number that is expected to rise sharply in the future. Many studies have reported robust and meaningful findings; however, a growing number also suffer from flawed experimental or analytic designs. Such errors could not only result in invalid discoveries, but also may lead others to perpetuate similar flaws in their own work. This perspective article aims to increase awareness of the issue, identify potential reasons why this is happening, and provide a path forward. Clin Cancer Res; 24(3); 532–4. ©2017 AACR.
Translational Relevance
The future of radiology is bright, as novel artificial intelligence (AI) algorithms have the potential to increase the clinical efficacy and quality of radiology, while making it faster and cheaper. Indeed, a large number of research groups are investigating how AI can improve radiologic quantifications, and several compelling discoveries have already been made. Unfortunately, for a variety of reasons, there has also been a large increase in studies with serious experimental design flaws. Prime examples include the use of small datasets that are underpowered or lack independent validation, and flawed statistical analyses such as improper multiple testing correction. These errors can lead to invalid discoveries that are not replicable and only serve to weaken the perception of the field, the credibility of its investigations, and perhaps even slow the clinical introduction of new technologies.
Breakthroughs in artificial intelligence (AI) have the potential to fundamentally alter medical image analysis as well as the clinical practice of radiology. These methods excel at identifying complex patterns in images and in using this information to guide clinical decisions (1–3). AI encompasses quantitative image analysis, also known as radiomics (4–9), which involves either the application of predefined engineered algorithms (that often rely on input from expert radiologists) or the use of deep learning technologies that can automatically learn feature representations from example data (4). Consequently, AI is expected to play a key role in automating clinical tasks that presently can only be done by human experts (1, 2). Such applications may aid the radiologist in making reproducible clinical assessments, thereby increasing the speed and accuracy of radiologic interpretation, and help the reader in situations difficult for human observers to interpret, such as in predicting the malignancy of a particular lesion or response to a particular therapy based on the patient's total tumor burden (10–12).
The potential of AI has resulted in an explosion of investigations that utilize various applications of data science to further radiologic research. The magnitude of this transformation is reflected in the large number of research studies published in recent years. Numerous articles have been published describing the automated detection of abnormalities (also known as CADe; refs. 13–15), others the automated quantification of diseases (also known as CADx; refs. 16–19), and still others that link radiologic data with genomic data (also known as imaging–genomics or radiogenomics) to define genotype–phenotype interactions (6, 20–24). With promising results from these early studies and the increasing availability of imaging data (including large retrospective datasets with clinical endpoints), we expect radiomic investigations to continue to grow rapidly in number and complexity in the coming years.
Many examples of robust discoveries have emerged from studies with stringent analytic and experimental designs. However, a number of studies, including many published in high-impact journals, suffer from (avoidable) flaws in experimental design or analytic methodology, which could potentially invalidate the findings. Among the most egregious examples are studies that employ imaging datasets that are too small to power a significant finding, for example, hundreds of parameters are evaluated but in only a couple dozen samples. Others include analyses that lack independent validation and present models that are trained and validated on the same data. Still, others suffer from “information leakage” due to improper separation of training and validation datasets. A common example of this would be when the features are selected from the same data used to evaluate performance. Errors are also being made in statistical analyses, such as improper correction for multiple testing (or a failure to correct at all), which can lead to overoptimistically low P values, or the reporting of incorrect statistical outcome measures. Such studies give rise to findings that cannot be replicated, ultimately weakening the perception of data science applications in radiology and threatening the credibility of other investigations in the field.
These problems occur for a plethora of reasons, ranging from inadequate training to inadequate example studies to inadequate reporting and sharing. First, many researchers working in the medical imaging and radiology field have little or no formal background in data analysis and biostatistics. This gap in understanding creates a blind spot in which investigators fail to recognize what is required for good study design or what methods can be used most appropriately to arrive at sound analytic results with a high likelihood of being replicated in an independent study. While humans are capable of understanding a small number of imaging features in correlation with a limited number of pathologic findings, the thousands of sometimes non-intuitive imaging parameters current technology is capable of extracting from imaging data, compounded by the complex relationships that exist between them, require the use of more sophisticated analytical methods.
Furthermore, many of the journal editors and reviewers assigned to evaluate and critique these scientific studies are experts in radiology but have no expertise in data analysis methods. Consequently, potential mistakes in the analytic pipeline may go unnoticed during the editorial and review process, resulting in publications with findings that may not be fully supported or reproduced by the data. This scenario sets up a vicious cycle in which other readers also fail to recognize experimental or analytic flaws, mistake the reported results for truth, and repeat the methodologic errors in their own work. Indeed, a number of proof-of-concept radiomic studies with methodologic shortcomings have been published, and one can now see these same errors repeated by others.
As is true for other scientific fields, current mechanisms for correcting published studies are inadequate. Although study flaws in published works may be quickly recognized by investigators with quantitative training or experience, those studies are rarely publicly challenged. There are few incentives to call into question the results of published articles, as doing so can cause great animosity. Nevertheless, there are some positive examples where a careful review of published studies can help us understand how to do better science. Chalkidou and colleagues (25) performed an independent reanalysis of several published imaging studies and found that the majority had significant issues with the experimental design, potentially leading to incorrect results. It is important for the community to take notice of and support reanalyses such as these and to recognize their value in advancing our science.
A straightforward approach to advance the quality of analytics in quantitative imaging is to create a culture in which there is a willingness to share primary data and software code. Much can be done to stimulate this culture by making data sharing a requirement for publication. Indeed, the technical aspects of sharing radiologic data are often feasible, and initiatives exist that can support investigators with this process, such as the Cancer Imaging Archive (26). Proper sharing assures that the results of any single study can be recapitulated, as other investigators can test the reproducibility of the main findings. It also facilitates rapid development of new, more robust methods as more data become available for integrated analyses.
As is true in other fields, such as genomics, editors and reviewers should demand the publication of all data (including medical images), code, and results to ensure full reproducibility. This level of disclosure is consistent with the guidelines of many scientific journals for other types of study and reflects the NIH's requirements for data sharing and reproducible research. Integrating these “best practices” into quantitative imaging will help assure the quality and reliability of our studies and will increase the strength and influence of our work, as others use and cite our data and software.
As the saying goes, “the devil is in the details.” This is especially true for the field of data science where confounding errors are easy to generate, but hard to identify, and require expertise and experience to identify and resolve. The most important steps investigators must take are to acknowledge their limitations, know when to ask for external expert advice, and recognize that a poorly designed and analyzed study is of little value in the long run. Better awareness and education in data science and statistical methodologies will increase the overall quality of discoveries within radiology.
It is also important to establish analysis guidelines to avoid pitfalls and provide recommendations for analysis strategies for medical imaging data. Guidelines related to data acquisition, data normalization, development of robust features and models, and rigorous statistical analyses will also increase the quality of these studies and allow for better evaluation of imaging data with other data sources such as clinical and genomic data.
Although the points raised here may seem overly critical of radiology research, this is not the first field to face such challenges. The most direct example in my view is the field of genomics, where early studies were underpowered, poorly analyzed, and nonreplicable. As faith in genomic assays began to wane, the community came together and recognized the need for better standards for experimental design, reproducible research, data and code sharing, and good analytic practices (27–35). The widespread institution of these practices by academic leaders and scholarly journals has led to genomic assays that are far more reproducible between laboratories, a necessity for potential clinical application.
Fortunately, there is growing appreciation of these issues and of the importance of better training in quantitative methodologies. Data science is becoming an important subject at leading radiology and image analysis conferences, and educational seminars are stimulating learning and the acquisition of new skills. It is likely the knowledge base will continue to increase for both investigators and editors, improving the overall quality of new research studies.
If we do this right, and keep on emphasizing the importance of data science training, I believe our field has a bright future. We will never rid ourselves of all our mistakes; however, we can, if we avoid the major pitfalls, improve our credibility. This will not only lead to good science, but also could ultimately reshape clinical radiology practice. Most important, it will lead to improved treatment and better outcomes for the patients we serve.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Acknowledgments
We acknowledge financial support from the NIH (U24CA194354 and U01CA190234) and would like to thank Dr. Lawrence H. Schwartz and Dr. John Quackenbush for their insightful and helpful comments.