Abstract
Imaging is a key technology in the early detection of cancers, including X-ray mammography, low-dose CT for lung cancer, or optical imaging for skin, esophageal, or colorectal cancers. Historically, imaging information in early detection schema was assessed qualitatively. However, the last decade has seen increased development of computerized tools that convert images into quantitative mineable data (radiomics), and their subsequent analyses with artificial intelligence (AI). These tools are improving diagnostic accuracy of early lesions to define risk and classify malignant/aggressive from benign/indolent disease. The first section of this review will briefly describe the various imaging modalities and their use as primary or secondary screens in an early detection pipeline. The second section will describe specific use cases to illustrate the breadth of imaging modalities as well as the benefits of quantitative image analytics. These will include optical (skin cancer), X-ray CT (pancreatic and lung cancer), X-ray mammography (breast cancer), multiparametric MRI (breast and prostate cancer), PET (pancreatic cancer), and ultrasound elastography (liver cancer). Finally, we will discuss the inexorable improvements in radiomics to build more robust classifier models and the significant limitations to this development, including access to well-annotated databases, and biological descriptors of the imaged feature data.
See all articles in this CEBP Focus section, “NCI Early Detection Research Network: Making Cancer Detection Possible.”
Introduction
As discussed elsewhere in this special issue, there are compelling medical and social needs for detecting cancers at an early stage while they are still localized and can be treated with curative intent. Ideally, the incipient cancers are detected in an asymptomatic individual because, in many cases, the onset of symptoms is associated with late-stage incurable disease. However, it is also abundantly clear that false positives, overdetection, overdiagnosis, and overtreatment are serious issues that have to be solved before early detection paradigms can be fully successful (1). Consequently, early detection paradigms are necessarily multistep processes with a goal to optimize different tradeoffs between sensitivity and specificity of the diagnosis at each step. Historically, medical imaging has been a key component of early detection pipelines for all cancers (Tables 1 and 2).
In this review, we will present the case that quantitative analyses of these images with “radiomics” provides superior clinical utility to optimize diagnosis and risk assessment compared with qualitative interpretations of images. Furthermore, the resulting predictive machine learned models can be tuned to optimize sensitivity or specificity, depending upon whether the imaging test is a primary, or a secondary filter, and depending on the biology of the incipient lesion (2). Primary screens should have very high sensitivity, with increasing specificity in subsequent (secondary) screens.
Radiomics
Radiomics refers to the conversion of images into structured, mineable data and the subsequent use of these data for prediction, diagnosis, prognosis, and/or longitudinal monitoring. The entire enterprise is predicated on the hypothesis that “Images are Data” and that they reflect the underlying pathobiology of the region of interest (ROI; ref. 3). Radiomics has been an explosive field in the last decade of its existence and has been thoroughly and recently reviewed elsewhere (4–6). The conventional radiomic analytic pipeline proceeds as shown in Fig. 1, wherein image-based features are extracted from the lesion and/or its surrounding tissues (i.e., ROIs). These features can be scored semantically by an experienced radiologist, or the features can be computer calculated from a hand-crafted set of features that number in the many hundreds to describe the size, shape, and textures within (and around) the lesion, or the features can be created by computed deep-learning (DL) algorithms. In some cases, “delta” features may also be computed by comparing individual feature values obtained at different times across screening intervals, during active surveillance, or during therapy (7).
The calculated features are then commonly analyzed with conventional machine learning approaches to develop models predictive of the dependent variable(s) of interest (8). In machine learning, the easiest outcomes to model are binary, and these are often most relevant to early detection paradigms, for example, cancer versus not cancer, aggressive versus indolent, etc. However, machine learning algorithms are also capable of modeling more complex continuous outcomes such as progression-free survival and time to recurrence, although these generally require much larger training sets. To avoid overfitting, the radiomic feature set must be reduced to a manageable number and this is done by sequentially removing unstable and redundant features, and then prioritizing those that are the most informative of outcome of interest. Whenever possible, radiomic models are combined with orthogonal data such as those provided by clinical covariates such as demographics, patient risk factors, serum analytes, or genomics. Such models that include both radiomics and clinical covariates generally improves predictive capabilities.
In conventional radiomics, accurate segmentation and choice of user-defined features are critical to success. Alternatively, DL approaches can mitigate these limitations as the algorithms self-train to identify the regions of the image that are most highly correlated to outcome (usually binary), and to identify the characteristics of that region that informed that decision though multiple layers of learning. For medical images, convolutional neural networks (CNN) are the most common models as they can accept two-dimensional or three-dimensional images as input. DL has only recently been applied to radiomics but it has already proven valuable in differential diagnosis (9). The major limitation affecting the use of CNNs in diagnostic imaging is the availability of training data that are adequate both in terms of quality and quantity. There are approaches to solve this by limiting the size of the input images, artificially increasing the training set size via “transfer learning” that trains on an unrelated set of images or by image transformation using geometric augmentation (4).
Regardless of which radiomic approach and pipeline is utilized, the greatest single limitation is access to large, high-quality datasets that are well annotated with respect to the relevant outcome(s). The importance of this last criterion cannot be overstated. For many radiomic studies in the cancer care continuum, outcomes such as, inter alia, recurrence, progression-free survival, or time to recurrence are rarely captured in readily accessible structured formats. Hence, these endpoints need to be captured manually through chart review that is laborious and requires relevant expertise to extract accurately. In an early detection paradigm, however, the important outcome (cancer vs. noncancer; indolent vs. aggressive) is often, or can be, captured in structured formats, mitigating some of the limitations associated with data curation. Nonetheless, ongoing access to large and evolving datasets remains the biggest single limitation to the widespread development of radiomic approaches for prediction, prognosis, and monitoring. One approach to solve this problem is the application of “distributed learning” wherein algorithms, and not data, are shared between institutions for training and eventual testing (10, 11).
A direct consequence of the paucity of large multi-institutional and complex datasets is reduced generalizability and portability of the developed algorithms. In the discovery phase of Radiomics, single institution moderately sized datasets were acceptable. However, it is axiomatic that clinical utility of such approaches will require that the predictive accuracy of the algorithms be independent of site, scanner, or reconstruction parameters. Hence, there is an absolute requirement for models to be trained on complex multisite, multivendor, and multiplatform datasets. The specific context of use (COU), required for every biomarker, will then be defined by the complexity of the associated test and validation sets (12).
Another serious limitation to the use of radiomic algorithms, whether conventional or deep learned, is identification of the biological underpinnings of the most “informative” (i.e., predictive) radiomic features. For example, recognizing that it will be difficult to make treatment decisions based on a highly informative “Grey Length Co-occurrence Matrix” (GLCM) feature, it is incumbent upon the radiomics community to identify the biologically relevant drivers of the informative radiomics models. Examples in lung cancer include radiomic predictors of EGFR mutational status (13), immune infiltration (14), or PD-L1 status (15), which could conceivably then be used to trigger treatment decisions.
Imaging Modalities
Optical
Optical approaches have the benefit of being relatively inexpensive and widely available. The major limitation to optical techniques is penetration depth that is limited to 1 to 2 mm. Although this can be overcome with optical coherence (OCT) or optoacoustic (OA) imaging, these techniques are not yet widely available and have yet to be tested in an early detection pipeline. Similarly, molecular imaging with targeted fluorophores or nanoprobes (16) has great promise, but these are still very much in the preclinical or early clinical research setting. For example, OCT is being tested for their ability to improve detection of colorectal cancer using sigmoidoscopy (17). Currently, the most widespread use of optical imaging is the classification of skin lesions as being worrisome enough to send to biopsy and surgery. This is being greatly improved with artificial intelligence (AI), which will be discussed in the subsequent section.
US
US has the decided advantages of being able to image deep body tissues in real time and be very widely available, but it has been plagued in early detection schema by being too operator dependent. This is being aggressively solved using a family of techniques called “elastography,” which can quantitatively image tissue stiffness that is well known to be increased with hyperplasia. Elastography can potentially be used to detect lymph node metastases (18) or incipient breast lesions, but the current leading application is quantification of liver fibrosis, which will be discussed in the subsequent section.
X-ray
High-energy photons have superior tissue penetration and can be absorbed by electron-dense materials, such as calcium deposits, or contrast agents. These are also absorbed by densities of soft tissues which can allow discrimination of, for example, adipose from muscle from connective tissue from lung parenchyma. X-ray absorptivity is inherently quantifiable as Hounsfield units (HU), which permits advanced analytics, as well as ready comparisons across scanners and institutions. A limitation for the use of CT is that X-rays can also be absorbed by DNA, leading to double strand breaks. Even though modern scanners are built to mitigate this by reducing the amount of radiation needed to generate a useable image, the use of ionizing radiation in a potentially otherwise healthy population is a well-recognized limitation. The two major uses of X-rays in primary screening are in mammography for early detection of breast cancers, and low-dose CT (LDCT) for early detection of lung cancers, both of which will be discussed in the subsequent section.
MRI
MRI is advantageous in that it also has excellent tissue penetration and delivers superior soft tissue (and tumor) contrast without ionizing radiation. However, MRI generally requires long acquisition times and hence leads to motion artifacts, especially in abdominal tissues. This is being mitigated with advanced compressed sensing techniques, including MR fingerprinting, which have the potential to render MR images in seconds, compared with the current minutes. A mixed blessing of MRI is also the wide variety of pulse sequences available that can generate contrast based on a multitude of physicochemical magnetic relaxation mechanisms. While this makes MRI an extremely powerful technique, it is difficult to achieve uniformity of acquisition parameters across multiple institutions for a qualifying study. These limitations are considered challenges by the MR research community, and meeting these challenges will be discussed in the final section. In current practice, multiparametric MRI (mpMRI) has found a niche in the active surveillance (AS) of prostate cancers to discriminate indolent from progressing disease, and this will be discussed in a subsequent section.
Nuclear
Nuclear imaging with single photon (SPECT) or positron (PET) tracers is not commonly associated with screening in an asymptomatic population. However, this is gaining traction in difficult to detect cancers, such as pancreatic, where the αvβ6 integrin has been shown to be an effective marker, with a highly specific 18-F positron-emitting tracers, and this will also be discussed in a subsequent section.
Specific Use Cases
Specific use cases of cancer sites, although not meant to be exhaustive reviews were chosen to illustrate the breadth of quantitative imaging approaches in cancer screening and early detection. In all cases of cancer sites, we identify where imaging currently and potentially fits within early detection paradigms, and whenever possible, how imaging data can be combined with serum biomarkers for improved sensitivity (if primary screen) and specificity (if a secondary screen).
Cutaneous neoplasms (optical)
The most obvious and impactful early detection setting that results in reduced cancer burden and mortality is optical detection and diagnosis of potentially lethal skin cancers, such as metastatic melanoma and squamous cell carcinomas. The accepted paradigm for assessing the risk of a dysplastic nevus is a validated quantitative scoring scheme known as “ABCD” for Asymmetry, Border irregularity, Color variegation, and Diameter >6 mm (19). Although this is a subjective scoring system, it is nonetheless effective and can be considered “quantitative imaging” (20). Automated or semiautomated quantification of the ABCD scoring system is an active area of research (21). In particular, because of the ready availability of annotated images, this is a particularly ripe area for the application of DL for classification. Deeply learned algorithms, such as CNNs do not rely on dermatologist-defined ABCD scoring criteria, but instead identify and develop informative features on their own.
In 2016, a large study trained a CNN on 129,450 annotated dermatologic images representing 2,032 different skin conditions, from benign rashes to malignant melanoma (22). Examples of the classification schema and the training images are provided in Fig. 2. In this study, the AI algorithm was then tested against 21 board-certified dermatologists in two clinically relevant binary classification tasks: keratinocytic carcinomas versus benign seborrheic keratoses, and malignant melanoma versus benign nevi. In both tasks, the AI algorithm achieved a performance level indistinguishable from the experts. More recently, deeper networks with further training have been shown to now outperforming dermatologists in classification accuracy (23). This paradigm of using AI to aid in differential diagnosis of skin lesions has been subjected to two large meta-analyses, including a Cochrane Review (24, 25). These concluded that AI approaches are highly sensitive for identification of invasive melanoma in select populations. In cutaneous lesions, the penalty for false positives is relatively low in that it triggers a simple excisional biopsy that is analyzed histologically. However, the biopsied sample is still limited by classification accuracy by micropathologists. Even here, DL algorithms are being shown to improve classification accuracy by dermatopathologists (26).
If these enterprises develop to the point of approval and acceptance, it has been imagined that AI applications, perhaps even on a smartphone, will be routinely used by patients, dermatologists, and dermatopathologists to routinely identify skin lesions before they have progressed to an incurable stage.
Lung cancer (CT)
For a number of reasons, radiomics has been developed and applied in the diagnosis, prediction, and prognosis of lung cancer far more than any other disease. These reasons include the high disease incidence and subsequent large number of cases available, the high CT imaging contrast that naturally exists between lung nodules or tumors and the surrounding parenchyma, the ubiquity of CT scans in the workup and management of lung cancers, and the biomedical importance of the problem. In an early detection paradigm, lung nodules are detected using non–contrast-enhanced LDCT screening of high-risk individuals (27). However, because screening programs have not been embraced by high-risk smokers or former smokers, indeterminate pulmonary nodules are most commonly detected incidentally during a CT scans for other conditions, such as heart calcium scoring, trauma, etc. (28, 29). Screen-detected nodules are managed by Lung CT Screening Reporting & Data System (Lung-RADS) scoring (30), or increasingly using the Brock University cancer prediction model (31). Incidental nodules are generally managed using the Fleischner Society guidelines for the management of pulmonary nodules (32). All of these aforementioned schemes include CT-measured nodule size as an integral component. Size is one of the main indicators, in addition to growth rate over time, to determine the nature of a pulmonary nodule (33).
In the screening setting, the publicly available data and images from the National Lung Screening Trial (NLST; ref. 34) has enabled dozens of notable radiomic studies. In one of the first well-powered studies, a conventional radiomic score was developed using machine learning based on 10 features (size, shape, and texture) that were extracted from segmented baseline indeterminate (4–12 mm) pulmonary nodules (35). In independent testing, the radiomics score outperformed volume, and had virtually the same overall accuracy of the Brock model in predicting subsequent development of cancers within 1 or 2 years after the baseline screen. Notably, the radiomics model outperformed the Brock score with accuracies >0.93 for the extreme cases of very high and very low radiomics scores, which represented approximately 60% of the screening subjects.
More recently, a powerful study from Google Health developed an end-to-end detection and diagnosis system using a deep learned Convolutional Neural Net (36). This model was developed based on 42,290 LDCT scans from 14,851 patients curated from the NLST. Of these, 578 developed biopsy confirmed lung cancer within 1 year of follow up. Patients were randomized into training (70%), tuning (15%), and testing (15%) sets. A powerful component of machine learned models is that the cutoffs can be varied to generate different “lung malignancy scores” (LUMAS), for comparison with Lung-RADS scores from 1 to 2 to 4X. In all cases, the AI model outperformed scoring by experienced radiologists using Lung-RADS scoring schema. Moving forward, it will be important to test this predictive model in a real-world setting. Furthermore, additional datasets from other screening studies [MILD (37), LUSI (38), NELSON (39)] should become available furthering the ability to train, test, validate and improve the power of these and other models.
An important consideration of radiomics, as described above, is that the prediction is based on images at a single time-point, which may or may not contain additional clinical data. Going forward, it will be important to incorporate longitudinal image data, or delta radiomics, much as oncologists do in their decision processes. The temporal changes in radiomic features has great potential to inform changes in the nodule biology that have important diagnostic and pathologic consequences (40, 41). It will also be important to incorporate quantitative biochemical information from serum or sputum in these models, as they clearly contain important orthogonal information (42, 43).
Despite the benefits of lung cancer earlier detection, LDCT screening identifies large numbers of false positives and indeterminate pulmonary nodules, of which only a fraction develops into cancer and LDCT screening detects indolent neoplasms that may not otherwise cause clinical symptoms or death (i.e., overdiagnosis). Because of these limitations, investigators in the EDRN have been developing image-based biomarkers along with genomic, molecular, and cellular biomarkers to improve diagnosis of nodules and reduce false positives and overtreatment. The Lung Cancer Collaborative Group brings together investigators from EDRN Clinical Validation Centers (CVC) and Biomarker Development Laboratories (BDL) with the primary focus is to develop and validate biomarkers and imaging methods to detect lung cancer among smokers with indeterminate nodules that are detected by LDCT.
Breast cancer (mammography, mpMRI)
Breast cancer screening by mammography has a long history. In screening for this disease, it is critical to have as high an accuracy as possible as the consequences of false positives leading to unnecessary interventions, and false negatives leading to cancer progression, can be disastrous (44, 45). Because of its long history, there are vast repositories of outcomes-annotated mammographic images. Consequently, the use of these images to develop and deploy computer-aided diagnostic (CAD) workstations to aid in the interpretation of mammographic images has a similarly long history. CAD systems have been shown to work very well in controlled research settings, but they generally fell short in “real-world” clinical settings. Despite widespread use of CAD in clinical practice (46), a large multi-institutional study in 2015 of over 600,000 mammograms concluded that “CAD does not improve diagnostic accuracy of mammography” (47). Although breast radiologists are willing to embrace technology, there is skepticism by some regarding the value of newer AI-driven algorithms in routine clinical practice (48).
This may soon change, however, with the publication of a new international study by Google Health (49). This study trained and tested a CNN on mammograms from 28,953 women in the United States and the United Kingdom who had at least 1-year of follow up to classify mammograms as showing cancer (biopsy-proven) versus noncancer (either biopsy proven or no lesion). In this study, the AI algorithm outperformed all of the radiologists who had originally read the mammograms, as well as 6 expert radiologists who prospectively over-read 500 cases. The AI algorithm had significantly higher sensitivities and specificities over human readers to predict emergence of cancer within 3 years (United Kingdom) or 2 years (United States). Nonetheless, the AI system did contain some false negatives that were identified as positive by radiologists, and vice versa. Examples of these are given in Fig. 3A and B. This complementarity suggests that combining AI with radiology reads would provide the highest accuracy and likely lead to more ready acceptance.
Similar to the efforts related to lung cancer, EDRN investigators have been combining image-based and circulating biomarkers to improve diagnostic discrimination in breast cancer screen and to discriminate between indolent and aggressive subtypes. The EDRN Breast and Gynecologic Cancers Collaborative Group brings together CVCs and BDLs to conduct research to improve the performance of screening mammography, to distinguish benign from malignant breast lesions, and improve early detection of different molecular subtypes of breast cancer.
Pancreatic cancer (CT, MRI, PET/CT)
Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal of human cancers and may become the most lethal by 2022 (50). Whereas the overall 5-year survival rate is 6% to 7% (51), the survival rate is virtually zero for disease that is unresectable (52–56). Because of this, pancreatic cancer has a most compelling need for an early detection portfolio, as this cancer is commonly diagnosed at an advanced stage, when it is lethal. As such, this is an area of intense investigation by combining serum and image biomarkers. In contrast to other cancers, there are no clearly defined schema for detecting pancreatic lesions in an asymptomatic population. Patients at high risk for development of PDAC include those with family history, symptomatic pancreatitis, new-onset diabetes (NOD), or asymptomatic mucinous cystic lesions (cystic neoplasms and intra papillary mucinous cystic neoplasm; IPMN), and thus are candidates for active surveillance with endoscopic US (EUS) or MR imaging (57). The other precursor lesion, pancreatic intraepithelial neoplasia, or PanIN, is rarely, if at all, incidentally detected in asymptomatic individuals. Cystic lesions are widely detected incidentally in approximately 1.2% of all abdominal CTs, or approximately 150,000 annually (58). Although IPMNs are precursor lesions in only approximately 15% of all PDAC, they are illustrative of the early detection paradigm, as they are difficult to manage, as <1% of all cystic lesions are destined to progress to PDAC (59). Current management of IPMNs involves surgery, which is associated with both operative mortality and multiple morbidities, as well as overtreatment, as 50% to 70% of resected cystic lesions are nonmalignant (60). The decision to resect is often based on radiographic identification of “worrisome” or “high-risk” features. Patients not electing surgery have an option of active surveillance with biannual EUS or MRI (61). Historically, there is poor agreement between preoperative diagnosis and pathologic examination, identifying a great need for improved quantitative markers of malignancy risk (62–67).
The current paradigm for early detection of PDAC is primarily based on biomarkers present in serum, and this is an active area of investigation (68, 69). The matrix glycoprotein thrombospondin-2 regulates remodeling of the extracellular matrix (ECM), and hence its presence in serum is indicative of ECM breakdown and resynthesis (70). Recently, it has been proposed as a biomarker to identify IPMNs that are at high risk for progression to PDAC (71), and the power of this prediction was enhanced by combining with CA 19-9 levels and qualitative radiographic features (72). In addition, circulating miRNAs have also been shown to identify IPMNs that are destined to progress (73). In a follow-up study, the “miRNA genomic classifier” (MGC) and quantitatively scored radiologic features were extracted from a cohort of 38 surgically resected, pathologically confirmed IPMN cases. The AUROC of the MGC to predict malignancy was 0.83, and the high-risk radiologic features yielded an AUROC of 0.84. Fourteen textural and nontextural radiomic features differentiated malignant from benign IPMNs (P < 0.05) and collectively had an AUROC of 0.77. Combining these textural radiomic features with the MGC yielded an AUROC of 0.92. Sensitivity (83%), specificity (89%), positive predictive value (88%), and negative predictive value (85%) were superior to conventional clinical pathomic models (74). Significant findings of this article were that radiomics could identify true negatives from otherwise false positives (Fig. 4A) and that there was one case that was true positive that was correctly classified by radiomics but was a false negative by conventional qualitative imaging (Fig. 4B).
Currently, there are multiple investigations to apply AI to abdominal images with the expectation that DL can identify precursor lesions before they are evident to radiologists (75). As with much of radiomics, this effort suffers from availability of multi-institutional, well-curated cohorts of past CT scans from current PDAC patients. Hence, most efforts in AI analyses of PDAC have been focused on discriminating invasive from benign neoplasms. For example, Koay and colleagues (MD Anderson Cancer Center, Houston, TX) have shown that quantitative analysis of the tumor-stromal interface was a significant prognostic indicator in 3 different cohorts (N = 303 total) of patients with PDAC treated with surgery or chemotherapy (76). Chu and colleagues (Felix project at Johns Hopkins, Baltimore, MD) have related their initial investigations using deep CNNs to identify normal pancreas (and other visceral tissues) as well as classifying PDAC as a case study (77). They trained the network on 575 control subjects and were able to achieve excellent segmentations of most major organs with accuracies >88% (compared with manual segmentations). This is an important first step, as it can be expected that precursor lesions will deviate from the norm. The network was also trained to recognize PDAC as a deviation from norm, and this involved a cohort of 750 patients with PDAC. Even though the acquisition parameters of this training set were homogeneous, the model performed well only on larger lesions. Hence, its eventual application to the identification and characterization of small precursor lesions remains to be demonstrated. Although this is the right approach, it will require significantly larger and more heterogeneous training and testing sets curated from prior scans from confirmed patients with PDAC.
Because asymptomatic pancreatic cystic lesions are primarily identified incidentally, more structured risk assessment tools are critically needed. A typical screening paradigm is unworkable due to the relatively low prevalence, even though additional risk factors are being identified (57). Because of low prevalence, serum biomarkers have also been elusive, and this is an area of active investigation within the EDRN network. CA 19-9 is the most common PDAC-associated antigen, yet it does not become measurable until disease has progressed. Combining CA 19-9 with autoantibodies, circulating antigens, miRNA, or enzymes such as cathepsins, however, may have promise to improve its sensitivity to incipient disease (68, 78–80). It is reasonable to expect that, in the foreseeable future, serum biomarkers to identify patients “at-risk” of incipient pancreatic hyperplasia to create a cohort with a tractable prevalence, which then could be used to trigger assessment by secondary imaging/radiomics-based screens.
When radiomic analyses are incorporated in differential diagnoses, improvements in sensitivity and specificity have been shown in determining risk of malignancy. EDRN investigators have been actively conducting research utilizing imaging and molecular biomarkers to develop a rational, evidence-based strategy to detect pancreatic cancer at an early and resectable stage. The EDRN GI Collaborative Group conducts research on colorectal, pancreatic, and esophageal cancers and there are two CVCs and one BDL focused on pancreatic cancer. The main goals of this Collaborative Group with respect to pancreatic cancer to develop and validate biomarkers and imaging methods to detect pancreatic cancer in high-risk groups and to develop and validate biomarkers and imaging methods to determine which pancreatic cysts are cancerous. In recognition of the importance of images to this enterprise, the PDAC working group is engaged in the construction of an image repository of early-stage PDAC and IPMNs.
Nuclear medicine approaches, such as PET, are usually not included within an early detection pipeline due to high costs and choice of tracers. However, this may be changing with pancreatic cancers, due to the severity of the problem and the need to identify early lesions with high risk of malignancy. Through many studies, the αvβ6 integrin has been shown to be highly associated with incipient pancreatic cancers and has been shown to be a driver of PDAC phenotypes (81). As such, specific PET tracer labeled ligands for integrin αvβ6 have been developed (82, 83). These tracers have been used in humans, and are proposed to be used in a screening paradigm (84), which would be the one of the first instances of such use for PET imaging.
Prostate cancer (mpMRI and tumor habitats)
Prostate cancer is one of the most commonly diagnosed cancers and accounts for the second largest number of cancer-related deaths among men (85). Most patients with prostate cancer are first identified by rising serum-based prostate-specific antigen (PSA) coupled with a digital rectal exam (86, 87). In contrast to other use cases described above, the application of imaging AI to Prostate Cancer occurs after cancer is detected, usually with elevated or rising PSA, even though this risk marker is limited by low sensitivity and specificity (87, 88). Unlike other cancers, prostate cancer is often detected early and can often be indolent, posing no risk during a subject's life time. However, if prostate cancer is high grade, with Gleason scoring of 4+3 or above, it has the potential to be life threatening. Therefore, discriminating indolent from aggressive disease is of utmost clinical importance.
Increasingly, imaging occurs in a setting of AS, which has emerged as an alternative for patients with localized, low-grade (Gleason <7) disease, where treatment options such as radiotherapy or prostatectomy along with their accompanying morbidity that could be postponed, by as much as 5 years without significant change in outcome (89–91). In AS schema, men with low-grade disease are regularly (annual or biannual) imaged with multiparametric mpMRI, consisting of a diffusion-weighted sequences, T2 images for morphology, and a dynamic contrast series following a bolus injection of a contrast agent. Coincident with these imaging sessions, men are routinely subjected to transrectal US (TRUS)–guided biopsy, generally consisting of 12 to 14 random cores along with one to two targeted biopsies of suspicious regions identified on the mpMRI. Current practice is to analyze mpMR images of the prostate using the Pi-RADS scoring system (92–94). However, Pi-RADS suffers from poor inter-reader concordance (95). Hence, there is an interest in developing an automated (AI) strategy that offers greater reliability than is achieved under the current paradigm (96).
Advances in imaging technologies have made it possible to obtain high-resolution functional imaging data that are capable of reflecting tumor cellularity, dynamics of contrast uptake, anatomical location and visual characteristics of the lesion in the entire gland in 3-dimensions. In recent years, investigators have been developing methods to combine image data from multiple image data sets to identify spatially distinct regions with similar physiologies, or “habitats” (4). In early prostate cancer, these habitat imaging approaches have shown promise in identification of “at-risk” volumes within the prostate that can be used to guide TRUS biopsy locations, as shown in Fig. 5 (97–99), or to improve risk stratification compared to Pi-RADS (100). The use of radiomics and habitat imaging in the characterization of early prostate has recently been reviewed (101, 102). The long-term goal of radiomics and habitat imaging in early prostate cancers is to gain enough confidence that the imaging method to detect progressive disease, so that annual core biopsies in an AS schema can be eliminated.
In the EDRN, investigators have been conducting research to improve the early detection of prostate cancer, for prostate cancer upgrading, and fusion MRI to improve the accuracy of prostate needle biopsies. The EDRN Prostate and Other Urological Cancers Collaborative Group is developing noninvasive tests to discern indolent cancers from aggressive cancers and to determine whether MRI prostate imaging and biomarkers can improve the prediction of cancer extent and aggressiveness to determine suitability for active surveillance or treatment.
Liver cancer (US elastography)
Liver fibrosis, cirrhosis, nonalcoholic steatohepatitis (NASH), and nonalcoholic fatty liver disease (NAFL) are increasingly common pathologic conditions, each with their own complications (103, 104). Notably, all of these are predisposing conditions for development of hepatocellular carcinoma (HCC; ref. 105). Historically, HCC has been associated with hepatitis (Hep) virus B or C infections leading to fibrotic liver disease. Thus, Hep B and C status are powerful risk factors to enter into a screening program. However, with the uptake of vaccinations, and increased prevalence of obesity, fatty liver disease is increasingly being recognized as a relevant risk factor as well (106). Obesity-associated NAFL is generally benign, yet a subset can develop into NASH, which is at high risk for developing cirrhosis (107). The gold standard to diagnose NASH versus NAFL is liver biopsy, but this is associated with morbidity and even mortality. Clinical predictors and serum biomarkers have been developed, yet they have not been validated in multi-institutional case–control studies (107). Hence, noninvasive measures are critically needed.
Because of the prevalence of predisposing conditions, increasing numbers of patients are routinely having their livers examined by acoustic force (ARFI) or shear-wave elastography (SWE), which are widely available (108). While these approaches are not generally associated with “Radiomics” per se, there is increasing analysis of US with DL (109). Notably, the risk of developing HCC increases with increasing hepatic stiffness values, reported quantitatively in kiloPascals (kPa). In a well-powered and highly cited study of 866 Japanese patients with Hep C infection, Masuzaki and colleagues observed that the incidence of HCC within 3 years was 38.5% among those with baseline liver stiffness values >25 kPa, compared with 0.4% among subjects with values ≤10 kPa (110). The strengths and limitations of elastography for diagnosis of fibrosis have been extensively reviewed in a consensus conference, with the conclusion that it can eliminate the need for liver biopsy in most patients (111).
If more detailed examinations are needed prior to interventions, it is also possible to perform elastography using MRI, wherein the body is mechanically stimulated with low-frequency vibrations, and the resulting movement of underlying tissues is measured with the MRI signal (112). Although MR elastography is less available compared with US elastography, it does have the advantage of providing a larger field of view to encompass the entire abdomen during a single imaging session. Figure 6 shows examples of MR elastographic images from patients at low, intermediate, and high risk of cirrhosis and hence HCC.
Although the EDRN does not have CVCs or BDLs focused on liver cancer, the network has begun prospectively collecting samples as part of the Hepatocellular Carcinoma Early Detection Strategy Study (HEDS). The goal of these efforts is to define better risk stratification, which could help identify additional high-risk patients that need triaged to surveillance programs and that identifying and improving the surveillance of patients at risk may help reduce mortality due to liver cancer.
Conclusions, Challenges, and Future Directions
Images are data. As described above, quantitative analyses of these data can yield highly predictive models based on a parsimonious set of informative imaging features. Conventional radiomics generates machine learned models using with supervised identification of the ROIs, as well as extraction using user-defined features. These have resulted in predictive models with very high accuracies. In nearly all cases, accuracy is improved when combining the radiomics model with orthogonal clinically derived data, such as serum markers, demographics, histopathology, genomics, etc. More recently, enabled by the availability of large publicly available and annotated datasets, images are being analyzed using DL approaches. In DL, the computer can analyze the entire image or automatically detect the ROI, or an operator can focus the computer's attention on the general region via, for example, a bounding box. In neural nets, the computer then identifies image features that are most informative to generate a classification. More detailed reviews on DL and AI in imaging are available elsewhere (3–6).
In either case, conventional or deeply learned radiomic training sets require large amounts of well-curated data. Curation requires identification of clean outcomes which are then used in the classifier tasks. This brings up two challenges, viz. very large datasets are not readily available, and outcomes data (e.g., cancer vs. noncancer; low grade vs. high grade) generally have to be manually defined and curated in a structured format. In cases where very large curated datasets have been generated, for example, the NLST, the image acquisition protocols that were used then do not represent current technology and clinical practices. Correcting for this presents its own challenges.
A future solution to solve most of these issues is the adoption of a vast distributed learning network, wherein institutions curate and retain their own data, and algorithms are shared to improve training, testing, and validation (11). A simplified extension of this would be for journals to require deposition of code (whether compiled or raw) in an accessible repository so that promising models could be tested and replicated at different sites. Notably, two of the highest profile recent papers used DL to predict lung nodule status from LDCT scans (36), and detection of breast cancers from mammography (49). In both cases the authors refrained from sharing code, with the rationale that: “code used for training the models has a large number of dependencies on internal tooling, infrastructure and hardware, and its release is therefore not feasible.” This is unacceptable, particularly if the training data were generated by public funds.
Disclosure of Potential Conflicts of Interest
R.J. Gillies reports nonfinancial support from HealthMyne, Inc. during the conduct of the study, other from HealthMyne, Inc. (investor) outside the submitted work, as well as patents 9721340 and 9940709 issued (not licensed). No potential conflicts of interest were disclosed by the other author.
Acknowledgments
This work was funded in part by the NIH grant U01 CA200464 (to R.J. Gillies and M.B. Schabath).