Abstract
Immunotherapy by immune checkpoint inhibitors has become a standard treatment strategy for many types of solid tumors. However, the majority of patients with cancer will not respond, and predicting response to this therapy is still a challenge. Artificial intelligence (AI) methods can extract meaningful information from complex data, such as image data. In clinical routine, radiology or histopathology images are ubiquitously available. AI has been used to predict the response to immunotherapy from radiology or histopathology images, either directly or indirectly via surrogate markers. While none of these methods are currently used in clinical routine, academic and commercial developments are pointing toward potential clinical adoption in the near future. Here, we summarize the state of the art in AI-based image biomarkers for immunotherapy response based on radiology and histopathology images. We point out limitations, caveats, and pitfalls, including biases, generalizability, and explainability, which are relevant for researchers and health care providers alike, and outline key clinical use cases of this new class of predictive biomarkers.
Introduction
Response prediction in immunotherapy—one of the major challenges of oncology
The unprecedented results of immune checkpoint inhibitors (ICI) in melanoma patients led to the extensive use of immunotherapy in many tumor types (1). However, while some patients achieve excellent responses, many others do not respond, but can still suffer from serious toxicity. Therefore, accurate predictive biomarkers towards a more precise patient stratification for ICI are needed to optimize patient selection and to minimize undesirable side effects. Several molecular biomarkers for ICI response prediction have been introduced. These include microsatellite instability (MSI), the tumor mutational burden (TMB), the expression of programmed death ligand-1 (PD-L1), as well as the number of tumor-infiltrating lymphocytes (TIL; refs. 2, 3). However, the performance for predicting the response of each biomarker individually or when combined is suboptimal, with some tumors showing resistance despite the presence of the biomarker and vice versa (4). Hence, there is a need for improved predictive biomarkers for supporting clinical decisions, ideally with high reproducibility, high accuracy, and low costs.
Artificial intelligence in medical imaging
Artificial intelligence (AI) is a branch of computer science that integrates different technologies able to replicate human brain functions such as learning and problem solving. Machine learning (ML) is a subset of AI techniques capable of teaching a computer to detect patterns in datasets. ML has been used in oncology for many years, for example to define gene signatures of treatment response (5). A particularly powerful method within ML is artificial neural networks (ANN). An ANN is an interconnected group of units which collectively can perform complex computations, and is loosely modeled after biological neural networks. Multi-layered ANNs are particularly powerful, and using them in ML is called “Deep Learning” (DL). In the last decade, DL has provided strong performance gains in many scientific domains. One of its most common fields of application is computer-based image processing. In image processing, ANNs typically use a mathematical operation called convolutions to reduce the raw pixel information to relevant concepts. ANNs relying on convolutions are called convolutional neural networks (CNN). For decades, teaching computer programs to automatically read and interpret digital images was a very hard task, but in the last decade, the use of CNNs coupled with better hardware and larger data collections have resulted in methodologic breakthroughs. Today, DL with CNNs is the state-of-the-art method for almost any image processing task in nonmedical and medical applications.
Medical applications of AI
As of 2022, dozens of AI-based systems have received regulatory approval to be used in clinical routine (6). Many of these systems are tools to analyze digital medical images, including radiology, histopathology, dermoscopy, and endoscopy images. In oncology, image data is abundant and instrumental for medical decisions: the suspicion of a malignant tumor is usually confirmed with radiology imaging, which also serves to determine the spread of the disease. The diagnosis and most components of the stage of a malignant tumor are usually obtained through histopathology, i.e., visual examination of a piece of tumor tissue under a microscope by an expert pathologist. Hence, for almost every single patient with a solid tumor, radiology and histopathology images are available. The central paradigm of image-based biomarkers in oncology is that these routinely available images contain much more information than is currently being used in clinical care and that this information can be extracted by AI.
Image-based biomarkers for cancer immunotherapy response prediction
In this work, we will review applications of AI to extract biomarkers for ICI response from radiology and histopathology images. The methods can be classified either as a “surrogate” strategy or an “end-to-end” strategy. The “surrogate” strategy means that AI is used to predict TMB, MSI status, TILs or PD-L1 expression. The “end-to-end” strategy means that AI is used to directly predict treatment response (Fig. 1). Together, these approaches have the potential to yield new diagnostic methods for better treatment decisions in clinical routine. However, before clinical adoption, it is important to discuss potential limitations and enable clinicians to understand and interpret the output of these systems.
Response Prediction from Radiology Imaging
Classical radiomics and DL radiomics
Radiology imaging modalities, such as CT, MRI, and PET are powerful tools for cancer detection, characterization, and follow-up, providing a comprehensive view of the entire tumor repeatedly throughout the course of the disease. As early as 2014, ML has been used to quantify tumor features in CT images, which are linked to clinical outcome (7). This method has been termed “radiomics” (7). Classical radiomics follow a two-step approach. First, to derive a quantifiable set of handcrafted features from imaging data. Second, to train ML methods for predicting clinically relevant categories from these features. Handcrafted features provide information about the intensity, shape, and texture of the tumor phenotype. However, handcrafting sets of features limit the type of information that ML models can learn, potentially reducing the performance. To overcome this, a more recent approach is “Deep Radiomics”, which uses DL (mostly CNNs) to predict a target category directly from image data, learning features and their combination in one go. CNNs can learn a larger number of features at different levels of abstraction (Fig. 2A). Thereby, the extracted features are selected and weighted on the basis of the task at hand, which provides more flexibility and can improve performance.
Radiomics to predict surrogate biomarkers for immunotherapy response
Several studies have used AI to predict surrogate markers of immunotherapy response from radiology images. These markers include MSI (8–10), TMB (11–13), TILs (14–16), and PD-L1 (Table 1; refs. 17–19). The basic idea behind these studies is that the molecular properties of tumors change the tumor's phenotype, which can be observed in radiology images (Fig. 2B). Many of these studies use classical radiomics to classify a discrete surrogate biomarker, such as microsatellite stable versus MSI-high, high versus low TMB, high versus low CD8-cell infiltration, or PD-L1 positive versus negative. These studies showed promising results: the area under the receiver operating characteristic curve (AUC) usually falls within the range of 0.70 to 0.90. Other studies have used DL methods, showing similar performances for MSI and PD-L1 prediction (17, 20). The ability to predict the status of such biomarkers from radiology images opens exciting opportunities for clinical practice. For example, monitoring changes of these biomarkers during treatment, analyzing the whole tumor and not just a small part of tumor tissue, and the noninvasiveness are clear advantages compared with invasive tissue-based methods. However, the performance of these imaging-based biomarkers is not perfect, and it remains to be shown if it is sufficient for clinical decision-making.
Target . | References . | Summary . | Data type . | Tumor type . | Technology . | Validation AUROC . | |
---|---|---|---|---|---|---|---|
RADIOLOGY | Surrogate marker | (8–10, 20) | Prediction of MSI status | CT | Colorectal | Radiomics features + ML model | 0.73–0.89a |
(11–13) | Prediction of TMB | CT/MRI | Lung, endometrial, glioma | Radiomics features + ML model | 0.81–0.87a | ||
(14–16) | Prediction of TIL number | CT/MRI | Head and neck, lung, hepatocellular, bladder, endothelial | Radiomics features + ML model | 0.67–0.90a | ||
(17–19) | Prediction of PD-L1 expression | CT/PET | Lung, pancreatic | Radiomics features + ML model / CNN | 0.68–0.82a | ||
Direct response prediction | (23) | Direct prediction of OS under immunotherapy | CT | Melanoma | Radiomics features + RF | 0.92 (95% CI, 0.89–0.95)b | |
(22) | Direct prediction of response to immunotherapy | CT | Multiple solid tumors | Radiomic features, Elastic-net, GLMM | 0.67 (95% CI, 0.58–0.76)b | ||
(21) | Melanoma, lung | Radiomics features + RF | 0.64–0.83a | ||||
PATHOLOGY | Surrogate marker | (41–50) | Prediction of MSI status | Colorectal, gastric, endometrial | CNN | 0.60–0.92a | |
(45, 61, 80) | Prediction of EBV or HPV status | H&E | Gastric, head and neck | CNN | 0.67–0.87a | ||
(51–54) | Prediction of TMB | H&E | Lung, urogenital, colorectal, gastric, head and neck, and others | CNN | 0.64–0.92a | ||
(57, 58, 81) | Prediction of immune-related gene expression signatures | H&E | Hepatocellular, colorectal | CNN, MIL | 0.81–0.92a | ||
(55, 56) | Prediction of PD-L1 status from H&E | H&E | Lung | CNN | 0.63–0.93a | ||
(37, 38) | Prediction of TIL number from H&E | H&E | Breast, oral | SSL, MIL, CNN | 0.88–0.89a | ||
Direct response prediction | (64, 65) | Direct prediction of response to immunotherapy | H&E | Lung, melanoma | CNN, GNN | 0.62–0.78a | |
IHC | Quantification of established marker | (29–33) | PD-L1 expression scoring assistance | IHC | Lung | GAN, CNN | 0.90–0.99 |
Target . | References . | Summary . | Data type . | Tumor type . | Technology . | Validation AUROC . | |
---|---|---|---|---|---|---|---|
RADIOLOGY | Surrogate marker | (8–10, 20) | Prediction of MSI status | CT | Colorectal | Radiomics features + ML model | 0.73–0.89a |
(11–13) | Prediction of TMB | CT/MRI | Lung, endometrial, glioma | Radiomics features + ML model | 0.81–0.87a | ||
(14–16) | Prediction of TIL number | CT/MRI | Head and neck, lung, hepatocellular, bladder, endothelial | Radiomics features + ML model | 0.67–0.90a | ||
(17–19) | Prediction of PD-L1 expression | CT/PET | Lung, pancreatic | Radiomics features + ML model / CNN | 0.68–0.82a | ||
Direct response prediction | (23) | Direct prediction of OS under immunotherapy | CT | Melanoma | Radiomics features + RF | 0.92 (95% CI, 0.89–0.95)b | |
(22) | Direct prediction of response to immunotherapy | CT | Multiple solid tumors | Radiomic features, Elastic-net, GLMM | 0.67 (95% CI, 0.58–0.76)b | ||
(21) | Melanoma, lung | Radiomics features + RF | 0.64–0.83a | ||||
PATHOLOGY | Surrogate marker | (41–50) | Prediction of MSI status | Colorectal, gastric, endometrial | CNN | 0.60–0.92a | |
(45, 61, 80) | Prediction of EBV or HPV status | H&E | Gastric, head and neck | CNN | 0.67–0.87a | ||
(51–54) | Prediction of TMB | H&E | Lung, urogenital, colorectal, gastric, head and neck, and others | CNN | 0.64–0.92a | ||
(57, 58, 81) | Prediction of immune-related gene expression signatures | H&E | Hepatocellular, colorectal | CNN, MIL | 0.81–0.92a | ||
(55, 56) | Prediction of PD-L1 status from H&E | H&E | Lung | CNN | 0.63–0.93a | ||
(37, 38) | Prediction of TIL number from H&E | H&E | Breast, oral | SSL, MIL, CNN | 0.88–0.89a | ||
Direct response prediction | (64, 65) | Direct prediction of response to immunotherapy | H&E | Lung, melanoma | CNN, GNN | 0.62–0.78a | |
IHC | Quantification of established marker | (29–33) | PD-L1 expression scoring assistance | IHC | Lung | GAN, CNN | 0.90–0.99 |
Note: Representative studies are referenced, and the list may not be exhaustive in all categories.
Abbreviations: AUROC, area under the receiver operating characteristic curve; CI, confidence interval; MIL, multiple instance learning; RF, random forest; FS, feature selection; GAN, generative adversarial network; CPS, combined positivity score; IHC, immunohistochemistry; H&E, hematoxylin and eosin.
aWhenever multiple studies are reported, the range of AUROC from the validation/test set of the studies are listed (lowest and highest value).
bWhenever only a single study is reported (or only a single study includes detailed performance statistics), the AUROC and 95% CI from the validation/test sets are listed as given in the publication.
Radiomics for end-to-end prediction of immunotherapy response
While “surrogate” approaches predict an established biomarker from image data, “end-to-end” approaches can directly predict immunotherapy response, as measured by the response evaluation criteria in solid tumors (RECIST), progression-free survival, or overall survival (OS) as endpoints (Table 1; refs. 21–23). Successful applications of this approach have been shown in melanoma (23), lung cancer (21, 22), and bladder cancer, among other tumor types, and are generally based on “classical” handcrafted radiomics, not DL radiomics. In general, these approaches have achieved a good performance with AUC values greater than 0.70 for predicting ICI response. Furthermore, in most studies, these predictive radiomics scores are significantly associated with other surrogate biomarkers with the aim of explaining the underlying biology of the radiomics responsive phenotypes (21, 22). OS is the less common endpoint of response, given that it is highly dependent on other factors, such as previous treatments. However, it is the most relevant endpoint in model development for reaching clinical practice (24). In summary, end-to-end imaging biomarkers may play a key role in complementing surrogate biomarkers to improve the understanding of the tumor phenotype.
Response Prediction from Digital Histopathology Images
The rise of computational pathology
Histopathology images reflect properties of tumors, which are associated with immunotherapy response. Compared with radiology, the spatial scale of histopathology is much lower, such that phenotypes at cellular and subcellular levels can be directly observed (Fig. 2B). Even routine tissue slides stained with hematoxylin and eosin (H&E) are sufficient to identify many different types of immune cells in the lymphoid and myeloid spectrum. A key disadvantage of histopathology, however, is that it requires invasive procedures. Also, while radiomics biomarkers can be recomputed in serial imaging during the course of the disease, histopathology image biomarkers are usually only measured in the initial tumor sample. In the last 5 to 10 years, AI approaches have been widely used to extract biomarkers from H&E slides of solid tumors (25). AI can extract diagnostic (26), prognostic (27), and predictive (28) information from H&E slides. This use of AI methods in pathology image analysis is referred to as “computational pathology”.
Quantification of IHC
One of the applications of DL methods is to facilitate the quantification of established biomarkers in IHC slides. For example, the expression level of the PD-L1 protein on tumor cells and immune cells (combined positivity score), is assessed by manual observation of IHC. Several studies have used DL to automate the subjective scoring of PD-L1 status (29–33). Similarly, the number of TILs in H&E or IHC slides is associated with survival and immunotherapy response in many tumor types (34, 35). As early as 2006, several years before DL was used as a tool in biomedicine, handcrafted image analysis pipelines with classical ML methods were used to count TILs (36). More recently, several studies have used DL for TIL quantification (37, 38). In addition to quantifying lymphocytes, ML methods were also used to quantify other types of innate and lymphoid immune cells, demonstrating that cell counts are prognostic of survival in multiple tumor types (39, 40).
Extracting molecular biomarkers from H&E slides
However, the abilities of DL are not limited to simple quantification tasks such as counting cells. DL can solve complex visual pattern recognition tasks and can piece together subtle visual cues related to cell numbers, cellular shape and textures, relative cell positions and phenotypes of connective tissue. For example, by training DL systems on raw histopathology slides, it is possible to infer the MSI status in colorectal, gastric, and endometrial cancer (41–50). This is particularly relevant for immunotherapy because MSI is a clinically approved biomarker to select patients for immunotherapy, independent of the tumor type (46). MSI prediction systems have been made explainable. When they are queried for relevant visual features driving the classifications, immune cell–rich tumor regions are highlighted (50). Similarly, DL has been used to predict TMB directly from H&E in multiple major tumor types, including lung, breast, and colorectal cancer (51–54). Also, PD-L1 expression levels have been inferred from H&E directly, without the need for a dedicated IHC (55, 56). In addition, the expression level of gene signatures which predict ICI response can be inferred by DL from H&E slides, in hepatocellular carcinoma (57), lung cancer (58), and multiple other tumor types (48). This could help to solve some of the practical problems associated with widespread clinical use of these signatures. Likewise, DL can predict immunotherapy-sensitive tumor subtypes. For example, the luminal subtype of urothelial carcinoma which conveys a better anti–PD-L1 response (59) has been predicted from H&E slides alone (60). Relatedly, virus positivity in head and neck and gastric cancer—which associates with better ICI response—can be inferred from H&E slides with DL (45, 61). Finally, some mutations in clinically relevant driver genes have been shown to increase the probability of immunotherapy response and these mutations can be inferred from H&E. Prominent examples are detection of mutant BRAF in melanoma (62) which predicts benefit to anti–PD-1 response (63). In summary, routinely available H&E slides of solid tumors seem to harbor a wealth of information, some of which is related to ICI response, and which can be extracted by DL.
Computational pathology for end-to-end prediction of immunotherapy response
All of these above-mentioned approaches use DL to predict a known biological marker from H&E images. An alternative strategy is to train DL directly on response or outcome data. This has been attempted by at least two studies (64, 65) that used CNNs or graph neural networks (GNN) to predict immunotherapy responses. These studies reported an AUC of 0.778 for prediction of responders in melanoma and an AUC of 0.69 for predicting response in lung cancer. Another report about the prediction of response from the morphology of cancer cell nuclei is publicly available (66). In general, such end-to-end prediction studies are quite difficult to perform for practical reasons: they require the DL system to be trained on clinical outcomes, ideally RECIST data. In practice, it is very difficult to collect a sufficient number of pathology tissue specimens of ICI-treated patients with matched response data. In addition, cancer immunotherapy is often administered in late lines of therapy, after patients have received one or more regimens of chemotherapy and other treatments. However, tissue for histopathology is typically acquired at the initial diagnosis, and re-biopsies at later time points are not commonly performed in most tumor types. This means that DL systems are trained on treatment-naïve tumor tissue while patients might start immunotherapy only months later after multiple previous lines of treatment have failed. This is a strong conceptual limitation that could only be resolved by a repetitive sampling of tissue, or a move of immunotherapies towards earlier lines of treatment.
Limitations and Outlook
Key limitations
AI biomarkers have a number of conceptual limitations. The first limitation is data quality. Beyond the large amount of data that AI models require for achieving accurate, generalizable results, this data must be of high quality (67). If we train a model with noisy or artefactual images, many more cases will be necessary for the model to converge and achieve a good performance. The second limitation is generalization. If the training data is not representative of real-world populations, DL models can fail to generalize. This is especially an issue in medical contexts, where data distributions vary markedly between different countries or even different hospitals. Without adequate precautions, such batch effects can inflate performance statistics (68). Mitigation strategies are to train on diverse datasets (42) or augmenting data (69). The third limitation is biases. AI models can be biased, which means that the performance can be dependent on patient characteristics like age, gender, or ethnicity (70). For AI models to be deployed in medicine, large-scale validation studies with predefined performance metrics are required to guarantee model performance in the real world (26). The fourth limitation is the quality of the ground truth. When developing a model using molecular biomarkers as a surrogate of response to immunotherapy, the performance of the model will be limited to the established molecular biomarker predictive capacity. This is a motivation for clear definition of ground truth for AI biomarker training, but also a motivation for end-to-end training of biomarkers on clinical outcome data. In all of these efforts, data standardization and quality control systems are paramount before application in clinical routine. Best practice guidelines for such quality aspects are formalized and collected in the Equator Network (https://www.equator-network.org/), which also includes AI-specific aspects, such as in the STARD-AI (71) and TRIPOD-AI (72) guidelines. In addition to radiology, an Image Biomarker Standardization Initiative (73) as well as a radiomics quality score (74) have been reported with the aim of ensuring reproducible and trustable predictive models. In addition to these research-focused guidelines, the FDA has published a list of guiding principles to promote the safe, effective, and high-quality application of AI and ML in the medical field (75).
Multimodal models
One way to combine the benefits of radiology and histology, as well as other data types, would be to develop multimodal AI models, which can integrate multiple data types. Multimodality also helps with interpretability of the resulting models because many image features only make sense in the light of specific host factors, such as age, immune status, comorbidities/disease, genomics. ML models have been used to extract immunotherapy biomarkers from non-image data such as serum profiles from liquid biopsies of patients with cancer (76–78). Combining such data with image data could further improve the predictive performance. On the technical side, transformer neural networks have achieved remarkable performance in nonmedical tasks, especially for combinations of different types of image data. However, evaluating such multimodal models exacerbated the practical problems associated with data collection: for most researchers at academic institutions, it is currently very difficult if not impossible to collect radiology, histopathology, and clinical data for a set of patients. Systematic data collection in clinical trials could be a solution to this problem if the data is made available to researchers with low barriers.
Explainability
The increasing use of AI for developing immune surrogate biomarkers and the potential impact on clinical decision-making has raised the need for humans to understand these algorithms. This has been termed “Explainable AI” (XAI). Many classical ML algorithms are intrinsically explainable by their structure, e.g., logistic regression or decision trees. However, models with higher complexity, such as DL models, tend to lack explainability. To solve the trade-off between performance and explainability, several post hoc techniques have been developed for understanding the decision-making procedure of these so-called black-box models. XAI aims to comprehend model predictions and explain them in human and understandable terms. This could increase the trustability of all the stakeholders: medical doctors and patients to rely on fair decisions, regulatory agencies for quality control, and developers for improving the product.
Regulatory approval
Routine clinical implementation of imaging AI-based tools should be driven by clear demonstration of clinical value and strict ethical and regulatory requirements. An added complexity for validating AI biomarkers, compared with other medical devices, is the ability for AI systems to learn from real-world data in real time. These evolving biomarkers require appropriately tailored regulatory frameworks. AI-based imaging biomarkers are tools embedded in software applications that are intended to be used, alone or in combination, for predicting or monitoring cancer response, therefore being considered as Medical Device Software. The framework to develop, clinically qualify and implement a software tool as a medical device will depend on the local regulations where the device is planned to be used. In the United States, the FDA has defined a framework to enable developers, users and the Agency itself to evaluate and monitor Clinical Decision Support (CDS) software from its premarket development through post-market performance accounting for its iterative nature, while still ensuring its continued safety and effectiveness evaluation. In Europe, CDS software should be compliant with the European Union Medical Devices Regulation (MDR 2017/745), that sets the standards of performance, quality, safety, and efficacy (79).
Embedding in routine workflows
The main constraints of AI-based tool implementation in clinical practice are numerous, particularly related to a lack of clinical qualification, but also to the complexity of conducting prospective clinical trials that evaluate biomarkers. Moreover, advances in the field are sometimes hampered by the reluctance of a part of the medical community to embrace these technologies due to these potential users’ lack of confidence and their presumed resistance to the heavier workload that high-throughput imaging analysis may involve. Importantly, the implementation of AI-based tools should not increase physician workload but actually reduce and facilitate radiologists’ and pathologists’ workflows, enabling standardized data reporting. Therefore, efforts are necessary to integrate these software applications within the clinical routine analysis platforms of radiology and pathology departments. In this regard, all the stakeholders (i.e., manufacturers, researchers, clinicians, patients) involved in the development of imaging AI-based tools should work together to accelerate the validation and implementation of these tools in clinical routine and truly impact clinical practice. Imaging AI-based tools must be accessible, user-friendly, rapid to compute and able to promote equality in healthcare to be implemented in routine clinical practice. Finally, the results must be considered as a decision support tool to assist physicians, rather than a substitute for expert physician decision-making.
Authors' Disclosures
R. Perez-Lopez reports grants from AstraZeneca and Roche Pharma outside the submitted work. J.N. Kather reports personal fees from Owkin, Panakeia, MSD, Eisai, and Bayer outside the submitted work. No disclosures were reported by the other authors.
Acknowledgments
R. Perez-Lopez is supported by La Caixa Foundation, a CRIS Foundation Talent Award (TALENT19–05), the FERO Foundation, the Instituto de Salud Carlos III-Investigacion en Salud (PI18/01395 and PI21/01019), and the Prostate Cancer Foundation (18YOUN19). M. Ligero is supported by PERIS PIF-Salut Grant. J.N. Kather is supported by the German Federal Ministry of Health (DEEP LIVER, ZMVI1–2520DAT111) and the Max-Eder-Programme of the German Cancer Aid (grant #70113864).