Abstract
Intratumoral heterogeneity arising from tumor evolution poses significant challenges biologically and clinically. Dissecting this complexity may benefit from deep learning (DL) algorithms, which can infer molecular features from ubiquitous hematoxylin and eosin (H&E)–stained tissue sections. Although DL algorithms have been developed to predict some driver mutations from H&E images, the ability of these DL algorithms to resolve intratumoral mutation heterogeneity at subclonal spatial resolution is unexplored. Here, we apply DL to a paradigm of intratumoral heterogeneity, clear cell renal cell carcinoma (ccRCC), the most common type of kidney cancer. Matched IHC and H&E images were leveraged to develop DL models for predicting intratumoral genetic heterogeneity of the three most frequently mutated ccRCC genes, BAP1, PBRM1, and SETD2. DL models were generated on a large cohort (N = 1,282) and tested on several independent cohorts, including a TCGA cohort (N = 363 patients) and two tissue microarray (TMA) cohorts (N = 118 and 365 patients). These models were also expanded to a patient-derived xenograft (PDX) TMA, affording analysis of homotopic and heterotopic interactions of tumor and stroma. The status of all three genes could be inferred by DL, with BAP1 showing the highest sensitivity and performance within and across tissue samples (AUC = 0.87–0.89 on holdout). BAP1 results were validated on independent human (AUC = 0.77–0.84) and PDX (AUC = 0.80) cohorts. Finally, BAP1 predictions correlated with clinical outputs such as disease-specific survival. Overall, these data show that DL models can resolve intratumoral heterogeneity in cancer with potential diagnostic, prognostic, and biological implications.
This work demonstrates the potential for deep learning analysis of histopathologic images to serve as a fast, low-cost method to assess genetic intratumoral heterogeneity.
Introduction
Intratumor heterogeneity is a primary challenge in cancer. Indeed, metastasis and treatment resistance are often driven by mutations that are present in only a subpopulation of tumor cells (1–3) and may be missed by traditional bulk-sequencing methods. Multiregion sequencing studies spanning different tumor types (4–6) show that multiple areas of a tumor would need to be sequenced to reliably capture all “driver” events, likely rendering such an approach impractical in the clinic. In contrast, morphologic tumor heterogeneity is far easier to observe in the ubiquitous hematoxylin and eosin (H&E) stained images that are a staple of routine pathologic diagnosis. Surprisingly, the relationship between heterogeneity observed in genetics and morphology—two pillars of modern oncology—is poorly understood, and the ability of H&E images to aid in the deconvolution of genetic heterogeneity remains little studied. Moreover, recent studies suggest that morphology is not simply a rich downstream readout of signaling but may be contributing to it (7, 8). Thus, mechanistic insights and associated vulnerabilities may be uncovered by unraveling the link between morphologic phenotypes and the underlying genetics.
Recent approaches using deep learning (DL) to predict tumor genetics from H&E images have shown great promise. For example, DL models, both from pan-cancer (9, 10) and organ-specific cancer studies (11, 12), predicted the likelihood of genetic mutations in various cancer types including lung, colorectal, breast, and kidney cancer. Similarly, DL models categorized important biomarkers like estrogen receptor status in breast cancer (13), microsatellite instability in gastrointestinal cancer (14), and predicted associations with molecular transcriptomics in various cancer types (9, 10). Although the accuracy of these predictions does not necessarily reach practical utility, they show an association (based on high AUCs) between mutational events and specific morphological phenotypes, which we define here as any histologic property accessible from H&E-stained formalin-fixed paraffin-embedded (FFPE) tissue sections on glass slides. Other relevant studies have shown that DL models can be used to investigate and quantify tumor heterogeneity in terms of cell types (15), tissue architectures (16), molecular subtypes, and expression of specific (typically cell-type specific) genes (17).
Despite these advances, the ability of DL models to unravel genetic intratumor heterogeneity—a powerful, yet more complex application—is unexplored. Such investigation is hindered by the common use of modalities like bulk sequencing, which do not provide spatial assignation for genetic information. In addition, if DL did discover any genotype–morphology connections, it remains unclear how one would probe the underlying mechanisms. Ideally, one would study these relationships in model systems like patient-derived xenografts (PDX), which are more amenable to molecular manipulation. However, it is uncertain whether relationships identified in human patient samples survive the extensive perturbation (including tissue disassociation and replacement of microenvironment) needed to establish PDX models (18).
Motivated by these challenges, we investigate the link between genetic and morphologic heterogeneity in the context of clear cell renal cell carcinoma (ccRCC), the most common type of kidney cancer. We chose ccRCC as our model system for two reasons. First, ccRCC is a paradigm for intratumor heterogeneity, both in terms of genetics and morphology. With respect to genetics, the seminal multiregion sequencing experiments unequivocally connecting genetic heterogeneity to tumor evolution were performed in ccRCC (19). More recent studies have reinforced the widespread genetic heterogeneity in ccRCC: to capture 75% of variants, seven randomly sampled areas are needed on average (4). Similarly, concerning morphology, it has been well documented that individual ccRCC tumors frequently exhibit areas showing different tissue architectures, cytologic states, and microenvironmental features (20–24). Thus, even a single slide may contain multiple “clones” with different states of driver genes or morphologic phenotypes (Supplementary Figs. S1A–S1C). Recent DL studies analyzing data from The Cancer Genome Atlas (TCGA) have demonstrated the possibility of predicting specific driver mutations from H&E images (9–11), but the ability to use such analyses to predict intratumor heterogeneity is unexplored. Moreover, these studies failed to detect a relationship between morphology and mutations in the driver gene BAP1, whose impact on tumor grade (which in ccRCC is primarily based on nuclear morphology) has been established previously (25–27). The second reason for choosing ccRCC is that the main driver genes are two-hit tumor suppressors, and we can use loss of protein expression, which can be resolved at the cellular level using IHC assays, as a proxy for gene status. Specifically, we use validated IHC assays (with positive and negative predictive values >98%; refs. 26, 28, 29) to ascertain genetic heterogeneity (both for training and validation) at much higher spatial resolution than is possible using bulk sequencing. Importantly, these IHC assays likely capture other mechanisms of gene inactivation besides intragenic mutation, such as gene silencing (BAP1/PBRM1), and in the case of SETD2 focus on loss of activity (i.e., histone methylation; ref. 26). In this study, IHC assays will be used as surrogates for “gene loss.”
Leveraging these features of our ccRCC model system, we present the first investigation of H&E-based DL models to: (i) test genetic state predictions at a regional intratumor scale; and (ii) recapitulate these predictions in PDX models. Specifically, we focus on the three most common driver genes in ccRCC that we previously showed impact prognosis: BAP1, PBRM1, and SETD2 (4, 30–32). We attempt to predict the status of these genes from tissue morphology. For each gene, we trained models at two scales: (i) whole-slide level—to parallel previous bulk-sequencing approaches, we predict if the gene is lost anywhere in a slide; and (ii) region level—to resolve intra-tumor heterogeneity, we predict gene loss status of local regions within slides. A defining aspect of this study is the diversity of cohorts used, which allowed us to test the robustness of our models to an unprecedented degree. For training and internal testing of both our region and slide level models, we use a large well annotated cohort (WSI; refs. 28, 29, 33, 34) collected at the Mayo Clinic. For external validation of the slide level models, we use data from TCGA and for the region level models we employ two patient tissue microarray (TMA; refs. 25, 26, 35–39) cohorts collected at UTSW (TMA1 and TMA2). In addition, to interrogate whether the morphologic phenotypes are cell-intrinsic or driven by interactions with stroma we test our predictions on a PDX TMA cohort (PDX1; refs. 26, 40). As described below, we were able to predict gene loss status, specifically that of BAP1, across these cohorts. Our results suggest that DL based prediction of gene status from histopathologic slides has the robustness and spatial resolution to serve as a low-cost in silico method to resolve intratumoral heterogeneity in cancer.
Materials and Methods
The study was conducted with approval from the UT Southwestern Medical Center (UTSW) Institutional Review Board (STU 022013-052) and in concordance with the Health Insurance Portability and Accountability Act (HIPAA) guidelines.
Assays
This study makes use of tissue slides stained using: (i) H&E as a readout of morphology and (ii) IHC as a ground truth readout of genetic status. Our IHC staining made use of previously developed (26) antibodies against BAP1 (26, 28), PBRM1 (26), and H3K36me3 (29), which have been validated and accurately capture the functional status of BAP1, PBRM1, and SETD2, respectively. Note that although, for simplicity, we refer to SETD2 loss, this is in fact based on an assessment of H3K36me3, a histone modification catalyzed by SETD2.
IHC-based definition of genetics ground truth
IHC (for all three proteins) and H&E staining were performed on proximal tissue sections in the WSI, TMA1, TMA2, and PDX1 cohorts. Determination of loss status made by pathologists on tumor regions of IHC slides were manually mapped to the corresponding H&E slides. Specifically, for each gene, whole slides were classified as either (i) wildtype (WT) if the entire tumor area displayed WT staining, (ii) Universal Loss (UL) if the entire tumor area displayed loss-like staining, or (iii) Localized Loss (LL) if the tumor region exhibited a mix of WT and loss-like staining (Supplementary Fig. S2A). TMA punches in contrast were simply classified as WT or Loss as little heterogeneity was observed in this limited area. For patient level scoring in the TMAs, if any of the punches for a patient displayed gene loss, the patient was considered to exhibit gene loss.
Datasets
In this study, we made use of four of our previously published (25, 26, 28, 29, 33–40) datasets with matched H&E and IHC slides and the TCGA KIRC dataset (41) with matching H&E slides and slide-level mutation calls from cBioPortal (42, 43). Unless explicitly noted, we only analyzed the H&E images from those studies, while using the previously generated pathologist (P. Kapur) scoring of the IHC slides (WSI and TMA cohorts) and cBioPortal mutation calls (TCGA) as ground truth. All five cohorts consisted solely of FFPE samples, which are standard in clinical practice and better preserve tissue morphology than frozen sections. Our four cohorts (WSI, TMA1, TMA2, PDX1) with matching IHCs were imaged using Aperio scanners. WSI and TMA1 were imaged at 20× magnification, whereas PDX1, TMA2, and TCGA were imaged at 40× (but were computationally down-sampled to 20× for analysis).
WSI
This cohort was used primarily as training data for our models. It consists of 1,362 H&E-stained whole-slide images of ccRCC tumors, each representing a single patient treated at the Mayo Clinic (34). As described above, based on previous IHC staining, slides were classified as WT, UL, or LL. In addition, in the case of the 19 BAP1 LL cases, a pathologist (V. Panwar) compared matching IHC and H&E staining and manually annotated the BAP1 Loss areas within the tumor regions of the slides. For quality control purposes, we excluded samples where H&E images were corrupted, our tumor model did not detect enough tumor area for patch generation (see below) or IHC status for any of the three genes was unreliable. Of the 1,362 whole-slide images, 1,282 were used in this study.
TCGA
This cohort was used primarily as validation for our slide-level model. It consists of 383 H&E stained whole-slide images of ccRCC tumors, with each representing a single patient with genetic sequencing data from the Pan Cancer Atlas study (41). The mutational status of BAP1, PBRM1, and SETD2 in these samples was acquired from the “KIRC TCGA Pan Cancer Atlas” project on cBioPortal (RRID:SCR_014555). Samples with any mutation calls (missense/splice/truncating) for our genes of interest were considered to exhibit gene loss, whereas those lacking any mutation calls were treated as universal WT. The corresponding FFPE whole-slide images were downloaded directly from “NIH GDC Data Portal” under the “TCGA-KIRC” project by selecting only the samples that listed “Diagnostic Slide” as their experimental strategy. Of the 402 samples with mutation data, we only considered the 383 samples with matching diagnostic slides. For quality control purposes, we further excluded 20 samples where H&E images did not contain enough tumor tissue for slide-level predictions (see Slide Level Inference). Of the available 383 FFPE whole-slide images with mutation data, 363 were used in this study.
TMA1
This cohort was used primarily for validation of our region model and evaluation of clinical readouts. It was acquired from a study conducted at UTSW (25). For quality control purposes, the H&E image of each individual core was evaluated by a pathologist and cores containing less than 10% tumor cells or exhibiting tears, staining issues, or poor focus were discarded. In all, 257 TMA cores corresponding to 118 patients passed our quality check and were used in downstream analysis. We only evaluate the BAP1 prediction in this cohort. Nuclear grade was determined at a punch level by pathologist examination. Except when explicitly noted (e.g., patient level predictions), each core in the H&E-stained slide image was treated as an independent sample with corresponding cores in the matching BAP1 IHC slide used for defining BAP1 genotype.
TMA2
This cohort was used primarily for validation of our BAP1 region model and evaluation of clinical variables. Similar to TMA1, it was acquired from a study conducted at UTSW (36, 39). As with our other TMA cohorts, we only evaluated the BAP1 prediction. Key differences from the TMA1 cohort are that BAP1/grade status was available only at a per patient level, and imaging was performed at 40× (and synthetically down-sampled to 20×). Quality control was performed to preserve only high-quality ccRCC punches with both grade and survival information. In all, 679 TMA cores corresponding to 365 patients passed our quality check and were used in downstream analysis. We note that there were 23 patients whose tissue (albeit from different areas) was included both in TMA1 and TMA2.
PDX1
This cohort was used primarily for testing the performance in PDX tissue of our BAP1 region model trained on human patient tissue. It consists of tissue cores extracted from PDX models established at UTSW (40). Like TMA2, imaging was performed at 40×, which was then computationally down-sampled to 20×. As with TMA1, each core in an H&E stained slide was treated as an independent sample whose BAP1 status was determined previously based on a matching IHC slide (40). On the basis of the same eligibility criterion as TMA1, we were able to use 147 tissue cores corresponding to 46 patients.
Tumor region identification
To identify the parts of our H&E slide images that contained tumor cells, we developed a DL model that segmented images into six region classes (tumor, background, normal, fat, immune, and stroma). The ground truth for this model consisted of pathologist-annotated regions, belonging to one of these six classes, in H&E stained ccRCC whole-slide images. We then sampled 256 × 256 pixels image patches, at 5× magnification, from these annotated regions and trained a custom fully convolutional DL model to predict the region-class corresponding to the center pixel of each patch. This region classifier was then applied to the images from the WSI cohort and only areas classified as tumor were used for training our genetic classification models (Supplementary Fig. S2A–S2C). Similarly, inference on TCGA was restricted to areas classified as tumor. For TMA cohorts (TMA1, TMA2 and PDX1), we analyzed all tissue (i.e., not just tumor, identified by ignoring all “background” pixels) as the cores already contain a limited amount of tissue and were curated to focus on tumor regions.
Patch generation
Our genetics models were trained exclusively on patches from the WSI cohort. We aim to randomly sample 1,200, 224 × 224 pixel (at 20× magnification) patches lying within the tumor area of WSI H&E images. For slides having limited tumor area, to avoid resampling the same tissue, we reduce the number of patches extracted such that the area covered by patches does not exceed (roughly) a third of the tumor area. For slide-level classification, the patches inherit the ground truth label (UL/LL as Loss, Wild Type as WT) of whole-slide genotype for the corresponding gene as determined from IHC slides as described above.
Data splits in WSI cohort
To allow us to evaluate the performance of the models in the WSI cohort, we split the slides (and correspondingly the patches extracted from them) into two groups: ∼60% of the slides (n = 775) were used in training and the remaining ∼40% (n = 507) for testing, with no slide and thus patient present in both groups. The training data were further divided using a three-fold stratified cross-validation scheme. The same splitting of data was used to train and test all three gene (BAP1, PBRM1, SETD2) models. In making these different splits of the data, we sought to ensure that each split contained a roughly equal proportion of slides assigned as WT, UL, or LL, for all three genes (Supplementary Tables S1 and S2). Each of the folds was used to train a slide-level classification model and a region-level classification model for each driver gene, resulting in three slide models and three region models for each gene. For both slide- and region-level models, the results of the models from the three folds were averaged (see description of Inference) before further analysis, essentially constituting a consensus model prediction.
Slide classification model
The goal of this model is to predict whether a slide contains any areas that exhibit loss (Supplementary Figs. S2 and S3). It aims to classify WT slides as WT, but LL and UL slides as Loss. As the majority of patches in a LL case could be WT, we make use of a multiple instance learning (MIL) framework (44), which makes predictions over a collection of patches rather than on individual patches. Because it is impractical to store a whole slide in memory, we instead use batches of data from a single slide and train a model to predict the genetic status of the batch in MIL fashion. A unique aspect of our MIL implementation is the use of an attention mechanism: rather than predict a fixed number of the most mutant patches (44), our network learns to pay more attention to informative patches. 224 × 224-pixel patches feed into a VGG backbone, which then splits into two arms, a feature extractor and an attention arm (Supplementary Fig. S4). We then combine information across all the patches by merging these two branches to calculate a single weighted (by attention outputs) mean output feature for the batch, which is ultimately used to predict whether the batch should be considered Loss or not (Supplementary Fig. S3).
Training color augmentation
To increase the robustness of our models to staining variation, we subject each patch used in training to “HED” color augmentation (45) that randomly modifies the color of individually patches during training to prevent overfitting. Our HED augmentation splits RGB image into channels corresponding to hematoxylin (H), eosin (E), and the residual DAB (D). For each channel it rescales the staining intensity by a random factor (between 0.93 and 1.08 chosen separately for each channel) across all pixels, and then finally merges back to produce an augmented RGB image.
Slide classification model training
We train independent networks, based on the shared MIL architecture, for each of three cross-validation folds for our 3 genes (BAP1, PBRM1 and SETD2). Each batch of training data contains randomly selected patches from a single (randomly selected) sample in the training cohort (augmented as described above). If the sample is UL or LL, the ground truth target is Loss, whereas if it is homogeneous wildtype the target is WT. We initialized a VGG backbone with weights pretrained on ImageNet (46), froze the first 4 layers after the input layer and fine-tuned the remainder of the network on the training data. We used a class-weighted binary cross-entropy loss function and stochastic gradient decent (SGD) optimization, with a batch-size of 48 patches, learning rate of 0.00001, and momentum of 0.9 for training the weights. All models were trained for 15 epochs.
Slide-level inference
Prediction for a slide was performed for both cohorts (WSI and TCGA) by (i) dividing the extracted tissue-region patches from the slide into batches of 128 patches, (ii) predicting the batch-level scores based on the pretrained model, and (iii) averaging the result across all batches and models from all three folds for the gene, that is, a consensus model prediction. We note that 128 patches are needed to perform inference on a slide in this way. Thus, samples with fewer patches, which based on our patch generation strategy corresponds to tumor area less than about 5 mm2, were not analyzed.
Receiver operating characteristic curve calculation
For each gene, we used the IHC/sequencing-based status (Loss or WT, with LL being treated as Loss) as ground truth and compared it with the average consensus activation score for the sample. We obtain 95% confidence estimates on the AUC, following past work (9), by performing bootstrapping (n = 10,000; ref. 9).
Region classification model data
We trained this model using patches from WT and UL cases for all genes as their genotypes could be assigned unambiguously. For BAP1, we additionally included patches from LL cases but assigned them labels based on the manual annotations made by a pathologist (by comparison to the corresponding IHC slide). For PBRM1 and SETD2, we did not possess such manual annotations and thus patches from LL cases were not used in training (this is unlikely to have had an impact on performance as Loss/WT patches from LL cases would anyway have shown up independently for classification, the LL cases represent a minority of the total cases and were not used for evaluation of PBRM1/SETD2). All regional models were trained on the same three-fold data splits as above.
Region classification model training
All regional classification models used the same network architecture: a VGG19 backbone with a custom final layer to produce a fully convolutional network predicting two classes, WT or Loss (Supplementary Table S3; Supplementary Fig. S2C). We initialized our model with weights pretrained on ImageNet (46), froze the first 4 layers after the input layer, and fine-tuned the remainder of the network with the same augmentation as the slide classification model. We used a class-weighted binary cross-entropy loss function and SGD optimization, with learning rate of 0.0001, and momentum of 0.9 for training the weights. All models were trained for 10 epochs. We trained a different model on each cross-validation fold of each gene (BAP1, PBRM1, SETD2), giving us nine models total.
Region-level predictions in whole-slide images
Inference was performed using a tessellation strategy (11, 14), where patches (224×224px) were staggered in steps of 100 px and a prediction was made on each patch. To generate whole-slide region-level predictions for a gene, we (i) profiled the H&E whole-slide images (by tessellation) using each of the three “fold” models for the gene to generate three activation maps, (ii) averaged these activation maps to form a single final “consensus” activation map for each gene, (iii) partitioned the final activation map into a 1 mm × 1 mm (TMA core size) grid, and (iv) averaged the activation for all grids containing at least 70% tumor tissue (the other grids are treated as background) to produce a region-level prediction map. The tumor locations in this map were treated as independent predictions, on which we measured our performance. The ground truth for each region was acquired by using the tumor masks created by the tumor classifier mentioned above to isolate the tumor regions and assign labels to each region based on corresponding IHC. For WT and UL cases, which were used to train all three genes, the label of all tumor regions was inherited from the assigned slide label. For LL cases, which were only used in BAP1 training/testing, the loss regions were manually annotated by a pathologist and the wildtype regions were determined by the nonloss tumor regions of the tumor mask. Receiver operating characteristic (ROC)–AUC analysis used the regional scores and ground truth for all the samples in the testing cohort.
Stain normalization
We performed slide-level stain normalization on our independent TMA cohorts to reduce the impact of staining variation, and to bring the staining intensities in the TMA cohort within the range present in our training data. This was achieved using an established stain normalization method (47) implemented in the StainTools package (https://github.com/Peter554/StainTools), which takes as input RGB values from pixels in source and target images and determines a transform such that the color distributions of the transformed source pixels match that of target pixels. This transformation is then used to normalize all pixels in the source image. We used a single target distribution for all slides, consisting of randomly sampled pixels from tumor patches in the WSI training cohort. For each TMA slide, our source distribution consisted of pixels randomly sampled from areas determined to be tissue (i.e., not in the background class) in that slide. A different transform was determined per slide and applied to all pixels in that slide.
Tests on independent cohorts
We tested the BAP1 region classification model on the TMA1, PDX1, and TMA2 cohorts described in the datasets section. The inference was performed similarly to that for the regional model section above with modifications to best use the limited TMA tissue. Specifically, for each model, instead of profiling on a uniform grid we used QuPath (https://qupath.github.io/) to identify boxes around each TMA core, and calculated the average activation across all non-background (identified as described above) pixels in the box. The final “consensus” activation score for each punch was found by averaging scores for each fold-model. We note that for the TMA1 and PDX1 cohorts, we had punch level BAP1 assignments, and thus the above punch-level activations and BAP1 status were compared. However, for the TMA2 cohort, BAP1 status was only available at a patient level (with each patient being represented by ∼2 punches on average). As a patient was considered to have BAP1 loss if any of the corresponding punches displayed loss, we calculated patient-level activation as the minimum (i.e., more BAP1-loss like) activation across all punches for the patient.
Nuclear feature-based classification
We developed a U-Net-based model to identify the location and shape of individual nuclei in our H&E images, as shown previously (48). Ground truth for this model, an assignment of pixels as nucleus/edge/background, was based on annotations of the boundaries of all nuclei in 186 ∼1,000 × 1,000 pixel areas from 31 H&E slides at 20× magnification. The trained U-Net classifier was then used to classify pixels in whole slides, and individual nuclei were identified as connected components of pixels belonging to the “nucleus” class. As quality control, we dropped nuclei smaller than 30 pixels or larger than 5,000 pixels. For each image in the WSI cohort, for each nucleus in the tumor region, we extracted a library of 36 features quantifying aspects of nuclear size, shape, color (in different color spaces), and texture (Supplementary Table S4). We then calculated the median value of each feature across all the (tumor region) nuclei to produce a 36-dimensional profile vector for the sample. Using the WSI Training Cohort, we trained random forest classifiers to predict the mutational status of the slides based on their nuclear features (which we then evaluated on the WSI Testing cohort).
Relationship to tumor grade
Two pathologists (V. Panwar and P. Kapur) independently reviewed each core in the human TMA, and then together assigned a consensus grade score (1–4) for each core in the TMA1 cohort, while grade was obtained at the patient level for the TMA2 cohort.
Survival prediction
Using the Lifelines python package, we performed a Cox-proportional hazards model analysis relating our BAP1 model's continuous score (range 0–1) to disease-specific survival (based on time between diagnosis and ccRCC-related-death or censored at last follow up/death related to other causes), and quantified performance in terms of P value, HR, and c-index. We also performed a similar analysis after stratifying patients based on the IHC assessment of their true BAP1 status. Finally, for visualization purposes we used a Kaplan–Meier curve to compare the IHC stratification to one based on predicted BAP1 status: patient predictions were considered Loss if at least one activation score across all cores for that patient was below 0.5, and WT if all activation scores for those cores were above 0.5.
Data/code availability
Our H&E slide images for the WSI, TMA1, TMA2, and PDX1 cohorts are available on FigShare+ at https://doi.org/10.25452/figshare.plus.c.5983795. The WSI cohort can be found at https://doi.org/10.25452/figshare.plus.19310870 and the TMA cohorts can be found at https://doi.org/10.25452/figshare.plus.19324118. TCGA KIRC can be downloaded from the TCGA GDC portal, whereas the corresponding genetic assignments are available from cBioPortal. The DL models for tumor identification, nuclear segmentation, and gene status prediction (both at local and region level) are available at https://doi.org/10.25452/figshare.plus.19353830. The source code and metadata for UTSW cohorts are available on GitHub at https://github.com/Rajaram-Lab/cancres-2022-intratumoral-heterogeneity-dl-paper. A singularity container image to run the code/models is available on DockerHub at https://hub.docker.com/layers/srajaram/lab_img/v2.14/images/sha256-f307fdf489b509740758813abef603f931993ce9ea1979eb9d980853a4b52415.
Results
In this study, we attempt to predict the status of three commonly mutated genes in ccRCC, BAP1, PBRM1, and SETD2 (4, 30–32) from H&E images (Fig. 1A). For each gene, we trained models at two scales: (i) whole-slide level—to allow comparison to previous bulk-sequencing approaches, we predict if the gene is lost anywhere in a slide; and (ii) region level—to resolve intratumor heterogeneity, we predict gene loss status of local regions within slides (Fig. 1B; Supplementary Fig. S2). The characteristics of different cohorts used are summarized in Table 1 and detailed below.
Cohort name . | System . | Assay . | Intrasample gene loss . | Institution . | Samples . | Notes . |
---|---|---|---|---|---|---|
WSI (34) | Human | Slide | From IHC assays | Mayo clinic | 1,282 WSIs (775 training, 507 holdout) | Subset of this cohort was used to train all gene models, remainder used as holdout to test performance of all gene models |
TCGA (41) | Human | Slide | Unavailable | TCGA | 363 WSIs | Tissue slides with whole exome sequencing for all three genes. Used for validation |
TMA1 (25) | Human | TMA | From IHC Assays | UTSW | 257 Cores | Used to validate region-level BAP1 predictions on TMA cores |
PDX1 (40) | PDX | TMA | From IHC Assays | UTSW | 147 Cores | Used to test BAP1 human-tissue–trained model performance in PDX TMAs |
TMA2 (36) | Human | TMA | From IHC Assays | UTSW | 365 Patients, (679 Cores) | Used to relate patient-level BAP1 prediction to clinical readouts |
Cohort name . | System . | Assay . | Intrasample gene loss . | Institution . | Samples . | Notes . |
---|---|---|---|---|---|---|
WSI (34) | Human | Slide | From IHC assays | Mayo clinic | 1,282 WSIs (775 training, 507 holdout) | Subset of this cohort was used to train all gene models, remainder used as holdout to test performance of all gene models |
TCGA (41) | Human | Slide | Unavailable | TCGA | 363 WSIs | Tissue slides with whole exome sequencing for all three genes. Used for validation |
TMA1 (25) | Human | TMA | From IHC Assays | UTSW | 257 Cores | Used to validate region-level BAP1 predictions on TMA cores |
PDX1 (40) | PDX | TMA | From IHC Assays | UTSW | 147 Cores | Used to test BAP1 human-tissue–trained model performance in PDX TMAs |
TMA2 (36) | Human | TMA | From IHC Assays | UTSW | 365 Patients, (679 Cores) | Used to relate patient-level BAP1 prediction to clinical readouts |
Note: Summary of the datasets used in this study. The model is trained on whole-slide images from the WSI cohort and its performance is benchmarked on data from TCGA and three TMA cohorts, two from human tumors and another from PDXs.
Our primary training cohort, the WSI cohort (Table 1), consists of 1,282 whole-slide H&E-stained images (each from a different ccRCC patient; refs. 28, 29, 33, 34) with spatially resolved gene loss status determined by a trained pathologist's inspection (Materials and Methods; Supplementary Fig. S2A) of serial sections stained with IHC antibodies against BAP1/PBRM1/H3K36me3 (SETD2). Consistent with expected frequencies of gene mutation, BAP1, PBRM1, and SETD2 inactivation was observed in approximately 12.40%, 52.18%, and 27.22% of slides, respectively. We split our samples into two separate cohorts (Fig. 2A), a training cohort that contains 60% of the slides (775 samples) and a testing cohort (WSI Testing) that contains the remaining 40% of the data (507 samples; Fig. 2A). All analysis of whole-slide images (which often contain areas of normal/stromal cells) was performed on tumor areas only (Supplementary Fig. S2A), which we identify in H&E WSIs using an independent model trained using pathologist annotation (Materials and Methods). From these tumor regions, we then extract 224px (∼111 μm) square patches at 20× magnification and train models to assign gene loss status based on the collection of these patches. Crucially, because of characteristic ccRCC intratumor heterogeneity, for a significant proportion of slides (∼8%–20% of cases exhibiting loss, depending on gene; Supplementary Table S1), the protein loss was localized to a focal region of the tumor area visualized on the whole slide (Supplementary Fig. S1). Thus, in generating our training data we distinguish between slides that exhibit WT (wildtype-like staining across whole slide), UL (Universal Loss: entire tumor area exhibiting gene loss), or LL (Localized Loss: subset of tumor area exhibiting gene loss) for each gene, which we ensure are spread approximately equally across our data splits (Supplementary Table S2).
DL can predict gene loss at whole-slide level
To enable comparison to published results, we first built a model that called (for example) a slide as BAP1 Loss if any part of the slide (even if in a minority of cells) exhibited this loss (Fig. 1B). Thus, all WT slides should be predicted as WT, but UL and LL slides should be predicted as Loss. To overcome the intraslide heterogeneity in gene loss status, we adopt a custom MIL approach (Fig. 2B; Supplementary Figs. S2B, S3, S4; Materials and Methods; ref. 44). This approach learns to pay more attention to loss-like phenotypes in a collection of patches and is thereby able to identify gene loss at the slide level, even if exhibited only in a minority of patches. We trained “consensus” predictors (i.e., average of multiple models based on the training data; Materials and Methods) to infer, from H&E images, the occurrence of BAP1, PBRM1, and SETD2 loss on whole slides. The output of each predictor is an “activation” score of gene function spanning a range of 0 for a confident Loss call to 1 for a confident WT call. In cases where a hard Loss/WT call is needed, an activation below 0.5 is treated as a Loss (and above 0.5 as WT).
We quantified the performance of our classifier on the WSI testing cohort using ROC of the output activation for slides relative to their ground truth gene loss status. We restricted our analysis to 489 (of 507) samples that had adequate tissue for reliable slide-level prediction (Materials and Methods). Following past work (Materials and Methods; ref. 9), we used a bootstrap-based estimation of 95% AUC confidence intervals that are indicated in parenthesis below. All three genes exhibited high AUCs, BAP1 = 0.87 (0.82–0.91), PBRM1 = 0.77 (0.73–0.81), and SETD2 0.71 (0.66–0.76; Supplementary Table S5), suggesting that in the hold-out portion of the training cohort (Fig. 2C), it is possible to predict expression of a whole-slide purely based on tissue morphology.
Models trained on one cohort often perform poorly in cohorts that show significant differences in preparation (e.g., staining) and imaging (e.g., different microscopes) from the original training cohort (49). To validate our results, we used the independent KIRC TCGA Pan Cancer Atlas dataset (referred to simply as TCGA; ref. 41), which contains H&E-stained whole-slide images as well as mutational status for our genes of interest (Supplementary Table S1), albeit at the whole sample, rather than regional, level. To serve as ground truth, we used mutational status for BAP1/PBRM1/SETD2 from cBioPortal (42, 43) and all called mutations (missense/splice/truncating) for these genes were treated as representing gene loss for our analysis (Materials and Methods; excluding missense mutations did not have a significant impact on results: Fig. 2D vs. Supplementary Fig. S5A). We found H&E-based features offered value towards gene loss prediction for all three genes with AUCs of 0.77 (0.70–0.84)/0.69 (0.63–0.76)/0.60 (0.51–0.68) (Supplementary Table S5) for BAP1, PBRM1, and SETD2, respectively. Interestingly, we were able to further improve the performance of the models, by restricting predictions to just low-grade tumors (Supplementary Figs. S5B–S5D; based on nuclear grade information available for TCGA). This is consistent with low-grade tumors being less evolved (with fewer mutations making it easier to identify the gene specific phenotypes) and more genetically homogenous.
Our model makes few false BAP1 Loss predictions [negative predictive value (NPV) of 0.94 on WSI Testing and 0.93 on TCGA], but as suggested by the flat top of the AUC curve, the high-confidence BAP1 WT predictions are especially reliable (Fig. 2C and D). For both cohorts, considering ∼60% of the samples deemed most likely to have BAP1 loss (Fig. 2E, top) captures 96% of samples that truly exhibit BAP1 loss (i.e., sensitivity = 0.96). Moreover, among the 33% of samples deemed most likely to be WT (Fig. 2E, bottom), none exhibited BAP1 loss in the TCGA and the two apparent mistakes in the top 33% (n = 167) in the WSI Testing cohort both exhibited BAP1 loss heterogeneity within the slide (suggesting a more conservative model may be able to do better). Taken together, our results suggest that our models for predicting gene loss status at the whole-slide level, especially for ruling out BAP1 loss, are particularly robust and generalizable.
DL can predict gene loss at regional level
We sought to build a framework to profile intratumor heterogeneity in H&E-stained images by localizing the gene loss calls to small intratumoral regions (Fig. 1B; as opposed to predicting if loss was exhibited anywhere on the slide as above). Here, we used a fixed region size of 1-mm-diameter as our unit of prediction. We reasoned that this size (i) strikes a good balance between capturing enough cells for a stable prediction, while still being small enough for low probability of genetic heterogeneity within a region and (ii) simplifies our validation process because it corresponds to the size of punches of our TMA cores.
For regional classification of gene loss, we built a more traditional convolutional neural network (Supplementary Fig. S2C), which sought to classify individual patches (rather than collections of patches for the MIL whole slide model) to the loss status they inherited from matching IHC regions. Similar to the whole-slide approach, we trained separate consensus predictors [by averaging multiple DL (46) models; Materials and Methods] for BAP1, PBRM1, and SETD2. To achieve TMA scale predictions, we first used the models to generate activation maps for the tumor portions of each slide (Fig. 3A and B; Supplementary Fig. S3), which we then divided into TMA-sized grid regions (Fig. 3C) and finally calculated region scores by averaging activation values within a grid square (Fig. 3D and E).
In this way, we were able to generate regional predictions within all our slides in the WSI Testing cohort (n = 507). To quantify performance, we compared the region-level activation scores to the ground truth for each punch size area. For all three genes, we achieved similarly high levels of performance (Fig. 3F) as their slide-level counterparts with AUCs of 0.89 (0.88–0.89)/0.76 (0.75–0.77)/0.68 (0.68–0.69) for BAP1, PBRM1, and SETD2, respectively, suggesting that morphology is predictive of gene loss status even at the regional length scale (Fig. 3F). Notably, even at this higher spatial resolution, BAP1 Loss continues to have the strongest relationship with tissue morphology and as discussed below, BAP1 model predictions are highly sensitive (Supplementary Fig. S6A). Given how much stronger the results are for BAP1 than PBRM1 or SETD2 (Fig. 3F), all subsequent analysis and validations were performed in the context of BAP1 alone.
We next considered the ability to deconvolute heterogeneity in slides that contain instances of both Loss and WT status (BAP1 LL cases). Figure 3 shows an example of how our prediction of BAP1 loss areas in a BAP1 LL cases compares to the IHC ground truth (other examples in Supplementary Figs. S7A–S7E). Beyond a hard WT/Loss call, our model output can also be used to guide sequencing, for example to preferentially select areas likely to exhibit BAP1 loss. Among the LL cases not used in training, we sorted the different 1 mm × 1 mm areas based on likelihood of BAP1 loss and found that by picking either the highest or second highest ranked areas we would capture BAP1 loss better than 98.7% of random selections, while choosing the third most confident area would still be better than 86.6% of random selections (Supplementary Figs. S8A and S8B).
Regional model can predict BAP1 status in multiple independent TMAs
We tested the robustness of our BAP1 region model on independent cohorts that (i) consisted of samples not used in training and (ii) used TMAs rather than whole tissue slides. An additional challenge was the stain color distribution differed from WSI and TCGA (Supplementary Figs. S9A–S9C), which we chose to address using slide-level color normalization (although the overall results are largely unaffected by this; Supplementary Fig. S9D). We found that the TMA punches were largely homogenous and made up of tumor regions, obviating the need for filtering using a tumor classifier. We then used a region-level approach akin to Fig. 3 (as our regions were the same 1 mm size as the TMA punches) to produce a single gene loss activation score for each punch, which was then compared with known gene loss status (Materials and Methods).
Our first independent TMA cohort (TMA1) contained matched IHC and H&E data for 118 human patients with ccRCC assayed as 257 punches (25, 26). This is an ideal data set for testing the robustness of our predictions because in addition to differing from our training cohort in preparation (TMA vs. whole slides and being generated and assayed at a different institution), the gene loss assignments are of high quality, having been validated by sequencing (26). Although a TMA does not exhibit the spatial heterogeneity of a whole slide, reliable predictions at a punch level will capture some of the genetic variation in a patient as each patient is represented by multiple TMA cores. Applying the BAP1 region classification models to the punches from TMA1 (Fig. 4A and B), we achieved an AUC of 0.77 (0.69–0.85), which, although expectedly lower than our training cohort AUC of 0.89 (0.88–0.89), still represents a well performing classification, especially considering the differences to the training cohort.
On both the WSI (NPV = 0.971) cohort and TMA1 (NPV = 0.975), our BAP1 region model is highly sensitive. In both cohorts, the top 25% of samples deemed most likely to be BAP1 WT, contain no BAP1 loss cases, and it is possible to capture 96% of BAP1 loss cases by examining just the top 63% of samples deemed most likely to exhibit BAP1 loss (Supplementary Figs. S6A and S6B; see Supplementary Figs. S10A–S10D for a more detailed view of tradeoffs across cohorts). Comparable performance (AUC = 0.84) and similar sensitivity can also be observed in the second human cohort: TMA2 (Supplementary Figs. S6C, S10, and S11). Furthermore, like in our WSI cohort, in TMA1 we find that for patients exhibiting heterogeneity in BAP1 loss, representing the patient by the single punch deemed most likely to exhibit BAP1 Loss by our model does better than a similar random selection 98.9% of the time (Supplementary Fig. S8C). Taken together, these results demonstrate the robustness of our region-level BAP1 classification models and the high sensitivity of its predictions.
Next, we tested whether DL phenotypes learnt on patient tissue are also applicable to tissue from PDX models. Given the dramatic experimental differences between patient tissue and PDXs, this represents a severe test of the robustness of the gene models. Intriguingly, as the generation of a PDX tumor involves replacement of the stroma by mouse tissue (40), the extent to which BAP1 status can be successfully predicted in PDXs highlights the tumor-intrinsic (as opposed to microenvironmentally driven) nature of these tissue phenotypes. We tested our models by applying the same approach we used for TMA1 to 147 punches in the PDX1 cohort from 46 PDX models established at UTSW. Despite the significant differences from the training cohort, which consists of human patient tissue, we were able to achieve performance comparable with TMA1 with AUC of 0.80 (0.69–0.90) in PDX1 (Fig. 4B and C; Supplementary Fig. S6D). These data show that the DL model does not rely on homotypic interactions between tumor and stromal cells, and that the features are preserved after tumor transplantation into a rodent. The ability to achieve such high performance using human trained models suggests that analysis of tissue phenotypes in PDX models could potentially inform future human studies.
Different features predict BAP1, PBRM1, and SETD2 loss
To better understand our models, we first sought to test whether morphology changes were gene-specific or were shared across driver mutations (e.g., reflecting overall tumor progression, which can occur through a variety of mutations). If the latter were true, one might expect that when one gene model (e.g., BAP1) makes a Loss prediction that would increase the likelihood of a different gene model (e.g., PBRM1) also making a Loss prediction on that sample. However, our results indicate that in fact, the likelihoods for BAP1 and PBRM1 loss perceived by our models are negatively correlated (Spearman correlation coefficient = −0.326, P value = 6.73e−13; Fig. 5A, bottom). These suggest that features used to predict BAP1 and PBRM1 are distinct and aligns with previous studies that indicate that mutations in BAP1 and PBRM1 are typically mutually exclusive in ccRCC (26). Moreover, PBRM1 status does not impact our model's perception of BAP1 phenotypes (Fig. 5A, top; gray vs. blue curves): both PBRM1 WT and Loss samples show similar distributions of BAP1 mutation likelihood. Similar results show that BAP1 is only weakly correlated with SETD2 classification status (Supplementary Fig. S12A). In contrast, the correlation between PBRM1 and SETD2 is consistent with previously reported co-occurrence of their loss (Supplementary Fig. S12B; refs. 26, 32). Taken together, predicted co-occurrence appears to mirror true co-occurrence of gene loss, suggesting that our models can identify gene-specific tissue phenotypes.
We next sought to identify the morphologic features that were most impacted by BAP1 and PBRM1 mutations. We focused primarily on nuclear phenotypes based on 36 nuclear features (including quantification of nuclear shape, size, and hematoxylin texture) from all tumor nuclei in the WSI training cohort samples (see Materials and Methods; Supplementary Fig. S13; Supplementary Table S4). Random forest classifiers trained using these features were able to classify BAP1 mutational status (Materials and Methods) with an AUC of 0.72, suggesting that nuclear features can contribute to identification of BAP1 loss (Supplementary Fig. S14A). By analyzing the differences in feature values between Loss and WT for BAP1 (Fig. 5B; Supplementary Figs. S15A–S15I), we were able to determine that BAP1 mutant tumors typically had larger nuclei and lighter nuclear hematoxylin staining than their WT counterparts (also see Supplementary Figs. S15A and S15D), consistent with previous reports (26, 27). These features were significantly less affected by PBRM1 and SETD2 Loss (Fig. 5B; Supplementary Figs. S14B and S14C, S16A–S16I, S17A–S17I).
BAP1 predictions correlate with clinical prognostic factors
Next, we sought to investigate the relationship between our BAP1 model predictions and known clinical prognostic parameters. For these experiments, we used TMA2, which while overlapping with TMA1 (23 of the 118 patients in TMA1 are shared) contains a larger number of patients (N = 365) with relevant clinical annotations. Next, as BAP1 mutations are known to be found largely in high-grade tumors (25), we chose to investigate how much value our classifier offers beyond nuclear tumor grade. As expected, tissue with higher grade is more likely to truly exhibit BAP1 loss and this is mirrored in the confidence of the BAP1 loss prediction. However, even within low- and high-grade tumors, samples with a BAP1 loss have higher confidence of being predicted as BAP1 loss than WT ones. These results are also supported by the TMA1 cohort where grade assessments were made at a punch (rather than patient) level, and within individual grades rather than just low/high (Supplementary Figs. S18A and S18B). Similarly, using just the grade to predict BAP1 would give poorer performance than our model with an AUC of 0.71 (0.63–0.79) on TMA1. Taken together, these results indicate that, although the phenotypes identified by our BAP1 classifier likely share some commonalities with grade, they capture BAP1 signature information that goes beyond grade.
Finally, because previous studies have shown that BAP1 mutations may influence patient outcome (25), we wanted to investigate the impact of our BAP1 predictions on survival. We performed a Cox-proportional hazards regression in TMA2 using our predicted BAP1 likelihood as a continuous variable and found that it significantly contributes to prediction of disease specific survival (P = 0.0087, HR = 5.43, c-index = 0.65). Interestingly, we found our prediction outperformed stratification based on IHC based BAP1 status (c-index = 0.55). For visualization purposes, we additionally stratified patients in TMA2 by whether their activation was greater than 0.5 (more likely to be WT) or less (more likely BAP1 loss) and compared the disease-specific survival to stratification based on IHC based determination of BAP1 status (Fig. 5D). On the basis of this 0.5 cutoff, the BAP1 classifier-based stratified survival curves showed an improved HR (HR = 2.53 vs. 1.8) and P value (0.0027 vs. 0.11) compared with stratification based on BAP1 status determined by IHC.
Discussion
In this study, we investigated the ability of DL to analyze H&E images to predict functional status of driver genes in ccRCC along two largely unexplored axes. First, we explored the effect of spatial length scale (slide vs. region) and whether genetic predictions can be made with high confidence at whole-slide level, as well as at subclonal regions, within tumors. The use of IHCs as ground truth is a unique aspect of our study enabling spatial resolution far surpassing previous sequencing-based approaches. Second, we test the robustness of our models across cohorts that are increasingly divergent from the initial training set, including the first test of human–tissue-trained models on PDX samples. Our results indicate that the status of all 3 genes evaluated, BAP1, PBRM1, and SETD2, can be gleaned on the basis of morphology. A particularly robust relationship was observed with BAP1.
At the whole-slide level, we report AUCs of 0.87/0.77/0.71 for BAP1/PBRM1 and SETD2, respectively, on the held-out portion of the WSI cohort. These differences likely reflect the fact that BAP1 loss has a large impact on tissue morphology (evidenced by its greater impact on nuclear grade; refs. 25–27), and although differences exist among the three IHC assays they all had very high positive and negative predictive values (26, 29). In addition, in ccRCCs, PBRM1/SETD2 loss is often associated with increased ITH (e.g., in the “PBRM1→SETD2 evolutionary subtype”) in contrast to “BAP1 driven” tumors that have low ITH with clonal expansion (4), potentially making it harder to identify their specific morphologic signatures. Our performance for all three genes was reduced when the classifier was applied to independent slides from TCGA (AUC = 0.77/0.69/0.60). However, this is not unexpected for several reasons. First, whereas our models were trained on IHC calls easily co-registered with H&E sections and providing superb accuracy, gene loss calls for TCGA were performed on spatially disjointed samples from those stained for H&E analyses. Second, factors other than mutations may result in lack of protein expression. We have previously reported that samples lacking a mutation where IHC showed BAP1 loss, did not express BAP1 protein on Western blot analysis (26). Finally, the bulk-sequencing based calling of mutations performed by TCGA is prone to missing subclonal events, represented by a small number of cells and therefore reads, that could be picked up by our assays. Consistent with these ideas, for the BAP1 gene, high confidence WT predictions seem to be always correct (NPV: WSI = 0.94, TCGA = 0.93), and it is the fewer sequencing-based detections of high confidence BAP1 loss predictions (PPV: WSI = 0.44, TCGA = 0.25) in TCGA that seem to drive the drop in AUC.
Two previous pan-cancer studies based on FFPE (9) and frozen (10) tumor sections as well as a ccRCC-specific study (12) attempted to predict mutation in our genes of interest from H&E images. These studies trained and tested their models on the TCGA KIRC cohort and may be expected to outperform our DL model on this dataset. However, despite training on a different dataset our DL model rivaled their performance with reported AUCs for PBRM1 of 0.59(10)–0.63(9) and SETD2 0.39(10)–0.65(9). The difference is even more striking for BAP1, where our model shows an AUC of 0.77 versus AUC = 0.65 (10). We suspect that our higher scores are probably due to the larger number of examples of BAP1 loss in our cohort (n = 159) compared with the TCGA (n ≈ 33), although our IHC assays that can capture other mechanisms of protein loss with high reliability (26, 29) may also have contributed. It is observed in the field that predictions made on the held-out part of the training cohort are often significantly better than those on an independent cohort. Among mutation predictions from H&E images that have been validated on an independent cohort, our BAP1 model with an AUC of 0.87/0.77 (holdout/independent) is among the strongest. It compares favorably to scores of 0.83/0.75 for EGFR in lung cancer (11), 0.83/0.63 for TP53 in breast cancer (10) and 0.72/0.67 for BRAF in colorectal cancer (9). Similarly, our region-level BAP1 model showed consistent performance across multiple cohorts (AUCs = 0.89 in WSI holdout vs. 0.77 in TMA1 and 0.84 in TMA2). This speaks to the robustness of the BAP1–morphology relationship.
The strong and robust morphology-genetics connection we detect for BAP1 has both practical and conceptual implications. Practically, our slide-level BAP1 score distribution may provide high enough sensitivity to be confidently actionable, although validation on additional cohorts is needed. Studies in other cancers have reported similarly high AUC scores but our ability to confidently rule out BAP1 loss on a large number of samples has not been rivaled except in much smaller cohorts (50). Moreover, the continued performance of the model at shorter length scales (regions of 1 mm diameter) suggest that these predictions can help guide studies to dissect intratumor heterogeneity. To capture 75% of all driver mutations in ccRCC, multiregional sequencing of seven randomly-selected regions is needed (4). As our models can rule out BAP1 loss with high sensitivity, they may help identify areas for sequencing, thereby increasing efficiency. From a more conceptual perspective, our results emphasize the scientific opportunities in studying the strong connection between genetics and morphology. Despite the presence of many other mutations, and no known mechanistic link between BAP1 and morphology, the effect of BAP1 stood out. The fact that this link is recapitulated in PDXs models, where the stroma is replaced by mouse tissue (40), shows that the effect is likely independent of homotypic tumor–microenvironment interactions. Moreover, the strength of the association is such that it may be possible to design experiments to understand how a deubiquitinase-like BAP1 ends up constructing such reproducible morphologic phenotypes.
Our results highlight the association of BAP1 loss with larger nuclei and open chromatin (lighter nuclear hematoxylin staining than their WT counterparts), an association not observed in the PBRM1- and SETD2-deficient tumors. These findings are consistent with our prior work demonstrating that BAP1 and PBRM1 drive ccRCC histologic grade (25, 26). We have shown (30) that targeting Bap1 and Pbrm1 in the mouse induces tumors of different nuclear grade where Bap1-deficient tumors are of high grade, a finding consistently observed in human ccRCCs. Our DL model, which was not privy to such information, seemed to have rediscovered the relationship between BAP1 and aggressiveness (as measured by grade), but also seems to be able to make predictions within a tumor grade. Surprisingly, our BAP1 classifier was predictive of disease-specific survival, apparently more so than BAP1 status as measured by IHC. Although we cannot rule out that grade-associated phenotypes picked up by the current BAP1 model could be influencing these predictions, future models that explicitly correct for grade effects could help better understand the interaction between, BAP1 loss, BAP1-associated morphologic phenotypes and prognosis.
These findings also raise questions and offer avenues for improvement in future studies. Despite the best-in-class AUC for BAP1 prediction, our model is not ready to replace IHC. Our high AUC scores (0.87 and 0.89 on slide and region level, respectively) reflect an enrichment in the relatively rare BAP1 mutations (∼10%–15% of cases). However, this only translates to a precision of approximately 44%. There are multiple potential explanations for these false positives. This may be a limitation of our training process and an even larger training cohort with a broader range of BAP1 mutant examples (our current model is trained on ∼98 BAP1 loss cases) would perform better, as has been demonstrated for metastasis detection (49). A second possibility is that as morphology is determined by several other factors including other mutations, (epi)genomics and environmental factors (7, 8), we have hit an information threshold on how much morphology can tell us about the functional status of BAP1. Perhaps the most intriguing possibility is that the morphologic signature we identify corresponds to a deeper cellular state where BAP1 mutations represent simply one means to reach this state. In other words, there may be other mechanisms to deregulate the BAP1 pathway besides BAP1 loss. Representing an endpoint of multiple cellular events, morphology could thus be a better indicator of survival than BAP1 status.
This study highlights the convergence of the two pillars of cancer diagnosis, histopathology and molecular genetics. Until recently it was unclear how morphology correlated with genomics. Multiple studies demonstrate a connection between morphology and particular gene mutations (9–11, 13, 14). These links are detectable using DL and our results show that the approach is valid to also profile tumor subclones. Intratumor heterogeneity has been a bane of biomarker studies. It is challenging to capture intratumor heterogeneity with sequencing-based approaches, which are also expensive, but heterogeneity is readily appreciable by visual inspection. Provocatively, it has been proposed that tissue morphology may not simply reflect a downstream readout of cellular state but may actively modulate signaling (7, 8). This emerging field has primarily focused on phenotypes of disassociated single cells, but our results suggest the links between morphology and molecular state in tissues are robust enough to be studied in more tractable model systems. Taken together, it is apparent that machine learning will have a key role to play in integrating tumor morphology and molecular biology, a key area of future research. Furthermore, as our data show, machine learning has the potential to provide important diagnostic and prognostic cues. Finally, inasmuch as mutations, such as in BAP1, may also predict therapeutic response, DL models may also influence therapy selection.
Authors' Disclosures
J. Brugarolas reports personal fees from Eisai, Johnson & Johnson, Exelixis, Calithera, and Arrowhead outside the submitted work. No disclosures were reported by the other authors.
Authors' Contributions
P.H. Acosta: Conceptualization, data curation, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, writing–review and editing. V. Panwar: Data curation, validation, investigation, methodology, writing–review and editing. V. Jarmale: Software, formal analysis, validation, visualization, methodology. A. Christie: Data curation, formal analysis, methodology, writing–review and editing. J. Jasti: Data curation, software, validation, methodology. V. Margulis: Resources, data curation, writing–review and editing. D. Rakheja: Resources, data curation, investigation, writing–review and editing. J. Cheville: Resources, data curation, investigation, writing–review and editing. B.C. Leibovich: Resources, data curation, investigation. A. Parker: Resources, data curation, investigation, writing–review and editing. J. Brugarolas: Conceptualization, resources, data curation, funding acquisition, investigation, methodology, project administration, writing–review and editing. P. Kapur: Conceptualization, resources, data curation, formal analysis, supervision, funding acquisition, validation, investigation, methodology, writing–original draft, project administration, writing–review and editing. S. Rajaram: Conceptualization, resources, data curation, software, formal analysis, supervision, funding acquisition, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing.
Acknowledgments
The authors thank all of the patients who provided tissues that enabled this research project. J. Brugarolas, A. Christie, D. Rakheja, and P. Kapur are supported by NIH (Specialized Program in Research Excellence in Kidney Cancer P50 CA196516). J. Brugarolas, A. Christie, and P. Kapur are supported by the Cancer Research & Prevention Institute of Texas (CPRIT; RP180192). J. Brugarolas and A. Christie are also supported by CPRIT (RP180191). P. Kapur was supported by NIH (R01CA244579, R01CA154475, and R01DK115986), DOD (W81XWH1910710), and CPRIT (RP200233). S. Rajaram was supported by startup funds provided through the Lyda Hill Department of Bioinformatics.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.