Abstract
Virtual staining for digital pathology has great potential to enable spatial biology research, improve efficiency and reliability in the clinical workflow, as well as conserve tissue samples in a nondestructive manner. In this study, we demonstrate the feasibility of generating virtual stains for hematoxylin and eosin (H&E) and a multiplex immunofluorescence (mIF) immuno-oncology panel (DAPI, PanCK, PD-L1, CD3, and CD8) from autofluorescence (AF) images of unstained non–small cell lung cancer tissue by combining high-throughput hyperspectral fluorescence microscopy and machine learning. Using domain-specific computational methods, we evaluated the accuracy of virtual H&E staining for histologic subtyping and virtual mIF for cell segmentation–based measurements, including clinically relevant measurements such as tumor area, T-cell density, and PD-L1 expression (tumor proportion score and combined positive score). The virtual stains reproduce key morphologic features and protein biomarker expressions at both tissue and cell levels compared with real stains, enable the identification of key immune phenotypes important for immuno-oncology, and show moderate to good performance across various evaluation metrics. This study extends our previous work on virtual staining from AF in liver disease and prostate cancer, further demonstrating the generalizability of this deep learning technique to a different disease (lung cancer) and stain modality (mIF).
We extend the capabilities of virtual staining from AF to a different disease and stain modality. Our work includes newly developed virtual stains for H&E and a multiplex immunofluorescence panel (DAPI, PanCK, PD-L1, CD3, and CD8) for non–small cell lung cancer, which reproduce the key features of real stains.
Introduction
Lung cancer is one of the most frequently diagnosed cancers and a leading cause of mortality worldwide. Approximately 85% of patients with lung cancer have non–small cell lung cancer (NSCLC), of which the most common subtypes are lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC; refs. 1, 2). Routine diagnosis of NSCLC requires hematoxylin and eosin (H&E) staining of tissue sections, and treatment decisions are often based on the assessment of various biomarkers. Protein biomarkers on tissue sections can be visualized by immunostaining techniques. Chromogenic IHC is routinely used in clinical practice (3). Multiplex immunofluorescence (mIF) is well-suited for the precise quantification and colocalization of multiple biomarkers and is widely used in research settings (3, 4).
Biomarkers characterizing the tumor and immune microenvironment are valuable for diagnosis, prognosis, and treatment decisions in NSCLC. Key biomarkers of interest include pancytokeratin (PanCK), cluster of differentiation 3 (CD3), cluster of differentiation 8 (CD8), and PD-L1. PanCK is expressed in epithelial cells and is useful for the identification of epithelial tumors (5). CD3 is expressed in T cells, which play an important role in the adaptive immune response. CD8 is predominantly expressed in cytotoxic T cells but can also be found in NK cells and dendritic cells. PD-L1, which may be expressed in both immune cells and tumor cells, suppresses the activity of T cells by binding to the regulatory receptor PD-1 and contributes to the immune evasion of tumors (6).
In recent years, immune checkpoint inhibitors that block PD-1 or PD-L1 have emerged as one of the most successful treatment strategies for NSCLC. PD-L1 expression is a widely used predictive biomarker of anti–PD-1/PD-L1 immunotherapy response for NSCLC. Pathologists determine the tumor proportion score (TPS) by evaluating PD-L1 expression in tumor cells (7–10), whereas the combined positive score (CPS) considers PD-L1 expression in both immune cells and tumor cells (7, 11). The immune cell topography has also been increasingly recognized as an important predictor of immunotherapy response and disease progression, and various methods to characterize immune cell phenotypes have been proposed for various cancers (12, 13). For example, the Immunoscore classification of colorectal cancer provides a scoring system based on CD3 and CD8 cell densities (14, 15).
The number of stains and molecular assays that can be performed is often limited by the availability of tissue. Variability of several factors, such as tissue preparation and staining differences between laboratories, also makes consistent inter-site analysis of IHC or IF results difficult (3, 4, 16). Variability of H&E stains can also be an issue, especially when performing automated downstream analyses (17). Virtual staining is a technique that addresses these problems using computer vision to generate stained images from unstained or differently stained tissue. One advantage of virtual staining from unstained tissue is the nondestructive process, thereby enabling the conservation of tissue samples for other uses such as standard histochemical staining or sequencing. Several methods for imaging unstained tissue have been explored, including hyperspectral autofluorescence (AF), by which endogenous fluorophores are imaged at high spatial resolution (18, 19). Virtual stains generated from AF images have demonstrated impressive qualitative and quantitative performance in several diseases and stains (20, 21). Virtual IHC has recently been shown to be promising for HER2 in breast cancer (22) and PIN-4 in prostate cancer (23). However, there remains a lack of virtual staining applications from unstained tissue for mIF, with recent works only demonstrating stain transfer capabilities from H&E and IHC (24, 25).
In this study, we demonstrate the feasibility of generating virtual stains for H&E and mIF from AF in NSCLC, extending the application of virtual staining of unstained tissue specimens to more diseases and stains. Figure 1 provides an overview of the current pathology workflow and virtual staining workflow.
Materials and Methods
Dataset
Data collection and preparation
Formalin-fixed paraffin-embedded tissue blocks from 448 participants with NSCLC approved for research use were procured from two biosample vendors (Avaden Biosciences, Inc. and Capital Biosciences, Inc.). Institutional Review Board exemption was obtained for this research. Participants were 224 male (50%) and 224 female (50%). The age of participants ranged from 37 to 89 years, with a mean age of 69 years. The diagnosis consisted of 362 LUAD (80.8%), 79 LUSC (17.6%), 4 adenosquamous carcinoma (0.9%), and 3 undetermined (0.7%) NSCLC subtypes.
Two slides from serial sections were prepared from each tissue block. Both slides were scanned with a custom-built hyperspectral fluorescence microscope, which has been previously described (23), to obtain unstained AF images consisting of 20 channels. The first section was then stained with H&E and scanned with the Aperio AT2 scanner (Leica Biosystems GmBH). The second section was stained with a custom mIF panel consisting of 4′,6-diamidino-2-phenylindole (DAPI) to visualize nuclei and primary antibodies paired with Opal fluorophores targeting PanCK (AE1/AE3, Opal 520), PD-L1 (E1L3N, Opal 620), CD3 (LN10, Opal 570), and CD8 (SP239, Opal 480; Akoya Biosciences, Inc.). The stained section was then scanned with the Vectra Polaris Imaging System (Akoya Biosciences, Inc.), and the component channels were spectrally unmixed with inForm tissue analysis software (Akoya Biosciences, Inc.). During spectral unmixing, the residual AF signal was retained as an additional channel.
Four additional slides were also prepared from each tissue block for use as control (two slides) and reserve (two slides) slides, if needed. A subset of control slides was stained for H&E and mIF to validate the staining protocols and examine sources of contamination that were occasionally observed during scanning. Reserve slides were kept unstained for use only in case tissue was damaged during the handling process.
Data alignment and quality control
Due to the imaging by different systems, image pairs of the AF and corresponding stains on the same slide were globally aligned with an affine transformation using an iterative approach which has been previously described (20, 23). Image pairs that could not be successfully aligned, due to tissue distortion from the staining process, for example, were excluded from the dataset. Other quality control checks were also performed to exclude images with out-of-focus regions or large regions of missing tissue. Overall, 422 AF–H&E image pairs (94.2%) and 405 AF–mIF image pairs (90.4%) passed all alignment and quality control checks and were used for model development, whereas 26 AF–H&E image pairs (5.8%) and 43 AF–mIF image pairs (9.6%) were excluded.
The antigen retrieval requirement of mIF staining is inherently harsher than H&E staining and more likely to create tissue warping, such as minor folds or tears. Therefore, an automatic algorithm to identify local regions with tissue warping was developed to exclude such artifacts. Patches from the globally aligned AF–mIF image pairs at the same location were converted to grayscale and normalized to the [0, 1] range. Using the normalized gradient field (NGF; ref. 26) or normalized total gradient (ref. 27) as the distance metric between the patches, a translational alignment of the patches was performed using the Powell method (28) combined with Gaussian pyramids with three levels (29) to speed up the convergence. The magnitude of the translational vector obtained was then used as an alignment score, whereby a lower score indicated better alignment quality. In order to determine a suitable threshold, a small subset of the training slides were manually annotated and compared against the alignment score. As a result, it was determined that any region with an alignment score greater than 6 μm at 10× magnification should be considered misaligned and excluded from model development. No significant differences were observed between using the NGF or normalized total gradient as the distance metric.
Data splits
Cases were randomly assigned to training, validation, and testing splits for virtual stainer model development. The distributions of sex, age, and diagnosis for each individual split were balanced, and there were no significant deviations from the overall dataset distribution. For H&E model development, the final dataset consisted of 215 training, 85 validation, and 122 testing slides. For mIF model development, the final dataset consisted of 245 training, 65 validation, and 95 testing slides. A larger number of training slides was allocated for mIF model development due to the lower prevalence of certain targets, while still maintaining a reasonable number of validation and testing slides.
Virtual stainer
Overview
Virtual stainer models were developed to generate H&E and mIF virtual stains from unstained AF images. For mIF, four separate models were trained to generate DAPI + PanCK, PD-L1, CD3, and CD8, respectively. All models operated on patches which were then combined into whole-slide images (WSI) for evaluation.
Data sampling
For model development, paired patches of the AF and corresponding stains were sampled from tissue regions of the training and validation slides. Paired patches were sampled at 40× magnification from the same locations, and a padding of 16 pixels was added to each side of the corresponding stains to account for local errors in the global alignment algorithm, which was addressed by a shift-invariant regression loss during training and further described in “Loss.” Therefore, the dimensions of the patches were 128 × 128 × 20 for AF, 160 × 160 × 3 for H&E, and 160 × 160 × 1 for each mIF channel.
For the H&E model, patches were uniformly sampled across all tissue regions. Overall, 30 million patches were sampled for training and 1,000 patches were sampled for validation. The smaller number of patches chosen for validation was observed to produce comparable metrics with decreased computational times.
For the mIF models, local regions with tissue warping, as determined by the automatic algorithm using the NGF distance metric described in “Data alignment and quality control” were excluded from sampling. Patches from tumor regions, as determined by the PanCK stain, were sampled with higher probabilities. Overall, for the DAPI + PanCK, CD3, and CD8 models, 10 million patches were sampled for training and 1,000 patches were sampled for validation. For PD-L1, a larger number of patches were sampled for training, and patches from PD-L1–positive regions, as determined by the intensity of the PD-L1 stain, were also sampled with higher probabilities. Due to the heterogeneity of PD-L1 expression in both immune cells and tumor cells, this was found to improve overall model performance. Overall, for the PD-L1 model, 20 million patches were sampled for training and 1,000 patches were sampled for validation.
Pseudo-IHC
Pseudo-IHC (pIHC) refers to IHC-like images rendered from mIF images using an algorithm based on modeling absorption using the Beer–Lambert law (30, 31). Our implementation of the algorithm included two modifications to improve the visual quality of the pIHC images. First, a realistic tissue background was rendered using the unmixed residual AF signal, which adds tissue morphology details to the image. Second, color offsets were added to match the colors typically observed in real IHC images obtained using brightfield microscopy, such as in regions in which no tissue is present.
Three channels from the mIF image were used to render the corresponding pIHC image. First, the DAPI stain was used to render the corresponding hematoxylin stain, which labels nuclei blue-purple. Second, the target stain was used to render the corresponding 3,3′-diaminobenzidine stain, which labels the target brown. Third, the unmixed residual AF was used to render the corresponding tissue background. Figure 2A shows examples of the mIF image and the corresponding pIHC image for different targets.
Note that the pIHC algorithm described in this study is based on pixel-wise image processing, invertible, and can also be used to render the mIF image from the corresponding pIHC image. More details are available in Supplementary Material S1 and Supplementary Table S1.
Initial experiments were conducted to compare the effectiveness of training the mIF models to predict the output as mIF or pIHC virtual stains. The initial experiments showed that training the models to predict pIHC virtual stains, and using the inverse pIHC algorithm to render the corresponding mIF virtual stains, resulted in higher quality of virtual stains compared with training the models to directly predict mIF virtual stains. This was determined by qualitative review of validation patches and WSIs. Using pIHC also enabled a more direct translation of some of the H&E model hyperparameters to the mIF models.
Architecture
The architecture of the models was based on generative adversarial networks, specifically the “pix2pix” paired image-to-image translation approach, and has been previously described (20, 23, 32). Briefly, the architecture consisted of a U-Net–based generator, a conditional discriminator, and two unconditional discriminators which operated at different magnifications of the image. The generator received the AF as the input and was trained to predict the corresponding virtual stain as the output. The discriminators received the real and virtual stains as the input and were trained to differentiate between the real and virtual stains. The unconditional discriminators received only the real and virtual stains as the input, whereas the conditional discriminator also received the AF as the input.
Specific hyperparameters such as kernel sizes, numbers of kernels, dropout rate, and attention gates (33) were different and empirically determined for the H&E and mIF models. More details are available in Supplementary Material S2 and Supplementary Table S2.
Loss
The loss components used to train the models have been previously described (20, 23). Briefly, a shift-invariant regression loss was used to minimize the L1 and L2 errors between the real and virtual stains. Conditional and unconditional adversarial losses were used in a minimax game to force the generator to produce realistic images in order to fool the discriminators. A rotational consistency loss was used to make the output rotation invariant and prevent the model from learning any orientation biases. An L2 regularization loss was used to penalize large model weights and reduce overfitting.
= the weight
= the IF intensity of the target stain
= the maximum IF intensity to clip values
For the DAPI + PanCK model, the pixel-wise weighting scheme was not applied, as no significant improvements in quality were observed for the PanCK target stain which is highly prevalent, and the DAPI stain which labels nuclei was also an output of interest. For the CD8 model, all pixels with an IF intensity of the target stain of less than 15 were given a weight of 1 to reduce the effect of label noise from background and nonspecific fluorescence, which is further described in “Results.”
Specific hyperparameters such as the weights for each loss component were different and empirically determined for the H&E and mIF models. More details are available in Supplementary Material S2 and Supplementary Table S2.
Training and validation
All model weights were randomly initialized using Glorot uniform initialization (34) and optimized using Adam optimization (35) to minimize the total loss on the training set. All models were trained with a fixed batch size. Learning rate schedules were used to dynamically change the learning rate over time. Loss schedules were used to change the relative weights of the shift-invariant regression loss and adversarial loss components over time. Performance was monitored based on the L1 error and Fréchet inception distance (ref. 36) between the real and virtual stain patches to ensure model convergence. The best checkpoint for each model was selected based on a combination of the minimum L1 and Fréchet inception distance on validation patches, as well as qualitative review of the virtual stains on validation WSIs. The best checkpoint for all models was between 320,000 and 360,000 training steps.
Specific hyperparameters such as the batch size, learning rate, learning rate schedules, and loss schedules were different and empirically determined for the H&E and mIF models. More details are available in Supplementary Material S2 and Supplementary Table S2.
Data availability
The data in this study are not available due to legal terms of the agreement between Verily Life Sciences LLC and Genmab. Releasing a trained binary or working code of our internal tooling, infrastructure, and hardware is not feasible. However, all virtual stainer models have been described in sufficient detail in “Materials and Methods” and Supplementary Materials to allow independent replication. The “pix2pix” architecture is open-source and available with tutorials at https://www.tensorflow.org/tutorials/generative/pix2pix. The algorithmic components of our work are built on open-source repositories. Python 3.6 packages (NumPy, SciPy, OpenCV, Pandas, Seaborn, and Matplotlib) were used for feature extraction, preprocessing, training, evaluation, statistical analysis, and plotting. TensorFlow 2.0 with Keras was used to build, train, and evaluate the virtual stainer models.
Evaluation
Domain-specific computational methods were used to evaluate the accuracy of the H&E and mIF virtual stains compared with the real stains on testing slides.
H&E
To evaluate the accuracy of the H&E virtual stains, an independent model which has been previously described (37) was used for automatic histologic subtyping, and the similarity of the model’s segmentations on real and virtual stains was evaluated. Briefly, the model was trained on H&E images of LUAD cases from The Cancer Genome Atlas dataset to segment nine histologic features at 10× magnification. The histologic features include six tumor subtypes (acinar, cribriform, lepidic, micropapillary, papillary, and solid), leukocyte aggregates, necrosis, and an “other” category comprising features such as normal tissue and stroma.
The tumor subtypes are only applicable to LUAD cases. Therefore, the six tumor subtypes were combined into a single “combined tumor” category for any analysis involving non-LUAD cases (23 LUSC, 3 adenosquamous carcinoma, and 1 undetermined). This was considered a reasonable approach even though the model was not specifically trained for non-LUAD cases, given that it was not the accuracy of the histologic subtyping model that was being evaluated but rather the similarity of the model’s segmentations on real and virtual stains.
Additionally, each tumor subtype is typically considered clinically relevant only if its presence exceeds a certain threshold (38). Therefore, an additional analysis for each tumor subtype was performed by limiting to LUAD cases in which that subtype exceeded 5% of the total tumor area (LUAD-5%). This excluded cases with low prevalence and may better represent clinical performance.
Dice scores for the overlap of segmentations on the real and virtual stains were calculated, whereby higher values indicate better performance.
mIF
To evaluate the accuracy of the mIF virtual stains, several cell segmentation–based measurements were obtained using classic threshold-based automatic image analysis tools developed using Visiopharm software (Visiopharm A/S). Briefly, cells were identified and segmented based on the DAPI stain, and tumor regions were identified and segmented based on the PanCK stain. Positive cell expression of PD-L1, CD3, and CD8 was determined based on the average intensity of the respective stain within each segmented cell. Figure 2B shows an example of the segmentations.
Measurements of the positive area, positive cell density, positive cell percentage, and computationally derived TPS and CPS were calculated for each stain, where relevant. Specifically, the positive area was calculated only for PanCK, whereas positive cell density and positive cell percentage were calculated only for DAPI, PD-L1, CD3, and CD8. For DAPI, positive cell percentage was not calculated as the calculation is itself a function of the number of cells segmented based on DAPI and would therefore always be 100%. For PD-L1 only, positive cell percentage was calculated as TPS and CPS to align with clinical terminology and practice. More details are available in Supplementary Material S3.
Additionally, a colocalization analysis which can better represent particular cell subsets was also performed. Specifically, CD3 and CD8 colocalization can better represent the subset of cytotoxic T cells, and CD3 and PD-L1 colocalization can better represent the subset of PD-L1–positive T cells.
Analysis was performed according to three different definitions of the region of interest:
Tissue: The entire tissue region, as determined by a tissue detection algorithm. The same region was used on both the real and virtual stains. This analysis was useful to measure the performance on the entire WSI.
Real tumor: The real tumor region, as determined by the real PanCK stain. The same region was used on both the real and virtual stains. This analysis was useful to measure the performance in the tumor region independent of the quality of the virtual PanCK stain.
Respective tumor: The respective tumor regions, as determined by the real PanCK stain on the real stains and the virtual PanCK stain on the virtual stains. This analysis was useful to measure the performance in the tumor region in a fully virtual workflow.
Pearson correlations and absolute differences between the measurements on real and virtual stains were calculated, whereby higher values for correlations indicate better performance and lower values for absolute differences indicate better performance.
Results
H&E
Qualitative analysis
Figure 3A shows a WSI example of the H&E real and virtual stains. Figure 3B shows a magnification series of concentric regions from the WSI.
Overall, the virtual stains were able to reproduce key morphologic features at tissue and cell levels such as tumor cells, tumor-infiltrating lymphocytes, and tumor-associated stroma. Certain subcellular features such as cell borders, nuclear size and shape, and cytoplasm appearance, as well as mitoses, nuclear lobation, and nucleoli, were also reproduced in the virtual stains.
Quantitative analysis
Tables 1 and 2 show the average performance based on the overlap of segmentations obtained from automatic histologic subtyping of real and virtual stains. Figure 3C shows examples of the segmentations.
Category . | All (122 cases) . | LUAD (95 cases) . | Non-LUAD (27 cases) . |
---|---|---|---|
Combined tumor | 0.92 ± 0.05, 0.94 | 0.93 ± 0.05, 0.94 | 0.92 ± 0.04, 0.92 |
Leukocyte aggregates | 0.85 ± 0.06, 0.86 | 0.85 ± 0.06, 0.86 | 0.85 ± 0.06, 0.85 |
Necrosis | 0.66 ± 0.27, 0.76 | 0.62 ± 0.29, 0.71 | 0.78 ± 0.13, 0.82 |
Other | 0.95 ± 0.04, 0.96 | 0.95 ± 0.04, 0.97 | 0.94 ± 0.04, 0.95 |
Category . | All (122 cases) . | LUAD (95 cases) . | Non-LUAD (27 cases) . |
---|---|---|---|
Combined tumor | 0.92 ± 0.05, 0.94 | 0.93 ± 0.05, 0.94 | 0.92 ± 0.04, 0.92 |
Leukocyte aggregates | 0.85 ± 0.06, 0.86 | 0.85 ± 0.06, 0.86 | 0.85 ± 0.06, 0.85 |
Necrosis | 0.66 ± 0.27, 0.76 | 0.62 ± 0.29, 0.71 | 0.78 ± 0.13, 0.82 |
Other | 0.95 ± 0.04, 0.96 | 0.95 ± 0.04, 0.97 | 0.94 ± 0.04, 0.95 |
Tumor subtype . | LUAD (95 cases) . | LUAD-5% (number of cases meeting criteria) . |
---|---|---|
Acinar | 0.73 ± 0.17, 0.77 | 0.80 ± 0.09, 0.82 (59) |
Cribriform | 0.56 ± 0.22, 0.63 | 0.70 ± 0.12, 0.71 (36) |
Lepidic | 0.76 ± 0.15, 0.79 | 0.80 ± 0.09, 0.82 (76) |
Micropapillary | 0.62 ± 0.27, 0.69 | 0.82 ± 0.06, 0.84 (23) |
Papillary | 0.68 ± 0.25, 0.77 | 0.81 ± 0.07, 0.82 (56) |
Solid | 0.76 ± 0.14, 0.76 | 0.79 ± 0.12, 0.80 (76) |
Tumor subtype . | LUAD (95 cases) . | LUAD-5% (number of cases meeting criteria) . |
---|---|---|
Acinar | 0.73 ± 0.17, 0.77 | 0.80 ± 0.09, 0.82 (59) |
Cribriform | 0.56 ± 0.22, 0.63 | 0.70 ± 0.12, 0.71 (36) |
Lepidic | 0.76 ± 0.15, 0.79 | 0.80 ± 0.09, 0.82 (76) |
Micropapillary | 0.62 ± 0.27, 0.69 | 0.82 ± 0.06, 0.84 (23) |
Papillary | 0.68 ± 0.25, 0.77 | 0.81 ± 0.07, 0.82 (56) |
Solid | 0.76 ± 0.14, 0.76 | 0.79 ± 0.12, 0.80 (76) |
Overall, the segmentations on real and virtual stains were very similar, despite patches often exhibiting multiple subtypes or ambiguous categorization. The segmentations for the combined tumor, leukocyte aggregates, and “other” categories showed the best performance, with Dice scores above 0.8, indicating good differentiation of tumor, nontumor, and immune cells. The performance for the necrosis category was slightly lower, which may be attributed to low AF signals in necrotic regions. No significant differences were observed between LUAD and non-LUAD cases. For LUAD cases, the Dice scores ranged from moderate (0.5–0.7) to good (0.7–0.9) for the segmentations of the tumor subtypes. The best performance was observed for the segmentations of solid, acinar, and lepidic tumor subtypes, which are the most established and common tumor subtypes (38). The lowest performance was observed for the segmentations of the cribriform tumor subtype, which is less established and more difficult to characterize than the other tumor subtypes (39). The performance for all tumor subtypes improved for the subset of LUAD-5% cases, with all Dice scores above 0.7, demonstrating promising performance when factoring in a clinical threshold.
mIF
Qualitative analysis
Figure 4A shows a WSI example composite of all mIF real and virtual stains. Figure 4B–E shows the individual model outputs at several regions across the WSI. Figure 5A–E shows examples of key immune phenotypes which are important for immuno-oncology. More examples are available in Supplementary Material S4 and Supplementary Figs. S1 and S2.
Overall, the global distribution of the virtual stains was similar to that of the real stains. The virtual stains were able to satisfactorily reproduce key morphologic features and protein biomarker expressions at tissue and cell levels. Importantly, the identification of key immune phenotypes such as PD-L1–positive, PD-L1–negative, immune-inflamed, immune-excluded, and immune-desert tumors was attainable with the virtual stains (12). The virtual stains for morphologic or cell type biomarkers such as DAPI, PanCK, and CD3 were of higher quality and accuracy than highly dynamic cell state biomarkers such as PD-L1. For CD8, whereas the overall spatial distribution and relative density of positive cells were moderately accurate at the tissue level, the positive expression at cell level was not as accurate. Upon further inspection, high amounts of background and nonspecific fluorescence were observed in many of the CD8 real stains, which adds label noise during the model training procedure, resulting in poorer performance. More details are available in Supplementary Material S4 and Supplementary Fig. S3.
Quantitative analysis
Tables 3 and 4 show the Pearson correlations between the measurements on real and virtual stains obtained from the cell segmentation–based analysis in Visiopharm software for the single expression and colocalization analysis, respectively. Blank entries indicate measurements that were not relevant for the stain as described in “Materials and Methods.” The average absolute differences and scatterplots of the measurements on real and virtual stains are available in Supplementary Material S4, Supplementary Tables S3 and S4, and Supplementary Figs. S4–S11.
Region . | Measurement . | PanCK . | DAPI . | PD-L1 . | CD3 . | CD8 . |
---|---|---|---|---|---|---|
Tissue | Positive area (mm2) | 0.96 (0.94–0.97) | — | — | — | — |
Positive cell density (cells/mm2) | — | 0.85 (0.79–0.90) | 0.63 (0.49–0.74) | 0.89 (0.83–0.92) | 0.41 (0.23–0.56) | |
Positive cell percentage (%) | — | — | 0.50 (0.33–0.64) | 0.84 (0.77–0.89) | 0.43 (0.25–0.58) | |
Real tumor | Positive cell density (cells/mm2) | — | 0.66 (0.52–0.76) | 0.52 (0.36–0.66) | 0.84 (0.76–0.89) | 0.50 (0.33–0.64) |
Positive cell percentage (%) | — | — | See TPS and CPS | 0.79 (0.70–0.86) | 0.45 (0.27–0.60) | |
TPS (%) | — | — | 0.46 (0.29–0.61) | — | — | |
CPS (%) | — | — | 0.71 (0.60–0.80) | — | — | |
Respective tumor | Positive cell density (cells/mm2) | — | 0.53 (0.37–0.66) | 0.52 (0.36–0.66) | 0.84 (0.76–0.89) | 0.49 (0.32–0.63) |
Positive cell percentage (%) | — | — | See TPS and CPS | 0.76 (0.66–0.83) | 0.45 (0.28–0.60) | |
TPS (%) | — | — | 0.47 (0.29–0.61) | — | — | |
CPS (%) | — | — | 0.71 (0.59–0.80) | — | — |
Region . | Measurement . | PanCK . | DAPI . | PD-L1 . | CD3 . | CD8 . |
---|---|---|---|---|---|---|
Tissue | Positive area (mm2) | 0.96 (0.94–0.97) | — | — | — | — |
Positive cell density (cells/mm2) | — | 0.85 (0.79–0.90) | 0.63 (0.49–0.74) | 0.89 (0.83–0.92) | 0.41 (0.23–0.56) | |
Positive cell percentage (%) | — | — | 0.50 (0.33–0.64) | 0.84 (0.77–0.89) | 0.43 (0.25–0.58) | |
Real tumor | Positive cell density (cells/mm2) | — | 0.66 (0.52–0.76) | 0.52 (0.36–0.66) | 0.84 (0.76–0.89) | 0.50 (0.33–0.64) |
Positive cell percentage (%) | — | — | See TPS and CPS | 0.79 (0.70–0.86) | 0.45 (0.27–0.60) | |
TPS (%) | — | — | 0.46 (0.29–0.61) | — | — | |
CPS (%) | — | — | 0.71 (0.60–0.80) | — | — | |
Respective tumor | Positive cell density (cells/mm2) | — | 0.53 (0.37–0.66) | 0.52 (0.36–0.66) | 0.84 (0.76–0.89) | 0.49 (0.32–0.63) |
Positive cell percentage (%) | — | — | See TPS and CPS | 0.76 (0.66–0.83) | 0.45 (0.28–0.60) | |
TPS (%) | — | — | 0.47 (0.29–0.61) | — | — | |
CPS (%) | — | — | 0.71 (0.59–0.80) | — | — |
Region . | Measurement . | CD3 and CD8 . | CD3 and PD-L1 . |
---|---|---|---|
Tissue | Positive cell density (cells/mm2) | 0.53 (0.38–0.66) | 0.81 (0.73–0.87) |
Positive cell percentage (%) | 0.55 (0.39–0.67) | 0.73 (0.62–0.81) | |
Real tumor | Positive cell density (cells/mm2) | 0.56 (0.41–0.69) | 0.71 (0.60–0.80) |
Positive cell percentage (%) | 0.56 (0.41–0.69) | 0.66 (0.53–0.76) | |
Respective tumor | Positive cell density (cells/mm2) | 0.51 (0.35–0.65) | 0.75 (0.65–0.83) |
Positive cell percentage (%) | 0.53 (0.36–0.66) | 0.68 (0.55–0.77) |
Region . | Measurement . | CD3 and CD8 . | CD3 and PD-L1 . |
---|---|---|---|
Tissue | Positive cell density (cells/mm2) | 0.53 (0.38–0.66) | 0.81 (0.73–0.87) |
Positive cell percentage (%) | 0.55 (0.39–0.67) | 0.73 (0.62–0.81) | |
Real tumor | Positive cell density (cells/mm2) | 0.56 (0.41–0.69) | 0.71 (0.60–0.80) |
Positive cell percentage (%) | 0.56 (0.41–0.69) | 0.66 (0.53–0.76) | |
Respective tumor | Positive cell density (cells/mm2) | 0.51 (0.35–0.65) | 0.75 (0.65–0.83) |
Positive cell percentage (%) | 0.53 (0.36–0.66) | 0.68 (0.55–0.77) |
Overall, the correlations ranged from moderate (0.4–0.7) to good (0.7–1.0) for the different measurements. The performance metrics were consistent with the qualitative observations described above, whereby virtual stains for DAPI, PanCK, and CD3 showed better performance than PD-L1 and CD8. In general, there was a decrease in performance in tumor regions compared with tissue regions, whereby the correlations were lower in tumor regions, indicating that biomarker expressions in tumor regions may be more dysregulated or heterogeneous and difficult to predict. For PD-L1, the correlations between the measurements on real and virtual stains were 0.46 for TPS and 0.71 for CPS compared with previously reported interobserver agreements (Cohen’s kappa) of 0.53 to 0.72 for TPS (40–42) and 0.52 to 0.74 for CPS (40) across various studies. The measurements of CPS showed better correlations than TPS, indicating that PD-L1 expression can be predicted more accurately in immune cells than in tumor cells. This was consistent with the good performance of the CD3 and PD-L1 colocalization analysis, which represents PD-L1 expression on T cells. For CD8, the background AF and nonspecific fluorescence contributed to the false positive identification of cells. Whereas CD8 can be expressed on NK cells and dendritic cells as well, the primary cell type of interest when staining for CD8 is often cytotoxic T cells. Therefore, the colocalization analysis of CD3 and CD8 enabled the specific analysis of cytotoxic T cells. The high quality of both the real and virtual stains for CD3 also reduced the false positives and improved the performance metrics.
Discussion
In this study, we present virtual stainer models that can generate virtual stains for H&E and an immuno-oncology mIF panel (DAPI, PanCK, PD-L1, CD3, and CD8) from AF images of unstained NSCLC tissue specimens. This work follows our previous work (20, 23) and demonstrates the generalizability of our technique by extending its application to a different disease and stain modality.
The H&E virtual stains accurately reproduced key morphologic features at tissue, cell, and even subcellular levels compared with real stains. The virtual stains also demonstrated good performance when used for automatic histologic subtyping. Overall, there was good differentiation between tumor and nontumor regions, with Dice scores above 0.8. There was also good characterization of immune cells, with Dice scores of 0.85 for leukocyte aggregates. For LUAD cases, the differentiation between specific tumor subtypes also showed good performance, especially for the solid, acinar, and lepidic tumor subtypes, with Dice scores above 0.7. Therefore, the H&E virtual stains showed potential to be accurate enough for the diagnosis and subtyping of NSCLC cases, as it is routinely used in clinical practice. We hypothesize that some of the differences between the segmentations on real and virtual stains may be partly attributed to the variability in pathologist annotations used for training the histologic subtyping model (37), whereby the high rates of interpathologist disagreement observed when subtyping adenocarcinoma (43) may result in the histologic subtyping model being susceptible to minor and clinically irrelevant variations between the real and virtual stains.
The mIF virtual stains also showed good performance, reproducing key morphologic features and protein biomarker expressions at tissue and cell levels, especially for DAPI, PanCK, and CD3. The identification of key immune phenotypes important for immuno-oncology was also attainable with the virtual stains. Overall, there was good agreement between clinically relevant measurements derived from the real and virtual stains, especially for measurements of the tumor area, T-cell density and percentage, and CPS, with correlations above 0.7. For PD-L1, the performance of the virtual stains was better in immune cells than in tumor cells, and further investigation would be required to determine the reason for this difference. We hypothesize that the more distinct morphology of immune cells, the heterogeneity of dysregulated PD-L1 expression in tumor cells, as well as the difference in AF signals between immune cells and tumor cells partly contribute to this difference. For CD8, we observed background and nonspecific fluorescence in the real stains which adds label noise during the model training procedure. Therefore, we expect the performance of the CD8 virtual stains to considerably improve with a less noisy dataset, which can be obtained by using fluorophores at a different wavelength for CD8 staining, for example, and better quality control. We aim to further improve on the performance of PD-L1 and CD8 virtual stains in future work.
We also present a pIHC algorithm which can be used to render IHC-like images from mIF images, introducing modifications to existing methods (30, 31) to improve the visual quality and similarity of pIHC images to real IHC images that pathologists are more familiar with reading and evaluating. The pIHC algorithm can be used in several ways. First, it enables the generation of pIHC images for multiple biomarkers from a single mIF image. On the other hand, the generation of real IHC images for multiple biomarkers would require the preparation of multiple slides containing consecutive sections obtained from a tissue block, which can be limited by the availability of tissue. The use of consecutive sections also prohibits accurate colocalization analysis of biomarkers, which is required for specific cell subsets and spatial analysis.
Second, the pIHC algorithm enables flexibility in the choice of modality for different applications. For example, pIHC images may be preferred when pathologists review images, whereas mIF images may be preferred for the development of computational analysis workflows and spatial biology research. We also took advantage of this flexibility in our model development process, whereby using pIHC enabled a more direct translation of previous virtual stainer models that were developed for H&E and IHC images, thereby reducing model development and iteration time. Similarly, many existing pathology image analysis models may be developed for IHC images, and the use of pIHC may directly enable mIF images to be applied to those models. The pIHC algorithm may also be used for the generation of multiplex pIHC, whereby multiple pIHCs from the same section can be overlaid and pseudo-colored. Further analysis on whether such images could be beneficial for pathologist image review would be required. Specifically, pathologists may prefer the additional contextual information about the histology that can be provided by the multiplex pIHC but not the mIF images.
In this study, the accuracy of virtual stains compared with real stains was evaluated primarily using computational methods. Although the results were promising, further validation would be required to establish whether virtual stains can be used in clinical or research settings as a substitute for real stains. Human reader studies should be performed to evaluate the diagnostic accuracy of virtual stains by pathologists. Previous reader studies have shown that H&E virtual stains can be accurate enough for pathologists to grade images using established clinical scoring systems in nonalcoholic steatohepatitis (20) and prostate cancer (23). A similar reader study can be performed for H&E virtual stains in NSCLC based on histologic classification and subtyping. Whereas mIF image analysis is often computational and there are few established scoring systems based on mIF for NSCLC, more in-depth computational spatial analysis comparing real and virtual stains can be performed for further validation. For example, more advanced machine learning–based algorithms (44) could be developed to improve upon the current threshold-based algorithms used for cell segmentation, which can have limited accuracy in separating overlapping adjacent cells. It would also be important to evaluate the association with clinical outcomes or endpoints in order to determine any clinical impact of the differences observed between the measurements on real and virtual stains. Furthermore, it would be valuable to compare the inter-stain variability between real and virtual stains (real–virtual variability) against the intra-stain variability observed between repeated generations of real stains (real–real variability) or virtual stains (virtual–virtual variability). We hope to address these limitations and further validate the accuracy of the virtual stains in future work.
In summary, we demonstrate the feasibility of generating virtual stains for H&E and mIF from AF images of unstained NSCLC tissue specimens, extending the application of our previous work to another disease and stain modality. Virtual staining has great potential to enable spatial biology research, improve efficiency and reliability in the clinical workflow, as well as conserve tissue samples in a nondestructive manner for future analysis. This study reinforces the potential of digital pathology and virtual staining to improve medical decision making and patient outcomes.
Authors’ Disclosures
J. Loo reports a patent to virtual staining pending. M. Robbins reports personal fees from Verily Life Sciences and grants from Genmab during the conduct of the study. C. McNeil reports a patent to “Platform-based predictions using digital pathology information” issued. T. Yoshitake reports other support from Verily Life Sciences LLC during the conduct of the study and outside the submitted work; in addition, T. Yoshitake has a patent to virtual staining pending. C. Santori reports a patent to “Generating virtually stained images of unstained samples” issued, a patent to “Multispectral fluorescence microscope” pending, a patent to “Image-based focusing” pending, a patent to “Image normalization for virtual staining” pending, and a patent to “Flatfield calibration” pending. S. Vyawahare reports a patent to the U.S. patent office pending. D.F. Steiner reports other support from Google during the conduct of the study. A.C. Sanchez reports grants from Genmab during the conduct of the study. L. Scott reports no financial ties other than previous employment at Genmab. P. Cimermancic reports nonfinancial support from Verily Life Sciences outside the submitted work. P.F. Wong reports a patent to virtual immunofluorescence pending. This work was supported by Verily Life Sciences LLC and Genmab. Verily Life Sciences LLC reports patent applications on virtual staining and alignment. J. Loo, M. Robbins, C.McNeil, T. Yoshitake, C. Santori, C.J. Shan, S. Vyawahare, H. Patel, T.C. Wang, R. Findlater, S. Rao, M. Gutierrez, Y. Wang, A.C. Sanchez, R. Yin, V. Velez, J.S. Sigman, S.S. Weaver, E. Rivlin, R. Goldenberg, P. Cimermancic, and P.F. Wong are current or former employees with equity interests during tenure at Verily Life Sciences LLC. D.F. Steiner is a current employee with equity interests at Google LLC. P. Coutinho de Souza, H. Chandrupatla, L. Scott, C.-W. Lee, and S.S. Couto are current or former employees at Genmab. All authors performed work for this study during their respective tenures.
Authors’ Contributions
J. Loo: Data curation, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing. M. Robbins: Data curation, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing. C. McNeil: Data curation, software, investigation, methodology, project administration, writing–review and editing. T. Yoshitake: Data curation, software, formal analysis, validation, investigation, methodology, writing–review and editing. C. Santori: Data curation, software, formal analysis, validation, investigation, methodology, writing–review and editing. C.J. Shan: Data curation, software, investigation, methodology, writing–review and editing. S. Vyawahare: Data curation, software, investigation, methodology, writing–review and editing. H. Patel: Data curation, software, formal analysis, validation, investigation, methodology, writing–review and editing. T.C. Wang: Data curation, software, formal analysis, validation, investigation, methodology, writing–review and editing. R. Findlater: Data curation, software, formal analysis, validation, investigation, methodology, project administration, writing–review and editing. D.F. Steiner: Supervision, validation, writing–review and editing. S. Rao: Supervision, validation, writing–review and editing. M. Gutierrez: Data curation, software, supervision, investigation, methodology, writing–review and editing. Y. Wang: Data curation, software, investigation, methodology, writing–review and editing. A.C. Sanchez: Data curation, software, investigation, methodology, writing–review and editing. R. Yin: Data curation, investigation, writing–review and editing. V. Velez: Data curation, investigation, writing–review and editing. J.S. Sigman: Data curation, investigation, methodology, project administration, writing–review and editing. P. Coutinho de Souza: Data curation, investigation, writing–review and editing. H. Chandrupatla: Data curation, investigation, writing–review and editing. L. Scott: Data curation, investigation, writing–review and editing. S.S. Weaver: Resources, supervision, funding acquisition, writing–review and editing. C.–W. Lee: Resources, supervision, funding acquisition, writing–review and editing. E. Rivlin: Resources, supervision, funding acquisition, writing–review and editing. R. Goldenberg: Resources, supervision, funding acquisition, writing–review and editing. S.S. Couto: Conceptualization, resources, supervision, funding acquisition, writing–review and editing. P. Cimermancic: Conceptualization, resources, supervision, funding acquisition, methodology, project administration, writing–review and editing. P.F. Wong: Conceptualization, resources, formal analysis, supervision, funding acquisition, validation, visualization, methodology, project administration, writing–review and editing.
Acknowledgments
We thank Verily and Genmab for supporting this collaboration. We thank Roopam Rajvanshi, Sebastian Dobon, Bryan Crampton, and Pavithran Ramachandran for data infrastructure support. We thank Byron Bogaert and Amer Jarrah for computer hardware and network support. We thank Janelle Chang Clark, Susan Kram, and Tyler Hassenpflug for program management and operations support. We thank Melissa Miao and Radha Patel for business development support. We thank Fabien Beckers for program leadership support.
Note: Supplementary data for this article are available at Cancer Research Communications Online (https://aacrjournals.org/cancerrescommun/).