The spatial organization of different types of cells in tumor tissues reveals important information about the tumor microenvironment (TME). To facilitate the study of cellular spatial organization and interactions, we developed Histology-based Digital-Staining, a deep learning-based computation model, to segment the nuclei of tumor, stroma, lymphocyte, macrophage, karyorrhexis, and red blood cells from standard hematoxylin and eosin–stained pathology images in lung adenocarcinoma. Using this tool, we identified and classified cell nuclei and extracted 48 cell spatial organization-related features that characterize the TME. Using these features, we developed a prognostic model from the National Lung Screening Trial dataset, and independently validated the model in The Cancer Genome Atlas lung adenocarcinoma dataset, in which the predicted high-risk group showed significantly worse survival than the low-risk group (P = 0.001), with a HR of 2.23 (1.37–3.65) after adjusting for clinical variables. Furthermore, the image-derived TME features significantly correlated with the gene expression of biological pathways. For example, transcriptional activation of both the T-cell receptor and programmed cell death protein 1 pathways positively correlated with the density of detected lymphocytes in tumor tissues, while expression of the extracellular matrix organization pathway positively correlated with the density of stromal cells. In summary, we demonstrate that the spatial organization of different cell types is predictive of patient survival and associated with the gene expression of biological pathways.
These findings present a deep learning-based analysis tool to study the TME in pathology images and demonstrate that the cell spatial organization is predictive of patient survival and is associated with gene expression.
See related commentary by Rodriguez-Antolin, p. 1912
With the advance of technology, hematoxylin and eosin (H&E)-stained tissue slide scanning has become a routine clinical procedure, which produces pathology images that capture histologic details in high resolution. Pathology images of tumor tissues contain not only essential information for tumor grade and subtype classifications (1), but also information on the tumor microenvironment (TME), such as the spatial organization of different types of cells. Cell spatial organization reveals cell growth patterns and the spatial interactions among different types of cells, which provide important insights into tumor progression and metastasis. A recent study by Cheng and colleagues (2) developed an algorithm to segment the cell nucleus and extract the topological features for TME analysis in renal cell carcinoma (RCC), which improved the understanding of cell spatial organization and patient outcome in RCC. However, this study used an unsupervised approach that assigns an identified cell nucleus to one of the clusters without a clear definition, which hampered interpretation of the results. Recent studies (3, 4) showed that the spatial organization and architecture of tumor-infiltrating lymphocytes (TIL) play important roles in the TME. However, these studies focused solely on recognizing lymphocytes and ignored other types of cells, which greatly limited exploration of the interactions among different types of cells. In this study, we developed a deep learning-based algorithm to examine standard H&E pathology images to automatically segment and classify different types of cell nuclei. This algorithm can be used as a tool to “computationally stain” different types of cell nuclei, to facilitate pathologists in examining tissue images, and researchers in studying the TME from these standard clinical materials.
The major cell types in a malignant tissue include tumor cells, stromal cells, lymphocytes, and macrophages. Stromal cells are mainly connective tissue cells such as fibroblasts and pericytes. The interactions between tumor cells and stromal cells play an important role in cancer progression (5–7) and metastasis inhibition (8). Because the cell boundaries of tumor cells and stromal cells are often unclear in standard H&E–stained lung cancer pathology images, we segmented and classified cell nuclei instead of whole cells. TIL are mainly white blood cells that have migrated into a tumor region. They are a mix of different types of cells, in which T cells are the most abundant population. The spatial organization of TILs has been associated with patient outcome and molecular profiles in multiple tumor types (9–12). Macrophages are inflammatory cells, and inflammation in tumor niches has been reported as a prognostic marker and correlated with tumor progression (13, 14). Other tissues and cellular structures existing in the TME include blood vessels (15, 16) and necrosis. In this study, blood cells and karyorrhexis are segmented to represent blood vessels and necrosis, respectively, to quantify blood vessels and necrosis and study their interactions with tumor cells, stromal cells, lymphocytes, and macrophages.
In this study, we developed a deep learning algorithm, histology-based digital (HD)-staining, using the Mask Regional Convolutional Neural Network (Mask-RCNN) architecture (17). We trained this HD-Staining model using pathology images of patients with lung adenocarcinoma from the National Lung Screening Trial (NLST) study with the nuclei of tumor cells, stromal cells, lymphocytes, macrophages, blood cells, and karyorrhexis manually labeled by expert pathologists. Through the training step, the HD-Staining model automatically learned to identify different nuclei based on a wide range of feature maps, including color, size, and texture within neighborhood area. The model accuracy was validated in a different set of images. From the identified cell types and cell spatial locations, we derived cell spatial organization features to characterize the TME. In our analysis, we found that these TME-related image features were significantly associated with patient overall survival. On the basis of these image features, a prognostic model for patients with lung adenocarcinoma was developed from the NLST dataset. This model was independently validated in the pathology image data from The Cancer Genome Atlas (TCGA) lung adenocarcinoma (LUAD) dataset, in which the predicted high-risk group showed significantly worse survival than the low-risk group (P = 0.001), with a HR of 2.23 (1.37–3.65) after adjusting for clinical variables.
Furthermore, the image-derived TME features significantly correlated with the gene expression of biological pathways. For example, transcriptional activation of both the T-cell receptor (TCR) and programmed cell death protein 1 (PD-1) pathways positively correlated with the density of detected lymphocytes in tumor tissues, while expression of the extracellular matrix organization pathway positively correlated with the density of stromal cells.
In this study, we developed a HD-Staining algorithm for nuclei segmentation and cell classification as a tool to study the tumor morphologic microenvironment using tissue pathology images in lung adenocarcinoma. To facilitate usage of this deep learning tool, a user-friendly web portal has been developed and can be accessed at http://lce.biohpc.swmed.edu/maskrcnn/analysis.php. Although this tool was developed in lung adenocarcinoma pathology images, our results showed that the HD-Staining method can be adapted and applied in head and neck cancer, breast cancer, and lung cancer squamous cell carcinoma pathology image datasets. The web portal also provides functions to facilitate researchers in adapting the HD-Staining model for other types of cancers.
Materials and Methods
Pathology images that support the findings of this study are available online in NLST (https://biometry.nci.nih.gov/cdas/nlst/) and TCGA-LUAD, https://wiki.cancerimagingarchive.net/display/Public/TCGA-LUAD. mRNA expression data for the TCGA dataset are available online at http://firebrowse.org. The H&E–stained pathology images together with the corresponding clinical data were obtained from the NLST and TCGA lung adenocarcinoma cohorts: 208 40× pathology images for 135 patients with lung adenocarcinoma were acquired from the NLST dataset, and 431 40× pathology images for 372 patients with lung adenocarcinoma were acquired from the TCGA-LUAD dataset (there could be multiple pathology images for a single patient). To refine our analysis within TME, a specialized lung cancer pathologist, Lin Yang, labeled the tumor region of interest (ROI) for each of the pathology images (Fig. 1). Another lung cancer pathologist, Dr. Adi Gazdar, confirmed the labeling. Shirley Yan, another lung cancer pathologist, annotated the lung adenocarcinoma histology subtypes for the NLST dataset. Clinical characteristics of the patients in this study are summarized in Supplementary Table S1.
Nuclei segmentation using HD-Staining
Training, validation, and testing sets preparation
To construct the training set for the HD-Staining algorithm, 127 image patches (500 × 500 pixels) from 39 pathologic ROIs (Fig. 1) were extracted from the NLST dataset. In these patches, different types of cell nuclei were labeled. All the pixels with tumor nuclei, stromal nuclei, lymphocyte nuclei, macrophage nuclei, red blood cells, and karyorrhexis were labeled according to their categories and all the remaining pixels were considered “other.” These labels, also collectively called the “mask,” were then used as the ground truth to train the HD-Staining model. The labeled images were randomly divided into training, validation, and testing sets. To ensure independence among these datasets, image patches from the same ROI were assigned together. More than 12,000 cell nuclei were included in the training set (tumor nuclei 24.1%, stromal nuclei 23.9%, lymphocytes 29.5%, red blood cells 5.8%, macrophages 1.5%, and karyorrhexis 15.2%), while 1,227 and 1,086 nuclei were included in the validation and testing sets, respectively.
A deep learning model was developed using the Mask-RCNN architecture. A Keras version by Waleed Abdulla was implemented (https://github.com/matterport/Mask_RCNN, last accessed on Oct 1, 2019). While the native structure of Mask-RCNN was kept, we adapted the implementation for pathologic image analysis by customizing data loader, image augmenter, image centering, and scaling. The model pretrained with Coco dataset (https://github.com/matterport/Mask_RCNN/releases) was fine-tuned on our training dataset from the NLST study. Images were standardized (centered and scaled to have zero mean and unit variance) for each RGB channel. To increase generalizability and avoid bias from different H&E-staining conditions, we performed extensive augmentations on the image patches. In particular, random projective transformations were applied to images and their corresponding masks; each image channel was randomly shifted using linear transformation (18). For the training process, the batch size was set to 2, the learning rate was set to 0.01 and decreased to 0.001 after 500 epochs, the momentum was set to 0.9, and the maximum number of epochs to train was set to 1,000. In the validation set, the model trained at the 707th epoch reached the lowest loss. This model was selected and used in the following analysis to avoid overfitting. Python (version 3.5.2) and python libraries (Keras, version 2.1.5; openslide-python, version 1.1.1; and tensorflow-gpu, version 1.8.0) were used (19).
Segmentation performance evaluation
Because the HD-Staining model simultaneously segments and classifies cell nuclei, three criteria were used to evaluate the segmentation performance in the validation and testing datasets, respectively. First, detection coverage was calculated as the ratio between the detected nuclei and the total ground truth nuclei. Each ground truth nucleus was matched to a segmented nucleus, which generated the maximum intersection over union (IoU). If the IoU for a ground truth nuclei was >0.5, this nuclei was labeled “matched;” otherwise it was labeled “unmatched.” Second, nuclei classification accuracy was determined for the matched nuclei by comparing the predicted nucleus type with the ground truth. Third, segmentation accuracy was evaluated by the IoUs, which were calculated for each detected nucleus and averaged in different nuclei categories.
Image feature extraction to describe nuclei composition and organization
To make the HD-Staining model more computationally efficient while retaining a good representation of each ROI, instead of applying the model to the whole slide, 100 image patches (1,024 × 1,024 pixels) were randomly sampled and analyzed for each pathologist-labeled ROI (Supplementary Fig. S1A). These 100 image patches provided good coverage of each ROI (Supplementary Fig. S1B). Nuclei were then segmented and classified through the HD-Staining model developed from this study (Fig. 1). To characterize the spatial organization of cells using a graph, we calculated the centroids of nuclei and used them as vertices to construct a Delaunay triangle graph for each image patch. The Delaunay triangle graph connects nuclei into a graph, and the number of connections and the average length (i.e., spatial distance) between two types of nuclei summarize the spatial organization of different types of cell. Because six nucleus categories were included in this study, the edges of the graph were classified into 21 categories [i.e., 6 × (6+1)/2 = 21] according to their vertex pairs. For each image patch, the number of connections (i.e., edges) for different categories was counted (which added up to 21 features), the lengths of the connections were averaged for each edge category (yielding another 21 image features), and the density of each type of nucleus was calculated (yielding six image features). In total, 48 image features were extracted. The image features were averaged across the 100 patches for each ROI in the pathology image. When two or more pathology slides were available for 1 patient, the features from the slides were averaged for each patient. Thus, in total 48 image features were extracted for each patient, in both the NLST and TCGA datasets.
Prognostic model development and validation
Overall survival, defined as the date of diagnosis till death or last contact, was used as the response variable for survival analyses. A Cox proportional hazard (CoxPH) prognostic model for overall survival was developed in the patients with lung adenocarcinoma in the NLST dataset and independently validated in the TCGA-LUAD dataset. An elasticnet penalty was used to avoid overfitting for the 22 features selected in the final CoxPH model (Supplementary Table S3). Given a set of the 22 image-derived TME features for each patient, the prognostic model will calculate a risk score for the patient by summarizing the products between features and corresponding coefficients, with a higher risk score indicating worse prognosis. The median risk score was used as the cutoff. The patients with a risk score higher than the median risk score were predicted as the high-risk group, and otherwise as the low-risk group. The survival curves of the predicted high- and low-risk groups were estimated using the Kaplan–Meier method. The survival differences between predicted high- and low-risk groups were compared using a log-rank test. A multivariate CoxPH model was used to determine the prognostic value of predicted risk groups (using image-derived TME features) after adjusting for other clinical characteristics, including age, gender, smoking status, and stage. R software, version 3.4.2, and R packages (survival, version 2.41-3; glmnet, version 2.0-13; and spatstat, version 1.55-1) were used (20, 21). The results were considered significant if the two-tailed P < 0.05.
Association analysis between image features and gene expression of biological pathways
Gene expression data of 372 patients from the TCGA-LUAD dataset were downloaded and preprocessed: the genes whose mRNA expression levels were 0 in more than 20% of patient samples were removed. The correlation between mRNA expression levels and image-derived TME features was evaluated using Spearman rank correlation. Gene set enrichment analysis (GSEA) was performed for each TME feature. All gene sets from the Reactome database were used (22). For multiple testing correction, Benjamini–Hochberg (BH)-adjusted P values were used to detect significantly enriched gene sets. Gene sets with BH-adjusted two-tailed P < 0.05 were regarded as significantly enriched. R packages Hmisc (version 4.1-1), fgsea (version 1.4.1), and gplots (version 3.0.1) were used (23).
HD-Staining simultaneously and accurately classifies and segments cell nuclei
The developed HD-Staining model segments and classifies individual nuclei at the same time (Supplementary Fig. S2A–S2C). Figure 2A demonstrates some of the segmentation and classification results. In total, the segmented cell nuclei were classified into six categories: tumor cell, stromal cell, lymphocyte, macrophage, red blood cell, and karyorrhexis, and all the remaining structures or spaces were considered background. Different nuclei were colored according to the predicted categories (Fig. 2A). For detected objects, the overall classification accuracy was 85% and 85% in the validation set and the testing set, respectively, while the accuracy for tumor nuclei was 88% in validation and 90% in testing, respectively (Fig. 2B). The stability of classification accuracy was further evaluated and validated in the five major LUAD subtypes: lepidic, papillary, acinar, micropapillary, and solid (Supplementary Figs. S3A–S3E and S3; ref. 24). It is noteworthy that the developed HD-Staining model can be applied to the entire digital pathology image to generate a cell spatial organization map across the whole slide, where tumor region and lymphocyte infiltration areas are clearly illustrated (Fig. 3), hinting at its potential ability to assist pathologic diagnosis.
To further validate the cell classification accuracy, nuclei detection results on H&E images were compared with IHC-stained images from the consecutive cut slides of the same sample (Supplementary Figs. S4–S6). From these consecutive cut slides of the same tumor tissue, we observed consistent pattern between the predicted lymphocyte from HD-Staining model and the IHC-stained CD3, a marker for T cells. Similarly, we observed consistency between the predicted macrophage and IHC-stained CD68, a marker for cells in monocyte lineage. These comparisons with IHC staining validated the cell classification accuracy of the HD-Staining model.
Prognostic value of nuclei composition and organization in the TME
A Delaunay triangle graph was constructed for each image patch (2) to extract topological features from nuclei spatial organization to characterize the TME (Supplementary Fig. S7). The nuclei and edges were counted and edge lengths averaged to yield 48 image features (see Materials and Methods Section). Supplementary Table S2 summarizes the TME features that significantly correlated with survival outcome in univariate analysis. It shows that higher karyorrhexis density, more karyorrhexis–karyorrhexis connections and more karyorrhexis–red blood cell connections were associated with worse survival outcome, which was expected as these features indicate a higher rate of tumor necrosis. Furthermore, higher stromal nuclei density and more stromal–stromal connections were associated with better survival outcome, which agreed with our current knowledge that more stromal tissues corresponds to better prognosis.
A prognostic model based on the image features was developed in the NLST dataset and then independently validated in the TCGA-LUAD dataset. Figure 4 shows the survival curves of the predicted high- and low-risk groups in the TCGA cohort, where the patients in the predicted high-risk group show significantly worse survival than those in the predicted low-risk group (log-rank test, P = 0.0011). Furthermore, the risk group defined by the TME features serves as an independent prognostic factor [high- vs. low-risk, HR, 2.23; 95% confidence interval (CI), 1.37–3.65; and P = 0.0013], after adjusting for clinical variables, including age, gender, smoking status, and stage (Table 1).
|TCGA dataset (n = 371) .||HR (95% CI) .||P .|
|High- vs. low-risk||2.23 (1.37–3.65)||0.0013|
|Age (year)||1.02 (0.99–1.05)||0.097|
|Male vs. female||0.90 (0.55–1.49)||0.70|
|Smoker vs. nonsmoker||1.09 (0.66–1.81)||0.73|
|Stage II||2.36 (1.28–5.30)||0.0029|
|Stage III||4.59 (2.44–8.63)||<0.001|
|Stage IV||3.30 (1.30–8.41)||0.012|
|TCGA dataset (n = 371) .||HR (95% CI) .||P .|
|High- vs. low-risk||2.23 (1.37–3.65)||0.0013|
|Age (year)||1.02 (0.99–1.05)||0.097|
|Male vs. female||0.90 (0.55–1.49)||0.70|
|Smoker vs. nonsmoker||1.09 (0.66–1.81)||0.73|
|Stage II||2.36 (1.28–5.30)||0.0029|
|Stage III||4.59 (2.44–8.63)||<0.001|
|Stage IV||3.30 (1.30–8.41)||0.012|
Association between image features and transcriptional activity of biological pathways
GSEA was performed to identify the biological pathways whose mRNA expression profiles significantly correlated with image-derived TME features in the TCGA dataset. Figure 5 and Supplementary Fig. S8 show examples of these biological pathways. For example, the transcriptional activation of both the TCR and PD-1 pathways positively correlated with lymphocyte density in the tumor tissue (Fig. 5A). This observation is consistent with previous reports that genes involved in the TCR and PD-1 pathways are expressed in immune cells (25, 26). In addition, expression of the extracellular matrix organization gene set, for which fibroblasts act as an important source (27), positively correlated with stromal cell density in tumor tissue (Supplementary Fig. S8A). In a negative control experiment where we randomly shuffled the patient IDs and repeated the same analysis, such correlation was no longer observed (Supplementary Fig. S9).
Furthermore, GSEA analysis showed that the cell-cycle pathway was significantly enriched with genes whose expression levels correlated with both the tumor nuclei density (Fig. 5B) and karyorrhexis density in tumor issue (Supplementary Fig. S8B). To look into the relationship between tumor cell density and the gene expression of the cell-cycle pathway, we grouped and sorted the patients in the TCGA-LUAD cohort according to their tumor nuclei density. Figure 5C shows, for each patient group in the TCGA-LUAD dataset, the average expression levels of genes within the cell-cycle pathway and whose expression levels significantly (P < 0.001) correlated with tumor nuclei density. Positive correlations between gene expression and tumor nuclei density can be observed for most of the cell-cycle–related genes, except for one gene, POLD4, which showed an inverse trend. Most of the genes in the cell-cycle pathway have higher expression in tumors with higher tumor nuclei density (may be a higher grade of tumor), while POLD4 shows the opposite pattern. This pattern of POLD4 compared with other genes in the cell-cycle gene set is consistent with a previous study of lung cancer (28): while most cell-cycle genes were upregulated in lung cancer, POLD4 was usually downregulated.
Webserver for publicly accessible pathologic image segmentation model
To facilitate usage of the HD-Staining model developed in this study, we also developed an online tool (http://lce.biohpc.swmed.edu/maskrcnn/analysis.php) for this deep learning-based nuclei segmentation and classification model (Fig. 6). This tool requires only a pathology image (or a patch from the image) as the input (Fig. 6A). Each uploaded input image will be assigned a job ID (Fig. 6B). The segmentation results will be automatically displayed and the spatial coordinates of each nucleus can be downloaded as an Excel table (Fig. 6C). To assist researchers in using this tool to study TME-related features for other cancer types, we also provide a function to automatically generate a mask for other cancer types. The newly generated segmentation mask can greatly reduce the manual work of creating the training sets for other cancer types, and thus accelerate the development of applications for pathology image analysis.
In this study, we developed a deep learning-based analysis tool to study the TME using standard H&E–stained pathology images. This tool successfully visualized and quantified the spatial organization of tumor cells, stromal cells, lymphocytes, inflammatory cells, red blood cells, and karyorrhexis in the tumor tissues of patients with lung adenocarcinoma. The topological features of cell spatial organization were used to characterize the TME. Our results showed that these features were associated with patient survival outcome and the gene expression of biological pathways. From these image-derived TME features, we developed a prognostic model for patients with lung adenocarcinoma and independently validated it in another lung adenocarcinoma patient cohort. The prognostic model predicts patient survival independent of other clinical variables in the validation cohort.
Several previous studies have tried to analyze the TME and discovered prognostic image features. However, these studies involved time-consuming hand labeling by pathologists (29–31). In contrast, we developed a fully automated and objective nuclei segmentation and classification strategy. In addition, this deep learning–based method enables the segmentation of nuclei within a whole-slide image, including small biopsy samples. Because the number of cells in a whole-slide image could be tremendous (∼2 × 106 on average), manually labeling all of them is impractical. Thus, this deep learning method empowers quantification of the TME across the whole-slide image. Furthermore, although developed in lung adenocarcinoma, this method can be easily generalized to other cancer types by retraining the model using the tools provided by our web portal.
In pathology image analysis, three-dimensional tissue structures are captured as two-dimensional images, and the cell nuclei may “touch” and “overlap” each other in the resulting images. This is one of the major challenges for nuclei segmentation in pathology image analysis. In this study, we developed a HD-Staining model to segment and classify different types of cell nuclei to study the spatial interactions among different cell types and tissue structures. Compared with other image segmentation algorithms, the HD-Staining method has several advantages: first, it segments and classifies nuclei at the same time, while traditional nuclei segmentation algorithms relying on color deconvolution cannot classify cell types (2, 32). Second, by using extensive color augmentation during the training process, it adapts to different staining conditions, which makes the algorithm more robust and allows us to avoid the time-consuming color normalization steps (33). Third, compared with traditional statistical approaches, deep learning–based approach does not require handcrafted feature extraction, can be highly parallel, and saves time. With GPU-aided computation, processing (classifying or segmenting) a 1000-by-1000 pixel image usually takes less than one second for HD-Staining, much faster than nondeep learning–based image segmentation methods (34). Fourth, compared with other popular semantic image segmentation neural networks such as Fully Convolutional Network, SegNet, and Deeplab (35–37) that classify each pixel, HD-Staining is intrinsically an instance segmentation algorithm that detects an object bounding box first and assigns pixels as foreground or background within this bounding box (17). In summary, HD-Staining provides a new solution to segmenting closely clustered nuclei in tissue pathology images.
The associations between the extracted TME features and patient prognosis were evaluated in this study. Karyorrhexis, a representative of necrosis, has been reported as an aggressive tumor phenotype in lung cancer (38). Consistently, the density of karyorrhectic cells and numbers of karyorrhexis–karyorrhexis edges were shown as negative prognostic factors in this study. On the other hand, the density of stromal cells and the numbers of stromal cell–stromal cell edges were positive prognostic factors, which is consistent with a recent report on patients with lung adenocarcinoma (8). These consistencies indicate the validity of this Mask-RCNN–based deep neural network and the potentiality of using cell organization features as novel biomarkers for clinical outcomes. Currently there are many lung cancer prognosis models using clinical data, such as nomogram models (39, 40). In this study, we showed that the image-based model has the prediction power to separate patients with significant survival differences after adjusting for clinical variables. It indicates that a prognostic model with both clinical variables and pathology image–based risk score could better predict patient prognosis than using clinical variable alone. In addition, many genomic signatures have been developed in recent years for lung cancer patient prognosis (41). Comparing with genomic signature, the pathology image–based prediction model can be more easily integrated into the current clinical practice, as the pathology image–based diagnosis is a part of routine clinical procedures. Thus, modeling using imaging data has advantages in the aspect of clinical practice. Furthermore, tumor genomic profile and pathology images provide complementary information and characterization of the tumor. Therefore, it will be potentially powerful to integrate pathology images, genomic data, and clinical data for better characterization of tumor and prediction of patient outcomes.
Gene expression patterns have been widely used to study the underlying biological mechanisms of different tumor types and subtypes (42, 43); moreover, genes with abnormal expression could become potential therapeutic targets of cancers (44, 45). However, traditional transcriptome profiling is usually done in bulk tumor (42, 46), which contains multiple cell types, such as stromal cells and lymphocytes, in addition to tumor cells. This bulk tumor–based sequencing could blur or diminish the mRNA expression changes arising from a single-cell type or from different cell compositions in the TME. Currently, the relationship between the transcription activities of biological pathways and the TME remains unclear. In this study, the image-derived TME features show interesting correlations with the transcriptional activities of biological pathways. For example, gene expression levels of TCR and PD-1 pathways positively correlated with the density of lymphocytes detected from tumor tissues. As genes involved in the TCR and PD-1 pathways are expressed in immune cells (25, 26), such correlation illustrates the contribution of lymphocytes to bulk tumor transcriptome profiling and thus validates the accuracy of both image-based nuclei detection and genetic sequencing of bulk tumor. This indicates the image-derived TME features may be used to study or predict immunotherapy response, because several promising cancer immunotherapies rely on activation of tumor-infiltrated immune cells and blocking immune checkpoint pathways (26, 47). In addition, the gene expression extracellular matrix organization pathway is associated with the density of stromal cells in tumor tissues. Because traditional transcriptome sequencing is done in bulk tumor, accurate cell composition derived from pathology images could help to improve the evaluation of gene expression for each individual cell type. Moreover, the correlation between image features and transcriptional patterns of biological pathways hints at the potential usage of image features to study tumor bioprocesses, including cell cycle and metabolism status.
There are some limitations to this HD-Staining model. First, information on the individual nuclei, such as nucleus shape and size, was not considered because this study focused on nuclei organization. Morphologic and intensity features of nuclei have been reported as prognostic factors, which can be automatically extracted using this nuclei segmentation algorithm (48). Second, some special structures, such as bronchi and cartilage, were not included in this algorithm. This study handled this problem by avoiding such structures during ROI annotation. However, a more comprehensive training set would be desirable for whole-slide analysis. Moreover, to distinguish different functional subtypes of lymphocytes, stromal cells (49), and macrophages (50), tissue slides, which are sequentially stained with H&E and IHC of specific markers would be needed as a training set. Distinguishing functional cell subtypes would further illustrate which subtypes play predominant roles in patient prognosis. Third, to represent whole-tumor heterogeneity, more than one slide, such as tissue microarray, should be collected and analyzed per tumor.
Disclosure of Potential Conflicts of Interest
J. Minna has received royalties from the NIH and University of Texas Southwestern Medical Center for Licensing Cell Lines. No potential conflicts of interest were disclosed by the other authors.
Conception and design: S. Wang, J. Fujimoto, L. Yang, X. Zhan, Y. Xie, G. Xiao
Development of methodology: S. Wang, R. Rong, E.R. Parra, B. Yao, I.I. Wistuba, Y. Xie, G. Xiao
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): S. Wang, J. Fujimoto, C. Behrens, E.R. Parra, B. Yao, I.I. Wistuba, Y. Xie, G. Xiao
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S. Wang, L. Cai, E.R. Parra, B. Yao, J. Minna, Y. Xie, G. Xiao
Writing, review, and/or revision of the manuscript: S. Wang, R. Rong, D.M. Yang, L. Yang, E.R. Parra, B. Yao, L. Xu, T. Wang, I.I. Wistuba, J. Minna, Y. Xie, G. Xiao
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): S. Wang, R. Rong, J. Fujimoto, L. Yang, D. Luo, B. Yao, Y. Xie, G. Xiao
Study supervision: D.M. Yang, G. Xiao
Other (review the virtual slides and subtype adenocarcinoma): S. Yan
This work was partially supported by the NIH (5R01CA152301, P50CA70907, 1R01GM115473, and 1R01CA172211), and the Cancer Prevention and Research Institute of Texas (RP190107 and RP180805). We thank the late Dr. Adi Gazdar for his critical inputs and discussion throughout this project, and for confirming the annotation of the pathology images. We would like to thank Jessie Norris for helping us to edit this article and Dr. Justin Bishop for support of the SPORE Pathology Core.