Abstract
The development of immune checkpoint–based immunotherapies has been a major advancement in the treatment of cancer, with a subset of patients exhibiting durable clinical responses. A predictive biomarker for immunotherapy response is the preexisting T-cell infiltration in the tumor immune microenvironment (TIME). Bulk transcriptomics–based approaches can quantify the degree of T-cell infiltration using deconvolution methods and identify additional markers of inflamed/cold cancers at the bulk level. However, bulk techniques are unable to identify biomarkers of individual cell types. Although single-cell RNA sequencing (scRNA-seq) assays are now being used to profile the TIME, to our knowledge there is no method of identifying patients with a T-cell inflamed TIME from scRNA-seq data. Here, we describe a method, iBRIDGE, which integrates reference bulk RNA-seq data with the malignant subset of scRNA-seq datasets to identify patients with a T-cell inflamed TIME. Using two datasets with matched bulk data, we show iBRIDGE results correlated highly with bulk assessments (0.85 and 0.9 correlation coefficients). Using iBRIDGE, we identified markers of inflamed phenotypes in malignant cells, myeloid cells, and fibroblasts, establishing type I and type II interferon pathways as dominant signals, especially in malignant and myeloid cells, and finding the TGFβ-driven mesenchymal phenotype not only in fibroblasts but also in malignant cells. Besides relative classification, per-patient average iBRIDGE scores and independent RNAScope quantifications were used for threshold-based absolute classification. Moreover, iBRIDGE can be applied to in vitro grown cancer cell lines and can identify the cell lines that are adapted from inflamed/cold patient tumors.
Introduction
The Cancer Genome Atlas (TCGA) has been a transformative tool in our understanding of the tumor immune microenvironment (TIME; ref. 1). Using cell-type metagenes/signatures or via deconvolution methods, the levels of T-cell infiltration can be approximated in TCGA and other bulk gene expression data (2–4). We and others have reported several highly correlated gene signatures that measure CD8+ T-cell infiltration in the TIME, including the Immunological Constant of Rejection (ICR), various interferon-γ (IFNG) signatures and the tumor immune signature (TIS; refs. 5–7). Because preexisting T-cell levels have emerged as a predictive marker of response to checkpoint-based immunotherapy (8–13), these methods for quantifying immune cell populations are relevant to understanding the TIME within and across cancer cohorts.
In bulk sampling datasets, however, estimates of many immune cell types correlate with each other and it is a challenge to dissect the contributions of individual subpopulations of the TIME (13). In addition, TCGA and other bulk gene expression datasets measure an average signal of all cells in the sample and are therefore limited in the granular analysis of the TIME. Together, these limitations suggest that bulk sampling misses the heterogeneity of the TIME as a complex system and overlooks the potential contributions of rare or less well-defined immune cell types.
To address these limitations, we have developed a data-driven method, iBRIDGE (Identifying Biological Relationships in multi-Dimensional Genomics Entities), which can identify patients with inflamed or cold tumors by integrating the malignant portion of given single-cell RNA sequencing (scRNA-seq) data with TCGA bulk RNA-seq data for the corresponding cohort. We have tested iBRIDGE in multiple scRNA-seq datasets across cancer cohorts and validated against datasets with matching bulk component. In addition, we show that the iBRIDGE method can be applied to cell lines and can identify those that are adapted from inflamed/cold cancers. In this way, we report the cell type–specific markers of inflamed/cold cancers in myeloid, stromal, and malignant compartments.
Methods
Human studies
All patient data were derived from public sources and other publications. The ethical approvals and declarations are listed in the original publications (Table 1).
Dataset_ID . | Reference . | Dataset_Type . | Input Data . | Cohort . | Patient_Count . | Cell_Count . |
---|---|---|---|---|---|---|
GSE131907 | 25 | scRNAseq-10X | Raw UMI counts | LUAD | 58 | 208506 |
GSE132465 | 26 | scRNAseq-10X | Raw UMI counts | COAD | 23 | 47285 |
GSE176078 | 28 | scRNAseq-10X | Raw UMI counts | BRCA | 26 | 100064 |
GSE176078 | 28 | Bulk RNAseq | Raw counts | BRCA | 24 | NA |
GSE178341 | 33 | scRNAseq-10X | Raw UMI counts | COAD | 62a | 371223a |
GSE85507 | 31 | Bulk Microarray | RMA normalized | Multi | 15 | NA |
GSE85507 | 31 | In vitro Microarray | RMA normalized | Multi | 16 | NA |
CCLE | 32 | In vitro RNAseq | Raw counts | Multi | 838 | NA |
TCGA | 1 | Bulk RNAseq | Raw counts | Multi | 10079 | NA |
Dataset_ID . | Reference . | Dataset_Type . | Input Data . | Cohort . | Patient_Count . | Cell_Count . |
---|---|---|---|---|---|---|
GSE131907 | 25 | scRNAseq-10X | Raw UMI counts | LUAD | 58 | 208506 |
GSE132465 | 26 | scRNAseq-10X | Raw UMI counts | COAD | 23 | 47285 |
GSE176078 | 28 | scRNAseq-10X | Raw UMI counts | BRCA | 26 | 100064 |
GSE176078 | 28 | Bulk RNAseq | Raw counts | BRCA | 24 | NA |
GSE178341 | 33 | scRNAseq-10X | Raw UMI counts | COAD | 62a | 371223a |
GSE85507 | 31 | Bulk Microarray | RMA normalized | Multi | 15 | NA |
GSE85507 | 31 | In vitro Microarray | RMA normalized | Multi | 16 | NA |
CCLE | 32 | In vitro RNAseq | Raw counts | Multi | 838 | NA |
TCGA | 1 | Bulk RNAseq | Raw counts | Multi | 10079 | NA |
Note: Accession IDs, authors, references, weblinks, number of patients/samples, number of cells (if single-cell) and profiled cancer cohort(s), are included in the table.
a9 patient subset with RNAscope data is used.
Data collection
Download and processing of TCGA data were performed as described before (RRID:SCR_00319313; ref. 13). To identify markers of inflamed/cold tumors in bulk, we sorted the samples by single sample gene set enrichment (GSE) values for ICR signature and performed differential expression (DE) analysis between top and bottom tertiles for each cohort using DESeq2 method.
Datasets from Gene Expression Omnibus (GEO) database (RRID:SCR_005012) were downloaded in the form of raw counts or Robust Multiarray Average (RMA) normalized values for scRNA-seq and the microarray platforms, respectively.
RNA-seq raw count data from Cancer Cell Line Encyclopedia (CCLE) were downloaded from DepMap portal (RRID:SCR_013836; https://sites.broadinstitute.org/ccle/datasets/).
Data processing
All data processing was done in R version 4.1.1 except when indicated. The base functions that can replicate the analyses presented here, are published at https://github.com/tolgaturan-github/IBRIDGE in the form of an R-package with a workflow vignette. All R packages used and their versions as well as other supporting methods are included in Supplementary Methods File S1.
scRNA-seq
10X Genomics scRNA-seq datasets were processed starting from the raw unique molecular identifier (UMI) counts into SingleCellExperiment objects. First perCell and perGene QC metrics were calculated using the scater package. After low quality cells and the dropout genes (0 expression for all cells) were filtered, individual patient samples were merged using the Seurat (RRID:SCR_016341) package. The datasets were normalized using SCTransform function with default parameters (14). This step calculated data normalization, scaling and variable gene identification. For iBRIDGE workflow, the variable genes were accessed by VariableFeatures function of Seurat package. The principle component analysis dimension reduction were performed followed by the calculation of Uniform Manifold Approximation and Projection (UMAP) dimension reduction, k-nearest neighbor (KNN) identification (20 principal components were used, for UMAP and KNN) and graph-based clustering using Seurat functions.
RNA-seq and microarray
DE analyses between inflamed and cold TCGA patients or CCLE lines were performed using DESeq2 package (RRID:SCR_000154). To assign classes to cell lines, the RNA-seq data were normalized using EDASeq package (RRID:SCR_006751) followed by GSE calculation using yaGST package (https://github.com/miccec/yaGST; ref. 15). For the microarray dataset, limma package was used on RMA normalized values to identify the in vivo correlates of ICR.
Cell type identification
In the cases where, “cell type” meta-data were unavailable, we performed cell type identification using SingleR package (Blueprint and PrimaryCellAtlas references) and Bioturing CellType prediction algorithm (bioRxiv 2020.12.11.414136, ref. 16). The cell types were assigned to Seurat clusters. Although normal patient samples (i.e., tumor-adjacent) were removed, to exclude normal epithelial cell contamination possibility, we used infercnv R package (RRID:SCR_021140) to cross-check the cell-type assignments with 0.1 as cutoff setting and fibroblasts as reference cell population (Broad Institute; ref. 17).
GSE
GSE calculation for each cell was performed using AUCell package (RRID:SCR_021327; ref. 18). The thresholds were set using the level of top 10% of the features of the cell rankings matrix.
Calculation of bulk or “in vivo” ICR scores
Bulk RNA-seq datasets from TCGA or Microarray dataset from GEO were normalized by TPM (EDASeq) or by RMA, respectively. Single sample GSE scores for 20-gene ICR signature were calculated using yaGST method.
Calculation of single-cell or “in vitro” iBRIDGE scores
iBRIDGE method outputs 2 gene sets, “Inflamed” and “Cold” for each cell or cell line. We calculated GSE scores for each signature using AUCell package for single-cell and using yaGST for “in vitro” cell line datasets. We calculated single-cell iBRIDGE scores by the ratio of “Inflamed GSE” to the “Cold GSE” and collapsed by “mean” function when necessary.
Relative classification (tertile-based)
We calculated GSE scores for each cell or cell line using the two signatures generated by iBRIDGE and divided the continuous range into tertiles for each signature. We assigned a cell as “Inflamed” if the cell was in the top tertile for the “Inflamed” signature and not in the top tertile for the cold signature and vice versa for the “Cold” classification. Any cell that did not satisfy this was classified as “intermediate/unclassified”. When “inflamed” cell percentage for a given patient was at least twofold or more than “cold” cell percentage and the difference between percentages was at least 20% (to exclude low confidence classification), the patient was classified as “inflamed”. “Cold” patient classification was similarly done when “cold” percentages were higher with the same cutoffs. Patients not satisfying these criteria were assigned as “intermediate/unclassified”
Absolute classification
Two datasets with ground-truth were used: GSE176078 (matched single-cell and bulk RNA-seq) and GSE178341 (matched scRNA-seq and CD3E RNAscope). Average iBRIDGE scores were used as the continuous predictor for both datasets. The bottom tertile of bulk RNA-seq ICR GSE values was used as the binary reference for GSE176078. Patients lower than 5% CD3E positive cells were considered as the binary reference for GSE178341. The ROC curve was generated and Youden index for optimal threshold was identified using the ROCit R package.
DE analysis
DE analysis using scRNA-seq data was carried out using Seurat package, ‘FindMarkers’ function between inflamed and cold classes. The genes passing Padj < 0.05 and log2 (FC) > 0.3 in either direction, were considered significant. For bulk RNA-seq and microarray datasets limma package was used. The genes passing Padj < 0.01 and log2 (FC) > 1 in either direction, were considered significant.
Gene ontology analysis
Gene ontology (GO) analyses were performed using clusterProfiler package. Specifically, “Biological Process” annotation was queried with the input DE features using compareCluster function (pval threshold: 0.05). Top 10 (unless otherwise specified) significantly enriched ontologies were visualized using dotplot function.
Data access statement
The source code and installation instructions for iBRIDGE R package can be found publicly at GitHub (https://github.com/tolgaturan-github/IBRIDGE). Raw data availability: All raw data used in the study (weblinks, references, cohort sizes and other meta-data attributes) are listed in Table 1.
Results
scRNA-seq can profile each cell in the TIME and thus has the potential to identify biomarkers of an inflamed TIME at the cellular level. However, identification of patients with an inflamed TIME from scRNA-seq data is not straightforward. Comparison of immune cell counts, or percentages/frequencies might not be accurate due to potential biases in cell dissociation and library preparation efficiencies across patients/cell types. Therefore, a data-driven method to identify patients with inflamed/cold TIMEs is needed.
Integration of scRNA-seq datasets with TCGA bulk data
Malignant cells cluster by patient while other immune/stromal cells cluster by type
To identify inflamed/cold phenotypes from scRNA-seq data we focused on the malignant cells. We and others have observed that among the several components of the TIME, the malignant cell population is the only component that clusters by patient (Fig. 1A; refs. 19–21). In contrast, immune cells including lymphoid and myeloid cells, as well as stromal cells, cluster primarily by cell type not by patient (Fig. 1B). We quantified this and showed that malignant cells have the highest Euclidean or ‘1-Pearson coefficient’ pair-wise distances, in multiple datasets (Fig. 1C; Supplementary Fig. S1A and S1B). This observation was expected, due to the divergent patient-specific somatic alterations each malignant cell has and because of the effects of these alterations on the malignant cell expression profiles. Therefore, we hypothesized that the malignant population in a scRNA-seq dataset would have the potential to differentiate immune phenotypes identified in the bulk datasets.
To quantify tumor-infiltrating T cells in TCGA bulk data, we used ICR gene signature, which marks T helper 1 (Th1)–polarized cytotoxic immune infiltration in bulk gene expression datasets as reported before (Supplementary Fig. S2A and S2B; refs. 5, 7).
When integrating malignant scRNA-seq profiles with bulk gene expression two different approaches were initially considered: focusing on the features “specifically” expressed in malignant cells, and focusing on the features very highly expressed in malignant cells (21). However, both approaches were problematic for our aims. The first approach involves identifying the marker genes of the malignant fraction compared with the rest of single-cell population. Therefore, it requires comparing malignant cells to nonmalignant populations. Because various datasets have different nonmalignant compositions, this approach results in outputs heavily dependent on the composition of the dataset or the cohort. In addition, there are chemokines that mark an inflamed TIME like CCL5, CXCL9, and CXCL10 (22, 23), that are not only expressed in the malignant population but also by other cell types (i.e., myeloid cells). Therefore, selecting for features specifically expressed in malignant cells would miss important drivers of inflamed TIME. The second approach, which focuses on the most abundant transcripts of the malignant population is also suboptimal as it enriches for housekeeping and/or ribosomal genes that are uniformly expressed and therefore cannot differentiate between immune phenotypes.
Given the limitations of these two approaches, we developed the following strategy. Firstly, we focused on malignant scRNA-seq profiles and extracted the most highly variable features (top 3,000 with highest variance) in malignant cells. Then we intersected this list with the bulk (from the respective TCGA cohort) biomarkers of inflamed and cold cancers (Fig. 2A and B). We hypothesized that the resulting overlaps would account for the inflamed/cold phenotypes among the malignant cells (Fig. 2C and E), and this information could then be used to classify patients with scRNA-seq data.
To test and validate this approach, we focused on scRNA-seq datasets from 3 main solid tumor types—lung, colon, and breast—and employed an integrative method with the corresponding bulk TCGA cohort (Table 1).
iBRIDGE workflow
We developed iBRIDGE to act as a bridge between data types with the goal of using markers of inflamed and cold cancers from bulk data and superimposing these features to single-cell data. To identify bulk markers of inflamed and cold cancers, the ICR gene signature was used (5). ICR is composed of twenty transcripts and four functional categories: CXCR3/CCR5 chemokines (including CXCL9, CXCL10, CCL5), Th1 signaling (including IFNG, IL12B, TBX21, CD8A, STAT1, IRF1, CD8B), effector (including GNLY, PRF1, GZMA, GZMB, GZMH) and immune regulatory (including CD274, CTLA4, FOXP3, IDO1, PDCD1) functions. These features are highly correlated in bulk datasets and high levels of ICR expression mark a CD8+ T-cell infiltrated phenotype. To identify high-ICR patients, we calculated single-sample GSE. High- and low-ICR samples were determined by top and bottom tertiles of ICR GSE scores. The inflamed and cold markers were then identified using DE between high- and low-ICR patients.
As described above, to integrate these features with single-cell data, we focused on malignant cells. Specifically, we intersected the “highly variable” gene set from this population with the inflamed/cold markers from bulk RNA-seq data (Fig. 2). These gene sets were then used to apply single-cell GSE for each malignant cell. Using the single-cell GSE scores, we determined top and bottom tertiles for “inflamed” and “cold” directions. We labeled a cell as “Inflamed” if the cell was in the top tertile for the “inflamed” signature and not in the top tertile for the “cold” signature and vice versa for the “cold” classification. Any cell that did not satisfy this was classified as “intermediate/unclassified”. More details of the iBRIDGE workflow are available in the Methods section.
iBRIDGE method can identify patients with inflamed/cold tumors in scRNA-seq data
In the non–small cell lung cancer (NSCLC) dataset by Kim and colleagues, the authors profiled ∼210K cells using the 10X platform (Table 1; ref. 24). We processed data from this study by selecting the malignant cell population and normalizing the data with the SCTransform workflow. As described in Fig. 2 and above, we overlapped the biomarkers of inflamed and cold cancers from the TCGA–lung adenocarcinoma (TCGA-LUAD) dataset, with highly variable gene subset of the malignant scRNA-seq data. We hypothesized that the resulting overlaps (i.e., iBRIDGE gene sets, listed in Supplementary Table S1) defined ‘inflamed’ and ‘cold’ phenotypes in the single-cell space.
The data from individual patients showed minimal assignment of opposite classes (i.e., inflamed and cold malignant cells) for a given patient cluster (Fig. 3A). This enabled classification of patients in an unambiguous way. By assigning malignant clusters from each patient, polarizing in either the inflamed or cold direction above the set threshold, we assigned inflamed and cold phenotypes to patients. Then from the patients we assigned classes to each of the other cell populations (Fig. 2). These classes were then used to identify cell type–specific markers of inflamed/cold phenotypes.
iBRIDGE method can identify cell type–specific markers of inflamed/cold phenotypes
After identifying inflamed/cold patients, we carried out DE analyses between inflamed and cold cells in each subpopulation, (i.e., malignant cells, fibroblasts, myeloid cells, CD8+ T cells, CD4+ T cells, and regulatory T cells). The GO/Pathway analyses performed on significant differentially expressed genes identified ‘response to Type I IFN’ and ‘response to virus’ terms as enriched in inflamed phenotypes of both myeloid and malignant populations (Fig. 3B). In addition, ‘extracellular matrix organization’ GO term and other TGFβ-driven mesenchymal phenotypes were observed not only in the fibroblast subpopulation but also in the malignant cells. In addition, TGFB1 and collagen genes (including COL1A1, COL6A1/A2) were part of the markers that significantly associated with inflamed phenotypes in malignant cells. These findings show an innate-immune driven/IFNG responsive, as well as a mesenchymal phenotype in the malignant cells from inflamed patients and these programs were shared with myeloid cells and fibroblasts, respectively (Supplementary Tables S2–S4).
Next, we applied iBRIDGE to a recently published colorectal cancer (COAD) dataset with 23 patients (GSE132465; ref. 25). Using the iBRIDGE method, we identified 7 patients as inflamed and 2 as cold (Supplementary Fig. S3A). Malignant cells that were identified as ‘inflamed’, showed enrichment of ‘antigen presentation via MHC Class I’ and ‘response to IFNG’ GO terms (Supplementary Fig. S3B). The enrichment of the ‘response to IFNG’ gene set was consistent with the observation that highly inflamed (CD8+ T cell–infiltrated) tumors had abundant IFNG in their TIME (Supplementary Tables S5–S7).
As microsatellite instability (MSI) status is a known predictor immunotherapy response and is associated with highly infiltrated colorectal cancer TIMEs, we evaluated how MSI status related to the iBRIDGE classifications in this dataset (26). None of the 1,813 MSI-High malignant cells were identified as cold, while a majority were labelled as inflamed (1,249 of 1,813 cells) or intermediate/unclassified (564 cells, Supplementary Fig. S3C). This shows that the iBRIDGE method can identify the inflamed phenotype at the malignant single-cell level and reinforces the idea that tumor-intrinsic mechanisms and programs can shape the immune polarization of the TIME.
Validation of iBRIDGE with a dataset with bulk RNA-seq validation
Although GSE131907 and GSE132465 scRNA-seq datasets revealed the profiles of inflamed/cold cancers at single-cell levels, they lacked a comparator where we could validate our approach against paired bulk RNA-seq data. A third dataset, GSE176078 by Wu et al, was a dataset with matched scRNA-seq and bulk RNA-seq components from patients with breast cancer (BRCA; ref. 27). There were 20 patients with paired bulk and single-cell data. First, we applied ICR gene signature to the bulk component and identified patients with inflamed, cold, and intermediate TIMEs.
Then we applied iBRIDGE to the scRNA-seq section of the dataset (using TCGA-BRCA as the bulk reference) and identified inflamed, cold, and intermediate/unclassified malignant cells (Fig. 4A). To compare with bulk ICR enrichment scores we calculated a per-patient scRNA-seq ICR metric by taking the ratio of “inflamed iBRIDGE-GSE” to the “cold iBRIDGE-GSE” and collapsing the average per-cell values by patient. Correlations of scRNA-seq ICR metric with bulk ICR enrichments values resulted in r = 0.85 Pearson correlation coefficient (intermediate patients not included). Except for intermediate/unclassified patients, iBRIDGE identified all inflamed and cold patients faithfully (Fig. 4B).
Pathway analyses of GSE176078 in each cell type indicated that, like the NSCLC dataset (GSE131907), myeloid and malignant cells from the patients with an inflamed phenotype had higher ‘response to interferon-gamma’ and ‘antigen presentation via MHC-I’ GO enrichment scores, compared with the cold patients (Fig. 4C). The patients with an inflamed phenotype showed higher ‘neutrophil activation’ GO enrichment scores not only in the fibroblast population, but also, to a lower extent, in the myeloid and malignant populations as well. In this dataset, distinct from the malignant cells, the myeloid population had significantly higher ‘response to type I interferon’ GO term enrichment scores in the patients with an inflamed phenotype compared with the cold patients (Supplementary Tables S8–S10).
The differential analyses of inflamed/cold classes for each cell type revealed more conserved patterns in the programs that define an “inflamed” phenotype compared with the ones defining a “cold” phenotype (Supplementary Tables S2–S10). For example, across the 3 scRNA-seq datasets described, the overlapping subset among the top100 iBRIDGE-inflamed features were 17 genes (APOE, BIRC3, GBP2, HLA-DRA, HLA-DRB5, IGKC, IGLC2, IRF1, PSMB9, RARRES1, RARRES3, SAA1, SAA2, SOCS1, TNFAIP2, TYMP, WARS). While the only common element for top100 iBRIDGE-Cold genes was NEAT1, a noncoding immunomodulator (Supplementary Fig. S3D; Fig. 4D). In addition, the bulk markers of the “inflamed” cancers were reported to be more conserved compared with the “cold” markers in TCGA cohorts as well (13, 28).
Signatures that quantify T-cell infiltration levels, such as ICR, TIS or IFNG signatures, are well researched in bulk datasets (5–7, 28). To exclude the possibility that the signatures and the methods from bulk analyses may be sufficient to quantify immune infiltration levels in scRNA-seq data, we applied TCGA-BRCA ICR correlates of inflamed/cold cancers directly on the malignant single-cell data from GSE176078 without the application of iBRIDGE (bulk_only signatures). We observed that for a large fraction of cells (30%), the ‘bulk_only’ signatures scores were 0 (Supplementary Fig. S4A and S4B). This was expected as the markers of inflamed/cold cancers in bulk datasets are highly T-cell centric with no expression in the malignant tumor cells. These observations suggest that ‘bulk only’ signatures cannot reliably differentiate immune context in single-cell data.
Per-patient average iBRIDGE scores matched with the iBRIDGE patient classifications in all 3 datasets, while the percentages (or the counts) of CD8+ T cells did not always match with the inflamed/cold patient classes (Supplementary Figures S5A and S5B). This finding confirmed the need for a data-driven identification of patient phenotypes compared with a mere counting of the cells that pass cell dissociation, single-cell library preparation and quality-filtering steps.
Application of iBRIDGE on in vitro tumor cell line datasets
Mouse syngeneic cell line dataset, GSE85507
Next, we hypothesized that in vitro grown cancer cell lines could be analogous to malignant single cells, as they are grown in clones and are not contaminated by other cell types and therefore can be assessed by iBRIDGE. We acquired published bulk microarray profiles of in vitro mouse cell lines as well as in vivo tumors of each model grown in their syngeneic hosts (GEO Accession GSE85507; ref. 29). To apply iBRIDGE to this dataset, we replaced the TCGA bulk component (Fig. 2A) with the in vivo portion of the dataset and replacing the single-cell malignant component of the iBRIDGE with the in vitro grown cell lines. First, we calculated mouse ICR GSE scores and sorted the in vivo models by the level of ICR scores, as a surrogate for T-cell infiltration (Fig. 5A). Then we identified positive and negative gene expression correlates of ICR in the in vivo component. Separately, we identified highly variable feature subsets from the in vitro data. Overlapping the ICR correlates with the in vitro feature set yielded two gene sets as in the original iBRIDGE method. Then we calculated in vitro ICR metric by calculating the ratio of the GSE scores. In vitro ICR scores calculated by iBRIDGE correlated strongly with tumor in vivo ICR scores (Fig. 5B, Pearson correlation coef: 0.9) and identified all cell lines in ‘inflamed’, ‘intermediate’, and ‘cold’ cell line categories, accurately.
These findings not only validated the iBRIDGE approach in this second ground truth dataset but also showed that the iBRIDGE method is independent from the gene expression platform (i.e., RNA-seq vs. microarray) and can be applied to species other than humans.
CCLE
To extend these findings, we focused on the CCLE database (30). Because the CCLE database does not have an in vivo component, we used the corresponding TCGA cohorts as the in vivo data source of markers for inflamed and cold phenotypes. As the subset with the greatest number of cell lines, we started from the CCLE–lung samples (210 cell lines). The iBRIDGE approach resulted in 2 gene sets representing inflamed and cold phenotypes. The GSE values for these 2 gene sets had negative correlation, representing two opposite phenotypes (Fig. 5C). After calculating the GSE values for each gene set, we classified the cell lines into inflamed, intermediate/unclassified, and cold classes. We hypothesized that the cell lines in the inflamed/cold classes were likely to be adapted from cancer patients with matching TIME phenotypes.
The DE analysis between inflamed and cold cell lines resulted in ‘response to interferon-gamma’, ‘defense response to virus’ and ‘positive regulation of cytokine production’ as well as various NF-κB signaling GO terms, enriched in the cell lines adapted from inflamed TIMEs. Supporting the data from single-cell datasets, ‘extracellular matrix organization’ as well as elements of TGFβ signaling pathways were enriched in the cell lines derived from putatively inflamed patients (Fig. 5D). Overall, selected markers enriched in inflamed cell lines included STING1, IRF1, IRF7, TGFB1, STAT1, and CD274 (Supplementary Tables S11–S13 for full lists genes).
Next, we applied iBRIDGE to other CCLE disease cohorts with similar results. The datasets with most replicates were bladder, breast, colon, esophageal, head and neck, kidney, glioma, liver, ovarian, sarcoma, melanoma, stomach, and uterine cancers. When we applied the iBRIDGE method to these cell line cohorts, we observed enriched GO terms including ‘response to type I interferon’, ‘response to interferon−gamma’ as well as ‘extracellular matrix organization’, supporting the findings from scRNA-seq datasets described above (Supplementary Figures S6A and S6B).
Absolute classification using iBRIDGE
Tertile-based approaches result in a relative classification of patients. Therefore, the classified patients are on the poles of the relative T-cell infiltration levels. This approach is useful to identify the markers of inflamed/cold cancers in each cell type and it can sort the patients based on inflammation. However, it can fail to classify the patients correctly if the cohort is composed of exclusively cold or inflamed patients. Therefore, we used 2 datasets that had independent inflamed/cold assessment as ground truths.
GSE178341, a colon cancer scRNA-seq dataset, consists of a subset of 9 patients with matched in situ hybridization (CD3E RNAscope) assessment (31). We applied iBRIDGE to the malignant cells and calculated average iBRIDGE scores (average ratio of inflamed and cold enrichment scores per patient). iBRIDGE scores correlated with fractions of CD3E positive cells (correlation coef:0.7; Supplementary Figures S5C and S5D). In addition, iBRIDGE scores were significantly higher in the MSI-High patients compared with the microsatellite stable patients confirming the findings from the colon cancer dataset (GSE132465), described above.
To set an absolute classification threshold, we merged the above results with the output from GSE176078, where bulk RNA-seq was available (20 patients). Of 29 patients, average iBRIDGE scores performed with an area under the ROC curve (AUC) value of 0.92, illustrating the predictive value of iBRIDGE scores (Fig. 6A). The optimum point of the ROC curve where ‘False Positive Rate’ is minimum and ‘True Positive Rate’ is maximum; is the per-patient iBRIDGE score of 1.425. Therefore, this absolute threshold can classify patients when the cohort is not large enough or skewed to either cold or inflamed phenotypes. The results of the tertile-based classification described for the lung, colon, and breast cancer datasets, matched this threshold and all patients identified as inflamed or cold had average iBRIDGE scores greater or lesser than the threshold, respectively (Fig. 6B; Supplementary Tables S14–S16). This finding also suggests the reliability of the relative method for well-balanced datasets.
Finally, we subset and processed the four MSI-H patients from the GSE132465 colon cancer dataset as an example of an unbalanced inflamed dataset. Although the tertile-based method would fail to classify all 4 patients as inflamed (due to the nature of the tertile method), all four patients had average iBRIDGE scores greater than the threshold (2.45, 3.03, 3.54, and 4.38) assigning them all to the “Inflamed” class, as expected. This illustrates, the reliability of the absolute classification when the input cohort is composed of “inflamed only” or “cold only” patients.
Discussion
iBRIDGE is a data integration method that takes in single-cell expression data (malignant or in vitro cancer cell line) and outputs the cells/patients with inflamed or cold TIMEs. As scRNA-seq profiling of tumor tissues becomes more widespread, the iBRIDGE method will be able to help classify inflamed/cold status of patients especially when matching bulk data is not available. iBRIDGE can also annotate cancer cell line databases with inflamed/cold status. Given that the preexisting T-cell infiltration level is one of the predictive biomarkers for cancer immunotherapy, the iBRIDGE metric can help immune-oncology research and clinical trials with such datasets.
Of the 3 scRNA-seq datasets and 2 cell line cohorts tested, one scRNA-seq and one cell line cohort had matched bulk gene expression data. We and others have been using bulk gene expression–based quantification of immune-infiltrate as a surrogate for IHC assessment of CD8+ T-cell content (32). Therefore, we used the matched bulk gene expression as a surrogate for ground truth to compare to the iBRIDGE results. We found that iBRIDGE classifications agreed with the bulk gene expression–based assessments especially when excluding the intermediate/unassigned patients/cell lines (0.85 and 0.9 Pearson correlation coefficients in the single-cell and cell line datasets, respectively).
One caveat of this study is that the iBRIDGE workflow needs cell type identity metadata. In cases where this information is unavailable, methods such as SingleR and inferCNV would need to be used (16, 17). In addition, we used the same stringent DE cutoffs to identify the real changes for each cell type (detailed in Methods). This approach generated stronger signals in myeloid, fibroblast and malignant populations but not with lymphoid/natural killer (NK) clusters. Testing additional datasets with iBRIDGE will validate these findings. Finally, the iBRIDGE method is applicable to scRNA-seq datasets with malignant single-cell profiles, but it is not applicable to peripheral blood mononuclear cell profiles or sorted scRNA-seq profiles lacking malignant cells.
In TCGA bulk data, we have previously shown that the markers of inflamed cancers are more consistent across cohorts compared with the cold markers, which are more divergent and cohort-specific (13, 28). With the data presented here, we observe that this is true to a certain extent in single-cell data. We tested iBRIDGE in lung, breast, and colon cancer single-cell cohorts and observed that the GO terms involving responses to type I and type II interferons were consistently detected, especially in myeloid and malignant subpopulations. The detection of ‘response to interferon-gamma (type II IFN)’ is expected as IFNG, secreted by T cells and NK cells, is a main transcript and regulator of the inflamed TIME in bulk datasets (IFNG gene is part of the ICR signature we used to assess inflamed microenvironments). However, in T cells and NK cells from the inflamed patients, we did not see a strong IFNG transcript upregulation. This suggests that that high levels of IFNG could arise from the abundance of T cells/NK cells rather than a higher per-cell expression. Regardless we observed that the T-cell exhaustion markers tended to be higher in CD8+ T cells from inflamed patients, including CXCL13 and HAVCR2 (33).
The ‘response to type I interferon’ (induced by IFNα members and IFNβ) signal observed in myeloid and malignant populations is challenging to separate from ‘response to type II interferon’ (induced by IFNγ) as both cytokine families share similar downstream signal transduction pathways such as JAK1/2 and STAT1 and other STAT members. Work is ongoing to differentiate the signal which is dominant. However, we do not preclude a scenario where both pathway activities are present.
In the fibroblast/stroma fractions from inflamed patients, we observed ‘extracellular matrix’ GO terms were enriched. This was expected because extracellular matrix organization (ECM) components are the main secretory outputs of this cell type. We observed the same GO terms were enriched in the malignant cells of the inflamed patients as well. In addition, we observe TGFB1, one of the master regulators secreted by stromal cells, was also significantly higher in inflamed malignant cells and in inflamed CCLE cohorts. We and others have previously observed and reported a moderate but consistent correlation between stromal and immune contents bulk datasets such as TCGA (34). Specifically, the stromal content correlates with ICR levels, and the inflamed patients tend to have more stromal content. The enriched ECM pathways and TGFβ itself in malignant cells, is therefore in line with the observations in bulk data. These findings are therefore compatible with a hypothesis where not only immune, but also stromal content/polarization of the TIME can be a function of the malignant state of the TIME.
The mesenchymal polarization of the tumor cells and the general correlation with stromal content in inflamed patients were intriguing. ‘Immunosuppressive stroma’ as well as ‘immune-exclusion’ by stromal entrapment are two mechanisms by which stroma can affect immune polarization of a TIME (35). Indeed, one of the signatures that predicts immunotherapy response, MIRACLE score, selects patients with higher immune-infiltration levels but with lower stroma content (13). On the other hand, the mesenchymal phenotype of the tumor cells in inflamed patients is likely a general phenotype of the inflamed TIMEs rather than the specific immune-exclusion cases. However, we did not have access to datasets with immune-excluded tumors to test this hypothesis (36). New datasets with matched histology data or with spatial transcriptomics can shed light to this question.
From a wider perspective, these results are in line with a model where the tumor cells with an inflamed phenotype drive a “Type I/II IFN-driven immune” and a “TGFβ-driven mesenchymal” phenotype affecting myeloid and stromal (fibroblasts) cells, respectively. Then, the myeloid population secretes CXCR3 chemokines including CXCL9/10 to recruit T-cell infiltration to the TIME. Tumor-infiltrating T cells then secrete IFNγ which affects all members of the TIME through IFNG receptors, IFNGR1/2, creating an inflamed TIME. The tumor and stromal cells can also affect each other via secretion of TGFβ and ECM (Fig. 6C). IFNγ and TGFβ in turn, induce immunosuppressive molecules including PD-L1 and IDO1 on tumor cells and PDCD1 on T cells, respectively (37), and contribute to the survival of the tumor cells in the T-cell inflamed TIME (28).
Conceptually, the markers we identified can be categorized as either correlative, causative, or compensatory. The correlative markers of inflamed/cold cancers are markers that correlate with the inflamed/cold phenotypes but are not mechanistic modulators of an inflamed or cold TIME. These markers would not affect the TIME when modulated with potential immune-oncology therapies. This set of genes could be used as “biomarkers” but not as drug targets. The causative markers are part of pathways which can change the phenotype of TIME when modulated (i.e., cold → hot). These could be novel drug targets that modulate the TIME. The third category includes “compensatory markers” that are upregulated in inflamed cancers to suppress active immunity. These markers can be as important as the causative markers and include existing immunotherapy targets such as PD-L1/PD-1, CTLA4 and TIGIT.
Although the tertile-based relative classification is useful for identification of marker genes and it can sort the patients relative to one another, it can fail to classify patients when all patients in the cohort are cold or inflamed. Therefore, the per-patient average iBRIDGE scores were used to set a threshold in two datasets with matched ground-truth. The patients identified in the other two datasets using the relative method, were all correctly identified using the threshold as well.
In the previous literature there are reports of data integration between bulk and single-cell transcriptomics. For example, Liang and colleagues published a prognostic stratification method using data integration in ovarian cancer (38). In another integrative approach, Sprooten and colleagues integrated bulk and scRNA-seq data from tumor as well as peripheral patient samples. Focusing on ovarian cancer as well, they identified serum biomarkers that can predict patient survival (39). In another report, Jiang and colleagues integrated a smaller NSCLC scRNA-seq dataset with TCGA-LUAD cohort to identify a prognostic method (40). Finally, Song and colleagues used a 3-patient NSCLC scRNA-seq dataset to identify B-cell marker genes and constructed a prognostic predictor in TCGA-LUAD and other bulk cohorts (41). During the preparation of our manuscript, two additional approaches were also reported. In one of them, Wang and colleagues pseudo-bulk transformed liver cancer scRNA-seq data and integrated with bulk TCGA-LIHC (liver cancer) and identified immune-infiltration levels per patient (42). In the second, Sun and colleagues reported a method, ‘Scissor’, which measures similarities between scRNA-seq and bulk datasets to identify populations of interest that are relevant to metadata (43). Our approach on the other hand, is independent from pseudo-bulk and applicable to all solid cancer cohorts (28 cancer types). In addition, iBRIDGE can identify inflamed/cold classification in any scRNA-seq dataset with malignant component and applicable to cell line or organoid models in mouse or human.
In addition, there are deconvolution methods that use scRNA-seq data to quantify cell types in bulk datasets. These include ‘MuSiC’ (44), ‘Bisque’ (45), and ‘CIBERSORTx’ (46), among others. These methods are very useful in using scRNA-seq data to annotate bulk datasets. Our method is distinct in integrating bulk and single-cell data to annotate single-cell datasets. To document our approach and for wider use of iBRIDGE, we created an R package (https://github.com/tolgaturan-github/IBRIDGE), where the user can apply iBRIDGE to any scRNA-seq or cancer cell line from 28 solid cancer types.
As a conclusion the iBRIDGE method faithfully identifies inflamed patients from single-cell tumor or in vitro cell line expression datasets. We used the malignant subpopulation to assign the cells into inflamed/cold classes. The success of this approach reinforces a tumor-intrinsic model where the TIME is driven by the transcriptional polarity of the malignant cells. We validated our approach in three datasets (two human single-cell and one mouse cell line), where the matched ground truth was available, and then documented the markers of inflamed/cold cancers at the individual cell-type level.
Authors' Disclosures
T. Turan, S. Kongpachith, K. Halliwill, R.T. McLaughlin, X. Zhao, M. Binnewies, R. Mathew, S. Ye, H.J. Jacob, and J. Samayoa are all employees of AbbVie. The design, study conduct, and financial support for this research were provided by AbbVie. AbbVie participated in the interpretation of data, review, and approval of the publication. D. Reddy is a student intern (no funding to disclose) and he has no affiliation with AbbVie beyond the current study. No other disclosures were reported.
Authors' Contributions
T. Turan: Conceptualization, data curation, software, formal analysis, validation, investigation, visualization, methodology, writing–original draft, writing–review and editing. S. Kongpachith: Conceptualization, formal analysis, writing–review and editing. K. Halliwill: Conceptualization, writing–original draft, writing–review and editing. R.T. McLaughlin: Data curation, formal analysis, methodology, writing–original draft; R.T. McLaughlin contributed to the manuscript by locating and downloading CCLE RNA-seq data. In addition, he helped CCLE somatic analyses and the methodology to compare inflamed cell lines to cold cell lines to identify the somatic changes associated with them. R.T. McLaughlin also contributed writing the original draft by giving detailed feedback on the manuscript. M. Binnewies: Writing–original draft, writing–review and editing. D. Reddy: Data curation, formal analysis, validation, visualization. X. Zhao: Supervision, writing–original draft, writing–review and editing. R. Mathew: Supervision, writing–original draft, writing–review and editing. S. Ye: Supervision, writing–original draft, writing–review and editing. H.J. Jacob: Supervision, funding acquisition, writing–original draft, writing–review and editing. J. Samayoa: Conceptualization, supervision, funding acquisition, investigation, writing–original draft, writing–review and editing.
Acknowledgments
The research was funded by AbbVie, Inc.
We thank AbbVie employees Jeffrey Waring, PhD; Jacob Degner, PhD; Michael Flister, PhD; and Sabah Kadri, PhD for reviewing and giving valuable feedback on the manuscript.
We thank AbbVie employees Wei Liu, PhD, and Pingping Zheng, PhD, for guidance regarding the need for data-driven patient annotation in scRNA-seq datasets due to variability in single-cell dissociation and library preparation efficiencies.
We thank AbbVie employee Thomas J. Hudson, MD, and Former AbbVie employees Francesco M. Marincola, MD, and Michele Ceccarelli, PhD, for seeding the “Computational Immunology and Oncology” research at AbbVie.
We are grateful to the authors of the public studies, which are used in this manuscript.
The results published here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Immunology Research Online (http://cancerimmunolres.aacrjournals.org/).