In cancer immunology, somatic missense mutations have been mostly studied with regard to their role in the generation of neoantigens. However, growing evidence suggests that mutations in certain genes, such as CASP8 or TP53, influence the immune response against a tumor by other mechanisms. Identifying these genes and mechanisms is important because, just as the identification of cancer driver genes led to the development of personalized cancer therapies, a comprehensive catalog of such cancer immunity drivers will aid in the development of therapies aimed at restoring antitumor immunity. Here, we present an algorithm, domainXplorer, that can be used to identify potential cancer immunity drivers. To demonstrate its potential, we used it to analyze a dataset of 5,164 tumor samples from The Cancer Genome Atlas (TCGA) and to identify protein domains in which mutation status correlates with the presence of immune cells in cancer tissue (immune infiltrate). We identified 122 such protein regions, including several that belong to proteins with known roles in immune response, such as C2, CD163L1, or FCγR2A. In several cases, we show that mutations within the same protein can be associated with more or less immune cell infiltration, depending on the specific domain mutated. These results expand the catalog of potential cancer immunity drivers and highlight the importance of taking into account the structural context of somatic mutations when analyzing their potential association with immune phenotypes. Cancer Immunol Res; 4(9); 789–98. ©2016 AACR.
The immune system detects abnormal cells and destroys potential tumors in a process called immunosurveillance. Although this process protects the host from many nascent tumors, it also leads to the selection of cells with genetic alterations that provide them with mechanisms to escape or modulate the immune response, leading to development of immune resistant tumors (1). Human cells use multiple mechanisms to interact with their microenvironment, and most of these mechanisms can also be exploited by cancer cells to evade host immune responses. Many somatic molecular alterations in cancer cells can alter the host immune response, such as overexpression as a result of increased somatic copy number (2), somatic missense mutations that create neoantigens that make the tumor more immunogenic and elicit immune responses (3, 4) and can be exploited to treat some cancer patients (5, 6), or mutations that help cancer cells avoid the cytotoxic immune responses (7).
Most extant analyses of mutations in cancer genomes look for correlations between mutations in a gene and some immune-related metric. This approach, although successful in some cases (7), is limited by the assumption that all the mutations in a given gene will lead to the same phenotype. However, it is known that mutations can lead to drastically different phenotypes depending on the specific protein region mutated (8–12). The reason is that genes are not monolithic entities; they consist of different regions, coding for specific protein domains that are usually responsible for different functions. Accounting for this fact in the analysis of cancer-driving mutations and drug-sensitivity biomarkers led to the discovery of many “domain drivers” or “domain biomarkers” (13, 14).
Here, we explore this approach in the immune response to cancer, searching for “cancer immunity drivers” on the domain level. To this end, we analyzed the somatic cancer genomes of 5,164 cancer patients with domainXplorer, an expanded version of our earlier e-Driver algorithm (14) that identifies correlations between any phenotype (including immune responses) and mutations in individual protein regions. We compared the mutation patterns in individual tumors to the predicted presence of the immune infiltrate, as measured by ESTIMATE (15), and found 122 protein regions, mutations which correlated with the presence of immune cells in cancer tissues. Some of these regions were located in proteins involved in immune response–related pathways, such as the antibody receptor FCγR2A, the complement protein C2, or the scavenger receptor CD163L1, confirming that our method is able to identify known cancer immune–evading mechanisms. Others represent potentially novel mechanisms of cancer cells influencing host immune response and would have to be studied in more detail.
Materials and Methods
Code and data availability
Level 3 mutation data were downloaded from The Cancer Genome Atlas (TCGA) data portal (https://tcga-data.nci.nih.gov) for 5,164 tumor samples that belong to 21 different cancer types. Variant Effect Predictor tool was used to map mutations from their genomic to their protein coordinates (18) using gene and protein annotations from ENSEMBL version 72 (19). We identified a total of 636,399 missense mutations in 19,106 proteins. We only analyzed the longest isoform of each gene in order to minimize problems related to multiple testing. Also, we excluded from the analysis proteins that are suspected to be prone to false positives in cancer genomics analysis, such as olfactory receptors or TTN.
We defined protein functional regions as protein domains defined by Pfam (20) or intrinsically disordered regions as identified by Foldindex (21) with a score below −0.1. We have also included a set of 1,300 potential domains identified by AIDA (22), an algorithm based on iterative recognition of domains by homology recognition algorithms with various sensitivities.
We used the ESTIMATE algorithm (15) to evaluate the presence of immune cell infiltrates in each tumor sample. The ESTIMATE score is based on the expression of a gene signature of 140 genes related to the immune system and correlates well with data obtained by cell sorting. We downloaded the pan-cancer normalized gene expression data from the UCSC Cancer Genomics Browser (23) on February 25, 2015. We then analyzed these data with ESTIMATE to evaluate the presence of immune cells in each cancer sample.
This algorithm compares the immune and mutation profiles of cancer samples to identify protein regions that, when mutated, correlate with the presence of immune cells within the tumor infiltrate. In brief, domainXplorer first seeks a correlation between mutations in a specific domain and immune scores using an analysis of variance (ANOVA, type I) test on a multivariate linear modeling (Test 1, Equation 1). The linear model also takes into account potential biases caused by differences in the immune responses between the tissues of origin of the tumors as covariates. The model is as follows:
where “E” is the ESTIMATE score of each sample, “T” is the tissue of origin of each sample, and “D” is a binary variable showing whether the domain is mutated. This first step assesses whether any correlation between the “D” term of the linear model and the ESTIMATE score is statistically significant and gives the P value that we report.
In the next step, domainXplorer compares the E score in tumors with mutations in the domain being analyzed against those with mutations in other regions of the same protein (Test 2) or no mutations in the protein at all (Test 3) using a Wilcoxon test. Finally, domainXplorer identifies all regions in which Tests 1 and 2 have a P value below 0.01 and Test 3 has a P value below 0.05. We are less stringent with Test 3 because it intrinsically has a smaller sample size (only those samples with mutations in the analyzed proteins). We do not use multiple-testing corrections (such as Bonferroni or Benjamini–Hochberg algorithms; ref. 24) because we use protein regions instead of whole genes, and the number of samples with mutations in each region is relatively small. This effectively reduces our statistical power for most regions below that needed to detect significant effect with multiple-hypothesis testing corrections. However, in other applications of a similar protocol, we found that the framework with three different tests is stringent enough to reduce the number of false positives while identifying true positives (10). Another reason not to correct for multiple-hypothesis testing is that this setting leads to a manageable list of candidates for further analysis, which is the goal of domainXplorer. Downstream analysis of such candidates is then up to the users, who should rely upon their biological insights and expert opinion.
We downloaded data on the predicted number of neoantigens for 4,592 TCGA samples from a previous study (7). The HLA alleles of these samples were identified using Polysolver (25) and the neoantigens predicted using NetMHCpan (26). The subset of these samples that was common to our cohort (n = 3,421) was used to repeat the first step of the domainXplorer analysis, but now adding the number of neoantigens per sample in our linear model (Equation 2):
where “E”, “T”, and “D” represent the same variables as in Equation 1 and “N” is the number of predicted neoantigens in each sample. We used the P value of the “domain” variable in this type I ANOVA test to evaluate whether a domain was still significant beyond what can be explained by tissue of origin or the presence of neoantigens.
Gene expression analysis
As explained above, we downloaded normalized gene expression data from the UCSC Cancer Genomics Browser. Then, in the case of the CTNNB1 analysis, we used an ANOVA test on a linear regression model using ESTIMATE as the response and the tissue of origin and CTNNB1 expression as covariates (Equation 3):
where, again, “E” and “T” have the same values as in Equation 1, and C is a variable containing the expression of CTNNB1 in either mRNA or reverse phase protein array (RPPA) experiments.
For the analysis of the STING pathway, we also used an ANOVA test of a linear model. In this test, expression of each gene in the pathway was treated as a response variable. As covariates we used a categorical variable indicating whether or not the sample had a mutation in the POLR3B domains and either (i) the ESTIMATE score of the sample (Equation 4), (ii) its tissue of origin (Equation 5), or (iii) both (Equation 6).
where “E”, “D,” and “T” have the same values as in Equation 1 and “G” represents the expression levels of the different genes in the pathway.
We used the Pancan16 normalized RPPA file from The Cancer Proteome Atlas (TCPA) website (27). This file contains RPPA data from 4,768 samples from 16 different cancer types and is already normalized to correct for batch effects. Again, we used an ANOVA on a linear regression model where either the gene expression or ESTIMATE scores were the response variables and the tissue of origin and β-catenin protein levels were the covariates (Equation 7):
where “T” and “C” have the same value as in Equation 3 and “V” represents the expression levels of the different genes tested (CDH11, CDH1, CD8A, and CD3E) or the ESTIMATE immune score.
Using ESTIMATE to evaluate tumor immune infiltrate
Algorithms that can deduce the cell composition of heterogeneous samples by analysis of gene expression data offer an alternative to direct cell quantification methods and have been benchmarked against methods such as immunostaining or histological analysis (15, 28, 29). Here, we used ESTIMATE (15) to infer the presence of immune cells in 5,164 cancer samples from 21 different projects of TCGA (30).
Although using a single measure of immune response is an obvious oversimplification, ESTIMATE scores correlate well with several important features of host immune response, such as the cytolytic activity (measured as the combined expression of GZMA and PRF1; ref. 7) of the immune infiltrate (P < 1e−16, Fig. 1A), which can be attributed mostly to active CD8 lymphocytes and natural killer cells (7). Therefore, cancer cells from samples with high ESTIMATE immune scores are immunogenic, and yet must have some molecular mechanisms that allow them to survive in the presence of lymphocytes. In contrast, lower immune scores could be due to having few neoantigens and/or some genomic alteration that suppresses the immune response, such as expression of CTNNB1 (31).
The overall results were in line with previous studies (7), with some cancer types showing more immune infiltration than others (Fig. 1B), especially those with a potential viral origin (head and neck or cervical cancers) or a strong inflammatory component (lung cancers) or those considered to be immunogenic (kidney clear cell carcinoma or melanoma). However, the immune scores among different samples of the same cancer type were highly variable (Fig. 1B), implicating unique properties of each specific tumor.
Immune scores derived from ESTIMATE correlate with the survival of cancer patients (Fig. 1C). Patients with higher immune scores had better outcomes in the Pan-cancer dataset, which includes all 5,164 samples (Cox P < 0.01, adjusted by tissue of origin), as well as in some individual cancer types, such as adrenocortical carcinoma, melanoma, or head and neck cancer (Cox P < 0.05 in all three cases).
domainXplorer reveals potential immune drivers
Our goal was to identify potential mechanisms used by cancer cells to influence the immune response, following the hypothesis that mutations in cancer cells influence the immune response in various ways beyond creation of neoantigens. Such effects were demonstrated for some proteins, for instance for the cadherin-associated protein CTNNB1 (31) or CASP8 (7). To expand this list, we used domainXplorer to analyze ESTIMATE immune scores. Our analysis yielded a total of 122 protein regions that are potential cancer immunity drivers. Several of these regions are discussed in detail below, and the full list is presented in Supplementary Table S1.
Given the importance of neoantigens in modulating the immune response against tumors (32, 33), we determined whether cancer immunity driver regions and neoantigens represent independent signals or are mutually redundant. We repeated the linear regression step of domainXplorer on a subset of patients whose neoantigens had been previously predicted (n = 3,421). Because of a smaller sample size of this second group, and due to the inclusion of the neoantigen variable, we lowered the significance threshold to P < 0.05 for this specific analysis. In this second analysis, 64 of the original 122 domains were identified with a standard domainXplorer linear regression model (Equation 1), reflecting the effects of the smaller sample size. Of these, 52 were also identified when adding the number of neoantigens in each sample to the linear regression model (Equation 2; Fig. 2A). Therefore, at least for these 52 protein regions, the correlation between mutations and the immune scores seems to be independent of the presence of neoantigens.
Several of the cancer immunity driver regions identified in our analysis belonged to proteins with known roles in the immune system (such as CD163L1, FCγR2A, or C2). Several others were located in the extracellular matrix (COL11A1, LAMA1, or VWA2, for example) where they could interact with immune cells, immediately suggesting a potential mechanism that could mediate the correlation. The remaining group was remarkably diverse with no apparent dominant function, cellular localization, or pathway.
We tested the hypothesis that proteins we identified may form a network. We analyzed several different protein interaction databases to identify physical or functional interactions either between these proteins themselves or between them and proteins known to modulate the tumor immune infiltrate (13). A tight interaction cluster was formed by 33 proteins (Fig. 2B) and several of them interact with either β-catenin (cadherin-associated protein, CTNNB1) or TP53, two proteins known to alter the anticancer immune response (7, 31). Thus, we hypothesized that some of the regions we identified correlated with immune infiltrates because they alter their interaction with these two proteins.
CDH11 is a potential CTNNB1 inhibitor
CDH11 is one of the proteins in the 33-protein interaction cluster, and its study highlights many of the advantages of domainXplorer over gene-centric analysis. Correlations between mutations in this protein and the ESTIMATE score were not picked up by gene-centric analysis (P > 0.2, Wilcoxon test), but were recognized by a domain-based analysis (Fig. 3A and B). This protein interacts with CTNNB1, through the region identified by domainXplorer (prediction by homology), immediately suggesting a molecular mechanism for the correlation: Mutations in this region correlate with the immune infiltrate because they are altering the interaction between these two proteins.
We tried to validate the association between CTNNB1 and lower immune responses (31) using the ESTIMATE immune score, as this association had been reported for melanoma, but we did not know if the correlation was true for other cancer types. Although the correlation between CTNNB1 and ESTIMATE scores at the mRNA level was not statistically significant (P > 0.2, Fig. 3C, top), using protein expression data revealed a significant correlation (P < 1e−10, Fig. 3C, bottom). The correlation was negative, suggesting that CTNNB1 inhibits the immune response, in agreement with previous results. β-Catenin protein amounts also were negatively correlated with CD3ϵ and CD8α (Fig. 3D). We measured CD3ϵ and CD8α transcripts instead of protein, as neither gene has proteomics data in TCGA; one should be cautious when interpreting these correlations. However, the negative correlation of CTNNB1 protein with CD8α, CD3ϵ, and ESTIMATE immune scores suggests that the effect of β-catenin in suppressing the homing of lymphocytes (and particularly CD8 T cells) at tumor sites seems to extend beyond melanoma and may be a mechanism common to many cancer types.
The expression of CDH11 also negatively correlates with the concentration of CTNNB1 (P < 1e−5, Fig. 3C, red regression line), leading to the hypothesis that CDH11 could be a β-catenin inhibitor. If that were the case, because all the cancer samples with mutations in the disordered region of CDH11 have higher levels of immune infiltrate, the mutations should be strengthening the CDH11/CTNNB1 interaction. To test this hypothesis, we used MECHISMO (34) to predict the consequences of six different CDH11 mutations. MECHISMO is a method that, among other things, uses an empirical score of how likely it is for two amino acids to be together in an interface, to estimate the effect of a specific mutation. All the mutations in the IDR region of CDH11 were located on the predicted interface between CDH11 and CTNNB1 (Fig. 3E) and the MECHISMO interaction score for specific mutations and the ESTIMATE immune scores are strongly correlated (R > 0.9, Fig. 3F), meaning that samples with CDH11 mutations that cause stronger interactions between these two proteins also had more immune infiltration in agreement with our original hypothesis.
Overall, CTNNB1 levels correlated with anticancer immune responses beyond melanoma, CDH11 is likely a CTNNB1 inhibitor, and mutations, specifically in the disordered c-terminal region of CDH11, probably alter the immune infiltrate by influencing the activity of β-catenin.
Mutations in regions of complement cascade proteins
The gene-centric analysis of thrombin, shown in detail in Fig. 4, as in the case of CDH11, compares samples with and without mutations in the entire gene and leads to the wrong conclusion: that mutations in this protein do not correlate with the amount of immune cell infiltrates (Fig. 4A). However, by analyzing each domain individually, domainXplorer identified an association between mutations in the trypsin domain of thrombin responsible for its proteolytic activity and greater immune cell infiltrate (Fig. 4B and C). A potential connection between thrombin and the immune response against cancer may come from the role of thrombin in the complement cascade, whose role in anticancer immunity is beginning to be recognized (35–37). Thrombin is known to cleave C3 and C5 into their corresponding subunits, activating the complement pathway (38, 39).
Given the complexity of the complement pathway and its controversial role in cancer, showing both pro- and anticancer effects (40, 41), it is difficult to decipher the specific mechanism underlying this association. One possibility is that mutations inactivate the catalytic function of thrombin, limiting the local activation of complement. Lower activation could lead to lower cytolytic activity of the immune infiltrate, allowing cancer cells to survive the numerous leukocytes found in these samples. Lower complement-mediated cell death (42, 43), or changes in anaphylatoxins C3a and C5a concentrations, could alter the types of immune cells recruited to the tumor. If mutations instead caused a gain-of-function in thrombin's catalytic activity, more complement would be activated, and the tolerance of these cancer cells to immune cells could be explained by the recruitment of myeloid-derived suppressor cells (MDSC) by C3a and C5a (44). However, the lack of a mutation cluster in the catalytic domain suggests that these are inactivating mutations, although direct experimental evidence is needed to support either scenario.
domainXplorer identified other elements of the complement cascade, such as C2 protein and plasminogen (Fig. 4D). The results for plasminogen were similar to those for thrombin, as the region identified as significantly mutated in cancer cells is also the peptidase domain, and plasminogen activates the complement cascade by cleaving C3 and C5 (refs. 38, 39; Fig. 4E). For the C2 protein, the Von Willebrand domain was the most relevant: TCGA samples with mutations in this region have unusually high numbers of immune cells. This domain is likely responsible for the interaction with C4b; therefore, these mutations could disrupt the formation of C3 convertase by blocking the interaction between C2b and C4b, ultimately leading, like thrombin, to the downregulation or blockade of the complement cascade. Regardless of the specific mechanism underlying each of these three associations, they all suggest a key role of the complement cascade in helping tumor cells evade the immune response (i.e., making them capable of surviving in the presence of high numbers of immune cells), in full agreement with studies that highlight the relevance of the complement cascade in tumor immunology (ref. 37; Fig. 4D).
Mutations in the same protein can be associated with opposite immune phenotypes
Another example of the importance of domain-level data analysis is POLR3B, the second-largest subunit of the RNA polymerase III complex. This complex is responsible for the transcription of noncoding RNAs and affects innate immunity by inducing the expression of type I IFN after detecting cytosolic DNA (45, 46). It also plays a role in the response of patients to cancer drugs and in antitumor immunity (47).
In a gene-centric analysis, patients with POLR3B mutations have higher immune scores than those with no mutations in this protein (P < 0.05). However, results from domainXplorer show that this effect is domain dependent. Patients with mutations in the hybrid-binding domain have more immune cell infiltration than POLR3B wild-type patients (P < 0.01), as well as patients with mutations in other domains of POLR3B (P < 0.01; Fig. 5A and B).
These results suggest that POLR3B mutations should lead to cancer cells that resist the immune response. However, domainXplorer identified mutations in the clamp region (C-terminal domain) that have the opposite effect. Patients with clamp region mutations have less immune cell infiltration (Fig. 5A and B). Thus, the specific position of the POLR3B mutations seems to determine the amount of immune cell infiltration, an effect that would have been missed in gene-centric analyses. The predicted three-dimensional model of the RNA polymerase III, based on PDB coordinates file of Saccharomyces cerevisiae RNA polymerase II elongation complex (PDB ID 4Y52; ref. 48), highlights the positions of the clamp region versus the hybrid-binding domain and the rest of POLR3B (Fig. 5C).
To better understand the role of mutations in the clamp region, we analyzed expression data for the different genes involved in the RNA polymerase III/Interferon pathway (POLR3A, POLR3B, RIG-I, STING, and IRF3) as well as several interferon genes. Only patients with mutations in this region of POLR3B had the equivalent amounts of both POLR3A and POLR3B (Fig. 5D), with significantly less expression of the rest of the proteins in the pathway and several type I interferon genes (IFNA1, IFNE, IFNRA1, and IFNRA2), although we observed no differences in IFNA2 or IFNB1. We observed these differences when either the tissue of origin or the ESTIMATE score of each sample was included as a covariate, but not when including both, probably due to lack of statistical power. Thus, RNA polymerase III with mutations in the POLR3B clamp domain could not activate the full type I interferon response, leading to reduced IFNG expression and a less immunogenic environment (Fig. 5D).
In this work, we have explored the landscape of cancer immunity drivers focusing on protein domains. The role of somatic mutations in cancer immunology has usually been interpreted in the context of neoantigens. Although they play a key role in determining the immunogenicity of a tumor and can be used in clinical settings to try to personalize immune-based therapy (6, 32, 33), few analyses have focused on whether cancer somatic mutations can influence host immune response through other means. Our results suggest that many proteins, when mutated in a specific domain, could affect the host immune response to tumors. Interpreting these associations is not trivial and is additionally complicated because ESTIMATE scores use a single value to measure the presence of immune cells but do not provide any details about the composition of the immune infiltrate. Nevertheless, as shown by the examples discussed, domain analysis not only finds correlations, but can also be used to develop specific hypotheses on the mechanisms of how these mutations affect the immune cells attacking the tumor.
The full list of 122 immune domain drivers identified is given in Supplementary Table S1. Given the limited statistics available in even the largest current databases, this table should be treated as a preliminary list of candidates for mutation immunity drivers that needs further validation using other knowledge of biological data and direct experimentation. Although we have used ESTIMATE scores, this approach is amenable to other cancer immunology data, such as profiles of different types of immune infiltrating cells. Given the importance of the composition of the immune infiltrate in influencing, among other things, the overall survival of patients (49, 50), we foresee that domainXplorer analysis of more detailed immune profiles that include information about specific cell types (29) will reveal further details of tumor immunology. The only requirement is that the samples have data regarding both a certain immune phenotype and protein-coding variations (somatic or germline). Also, the functionality of domainXplorer and similar analysis can be extended by including functional regions that are defined in three dimensions, such as protein interaction interfaces. Just as in the case of cancer-driver genes (13), analyses focusing on three-dimensional regions will likely reveal more features that influence the immune infiltrate in cancer. Finally, these 122 associations between protein domains and the amount of immune infiltration, as well as the algorithm used to discover them, show that domainXplorer could become a valuable resource to improve our understanding of the complex relationships between cancer and immune cells.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Conception and design: E. Porta-Pardo, A. Godzik
Development of methodology: E. Porta-Pardo
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): E. Porta-Pardo, A. Godzik
Writing, review, and/or revision of the manuscript: E. Porta-Pardo, A. Godzik
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): E. Porta-Pardo, A. Godzik
Study supervision: A. Godzik
We thank the reviewers for their thoughtful comments and suggestions, TCGA team for making their results publicly available for further analysis, as well as our colleagues from the SBP Inflammation and Infectious Diseases Center (IIDC) and the Cancer Center (CC), especially Drs. Carl Ware and Marcus Kaul, for providing advice and guidance on cancer immunity and comments and edits to the manuscript.
This research was partly supported by the SBP CC grant (P30 CA030199 to E. Porta-Pardo).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.