Abstract
Alternative usage of transcript isoforms from the same gene has been hypothesized as an important feature in cancers. However, differential usage of gene transcripts between conditions (isoform switching) has not been comprehensively characterized in and across cancer types. To this end, we developed methods for identification and visualization of isoform switches with predicted functional consequences. Using these methods, we characterized isoform switching in RNA-seq data from >5,500 cancer patients covering 12 solid cancer types. Isoform switches with potential functional consequences were common, affecting approximately 19% of multiple transcript genes. Among these, isoform switches leading to loss of DNA sequence encoding protein domains were more frequent than expected, particularly in pancancer switches. We identified several isoform switches as powerful biomarkers: 31 switches were highly predictive of patient survival independent of cancer types. Our data constitute an important resource for cancer researchers, available through interactive web tools. Moreover, our methods, available as an R package, enable systematic analysis of isoform switches from other RNA-seq datasets.
Implications: This study indicates that isoform switches with predicted functional consequences are common and important in dysfunctional cells, which in turn means that gene expression should be analyzed at the isoform level.
Visual Overview: http://mcr.aacrjournals.org/content/molcanres/15/9/1206/F1.large.jpg. Mol Cancer Res; 15(9); 1206–20. ©2017 AACR.
Introduction
The ability to produce different transcripts (gene isoforms) through alternative splicing (AS), alternative transcription start sites (aTSS), and alternative transcription termination sites (aTTS) is a major determinant of the increased complexity of higher vertebrates (1). A large majority of human genes uses alternative isoforms: approximately 95% of multi-exon genes show evidence of AS (2) and approximately 60% of genes have at least one aTSS (3). Recently, the ENCODE project estimated that, on average, each gene has 6.3 isoforms (3.9 different protein-coding isoforms; ref. 4). It is therefore no surprise that gene isoform usage has an important role in many biological processes, including development, homeostasis, pluripotency, and apoptosis (5–9). Moreover, isoforms are often tissue-specific and may alter the function, cellular localization, and stability of the corresponding RNA or protein (10, 11).
Differential usage of isoforms in different conditions, often referred to as isoform switching (Fig. 1A), can have substantial biological impact, caused by the difference in the functional potential of the two isoforms. Isoform switches are implicated in many diseases (11) and are especially prominent in cancer (12). A well-described example is the isoform switch in the ALK gene, occurring in 11% of melanoma patients, caused by the differential usage of aTSSs. The switch results in the production of a truncated protein lacking extracellular domains (13), which in turn promotes cell proliferation and drives tumorigenesis in vitro (13). Many such examples exist in the literature and include genes central to all eight cancer hallmarks (reviewed in refs. 14, 15). This has resulted in an increasing interest in targeting both general and specific splicing events for therapeutic purposes (16, 17).
The combination of mature methods for genome-wide RNA-sequencing (RNA-seq) and availability of advanced tools for the analysis of the resulting short DNA reads [e.g., Cufflinks (18) or Kallisto (19)] has enabled the quantification of transcriptomes with isoform resolution. This has enabled the genome-wide analysis of isoform usage and thereby the identification of isoform switches (20). Relative isoform usage can be defined as the fraction of gene expression originating from each associated isoform. Here we define these fractions as isoform fractions (IF; Fig. 1B, top). It follows that changes in the usage of individual isoforms can be quantified as the difference in IF values between conditions (dIF values, Fig. 1B, bottom), and by extracting isoforms with opposite dIF values (one going up, one going down), isoform switches may be identified. The validity of such switches can be evaluated with statistical tests across samples (Fig. 1C). For simplicity, we will here refer to isoforms with increased or decreased dIF values, respectively, as upregulated or downregulated.
RNA-seq and related genome-wide RNA profiling methods have been cornerstone for consortia projects profiling the transcriptional states of tissues, cells, and disease states (4, 7, 21, 22). Of particular interest in this context is The Cancer Genome Atlas (TCGA; ref. 7). TCGA has produced RNA-seq data from thousands of cancer patients covering a wide range of cancer types, quantified at both gene and isoform resolution, with the ultimate goal of profiling transcriptional states of different cancer types.
The availability of datasets of this type has spurred the development of statistical tools and methods for finding genes with isoform switches, such as Cufflinks2/Cuffdiff2, IUTA, rSeqNP, and DRIMSeq (23–26). In particular, isoform switches in cancer have been analyzed by several groups (27–29). Of special interest is the study by Sebestyén and colleagues (27), which analyzed TCGA datasets from 9 cancer types and identified 244 genes with isoform switches, of which 59 were found in multiple cancer types (pancancer switches). However, this and other studies have only analyzed isoform switching descriptively, and not estimated the potential impact of switches. Functional consequences of switches such as gain or loss of protein domains or signal peptides, loss of protein coding sequence, and changes in propensity for nonsense-mediated decay (NMD), may all be computationally predicted once the specific isoforms involved in a switch are detected (30–32).
Analyses of RNA-seq data utilizing isoform-level information remain rare, even though RNA-seq is now a standard technique and new RNA-seq datasets are produced on a daily basis. To illustrate this, we examined 100 randomly selected articles from the first 5 months of 2016 analyzing RNA-seq datasets: only 11% of these articles performed analysis at the isoform level. Thus, RNA-seq data are clearly underutilized. We believe this is partially due to the lack of dedicated computational tools for the postanalysis of RNA-seq data at isoform resolution. Specifically, there are to our knowledge, no published tools enabling (i) statistical identification of the specific isoform involved in an isoform switch; (ii) prediction of the potential effects of an isoform switch, (iii) straightforward visualization of isoform switches and their potential consequences.
Here, we describe a set of methods for the identification of isoform switches and the subsequent integration of multiple predictors of isoform function, to identify isoform switches with predicted functional consequences. We applied these methods to TCGA RNA-seq datasets from more than 5,500 cancer patients covering 12 solid cancer types, making the most comprehensive analysis of isoform switches in cancer to date. We found that isoform switches with predicted functional consequences were common, affecting approximately 19% (N = 2,352) of multi-isoform genes. Among these, switches leading to the loss of sequence-encoding protein domains were frequent, particularly in pancancer isoform switches. We also identified a set of 31 pancancer switches as highly predictive of patient survival, independent of cancer types. Our data can be easily explored through interactive online visualization tools and our methods are implemented as an easy-to-use R package, available at http://bioconductor.org/packages/IsoformSwitchAnalyzeR/.
Materials and Methods
All analysis was performed with R > 3.1.1 except if specifically stated. False discovery rate (FDR) was used to correct for multiple testing. P/FDR < 0.05 was considered significant.
TCGA data acquisition
TCGA data were analyzed using the hg19 assembly. Isoform count matrices [RNASeqV2, quantified with RSEM (33) by TCGA], and patient metadata for all available TCGA RNA-seq datasets (all 33 cancer types) were downloaded from the TCGA data matrix on November 3, 2015, specifying no cell line controls (data now hosted at https://portal.gdc.cancer.gov/). The GAF file containing annotation of the quantified isoforms was downloaded from tcga-data.nci.nih.gov but have since been moved to https://api.gdc.cancer.gov/v0/data/a0bb9765-3f03-485b-839d-7dce4a9bcfeb. Clinical data were obtained from https://genome-cancer.ucsc.edu/proj/site/hgHeatmap on August 24, 2016. COSMIC coding point mutations from targeted and genome-wide screens (“CosmicMutantExport”) and copy number alterations (CNA, “CosmicCompleteCNA”) were downloaded from https://cancer.sanger.ac.uk/cosmic on February 22, 2017.
Isoform annotation
We manually constructed strand-specific gene IDs defined as the most up- and downstream genomic coordinates of overlapping isoforms all having the same gene name.
Library filtering
We discarded libraries with <20 million reads as well as multiple libraries from the same patient (only keeping one). Only cancers types with >24 paired samples after filtering were considered.
RNA-seq data processing
For each cancer type, isoform expression was obtained from the downloaded count matrix. To take transcript length into account we calculated relative log expression (RLE) normalized FPKM (fragments per kilobase of transcript per million fragments mapped) values. Gene expression was calculated as the sum of the FPKM values of all isoforms associated with that gene ID. IFs were calculated by dividing the isoform expression with the parent gene expression (FPKMiso/FPKMgene; Fig. 1B).
Analysis of differential isoform usage in TCGA data
For each cancer type, we only considered genes containing multiple isoforms, requiring that the lower boundary of the 95% confidence interval (CI) of the mean expression of a given isoform >1 FPKM in at least one condition from genes where the lower boundary of the 95% CI of the mean gene expression >1 FPKM in both control and tumor samples. The significance of the change in isoform usage for the paired analysis and analysis of all samples was done, respectively, with a paired and an unpaired Mann–Whitney U test (Fig. 2A). dIF values for paired analyses were calculated as the mean of all paired dIF values. dIF values from the analysis of all samples were calculated as meanIF2 – meanIF1. The statistical test was only performed if both conditions had at least 25 defined IF values. We considered a gene as having an isoform switch if it contained at least one isoform where the |dIF| value >0.1 and FDR < 0.05 in both paired analysis and the analysis of all libraries.
IsoformSwitchAnalyzeR
To systematically identify and analyze the potential consequences of isoform switches we developed the R package IsoformSwitchAnalyzeR http://bioconductor.org/packages/IsoformSwitchAnalyzeR/. See Supplementary Text S1–S2 for a detailed introduction.
Analysis of TCGA data with IsoformSwitchAnalyzeR
Isoform switch analysis was performed with IsoformSwitchAnalyzeR v0.98.0. Except for the isoform switch identification, a standard IsoformSwitchAnalyzeR workflow was performed on the TCGA data (see Supplementary Text S1).
As described above, we identified isoforms with a significant change in isoform usage. For each gene, pairs of isoforms having opposite change in usage (requiring a minimum change of |dIF| > 0.1) were identified. Next, we annotated isoforms involved in a switch using computational predictions. We annotated premature stop-codons (PTC) using the 50-nt rule (34). Intron retentions were annotated by comparing each isoform to the hypothetical pre-RNA generated when combining all exons of the parent gene, done by the spliceR (32). Using the known exon structure of each isoform, we obtained the corresponding spliced nucleotide sequence and corresponding coding sequence using annotated open reading frame (ORF) positions. These sequences were used for sequence analysis of protein domains [via Pfam; (30) only using the pfamA database and by specifying –“as”], signal peptides [via SignalP (35) specifying “-f summary”], and coding potential (via CPAT). The cutoff for distinguishing coding and noncoding isoforms based on the CPAT analysis was 0.364 as suggested in ref. 36.
We predicted functional consequences of an isoform pair by comparing the isoforms that were used more (dIF > 0.1) to the isoforms that were used less (dIF < −0.1), and identified differences in the obtained annotation via the analyzeSwitchConsequences() function. We focused on six annotation types: gain/loss of protein domains, signal peptides, introns, or coding potential, changes in NMD sensitivity, and large changes in amino acid sequence. The latter was defined as a Jacard similarity (length of aligned region without gaps)/[(length of sequence a) + (length of sequence b) - (length of aligned region without gaps)] of the pairwise alignment of the sequences <0.9. All plots of isoforms and their annotations were made with the switchPlot() function. The global analysis of changes in isoform features was performed with the IsoformSwitchAnalyzeR's extractGenomeWideAnalysis().
Analysis of potential confounding biases
BRCA samples were divided on the basis of their subtype classification into “Basal-like,” “Luminal A,” and “Luminal B” and all pairwise comparisons were considered, as well as all control samples versus all cancer samples. Isoform switches were identified via IsoformSwitchAnalyzeR (Supplementary Text S2). Predictions of functional consequences were done as described above. The analysis of which transcription start sites (aTSS), alternative splicing (AS), and alternative transcription termination (aTTS) caused the consequences were analyzed via analyzeSwitchConsequences() by specifying consequencesToAnalyze =c(“tss”, “intron_structure”, “tts”).
Analysis of mutation frequency
The COSMIC database (37) was reduced to cancer types analyzed and the genes tested for isoform switches. For each mutation type, the number of times a gene was annotated as mutated was counted and genes were divided into those with and without isoform switches with predicted functional consequences. Only deleterious mutation types found in at least 25 genes in each category were considered.
Enrichment tests
All enrichment analysis was done using a Fisher exact test, except if especially stated. Expressed multi-isoform genes were used as background.
Gene set enrichment analysis
Gene Oontology (GO) gene sets were downloaded (January 6, 2016) from EBI. GO terms from the fifth level were used. MSigDB's gene sets c2, c6, and H were downloaded (January 6, 2016) from http://bioinf.wehi.edu.au/software/MSigDB/ (38). Gene names were converted to Ensembl gene IDs. Only gene sets with >9 genes were considered.
The functional class scoring analysis (Supplementary Fig. S1D), using enrichment P values, was done for each of each gene set as follows: (i) for each cancer type, a Fisher exact test was performed only for testing enrichment. In each cancer type, only expressed multi-isoform genes were used as background. (ii) The average −log10(P) across all 12 cancer types was calculated. (iii) A bootstrap P value was calculated, from the cancer type–specific expressed multi-isoform genes, by sampling 1,000 random gene sets of equal size as the gene set in question and the average −log10(P) was calculated.
Significant gene sets across cancer types were required to have: (i) bootstrap P < 0.001; (ii) average −log10(P) > −log10(0.05) (iii) median odds ratio (OR) > 2; (iv) average genes found in >5 gene-set.
Protein network analysis
The protein network analysis was done via the STRING (39) webserver (http://string-db.org/) only considering “high confidence” interactions (interaction score > 0.7).
Literature searches
All literature searches were performed on PubMed (http://www.ncbi.nlm.nih.gov/pubmed) using the RISmed R package. We used two queries, one for finding RNA-seq articles (“RNA-seq search”) and the other for finding RNA-seq articles analyzing isoforms (“isoform search). Both queries were limited to (i) articles published in the first 5 months of 2016; (ii) article type= “Journal Article;” (iii) english articles; (iv) articles about human or mouse data. The RNA-seq search was performed searching for “RNA-seq” or “RNA-sequencing” specifically excluding single-cell studies, resulting in 518 articles. The isoform search was performed using the search terms “isoform,” “isoforms,” “Alternative Splicing”[Mesh] “transcript” or “transcripts” resulting in 1,000 articles; 51 articles were present in both groups.
We defined an “RNA-seq only” set, containing the RNA-seq search articles not overlapping with the isoform search, and a “RNA-seq and isoform” set containing the articles found in both searches. We randomly selected 50 articles from each set, which were read and categorized as analyzing RNA-seq data with isoform resolution or not; 30 articles were discarded because they were either analyzing nonstandard RNA-seq data, describing tools for analysis, or were not available online. From this categorization, we estimated fraction of articles expected to be analyzing RNA-seq data with isoform resolution.
To identify known human isoform switches in cancer, we constructed a comprehensive set of phrases for describing isoform switches in cancer. We identified 26 different phrases used in the literature we already knew. These were deconstructed into 9 prefixes (e.g., “differential”), 14 feature references (e.g., “isoform”), and 4 suffixes (e.g., “usage”). All possible combinations of a prefix, a feature reference, and a suffix were created. Combinations were used to perform a PubMed search (January 13, 2016) in combination with variations of “human” and “cancer.” All relevant articles were read and genes containing isoform switches with a described functional consequence in cancer were extracted. Only isoform switches in the 12 cancer types analyzed here (Table 1) were considered.
Analysis of patient survival
We identified isoform switches in individual patients as described in the main text. We used a Cox proportional hazards regression model [the coxph() function from the survival R package; ref. 40). To test for differences in the overall survival rates and for differences in Hazard Ratios (HR), two separate tests were needed as the P values for the survival test were not uniformly distributed when covariates were included. We therefore tested each isoform switch both with and without age, gender, and parent gene expression as covariates.
To label an isoform switch as significantly associated with poor patient survival, we required:
(i) The switch had to be identified in at least 5 diseased patients and not be identified in at least other 5 patients.
(ii) HR FDR (with covariates) < 0.05.
(iii) Log-likelihood test FDR (without covariates) < 0.05.
(iv) Median survival rate < 1.
(v) HR > 1.
The pan-cancer analysis was stratified by cancer type via the strata() function and gender was omitted as a cofactor for PRAD due to no females in that cohort.
Results
Identification of isoform switches with predicted consequences from TCGA data
To comprehensively characterize isoform switches in cancer, we utilized the extensive TCGA RNA-seq (7). After filtering, our dataset contained 6,194 RNA-seq libraries from 5,562 cancer patients covering 12 solid cancer types (Table 1). For each cancer type, RNA-seq libraries from tumors and corresponding healthy tissue were available, and for a substantial subset, tumor and healthy tissues were paired (originating from the same patient; Table 1). The availability of paired samples made it possible to identify switches using either a paired approach, comparing RNA-seq profiles from tumor and healthy tissue from the same patient, or an unpaired approach also including tumor and healthy tissue samples that were from different patients (Fig. 2A, top). The advantage of the paired approach is increased statistical power, as potential artifacts such as batch effects and interpatient variance are omitted, while on the other hand, the unpaired analysis includes more samples and thereby allows for better generalization.
To exploit the advantages of both approaches, we tested each isoform for differential usage by comparing the IF values of tumor and healthy tissue samples using a paired Mann–Whitney U test for the paired samples and a standard Mann–Whitney U test for the unpaired analysis. For each such test, we required: (i) at least 25 patients in each group (healthy and tumor), (ii) a difference in average isoform usage larger than 10% (|dIF | > 0.1), and (iii) an FDR <0.05. Isoform switches were called when both tests passed the outlined requirements and where we also found opposite usage |dIF |>0.1 of another isoform from the same gene (Fig. 2A, bottom).
Many isoform switches identified this way may be inconsequential to cells, as the isoforms might have identical biological functions. One example is closely located alternative TSSs in the 5′ UTR, where the effect of an isoform switch will be to slightly change the 5′ UTR length: such a change may in some cases not have a measurable biological effect.
To focus on switches with potential biological impact, we only analyzed isoform switches where we could predict such consequences. To predict consequences, we first collected a set of annotations for each isoform involved in a switch utilizing its annotated ORF. The collected annotation included (i) protein domains predicted by Pfam (30), (ii) signal peptides identified via SignalP (35), (iii) protein coding potential predicted by CPAT (36), (iv) large changes in amino acid sequence (see Materials and Methods), (v) intron retention identified by spliceR (32), and (vi) NMD sensitivity, analyzed by the 50-nt rule (see Materials and Methods; ref. 34). For each gene with a significant isoform switch, we then in a pairwise manner compared the annotations for the isoform used more in a given state (dIF > 0.1) to the isoform used less (dIF < −0.1). Cases where we found differences in the annotations between the two isoforms were considered as isoform switches with predicted functional consequences (workflow illustrated in Fig. 2B, dashed line).
We assessed whether the prediction of isoform switches could be confounded by systematic bias in the dataset due to either differences in GC content or changes in RNA quality between cancer and control and found that such biases have either no or minor effect on our results (Supplementary Text 3).
Extent and verification of isoform switching with potential functional consequences
We found isoform switches with predicted functional consequences to be common in cancers: on average, each cancer type had 357 genes (3.5% of expressed multi-isoform genes) containing at least one isoform switch predicted to result in functionally different RNAs/proteins (Fig. 3A; Supplementary Fig. S1A–S1C; Supplementary Tables S1 and S2). Across all 12 cancer types analyzed, we found 2,352 genes (18.96% of all expressed multi-isoform genes) that had such a switch (Fig. 3B). These genes were generally associated with signal transduction and stemness, and were part of known cancer gene signatures (Supplementary Fig. S1D). Interestingly, these genes were also more frequently mutated in somatic cancers, as annotated in the COSMIC database (37), compared with genes without such switches (Fig. 3C). We reasoned that the high occurrence of isoform switches might be the result of the disruption of the spliceosome. If this was the case, one would expect to see global changes in isoform usage (41). To investigate this, we compared the genome-wide distribution of control and cancer isoform usage for isoforms with a particular feature (e.g., ORF, protein domain etc). With a single exception in KIRP, we found no indications of global disruptions (Supplementary Fig. S1E).
To analyze the quality of the identified isoform switches, we performed a systematic literature review and identified 47 genes with known isoform switches that were described as having functional consequences in their respective studies (Supplementary Table S3). These literature-curated switches were highly enriched among the set of genes where we identified isoform switches with predicted consequences compared with all genes we tested for isoform switching (when only considering the parent gene: OR >4.2, P < 3.8e−12, Fisher exact test, when also considering the cancer type in which the isoform switch is described: OR >16.33, P < 3.4e−35, Fisher exact test). Because we systematically analyzed 12 cancer types, we could in many cases extend the number of cancer types in which a switch was observed, compared with the cancers where the switch was previously described (Fig. 3D; Supplementary Table S3). For example, ZAK (also known as MLTKα), a gene with known isoform-specific oncogenic properties, has an isoform switch which was previously described in three cancer types (BRCA, COAD, and STAD; ref. 29). We identified this isoform switch in three additional cancer types (KIRC, KIRP, and PRAD; Fig. 3D) and found that the isoform switch causes the inclusion of exons encoding the SAM_1 protein domain, which typically facilitates protein–protein interactions (Fig. 3E), illustrating the usefulness of our methods for hypothesis generation.
Consequences of isoform switching
Because we applied the same analysis across 12 cancer types, we were able to systematically investigate the occurrence of different predicted consequences across different cancer types. The most commonly predicted consequences of isoform switching were large changes in amino acid sequence [corresponding to loss and/or gain of amino acid coding exon(s)], change of protein domains, and change of signal peptides (Fig. 4A; Supplementary Fig. S2A). Surprisingly, the distribution of protein domain gain versus domain loss was highly skewed toward domain loss for isoforms upregulated in cancer (Fig. 4A; first column). To investigate what gain/loss ratio would be expected if isoform switching occurred randomly (i.e., if it were not regulated), we calculated, for each cancer-type, the expected distribution of gain/loss ratios by randomly sampling the same number of isoform pairs from nonswitching protein-coding genes 1,000 times. Compared with this background, loss of protein domains was significantly more frequent than domain gain in all 12 cancer types (P < 0.001, by sampling; Fig. 4B; Supplementary Fig. S2B). Importantly, the observed prevalence of protein domain loss was only partially explained by the commonly observed isoform switches resulting in protein-coding to noncoding change (ORF loss; Fig. 4A, third column, Supplementary Fig. S2C).
Switches resulting in ORF loss occurred significantly more often than expected in all 12 cancer types (FDR < 5.82e−10, Fisher exact test; Fig. 4C, black dots). Using a more conservative ORF loss threshold (requiring ORF loss as well as loss of coding potential, as calculated by CPAT; ref. 36), this held true for 10 of the 12 cancer types (FDR < 0.05, Fisher exact test; Fig. 4C, gray dots; Supplementary Fig. S2D). Considering the overrepresentation of ORF loss, and that upregulated isoforms in cancer typically were shorter (Supplementary Fig. S2E), it was particularly interesting that isoform switching resulting in gain of signal peptides occurred significantly more often than expected in 9 of the 12 cancer types (FDR < 0.05, Fisher exact test; Fig. 4D).
The regulatory mechanisms controlling the production of alternative isoforms are often assumed to originate from AS. Thus, the contribution from aTSS or aTTS has not been comprehensively investigated. To analyze the contribution of the individual mechanisms, we decomposed the isoforms involved in isoform switches with predicted consequences into those originating from AS, aTSS, aTTS, and combinations thereof. AS and aTSS were the main contributors: 78.4% and 70.3% of isoform switches involved AS and aTSS, respectively, while 52% of isoform switches involved both (Fig. 4E and F). For switches where only one mechanism was used, aTSS and AS accounted for 40.4% and 51.5%, respectively. In comparison, aTTS contributed much less: globally only 17.3% of switches utilize aTTS, and aTTS could itself only explain 8.1% of the single mechanism switches (Fig. 4E and F).
Pancancer isoform switches with predicted functional consequences
We reasoned that isoform switches, with predicted functional consequences, which were observed across two or more cancer types, referred to as pancancer switches, were of particular interest. We identified 945 genes containing such switches (Fig. 5A; Supplementary Table S4), which were enriched for genes involved in signal transduction and stemness, and were over-represented in known cancer signature gene lists (Fig. 5B). In these pancancer isoform switches, the predicted protein domain gain/loss ratio was particularly skewed toward domain loss (Fig. 5C; 78.26% of events, P < 0.001, bootstrap test), a ratio higher than the domain gain/loss ratio for 10 of the 12 individual cancer types analyzed (Supplementary Fig. S3A). In the pancancer isoform switches, we identified 11 protein domains that, compared with single cancer isoform switches, were lost from significantly more genes than expected by chance (FDR < 0.05, Fisher exact test; Fig. 5D, left). These domains were linked to signal transduction (e.g., protein kinase domains), extracellular space (e.g., immunoglobulin domains), and protein–protein interaction (e.g., PH domains; Fig. 5D, right).
We hypothesized that genes containing pancancer switches may be functionally connected. To assess this, we used the STRING database (39), a resource summarizing protein–protein interaction evidence from multiple sources. Using genes containing pancancer switches observed in at least four cancer types, we identified a high-confidence protein interaction network (7.45× more interactions than expected, P < 5.8e−04, STRING online analysis; Fig. 5E; Supplementary Fig. S3B). The protein interaction network, which mainly consisted of genes involved in cell signaling, contained VEGFA and AKT1 (also known as PKB1 or PKBα) as central nodes (Fig. 5E). The previously uncharacterized isoform switch in AKT1 is of particular interest, as AKT1 is a known tumor suppressor that inhibits tumor invasion (42). We found this isoform switch in five different cancer types (COAD, KICH, KIRC, KIRP, and THCA) and our detailed analysis revealed that the upregulated isoform lacked the PH domain present in the normal isoform (Fig. 5F). As PH domains typically facilitate protein–protein interactions, this could indicate that the ability of the cancer form of AKT1 to colocalize with its normal signal transduction partners is decreased, perhaps decreasing its suppressor effect. However, the causality and functional importance of this switch remains to be tested.
Isoform switches associated with worse pancancer patient survival
As analyses of patient survival rates using exon-level expression data have previously been shown to complement gene-level expression (43), we hypothesized that isoform switches could be predictors of patient survival. To investigate this, we analyzed each cancer patient for the presence of any of the identified 2,792 isoform pairs involved in switches with predicted consequences as follows. First, for each cancer type, we analyzed the average isoform fractions of the isoform pair in all control samples. Next, for each of the 5,232 cancer patients for whom survival data was available (Table 1), we compared these IFs to the distributions from corresponding healthy samples. If the IF in a particular cancer patient was different from healthy samples [defined as a |Z-score| > 3 and a combined dIF score of the two isoforms > 0.2 compared with the average of healthy samples], we classified the individual cancer patient as having the isoform switch. We then compared the survival rates of patients with and without the isoform switch within each cancer type and across all cancers (a pancancer analysis) taking age at diagnosis, cancer type, gender, and the expression of the parent gene into account (see Materials and Methods). This resulted in the identification of 1,195 isoform switches significantly associated with worse patient survival in at least one cancer type. Of these, we found 111 isoform switches that were significantly associated with worse patient survival in the pancancer analysis (Supplementary Table S5).
Encouraged by these findings, and the large number of pancancer isoform switches identified, we hypothesized that for some isoform switches, the associations with worse survival rates might be independent of cancer type. In other words, there may be cases where an isoform switch is associated with worse survival rates regardless of which cancer type it was identified in.
We reasoned that if the confidence (as estimated by P value) of the association between an isoform switch and worse survival rates was larger in the pancancer analysis (smaller P value) compared with any of the corresponding single cancer analysis, the association was independent of cancer types. Importantly, such a result would not be explained by differences in the number of patients analyzed as our pancancer analysis is stratified by cancer type and thereby only reflects cancer types where patients with the switch are identified (Supplementary Fig. S4).
To identify such isoform switches, we compared the P value of the pancancer analysis to the P value of the most significant single cancer analysis. This resulted in the identification of 31 isoform switches that were significantly associated with worse survival rates independent of cancer types (Fig. 6A; Supplementary Table S5).
As an isoform switch is not a binary event, we hypothesized that the magnitude of isoform switches could be an important parameter for survival analysis. To analyze this, we divided cancer patients based on the magnitude of the isoform switch for each of the 31 pancancer predictors: isoform switches with a combined dIF score of the two analyzed isoforms >0.5 were defined as “large” switches and remaining cases as “normal.” On the basis of this stratification, we reanalyzed the survival rates and obtained significant results for the individual analysis of both normal and large switches for 13 of 30 pancancer predictors (details in Materials and Methods). As hypothesized, 9 of 13 (69%) cases showed clear differences in median survival rates with worse outcome for patients with larger switches (Supplementary Table S5).
To illustrate the results of our survival analysis, we here provide an in depth description of four isoform switches with potential functional consequences where each switch was predictive of patient survival independent of cancer type.
The ZNF12 gene is a transcriptional repressor that, through its KRAB domain, suppresses AP-1 and SRE-mediated transcriptional activity (44). An isoform switch in ZNF12 (identified in 5 cancer types, Supplementary Table S5) resulted in the loss of the transcriptional inhibitory KRAB domain (Fig. 6B, top). In the pancancer analysis, this isoform switch was associated with worse survival rates than any of the single-cancer analysis (HR, 1.58; HR P < 0.0013) (Fig. 6B, bottom).
Another isoform switch resulting in the loss of a protein domain was found in ERCC1, a gene with a central role in the nucleotide excision repair system, where mutations have previously been associated with worse survival rates (45). We identified this isoform switch in 1,091 patients (corresponding to an average 23.5% patients in each of the 12 individual cancer types, Supplementary Table S5). This isoform switch was predicted to result in a protein lacking the HHH domain in cancer states (Fig. 6C, top). As the HHH domain is a sequence-nonspecific DNA-binding element, this switch could potentially compromise the function of ERCC1. Patients with this isoform switch consistently had lower survival rates than patients without it, in both the analysis of HNSC (the single cancer type where the survival difference was the most significant) and the pancancer analysis (HR: 1.28; HR P < 0.001; Fig. 6C, bottom).
FAM36A, which encodes an assembly factor important for the last step of the electron transport chain in oxidative phosphorylation, had an isoform switch that was found in more than 1,000 patients across all 12 cancer types analyzed (Supplementary Table S5). This isoform switch, which constitutes a change from a noncoding to a coding isoform (Fig. 6D top), was associated with worse survival rates across cancer types (HR: 1.4; HR P < 1.63e−06; Fig. 6D, bottom), supporting recent findings that oxidative phosphorylation might play an important role in cancer progression (46). Importantly, the size of the isoform switch seems to be associated with worse patient outcome (Fig. 6E).
One of the most extreme pancancer predictors identified was the isoform switch in the SHC1 gene. This isoform switch was found in 11 cancer types (Supplementary Table S5) and associated with a median survival rate of 0.58 (Fig. 6A). The isoform switch was predicted to result in increased usage of the longer isoform, which is often referred to as p66Shc (Fig. 6F, top). This long isoform included an additional 110 N-terminal amino acid region, and is known to play a central role in prostate cancer metastasis as well as many other cellular functions (47). In agreement with these results, we found the isoform switch to be associated with poor patient survival across cancer types (HR: 1.58; HR P < 4.70e−06; Fig. 6F, bottom) a trend worse for patients with large isoform switches (Fig. 6G).
Resources and software for mining isoform switches in cancer
To facilitate the analysis of isoform switches with potentially functional consequences in cancer, we provide access to the data presented here in several ways, aimed at different types of users. For easy and fast exploration of isoform switches in cancer, we generated three interactive online web services, which produce isoform switch plots (similar to Figs. 3D or 5F), where the genes can be selected with specific focuses, either: (i) gene-oriented, (ii) cancer-type oriented, (iii) pancancer oriented. These tools are available at http://www.binf.ku.dk/services/#switch_cancer. We also provide all the results presented here as Supplementary Data for power users (Supplementary Tables S1–S5). To streamline analysis of isoform switches with predicted consequences (Fig. 2B) and for others to be able to apply our methods to other RNA-seq datasets, we implemented our methods as an R package, IsoformSwitchAnalyzeR, which enables statistical identification of isoform switches with predicted functional consequences from RNA-seq data. IsoformSwitchAnalyzeR is available through Bioconductor http://bioconductor.org/packages/IsoformSwitchAnalyzeR/ and is to our knowledge the first published tool enabling statistical identification of the specific isoforms involved in a switch (Supplementary Fig. S5A).
Details about IsoformSwitchAnalyzeR can found in the vignette (Supplementary Text S1) and Supplementary Text S2. Finally, two SwitchAnalyzeRlist objects, which can be explored via IsoformSwitchAnalyzeR, and contains the full switch analysis of TCGA on which all of the above analysis is made, are available via figshare; http://doi.org/10.6084/m9.figshare.4924724.
Discussion
Despite the large amount of RNA-seq data and computational methods available, isoform-based expression analysis is rare. This means that the potential of existing RNA-seq data is untapped, and as a consequence, our general understanding of differential isoform usage is poor. The few efforts at analyzing individual isoform switches have typically dealt with isoforms by describing their frequent occurrence rather than trying to systematically predict their consequence. Overall, this is unsatisfying, as isoform usage is important in disease and especially cancer, where many individual isoform switches have been described.
Here we present methods for the statistical identification and analysis of isoform switches with predicted functional consequences. We utilized these methods to make the most comprehensive analysis of isoform switches in cancer to date, analyzing data for more than 5,500 cancer patients. According to our analysis, isoform switches with predicted functional consequences are common, occur in genes important for cancer, and are often shared between cancer types. Such isoform switches often lead to protein domain loss and/or ORF loss. Conversely, signal peptides are more often gained than lost in cancer-upregulated isoforms. Many of these switches have previously been described as functionally important, and in many instances causal for cancer development and/or progression. Regardless of causality, a subset of such isoform switches was proven to predict patient survival independent of cancer type. As the survival rates were associated with the magnitude of switches, these switches may be useful as biomarkers.
Although we found that isoform switching with predicted functional consequences is common, we believe our results are a conservative estimate for several reasons: First, the analysis presented here is based on UCSC KnownGenes isoform models released in 2009, while the most recent GENCODE annotation have >2.5× more transcripts annotated and contains approximately 50% more multi-isoform genes. Similarly, it is likely that many novel isoforms, which are not in the current annotation, are used in cancer patients. Consequently, we may not be analyzing the full spectrum of isoforms, which limits our ability to detect isoform switches. Second, in the analysis presented here, we did not analyze cancer subtypes because we prioritized a high number of paired samples to account for interpatient and batch effects. This means we only identify isoform switches shared by the majority of patients in each cancer type. Thus, we will not identify isoform switches found in smaller subtypes. One such example is the isoform switch in ERCC1, which we first globally identified in LUAD and KICH, but deeper analysis showed that the isoform switch existed in subsets of patients from all 12 cancer types analyzed. Finally, we have focused on isoform switches with predicted functional consequences. Such predictions are limited, as we cannot predict all functional features of genes. For example, we identified an isoform switch from a short to a long isoform in the insulin receptor (INSR) in eight cancer types (Supplementary Fig. S5B), an isoform switch suggested to have a role both in fetal growth and cancer biology (48). However, because we did not predict a functional consequence of this switch, it was not analyzed further.
Overall, our results strongly indicate that isoform switches with predicted functional consequences are both common and important in dysfunctional cells, illustrating the potential of augmenting gene-level analysis with isoform-level analysis. This idea is especially appealing in the light of recent findings showing that utilization of isoform-level expression data also might lead to more reliable gene-level expression estimates (49). While there is ample room for improvements in algorithms for isoform reconstruction and quantification, our results suggest that current methods are still adequate for isoform-level–based analysis.
We expect our datasets will be a significant resource for cancer biology research and that the methods described here will be useful for analyzing other RNA-seq sets, enabling analysis of isoform switches with predicted functional consequences in other diseases or physiologic states.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: K. Vitting-Seerup, A. Sandelin
Development of methodology: K. Vitting-Seerup
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): K. Vitting-Seerup
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): K. Vitting-Seerup
Writing, review, and/or revision of the manuscript: K. Vitting-Seerup, A. Sandelin
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): K. Vitting-Seerup, A. Sandelin
Study supervision: A. Sandelin
Acknowledgments
The authors thank Sarah Rennie for comments on the manuscript.
Grant Support
This study was supported by grants from the Novo Nordisk Foundation, the Lundbeck Foundation, and Elixir Denmark (to A. Sandelin).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.