Abstract
Increasing evidence has suggested a role for adenosine-to-inosine RNA editing in carcinogenesis. However, the clinical utility of RNA editing remains limited because functions of the vast majority of editing events remain largely unexplored. To help the cancer research community investigate functional consequences of individual editing events, we have developed a user-friendly bioinformatic resource, The Cancer Editome Atlas (TCEA; http://tcea.tmu.edu.tw). TCEA characterizes >192 million editing events at >4.6 million editing sites from approximately 11,000 samples across 33 cancer types in The Cancer Genome Atlas. Clinical information, miRNA expression, and alteration in miRNA targeting modulated through RNA editing are also integrated into TCEA. TCEA supports several modules to search, analyze, and visualize the cancer editome, providing a solid basis for investigating the oncogenic mechanisms of RNA editing and expediting the identification of therapeutic targets in cancer.
This user-friendly bioinformatic resource reduces the barrier to analyzing the huge and complex cancer RNA editome that cancer researchers face and facilitates the identification of novel therapeutic targets in cancer.
Introduction
Adenosine to inosine (A-to-I) conversion, the most prevalent type of RNA editing in human, is catalyzed by adenosine deaminases acting on RNA (1). A-to-I RNA editing can increase transcriptome diversity and alter miRNA regulation (2). Recent studies have highlighted a link between A-to-I editing and carcinogenesis through analyzing data from The Cancer Genome Atlas (TCGA; refs. 3–6). However, functions of the vast majority of editing events remain largely unexplored because the investigation is challenging. First, cancer development is associated with alterations in editing frequency (7), but editing frequency across various histologic grades, stages, or microenvironments is not available in current RNA editing databases (refs. 8–10; Supplementary Table S1). The lack of editing frequency and its integration with clinical information imposes the barrier to exploring functional consequence of individual editing events. Second, the selection of cancer driver–like editing events for testable hypotheses and functional validation is burdensome. More than 4.5 million A-to-I editing sites have been identified in human, and most of them are located in noncoding regions such as introns and 3′UTRs (10). However, no web platform is available for exploratory analysis of the huge data. These challenges that cancer researchers face obscure the understanding of the oncogenic mechanisms of RNA editing, which limits the clinical utility of RNA editing (11).
To facilitate access to cancer editome and support exploratory data analysis and visualization, we have developed a bioinformatics resource, The Cancer Editome Atlas (TCEA; publicly available at http://tcea.tmu.edu.tw; for access from Brazil and Russia please connect to http://13.56.252.19). We functionally annotated >4.6 million editing sites and systematically characterized >192 million editing events in approximately 11,000 samples across 33 cancer types in TCGA. Unlike previous studies (3–6), we utilized unmapped reads to identify events from hyper-edited reads. We also integrated miRNA expression with editing frequency at sites that may alter miRNA targeting to examine the interplay between RNA editing and miRNA regulation. TCEA provides a user-friendly interface to search, analyze, and visualize the cancer RNA editome, which helps researchers identify cancer driver–like editing events and complements well with other TCGA datasets (e.g., mutation and expression).
Materials and Methods
Sequencing, expression, and clinical data
We downloaded RNA-Seq bam files (v.2016), miRNA expression data (counts normalized to reads per million mapped reads), and clinical information across 33 cancer types in TCGA from Genomic Data Commons (GDC; https://gdc.cancer.gov/; Supplementary Table S2). For the glioblastoma multiforme (GBM) cohort, miRNA microarray expression downloaded from GDC legacy archive was used because miRNA-seq data was not available for most GBM samples. To rule out DNA changes that may affect the detection of RNA editing events, for each sample the corresponding somatic mutations and germline single nucleotide polymorphisms (SNP) were downloaded. The somatic mutations detected using whole-exome sequencing was downloaded from GDC (December 2017). Germline SNPs were obtained from: (i) whole-exome sequencing by the TCGA Pan-Cancer analysis project (ref. 12; July 2018 from GDC); and (ii) genotyping SNP array (from GDC legacy archive).
Catalog of A-to-I RNA editing sites
We downloaded a huge catalog of A-to-I RNA editing sites (∼4.5 millions) from REDIportal (ref. 10; January 2018) and converted their positions from hg19 to hg38 using UCSC's liftover.
miRNA mature sequence
We downloaded miRNA mature sequence from miRBase (release 22, http://www.mirbase.org/).
Database architecture and web interface
Data was stored in a MySQL database and web back-end was implemented by Django web framework. User interface, tables, and figures were designed and dynamically generated with Semantic UI, jQuery, and D3.js.
Site annotation
Gene structure (gencode v.23), variant (dbSNP, ExAc, and gnomAD), and disease (ClinVar, InterVar, and COSMIC v.85) information of editing sites were annotated using ANNOVAR (ref. 13; July 08, 2018). Repeat annotation and phyloP conservation score across 100 vertebrates were downloaded from RepeatMasker (http://www.repeatmasker.org/) and UCSC (https://genome.ucsc.edu/), respectively.
Detection of valid RNA editing events
Several methods have been proposed to detect RNA editing events (14–17). Both Zhang and colleagues (15) and Franzén and colleagues (17) identified editing sites (and the associated editing events) directly from RNA-seq data. The strength of Zhang and colleagues' approach is SNP-free and able to detect hyper-edited reads. Peng and colleagues (16) detected editing events only on the sites reported in RADAR (Rigorously Annotated Database of A-to-I RNA editing; http://rnaedit.com/). Because of frequent DNA alterations in cancer genomes, characterization of editing events only on known sites could reduce many false positive calls in the cancer editome. Therefore, similar to Peng and colleagues (16) we used REDItools (14) to characterize editing events on sites reported in REDIportal (10). REDIportal (10) is currently the most comprehensive editing database that includes sites reported in RADAR and other studies. Hyper-edited reads were discovered as described in Porath and colleagues (18) and only those covering known sites were considered. Impacts of hyper-edited reads on samples across cancer types are shown in Supplementary Fig. S1.
Editing events on mapped and unmapped reads.
We took an approach similar to that described in REDIportal (10). Specifically, we focused on known editing sites to reduce false positive calls derived from somatic mutations and germline SNPs. Editing events of individual RNA-seq bam files at approximately 4.5 million A-to-I RNA editing sites were detected by REDItools (ref. 14; REDItoolKnown.py) with default setting (e.g., minima read coverage, 10; minima quality score, 20; and minima mapping quality score, 25). For pair-end data, only paired concordant reads were considered. To capture hyper-edited reads, the pipeline and scripts developed by Porath and colleagues (18) were used. Briefly, unmapped reads with potential sequencing errors were first removed. The remaining reads were remapped to the genome (hg38) with all As converted to Gs in both RNA reads and genome DNA. The original sequences of the remapped reads and the original genome sequence were then compared with identity high-quality clusters of A-to-G mismatches. Reads with dense clusters of A-to-G mismatches were considered hyper-edited.
Excluding somatic and germline mutations.
To rule out DNA changes that may affect the detection of RNA editing events, for each sample we excluded sites overlapped with somatic mutations or germline SNPs.
Excluding editing events with low confidence.
After excluding sites overlapped with DNA changes, we used two filters to identity confident editing events: (i) a valid event is required to be supported by ≥2 reads with the edited G nucleotide and covered by ≥10 reads (A+G); and (ii) editing frequency (edited G reads/A+G reads) is significantly greater than 0.1% (Benjamini–Hochberg procedure corrected binomial test with FDR < 0.05).
Calculation of editing frequency and prevalence of editing sites
Note: EF = 0 (if covered by ≥10 reads and not overlapped with DNA changes but did not pass the low confidence filter); and EF = NaN (all other situations).
Identification of differentially edited sites
For the 16 cancer types with |\ge {\rm{\ }}10$| normal samples (BLCA, BRCA, COAD, ESCA, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PRAD, READ, STAD, THCA, and UCEC), we identified sites differentially edited between tumor and normal samples. We examined the difference in editing prevalence and median editing frequency of sites with ≥10 covered tumor and ≥10 covered normal samples:
By editing prevalence.
For each of the 16 cancer types, sites edited in ≥1 tumor or ≥1 normal samples were selected for Fisher exact test (adjusted by Benjamini–Hochberg procedure). Sites with FDR < 0.1 were considered differentially edited (by editing prevalence) between tumor and normal samples.
By editing frequency.
Similarly, sites edited in ≥1 tumor or ≥1 normal samples were selected for Wilcoxon rank sum test (adjusted by Benjamini–Hochberg procedure). Sites with FDR < 0.1 were considered differentially edited (by editing frequency) between tumor and normal samples.
Clustering of differentially edited sites
We identified 34,979 sites differentially edited (by editing frequency) in at least one cancer type. These sites were clustered using K-means based on the 49 median editing frequency from the 33 cancer types (for the mentioned 16 cancer types, both median editing frequency from tumor and normal samples were used; in the other 17 cancer types only median editing frequency from tumor samples were used). First, 4,958 sites were excluded because more than half of their median editing frequency were not available (e.g., no covered sample). Second, we performed Principal Component Analysis for dimension reduction. We selected the first seven components which together could explain >85% of the variance. The Elbow method was used to determine the number of clusters, such that 15 clusters were selected (Supplementary Fig. S2).
Calculation of pan-cancer rank of editing sites
Differentially edited sites.
For sites differentially edited in at least one cancer type, the ranking in editing prevalence (or frequency) is calculated as the sum of −log10 (FDR of editing prevalence or frequency) in cancer types with FDR < 0.1.
Nondifferentially edited sites.
For sites that were not differentially edited in any cancer type, the ranking in editing prevalence (or frequency) was calculated as the sum of percentiles of editing prevalence (or frequency) in 33 cancer types. In KICH and KIRP, only values from normal samples were used for percentile calculation because editing events occurred more often in normal samples. For the other 31 cancer types only tumor samples were used. For sites with no edited sample but with ≥10 covered tumor samples (≥10 covered normal for KICH and KIRP), the percentile was set zero. For sites with <10 covered tumor samples (or normal samples in KICH and KIRP), the percentile was set −10.
Identification of potential prognostic editing events
We used Cox proportional hazards model and Kaplan–Meier (KM) method to analyze the association between editing frequency and overall survival. In Cox regression, age and gender were controlled in all cancer types. In addition, tumor grade was controlled when the information of tumor grade was available, including BLCA, CESC, CHOL, ESCA, HNSC, KIRC, LGG, LIHC, OV, PAAD, STAD, and UCEC. Multiple comparisons were corrected with Benjamini–Hochberg procedure. In KM plot, log-rank test was used for statistical significance.
miRNA target prediction and identification of target loss/gain
Gencode (v.23) was used to obtain 3′ UTR regions from the human genome sequence (hg38; https://genome.ucsc.edu/). Only transcripts whose 3′ UTRs overlapped with editing sites were included for further analysis. For each 3′ UTR (unedited, UN), we first generated edited (ED) 3′ UTRs. Only one nucleotide (A) was replaced with G in each ED 3′ UTR. For example, if a 3′ UTR overlaps with three editing sites, three ED 3′ UTRs will be generated. In total, 9,014 UN and 195,563 ED 3′ UTRs were analyzed. We used three tools, including TargetScan (19), miRanda (20), and MBSTAR (21) to predict miRNA binding sites on UN and ED 3′ UTRs. Binding sites predicted by at least two tools on UN but none on ED were considered to be loss of miRNA targeting. In contrast, binding sites predicted by at least two tools on ED but none on UN were considered to be gain of miRNA targeting.
Rank of gene editing and miRNA target alterations
To measure the significance of editing events of genes and target alterations of miRNAs, we estimated editing ranks of genes and ranks of miRNA target alterations (against genes of similar gene length and miRNAs of comparable number of recognition elements, respectively).
For EdSite module.
(ii) We searched for k genes with similar size of |gen{e_i}$|:
|0.9 \times length\ ( {gen{e_i}} ) \le length\ (gen{e_{1\sim k}}) \le 1.1 \times length\ ( {gen{e_i}} )$|
If k < 30, the gene length threshold (10%) will be extended (20%) to include more genes.
(iii) We calculated Ri* (the rank of |{p_i}$|) based on values of edited percentage of the genes found in ii (|{p_{1\sim k}}$|).
(iv) A bootstrap approach (n = 1,000) was used to obtain the distribution of Ri* by randomly selecting k genes from genes found in ii and calculated Ri as in iii. This approach generated one thousand Ri*.
(v) The median of the 1,000 Ri* is denoted as the rank of genei. Lower and upper quantile of Ri* are also shown on the website.
For EdmiR module.
(i) We calculated the percentage of target loss (PiL) and target gain (PiG) of |mi{R_i}$|:
(ii) We searched for miRs with comparable number of recognition elements (RE) of |mi{R_i}$|:
a. Loss of miRi targeting comparison (based on unedited 3′ UTRs):
|\eqalign{0.8 \times \ \# \ of\ REs( {mi{R_i}} ) \le \# \ of\ REs(mi{R_{1 \sim k}}) \le 1.2\!\times \#\ of\ REs\,\cr( {mi{R_i}} )$|
b. Gain of miRi targeting comparison (based on edited 3′ UTRs):
|\eqalign{0.8 \times \ \# \ of\ REs( {mi{R_i}} ) \le \# \ of\ REs(mi{R_{1 \sim k}}) \le 1.2\! \times \# \,of\ REs\cr( {mi{R_i}} )$|
If k < 30, the threshold (20%) will be extended (30%) to include more miRs.
(iii) The following steps are the same as iii-v of EdSite. The median of 1,000 RiL* and 1,000 RiG* for miRi is denoted as the rank of target loss and target gain of miRi, respectively. Lower and upper quantile of RiL* and RiG* are also shown on the website.
Results
TCEA currently consists of three analytic modules, EdSite, EdCancer, and EdmiR (Fig. 1A). Each module supports user-defined filters and dynamically generates figures and tables useful for generating testable hypotheses and available for download. We demonstrate the utility of TCEA through examples described as below and three videos (Supplementary Videos S1–S3).
Overview of TCEA. A, TCEA data and modules for exploratory data analysis and visualization. B, “Annotation tab.” Ped and Fed of selected editing sites. C, “Differential analysis.” Sites with difference in Fed between tumor and normal samples in at least one cancer type. D, “Survival analysis.” LUAD cases with high editing frequency at chr8:102,829,408 have worse overall survival than cases with low editing frequency (log-rank test, P = 0.04). E, “Shared tab.” Ped and Fed of differential editing sites in BRCA (top, Fed: tumor > normal; bottom, Fed: normal > tumor). F, “High” tab. Ped and Fed of highly edited sites in BRCA tumors. G, “EdmiR module.” miR-128-3p expression. H, “Loss” tab. Cancer type distribution (partial) of sites altering miR-128-3p target recognition. I, “Loss” tab. Ped and Fed of sites differentially editing in H. J, Predicted binding between miR-128-3p and XIAP. Star indicates the editing site that alters miRNA targeting. K, XIAP expression in 16 cancer types with ≥10 normal samples.
Overview of TCEA. A, TCEA data and modules for exploratory data analysis and visualization. B, “Annotation tab.” Ped and Fed of selected editing sites. C, “Differential analysis.” Sites with difference in Fed between tumor and normal samples in at least one cancer type. D, “Survival analysis.” LUAD cases with high editing frequency at chr8:102,829,408 have worse overall survival than cases with low editing frequency (log-rank test, P = 0.04). E, “Shared tab.” Ped and Fed of differential editing sites in BRCA (top, Fed: tumor > normal; bottom, Fed: normal > tumor). F, “High” tab. Ped and Fed of highly edited sites in BRCA tumors. G, “EdmiR module.” miR-128-3p expression. H, “Loss” tab. Cancer type distribution (partial) of sites altering miR-128-3p target recognition. I, “Loss” tab. Ped and Fed of sites differentially editing in H. J, Predicted binding between miR-128-3p and XIAP. Star indicates the editing site that alters miRNA targeting. K, XIAP expression in 16 cancer types with ≥10 normal samples.
EdSite module
The EdSite module allows pan-cancer comparison of RNA editing events at sites of interest. The analysis can be performed using the location, functional consequence, the number of edited samples, or any combination of the above. Each EdSite query returns three tab pages. The “Annotation” tab lists various annotations (e.g., variant and disease) of selected sites in a sortable table. Each site contains five panels visualizing editing details: (i) editing prevalence (Ped) and median editing frequency (Fed) across 33 cancer types (Fig. 1B); (ii) the ranking of Ped and Fed within each cancer type; (iii) links to affected miRNA-3′ UTR pairs; (iv) survival analysis; and (v) read details and clinical information of edited samples. The “Differential” tab visualizes Ped and Fed of sites differentially edited between tumor and normal samples (Fig. 1C). The “Clusters” tab visualizes cancer type distribution of editing events and clusters to which differentially edited sites belong (Supplementary Fig. S2).
For example, users can explore editing events of genes related to cell growth and proliferation. Overexpression of AZIN1 has been shown to increase proliferation and tumorigenesis (22). We used the gene search filter to identity 234 editing sites of AZIN1, and only AZIN1S367G (chr8:102,829,408) exhibited differential editing (FDR < 0.05; Fig. 1C). Edited AZIN1S367G has been found to promote cell proliferation in hepatocellular carcinoma (23) and cell survival in human breast cell line (3). Our survival analysis showed that higher frequency of edited AZIN1S367G was associated with lower overall survival of patients with lung cancer (LUAD; Fig. 1D), consistent with the recent finding that edited AZIN1S367G induced tumor migration in lung cancers (24). Multiple filters are allowed to limit the number of editing sites. For example, we used “the number of edited tumor samples” and “gene” filters to identify sites of GRIA2 that were edited in more than 10 tumor samples. This search returned 10 editing sites and two (GRIA2Q607R and GRIA2R764G) were differentially edited. Aberrant editing of GRIA2Q607R and GRIA2R764G has been shown to promote cell migration in glioblastoma (25) and cell survival in human breast cell line (3), respectively.
EdCancer module
The EdCancer module enables flexible exploration of editing events of a sample or cohort. Cohorts can be generated using TCGA cancer type or any combination of tissue, gender, or disease status filters. Each EdCancer query returns two or three tab pages. The “Sample” tab visualizes and lists clinical information of the selected cohort. For cancers types with <10 normal samples, the “Shared” tab visualizes Ped and Fed of highly edited sites in tumors and lists details of editing sites shared by selected tumors. For cancer types with ≥10 normal samples, the “Shared” tab visualizes Ped and Fed of top differentially edited sites (Fig. 1E) and lists details of editing sites shared by selected tumor and normal samples. Ped and Fed of highly edited sites in tumors are visualized in the “High” tab. (Fig. 1F).
For example, users can search for common editing events of a cohort of interest. We selected breast cancer (BRCA) and focused on sites shared by ≥1,000 tumor and ≥100 normal samples. This query generated a sortable table containing 1,509 sites. We sorted these sites by clicking the “EdSample (T/N)” column and identified 12 sites causing protein recoding through the “AAchange” filter. Eleven of the 12 sites were differentially edited including AZIN1S367G, highlighting the importance of AZIN1 RNA editing in cancer development. The top 3 edited genes were NOP14I779V (chr4:2,938,299), FLNBM2324V (chr3: 58,156,064), and IGFBP7K95R (chr4: 57,110,068), all of which have been suggested to suppress breast cancer progression (26–28). Moreover, DNA mutations of NOP14I779V and FLNBM2324V are reported in COSMIC (29), demonstrating the utility of TCEA to identify driver-like editing events in carcinogenesis.
EdmiR module
The EdmiR module enables easy exploration of the interplay between miRNA regulation and RNA editing. Users can select a miRNA by miRBase ID or expression level in TCGA. If expression filter is used, a sortable table containing miRNAs meeting the criteria is returned. After a miRNA is selected, EdmiR returns a heatmap visualizing expression across 33 cancer types (Fig. 1G) and four tab pages. The “Gain” and “Loss” tabs return information about editing sites predicted to cause gain and loss of miRNA targeting, respectively, including (i) a bar plot visualizing cancer type distribution of sites altering target recognition (Fig. 1H); (ii) a heatmap visualizing Ped and Fed of differentially edited sites (Fig. 1I); and (iii) a sortable table integrating editing events with miRNA expression. The “Top” tab visualizes the number of target alterations of top 10 sites. The “Share” tab lists miRNAs sharing the same target alterations with the selected miRNA and visualizes the number and ratio of shared alterations.
For example, users can identify editing events likely altering targeting of miRNAs involved in cancer-related biological processes. Human miR-128-3p targets genes involved in cell proliferation and apoptosis (30). We identified 44 loss and 159 gain of miR-128-3p target sites. Among the 44 target-loss sites, chrX:123,912,716 (located on the 3′UTR of XIAP, a proto-oncogene) was the most differentially edited site (Fig. 1I). EdmiR analysis suggests that higher editing frequency at this site in tumors likely reduce miR-128-3p–mediated XIAP suppression in tumors (Fig. 1J), consistent with the higher expression of both miR-128-3p (Fig. 1G) and XIAP (Fig. 1K) in tumors relative to normal samples.
Discussion
TCEA helps cancer researchers explore functional impacts of individual editing events on protein recoding, miRNA regulation, and patient survival through interactive analyses and visualization. In the future we intend to add more cancer data (such as the International Cancer Genome Consortium; https://icgc.org/) and novel editing sites identified from hyper-edited reads. We anticipate that TCEA will broaden our understanding of the full impact of nucleic acid sequence changes in cancer development and facilitate the development of new cancer therapies.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: S. C.-C. Chen
Development of methodology: C.-H. Lin, S.C.-C. Chen
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): C.-H. Lin, S.C.-C. Chen
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): C.-H. Lin, S.C.-C. Chen
Writing, review, and/or revision of the manuscript: S.C.-C. Chen
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): C.-H. Lin, S.C.-C. Chen
Study supervision: S.C.-C. Chen
Acknowledgments
This study was supported by Ministry of Science and Technology of Taiwan (105-2320-B-038-069, 106-2221-E-038-017, and 107-2311-B-038-003-MY3) and Taipei Medical University (105-AE1-B08 to S.C.-C. Chen). The results shown here are in whole or part based upon data generated by the TCGA Research Network: http://cancergenome.nih.gov/. We thank Dr. Yu-Wei Wu for providing computing resource.
References
Supplementary data
Tutorial on the EdSite module
Tutorial on the EdmiR module
Tutorial on the EdCancer module