Abstract
Long noncoding RNAs (lncRNA) have emerged as essential players in cancer biology. Using recent large-scale RNA-seq datasets, especially those from The Cancer Genome Atlas (TCGA), we have developed “The Atlas of Noncoding RNAs in Cancer” (TANRIC; http://bioinformatics.mdanderson.org/main/TANRIC:Overview), a user-friendly, open-access web resource for interactive exploration of lncRNAs in cancer. It characterizes the expression profiles of lncRNAs in large patient cohorts of 20 cancer types, including TCGA and independent datasets (>8,000 samples overall). TANRIC enables researchers to rapidly and intuitively analyze lncRNAs of interest (annotated lncRNAs or any user-defined ones) in the context of clinical and other molecular data, both within and across tumor types. Using TANRIC, we have identified a large number of lncRNAs with potential biomedical significance, many of which show strong correlations with established therapeutic targets and biomarkers across tumor types or with drug sensitivity across cell lines. TANRIC represents a valuable tool for investigating the function and clinical relevance of lncRNAs in cancer, greatly facilitating lncRNA-related biologic discoveries and clinical applications. Cancer Res; 75(18); 3728–37. ©2015 AACR.
Introduction
The human genome encodes approximately 20,000 protein-coding genes and also a large number of transcriptionally active, noncoding RNAs (∼14,000 according to the ENCODE annotation; ref. 1). Among noncoding RNAs, long noncoding RNAs (lncRNA), typically >200 bp, have increasingly been recognized as playing essential roles in tumor biology, representing a new focus in cancer research (2–4). Emerging evidence has indicated that lncRNAs contribute to tumor initiation and progression through diverse mechanisms ranging from epigenetic regulation of key cancer genes (5, 6) and enhancer-associated activity (7) to posttranscriptional processing of mRNAs (8, 9). Therefore, central tasks in cancer research are the identification of lncRNA components involved in carcinogenesis and elucidation of their functions in specific tumor contexts. That inquiry is expected to lay the foundation for development of novel biomarkers and therapeutic agents.
Recent RNA-seq data over large cancer patient cohorts provide an unprecedented opportunity to pursue that inquiry in a systematic way. In particular, The Cancer Genome Atlas (TCGA) represents a unique resource as it generates multidimensional data at the DNA, RNA, and protein levels for a broad range of human tumor types (10). However, there are several computational challenges for biomedical researchers to make full use of these data and prioritize lncRNAs for further functional investigations. First, the number of expressed lncRNAs in human cancers is large. For example, a very recent pan-cancer analysis reveals approximately 8,000 tumor-specific or lineage-specific lncRNAs (11). In terms of prioritizing lncRNAs with potential clinical relevance and elucidating their mechanisms, it is very informative to perform the correlation analysis of lncRNA expression with clinical variables (e.g., patient survival) or with the molecular characteristics of driver genes or therapeutic targets (e.g., PTEN loss or HER2 status) over large patient cohorts. But because of high dimension and complexity of the data involved, such analyses are often daunting and time-consuming. Second, the annotation of lncRNAs in the human genome is rough, very incomplete and fast evolving, so it is important for researchers to be able to query the expression profiles of user-defined lncRNAs (based on genomic coordinates). This function is not available in current lncRNA-related bioinformatics resources as it requires the calculation directly from a huge amount of raw RNA-seq mapping files. Third, given lncRNA candidates of interest, it is critical to examine their profiles in a variety of cancer cell lines, which allows researchers to choose appropriate model systems for experimental studies. Unfortunately, efficient bioinformatics tools with the above functions are still missing, representing a major barrier for the cancer research community to a systems-level understanding of the function and underlying mechanisms of lncRNAs.
To fill the gap, we have developed The Atlas of Noncoding RNAs in Cancer (TANRIC), a user-friendly, open resource for interactive exploration of lncRNAs in the context of TCGA clinical and genomic data. Using TANRIC, we have demonstrated that a large number of lncRNA species show differential expression among known tumor subtypes or in correlation with clinical variables; many lncRNAs show strong correlations with established therapeutic targets and biomarkers across tumor types or with drug sensitivity across cell lines; and the tumor subtypes defined by lncRNA-expression profiles show extensive concordance with established tumor subtypes and provide potential prognostic value.
Materials and Methods
Data resource
We downloaded RNA-seq BAM files of 6,309 patient samples (including 6,083 primary tumor samples and 226 metastasis samples) across 20 TCGA cancer types and their related 564 non-tumor tissue samples (if available; ref. 10) from the UCSC Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu/). Included were bladder urothelial carcinoma (BLCA), brain lower-grade glioma (LGG), breast invasive carcinoma (BRCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), colon adenocarcinoma (COAD), skin cutaneous melanoma (SKCM), glioblastoma multiforme (GBM), head and neck squamous cell carcinoma (HNSC), kidney chromophobe (KICH), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), ovarian serous cystadenocarcinoma (OV), prostate adenocarcinoma (PRAD), rectum adenocarcinoma (READ), stomach adenocarcinoma (STAD), thyroid carcinoma (THCA), and uterine corpus endometrioid carcinoma (UCEC). We also downloaded 739 BAM files of Cancer Cell Line Encyclopedia (CCLE) cell lines (12) from CGHub. In addition, we obtained the RNA-seq files of 531 samples from another three independent studies, including LUAD (13), clear-cell renal cell carcinoma (14), and glioblastomas (15). In total, the current TANRIC release includes RNA-seq data from 8,143 samples (1,142 billion reads).
Efficient algorithm for expression quantitation of user-defined lncRNAs
To calculate the expression of a user-defined lncRNA, TANRIC accepts the genomic coordinates of multiple segments as the input (e.g., given a lncRNA of 3 exons, the input could be “chr7:27135713–27136007;27138458–27138985;27139398–27139585”). The total exon length of a queried lncRNA should be shorter than 50 kb. To minimize the computation time for quantifying user-defined lncRNA expression, we preprocessed all raw BAM files through three steps: (i) extraction of sequence depth data from raw BAM files using SAMtools; (ii) division of genome-wide depth data into short segments (∼3,000,000 bp); (iii) merged of the data into a single file, thereby minimizing the file input/output time when dealing with hundreds of samples for each cancer type; (iv) compression of the merged depth files using a block compression algorithm; and (v) generation of corresponding index files for quick location and retrieval of queried data. We quantified the lncRNA expression as reads per kilobase per million mapped reads (RPKM; ref. 16) and generated the expression profile in a dynamic table. With data preprocessing, the time for calculating lncRNA expression was reduced by >100-fold compared with that of SAMtools. Currently, TANRIC, operating single threaded, can generate the expression profile for any user-defined lncRNA in a minute. That capability dramatically improves performance, enabling rapid analysis of specified lncRNAs through a web interface.
Implementation of the TANRIC data portal
The expression data on annotated lncRNAs and the precalculated correlations with clinical and genomic data are stored in CouchDB. Correlation, differential analyses, and survival analyses were performed in R. The Web interface was implemented in JavaScript; tables were visualized by DataTables; the embedded plots were based on HighCharts; and heat maps were generated using the Next-Generation Clustered Heat Map tool (Broom, Weinstein and colleagues; unpublished data).
Expression quantitation of annotated lncRNAs
To perform a comprehensive survey of human lncRNAs, we obtained the genomic coordinates of 13,870 human lncRNAs from the GENCODE Resource (version 19; ref. 1). We further filtered those lncRNA exons that overlapped with any known coding genes based on the gene annotations of GENCODE (1) and RefGene. As a result, the analysis focused on the remaining 12,727 lncRNAs. On the basis of the BAM files, we quantified the expression levels of lncRNAs as RPKM, and the lncRNAs with detectable expression were defined as those with an average RPKM ≥ 0.3 across all samples in each cancer type, as defined in the literature (17).
Analysis of expressed lncRNAs for biomedical significance
We obtained the clinical information associated with tumor samples, including the patient's overall survival time, tumor stage, and tumor grade from Synapse TCGA Pan-Cancer data portal (https://www.synapse.org/), with ID syn300013. We also obtained known tumor subtype information from TCGA marker articles (if available). To identify lncRNAs differentially expressed between tumors and matched normal samples, we used the paired Student t test to assess the statistical difference between the two groups. To identify lncRNAs differentially expressed among established tumor subtypes or tumor stages, we used ANOVA to assess the statistical difference. Groups with fewer than five samples were excluded from the analysis.
Analysis of lncRNA expression related to potential clinical applications
We obtained a list of 121 actionable target genes from Van Allen and colleagues (18) and added two genes that are well-established targets in immune therapy. We downloaded TCGA molecular profiling data of these target genes, including somatic mutations, mRNA expression, miRNA expression, and somatic copy-number alteration (SCNA) data from Synapse TCGA Pan-Cancer data portal. Student t tests were used to assess the statistical difference in lncRNA expression between mutated and wild-type samples given a gene of interest, and Spearman rank correlations were used to assess relationships between lncRNA expression and SCNA or mRNA, with a coefficient (absolute value) cutoff of 0.6. Multiple comparisons correction was performed using the Benjamini–Hochberg method with a corrected false discovery rate (FDR) cutoff of 0.05, and a 2-fold change between at least two groups was also required. To assess the effects of lncRNA expression on drug sensitivity, we downloaded the drug screening data from CCLE (http://www.broadinstitute.org/ccle/home), and calculated the correlations between the expression levels of approximately 1,290 expressed lncRNAs and the IC50 values of 24 drugs across approximately 330 cell lines. Spearman rank correlations were used to detect significant correlations with a coefficient (absolute value) cutoff of 0.3.
Analysis of tumor subtypes revealed by lncRNA expression
To classify tumor subtypes based on lncRNA expression, for each cancer type, we selected the 500 lncRNAs with the most variable expression pattern and used ConsensusClusterPlus (19) to classify the tumor samples into sample clusters (subtypes). We then used the χ2 test to determine concordance between lncRNA-expression subtypes and known subtypes, and the log-rank test to examine whether lncRNA-expression subtypes significantly correlated with the overall patient survival times. To understand the molecular mechanisms associated with lncRNA subtypes, we downloaded reverse-phase protein array (RPPA) expression data from The Cancer Proteome Atlas (20). Pathway analysis was conducted as previously described (21). Briefly, the members of each pathway were predefined on the basis of a literature search. RPPA data were median-centered and normalized by SD across all samples for each component to obtain relative protein levels. The pathway score was then taken as the sum of the relative protein levels of all positive regulatory components minus the equivalent sum for the negative regulatory components in a particular pathway. Antibodies targeting different phosphorylated forms of the same protein with Pearson correlation coefficient >0.85 were averaged. We used a Student t test or ANOVA analysis to assess statistical differences in pathway score among groups, using the Benjamini–Hochberg correction, with FDR cutoff of 0.05.
Results
A user-friendly, interactive, open-access platform for exploring the function of lncRNAs in cancer
To provide a comprehensive lncRNA resource to the cancer research community, we have collected large-scale RNA-seq datasets from TCGA and other, independent studies and have made processed lncRNA expression data plus multiple analysis and visualization modules available through TANRIC (http://bioinformatics.mdanderson.org/main/TANRIC:Overview; Fig. 1). This data release, which covers 8,143 samples, has three parts (Table 1). (i) Part one consists of TCGA tissue sample sets: 6,309 tumor samples from 20 cancer types and 564 normal samples from 11 tissues. Other TCGA cancer sets will be added in the coming months. (ii) Part two consists of independent tumor tissue sample sets: one GBM set (274 samples; ref. 15), one KIRC set (97 samples; ref. 14), and one LUAD set (83 samples; ref. 13). Other independent sample sets will be added when available. (iii) Part three consists of tumor cell lines: 739 cell line samples from CCLE (12). To our knowledge, this represents the largest publicly available collection of lncRNA data with parallel multidimensional cancer genomic data.
Summary of the data resources of the current TANRIC release
Data source . | Cancer type . | #Normal samples . | #Tumor samples . | Sequencing strategy . | Read length . | #Expressed lncRNAsa . |
---|---|---|---|---|---|---|
TCGA | Bladder urothelial carcinoma (BLCA) | 19 | 252 | Paired-end | 48 | 1,958 |
TCGA | Brain lower grade glioma (LGG) | 0 | 486 | Paired-end | 48 | 2,301 |
TCGA | Breast invasive carcinoma (BRCA) | 105 | 837 | Paired-end | 50 | 1,960 |
TCGA | Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC) | 3 | 196 | Paired-end | 48 | 1,846 |
TCGA | Colon adenocarcinoma (COAD) | 0 | 157 | Single-end | 76 | 714 |
TCGA | Skin cutaneous melanoma (SKCM) | 0 | 226 | Paired-end | 48 | 1,755 |
TCGA | Glioblastoma multiforme (GBM) | 0 | 154 | Paired-end | 76 | 2,369 |
TCGA | Head and neck squamous cell carcinoma (HNSC) | 42 | 426 | Paired-end | 48 | 1,357 |
TCGA | Kidney chromophobe (KICH) | 25 | 66 | Paired-end | 48 | 1,971 |
TCGA | Kidney renal clear cell carcinoma (KIRC) | 67 | 448 | Paired-end | 50 | 2,111 |
TCGA | Kidney renal papillary cell carcinoma (KIRP) | 30 | 198 | Paired-end | 48 | 2,118 |
TCGA | Liver hepatocellular carcinoma (LIHC) | 50 | 200 | Paired-end | 48 | 1,446 |
TCGA | Lung adenocarcinoma (LUAD) | 58 | 488 | Paired-end | 48 | 2,031 |
TCGA | Lung squamous cell carcinoma (LUSC) | 17 | 220 | Paired-end | 50 | 1,883 |
TCGA | Ovarian serous cystadenocarcinoma (OV) | 0 | 412 | Paired-end | 75 | 1,866 |
TCGA | Prostate adenocarcinoma (PRAD) | 52 | 374 | Paired-end | 48 | 2,010 |
TCGA | Rectal adenocarcinoma (READ) | 0 | 71 | Single-end | 76 | 716 |
TCGA | Stomach adenocarcinoma (STAD) | 33 | 285 | Paired-end | 75 | 1,328 |
TCGA | Thyroid carcinoma (THCA) | 59 | 497 | Paired-end | 48 | 1,900 |
TCGA | Uterine corpus endometrioid carcinoma (UCEC) | 4 | 316 | Single-end | 76 | 855 |
CCLE | Tumor cell lines | 0 | 739 | Paired-end | 101 | 2,137 |
Independent | Chinese_GBM | 0 | 274 | Paired-end | 101 | 2,419 |
Independent | Japanese_KIRC | 0 | 97 | Paired-end | 100 | 2,308 |
Independent | Korean_LUAD | 77 | 83 | Paired-end | 101 | 2,569 |
Data source . | Cancer type . | #Normal samples . | #Tumor samples . | Sequencing strategy . | Read length . | #Expressed lncRNAsa . |
---|---|---|---|---|---|---|
TCGA | Bladder urothelial carcinoma (BLCA) | 19 | 252 | Paired-end | 48 | 1,958 |
TCGA | Brain lower grade glioma (LGG) | 0 | 486 | Paired-end | 48 | 2,301 |
TCGA | Breast invasive carcinoma (BRCA) | 105 | 837 | Paired-end | 50 | 1,960 |
TCGA | Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC) | 3 | 196 | Paired-end | 48 | 1,846 |
TCGA | Colon adenocarcinoma (COAD) | 0 | 157 | Single-end | 76 | 714 |
TCGA | Skin cutaneous melanoma (SKCM) | 0 | 226 | Paired-end | 48 | 1,755 |
TCGA | Glioblastoma multiforme (GBM) | 0 | 154 | Paired-end | 76 | 2,369 |
TCGA | Head and neck squamous cell carcinoma (HNSC) | 42 | 426 | Paired-end | 48 | 1,357 |
TCGA | Kidney chromophobe (KICH) | 25 | 66 | Paired-end | 48 | 1,971 |
TCGA | Kidney renal clear cell carcinoma (KIRC) | 67 | 448 | Paired-end | 50 | 2,111 |
TCGA | Kidney renal papillary cell carcinoma (KIRP) | 30 | 198 | Paired-end | 48 | 2,118 |
TCGA | Liver hepatocellular carcinoma (LIHC) | 50 | 200 | Paired-end | 48 | 1,446 |
TCGA | Lung adenocarcinoma (LUAD) | 58 | 488 | Paired-end | 48 | 2,031 |
TCGA | Lung squamous cell carcinoma (LUSC) | 17 | 220 | Paired-end | 50 | 1,883 |
TCGA | Ovarian serous cystadenocarcinoma (OV) | 0 | 412 | Paired-end | 75 | 1,866 |
TCGA | Prostate adenocarcinoma (PRAD) | 52 | 374 | Paired-end | 48 | 2,010 |
TCGA | Rectal adenocarcinoma (READ) | 0 | 71 | Single-end | 76 | 716 |
TCGA | Stomach adenocarcinoma (STAD) | 33 | 285 | Paired-end | 75 | 1,328 |
TCGA | Thyroid carcinoma (THCA) | 59 | 497 | Paired-end | 48 | 1,900 |
TCGA | Uterine corpus endometrioid carcinoma (UCEC) | 4 | 316 | Single-end | 76 | 855 |
CCLE | Tumor cell lines | 0 | 739 | Paired-end | 101 | 2,137 |
Independent | Chinese_GBM | 0 | 274 | Paired-end | 101 | 2,419 |
Independent | Japanese_KIRC | 0 | 97 | Paired-end | 100 | 2,308 |
Independent | Korean_LUAD | 77 | 83 | Paired-end | 101 | 2,569 |
aExpressed lncRNAs defined as those with an average RPKM ≥ 0.3 across all samples in each cancer type.
TANRIC integrates lncRNA expression data with clinical and genomic data (Fig. 1) and provides a user-friendly interface consisting of six modules: Summary, Visualization, Download, My lncRNA, Analyze all lncRNAs and lncRNAs in cell lines (Fig. 2,i). The “Summary” module shows an overview of RNA-seq datasets in TANRIC with a detailed description of each set (e.g., source, read length, sequencing platform, and sequencing strategy; Fig. 2,ii). The “Visualization” module offers an innovative way to examine the global patterns of lncRNA expression in a specific sample set through “next-generation clustered heat maps” (Fig. 2,iii). The interactive heat maps allow users to zoom, navigate, and drill down on clustering patterns (subtypes) of samples or lncRNAs and link to relevant biologic information sources. The “Download” module allows users to obtain the expression data of approximately 13,000 annotated lncRNAs for analysis (Fig. 2, iv; Materials and Methods).
Overview of TANRIC data portal. The panel of six modules (i); the “Summary” module (ii); the “next-generation clustered heat map” view in the “Visualization” module (iii); the “Download” module (iv); the three analysis modules provide raw expression data on a lncRNA of interest (v); the analysis modules provide clinical data analysis of lncRNAs (including differential analysis among tumor subtypes, stages, and grades) and analysis of correlation with patient survival (vi); and the analysis modules provide genomic data analysis of lncRNAs, including differential analysis between mutated and wild-type samples for a protein-coding gene of interest and analysis of correlations with SCNA, miRNA, mRNA, and protein expression (vii).
Overview of TANRIC data portal. The panel of six modules (i); the “Summary” module (ii); the “next-generation clustered heat map” view in the “Visualization” module (iii); the “Download” module (iv); the three analysis modules provide raw expression data on a lncRNA of interest (v); the analysis modules provide clinical data analysis of lncRNAs (including differential analysis among tumor subtypes, stages, and grades) and analysis of correlation with patient survival (vi); and the analysis modules provide genomic data analysis of lncRNAs, including differential analysis between mutated and wild-type samples for a protein-coding gene of interest and analysis of correlations with SCNA, miRNA, mRNA, and protein expression (vii).
TANRIC provides three analysis modules that enable users to examine the function and underlying mechanisms of lncRNAs in a flexible, interactive way. The “My lncRNA” module provides detailed information about one lncRNA of interest in a user-specified patient sample set. With the module, users can obtain the expression data for any annotated lncRNA (Fig. 2,v) and examine whether the lncRNA shows differential expression between tumor and normal samples or among tumor subgroups (as visualized through the box plots, Fig. 2,vi) or whether it correlates with patient survival time (based on P values from the univariate Cox proportional hazards model and log-rank test and visualization through a Kaplan–Meier plot, Fig. 2,vi). This module also enables users to examine the correlations of the lncRNA with various molecular data for protein-coding and miRNA genes. The data types include SCNAs, mRNA expression, miRNA expression, protein expression (as visualized through the scatter plots in Fig. 2,vii) and somatic mutations (as visualized through the box plots, Fig. 2,vii). For example, elevated BCAR4 expression has been shown to significantly correlate with shorter survival time of breast cancer patients (22); and HOTAIR, a well-studied lncRNA, is known to be coexpressed with HOXC genes (23). Through this module, these findings can be easily confirmed on the basis of TCGA cohorts. Because the annotation of lncRNAs is still rough and incomplete, this module also allows for the query of any user-defined lncRNA or its isoform (based on genomic coordinates) and returns the analysis results. The “Analyze all lncRNAs” module allows users to analyze approximately 13,000 ENCODE-annotated RNAs in a user-specified patient sample set. With this module, users can easily identify the most differentially expressed lncRNAs among tumor subtypes or those with the strongest correlations with patient survival times (Fig. 2,vi). Given a known coding/miRNA gene of interest, this module helps identify those lncRNAs with the strongest associations for various types of molecular data (Fig. 2,vi). The results are presented in a table, and users can search the results by lncRNA name, rank the correlations, and visually examine the details. The “lncRNAs in cell lines” module provides analyses similar to those in “My lncRNA,” but in sets based on cell lines. It can help users identify appropriate cell line models for functional experiments. Through the TANRIC portal, users can perform extensive analyses on lncRNAs, both within and across tumor types and obtain publication-quality figures in a convenient way.
A large number of lncRNAs with potential biomedical significance across cancer types
Using the data and analysis modules available at TANRIC, we performed a comprehensive survey to assess the potential biomedical significance of lncRNAs. First, for 12 TCGA cancer types with available non-tumor samples, we found large numbers of lncRNAs with significant differential expression between tumor and matched normal samples (paired t test; FDR < 0.05; fold change ≥2; Fig. 3A). As an independent validation, 81% of the differentially expressed lncRNAs identified in the TCGA LUAD set were confirmed in a Korean sample set (13). Second, for 11 TCGA cancer types with established biologic or molecular subtypes (Supplementary Table S1), we found considerable numbers of differentially expressed lncRNAs among the known tumor subtypes (t test or ANOVA; FDR < 0.05; fold change≥ 2 in at least two groups; Fig. 3B), and those lncRNAs may play a role in defining tumor heterogeneity within a cancer type. Third, for eight TCGA cancer types with sufficient samples available across different disease stages (tumor stages I–IV), we identified some lncRNAs for which the expression patterns correlated with disease stage. Some showed a monotonic change [e.g., 71 and 41 in kidney clear cell cancer (KIRC) and KIRP, ANOVA analysis, FDR < 0.05, fold change ≥2 in at least two groups; Fig. 3C]. Those lncRNAs may be involved in tumor progression. Across the above three analyses, we demonstrate an abundance of lncRNAs with potential biomedical relevance, and many of the lncRNAs show significance in more than one cancer type (Fig. 3D).
A large number of lncRNAs with potential biomedical significance in various cancer types. A, the total bars represent the numbers of expressed lncRNAs; red represents the numbers of lncRNAs differentially expressed between tumor and matched normal samples across tumor types. B, the total bars represent the numbers of expressed lncRNAs; blue represents the numbers of differentially expressed lncRNAs among known tumor subtypes. C, the total bars represent the numbers of expressed lncRNAs; green represents the numbers of differentially expressed lncRNAs among clinical stages, among which the light green parts represent those with a pattern of consistent increase or decrease across stages. D, the pie chart showing the numbers of lncRNAs with biomedical significance across tumor types.
A large number of lncRNAs with potential biomedical significance in various cancer types. A, the total bars represent the numbers of expressed lncRNAs; red represents the numbers of lncRNAs differentially expressed between tumor and matched normal samples across tumor types. B, the total bars represent the numbers of expressed lncRNAs; blue represents the numbers of differentially expressed lncRNAs among known tumor subtypes. C, the total bars represent the numbers of expressed lncRNAs; green represents the numbers of differentially expressed lncRNAs among clinical stages, among which the light green parts represent those with a pattern of consistent increase or decrease across stages. D, the pie chart showing the numbers of lncRNAs with biomedical significance across tumor types.
To examine the potential impact of lncRNAs on clinical practice, we focused on 123 clinically actionable genes (18). According to their clinical utility, we classified the genes into four groups: (i) therapeutic targets with FDA drugs approved for cancer treatment; (ii) therapeutic targets with drugs in late-stage clinical trials; (iii) therapeutic targets with drugs in early-stage clinical trials; and (iv) other established diagnostic and prognostic biomarkers (Supplementary Table S2). We then examined the correlations between the expressed lncRNAs and the actionable genes, and found considerable numbers of lncRNAs strongly correlated with one or more targets in terms of (i) differential expression between samples with wild-type and mutated genes (t test, FDR < 0.05, and fold change ≥2); (ii) in correlation with SCNAs (Spearman rank correlation ∣Rs∣ > 0.6); and (iii) in correlation with mRNA expression (Spearman rank correlation ∣Rs∣ > 0.6; Fig. 4A). Focusing on strongly correlated lncRNA-target pairs, we found that many of the pairs are consistently identified in multiple TCGA cancer types (Fig. 4B). These results highlight the potential of lncRNAs as regulators of key therapeutic targets for clinical practice.
Associations of lncRNAs with clinically actionable genes or drug sensitivity. A, numbers of lncRNAs for which the expressed levels are associated with an SCNA, mRNA expression, or somatic mutation of clinically actionable genes in each cancer type. B, numbers of lncRNA–gene pairs across multiple cancer types. The color bars represent the frequencies according to the clinical utility of actionable genes. C, a Manhattan plot showing the correlations of lncRNA expression and drug IC50 across CCLE cell lines. Each dot represents one lncRNA–drug correlation and correlations for different drugs are shown in different colors.
Associations of lncRNAs with clinically actionable genes or drug sensitivity. A, numbers of lncRNAs for which the expressed levels are associated with an SCNA, mRNA expression, or somatic mutation of clinically actionable genes in each cancer type. B, numbers of lncRNA–gene pairs across multiple cancer types. The color bars represent the frequencies according to the clinical utility of actionable genes. C, a Manhattan plot showing the correlations of lncRNA expression and drug IC50 across CCLE cell lines. Each dot represents one lncRNA–drug correlation and correlations for different drugs are shown in different colors.
To explore the potential effects of lncRNAs on drug sensitivity, we identified the expressed lncRNAs in the CCLE cell lines (12) and examined their correlations with the sensitivity data (IC50) of 24 drugs available. Interestingly, we found 202 lncRNA–drug pairs with significant correlations (Spearman rank correlation ∣Rs∣ > 0.3 and FDR < 0.01; Fig. 4C). These results suggest a critical role of some lncRNAs in affecting the response of cancer therapies.
Biologic and clinical relevance of tumor subtypes revealed by lncRNA expression
Finally, we examined the clinical relevance of tumor subtypes revealed by TCGA lncRNA expression profiles. On the basis of the top 500 lncRNAs with the most variable expression, we defined sample subtypes (sample clusters) by ConsensusClusterPlus (Materials and Methods; ref. 19). For each of the TCGA cancer types we studied, lncRNA-expression subtypes show extensive, strong concordance with established subtypes (χ2 test; P < 0.05; FDR < 0.05; Fig. 5A). For example, lncRNA subtype 1 in breast cancer (BRCA) almost exclusively corresponds to the basal subtype; lncRNA subtype 5 in HNSC primarily corresponds to HPV-negative tumors; and lncRNA subtype 1 in endometrial cancer (UCEC) mainly represents the high copy-number molecular subtype (24). We next assessed the prognostic value of lncRNA-expression subtypes. For BRCA, HNSC, KIRC, and brain LGG, the lncRNA-expression subtypes show distinct patient survival profiles (log-rank test, P < 0.05, Fig. 5B). As an independent validation, the three lncRNA-expression subtypes in another independent KIRC cohort (14) also show a significant correlation with the overall patient survival times (Supplementary Fig. S1). Furthermore, given clinical variables (i.e., disease stage and tumor grade), the lncRNA subtypes confer additional prognostic power in BRCA and KIRC (multivariate Cox proportional hazards model, P < 0.05).
lncRNA expression reveals clinically and biologically relevant tumor subtypes. A, lncRNA-expression subtypes show extensive, strong concordance with established tumor subtypes. B, lncRNA-expression subtypes appear to be correlated with overall patient survival times in BRCA, HNSC, KIRC, and LGG. C, key signaling pathways are differentially expressed among tumor subtypes defined by lncRNA expression. The colors in the heat map represent the statistical significance (FDR) of the associations between lncRNA-expression tumor subtypes and the protein-expression pathway scores.
lncRNA expression reveals clinically and biologically relevant tumor subtypes. A, lncRNA-expression subtypes show extensive, strong concordance with established tumor subtypes. B, lncRNA-expression subtypes appear to be correlated with overall patient survival times in BRCA, HNSC, KIRC, and LGG. C, key signaling pathways are differentially expressed among tumor subtypes defined by lncRNA expression. The colors in the heat map represent the statistical significance (FDR) of the associations between lncRNA-expression tumor subtypes and the protein-expression pathway scores.
To explore molecular mechanisms associated with the tumor subtypes defined by lncRNA expression, we examined whether some biologic pathways showed some differential expression among the tumor subtypes based on pathway scores calculated from TCGA protein expression data (21). We found that the tumor subtypes defined by lncRNA expression (Fig. 5A) are associated with activation or inhibition of some pathways (Fig. 5C). These results suggest that lncRNA expression represents one meaningful dimension; therefore, integrating lncRNA expression with other molecular data may help characterize the molecular basis of human cancer more fully.
Discussion
We have developed TANRIC, a user-friendly, interactive, open-access web resource for exploring the functions and mechanisms of lncRNAs in cancer. Compared with other available lncRNA-focused bioinformatics resources (11, 25–34), TANRIC has several unique features (Table 2): (i) It provides extensive, intuitive, and interactive analyses on lncRNAs of interest for their interactions with other TCGA genomic/proteomic/epigenomic and clinical data types, both within a tumor type and across tumor types; (ii) it enables users to query expression profiles of user-defined lncRNAs quickly; (iii) it includes RNA-seq data from well-characterized cell lines and other large, non-TCGA patient cohorts, thereby allowing users to validate a pattern of interest or identify model cell lines for experimental characterization. With the efficient analytic modules, TANRIC substantially lowers the barriers between cancer researchers and complex cancer transcriptomic data (>60TB and 1,142 billion reads in the current release). Going forward, we will constantly incorporate newly available large-scale cancer RNA-seq data into TANRIC.
We have further demonstrated the utility of TANRIC through a comprehensive pan-cancer analysis of expressed lncRNAs. Consistent with previous studies (4, 11, 35, 36), our analysis revealed a large number of tumor-associated lncRNAs. More importantly, we report that some lncRNAs show strong correlations with established therapeutic targets across tumor types or with drug sensitivity across cell lines. Although the correlations do not necessarily indicate direct cause–effect relationships, they highlight the potential of lncRNAs as a novel class of biomarkers or therapeutic targets. TANRIC, thus, represents a starting point for exploration of particular lncRNA species and for generation of testable hypotheses for further experimental investigation.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: J. Li, L. Han, H. Liang
Development of methodology: J. Li, L. Han, H. Liang
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): J. Li, L. Han, L. Liu, H. Liang
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): J. Li, L. Han, P. Roebuck, L. Diao, L. Liu, Y. Yuan, H. Liang
Writing, review, and/or revision of the manuscript: J. Li, L. Han, P. Roebuck, J.N. Weinstein, H. Liang
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): J. Li, P. Roebuck, L. Liu, J.N. Weinstein
Study supervision: H. Liang
Acknowledgments
The authors gratefully acknowledge contributions from the TCGA Research Network and its TCGA Pan-Cancer Analysis Working Group. The authors thank the MD Anderson high-performance computing core facility for computing resources and LeeAnn Chastain for editorial assistance.
Grant Support
This study was supported by the NIH (CA143883 to J.N. Weinstein; CA175486 to H. Liang; and the MD Anderson Cancer Center Support Grant P30 CA016672 to J.N. Weinstein and H. Liang); the R. Lee Clark Fellow Award from The Jeanne F. Shelby Scholarship Fund (H. Liang); a grant from the Cancer Prevention and Research Institute of Texas (RP140462 to H. Liang); the Mary K. Chapman Foundation and the Lorraine Dell Program in Bioinformatics for Personalization of Cancer Medicine to J.N. Weinstein.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.