Abstract
Genes that are commonly deregulated in cancer are clinically attractive as candidate pan-diagnostic markers and therapeutic targets. To globally identify such targets, we compared Cap Analysis of Gene Expression profiles from 225 different cancer cell lines and 339 corresponding primary cell samples to identify transcripts that are deregulated recurrently in a broad range of cancer types. Comparing RNA-seq data from 4,055 tumors and 563 normal tissues profiled in the The Cancer Genome Atlas and FANTOM5 datasets, we identified a core transcript set with theranostic potential. Our analyses also revealed enhancer RNAs, which are upregulated in cancer, defining promoters that overlap with repetitive elements (especially SINE/Alu and LTR/ERV1 elements) that are often upregulated in cancer. Lastly, we documented for the first time upregulation of multiple copies of the REP522 interspersed repeat in cancer. Overall, our genome-wide expression profiling approach identified a comprehensive set of candidate biomarkers with pan-cancer potential, and extended the perspective and pathogenic significance of repetitive elements that are frequently activated during cancer progression. Cancer Res; 76(2); 216–26. ©2015 AACR.
Introduction
Successful cancer treatment depends heavily on early detection and diagnosis. Despite decades of research, relatively few biomarkers are routinely used in clinics (e.g., CA-125 and PSA in ovarian and prostate cancers, respectively; refs. 1, 2). There is a need for reliable and clinically applicable new cancer biomarkers for early detection. Cancers originating in the same tissue can be very heterogeneous, often being derived from different cell types and having drastically different mutation profiles (3). At the same time, cancers from different tissues can share some common features, for example, The Cancer Genome Atlas (TCGA) has found genes and pathways, DNA copy number alterations, mutations, methylation, and transcriptome changes that recur across 12 different primary tumor types (4).
Here using Cap Analysis of Gene Expression (CAGE) data collected for the Functional ANnoTation Of Mammalian genome (FANTOM5) project (5), we identified mRNAs, long-noncoding RNAs (lncRNA), enhancer RNAs (eRNA), and RNAs initiating from within repeat elements, which are recurrently perturbed in cancer cell lines. To confirm that these transcripts are relevant to tumors, we compared their expression in 4,055 primary tumors and 563 matching tissue sets RNA-seq profiled by the TCGA (6) and in a set of colorectal tumor (7) samples profiled proteomically. Finally, for the most promising biomarker candidates we performed qRT-PCR validations in cancer cell lines and tumor cDNA panels. Taken together, our analyses allowed for identification of a set of robust pan cancer biomarker candidates, which have the potential for development as blood biomarkers for early detection and for histological screening of biopsies.
This work is part of the FANTOM5 project. Data download, genomic tools, and copublished manuscripts have been summarized at the FANTOM5 website (8).
Materials and Methods
FANTOM5 data
We used the cap analysis of gene expression (CAGE) data from the FANTOM5 project (libraries sequenced to a median depth of 4 million mapped tags; ref. 5). We used 564 CAGE profiles: 225 cancer cell lines and 339 primary cells samples. We split the data into three data sets: (i) matched solid, (ii) unmatched solid, and (iii) matched blood (Supplementary Table S1A and S1C for list of cancer types and sample annotation). The CAGE tag counts under 184,827 robust decomposition-based peak identification (DPI) clusters (5) were used to represent a promoter-level expression. For the enhancer activity, we used the CAGE tags counts under 43,011 enhancer regions identified in ref. 9.
FANTOM5 differential expression analysis
To identify up- and downregulated transcripts in cancer cell lines versus normal primary cells, we used Genewise Negative Binomial Generalized Linear Models as implemented in edgeR (10). The cancer versus normal comparison was performed using glmLRT function. In matched solid comparison, we set equal weight for each solid cancer type, each type contributing equally to overall comparison. In the matched solid and matched blood dataset, simple cancer versus normal comparison was performed.
The P-values were adjusted for multiple testing by Benjamini–Hochberg method. The thresholds of fold change >4 and FDR <0.01 were used.
ON/OFF analysis
For each feature, we determined the expression status in binary fashion: ON (expressed, count > 0), OFF (not detected, count = 0). We then calculated the frequency of expression in cancer and normal samples. Features expressed four times more frequently in cancer than in normal samples were selected as “ON in cancer,” whereas features not expressed/lost four times more often in cancer than in normal samples mere selected as “OFF in cancer.” The procedure was applied to each dataset (matched solid, unmatched solid, and matched blood). The significance of the association (contingency) between ON/OFF status and cancer/normal status was tested by two-sided Fisher exact test with adjustment for multiple testing by Benjamini–Hochberg method. The threshold of FDR < 0.01 was used. The pipeline of differential expression described above was applied separately to the DPI/promoter counts and enhancer counts. The features found differentially expressed in all three datasets were selected as “pan” cancer features, whereas features differentially expressed in matched and unmatched solid datasets only were selected as “solid only” cancer features.
TCGA RNA-seq data
We obtained the RNA-Seq profiling data of 4,055 cancer samples and 563 normal tissues data from The Cancer Genome Atlas (TCGA) Data Portal (data status as of Aug 5, 2013, origin listed in Supplementary Table S1B; ref. 6). The profiles represented 14 solid cancer types for which both tumor and normal tissue samples were available. We downloaded level 3 RNASeqV2, upper quartile normalized RSEM count estimates with expression profiles of 20,531 genes in 4,618 samples.
The counts were log2 transformed and used as an input expression data to LIMMA.
The cancer versus normal comparison was performed using equal weight for each solid cancer type, each type contributing equally to overall comparison. The P-values were adjusted for multiple testing by Benjamini–Hochberg method. The thresholds of fold change >2 and FDR <0.01 were used.
Enrichment for cancer-related genes
We tested for the enrichment by applying a hypergeometric test, using the significance threshold of P < 0.05. The list of oncogenes was a union of oncogenes listed in MSigDB (11) and UniProt (12) databases. For tumor suppressors, we considered genes listed in at least two of three sources: MSigDB (11), UniProt (12), and TSGene (13) tumor suppressor list. For the list of genes frequently mutated in cancer we used the high confidence drivers mutated across 12 cancer types from Tamborero and colleagues (14). We also tested for enrichment of cancer-related genes listed in COSMIC: Cancer Gene census (15).
Chromatin Interaction Analysis Paired-End Tagenhancer–promoter pairs
We obtained the Chromatin Interaction Analysis Paired-End Tags (ChIA-PET) data from ENCODE/GIS-Ruan project (GSE39495, April 21, 2014). Data files from 15 experiments covered five cell lines (Hct-116, Helas3, K562, Mcf7, and Nb4) and three transcription factors (Pol2, Ctcf, ERalpha a). We merged the interaction from all experiments. We then extracted the ChIA–PET interaction clusters overlapping the genomic locations of enhancers and searched if the linked genomic locations overlap promoters of the annotated genes.
Quantitative PCR for cell line samples and human cancer/normal tissue cDNA
Five hundred nanograms of total RNA from K562, HepG2, MCF7, and HDF was reverse-transcribed using oligo dT primer, which was then diluted 12.5 times with DNA/RNA free water. Primers for real-time PCR were designed by the Primer3 web tool (Supplementary Table S12). The housekeeping gene ACTB was utilized as to normalize the expression levels. Quantitative PCR (qPCR) was carried out with ABI 7500 Fast Real-Time PCR System using Power SYBR Green PCR Master Mix. For validation in tumor samples we performed qPCR reactions on TissueScan Cancer Survey Panel 96 – I cDNA panel (CSRT501, OriGene, MD). Each reaction was run in triplicate for cell line samples, and singlet for human cancer/normal tissue cDNA panel.
Results
Identification of transcripts recurrently up- or downregulated in cancer cell lines
Using CAGE data collected for the FANTOM5 (5, 9) project, we compared expression levels of transcripts from 184,827 promoter and 43,011 enhancer regions between a panel of 225 cancer cell lines and a panel of 339 primary cell samples (samples IDs and their annotation is listed in Supplementary Table S1C).
First, the cancer cell line and primary cell datasets were divided into three subsets (see Supplementary Table S1A); cell lines and primary cells from solid tissues or blood lineages that could be matched are referred to as matched-solid or matched-blood. The remaining samples from solid tissue are referred to as unmatched-solid.
In each subset, we identified promoters that were differentially expressed between cancer and normal (edgeR; ref. 10, >4-fold change, FDR < 0.01). We also performed an alternative binary analysis (we refer to it as an ON/OFF analysis) to identify transcripts that were consistently switched off or switched on in cancer [four times more often expressed (switched ON) or not detected (switched OFF) in the cancer group compared to the normal group, using a significance level of FDR < 0.01 by Fisher exact test (examples on Fig. 1B)]. The results of the ON/OFF and edgeR analyses were then merged to obtain a final selection of up- and downregulated promoters (Fig. 1A and Supplementary Table S2).
In total, 2,108 promoters were differentially expressed in cancer cell lines. Seven hundred and eighty-one were consistently up regulated in all three comparisons and a further 814 were up only in solid cancers. Conversely 99 were consistently down-regulated in all three datasets and a further 414 were down only in solid cancers (Table 1). Sixty-three percent of the differentially expressed peaks overlapped protein-coding genes, 12% overlapped long noncoding genes (GENCODE v19; ref. 16) and 25% were not associated to any known genes (Supplementary Table S3).
. | Upregulated . | Downregulated . | . | . | ||
---|---|---|---|---|---|---|
Type of genomic region . | Pan cancer . | Solid only . | Pan cancer . | Solid only . | Total . | % . |
Protein coding | 434 | 455 | 92 | 354 | 1,335 | 63 |
LincRNA | 45 | 38 | 2 | 7 | 92 | 4 |
Antisense | 37 | 28 | 0 | 4 | 69 | 3 |
Pseudogene | 12 | 9 | 2 | 4 | 27 | 1 |
Other ncRNAs | 20 | 33 | 0 | 5 | 58 | 3 |
Unannotated | 233 | 251 | 3 | 40 | 527 | 25 |
Total | 781 | 814 | 99 | 414 | 2,108 | 100 |
. | Upregulated . | Downregulated . | . | . | ||
---|---|---|---|---|---|---|
Type of genomic region . | Pan cancer . | Solid only . | Pan cancer . | Solid only . | Total . | % . |
Protein coding | 434 | 455 | 92 | 354 | 1,335 | 63 |
LincRNA | 45 | 38 | 2 | 7 | 92 | 4 |
Antisense | 37 | 28 | 0 | 4 | 69 | 3 |
Pseudogene | 12 | 9 | 2 | 4 | 27 | 1 |
Other ncRNAs | 20 | 33 | 0 | 5 | 58 | 3 |
Unannotated | 233 | 251 | 3 | 40 | 527 | 25 |
Total | 781 | 814 | 99 | 414 | 2,108 | 100 |
In some cases, the CAGE analysis identified alternative promoters. Comparing the gene-wise differential expression (total CAGE signal for the same gene) to the differential expression of individual promoters (Fig. 1C), we found that for 23% of differentially expressed protein coding genes, at least one alternative promoter behaved differently to that of the whole gene, whereas for lncRNAs (which have fewer alternative promoters) it was only 5% (Fig. 1D).
Differentially expressed protein coding genes are enriched in cancer-associated genes
Focusing on CAGE peaks unambiguously at the 5′ end of protein coding genes (±500 bp from the 5′ end of annotated transcripts or located in 5′ UTRs) we identified 911 promoters corresponding to 656 unique genes that were differentially expressed: 435 upregulated and 221 downregulated (Supplementary Table S9). The gene set was significantly enriched for oncogenes (hypergeometric test P = 7.5e−05, 33 genes), tumor suppressors (P = 0.0043, 13 genes), genes frequently mutated in cancer (P = 0.034, 18 genes; ref. 14) and genes listed in the Cancer Gene Census (P = 0.01, 28 genes; see Supplementary Table S3F; ref. 15). Interestingly, eight oncogenes were downregulated, and five tumor suppressors were upregulated, changing in the opposite direction to one would expect (Supplementary Fig. S1A–S1C). This may be caused by regulatory feedback loops responding to neoplastic changes.
We next performed an analogous cancer versus normal analysis on RNA-seq data from 14 tumor-normal pairs (4,055 primary cancer samples and 563 normal tissues samples; Supplementary Table S1B) from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/). The fold changes observed for the TCGA analysis were considerably weaker (Fig. 2), than those seen for the FANTOM5 analysis presumably because of the mixture of cells in a tumor diluting the cancer cell signal (Fig. 2). To recover similar numbers of genes from both the TCGA and FANTOM5 analyses, we therefore applied a weaker threshold (abs FC > 2, FDR < 0.01) to identify 490 upregulated genes and 1,661 downregulated genes (Supplementary Table S4). The up-regulated genes were enriched for those listed in the cancer gene census (hypergeometric test P = 0.03, 18 genes). Of particular note, many more genes were downregulated in the tumor-normal comparison than we observed for the cancer cell line-primary cell comparison.
Potential pan-cancer biomarkers
We found that 76 (17%) of the upregulated genes identified in the cancer cell lines analysis were also upregulated in primary tumors from TCGA (Fig. 2). Among them we find oncogenes (HOXC13, MYEOV, MNX1, and CASC5), cancer antigens (PRAME, CD70, CASC5, IDF2BP3) and, somewhat unexpectedly, the tumor suppressors (TP73, BLM, BUB1B). The upregulated genes were also enriched in genes involved in cell cycle, DNA metabolism, biopolymer metabolism, and homeobox genes involved in development. This included well-known pan-cancer genes such as TERT, PRAME, and TOP2A (17, 18) and MYEOV and MNX1, which are implicated in blood malignancies (19, 20) and FAM111B in prostate cancer (21).
For the downregulated genes, 52 (19%) genes from the FANTOM5 cancer cell lines analysis were also downregulated in primary tumors (Fig. 2). Interestingly, the list was enriched for genes related to oxidoreductase activity (five genes: AOX1, PTGS1, ACOX2, COX7A1, and the tumor suppressor GPX3; ref. 22). Because the downregulation is seen in both cancer cell lines and primary tumors, we deduce that the changes are caused by a permanent reprograming of metabolism in cancer cells rather than response to tumor microenvironment, or cell culture conditions. Finally, we also observed seven discrepancies; CDKN2A, COL1A1, COL5A2, GJB2, HIST1H2BH, MMP9, and TNFRSF6B were downregulated in the cancer cell lines but up regulated in the primary tumor analysis.
Finally, we used recent proteome data from 90 colorectal cancers and 30 normal tissues published by Zhang and colleagues (7). The spectral count data were available for 239 of our 656 differentially expressed genes. Twenty mRNAs/proteins were upregulated in both the cancer cell lines (CAGE) and colorectal tumors (mass spec data) whereas 16 were upregulated in both the RNA-seq and mass spec data (Supplementary Fig. S2A and Supplementary Table S9C). Notably, four genes were robustly upregulated in all three comparisons: MCM2, TOP2A, ASNS, and MKI67.
There were 108 genes that were downregulated at the protein level and in at least one transcriptome analysis (CAGE or RNA-seq). Strikingly, the top 10 enriched terms within those genes were all related to metabolic processes, either to oxidative processes or lipid metabolism (Supplementary Fig. S2B), thereby confirming the metabolic pathway changes that we have observed from the RNA data.
Pan-cancer long-noncoding RNAs
From the cancer cell line analysis we identified 271 differentially expressed lncRNAs (181 lncRNAs annotated with GENCODE 19, plus a further 90 with the miTRanscriptome annotation; ref. 23). The majority (247 lncRNAs) were upregulated whereas 24 were downregulated (Supplementary Table S10A). In total, 39 and five of these were up- and downregulated, respectively, in both the cancer cell line analysis and at least one tumor type in the miTranscriptome study (23). Of those, 21 were consistently upregulated and two consistently downregulated in cancer cell lines and at least two tumor types (Supplementary Table S10B).
For two of these lncRNAs (ENST00000448869 and FOXP4-AS1), we performed qRTPCR validation in cancer cell lines versus primary cells and also in a cDNA panel covering eight tumor types and normal matching tissues. In both cases, the targets were highly significantly upregulated in both cancer cell lines and tumors (Fig. 4).
We also looked for the overlap with the lists of pan-cancer lncRNAs to the 229 “onco-lncRNAs” identified by Cabanski and colleagues (24), which allowed us to confirm three additional upregulated lncRNAs (Supplementary Table S10B and S10D). Our analysis of preprocessed TCGA RNA-seq data also allowed us to confirm deregulation of four lncRNAs, two already confirmed by miTranscriptome and Cabanski (MEG3 and DGCR5) and two that were missed by other reports; downregulation of the MT1L pseudogene and most notably the up-regulation of PVT1, which is a well-known lncRNA oncogene (25).
Deregulated long-noncoding RNAs located near cancer-related genes
We next looked at the genomic neighborhood of the differentially expressed lncRNAs. For 27 of the 181 (GENCODE19) differentially expressed lncRNAs, we found 33 cancer-related genes within 100 kb (Table 2, example in Supplementary Fig. S3). For example, PVT1 neighbors MYC; these are consistently cogained in cancer. We also observe RP11-1070N10.5 neighboring the TCL6 (lincRNA), TCL1A, and TCL1B oncogenes (located in a breakpoint cluster region on chromosome 14q32 in T-cell leukemia (26) and HOXA11-AS, neighboring HOXA13 and HOXA9 and overlapping the HOXA11 oncogene. Notably five out of six cancer-related genes located within 1 kb from upregulated lncRNAs were also upregulated, these include the MCF2L, GATA2, and MNX1 oncogenes and BSG and CSAG1 cancer antigens (Table 2). Possibly linked to cancer metabolism, the upregulated PCAT7 is located antisense to fructose-1,6-bisphosphatase-2 (FBP2; Supplementary Fig. S3), whose decreased expression promotes glycolysis and growth in gastric cancer cells (27).
lncRNA . | lncRNA DE summary . | Neighbor gene name . | Neighbor DE summary . | Neighbor gene info . | Distance from lncRNA . | Overlap . | Strand . |
---|---|---|---|---|---|---|---|
MCF2L-AS1 | Solid UP | MCF2L | Solid UP | Oncogene | <1 kb | Yes | Opposite |
RP11-475N22.4 (GATA2-AS1) | Solid ON | GATA2 | Solid ON | Oncogene, cancer gene census | <1 kb | Yes | Opposite |
MNX1-AS1 | Pan ON | MNX1 | Solid ON | Oncogene | <1 kb | No | Opposite |
AC009005.2 | Solid ON | BSG | Pan UP | Cancer antigen | <1 kb | Yes | Opposite |
CSAG4 | Pan ON | CSAG1 | Pan ON | Cancer antigen | <1 kb | No | Opposite |
HOXA11-AS | Pan UP | HOXA11 | Oncogene, cancer gene census | <1 kb | Yes | Opposite | |
RHPN1-AS1 | Pan UP | MAFA | Solid UP | Tumor suppressor, oncogene | <100 kb | No | Same |
LINC00624 | Pan ON | BCL9 | Solid UP | Oncogene, cancer gene census | <100 kb | No | Opposite |
IFITM9P | Solid DOWN | MYEOV | Pan UP | Oncogene | <100 kb | Yes | Opposite |
RP11-435O5.2 | Pan UP | PTCH1 | Pan ON | Tumor suppressor | <100 kb | No | Same |
LIFR-AS1 | Solid ON | LIFR | Oncogene, cancer gene census, Mut Driver | <100 kb | Yes | Opposite | |
AC079767.4 | Pan ON | CREB1 | Oncogene, cancer gene census, Mut Driver | <100 kb | No | Same | |
RP11-460N16.1 | Pan ON | MITF | Oncogene, cancer gene census | <100 kb | No | Same | |
RP11-1070N10.5 | Solid ON | TCL6 | Oncogene, cancer gene census | <100 kb | No | Same | |
RP11-1070N10.5 | Solid ON | TCL1A | Oncogene, cancer gene census | <100 kb | No | Opposite | |
HOXA11-AS | Pan UP | HOXA9 | Oncogene, cancer gene census | <100 kb | No | Opposite | |
PVT1 | Solid UP | MYC | Oncogene, cancer gene census | <100 kb | No | Same | |
LAMTOR5-AS1 | Solid UP | RBM15 | Oncogene, cancer gene census | <100 kb | No | Same | |
RNU6-781P | Pan UP | ZNF384 | Oncogene, cancer gene census | <100 kb | No | Opposite | |
RP11-284F21.7 | Pan UP | PRCC | Oncogene, cancer gene census | <100 kb | No | Opposite | |
HOXA11-AS | Pan UP | HOXA13 | Oncogene, cancer gene census | <100 kb | No | Opposite | |
RP11-1070N10.5 | Solid ON | TCL1B | Oncogene | <100 kb | No | Same | |
TSPY26P | Solid OFF | HCK | Oncogene | <100 kb | No | Opposite | |
RP5-884M6.1 | Solid ON | PIK3CG | Mut Driver | <100 kb | No | Same | |
RP11-405F3.4 | Pan ON | KIFC3 | Mut Driver | <100 kb | No | Same | |
RP5-991G20.1 | Pan UP | ZFHX3 | Mut Driver | <100 kb | Yes | Opposite | |
CTA-714B7.5 | Pan UP | TOM1 | Mut Driver | <100 kb | No | Opposite | |
LINC00243 | Pan UP | MDC1 | Mut Driver | <100 kb | No | Same | |
CTA-714B7.5 | Pan UP | HMGXB4 | Mut Driver | <100 kb | No | Opposite | |
CTC-338M12.9 | Pan ON | TRIM7 | Mut Driver | <100 kb | No | Opposite | |
SCARNA14 | Solid ON | MAP2K1 | Cancer gene census | <100 kb | No | Opposite | |
AC034193.5 | Solid UP | FANCD2 | Cancer gene census | <100 kb | No | Same | |
RNU6-781P | Pan UP | ING4 | Tumor suppressor | <100 kb | No | Opposite |
lncRNA . | lncRNA DE summary . | Neighbor gene name . | Neighbor DE summary . | Neighbor gene info . | Distance from lncRNA . | Overlap . | Strand . |
---|---|---|---|---|---|---|---|
MCF2L-AS1 | Solid UP | MCF2L | Solid UP | Oncogene | <1 kb | Yes | Opposite |
RP11-475N22.4 (GATA2-AS1) | Solid ON | GATA2 | Solid ON | Oncogene, cancer gene census | <1 kb | Yes | Opposite |
MNX1-AS1 | Pan ON | MNX1 | Solid ON | Oncogene | <1 kb | No | Opposite |
AC009005.2 | Solid ON | BSG | Pan UP | Cancer antigen | <1 kb | Yes | Opposite |
CSAG4 | Pan ON | CSAG1 | Pan ON | Cancer antigen | <1 kb | No | Opposite |
HOXA11-AS | Pan UP | HOXA11 | Oncogene, cancer gene census | <1 kb | Yes | Opposite | |
RHPN1-AS1 | Pan UP | MAFA | Solid UP | Tumor suppressor, oncogene | <100 kb | No | Same |
LINC00624 | Pan ON | BCL9 | Solid UP | Oncogene, cancer gene census | <100 kb | No | Opposite |
IFITM9P | Solid DOWN | MYEOV | Pan UP | Oncogene | <100 kb | Yes | Opposite |
RP11-435O5.2 | Pan UP | PTCH1 | Pan ON | Tumor suppressor | <100 kb | No | Same |
LIFR-AS1 | Solid ON | LIFR | Oncogene, cancer gene census, Mut Driver | <100 kb | Yes | Opposite | |
AC079767.4 | Pan ON | CREB1 | Oncogene, cancer gene census, Mut Driver | <100 kb | No | Same | |
RP11-460N16.1 | Pan ON | MITF | Oncogene, cancer gene census | <100 kb | No | Same | |
RP11-1070N10.5 | Solid ON | TCL6 | Oncogene, cancer gene census | <100 kb | No | Same | |
RP11-1070N10.5 | Solid ON | TCL1A | Oncogene, cancer gene census | <100 kb | No | Opposite | |
HOXA11-AS | Pan UP | HOXA9 | Oncogene, cancer gene census | <100 kb | No | Opposite | |
PVT1 | Solid UP | MYC | Oncogene, cancer gene census | <100 kb | No | Same | |
LAMTOR5-AS1 | Solid UP | RBM15 | Oncogene, cancer gene census | <100 kb | No | Same | |
RNU6-781P | Pan UP | ZNF384 | Oncogene, cancer gene census | <100 kb | No | Opposite | |
RP11-284F21.7 | Pan UP | PRCC | Oncogene, cancer gene census | <100 kb | No | Opposite | |
HOXA11-AS | Pan UP | HOXA13 | Oncogene, cancer gene census | <100 kb | No | Opposite | |
RP11-1070N10.5 | Solid ON | TCL1B | Oncogene | <100 kb | No | Same | |
TSPY26P | Solid OFF | HCK | Oncogene | <100 kb | No | Opposite | |
RP5-884M6.1 | Solid ON | PIK3CG | Mut Driver | <100 kb | No | Same | |
RP11-405F3.4 | Pan ON | KIFC3 | Mut Driver | <100 kb | No | Same | |
RP5-991G20.1 | Pan UP | ZFHX3 | Mut Driver | <100 kb | Yes | Opposite | |
CTA-714B7.5 | Pan UP | TOM1 | Mut Driver | <100 kb | No | Opposite | |
LINC00243 | Pan UP | MDC1 | Mut Driver | <100 kb | No | Same | |
CTA-714B7.5 | Pan UP | HMGXB4 | Mut Driver | <100 kb | No | Opposite | |
CTC-338M12.9 | Pan ON | TRIM7 | Mut Driver | <100 kb | No | Opposite | |
SCARNA14 | Solid ON | MAP2K1 | Cancer gene census | <100 kb | No | Opposite | |
AC034193.5 | Solid UP | FANCD2 | Cancer gene census | <100 kb | No | Same | |
RNU6-781P | Pan UP | ING4 | Tumor suppressor | <100 kb | No | Opposite |
NOTE: For the complete list of genes that are located near differentially expressed lncRNAs, see Supplementary Table S7.
Abbreviation: DE, differentially expressed.
Activation of repeat elements in cancer
Globally about 20% of all FANTOM5 promoters initiate from within repetitive elements and low complexity DNA sequences annotated by RepeatMasker. We observed a simple relationship for promoters that overlapped a repetitive element; the higher the fold change (upregulation in cancer), the higher the probability that the promoter overlapped a repetitive element (Supplementary Fig. S5, see Supplementary Table S13 for the promoter–repeat associations). A more detailed analysis revealed that the upregulated promoters are enriched in retrotransposons (SINE/Alu, LINE/L1, LTR/ERV1, LTR/ERVL). The SINE/Alu and LINE/L1 overlapping promoters tended to be located in intronic regions of protein coding genes (49% and 32%, respectively) and not associated to known RNA transcripts, whereas the upregulated promoters overlapping LTR/ERV1 often initiated the expression of lncRNAs (31 GENCODE lncRNAs and 48 miTranscriptome lncRNAs; Table 3).
. | . | Location of differentially expressed repeat . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Repeat overlapping promoters . | Differentially expressed repeat overlapping promoters . | Protein coding . | . | . | . | |||||||||
Repeat family . | Total . | # down . | Odds ratio . | P-value . | # up . | Odds ratio . | P-value . | 5′UTR . | Intron . | Exon . | 3′UTR . | lncRNA . | Pseudogene . | Not annotated . |
REP522 | 72 | 0 | 0 | 1 | 25 | 62.05 | 2.2E−16 | 1 | 0 | 0 | 0 | 9 | 3 | 12 |
Low_complexity | 2,013 | 13 | 2.37 | 4.7E−03 | 18 | 1.04 | 0.81 | 15 | 2 | 2 | 6 | 2 | 0 | 4 |
Simple_repeat | 11,982 | 44 | 1.35 | 0.06 | 204 | 2.13 | 2.2E−16 | 86 | 70 | 4 | 7 | 17 | 1 | 63 |
SINE/Alu | 3,961 | 0 | 0 | 2.4E−05 | 138 | 4.44 | 2.2E−16 | 5 | 67 | 1 | 1 | 11 | 3 | 50 |
LINE/L1 | 3,426 | 1 | 0.1 | 1.5E−03 | 67 | 2.35 | 1.8E−09 | 2 | 22 | 0 | 0 | 12 | 0 | 32 |
LINE/L2 | 3,220 | 2 | 0.22 | 0.01 | 25 | 0.9 | 0.7 | 2 | 4 | 0 | 0 | 4 | 0 | 17 |
LTR/ERVL-MaLR | 3,613 | 0 | 0 | 7.8E−05 | 31 | 0.99 | 1 | 6 | 4 | 0 | 0 | 10 | 0 | 11 |
LTR/ERV1 | 3,932 | 2 | 0.18 | 3.0E−03 | 133 | 4.3 | 2.2E−16 | 7 | 12 | 0 | 0 | 31 | 2 | 83 |
LTR/ERVL | 1,488 | 0 | 0 | 0.04 | 20 | 1.57 | 0.049 | 2 | 2 | 0 | 0 | 8 | 0 | 8 |
. | . | Location of differentially expressed repeat . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Repeat overlapping promoters . | Differentially expressed repeat overlapping promoters . | Protein coding . | . | . | . | |||||||||
Repeat family . | Total . | # down . | Odds ratio . | P-value . | # up . | Odds ratio . | P-value . | 5′UTR . | Intron . | Exon . | 3′UTR . | lncRNA . | Pseudogene . | Not annotated . |
REP522 | 72 | 0 | 0 | 1 | 25 | 62.05 | 2.2E−16 | 1 | 0 | 0 | 0 | 9 | 3 | 12 |
Low_complexity | 2,013 | 13 | 2.37 | 4.7E−03 | 18 | 1.04 | 0.81 | 15 | 2 | 2 | 6 | 2 | 0 | 4 |
Simple_repeat | 11,982 | 44 | 1.35 | 0.06 | 204 | 2.13 | 2.2E−16 | 86 | 70 | 4 | 7 | 17 | 1 | 63 |
SINE/Alu | 3,961 | 0 | 0 | 2.4E−05 | 138 | 4.44 | 2.2E−16 | 5 | 67 | 1 | 1 | 11 | 3 | 50 |
LINE/L1 | 3,426 | 1 | 0.1 | 1.5E−03 | 67 | 2.35 | 1.8E−09 | 2 | 22 | 0 | 0 | 12 | 0 | 32 |
LINE/L2 | 3,220 | 2 | 0.22 | 0.01 | 25 | 0.9 | 0.7 | 2 | 4 | 0 | 0 | 4 | 0 | 17 |
LTR/ERVL-MaLR | 3,613 | 0 | 0 | 7.8E−05 | 31 | 0.99 | 1 | 6 | 4 | 0 | 0 | 10 | 0 | 11 |
LTR/ERV1 | 3,932 | 2 | 0.18 | 3.0E−03 | 133 | 4.3 | 2.2E−16 | 7 | 12 | 0 | 0 | 31 | 2 | 83 |
LTR/ERVL | 1,488 | 0 | 0 | 0.04 | 20 | 1.57 | 0.049 | 2 | 2 | 0 | 0 | 8 | 0 | 8 |
NOTE: The table shows the numbers of upregulated and downregulated promoters that overlap nine families of repetitive elements (≥20 promoters) as well as Fisher exact statistics of the enrichments (odd ratios and P-value, two-sided test). The right side of the table shows the available information about the annotation of those promoters.
In contrast, the majority of promoters overlapping simple repeats and low complexity sequences were associated with protein coding transcripts. Simple repeats were enriched among upregulated promoters, whereas low complexity sequences were enriched among downregulated promoters (Table 3).
Bidirectional transcription from REP522 satellite repeat is activated in cancer
Interestingly, a specific repeat family, REP522, was strongly enriched in the most upregulated promoters. REP522, originally called a telomeric satellite, is a largely palindromic, unclassified interspersed repeat of ∼1.8 Kb in size (28). We observed that out of 72 promoters overlapping REP522, 25 were upregulated in cancer (odds ratio, 62.05). Twenty out of these 25 promoters were associated with a known transcript (five pseudogenes, nine lncRNAs, and one protein coding gene) including the pseudogene BAGE2 (B melanoma antigen family, member 2) and the lncRNAs PCAT7 and BRCAT95, which were previously implicated in cancer (23). In most cases, the transcription is initiated bidirectionally and in five cases it overlaps regions previously annotated as enhancers. To show that the observed activation of REP522 elements was not due to a mapping artifact, we performed qPCR validation for 11 upregulated, REP522 initiated transcripts from different genomic regions in three cancer cell lines and dermal fibroblast cells as a control. For eight of these we confirmed upregulation in the cancer cell lines compared with normal fibroblasts (Fig. 4A). In one case, we confirmed the bidirectional transcription of CCD144NL and CCD144NL-AS1 from one REP522 element (Fig. 3B). The three transcripts for which the qPCR validations did not yield any results represented very lowly expressed, novel and computationally assembled transcripts from miTranscriptome, hinting at the possibility that they were either too lowly expressed or the transcripts were not correctly assembled (Fig. 3C). To our knowledge this is the first report implicating REP522 activation in cancer.
Enhancer activation in cancer
Taking advantage of the fact that CAGE data can be used to estimate the activity of enhancers from balanced bidirectional capped transcription (9), we performed differential expression analysis based on CAGE tags counts under 43,011 CAGE-defined enhancers (9), using the same differential expression pipeline as for the promoter regions. We found 28 pan-cancer enhancers upregulated in solid and blood cancers and a further 62 upregulated in solid cancers only (Supplementary Table S5). Enhancers tend to be highly cell-type specific (9); accordingly we found no broadly downregulated enhancers in cancer.
We found that 23 of the 90 upregulated enhancers could be associated to a miTranscriptome transcript (5′ end within 500 bp from the enhancers; Supplementary Table S11A) and that four of those transcripts were reported to be upregulated in at least one cancer type (Supplementary Table S11B).
We next used Chromatin Interaction Analysis Paired-End Tags (ChIA-PET) data from the ENCODE project to associate these pan-cancer enhancers with their target genes. We found that 55 of the 90 upregulated enhancers can be physically linked to promoters of known genes (228 unique enhancer—gene links, Supplementary Table S6). 17 of the enhancers were linked to cancer related genes, including seven oncogenes (BCL9, CREB1, ZNF384, SALL4, TFRC, BTG1, and oncomir MIR21), two tumor suppressors (ING4, KCTD11) and five Mut-Drivers (PIK3CB, CLIP1, KIFC3, GPS2, and CARM1; Supplementary Table S6). In addition, eight of the upregulated enhancers were linked to promoters found to be upregulated in cancer cell lines, including cancer-linked genes such as TNFSF12 and PIK3R3 (Fig. 3D; ref. 29).
Discussion
By using both the FANTOM5 CAGE expression data from cancer cell lines and primary cells, and the TCGA RNA-seq and TCGA proteome expression datasets from TCGA tumors and normal tissues we have built an overview of recurrent expression changes in cancer.
These datasets have their own strengths and weaknesses. Complicating the TCGA analysis, both tumors and normal tissues are complex mixtures of cell types (cancer cells, infiltrating lymphocytes, stroma and blood vessels), thus interpretation of differential expression between normal and cancer is complicated. Differences in gene expression may simply reflect differences in cell composition. To minimize this issue, the TCGA (3) required that profiled tumor samples contain at least 60% tumor cells and less than 20% necrosis. The FANTOM5 cell line and primary cell data avoids this complication as relatively homogenous, pure cell cultures were profiled. Conversely, artifacts from the long-term culture of cell lines and their artificial in vitro culture conditions could affect our FANTOM5 analysis. The TCGA avoids this by directly profiling fresh tissue.
As expected, there are differences in the genes sets identified by the two datasets. Despite this, we identified a core set of 128 markers that are consistently perturbed in both the FANTOM5 cell and TCGA tissue analyses. Four of the markers are also upregulated at the protein level in a colon cancer dataset. Specifically, TOP2A, MKI67, MCM2, and ASNS, which are among some of the most studied cancer biomarkers and drug targets. TOP2A is targeted by etoposide (30). ASNS is targeted in asparginase therapy of acute lymphoblastic leukemia (31), and both MKI67 and MCM2 have been studied as biomarkers (32) and (33) and potential drug targets (34, 35). Targeting these genes is likely to bring therapeutic value to many patients as they are recurrently upregulated across many cancer types. Our pan-cancer markers also appear to be mostly novel, as comparison to prior works [Rhodes and colleagues, multicancer meta-signature of 67 genes upregulated in cancer by meta-analysis of 40 published microarray experiments (18); Xu and colleagues, 46 genes upregulated across 21 microarray data sets (36)] found little overlap (Supplementary Table S8).
The FANTOM5 CAGE data also allowed us to look at transcript types rarely studied in prior efforts (long-noncoding RNAs, enhancer RNAs, and repetitive element derived RNAs). We report 271 pan-cancer lncRNAs, including famous cancer-associated lncRNAs such as PVT1 and many novel cases. Public datasets confirmed the upregulation of 25 and downregulation of three of these lncRNAs in at least two primary tumor types (23, 24) and we further validated upregulation of two novel lncRNAs by qRTPCR in a cDNA panel covering eight tumor types. We also identify 90 enhancer RNA-producing regions that are recurrently activated in cancer cell lines. For four of them a corresponding lncRNA transcript model is upregulated in the TCGA dataset.
The observation that promoters that overlap repetitive elements are often upregulated in cancer is quite interesting, and the link of the little known REP522 element to cancer is novel. One instance of REP522 near the B melanoma antigen (BAGE) locus has been reported to be marked with H3K9me3 and actively transcribed (37), perhaps suggesting REP522 transcriptional activation is responsible for upregulation of BAGE in cancer. Other better studied elements such as LTR elements have previously been reported to act as alternative promoters of host genes in mouse embryos (38) and to contribute to the complexity of the transcriptome of iPS and stem cells (39). Thus, the reactivation of these elements and the eRNAs identified above suggests acquisition of stem cell like properties by cancer cells. Possibly because repetitive sequences are usually suppressed by methylation in somatic cells; however, in cancer they are frequently hypomethylated (40).
In conclusion, our results, which highlight the transcriptome changes in cancer and cover both protein coding genes, non-protein coding transcripts, unannotated promoters and enhancer RNAs, represent an important step towards discovery of potentially useful cancer biomarkers and therapeutic targets. Development of technologies to detect and target these molecules has the great potential to be applicable to a broad range of cancers. One last note is that we identify molecules that are consistently up or down in cancer normal comparisons, but are not necessarily always higher in all cancers relative to all normal tissues (a subset are). Such molecules may not be suitable for plasma/serum based diagnostics but would be useful in screening biopsies in a histopathologic setting.
Disclosure of Potential Conflicts of Interest
P. Carninci is founder and CEO for TransSINE Technologies. No potential conflicts of interest were disclosed by the other authors.
Authors' Contributions
Conception and design: B. Kaczkowski, Y. Hayashizaki, P. Carninci, A.R.R. Forrest
Development of methodology: B. Kaczkowski, M. Itoh, P. Carninci
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): Y. Tanaka, M. Itoh, The FANTOM5 Consortium, P. Carninci, A.R.R. Forrest
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): B. Kaczkowski, H. Kawaji, A. Sandelin, R. Andersson, M. Itoh, T. Lassmann
Writing, review, and/or revision of the manuscript: B. Kaczkowski, R. Andersson, P. Carninci, A.R.R. Forrest
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): Y. Tanaka, H. Kawaji, A. Sandelin, T. Lassmann, The FANTOM5 Consortium, P. Carninci
Study supervision: A.R.R. Forrest
Other (supported experimental design of qPCR validation of pan-cancer marker's candidates and data analysis, as well as performing experiments): Y. Tanaka
Acknowledgments
The authors thank Erik Arner, Efthymios Motakis, Kosuke Hashimoto, Dave Tang, Chung-Chau Hon, Jordan Ramilowski, and Giovani Pascarella for valuable discussions and comments to the manuscript, and Yuri Ishizu for technical assistance.
Grant Support
B. Kaczkowski was supported by Postdoctoral Fellowship Program from Japan Society for the Promotion of Science (JSPS) and Foreign Postdoctoral Researcher (FPR) program from RIKEN, Japan. Y. Tanaka was supported by Grants-in-Aid for Scientific Research (KAKENHI) from the Ministry of Education, Culture, Sports, Science, and Technology. R. Andersson was supported by funding from the European Research Council (ERC) under the European Union's Horizon 2020 Research and Innovation Programme (grant agreement No. 638273). A. Sandelin was supported by the Novo Nordisk Foundation and the Lundbeck Foundation. FANTOM5 was made possible by a Research Grant for RIKEN Omics Science Center from MEXT to Y. Hayashizaki and a Grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from the MEXT, Japan to Y. Hayashizaki. This study is also supported by Research Grants from the Japanese Ministry of Education, Culture, Sports, Science and Technology through RIKEN Preventive Medicine and Diagnosis Innovation Program to Y. Hayashizaki and RIKEN Centre for Life Science, Division of Genomic Technologies to P. Carninci. A.R.R. Forrest is supported by a Senior Cancer Research Fellowship from the Cancer Research Trust and funds raised by the MACA Ride to Conquer Cancer.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.