tRNA-derived fragments (tRF) are a class of potent regulatory RNAs. We mined the datasets from The Cancer Genome Atlas (TCGA) representing 32 cancer types with a deterministic and exhaustive pipeline for tRNA fragments. We found that mitochondrial tRNAs contribute disproportionally more tRFs than nuclear tRNAs. Through integrative analyses, we uncovered a multitude of statistically significant and context-dependent associations between the identified tRFs and mRNAs. In many of the 32 cancer types, these associations involve mRNAs from developmental processes, receptor tyrosine kinase signaling, the proteasome, and metabolic pathways that include glycolysis, oxidative phosphorylation, and ATP synthesis. Even though the pathways are common to multiple cancers, the association of specific mRNAs with tRFs depends on and differs from cancer to cancer. The associations between tRFs and mRNAs extend to genomic properties as well; specifically, tRFs are positively correlated with shorter genes that have a higher density in repeats, such as ALUs, MIRs, and ERVLs. Conversely, tRFs are negatively correlated with longer genes that have a lower repeat density, suggesting a possible dichotomy between cell proliferation and differentiation. Analyses of bladder, lung, and kidney cancer data indicate that the tRF-mRNA wiring can also depend on a patient's sex. Sex-dependent associations involve cyclin-dependent kinases in bladder cancer, the MAPK signaling pathway in lung cancer, and purine metabolism in kidney cancer. Taken together, these findings suggest diverse and wide-ranging roles for tRFs and highlight the extensive interconnections of tRFs with key cellular processes and human genomic architecture.
Across 32 TCGA cancer contexts, nuclear and mitochondrial tRNA fragments exhibit associations with mRNAs that belong to concrete pathways, encode proteins with particular destinations, have a biased repeat content, and are sex dependent.
tRNA-derived fragments (tRF) are a new molecular category of short ncRNAs that are produced from specific cleavage of precursor and mature tRNA molecules (1, 2). For tRFs that overlap the mature tRNA, four structural categories were reported initially: 5′-tRFs, 3′-tRFs, 5′-tRNA halves (5′-tRHs), and 3′-tRNA halves (3′-tRHs; ref. 3). Our group identified and reported a fifth category, the internal tRFs or i-tRFs, with numerous and abundant members (4, 5).
The production of tRFs changes in response to factors like diet variations (6) and trauma (7). It has also been shown that tRFs are produced in a tissue-dependent manner (5), exhibit differential abundances in cancer compared to normal tissue (8, 9), and are involved in transgenerational inheritance (10).
Mechanistically, tRFs can regulate translation (11) as well as interact with the ribosome and aminoacyl tRNA synthetases (11, 12). The specific category of tRHs can be produced under stress conditions (11, 13) as well as constitutively (5), and its members can have endpoint modifications that render them invisible to standard RNA sequencing (RNA-seq; ref. 14). Some of the shorter tRFs can interact with Argonaute proteins (3, 15, 16) in a cell type–specific manner (5). tRFs that enter the RNAi pathway follow base-pairing rules that match those of miRNAs (3, 17). Interestingly, tRF loading on Argonaute can be both Dicer-dependent and Dicer-independent (3, 17). Moreover, the decoying of an RNA-binding protein (RBP) by tRFs (18) and the potentially extensive binding of tRFs by RBPs (19) can affect cancer molecular biology and metastasis.
Previously, we showed that tRFs have links to precision medicine and hold promise for furthering our understanding of homeostasis and disease. Specifically, we demonstrated that the identity and abundance of tRFs depend on a person's sex, population origin, and race/ethnicity, as well as tissue, tissue state, and disease type (5, 8, 19). We also found that the associations (“wiring”) of tRFs with mRNAs and molecular pathways is race/ethnicity-specific in prostate adenocarcinoma (8) and triple-negative breast cancer (19).
The current literature highlights the heterogeneity of the tRFs' roles, mechanisms of action, and functional impact. Given this heterogeneity, we pursued our studies in an integrated manner. We holistically investigated unexplored areas of tRF expression and patterns of inter-transcript (tRFs-mRNAs) associations in 32 cancer types of The Cancer Genome Atlas (TCGA) cohort. Our analysis distinguished tRFs based on whether they can be traced back to the nuclear or the mitochondrial genome. We also examined the attributes of the mRNAs with which the tRFs are correlated and sought potential dependencies of the tRF-mRNA associations on cancer type. Finally, we examined possible links between tRFs and the repeat-element content of genes, and between tRFs and sex disparities.
Materials and Methods
Data acquisition and tRNA fragment profile generation
We downloaded 11,198 short RNA-seq datasets (sequenced samples) from TCGA's Cancer Genomic Hub and the respective clinical metadata from TCGA's data portal. We converted TCGA's small RNA-seq datasets to FASTQ format using bam2FastQ (http://genome.sph.umich.edu/wiki/BamUtil - v1.0.10). To ensure consistency with previously reported TCGA analyses, we worked with the GRCh37/hg19 genome assembly. We used MINTmap (20, 21) to exhaustively and deterministically mine tRFs in all 11,198 datasets. These tRFs were recently made available through an interactive database (4). We computed an adaptive minimum-support threshold for each dataset using the Threshold-seq algorithm (22), retaining only those tRFs that exceeded threshold and also had mean normalized abundance ≥1 reads per million (RPM) in at least one of the 32 cancer types. We refer to tRFs using the “license plate” naming scheme that we introduced previously (5, 23). We tag tRF sequences as “exclusive” if they exist only within the span of mature tRNAs that contain a CCA (the “tRNA space”) and appear nowhere else on the genome; otherwise, we tag them as ambiguous. In the presented analyses, we included the 10,274 “white-listed” datasets that contain no special annotations in the associated clinical metadata.
We adhere to the NIH/TCGA designations. White refers to person with origins in any of the original peoples of the far Europe, the Middle East, or North Africa. Black or African American refers to persons with origins in any of the black racial groups of Africa.
We define the “normalized abundance” of an isoacceptor (in reads per million or RPM) for a dataset as the sum of the (normalized) abundances of all the tRFs that the isoacceptor produces. We define the “normalized abundance” of an isoacceptor for a cancer type as the average of the isoacceptor's abundances across all the datasets of this cancer type. For the 5′-tRFs from tRNAHisGTG, and separately for each sample/dataset, we computed the ratio of abundance of 5′-tRF ending at consecutive positions, then log2-transformed it, filtered out infinite values in the ratios (divisions by 0), and computed mean and SD separately for tumor and normal samples. We distinguished 5′-tRFs starting at position −1 from 5′-tRFs starting at position +1 of tRNAHisGTG. Univariate statistical comparisons in abundance were carried out with the nonparametric Mann–Whitney U test. For network visualizations, we collapsed all tRFs to the isoacceptor level, for example, a node labeled “nAspGTC” represents all expressed tRFs that overlap the mature nuclear tRNAAspGTC.
tRNA base modifications and mapping with mismatches
We leveraged the known modifications of human tRNAs and the respective sequence alignments contained in MODOMICS (24). Per MODOMICS, a base's “frequency of modification” is defined as the ratio of tRNAs with a modification at that base over the number of all considered tRNAs. Intuitively, had a modified base at position N of the mature tRNA stopped reverse transcription during library preparation, then tRFs from this tRNA would have appeared to possess a “pseudo-5′” terminus at position N+1. We examined this possibility at each mature tRNA position by counting the number of tRFs starting at that position and comparing it with the modification frequency that MODOMICS lists for the position immediately upstream.
In addition, for 3′-tRFs, we also evaluated whether there is benefit in mapping tRFs after allowing a single nucleotide mismatch. Such a mismatch would, in principle, alleviate the potential impact on the sequenced reads of the known m1A58 modification (methylated adenine at position 58 of the mature tRNA). To this end, for all 3′-tRFs, we enumerated all possible sequence variations that result from changing exactly one nucleotide anywhere along the 3′-tRF at hand. We then examined which of these derivative “3′-tRFs” are (i) supported by the available RNA-seq data; and (ii) can now be found identically in the genome outside of the sequence space of tRNAs (when allowing no mismatch).
We computed positive and negative tRF-mRNA Spearman ρ correlation coefficients using only the tumor datasets, and separately for each cancer type. For bladder urothelial carcinoma (BLCA), lung adenocarcinoma (LUAD), or kidney renal clear-cell carcinoma (KIRC), we split the tumor datasets by sex and carried out the correlations separately for each sex. For KIRC, we only considered samples that belonged in the ccA cluster in TCGA analysis (25). For increased stringency, we further required that the median normalized abundance of each tRF be ≥2 RPM within the group of considered samples. For the mRNA profiles, we used TCGA's files of normalized results (“rsem_genes.normalized_results”) filtering out any genes whose average abundance was less than the median of the means of abundances of all mRNAs across the primary tumor samples of the group under consideration. We determined which tRFs and mRNAs enter the correlation analyses separately for each cancer type (or, sex in the case of BLCA, LUAD, and KIRC-ccA). We used Python's numpy (version 1.11.1) and scipy (version 0.18.1) packages. We kept tRF-mRNA pairs with Spearman correlation ≤ −0.333 or ≥ 0.333 and an associated FDR ≤ 5%. For most cancer types, we found tens of thousands of tRF-mRNA correlations satisfying these constraints. To focus on the strongest of the correlations and to balance stringency and specificity, we analyzed only tRF-mRNA pairs corresponding to the 5,000 highest (positive) and 5,000 lowest (negative) correlation values. All retained correlation pairs, correlation values, and FDRs are listed in the Supplementary Data.
We computed commonly and differentially correlated tRF-mRNA pairs as we have done previously (19). Specifically, a tRF-mRNA pair is commonly correlated in two cancer types [e.g., LUAD and lung squamous cell carcinoma (LUSC)] or both sexes (e.g., BLCA male and BLCA female) if it is listed among the significant pairs with the same sign in the two categories being compared. A tRF-mRNA pair is differentially correlated if it is listed among the significant pairs in exactly one of the two categories being considered or listed among the significant pairs in both categories but with opposite signs.
For hierarchical clustering as well as visualizations, we used R and Cytoscape, as described previously (5, 19). Protein–protein interactions were drawn from PICKLE (26). We carried out pathway enrichment analysis with DAVID (27) using as “background” all the genes that passed the expression filtering for the respective cancer type. We examined Gene Ontology (GO) terms for biological processes (GOTERM_BP_FAT), molecular function (GOTERM_MF_FAT), cell compartment (GOTERM_CC_FAT), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (KEGG_PATHWAY) for enrichment filtering at an FDR threshold of 0.05. BP terms in more than 10 cancer types were grouped into clusters based on the pairwise Jaccard index matrix measuring how common two BP GO terms are in terms of the genes they contain and are correlated with tRFs. We reduced and summarized the grouped GO terms with REVIGO (allowed similarity, medium; similarity measure, normalized Resnik; ref. 28).
To construct the network of glycolysis/gluconeogenesis, we connected the genes whose encoded proteins interact with the same metabolite: we downloaded the hsa00010 KEGG pathway structure, and identified the genes in the KEGG modules hsa_M00001, hsa_M00002, and hsa_M00003. We treated two genes as connected if and only if the reactions catalyzed by the encoded proteins included at least one common metabolite as substrate or product, collapsing duplicate edges to a single node. We constructed the network of purine metabolism in a similar manner by connecting all the genes in the biosynthesis of inosine monophosphate (IMP) from ribose 1-phosphate (KEGG pathway hsa00230). The gene collections for ribosomal and proteasomal proteins are from https://www.genenames.org/cgi-bin/genefamilies/.
Protein localization data are from UniProt (October 10, 2017; ref. 29). For each cancer type and the mRNAs participating in correlations with tRFs, we identified the cellular compartments and destinations of the corresponding encoded proteins. We used a χ2 test of homogeneity of proportions to determine enrichment and/or depletion of compartments within this mRNA set compared with the mRNAs not participating in the correlations. We only used enriched or depleted compartments with an FDR threshold of 5% and residual scores ≥ +3 (i.e., enriched and colored gold) or ≤ −3 (i.e., depleted and colored purple), respectively.
We used RepeatMasker (http://www.repeatmasker.org; hg19 version 4.0.5) to find the overlap of human genes with repeat elements. The coordinates of all introns and exons are from ENSEMBL 75. We formed a gene's genomic span by taking the union of all its unspliced variants. We define the exonic portion of a gene as the union of all its exons. The gene's intronic portion is what remains of the gene's span after removing its exonic portion. We ran all Monte–Carlo simulations for 10,000 iterations, choosing genes randomly from the pool of background genes in each iteration, that is, the genes that participated in the correlation analyses. For the randomly-chosen genes, we computed the average span of the chosen genes and the average density of each repeat element family, and did so using (i) only the genes' exonic portion to which we refer as the “mRNA space” of a gene; and (ii) only the genes' intronic portion. We defined a repeat family's density (“repeat content”) in a genomic region as the fraction of the region that is annotated as belonging to the family. Upon completion of all the iterations, we built the “expected” distribution for each parameter and used Z scores to evaluate the enrichment/depletion of the “observed” parameter. We used the mean across the gene set that is correlated (either positively or negatively) with tRFs as the “observed” value. In total, we carried out 128 rounds of Monte–Carlo simulations (each with 10,000 iterations). Shades of gold color represent enrichment (Z score ≥ +2). Shades of purple color represent depletion (Z score ≤ –2). To account for multiple testing, we also carried out Kolmogorov–Smirnov tests checking whether the observed distribution is statistically significantly different than the background one. The resulting P values were corrected to FDR values and are included in the Supplementary Tables.
For the case of BLCA, we carried out two different Monte–Carlo simulations with 10,000 iterations. In the first simulation, we examined whether the deviations from the diagonal (where the number of correlations of an isoacceptor in males equals the number of correlations in females) are random. For each iteration, separately for males and females, we randomly selected (with replacement) the same number of tRFs as counted in the correlations. We then collapsed the tRFs to the isoacceptor level and for each one, we calculated the distance from the diagonal, that is, its deviation from exhibiting the same number of correlations in both sexes, and computed the median distance across them. Thus, we built an expected distribution. We also computed the median distance of the data from the diagonal and used the expected distribution to calculate the Z score. In the second simulation, and separately for each sex, we reassigned to each tRF the mRNAs with which it formed correlated pairs. We chose only among mRNAs that already participated in tRF-mRNA correlations while maintaining the number of correlations for each tRF unchanged: this allows us to estimate the significance of the differentially correlated observed tRF-mRNA pairs. At the end of the simulation (10,000 iterations), we had an expected distribution of the differences in correlation coefficients that we plotted against the observed coefficient differences. In addition, we point out that out of the 10,000 iterations, we found only 12 simulated tRF-mRNA pairs that would be considered as differentially correlated on the basis of their coefficient and FDR values – this is vastly smaller than the 5,520 tRF-mRNA pairs that emerge from our analysis of the BLCA datasets, which emphasizes the statistical significance of the findings.
Repeat element density and its correlation with biological processes
In the context of the recent literature around repeats and tRFs, a central finding in our work was the link of tRFs with mRNAs with specific distribution in repeats. It is important to note that we came to these results by analyzing the relative abundances of transcripts. We did not take into account the absolute magnitude of transcript abundance, or any genomic properties. In the context of genome organization, several publications previously highlighted the nonrandomness of repeat elements and their correlations with other gene properties, like length and GC content (see Results and Discussion). Therefore, it seems reasonable to hypothesize that the tRF-mRNA correlations at the transcriptomic level are the dynamic manifestation of the (static) genomic landscape. We emphasize the dynamic nature of these transcriptomic correlations because we observed that the same genomic properties significantly emerge in multiple contexts (cancer types), even though the tRF-mRNA pairs exhibit a strong context-specific signature. From this perspective, we examined how the distribution of repeat elements is associated with biological processes. Or, in other words, what is the overarching genomic architecture that is intertwined with the transcriptomic results.
We focused on those of the repeat elements that we found to be mostly correlated with tRFs, namely SINE (ALU and MIR), LINE (L1 and L2), LTR (ERVL-MaLR), and DNA transposons (TcMar-Tigger and hAT-Charlie), and their instances that are sense to the whole-genomic span of protein-coding genes. These analyses do not aim at providing a thorough investigation, but rather at demonstrating the links between biological processes and these repeats. For each repeat family, we computed the repeat content as described above and ranked the genes. Then, for each gene, we calculated the average rank and sorted the genes based on this metric. We considered the top 3,000 and the bottom 3,000 genes corresponding to the genes with the highest and lowest, respectively, density in repeats. We note that for the majority of the genes with low repeat density, the mean density value is zero, or close to zero. On these two lists, we computed enrichments using DAVID with the same parameters as described above.
We mined all the datasets (see Materials and Methods) of the TCGA repository and identified 23,413 tRFs that overlap mature tRNAs. The tRFs can be bulk-downloaded from https://cm.jefferson.edu/tcga-mintmap-profiles, or, examined interactively through MINTbase at https://cm.jefferson.edu/MINTbase/ (4, 23). We focus the below analyses on 10,274 white-listed TCGA samples and the corresponding 20,722 tRFs with significant expression in at least one of these datasets (Supplementary Table S1). Of the discovered fragments, 16,133 (78%) belong to the new category of i-tRFs, in complete analogy to what we reported previously (5, 8, 19, 23). We also identified 1,717 5′-tRFs (8% of all the identified tRFs), 2,840 3′-tRFs (14%), and 32 5′-tRHs. Fragments with lengths ≥ 28 nt are likely truncated versions of longer fragments that have been (artificially) shortened as a result of the 30-cycle limitation of the sequencing protocol used by TCGA (30). This limitation results in an underrepresentation of halves among the identified tRFs. Of the 20,722 tRFs, 13,904 (67%) have sequences that can be found only inside the tRNA space (“exclusive” tRFs). The sequences of the remaining 6,818 (33%) tRFs are of ambiguous genomic origin, that is, one third of the identified tRFs may not arise from tRNA genes.
Because the tRFs we analyze are believed to originate from mature tRNAs, they are expected to inherit the parental molecule's base modifications. In principle, such modifications could hinder the reverse-transcription step during sequencing, leading to an artificial 5′ endpoint for some tRFs, or, to misreading the nucleotide at the modified location. Our analysis did not find evidence that the presence of base modifications leads to artificial 5' endpoints or that it distorts the identity of the tRFs that are mined from TCGA (Supplementary Fig. S1A and S1B). Our analysis also confirmed that permitting nucleotide mismatches when mapping reads to the genome greatly hinders one's ability to distinguish among tRFs and non-tRFs (Supplementary Fig. S1C). Thus, we enforced exact matching during read-mapping, which is a key property of MINTmap (see Materials and Methods; refs. 20, 21).
The case of 5′-tRFs from the nuclear tRNAHisGTG
Among the many diverse tRFs, the nuclear tRNAHisGTG stands apart as a notable exception (Supplementary Fig. S1D–S1G). We previously reported in a model human cell line (BT-474) that 5′-tRNA halves from tRNAHisGTG have the expected guanosine added to their 5′ termini (“position −1”) as well as an unexpected uracil (31). We refer to these molecules using the “His(−1G)” and “His(−1U)” qualifier, respectively. Examination of the −1 positions of all 5′-tRFs from tRNAHisGTG across all TCGA samples revealed unexpectedly that His(−1U) is the most abundant modification (Fig. 1A). A smaller portion of the 5′-tRFs from tRNAHisGTG contains an adenine at the −1 position, or no modification. Even fewer 5′-tRFs from tRNAHisGTG contain a guanosine or a cysteine (Fig. 1A). We also found that the His(-1U) 5′-tRFs exhibit a notable property: as nucleotides are progressively added to their 3′ terminus, their abundance levels oscillate through position 23: although the absolute abundances of these 5′-tRFs change among cancer types (Supplementary Fig. S1D), the abundance ratios of His(-1U) 5′-tRFs that differ by 1 base at their 3′ terminus is conserved (see Fig. 1B for an example; full data matrix included in Supplementary Table S2). This “see-saw” pattern persists across all 32 TCGA cancer types and extends to the normal tissue samples as well (Supplementary Table S2; Supplementary Fig. S2A). Note that the His(+1G), that is, the unmodified 5′-tRFs beginning at position +1 of tRNAHisGTG as well as 5′-tRFs from other isoacceptors either do not exhibit this exact pattern or exhibit other patterns (Supplementary Fig. S2B). We note that analogous patterns have been reported for tRNA-derived Piwi-interacting RNAs (piRNA) in Bombyx mori (32).
Interestingly, the mitochondrial tRNAHisGTG does not generate any 5′-tRFs even though it produces a similar number of i-tRFs and 3′-tRFs as the nuclear tRNAHisGTG. Comparison of the relative abundances of nuclear and mitochondrial tRNAHisGTG fragments showed that they are not correlated (Fig. 1C). Moreover, across all TCGA datasets and cancer types, tRFs from the mitochondrial tRNAHisGTG are considerably less abundant than their nuclear counterparts (Fig. 1C, diagonal of Fig. 1D, and Supplementary Fig. S1E and S1F).
tRF lengths and tRNA cleavage patterns depend on the genome of origin
Motivated by the differences between the nuclear and mitochondrial tRNAHisGTG, and analogous differences that we reported previously (5) in healthy and diseased samples, we extended our analyses and comparisons to the rest of the nuclear and mitochondrial isoacceptors.
In terms of unique tRF sequences, the contribution by the 22 mitochondrial tRNAs comparatively eclipses that by the 610 nuclear tRNAs. We stress that this statement is about the diversity of the produced molecules and not about their relative abundance levels. The 22 mitochondrial tRNAs (3.5% of all tRNAs) are responsible for 6,031 (29%) of all distinct tRFs we find in TCGA. We note here that the human nuclear chromosomes are riddled with nuclear mitochondrial DNA segments (NUMT) as well as hundreds of tRNA-lookalikes (33). Thus, it is conceivable that the mitochondrial tRFs are not the exclusive product of the mitochondrial genome.
In terms of length distributions, there are concrete differences between tRFs produced from nucleus-encoded tRNAs and their mitochondria-encoded counterparts. These differences persist in all analyzed cancer types (Supplementary Fig. S1G; Supplementary Table S3), mirror our previous findings in lymphoblastoid cells from healthy individuals (5), and suggest potentially distinct roles for nuclear and mitochondrial tRFs in cancer.
To investigate the relative contribution of the nuclear and mitochondrial genomes to the pool of present RNAs, we calculated the abundance of tRFs across cancer types at the isoacceptor level, doing so separately for mitochondrial and nuclear isoacceptors. Use of unsupervised hierarchical clustering separates nearly all mitochondria-encoded isoacceptors from their nuclear counterparts. However, Fig. 2A also shows that the abundance of mitochondrial tRFs depends on cancer type. We investigated this further by correlating the expression of tRFs per isoacceptor with the mitochondrial DNA copy number in 22 cancer types (34). With the exception of adrenocortical carcinoma and kidney renal papillary cell carcinoma (KIRP), the average correlation coefficient of mitochondrial tRFs with mtDNA copy number was low (Spearman ρ < 0.4; Supplementary Fig. S3).
We also examined the clustering of tRFs when we consider their structural category. Separately for mitochondrial and nuclear fragments, we computed the abundance levels of 5′-tRFs, i-tRFs, and 3′-tRFs. Because i-tRFs represent a more heterogeneous group of molecules, we divided them into six subcategories based on the location along the mature tRNA of an i-tRF's 5′ terminus. Hierarchical clustering reveals some groupings of the structural categories that persist across the 32 cancer types (Fig. 2B): e.g., nuclear and mitochondrial i-tRFs that begin in region A are correlated with those that begin in the D loop; nuclear i-tRFs that begin in region B are correlated with nuclear 3′-tRFs; etc. A detailed cleavage analysis of tRNAs, which we carried separately for each of the 32 cancer types and tRF structural categories, further supports these findings (Supplementary Fig. S4).
These findings suggest that tRF production strongly depends on the parental tRNA's genomic origin, with nuclear and mitochondrial tRNAs producing characteristically different tRFs, in terms of identity and abundance. tRF production and the resulting tRF profiles are the outcome of currently-unknown mechanisms.
The associations between tRFs and mRNAs depend on molecular context
To identify links between tRFs and biological processes, we studied each of the 32 cancers for patterns of statistically significant correlations between tRFs and mRNAs. We filtered these tRF-mRNA correlations using stringent criteria (see Materials and Methods) while examining mitochondrial tRFs separately from nuclear tRFs. We find that the identities of the tRFs and mRNAs that are present in the analyzed samples remain largely unchanged across cancer types (Supplementary Fig. S5A–S5C; Supplementary Tables S4 and S5). However, the identities of the tRFs and mRNAs that are statistically-significantly correlated with one another change dramatically from one cancer type to the next (Supplementary Fig. S5D–S5G). Intriguingly, we find that the correlations with mitochondrial tRFs primarily comprise 3′-tRFs, whereas the ones with nuclear tRFs include a mixture of all structural categories with a preference for 5′-tRFs or i-tRFs in some cancer types (Fig. 2C).
To examine whether the observed correlation patterns reflect tissue-specific (and not cancer type–specific) events, we considered LUAD and kidney renal cell clear carcinoma (KIRC). For these two cancer types, TCGA has analyzed additional types from the same tissue: LUSC; kidney chromophobe (KICH); and, KIRP. Examining the number of commonly and differentially correlated tRF-mRNA pairs (19), we found that LUAD is closer to LUSC than to any other cancer type (Supplementary Fig. S5H). Nonetheless, only a mere 14% of the significantly correlated tRF-mRNA pairs in LUAD are shared with LUSC. For KIRC, we found that it is as far from KIRP and KICH as it is from any other cancer type. These data indicate that in addition to cancer type representing a major contribution to our analyses (Supplementary Fig. S5I), there can also be a tissue-specific contribution for some cancers (Supplementary Fig. S5H). Because decoupling the contribution of each component is not comprehensively feasible with the data that is available in TCGA, we refer to these correlations as “context-specific.” This is a particularly notable observation that echoes our recent report on triple-negative breast cancer (19).
tRFs are positively correlated with shorter mRNAs and negatively correlated with longer mRNAs
With the tRF-mRNA correlation pairs in hand, we examined whether the involved mRNAs exhibit length biases. We used Monte–Carlo simulations and Kolmogorov–Smirnov tests corrected for multiple testing (see Materials and Methods, and Supplementary Table S6) to evaluate the length of the mRNA, defined as the length of the union of the respective exonic sequences. As Fig. 3A shows, the mRNAs that are positively correlated with tRFs are, on average, significantly shorter that the average length of the expressed mRNAs. Also, the mRNAs that are negatively correlated with tRFs are, on average, significantly longer. This holds true for both nuclear and mitochondrial tRFs, and most of the 32 cancer types. To highlight the differences in distributions, we show in Supplementary Fig. S6 boxplots of the length distributions for each of the four combinations: two correlation signs × two genomes-of-origin. These length biases persist when instead of mRNAs we examine the length of the intronic portion for genes whose mRNAs are correlated with tRFs (Supplementary Fig. S7). These results suggest preferential interactions of tRFs with genes of specific genomic architecture.
The localization of the encoded proteins is dependent on the genomic origin of the correlated tRFs
Next, we systematically examined the cellular localization of proteins encoded by mRNAs that participate in tRF-mRNA correlations. We considered seven destinations: nucleus, cytoplasm, endoplasmic reticulum (ER)–Golgi, mitochondrion, cell membrane, secreted, and “other” (e.g., vesicles, endosomes, etc.). Again, we separated positive from negative correlations, and nuclear tRFs from mitochondrial tRFs.
We find that tRFs are correlated with mRNAs whose protein products localize to various combinations of the seven destinations following localization patterns that are context-specific (Fig. 3B and C). As far as positive correlations are concerned, mRNAs encoding nuclear proteins are significantly enriched in some cancer types, for example, breast cancer and testicular germ cell tumors (TGCT; Fig. 3B). But in uveal melanoma (UVM) and thyroid cancer, mRNAs encoding nuclear proteins are significantly depleted (Fig. 3B). We note that mRNAs encoding proteins destined for the mitochondria are consistently enriched among the positive correlations with mitochondrial tRFs in 30 of the 32 cancers.
Analogous observations can be made for the negative correlations (Fig. 3C). In colon adenocarcinoma, rectum adenocarcinoma, esophageal carcinoma, and TGCT, the negative correlations of both mitochondrial and nuclear tRFs are depleted in mRNAs encoding proteins destined for the nucleus or the mitochondria; yet, they are enriched in the mRNAs of secreted or cell membrane proteins. In other cancer types like prostate adenocarcinoma, low-grade glioma, KIRP, and LUAD, the negative tRF-mRNA correlations include more mRNAs whose proteins are destined for the nucleus than expected by chance (Fig. 3C). UVM is another interesting case: the negative correlations involving nuclear tRFs are enriched in mRNAs whose proteins are destined for the nucleus whereas those involving mitochondrial tRFs are depleted in this regard.
These findings suggest extensive and context-specific flow of information across cellular compartments.
The mRNAs that are correlated with tRFs differ by cancer type but often belong to the same biological processes
Having established that the identity of mRNAs that are correlated with nuclear or mitochondrial tRFs depends on the molecular context, we examined whether this dependence extends to biological processes (Supplementary Tables S7 and S8). We identified multiple examples of tRF-mRNA correlations and associated pathways that are enriched in only one cancer type (Supplementary Fig. S8A). For example, mRNAs from bile secretion, several amino acid metabolic pathways, and xenobiotic metabolism are exclusively prevalent among the tRF-mRNA correlations in liver hepatocellular carcinoma. As another example, mRNAs from the steroid biosynthesis pathway are prevalent only among the tRF-mRNA correlations in adrenocortical carcinoma.
We also found pathways that are enriched among the tRF-mRNA correlations in multiple cancers. For example, the KEGG pathway “ribosome” (hsa03010) is significantly overrepresented in 21 cancer types (Supplementary Tables S7 and S8) and the corresponding mRNAs are correlated with both nuclear and mitochondrial tRFs (Fig. 4A). For clarity, each tRF node in this figure represents all tRFs from the corresponding isoacceptor. Note how four mitochondrial tRNAs (mt-tRNAValTAC, mt-tRNALeuTAA, mt-tRNAProTGG, and mt-tRNAGluTTC) have the highest out-degrees and are associated with all three groups of ribosomal proteins, including those forming the cytosolic large ribosomal subunit (LSU) and small ribosomal subunit (SSU). We also note that which tRFs are correlated with ribosomal proteins depends on the considered cancer type (Supplementary Fig. S8B).
We methodically pursued this further by seeking “Biological Process” (BP) GO terms that are enriched in >10 distinct cancer types. To account for the overlap in the included genes, we grouped them into clusters and identified the represented processes (Supplementary Figs. S8C and S9A–S9D). The BP GO terms formed four basic groups with a multitude of correlations involving nuclear and mitochondrial tRFs (Fig. 4B).
The largest group (“red”) of GO terms pertains to development, cell adhesion, and signaling. Heart, blood vessel, and central nervous system development are included, as well as cell–matrix adhesion, and receptor tyrosine kinase signaling. Genes that are exclusive to this cluster include IGF2R, TGFBR2, ELK3, LRP1, ZEB2, and EDF1 and all have positive and negative correlations with nuclear and mitochondrial tRFs in 28 of 32 cancer types. The second largest group (“blue”) pertains to DNA and RNA metabolism, trafficking across compartments, and cell division. Notable genes in this category include TOP3B, RECQL, VDAC2, TRAM2, XPOT, IPO11, PQLC2, BOB1 and RAD1. The third group (“magenta”) pertains to “oxidative phosphorylation and ATP synthesis.” Finally, the “green” group pertains to genes linked to proteasome degradation, protein ubiquitination, antigen presentation, and NF-κB signaling.
We stress that despite the presence of the same pathways in different cancer types, the tRF-mRNA correlations that fuel these findings involve different tRFs and different mRNAs in each cancer type. We demonstrate this with two examples, the proteasome and glycolysis pathways.
For the proteasome, we analyzed the correlations from diffuse large B-cell lymphoma (DLBC) and KIRC. Figure 4C shows the results in each case, with the proteasome genes in exactly the same placement to facilitate comparisons. The shown edges indicate correlations between tRFs and the respective mRNAs. Note how the identities of the correlated partners differ in the two cancers. For example, in DLBC, the abundances of PSMD5 and PSMD9 exhibit the most correlations with tRFs, whereas in KIRC, it is PSMB4 and PSME2. In DLBC, the nuclear tRNAAlaTGC and tRNALeuCAG isoacceptors have the most links with mRNAs, whereas in KIRC, it is mitochondrial tRNAValTAC.
For the glycolysis, we analyzed the correlations from esophageal carcinoma, thyroid cancer, uterine corpus endometrial carcinoma (UCEC), and UVM. Specifically, we dissected the correlations of tRFs with metabolism-related genes. Figure 4D presents a reconstructed network of the genes from this pathway. The genes are connected on the basis of their known interactions with common metabolites as substrates/products of consecutive reactions in the pathway (see Materials and Methods). In esophageal carcinoma, TPI1 and GAPDH attract the attention of positively correlated tRFs from several nuclear and mitochondrial isoacceptors. In thyroid cancer, it is GPI and PFKM that have the most associations (positive and negative) with tRFs. UCEC and UVM exhibit yet different correlations patterns involving genes from this pathway.
These examples highlight the existence of strong associations between tRFs and core cellular processes (ribosome, proteasome, glycolysis, etc.). Also, the findings indicate that the exact details of these associations and underlying putative regulatory links depend strongly on cancer/tissue type.
The genomic spans of genes whose mRNAs are correlated with tRFs contain specific repeat elements
In earlier work, we showed that the distribution of repeat elements in introns and exons is not random and that it captures functional conservation in the absence of sequence conservation (35, 36). More recent efforts linked tRFs from tRNAGlyGCC to mRNAs in the mouse, through the MERVL family (6), and showed that tRFs can interfere with the reverse transcription or coding ability of two active mouse endogenous retrovirus (ERV) families (37). We, thus, posited a link between tRFs and repeat elements at large in human cancers and investigated this possibility by mining the tRF-mRNA correlations at hand. Separately for each cancer type, we examined possible enrichment/depletion of transcripts in repeat elements distinguishing between sense and antisense orientations. We analyzed separately the intronic and exonic portions of the genes (see Materials and Methods). We note here that, as our analyses showed, the abundance of mRNAs is not correlated with repeat element density in any of the cancer types (Supplementary Table S9).
We find multiple repeat families to be specifically enriched or depleted in genes whose mRNAs participate in correlations with tRFs (Supplementary Table S9). Figure 5 shows the most frequently occurring ones. A striking link between the sign of the tRF-mRNA correlations and the repeat content of the corresponding mRNAs is evident. Specifically, the introns and exons of genes whose mRNAs are positively correlated with tRFs are significantly enriched in several types of repeats. For mRNAs that are negatively correlated with tRFs, their genes are significantly depleted in the same types of repeats. ALU and MIR retrotransposons are most significantly enriched or depleted across multiple cancer types. On average, exons show less pronounced enrichment/depletion scores compared with introns.
Mitochondrial tRFs are correlated more strongly with repeats than nuclear tRFs. Generally, mitochondrial tRFs are positively (negatively, respectively) correlated with mRNAs whose introns have high (low, respectively) density in these repeats. This is true for exons as well. Unlike mitochondrial tRFs, nuclear tRFs show less pronounced associations with repeats through their correlated mRNAs (Fig. 5A and B). However, see the links between DLBC, KIRP, colon adenocarcinoma, rectum adenocarcinoma, head and neck squamous cell carcinoma, pancreatic adenocarcinoma, prostate adenocarcinoma, and breast cancer with ALUs and MIRs.
LINE elements warrant special mention. The introns of genes whose mRNAs are positively correlated with mitochondrial tRFs are frequently depleted in antisense L1s and consistently enriched in both sense and antisense L2s. For genes whose mRNAs are correlated (positively or negatively) with nuclear tRFs, their introns are only enriched in antisense L2s.
These results suggest that the links between tRFs and repeat elements are far-ranging and extend well beyond the recently reported links with ERVs. Notably, the findings suggest links between mitochondrial tRFs and extramitochondrial transcripts and have direct implications for the roles of repeats in cancer biology.
The transcriptomic coupling to repeat elements mirrors attributes of the underlying genomic architecture
In our results so far, we focused on elucidating the genomic properties of the mRNAs that are correlated with tRFs. In other words, the mRNAs that were linked with tRFs were singled out based on abundance relationships between mRNAs and tRFs, and not on abundance magnitude or any genomic characteristics selected a priori. As noted above, many of the identified genomic properties are not independent. It is known that gene length and repeat element density follow specific distributions based on the encoded proteins' involvement in biological processes (38), as well as the evolutionary trajectory of the genomic region (39). To link the underlying genome architecture with our results, we took an orthogonal approach and examined biological processes from a genomic perspective.
We considered all human genes, independent of expression, and ranked them based on the density of the repeat element categories of Fig. 5. We then examined which pathways are overrepresented in the most repeat-dense and the least repeat-dense gene sets (Supplementary Table S10). We found a considerable number of enriched pathways with essentially no overlap. The gene set with low repeat density includes ribosomal proteins, homeobox genes, G protein–coupled receptors, keratins, cytokines, as well as FOX proteins. The corresponding enriched GO terms include development, morphogenesis, and differentiation. On the other hand, the gene set with high repeat density includes G proteins, tyrosine, and serine-threonine kinases, as well as proteins with DHR1, DHR2, FERM, and/or EF-hand domains. Moreover, genes with high repeat density belong to signaling pathways, including MAPK, ErbB, Ras, PI3K-Akt, and cGMP-PKG.
These results emphasize that at the genomic level, the architecture of genes and the placement of repeat elements is nonrandom. At the same time, the identified processes have considerable overlap with those shown in Fig. 4. This suggests a coupling between the transcriptomic level, where we uncovered novel interconnections between tRFs and mRNAs, and the architecture of genes at the genomic level.
tRFs are correlated with mRNAs in a sex-dependent manner
We previously showed that tRF-mRNA correlations capture differences between White and Black/African-American patients with triple-negative breast cancer (19) and prostate adenocarcinoma (8). We posited that tRF-mRNA correlations also capture differences between patients of different sex. We are not aware of previous work that examined the possibility of molecular links between sex disparities and tRFs. To this end, we first focused on BLCA for which sex disparities with regard to incidence and survival rates have been documented (40). Because BLCA also depends on a patient's race/ethnicity (41), we restricted our analysis to only samples from White patients (Supplementary Table S11).
A first, rather striking observation pertains to the number of correlations per tRNA isoacceptor in patients of different sex (Fig. 6A). The same isoacceptor is associated with markedly different numbers of mRNAs in male (x-axis) and female (y-axis) patients. In fact, more isoacceptors are correlated with more mRNAs in females than in males. Isoacceptors are labeled if they are associated with ≥2× (or ≤0.5×) mRNAs in one sex (Fig. 6A). This difference is statistically very significant (Supplementary Fig. S10A). We also analyzed tRF-mRNA correlations in BLCA focusing on those that are either present in only one sex or change sign between sexes (Supplementary Table S12; Supplementary Fig. S10B). We find that 36% of the tRF-mRNA correlations found in female patients are absent from the male patients; and 19% of the tRF-mRNA correlations found in male patients are absent from the female patients. Figure 6B highlights BLCA's sex-dependent differences with the help of cyclin-dependent kinases (CDK) or proteins interacting with CDKs (Supplementary Table S11).
We repeated the same analysis for LUAD (42) as well as for KIRC (43), specifically for subtype ccA of KIRC (25). In LUAD, the case of MAP4K4 stands out. Its mRNA has the largest number of differential coexpression links (Supplementary Table S11). We examined the MAPK signaling pathway further and found that many of its components and direct interactors are differentially correlated with tRFs between the two sexes (Fig. 6C). Note the presence of critical gene regulatory nodes, like PTEN, CREB1, and CEBPB. Interestingly, in the original TCGA publication on LUAD, it was pointed out that mutations could explain only part of the activation of the PI(3)K–MAPK pathway (44).
Similarly, in KIRC-ccA, we identified numerous sex-dependent tRF-mRNA correlation pairs (Supplementary Table S11), including mRNAs from the purine metabolism pathway (Supplementary Table S12). Several mRNAs that code for enzymes in the biosynthesis of IMP, the precursor of AMP and GMP, show extensive rewiring with tRFs as a function of sex (Fig. 6D). The sex-dependent correlation differences extend beyond this pathway. Adenylate cyclase genes (ADCY3, ADCY4, and ADCY7) and nucleoside diphosphate kinases (NME1, NME3, and NME7) also exhibit sex-dependent differential correlation with tRFs in KIRC-ccA. We note that poor prognosis in KIRC has been associated with alterations in several metabolic pathways, including the pentose phosphate pathway (25).
These results suggest the possibility of tRFs being involved in the molecular events underlying sex disparities in multiple cancer types. They are in complete analogy to our previous reports that the tRFs are linked to race/ethnicity disparities in disease (8, 19).
We analyzed 20,722 tRFs that we mined from 10,274 white-listed TCGA datasets representing 32 human cancer types. In concordance with our previous results (5, 8), the evidence continues to support the view that tRFs represent a novel and complex layer in posttranscriptional regulation. The structural category of i-tRFs was the richest in terms of the number of distinct tRFs found across the various TCGA datasets. This category, despite having been discovered only recently (5), has been gaining independent validation computationally as well as experimentally (4, 18, 45).
First, we evaluated whether base modifications affect our ability to accurately mine tRFs from TCGA. Base modifications in the mature tRNA have been thought to be a consideration when working with datasets generated using standard RNA-seq protocols (46). Through TCGA-wide analyses, we showed that base modifications have a rather limited impact on our ability to mine tRFs from these cancer datasets (Supplementary Fig. S1A and S1B). This is an expected result when the RNA-seq approach involves ligating both adapters prior to the reverse transcription step, which is the method used by TCGA. Indeed, had a modification caused the reverse transcriptase to stop, then the corresponding molecule would not have been amplified and, therefore, would not have appeared among the sequenced reads. We also evaluated the impact of mapping sequenced reads by allowing mismatches. We found that doing so decreases our ability to establish the tRNA provenance (or lack thereof) of the considered molecules (Supplementary Fig. S1C).
We systematically investigated the relationships between tRFs and mRNAs, motivated by earlier work by others and us, showing that tRFs affect mRNA abundance by acting like miRNAs (3, 5), or by decoying RBPs (18, 19). Among the tRF-mRNA pairs that emerge de novo from our analyses are several interactions that were validated recently in the literature: i-tRFs from tRNAGly and tRNATyr were shown to be linked to the mRNAs of HMGA1, CD151, CD97, and TIMP3 through the RBP known as YBX1 (18). In addition, the tRF-mRNA pairs also included several hundred correlations between tRFs and ribosomal proteins (Fig. 4A), as well as more than one thousand significant correlations with aminoacyl-tRNA synthases, including GARS, IARS, and MARS. These correlations persist across many cancer types and are concordant with recent findings (1, 12, 13).
Among the mRNAs that are correlated with tRFs, we observed several pathways that are consistently present across multiple cancer types (Fig. 4B; Supplementary Tables S4 and S5). In addition, we identified other pathways that are unique to individual cancer types. Not surprisingly, our analyses show a tissue-specific contribution to the discovered correlation patterns. This is in agreement with our previously reported findings of context- and tissue-specific correlations between miRNA isoforms and mRNAs during cancer development (47), as well as other related work on additional levels of biological function (48). Specifically for glycolysis (Fig. 4), although the Warburg effect is a hallmark of cancer metabolism, there is increasing evidence in support of cancer type- and context-specific metabolic signatures (49). From this perspective, the identified links to tRFs could shed light on the molecular events behind their regulation and functions in different tissues and cancer types.
Mitochondrial tRFs are featured prominently in all of our findings. Studies of the tRFs' subcellular distributions are in their early stages (3). From this standpoint, the correlation of mitochondrial tRFs with processes that are not mitochondria-specific suggests possible biogenesis from either mitochondrial tRNAs that exit from the mitochondrion (50) or from the DNA of the mitochondrial “tRNA-lookalikes” that we reported in nuclear chromosomes (33). It is worth recalling that the mitochondria have established roles in cancer, at multiple levels between the genetic (48) and the metabolic processes (51), and that mitochondrial processes can signal and affect the nuclear genome's state (52). Thus, in light of our results, we postulate that tRFs act as mediators of an information exchange between the mitochondria and the nucleus. For example, the localization patterns of Fig. 3B suggest possible roles of tRFs as regulators marshaling the communication exchange between different cell compartments.
Sex disparities are attracting increasing scientific interest. We examined three cancer types, BLCA, LUAD, and KIRC, with documented disparities in TCGA (53) using a differential coexpression approach that provides deeper insights than differential expression (19, 54). In all cases, we identified pathways that are integral components of the molecular biology of each cancer type as well as responsive to the hormone environment. Sex hormones are arguably important contributors to sex disparities in the disease context. In fact, sex hormones have been shown to regulate cell proliferation in the context of BLCA (55), signaling cascades in lung cancer that include the MAPK pathway (56), as well as the pentose phosphate pathway (57). In addition, some tRHs have sex hormone dependencies (58). We posit that the networks of Fig. 6 depict components of the mechanistic contributions to sex disparities in cancer by tRFs.
We also identified links between tRFs and genomic features. Two striking results are the bias in the length of the mRNAs that participate in correlations with tRFs (Fig. 3A), and the enrichment of their genomic span in specific classes of repeats (Fig. 5). We note that the repeat class of “tRNA” was not found enriched among the exons that comprise the mRNAs exhibiting correlations with tRFs (Supplementary Table S9). This indicates that the observed tRF-mRNA correlations are not due to an enrichment of exons in tRFs.
In our earlier work, we described the nonrandom placement of repeat elements within the exons of genes (35), and across the genome (36), as well as outlined interconnections involving short RNAs (piRNAs) produced by exonic regions (35, 59). In parallel studies of genome-wide methylation, we showed that repeat elements become demethylated as stem cell differentiation progresses (60). These previous findings collectively suggested potentially important roles for repeat elements.
More recent work highlighted the potential for regulation of repeat elements by tRFs (6, 37, 61). Consequently, the roles of repeat elements in gene expression regulation have been attracting attention (36, 62). In the context of cancer, repeat elements continue to emerge as important components in genomic rearrangements (63) as well as potent regulators of gene expression (64), and as determinants of overall survival (65). As repeat elements are also associated with chimeric transcripts involving protein-coding exons (66), it is conceivable that tRFs can interact with such transcripts, in complete analogy to what we reported previously for miRNAs (67). Our results also suggest that, in addition to interacting with repeat elements to protect the genome's integrity (61), tRFs are also involved in complex gene regulation. Indeed, short gene length has been associated with highly expressed genes and a proliferative phenotype (39). The state of proliferation is also molecularly unique at additional levels of cellular function. The metabolism of proliferative cells has the characteristic signature of the Warburg effect, whereas mature tRNA abundance profiles are distinct in proliferative cells as compared with differentiated ones (68). In addition, based on our genome-level analysis (Supplementary Table S10), pathways promoting proliferation include genes dense in repeats, whereas pathways promoting differentiation include genes with no, or little, repeat content. These results have been discussed in the literature (36). Our analyses place the tRFs at crucial junctions of the multi-dimensional and complex process of cell proliferation.
In conclusion, our analyses reveal a very extensive array of complex relationships between tRFs and protein-coding mRNAs. These associations suggest the existence of numerous direct molecular interactions that await validation and characterization. The presence of multiple families of repeats in the introns and exons of mRNAs with which the tRFs are correlated adds yet another level of complexity to these complex relationships.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Conception and design: A.G. Telonis, P. Loher, E. Londin, I. Rigoutsos
Development of methodology: A.G. Telonis, I. Rigoutsos
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): A.G. Telonis, P. Loher, R. Magee, I. Rigoutsos
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): A.G. Telonis, P. Loher, R. Magee, V. Pliatsika, E. Londin, I. Rigoutsos
Writing, review, and/or revision of the manuscript: A.G. Telonis, P. Loher, R. Magee, V. Pliatsika, E. Londin, Y. Kirino, I. Rigoutsos
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): A.G. Telonis, P. Loher, V. Pliatsika, I. Rigoutsos
Study supervision: A.G. Telonis, I. Rigoutsos
We are indebted to the NIH for making TCGA data publicly available. We thank Dr. Megumi Shigematsu, Dr. Takuya Kawamura, and the other members of the Center for discussions and input on this manuscript. The work was supported partially by a William M. Keck Foundation grant (I. Rigoutsos), NIH/NCI R21-CA195204 (I. Rigoutsos), and by Institutional Funds.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.