Understanding the vast noncoding cancer genome requires cutting-edge, high-resolution, and accessible strategies. Artificial intelligence is revolutionizing cancer research, enabling advanced models to analyze genome regulation. This review examines illustrative examples of noncoding mutations in cancer, focusing on both key regulatory elements and risk-associated variants that remain poorly understood, and compares key artificial intelligence models developed over the last decade for identifying functional noncoding variants, predicting gene expression impacts, and uncovering cancer-associated mutations. The discussion of the goals, data requirements, features, and outcomes of the models offers practical insights to help cancer researchers integrate these technologies into their work, regardless of computational expertise.

This article is part of a special series: Driving Cancer Discoveries with Computational Research, Data Science, and Machine Learning/AI.

Cancer remains a critical health challenge in the 21st century (1, 2). Concurrently, artificial intelligence (AI) stands as one of the most transformative technological advancements of our era. It is thus unsurprising that a growing number of research groups are using AI to address multiple challenges in cancer research (3, 4). By integrating AI into cancer omics, researchers are now able to examine vast genomic datasets, uncovering new patterns within the noncoding regions that may influence cancer development (3, 5, 6).

The main trigger of oncogenesis is the accumulation of mutations in our cells’ DNA (7, 8). With more than 98% of the human genome being noncoding, it is expected that most mutations occur in these unexplored regions (9). In 2012, with the publication of the Encyclopedia of DNA Elements (10), the scientific community acknowledged the indispensable functions of the noncoding elements in gene regulation, structural chromosome organization, and other cellular processes. However, the once-dismissed “junk” DNA (11) was previously considered functionally meaningless due to its inability to code proteins. In recent years, the informal denomination shifted from “junk” to “dark” genome (10, 12), abandoning the notion of irrelevance and emphasizing our limited understanding of most of our DNA. We now know that mutations affecting regulatory elements may also contribute to tumorigenic processes. This was reinforced in 2013 when two articles (13, 14) demonstrated the oncogenic effect of point mutations in the promoter of the gene telomerase reverse transcriptase (TERT; refs. 13, 14).

Despite significant progress over the past decade, several challenges persist in understanding the role of noncoding variants in cancer. Key issues include distinguishing between functional and nonfunctional alterations, identifying causal variants, and uncovering the underlying mechanisms of mutations in noncoding regions. Redefining the landscape, AI models have emerged, providing powerful tools to overcome these barriers. This review aims to trace the evolution of our understanding from the initial concept of “junk DNA” introduced in the 1970s to the recent publication of the general expression transformer (GET) model (15) for functionalizing cancer genomes (see timeline included in Fig. 1).

Figure 1.

Timeline illustrating significant milestones in human genome research, highlighting key discoveries and projects related to the noncoding genome and regulatory elements (gray) and the main development of AI methods for investigating noncoding elements implicated in cancer (blue). DLBCL, diffuse large B-cell lymphoma; ENCODE, Encyclopedia of DNA Elements; NN, neural network; SVM, support vector machine; TMB, tumor mutational burden.

Figure 1.

Timeline illustrating significant milestones in human genome research, highlighting key discoveries and projects related to the noncoding genome and regulatory elements (gray) and the main development of AI methods for investigating noncoding elements implicated in cancer (blue). DLBCL, diffuse large B-cell lymphoma; ENCODE, Encyclopedia of DNA Elements; NN, neural network; SVM, support vector machine; TMB, tumor mutational burden.

Close modal

This review includes two main parts. The first part presents an updated perspective on the role of the noncoding regions in our genome and their implications in oncogenesis through mutations. We will examine key regulatory elements and provide examples of known mutations in cancer affecting these elements, such as those in promoters and enhancers, and also germline risk variants in less studied noncoding regions. In the second part of this review, we will examine state-of-the-art AI models that have significantly advanced our understanding of the biological mechanisms behind noncoding variants in the human genome. We will compare these models based on their research purposes, data requirements, architectures, and outputs, focusing on their applications and implications within the realms of computational biology and cancer research.

The significance of this review is further highlighted by a perspective article published in September 2024 by the Impact of Genomic Variation on Function Consortium in Nature (16). This article details their strategic initiative to integrate experimental and computational methodologies aimed at elucidating how both coding and noncoding variants influence gene regulatory and protein interaction networks in various diseases. This initiative underscores the critical need to deepen our understanding of noncoding regions in cancer genomics, aligning perfectly with our review’s goal to explore their functional roles and the innovative computational tools available to study them.

The specific objectives of this review are as follows: (i) to discuss and exemplify how point mutations in key regulatory elements and other noncoding regions can affect cancer development and (ii) to evaluate the leading AI models used for analyzing noncoding variants with a focus on their application in cancer research and to discuss their evolution over the past decade. We will compare key factors such as model complexity, data integration, cell type specificity, interpretability of results, and overall usability in the research context. Figure 2 provides a schematic overview of the review’s approach, illustrating the connection between these two main objectives. By highlighting how these models contribute to understanding the regulation of genomes and their potential impact on cancer, we aim to provide valuable insights for the biomedical community, particularly those less familiar with computational technologies, thereby facilitating their integration into ongoing research efforts.

Figure 2.

Illustrative scheme of the review’s approach. AI models for studying the regulation of cancer genomes. Left, illustration of how noncoding regions can be altered by mutations in two categories: (i) regulatory regions, with the most common being promoters, enhancers, SEs, and silencers, and (ii) other, less characterized noncoding regions, at both somatic and germline levels. Right, the potential applications of AI models for functionalizing noncoding cancer genomes. Key objectives include identifying causal variants, predicting gene expression changes, detecting cell types of origin, elucidating alterations in regulatory elements, and establishing causation. This approach highlights the pivotal role of AI in advancing the functional interpretation of noncoding mutations in cancer. Created in BioRender. Alvarez Torres, M. (2025) https://BioRender.com/8jsn4ih.

Figure 2.

Illustrative scheme of the review’s approach. AI models for studying the regulation of cancer genomes. Left, illustration of how noncoding regions can be altered by mutations in two categories: (i) regulatory regions, with the most common being promoters, enhancers, SEs, and silencers, and (ii) other, less characterized noncoding regions, at both somatic and germline levels. Right, the potential applications of AI models for functionalizing noncoding cancer genomes. Key objectives include identifying causal variants, predicting gene expression changes, detecting cell types of origin, elucidating alterations in regulatory elements, and establishing causation. This approach highlights the pivotal role of AI in advancing the functional interpretation of noncoding mutations in cancer. Created in BioRender. Alvarez Torres, M. (2025) https://BioRender.com/8jsn4ih.

Close modal

The 2020 publication “A Compendium of Mutational Cancer Driver Genes” identified key genes linked to cancer (17) though it did not address mutations in noncoding elements. That year, a review article highlighted the significance of the noncoding genome in cancer (18), emphasizing its relevance in gene regulation (19). Mutations within these regulatory elements can disrupt this intricate network by promoting uncontrolled cell proliferation, inhibiting tumor suppressors, or impairing DNA repair mechanisms (18, 20, 21). This perspective highlights the imperative for an integrated approach to cancer genetics, encompassing both coding and noncoding mutations.

Noncoding germline variants are frequently identified in genome-wide association studies (GWAS), suggesting potential links to cancer but not confirming causation. Identifying truly causal variants requires extensive functional studies (20), particularly in the noncoding regions. Advanced AI models can enhance the efficiency of these functional studies by guiding researchers in pinpointing relevant variants, genes, and cells of origin or underlying mechanisms and providing insights into their potential roles in gene regulation and carcinogenesis.

The intricate role of noncoding regions in cancer extends beyond mere association studies. The following sections will explore the functional characteristics of the main regulatory elements and highlight specific mutations, underscoring their influence across various cancer types. We present key examples of somatic and germline noncoding mutations associated with cancer, focusing on regulatory elements such as promoters, enhancers, super-enhancers (SE), and silencers, as well as mutations in “other” intronic or intergenic regions in which functions remain unidentified. Although this study focuses on point mutations, we recognize that structural variations, both germline (22) and somatic (23), can also play a significant role in the noncoding genome and contribute to oncogenesis.

Mutations in promoters

At the heart of gene expression regulation lies the promoter—a critical DNA sequence positioned at the transcription start site of both protein-coding and noncoding genes. Promoters support the assembly of the transcription machinery, including RNA polymerase II and a plethora of transcription factors, which collectively initiate the transcriptional process (24, 25). Mutations in these noncoding promoter regions can drastically alter the binding dynamics of these essential components, leading to perturbations in gene transcription and expression (26). The most obvious example is the mutations in the promoter of the TERT gene, unleashing profound implications in cancer biology (2729). TERT encodes the catalytic subunit of telomerase, a ribonucleoprotein enzyme crucial for maintaining telomere length and cellular immortality (28, 30). In cancer, somatic mutations in the TERT promoter commonly occur at specific hotspots, most notably at positions -124 bp (C228T) and -146 bp (C250T) upstream of the transcription start site (3133). These point mutations create de novo binding sites for ETS family transcription factors, such as GA-binding protein, allowing ETS factors to bind more effectively, leading to enhanced transcriptional activity and higher expression of the TERT gene (34). This aberrant upregulation of TERT enables cancer cells to evade replicative senescence and apoptosis, facilitating unlimited proliferation—an essential hallmark of cancer. These promoter mutations are prevalent across various cancer types, with particularly high frequencies in melanoma (13, 14), glioblastoma (35), and certain carcinomas (32). Despite their high frequency, TERT promoter mutations in cancer remained unnoticed until 2013 (13, 14), likely due to restricted access to whole-genome sequencing data and the absence of appropriate bioinformatic tools. Between 2010 and 2012, leading bioinformatics tools such as GATK (36), MuTect (37), VarScan (38), and ANNOVAR (39) were published, enabling comprehensive analysis of the whole genome.

Another remarkable example of promoter mutations, this time germline, that increase cancer risk is seen in familial adenomatous polyposis. In this hereditary syndrome, multiple deletions and loss-of-function mutations of promoter 1B in the APC (adenomatous polyposis coli) gene have been identified. These alterations disrupt the normal transcriptional regulation of APC, a key tumor suppressor that is essential for the control of cell proliferation and maintenance of genomic integrity (40, 41). The resultant loss of APC function due to these promoter disruptions leads to the development of a high number of colorectal polyps, from hundreds to thousands, significantly heightening the susceptibility to colorectal cancer (41). The progression of these polyps to malignancy is almost inevitable without surgical intervention, highlighting the profound impact of promoter mutations on cancer risk in familial adenomatous polyposis.

Mutations in enhancers

Enhancers are fundamental regulatory DNA elements that modulate gene expression by amplifying the transcriptional activity of target genes through sophisticated interactions with their promoters, often over extensive genomic distances (42, 43). Their ability to integrate and respond to a range of cellular signals and developmental cues is crucial for the precise orchestration of gene networks (43). Therefore, disruptions in enhancer function can contribute to disease progression, underscoring their critical role in maintaining genomic stability and cellular function.

An illustrative case of the oncogenic effect of enhancer mutations is the SNP rs55705857, located on chromosome 8q24, in an enhancer that regulates the MYC oncogene. This germline mutation is associated with a sixfold increased risk of developing IDH1-mutant gliomas (44). The risk allele alteration (a G-to-A substitution) disrupts the OCT4 motif, activating this enhancer and leading to increased MYC expression and promoting glioma pathogenesis (44). This study is an excellent example of the substantial time and effort required to unravel these complex interactions using using traditional methods. Researchers must meticulously map the interactions between the regulatory elements and motifs disrupted by the SNP and evaluate how such noncoding variants influence gene regulation.

As illustrated in the previous example, we can extract useful genetic evidence for the role of enhancer dysfunction in cancer through GWAS (45). Furthermore, somatic alterations significantly affect enhancer activity by increasing enhancer copy number, causing structural rearrangements, and introducing point mutations or insertions/deletions that alter transcription factor binding or create new enhancers. For example, in T-cell acute lymphoblastic leukemia, chromosomal translocations can reposition enhancers near critical proto-oncogenes such as TLX1 and MYC, disrupting their normal function (42). Furthermore, long-range enhancers that regulate MYC via NOTCH1 signaling can become aberrantly activated due to duplications, leading to MYC overexpression and promoting malignant transformation in T-cell acute lymphoblastic leukemia (46).

Together, these findings underscore the complex role of enhancers in cancer and highlight the need for comprehensive studies to unravel their contributions to disease progression.

Hypermutations in SEs

SEs can be defined as large clusters of putative enhancers in close genomic proximity with unusually high levels of mediator binding that drive exceptionally high gene expression, often regulating key genes involved in cell identity and disease (47). Due to their capacity to activate oncogenes, SEs are increasingly recognized for their potential critical role in cancer (47, 48), as they can amplify the expression of genes involved in tumorigenesis. In various cancers, SEs become aberrantly activated, leading to the dysregulation of oncogenes and other crucial genes, thereby promoting malignant transformation and progression (49).

A notable example was presented in 2022 by Bal and colleagues (50), who demonstrated that SEs are frequently somatic hypermutated in diffuse large B-cell lymphoma, the most common form of B-cell non–Hodgkin lymphoma, which remains incurable in 40% of patients (51). SEs were hypermutated in 92% of analyzed samples and exhibited signatures indicative of activation-induced cytidine deaminase activity, associated with genes encoding B-cell development regulators and oncogenes. Notably, these hypermutated regions were linked to the BCL6, BCL2, and CXCR4 proto-oncogenes, in which activation was found to prevent the binding and transcriptional repression by factors such as BLIMP1 and NR3C1. The study demonstrated that genetic correction of these mutations in repressor-binding domains led to a decrease in target gene expression and the selective outgrowth of cells with corrected alleles, emphasizing the oncogenic dependency on these SE mutations.

Mutations in silencers

Silencers are distinct regulatory DNA elements that actively suppress gene expression by recruiting transcriptional repressors. Unlike enhancers, which enhance gene expression, silencers act to inhibit it, ensuring that genes are expressed only under appropriate conditions (52). Mutations in silencers can disrupt this repression mechanism, leading to the aberrant activation of genes that are normally kept in check (52).

GWAS of estrogen receptor (ER)–positive breast cancer identified three independent germline variants within the FGFR2 risk locus (53, 54). Functional reporter assays demonstrated that these variants reside in silencer elements, which regulate FGFR2 expression. Notably, the risk alleles enhance silencer activity, leading to reduced FGFR2 expression, increasing estrogen responsiveness, and conferring increased breast cancer risk. Similarly, SNPs linked to both ER-positive and ER-negative breast cancers were mapped to the human ERα gene (ESR1; refs. 55, 56). Fine-mapping and functional studies revealed that five of these SNPs are located in noncoding regulatory elements controlling ESR1 expression, including one in a silencer element in which mutation may elevate the expression of ESR1 (57).

These mutations exemplify how the alteration of these regulatory elements can lead to dysregulation of gene silencing mechanisms and promote oncogenesis (58).

Variants of uncertain significance in noncoding regions

Previously, we have highlighted several mutations within key regulatory elements, including promoters, enhancers, SEs, and silencers, that are critical in the pathogenesis of various types of cancer. However, scientific evidence indicates that only a limited number of noncoding mutations function as true drivers of oncogenesis. This complexity and functional diversity are illustrated by the findings of Dr. Dietlein and colleagues (59), who conducted a comprehensive genome-wide analysis of somatic noncoding mutations across 19 cancer types. This study classified mutations into “regulatory regions,” “tissue-specific genes,” and an additional “other” category, thereby revealing a complex landscape of noncoding alterations.

Mutations within regulatory regions showed significant enrichment in canonical cancer genes, such as TERT promoter mutations and FOXA1 alterations in breast and prostate cancers, underscoring the crucial role of these elements in oncogenesis. In contrast, tissue-specific mutations, while less likely to be direct drivers, were associated with genes such as TMEFF2 and HCN1 in brain tumors and KLK3 and TMPRSS2 in prostate cancer. Moreover, this study, along with findings from multiple cancer GWAS, identified numerous noncoding mutations in the “other” category that do not conform to these classifications. These include mutations in NEAT1 and NEAT2 across various cancers, as well as alterations in genes such as MAD1L1 and MAD2L1 in brain and ovarian cancers, NF1 in breast cancer, and KCNJ15 and ABHD5 in kidney and liver cancers. Additionally, noncanonical splice site mutations in APC and SMAD4 in colorectal cancer were identified, further complicating our understanding of the oncogenic potential of altering these noncoding regions.

The mechanisms underlying these “other” noncoding mutations remain largely unresolved, presenting a critical avenue for future research. Advanced AI models offer a promising approach to decipher these mutations by identifying hidden complex patterns and interactions within the data, hopefully leading to the discovery of novel biomarkers and therapeutic targets.

In the previous section of this review, we have underscored the crucial significance of mutations, both somatic and germline, within key regulatory elements, as well as in less-characterized noncoding regions. It is evident that there is a pressing need to identify novel mutations in these noncoding genomic areas and unravel the mechanisms contributing to oncogenesis. In this context, the advent of advanced AI models has profoundly transformed the study of the noncoding genome in cancer research. However, discrepancies in methodologies among these models can result in variations in predictions and scoring systems. Although numerous models leverage DNA sequence data, some are adept at identifying regulatory motifs or assessing variant impacts, whereas others enhance accuracy by incorporating additional data, such as chromatin accessibility, or predicting gene expression. This methodologic diversity underscores the necessity of understanding each model’s capacities, strengths, and limitations. A comparative analysis of these approaches can refine the interpretation of model outputs and advance our understanding of the noncoding genome in cancer.

We selected the main AI models based on their widespread recognition in the field, considering citation metrics, their performance in predicting gene expression and the effect of noncoding mutations, and their contributions to cancer genomics research from 2015 to 2024. This resulted in the inclusion of the following nine models for comparison: DeepSEA (60), DeepBind (61), DeltaSVM (62), Basset (63), CADD (64), Expecto (65), DeepC (66), Enformer (67), and GET (15). Supplementary Table S1 presents a descriptive overview of the AI models, including their architectural frameworks, training datasets, and key outputs, while highlighting their strengths, limitations, and contributions to cancer genomics.

A detailed examination of each original publication for the AI models (Supplementary Fig. S1; Supplementary Table S1) reveals several key characteristics for further discussion: model complexity, data integration, cell type specificity, interpretability, and usability in cancer research.

Model complexity

The complexity of AI models in genomics has evolved significantly over time (68). Earlier models, such as DeepSEA and DeepBind (both from 2015), used convolutional neural networks to predict the impact of genetic variants with moderate complexity. DeltaSVM (2015) employed a simpler supervised machine learning technique, the support vector machine. Subsequent models, including Basset (2016) and Expecto (2018), introduced deeper architectures and attention mechanisms to handle cell type variations and regulatory sequences. The complexity reached new heights with DeepC (2020), which used advanced deep learning techniques to predict enhancer-associated variants. The latest models, Enformer (2021) and GET (2024), leverage transformer-based architectures, reflecting a substantial increase in complexity by integrating extensive genomic data using attention mechanisms to capture intricate gene expression interactions. This is illustrated in Supplementary Fig. S1, which demonstrates the increase in model complexity scores from the older models to the newer ones.

Furthermore, these advancements in complexity address several common limitations in the field. Current sequence-based models primarily capture gene expression determinants in promoters but often overlook the crucial role of distal enhancers, leading to incomplete interpretations of gene regulation (69). Additionally, existing genomic deep learning models struggle to adequately account for personal transcriptome variation, resulting in a limited understanding of how individual genetic differences influence gene expression (70). By overcoming these limitations, the new models offer a more comprehensive view of genomic regulation.

Multimodality integration

Data integration in AI models involves combining genomic sequences, transcriptomic profiles, and epigenetic markers to enhance the identification of gene mutations, expression patterns, and regulatory interactions (68). However, integration is challenging due to data heterogeneity, large datasets, and varying formats that require extensive preprocessing (71). Some models, such as DeepSEA, Expecto, Enformer, and GET, utilize extensive datasets. DeepSEA was trained using both accessibility profile features and transcription factor (TF)–binding events; Expecto integrates both TF-binding and chromatin features and cell type–specific genome-wide histone marks to improve its predictive power. Enformer can be trained on different types of genome-wide tracks, including Cap analysis of gene expression (CAGE) measuring transcriptional activity, histone modifications, TF binding, and DNA accessibility. GET combines genome data with chromatin accessibility data (single-cell Assay for Transposase-Accessible Chromatin using sequencing) and gene expression profiles (single-cell RNA sequencing) from more than 200 cell types to allow a better understanding of the mechanisms governing transcriptional regulation. These models represent a trend toward sophisticated integration of diverse biological data to elucidate the complex effects of noncoding variants. However, some of them can present significant computational challenges and demands due to the integration of large-scale datasets. In contrast, other models adopt a more focused approach, specializing in specific data types to address particular aspects of noncoding variant effects. For instance, DeepBind emphasizes DNA–protein interactions by predicting binding affinities based on experimentally derived scores. Likewise, DeepC enhances our understanding of chromatin interactions through high-resolution contact matrices. Therefore, these models do not require extensive data integration.

Cell type specificity

The significance of cell type specificity emerges as a critical factor influencing model performance. DeltaSVM, Expecto, Basset, Enformer, and GET exemplify the impact of cell type–specific training data on prediction accuracy. DeltaSVM shows high performance in identifying causal SNPs when the training data closely match the cell type of interest, yet its generalizability to other cell types is constrained. Expecto similarly excels in cell type–specific predictions but requires extensive biological data integration. Basset, trained on a diverse range of cell types, provides a more generalized perspective and is effective in capturing chromatin accessibility features across various contexts though it may lack the depth in cell type–specific regulatory insights that more specialized models offer. Enformer improved the performance of previous models in various cell types and tissues. However, it can only make predictions for cell types and assays in the training data, not generalizing to new ones.

The example discussed in the “Mutations in enhancers” section highlights the crucial role of cell type specificity in ensuring accurate predictions by AI models. In this case, the SNP increases the risk of developing isocitrate dehydrogenase (IDH)–mutant gliomas sixfold but does not affect IDH wild-type glioblastoma. Although these tumors were once thought to be closely related, we now recognize significant differences in their behavior and likely their cells of origin. As a result, predicting the effects of such variants or guiding experimental research would only be possible with AI models that account for cell type–specific contexts. In this sense, the newest model, GET, represents a pioneering approach, achieving experimental-level accuracy in predicting gene expression even in previously unseen cell types and including nonphysiologic cell types.

Interpretability

Without sufficient interpretability, model predictions risk being perceived as opaque “black box” outputs, which could limit their applicability in clinical and research settings. In the context of understanding mutations within the noncoding genome in cancer, interpretability is not just beneficial but imperative, as it allows researchers to pinpoint how specific mutations might regulate gene expression or contribute to tumorigenesis. Some models, such as DeepSEA and DeepBind, have strong predictive accuracy but struggle with interpretability. These models, built on convolutional neural networks (CNN), can predict the impact of noncoding variants on gene regulation but often do so without revealing the underlying features driving their predictions.

On the other hand, models such as Expecto, Enformer, and GET emphasize interpretability by integrating multiomic data and employing advanced techniques such as attention mechanisms. Among the models discussed in this review, Enformer and GET stand out for their ability to integrate and learn relationships within a broader sequence and cell type context. This allows for more global and informative interpretations, as their transformer-based architectures facilitate detailed feature attribution, linking predictions to specific genomic elements such as promoters, enhancers, and other regulatory regions. This design enhances the model’s transparency by elucidating which genomic features influence predictions, providing direct insight into the biological implications in gene expression, tailored to each human cell type. This level of interpretability is particularly valuable in cancer research, in which understanding the role of noncoding mutations in regulatory regions can reveal critical insights into tumorigenesis and identify potential targets for therapeutic intervention.

Usability in cancer research

The AI models reviewed in this article were selected for their specific strengths in analyzing the effect of noncoding variants within the human genome. Although not all models represent the latest technical advancements, each of them was selected for its demonstrated utility and relevance to human genetics and cancer research. This careful selection ensures a focused discussion on models that are most pertinent to understanding the role of noncoding variants in cancer.

We have summarized in Supplementary Table S2 specific cases of use presented in the original publications of each model. Although most publications emphasize cancer-related examples, it is worth noting that Basset and DeepC did not focus on cancer in their original articles. Instead, these models were applied to other clinical conditions. A key point to highlight is that uncertainty still exists about the functional consequences of the findings presented in Supplementary Table S2. Although some studies include validation efforts, the uncertainty about whether noncoding mutations in specific genes are canonical drivers persists, highlighting a limitation of AI models.

It is important to emphasize that all models discussed are highly cited, with an average of 1,538 total citations (ranging from 184 to 3,256) and 228 citations per year (ranging from 46 to 562). These substantial citation rates reflect their considerable impact and practical utility in the field of genetic research and cancer studies. Supplementary Table S3 provides the citation metrics for each model as of March 4, 2025.

General overview of model evolution

To provide a general overview of the key features of the nine AI models, we created a radar chart plot highlighting the discussed features (see Supplementary Fig. S1). The criteria for scoring each model feature are detailed in Supplementary Table S4.

We observe that older models (from 2015 to 2018) consistently score lower in model complexity compared with more recent ones (2018–2024). This difference is easily explained, as earlier models were developed with simpler architectures and fewer parameters, reflecting the computational and algorithmic limitations of their time. In contrast, newer models benefit from advances in computational power and deep learning techniques, enabling more complex architectures that can capture intricate biological patterns, albeit with increased complexity. A similar trend is seen in data integration and cell type specificity, in which newer models excel due to their ability to incorporate diverse datasets and target-specific cell types.

However, usability in practical research does not seem to be correlated with the year of publication. This is likely because, regardless of technological advancements, the practical application of these models heavily depends on factors such as user-friendly interfaces, comprehensive documentation, and community support, which may not always align with the model’s complexity or novelty.

Notably, none of the models achieved a score of 5 in data integration, interpretability, or usability in practical research. This suggests that the refinement of these models in the near future could lead to significant improvements in these areas.

This review article represents, to the best of our knowledge, the first analysis of AI models applied to the study of the noncoding genome in cancer research. This investigation is significant due to AI’s potential to identify noncoding mutations, reveal the functional impact of risk variants, clarify their mechanisms, and deepen our understanding of their role in oncogenesis.

In the first part of this review, we explored how the identification of noncoding mutations is becoming a crucial element in cancer biology. These mutations, scattered across regulatory elements, disrupt the intricate network of gene regulation essential for maintaining cellular homeostasis. Mutations in promoters often activate oncogenes or inactivate tumor suppressors, giving cancer cells a growth advantage. Enhancer mutations can amplify oncogenic signals or disrupt cellular differentiation pathways. Similarly, hypermutations in SEs can aberrantly activate gene networks critical for cancer cell identity and tumorigenesis. Mutations in silencers can either enhance or reduce gene repression, leading to inappropriate activation or suppression. Additionally, hundreds of noncoding variants found in cancer GWAS with unknown mechanisms emphasize the need for a comprehensive approach that integrates genomic and computational methods to decode the complex roles of these mutations.

The second section of this review discusses how AI model evolution has enhanced our ability to analyze noncoding mutations in cancer research over the past decade. The progression from early models such as DeepSEA and DeepBind (2015), which used CNNs to predict variant impacts, to more sophisticated architectures such as Basset and Expecto demonstrates significant improvement in model complexity and data integration. Recent models such as DeltaSVM and DeepC have introduced deeper learning techniques and refined predictions by incorporating chromatin accessibility and high-resolution contact matrices. Transformer-based models, notably Enformer (2021) and GET (2024), represent a paradigm shift by offering unprecedented precision in capturing long-range dependencies and complex interactions in gene regulatory networks. These models, along with CADD and Expecto, highlight the power of integrating diverse biological data to improve prediction accuracy. The focus on cell type specificity and interpretability in these newer models not only improves our understanding of how noncoding mutations disrupt gene regulation but also aligns with the need for context-aware insights in cancer research.

Although the effectiveness of AI models will depend on the specific dataset and research goals, this review aims to provide a comprehensive guide to the current state-of-the-art models. By summarizing the applications, major strengths, and limitations, we hope to support researchers in selecting the most suitable tools for studying noncoding mutations in cancer.

Conclusions

In conclusion, advances in AI have demonstrated its potential for illuminating the role of noncoding mutations in cancer, unveiling complex regulatory patterns and unknown mechanisms, as evidenced by more than 11,000 scientific articles citing these models. To unlock further insights into the relationships between genomic variation and phenotype, a systematic catalog of genome function is essential. The Impact of Genomic Variation on Function Consortium aims to integrate single-cell mapping, genomic perturbations, and predictive modeling to elucidate how both coding and noncoding variants influence gene expression and phenotypic diversity across various cellular contexts (16). This approach highlights the importance of predictive models that can generalize across diverse situations, ultimately enhancing our understanding of genomic variation and its implications for human health.

R. Rabadan reports grants from the NIH/NCI and Stand Up To Cancer during the conduct of the study as well as nonfinancial support from GenoTwin and Flahy and personal fees from Diatech Pharmacogenetics outside the submitted work and has a provisional patent with application numbers 63/486,855 and PCT/US2024/017064 pending. No disclosures were reported by the other authors.

We gratefully acknowledge funding from the NIH (R35 CA253126 to R. Rabadan and M.d.M. Alvarez-Torres, P01 CA174653 to R. Rabadan, R01 HL159377 to R. Rabadan, and U01 CA243073 to R. Rabadan) and SU2C Convergence 3.14 to R. Rabadan.

Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).

1.
The global challenge of cancer
.
Nat Cancer
2020
;
1
:
1
2
.
2.
Bray
F
,
Laversanne
M
,
Sung
H
,
Ferlay
J
,
Siegel
RL
,
Soerjomataram
I
, et al
.
Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries
.
CA Cancer J Clin
2024
;
74
:
229
63
.
3.
Hosny
A
,
Aerts
HJWL
.
Artificial intelligence for global health
.
Science
2019
;
366
:
955
6
.
4.
AI will transform science — now researchers must tame it
.
Nature
2023
;
621
:
658
.
5.
Dias
R
,
Torkamani
A
.
Artificial intelligence in clinical and genomic diagnostics
.
Genome Med
2019
;
11
:
70
.
6.
Novakovsky
G
,
Dexter
N
,
Libbrecht
MW
,
Wasserman
WW
,
Mostafavi
S
.
Obtaining genetics insights from deep learning via explainable artificial intelligence
.
Nat Rev Genet
2023
;
24
:
125
37
.
7.
Hanahan
D
,
Weinberg
RA
.
Hallmarks of cancer: the next generation
.
Cell
2011
;
144
:
646
74
.
8.
Stratton
MR
,
Campbell
PJ
,
Futreal
PA
.
The cancer genome
.
Nature
2009
;
458
:
719
24
.
9.
Lander
ES
,
Linton
LM
,
Birren
B
,
Nusbaum
C
,
Zody
MC
,
Baldwin
J
, et al;
International Human Genome Sequencing Consortium
.
Initial sequencing and analysis of the human genome
.
Nature
2001
;
409
:
860
921
.
10.
Dunham
I
,
Kundaje
A
,
Aldred
SF
,
Collins
PJ
,
Davis
CA
,
Doyle
F
, et al;
ENCODE Project Consortium
.
An integrated encyclopedia of DNA elements in the human genome
.
Nature
2012
;
489
:
57
74
.
11.
Ohno
S
.
So much “junk” DNA in our genome
.
Brookhaven Symp Biol
1972
;
23
:
366
70
.
12.
Blaxter
M
.
Genetics. Revealing the dark matter of the genome
.
Science
2010
;
330
:
1758
9
.
13.
Huang
FW
,
Hodis
E
,
Xu
MJ
,
Kryukov
GV
,
Chin
L
,
Garraway
LA
.
Highly recurrent TERT promoter mutations in human melanoma
.
Science
2013
;
339
:
957
9
.
14.
Horn
S
,
Figl
A
,
Rachakonda
PS
,
Fischer
C
,
Sucker
A
,
Gast
A
, et al
.
TERT promoter mutations in familial and sporadic melanoma
.
Science
2013
;
339
:
959
61
.
15.
Fu
X
,
Mo
S
,
Buendia
A
,
Laurent
A
,
Shao
A
,
Alvarez-Torres
MDM
, et al
.
A foundation model of transcription across human cell types
.
Nature
2025
;
637
:
965
73
.
16.
IGVF Consortium
.
Deciphering the impact of genomic variation on function
.
Nature
2024
;
633
:
47
57
.
17.
Martínez-Jiménez
F
,
Muiños
F
,
Sentís
I
,
Deu-Pons
J
,
Reyes-Salazar
I
,
Arnedo-Pac
C
, et al
.
A compendium of mutational cancer driver genes
.
Nat Rev Cancer
2020
;
20
:
555
72
.
18.
Zhang
X
,
Meyerson
M
.
Illuminating the noncoding genome in cancer
.
Nat Cancer
2020
;
1
:
864
72
.
19.
Elkon
R
,
Agami
R
.
Characterization of noncoding regulatory DNA in the human genome
.
Nat Biotechnol
2017
;
35
:
732
46
.
20.
Elliott
K
,
Larsson
E
.
Non-coding driver mutations in human cancer
.
Nat Rev Cancer
2021
;
21
:
500
9
.
21.
Scacheri
CA
,
Scacheri
PC
.
Mutations in the noncoding genome
.
Curr Opin Pediatr
2015
;
27
:
659
64
.
22.
Gillani
R
,
Collins
RL
,
Crowdis
J
,
Garza
A
,
Jones
JK
,
Walker
M
, et al
.
Rare germline structural variants increase risk for pediatric solid tumors
.
Science
2025
;
387
:
eadq0071
.
23.
Aaltonen
LA
,
Abascal
F
,
Abeshouse
A
,
Aburatani
H
,
Adams
DJ
,
Agrawal
N
, et al;
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium
.
Pan-cancer analysis of whole genomes
.
Nature
2020
;
578
:
82
93
.
24.
Levine
M
,
Tjian
R
.
Transcription regulation and animal diversity
.
Nature
2003
;
424
:
147
51
.
25.
Venters
BJ
,
Pugh
BF
.
How eukaryotic genes are transcribed
.
Crit Rev Biochem Mol Biol
2009
;
44
:
117
41
.
26.
Perera
D
,
Poulos
RC
,
Shah
A
,
Beck
D
,
Pimanda
JE
,
Wong
JWH
.
Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes
.
Nature
2016
;
532
:
259
63
.
27.
Vinagre
J
,
Almeida
A
,
Pópulo
H
,
Batista
R
,
Lyra
J
,
Pinto
V
, et al
.
Frequency of TERT promoter mutations in human cancers
.
Nat Commun
2013
;
4
:
2185
.
28.
Yuan
X
,
Larsson
C
,
Xu
D
.
Mechanisms underlying the activation of TERT transcription and telomerase activity in human cancer: old actors and new players
.
Oncogene
2019
;
38
:
6172
83
.
29.
Bell
RJA
,
Rube
HT
,
Xavier-Magalhães
A
,
Costa
BM
,
Mancini
A
,
Song
JS
, et al
.
Understanding TERT promoter mutations: a common path to immortality
.
Physiol Behav
2017
;
176
:
139
48
.
30.
Shay
JW
,
Wright
WE
.
Telomeres and telomerase: three decades of progress
.
Nat Rev Genet
2019
;
20
:
299
309
.
31.
da Silva
EM
,
Selenica
P
,
Vahdatinia
M
,
Pareja
F
,
Da Cruz Paula
A
,
Ferrando
L
, et al
.
TERT promoter hotspot mutations and gene amplification in metaplastic breast cancer
.
NPJ Breast Cancer
2021
;
7
:
43
.
32.
Boscolo-Rizzo
P
,
Giunco
S
,
Rampazzo
E
,
Brutti
M
,
Spinato
G
,
Menegaldo
A
, et al
.
TERT promoter hotspot mutations and their relationship with TERT levels and telomere erosion in patients with head and neck squamous cell carcinoma
.
J Cancer Res Clin Oncol
2020
;
146
:
381
9
.
33.
Panebianco
F
,
Nikitski
AV
,
Nikiforova
MN
,
Nikiforov
YE
.
Spectrum of TERT promoter mutations and mechanisms of activation in thyroid cancer
.
Cancer Med
2019
;
8
:
5831
9
.
34.
Bell
RJA
,
Rube
HT
,
Kreig
A
,
Mancini
A
,
Fouse
SD
,
Nagarajan
RP
, et al
.
Cancer. The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer
.
Science
2015
;
348
:
1036
9
.
35.
Olympios
N
,
Gilard
V
,
Marguet
F
,
Clatot
F
,
Di Fiore
F
,
Fontanilles
M
.
Tert promoter alterations in glioblastoma: a systematic review
.
Cancers (Basel)
2021
;
13
:
1147
.
36.
McKenna
A
,
Hanna
M
,
Banks
E
,
Sivachenko
A
,
Cibulskis
K
,
Kernytsky
A
, et al
.
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data
.
Genome Res
2010
;
20
:
1297
303
.
37.
Cibulskis
K
,
Lawrence
MS
,
Carter
SL
,
Sivachenko
A
,
Jaffe
D
,
Sougnez
C
, et al
.
Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples
.
Nat Biotechnol
2013
;
31
:
213
9
.
38.
Koboldt
DC
,
Chen
K
,
Wylie
T
,
Larson
DE
,
McLellan
MD
,
Mardis
ER
, et al
.
VarScan: variant detection in massively parallel sequencing of individual and pooled samples
.
Bioinformatics
2009
;
25
:
2283
5
.
39.
Wang
K
,
Li
M
,
Hakonarson
H
.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
.
Nucleic Acids Res
2010
;
38
:
e164
.
40.
Hankey
W
,
Frankel
WL
,
Groden
J
.
Functions of the APC tumor suppressor protein dependent and independent of canonical WNT signaling: implications for therapeutic targeting
.
Cancer Metastasis Rev
2018
;
37
:
159
72
.
41.
Rohlin
A
,
Engwall
Y
,
Fritzell
K
,
Göransson
K
,
Bergsten
A
,
Einbeigi
Z
, et al
.
Inactivation of promoter 1B of APC causes partial gene silencing: evidence for a significant role of the promoter in regulation and causative of familial adenomatous polyposis
.
Oncogene
2011
;
30
:
4977
89
.
42.
Sur
I
,
Taipale
J
.
The role of enhancers in cancer
.
Nat Rev Cancer
2016
;
16
:
483
93
.
43.
Zaugg
JB
,
Sahlén
P
,
Andersson
R
,
Alberich-Jorda
M
,
de Laat
W
,
Deplancke
B
, et al
.
Current challenges in understanding the role of enhancers in disease
.
Nat Struct Mol Biol
2022
;
29
:
1148
58
.
44.
Yanchus
C
,
Drucker
KL
,
Kollmeyer
TM
,
Tsai
R
,
Winick-Ng
W
,
Liang
M
, et al
.
A noncoding single-nucleotide polymorphism at 8q24 drives IDH1-mutant glioma formation
.
Science
2022
;
378
:
68
78
.
45.
Welter
D
,
MacArthur
J
,
Morales
J
,
Burdett
T
,
Hall
P
,
Junkins
H
, et al
.
The NHGRI GWAS Catalog, a curated resource of SNP-trait associations
.
Nucleic Acids Res
2014
;
42
:
D1001
6
.
46.
Herranz
D
,
Ambesi-Impiombato
A
,
Palomero
T
,
Schnell
SA
,
Belver
L
,
Wendorff
AA
, et al
.
A NOTCH1-driven MYC enhancer promotes T cell development, transformation and acute lymphoblastic leukemia
.
Nat Med
2014
;
20
:
1130
7
.
47.
Tang
F
,
Yang
Z
,
Tan
Y
,
Li
Y
.
Super-enhancer function and its application in cancer targeted therapy
.
NPJ Precis Oncol
2020
;
4
:
2
.
48.
Bacabac
M
,
Xu
W
.
Oncogenic super-enhancers in cancer: mechanisms and therapeutic targets
.
Cancer Metastasis Rev
2023
;
42
:
471
80
.
49.
Yoshino
S
,
Suzuki
HI
.
The molecular understanding of super-enhancer dysregulation in cancer
.
Nagoya J Med Sci
2022
;
84
:
216
29
.
50.
Bal
E
,
Kumar
R
,
Hadigol
M
,
Holmes
AB
,
Hilton
LK
,
Loh
JW
, et al
.
Super-enhancer hypermutation alters oncogene expression in B cell lymphoma
.
Nature
2022
;
607
:
808
15
.
51.
Roschewski
M
,
Staudt
LM
,
Wilson
WH
.
Diffuse large B-cell lymphoma - treatment approaches in the molecular era
.
Nat Rev Clin Oncol
2014
;
11
:
12
23
.
52.
Pang
B
,
Snyder
MP
.
Systematic identification of silencers in human cells
.
Nat Genet
2020
;
52
:
254
63
.
53.
Meyer
KB
,
O’Reilly
M
,
Michailidou
K
,
Carlebur
S
,
Edwards
SL
,
French
JD
, et al
.
Fine-scale mapping of the FGFR2 breast cancer risk locus: putative functional variants differentially bind FOXA1 and E2F1
.
Am J Hum Genet
2013
;
93
:
1046
60
.
54.
Campbell
TM
,
Castro
MAA
,
De Santiago
I
,
Fletcher
MNC
,
Halim
S
,
Prathalingam
R
, et al
.
FGFR2 risk SNPs confer breast cancer risk by augmenting oestrogen responsiveness
.
Carcinogenesis
2016
;
37
:
741
50
.
55.
Turnbull
C
,
Ahmed
S
,
Morrison
J
,
Pernet
D
,
Renwick
A
,
Maranian
M
, et al
.
Genome-wide association study identifies five new breast cancer susceptibility loci
.
Nat Genet
2010
;
42
:
504
7
.
56.
Zheng
W
,
Long
J
,
Gao
Y-T
,
Li
C
,
Zheng
Y
,
Xiang
Y-B
, et al
.
Genome-wide association study identifies a new breast cancer susceptibility locus at 6q25.1
.
Nat Gen
2009
;
41
:
324
8
.
57.
Dunning
AM
,
Michailidou
K
,
Kuchenbaecker
KB
,
Thompson
D
,
French
JD
,
Beesley
J
, et al
.
Breast cancer risk variants at 6q25 display different phenotype associations and regulate ESR1, RMND1 and CCDC170
.
Nat Genet
2016
;
48
:
374
86
.
58.
Pang
B
,
van Weerd
JH
,
Hamoen
FL
,
Snyder
MP
.
Identification of non-coding silencer elements and their regulation of gene expression
.
Nat Rev Mol Cell Biol
2023
;
24
:
383
95
.
59.
Dietlein
F
,
Wang
AB
,
Fagre
C
,
Tang
A
,
Besselink
NJM
,
Cuppen
E
, et al
.
Genome-wide analysis of somatic noncoding mutation patterns in cancer
.
Science
2022
;
376
:
eabg5601
.
60.
Zhou
J
,
Troyanskaya
OG
.
Predicting effects of noncoding variants with deep learning-based sequence model
.
Nat Methods
2015
;
12
:
931
4
.
61.
Alipanahi
B
,
Delong
A
,
Weirauch
MT
,
Frey
BJ
.
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
.
Nat Biotechnol
2015
;
33
:
831
8
.
62.
Lee
D
,
Gorkin
DU
,
Baker
M
,
Strober
BJ
,
Asoni
AL
,
McCallion
AS
, et al
.
A method to predict the impact of regulatory variants from DNA sequence
.
Nat Genet
2015
;
47
:
955
61
.
63.
Kelley
DR
,
Snoek
J
,
Rinn
JL
.
Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks
.
Genome Res
2016
;
26
:
990
9
.
64.
Rentzsch
P
,
Witten
D
,
Cooper
GM
,
Shendure
J
,
Kircher
M
.
CADD: predicting the deleteriousness of variants throughout the human genome
.
Nucleic Acids Res
2019
;
47
:
D886
94
.
65.
Zhou
J
,
Theesfeld
CL
,
Yao
K
,
Chen
KM
,
Wong
AK
,
Troyanskaya
OG
.
Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk
.
Nat Genet
2018
;
50
:
1171
9
.
66.
Schwessinger
R
,
Gosden
M
,
Downes
D
,
Brown
RC
,
Oudelaar
AM
,
Telenius
J
, et al
.
DeepC: predicting 3D genome folding using megabase-scale transfer learning
.
Nat Methods
2020
;
17
:
1118
24
.
67.
Avsec
Ž
,
Agarwal
V
,
Visentin
D
,
Ledsam
JR
,
Grabska-Barwinska
A
,
Taylor
KR
, et al
.
Effective gene expression prediction from sequence by integrating long-range interactions
.
Nat Methods
2021
;
18
:
1196
203
.
68.
Vilhekar
RS
,
Rawekar
A
.
Artificial intelligence in genetics
.
Cureus
2024
;
16
:
e52035
.
69.
Karollus
A
,
Mauermeier
T
,
Gagneur
J
.
Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers
.
Genome Biol
2023
;
24
:
56
.
70.
Huang
C
,
Shuai
RW
,
Baokar
P
,
Chung
R
,
Rastogi
R
,
Kathail
P
, et al
.
Personal transcriptome variation is poorly explained by current genomic deep learning models
.
Nat Genet
2023
;
55
:
2056
9
.
71.
Subramanian
I
,
Verma
S
,
Kumar
S
,
Jere
A
,
Anamika
K
.
Multi-omics data integration, interpretation, and its application
.
Bioinform Biol Insights
2020
;
14
:
1177932219899051
.
This open access article is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.

Supplementary data