Abstract
G-quadruplexes (G4) are noncanonical secondary genome structures. Aberrant formation of G4s can impair genome integrity. Investigation of the relationship between G4s and somatic structural variants (SV) in cancers could provide a better understanding of the role of G4 formation in cancer development and progression. In this study, we combined bioinformatic approaches and multiomics data to investigate the connection between G4s and the somatic SVs. Somatic SV breakpoints were significantly enriched in G4 regions, regardless of SV subtypes. This enrichment was only observed in regions demonstrated to form G4s in cells (“active quadruplexes”), rather than in regions with a sequence compatible with G4 formation but without confirmed G4 formation (“potential quadruplexes”). Several genomic features affected the connection between G4s and SVs, with the enrichment being notably strengthened at the boundary of topologically associated domains. Somatic breakpoints were also preferentially associated with G4 regions with earlier replication timing and open chromatin status. In patients with cancer with homologous recombination repair defects, G4s and somatic breakpoints were substantially more strongly associated. Machine learning models were constructed that showed that G4 propensity is a potent feature for predicting the density of SV breakpoints. Altogether, these findings suggest that the G4 structures play a critical role in modulating the production of somatic SVs in cancers.
G-quadruplex structure formation constitutes a critical step in the production of somatic structural variants in cancers, suggesting G-quadruplex structures as potential targets for future cancer prevention and treatment strategies.
Introduction
Genome instability is one of the hallmarks of cancers (1), and it usually involves abnormal chromosome organization of large segments, namely structural variants (SV; refs. 2, 3). SVs commonly have an impact on genomes, being more damaging and potentially more catastrophic than single-base variants or small insertions and deletions (4). Somatic SVs are implicated in a variety of cancer-related gene alterations such as gene truncation or fusion (5, 6), gene amplification (7, 8), and altered epigenetic regulation (e.g., enhancer hijacking events; refs. 9, 10). These events may trigger malfunctioning expression of oncogenes or tumor suppressors, leading to cancer development and progression.
Currently, numerous studies have been conducted to explore the causes of somatic SVs in cancers. These variants can be caused by various factors, such as defects in homologous recombination (HR) repair, translocation of mobile elements, microhomology-mediated break-induced replication, etc. (11). However, most studies have focused on the linear sequences rather than secondary structures, and therefore, there is a dearth of systematic studies exploring the impact of secondary structures on SVs. A recent study revealed that DNA secondary structures increase genome replication stress and thus plausibly result in SVs (12–14).
G-quadruplexes (G4; Fig. 1A) are one of the most studied non-canonical secondary structures that typically form in G-rich regions (15, 16). G4s have been shown to be critical in regulating diverse biological processes such as gene transcription (17–20), alternative splicing (21, 22), protein translation (23–25), and the organization of higher-order chromatin structures (26, 27). Notably, G4s play a specific regulatory role in genome replication. On the one hand, G4s serve as essential elements for replication origins, which is crucial for genome replication initialization (28, 29), whereas they can shape the unique landscape of replication timing (RT) due to their ability to recruit certain proteins (e.g., Rif1; ref. 30). On the other hand, as secondary structures that can impede the elongation of DNA polymerases, their inappropriate presence can stall the replication forks, thus increasing replication stress and triggering DNA double-strand breaks (DSB), particularly when HR repair function is defective (31–35).
The rapid development of sequencing technologies and the reduction of sequencing cost have provided us with the opportunity to investigate SVs from a more refined and complete outlook. The Pan-Cancer Analyses of Whole Genomes (PCAWG) is one of the major international pan-cancer studies that aims to carry out thorough analyses of genome-wide mutational patterns in pan-cancer samples by means of a unified framework and stringent quality control (36). In addition, thanks to improvements in G4-sequencing technology (37–39), we can now gain access to accurate maps of in vivo G4 formation for further studies instead of relying on potential G4 motif data with low structural folding confidence. Prior studies suggest that G4 motifs were associated with deletions and translocations (40, 41), whereas another report shows that G4s could control the landscape of somatic copy-number alterations (42). However, it is still unclear whether it is this primary sequence compatible with G4 formation or the structure itself that affects the formation of SVs. In addition, because accurately detecting of all types of SV in cancer samples is still challenging, there has not yet been a systematic evaluation of the relationship between G4 and various types of SVs in a large and high-quality cohort.
In this study, we explored the relationship between G4s and somatic SVs in cancers by integrating PCAWG data and G4 probing technology data. We investigated the distribution pattern of SV breakpoints around active G4s (aG4; see definition below), focusing on the association between the two at the boundary of topologically associated domains (TAD). We also found that genomic properties exhibited a strong effect on shaping the breakpoint-related aG4s. Furthermore, we established a link between aG4s and DNA DSBs and evaluated the relationship with SVs in the presence of HR deficiency. Finally, we used machine-learning methods to assess the contribution of aG4s to the prediction of SV breakpoints in cancers, thereby providing a comprehensive overview of the relationship between G4s and somatic SVs in cancers.
Materials and Methods
aG4 dataset
In this study, we established and used the concept of aG4s for analyses, by merging in vivo G4 formation data from a variety of cancer cell lines. These G4s are the most likely to form quadruplex structures under the intracellular environment of cancer cells, as compared with G4 predicted only or evidenced in vitro but not in cells (see below).
The aG4 dataset construction steps are as follows (see Supplementary Fig. S1A):
(i) First, we downloaded the in vivo G4 coordinate files for K562, HepG2, A549, H1975, and 93T449 cell lines from the Gene Expression Omnibus (GEO) database or publicly available literature (see details in Supplementary Table S1) that were derived from G4 chromatin immunoprecipitation (ChIP)-seq and G4P-seq technologies.
(ii) Second, due to the larger amount of the in vivo G4 peaks generated by the G4P-seq technique, we retained the peaks with an enrichment score greater than 5 and applied Irreproducible Discovery Rate (version 2.0.4.2) software (under the default parameters) to obtain the highly reproducible peaks and to ensure the reliability in vivo G4 peaks.
(iii) Third, we used bedtools (version 2.30.0) to merge all the in vivo G4 peaks from the above cancer cell lines to obtain the final dataset of aG4s (with a peak count around 24,000, Supplementary Fig. S1B). Particularly, in this study only 22 autosomal data were retained for subsequent analyses.
The flowchart steps for aG4 dataset construction can be found in Supplementary Fig. S1A.
Other G4 datasets
In this study, we also used in vitro G4s as well as G4 sequences identified by computational methods for comparative analyses.
The in vitro G4s, also known as observed G4s (oG4), correspond to secondary structures that were captured by G4-seq technique under the specific in vitro ionic environments (19, 43). The oG4s were directly acquired from GSE63874 following the treatment of PDS and K+ (Supplementary Fig. S1B).
The putative G4s (pG4) are the sequences that match the G4 base composition patterns from the computational standpoint, with or without concomitant evidence they can form secondary structures in vitro or in vivo. Therefore, the pG4 dataset is larger than aG4 and oG4. The pG4 dataset was derived from GSE133379 (38), which includes a variety of putative quadruplex sequence types, including canonical and noncanonical motif sequences (Supplementary Fig. S1B).
Somatic breakpoint dataset for cancer SVs
Somatic SV breakpoints were retrieved from the International Cancer Genome Consortium (ICGC) database. To obtain the somatic breakpoint data with high accuracy and quality, the PCAWG project data were used. We set donors as baseline and removed apparently duplicated SVs. As with the aG4 dataset, only the somatic breakpoints that were coming from 22 autosomes were retained. As a result, a total of 570,460 SV breakpoints from 2,508 donors were used for subsequent analyses. These somatic breakpoints were derived from 47 projects of 20 primary sites. There are four main subtypes of SV, which are deletion, duplication, inversion, and translocation. To ensure the reliability of the statistical analyses, when conducting cancer type–specific analyses, only cancer types with a sample size of more than 40 and a total number of breakpoints of more than 10,000 were considered for further analyses.
Classification of samples according to HR pathway mutations
We obtained the consensus SNV (single-nucleotide variant) and INDEL (Insertions and Deletions) annotation data for the pan-cancer samples from the PCAWG project and downloaded the HR pathway–related genes of Kyoto Encyclopedia of Genes and Genomes (KEGG) from the Gene Set Enrichment Analysis website (https://www.gsea-msigdb.org/gsea/index.jsp, login required). If a sample was detected with the following types of mutations in HR-related genes: Splice_Site, Nonsense_Mutation, Frame_Shift_Del, Frame_Shift_Ins, then we considered that the HR pathway contained loss-of-function mutations (LoF), and its function was potentially affected. If no HR pathway–related mutations, including LoF mutations as well as other types, were detected in the samples it would be regarded as a non-mut sample.
Relative somatic breakpoint density calculation
Shuffled genomic regions
In this study, we widely used shuffled regions to compare with aG4s. For the selection of shuffled regions, we used the shuffle function of the bedtoolsr package (version 2.30.0). During the shuffle process, we kept the total number of shuffled regions the same as the original entries, as well as the length of shuffled regions.
To ensure the reproducibility of the results, we fairly selected random seeds. The random seeds were only related to the number of random tasks. That is, if we only need to do one shuffle job, the random seed would be set to 1; if we shuffled the regions 1,000 times, then the random seeds would be set from 1 to 1,000.
TAD dataset, binning, and boundary definition
We obtained TAD coordinate files with 10kb resolution based on the Directionality Index (DI) calling method from the TADKB database (44), which were derived from several cell lines, including K562, GM12878, etc. We integrated those TAD coordinate files into a general file for subsequent analyses. Because the TADs are of different sizes and span large distances, to split the bins as unbiased as possible, we referred to the method of Krefting and colleagues (45) by extending the TAD upstream of the start position and downstream of the end position by 50% of the TAD size each and dividing the TAD together with the extended region into 20 equal bins, where bins 1 to 5 were the upstream regions of TAD, bins 6 to 15 were the TAD regions, and bins 16 to 20 were the downstream regions of TAD.
Herein, to further define the TAD boundary, we extended the start and end coordinates of the TAD upstream and downstream to a specific length N, respectively, as suggested by McArthur and colleagues (46). The extended regions were then used as TAD boundaries. For example, given a TAD (chr1, n1, n2), then both the (chr1, n1-N, n1) and (chr1, n2, n2+N) would be considered as TAD boundaries. In this study, N was assigned as 50 kb.
Relative fold enrichment across TAD bins
Through transformation, we were able to observe the fold enrichment of the remaining bins relative to the bin with the lowest density value. In this application scenario, we tend not to focus on the absolute magnitude of the density, but on the relative magnitude and trend of the density.
Signal calculation for genomic properties
The RT data transformed via the wavelet-smoothed method were obtained from the GEO database under the accession number GSE34399. Other genomic property signals were derived from the roadmap database, and the imputed data were downloaded and used for our subsequent analyses. For the matching of data to primary sites, we drew on the method of Li and colleagues (47). The relevant matching information can be found in Supplementary Table S2. We used several open-source tools, including bigWigMerge and bedGraphToBigWig, to average and merge data originating from multiple cell types.
We applied the bigWigAverageOverBed program to compute the signal values of genomic properties in specific regions, such as the windows of 3 kb in length centered on aG4s. Then, the signal values of these specific regions will be used for comparison. For example, comparing the signal value of H3K24ac for the breakpoint-related aG4 regions with other (non-breakpoint-related) aG4 regions.
Annotate with RT states
DSB data for K562
We retrieved the K562 DSB data derived from sBLISS technology [treated with dimethylsulfoxide (DMSO) and etoposide (ETO), respectively] with the accession number of GSE121740. Both the DMSO- and ETO-treated samples were used in this study. The distribution pattern of DSB signals was computed and visualized by SeqPlots.
Z-score calculation
Effect size calculation for G4 versus shuffled regions in LoF and non-mut groups
We designed the following experiments to unbiasedly compare the relationship between aG4s and breakpoints in the LoF group versus the non-mut group. Effect size was measured by using the R package of rstatix (version 0.7.0).
Step1: Calculate the proportion of breakpoints associated with aG4s (e.g., within 1.5 kb from the center of aG4s) for each sample in the LoF group.
Step2: Randomize the shuffled genomic regions with the same length and the identical number as aG4s and calculate the proportion of somatic breakpoints associated with the shuffled regions in each sample. The randomization was performed 1,000 times, and the effect values could be estimated for each randomization to characterize the magnitude of difference between aG4s and shuffled regions.
Step3: Repeat the above operation in the non-mutated samples as well.
By comparing the difference in the number of somatic breakpoints around aG4s and shuffled regions in the LoF and non-mut groups, we can further infer whether the occurrence of LoF in HR-related genes enhances the association of aG4s with somatic breakpoints.
Calculation of COSMIC SBS3 signals for each sample
We obtained the SNV mutation annotation data from the PCAWG project and used the SigProfiler tool (https://cancer.sanger.ac.uk/signatures/tools) to extract the mutational signatures and decompose COSMIC (Catalogue of Somatic Mutations in Cancer) signatures (49) from the samples based on the SNV mutation data. The relative contribution of SBS3 was defined as the proportion assigned to SBS3 in each sample.
Regression modeling of somatic breakpoint density
We trained the regression models using the R package of randomForest (version 4.7), gbm (version 2.1.8), and e1071 (version 1.7) with default parameter settings, respectively. The human genome was divided into 0.5 Mb nonoverlapping windows, and after removing the Gap regions as well as the blacklisted regions, we finally got 5,029 windows. For different primary sites, we computed and normalized (z-score) the signal values of all features (genomic properties) on these windows. In addition, we added the G4 ChIP-seq signal value of aG4s as features, which were obtained by merging the G4 bigwig signals from K562, A549, 93T449, and H1975 cell lines.
In the light of the method from Cheloshkina and colleagues (50), the somatic breakpoint density of a genomic region was defined as the number of somatic breakpoints in that region divided by the number of all somatic breakpoints in the chromosome on which the genomic region was located and finally multiplied by 1,000.
We randomly split all genomic windows into 70% and 30% fractions, where 70% of the data were used for model training, whereas the remaining 30% was used for model testing.
We evaluated the importance of individual features in regression models for different primary sites using the permutation-based approach, with importance being reflected by mean squared error (MSE). The main idea of this method is as follows, if the prediction error of a model is greatly increased after the values of a certain feature were randomly disrupted, then that feature is of great value for the model. Moreover, the accumulated local effect (ALE) method, which can partially isolate the effect of other features, was taken to explore how G4 affects the prediction of the model. We used the iml package (version 0.11.0) for MSE and ALE analyses.
Data availability
The in vivo G4 data were obtained from the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under the following accession number GSE145090, GSE133379, and GSE145543. The in vitro and pG4 data were downloaded from GEO under the accession number GSE63874 and GSE133379, respectively. The somatic SVs, the SNVs, and the insertions and deletions were publicly retrieved from the PCAWG project (https://dcc.icgc.org/releases/PCAWG/) in the ICGC database. The DSB data for the K562 cell line can be accessed through the accession number GSE121740. The RT datasets were obtained through the accession number GSE34399, whereas other genome property datasets were downloaded from the NIH Roadmap Epigenomics data portal (https://egg2.wustl.edu/roadmap/web_portal/). The TAD annotation file was downloaded from the TADKB database (http://dna.cs.miami.edu/TADKB/). The RNA polymerase and transcription factor peak files were retrieved from the ChIP-Atlas database (https://chip-atlas.org/). The cancer ATAC-seq peak data used in this study were acquired from The Cancer Genome Atlas (TCGA) portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG). Independent validation SV datasets were collected and downloaded through the following links: https://data.mendeley.com/datasets/2mn4ctdpxp/1, https://data.mendeley.com/datasets/6gtrrxrn2c/1, and https://dcc.icgc.org/releases/current/Projects/BRCA-FR.
Results
Breakpoints of somatic SVs in cancers are enriched in the vicinity of aG4s
To comprehensively evaluate the relationship between G4s and somatic SVs in cancer primary sites, we collected in vivo G4 data from previously published high-throughput experiments in multiple cancer cell lines, including K562, MCF7, and others (See Materials and Methods). These in vivo G4 sites will then be merged as long as they were observed in any cell lines and will be used as secondary structure sites that can be formed in the cancer cells for our subsequent analyses. Herein, we termed these merged in vivo G4s as aG4s to indicate their ability to form secondary structures under the intracellular environment of cancer cells (Fig. 1A).
The first question was whether there was an enrichment of SV breakpoints around G4s. We were motivated by the consideration that if G4 structures are tightly related to cancer SV breakpoints, significant enrichment of breakpoints would be observed in their vicinity. To test this, we obtained the somatic breakpoints identified by the PCAWG project, which were derived from a unified detection framework with high quality and accuracy. The somatic SVs included four subtypes, namely deletion (DEL), duplication (DUP), inversion (INV), and translocation (TRA), and include a total of 570,460 somatic breakpoints from 20 primary sites (Supplementary Fig. S1C).
The distribution of somatic breakpoints within 50 kb of aG4s was first evaluated. The results showed a higher density of breakpoints in the vicinity of aG4s relative to the distal flanking sequences (Fig. 1B). We did not observe this unique distribution around random sites after one hundred shuffled experiments (Supplementary Fig. S2A). Furthermore, we investigated whether this phenomenon was only observed for a specific breakpoint type. The distribution of the breakpoints for different SV subtypes centered on aG4s was calculated. We found that such distinctive enrichment persisted across four subtypes of SVs, with translocations being particularly pronounced (Fig. 1B). In addition, the enrichment level of breakpoints around aG4s (±1.5 kb) was compared with the enrichment found in shuffled regions (for 1,000 shuffled rounds, Fig. 1C). Overall, breakpoints of different cancer tissues were enriched in aG4 regions above shuffled regions (Fig. 1C), and this observation was true for all SV subtypes (Supplementary Fig. S3). Because aG4 corresponds to a secondary structure with a high probability of being structured, especially in the physiological state of cancer cells, we investigated the distribution pattern of oG4s and pG4s peripheral breakpoints, which have a lower capacity to form secondary structures in vivo. In contrast with aG4s, only weak enrichment signals were observed in the vicinity of oG4s and pG4s (Supplementary Fig. S2B and S2C), especially for the breakpoints of duplication, inversion, and deletion, where no significant enrichment peaks could be found (Supplementary Fig. S2D and S2E). Given the weakened distribution pattern, we wondered whether the accumulation of breakpoints is structure-specific rather than sequence-specific; that is, is the association of G4s with somatic breakpoints related to their ability to form structures? We further classified the pG4s into two groups according to whether they were supported or not by in vivo experiments. Interestingly, a significant peak was only observed in the pG4 group supported by in vivo experiments rather than in the others (Fig. 1D). This divergence persisted when we investigated the distinct SV subtypes separately (Fig. 1D), implying that the G4 structures rather than the sequences, may influence the formation of somatic SVs.
We compared the somatic breakpoint diversity within windows of different lengths, centered on aG4s to shuffled regions. We noticed an obvious diversity of breakpoints around aG4s (Supplementary Fig. S2F), but as the window length increased, this effect diminished and became less different from shuffled regions, as shown by the error lines (Supplementary Fig. S2F). In addition, we divided the human genome into nonoverlapping 50 kb windows and compared the coverage of aG4s between the windows containing somatic breakpoints and the others. Considering the preferential distribution of aG4s, some specific genomic regions were also encompassed. As a result, the breakpoint containing windows displayed higher aG4 coverage values (Fig. 1E; Wilcoxon one-side test, P = 2.81 × 10−12, P = 3.45 × 10−130, P = 4.77 × 10−25, P = 6.95 × 10−127 for DEL, DUP, INV, and TRA, respectively), whereas the coverage differences of other genomic regions were not as significant as aG4s (Fig. 1E; Wilcoxon one-side test, promoter: P = 0.351, P = 5.74 × 10−60, P = 3.96 × 10−11, P = 9.56 × 10−66 for DEL, DUP, INV, and TRA, respectively; exon: P = 1.57 × 10−10, P = 5.89 × 10−123, P = 4.33 × 10−18, P = 2.21 × 10−92 for DEL, DUP, INV, and TRA, respectively; intron: P = 2.39 × 10−32, P = 2.26 × 10−22, P = 0.219, P = 7.66 × 10−3 for DEL, DUP, INV, and TRA, respectively; UTR: P = 0.779, P = 2.26 × 10−73, P = 4.93 × 10−11, P = 3.01 × 10−70 for DEL, DUP, INV, and TRA, respectively). This was also corroborated in the model when logistic regression was used to determine the presence of breakpoints in genomic windows, with aG4 having the lowest P value (Fig. 1F).
The above results suggest that G4 structures, especially those with higher structure formation ability in cancer cells, are closely related to the somatic SV breakpoints in real-world cancer samples.
The aG4-breakpoint association is strengthened at TAD boundaries
Our previous studies found that G4s are enriched at TAD boundaries and enhance their insulation ability (26). It has also been shown that SVs at TAD boundaries can potentially lead to enhancer hijacking (9) and disrupt the expression of important genes. On the basis of this, we then proceeded to analyze the association pattern of aG4s with cancer somatic SV breakpoints at TAD boundaries. We obtained TADs from the TADKB database (44) for multiple cell lines at 10 kb resolution. We first extended all TADs up and downstream by 50% on each side, and then divided the TADs along with the extended regions into 20 bins equally, namely, bin 1–5 are TAD upstream regions, whereas bin 6–15 and bin 16–20 are inner and downstream of TAD regions, respectively (see Materials and Methods; ref. 45). We found that both aG4s and somatic breakpoints were enriched in the up or downstream regions of TAD boundaries (Fig. 2A and B). In contrast, aG4s and somatic breakpoints were depleted in the inner TADs (Fig. 2A and B). We did not observe the above pattern for shuffled regions (Fig. 2A and B). In addition, it is noteworthy that the relative enrichment fold of oG4s and pG4s at the TAD borders were weaker than that of aG4s (Supplementary Fig. S4A and S4B). When we consider the subtypes of SV, we found that the relative enrichment of deletions was higher within TADs than in boundary regions, which exhibits a unique pattern in contrast with the other three SV subtypes (Supplementary Fig. S4C–S4F).
Although aG4s and somatic breakpoints showed similar distribution patterns inside and flanking of TADs, we cannot assert that aG4s and somatic breakpoints occur in these regions in a synergistic manner. Thus, we further evaluated the correlation between aG4s and somatic breakpoints in the bins mentioned above. Overall, aG4s positively correlated with somatic breakpoints (Fig. 2C; Supplementary Fig. S4G). Moreover, we found that the correlation was higher in the TAD borders than inside the TADs, especially in the borders adjacent to the TADs, implying that the connection between aG4s and somatic breakpoints was strengthened in the boundary regions. The correlation was also observed in the analyses of oG4s and pG4s but attenuated (Supplementary Fig. S4H and S4I). Although the distribution pattern of deletion breakpoints was opposite to that of aG4s, a similar pattern was presented when we observed their correlations (Supplementary Fig. S4J), which is consistent with the other three SV subtypes (Supplementary Fig. S4K-S4M), as well as when regarding the cancer tissues (Supplementary Fig. S5), suggesting that the correlation between aG4s and somatic breakpoints was independent of the distribution of breakpoints themselves.
To further characterize the relationship between aG4s and somatic breakpoints at TAD boundaries, we referred to the method of McArthur and colleagues (46) and defined the TAD boundary as a region of specific length upstream from the TAD start coordinate and downstream from the end coordinate. In this study, the length of TAD boundary was defined as 50 kb. We classified the boundaries into four categories, Q1 through Q4, based on the number of somatic breakpoints at TAD boundaries. We also randomized shuffled regions of the same size from TADs and classified them into four categories as well, with the number of somatic breakpoints increasing sequentially from Q1 to Q4. First, we found that higher levels of breakpoint numbers were always associated with higher aG4 coverage, and the aG4 coverage at TAD boundaries was always higher than that within TADs (Fig. 2D), which is consistent with the findings mentioned above. Consistent observations were observed when the SV subtypes were taken into account (Supplementary Fig. S6A–S6D). We additionally determined the effect sizes of aG4 coverage between TAD boundaries and shuffled regions within TADs. As a result, Q3 and Q4 displayed higher intra-group discrepancy (Fig. 2E), indicating that the accumulation of somatic breakpoints often correlated with a stronger occupation of aG4s. Analogous results were also observed in various subtypes of SV, especially in deletions (Supplementary Fig. S6E–S6H). Moreover, the inter-group effect sizes of aG4 coverage at TAD boundaries were greater than within TADs (Supplementary Fig. S6I–S6M).
Finally, we investigated whether aG4-related deletions (the breakpoint of deletions was within 1.5 kb of the aG4 center) prefer to directly destroy the entire TAD boundaries, which usually results in a devastating blow to the TAD, leading to the creation of new fusion TADs and consequent abnormalities in gene regulation. In comparison with 1,000 randomized experiments, we found that a higher proportion of aG4-related deletions spanned the entire TAD boundaries, implying that aG4-related deletions were more likely to directly delete the entire TAD boundaries (Fig. 2F; P < 0.001). This is also the case when we define the aG4-related deletions as those whose breakpoint was located within 0.5 kb or 2.5 kb distance from aG4 centers (Supplementary Fig. S6N and S6O).
Breakpoint-related G4s exhibit unique genomic properties
We noted that not all aG4s were connected to somatic breakpoints, so we wondered what kind of G4s were more likely to be associated with somatic breakpoints. Because sequences affected by SVs commonly showed unique genomic properties, we next focused on exploring the effect of genomic properties on the connection between aG4s and somatic breakpoints.
We stratified aG4s into breakpoint-related aG4s and non-breakpoint–related aG4s based on whether they were in proximity (±1.5 kb) to somatic breakpoints. We assessed the signal values for each type of genomic property within a window of 3 kb centered on the two types of aG4s (breakpoint-related aG4s and non–breakpoint-related aG4s) and examined the distinctions in genomic signals. First, several genomic properties associated with active chromatin such as RT, DNase-seq, RNA-seq, etc., showed significantly stronger signal values in breakpoint-related aG4 regions (Fig. 3A). In contrast, genomic signals associated with inactive chromatin features, such as H3K27me3, exhibited weak signal values in the breakpoint-related aG4 regions (Fig. 3A). Genomic signal discrepancies were evident in some primary sites, including blood, breast, and ovary, which implied that the magnitude of genomic properties on breakpoint-related aG4s might be primary-site specific (Fig. 3A). To exclude the effect of window size, we additionally defined aG4-centered windows of lengths 1 kb and 5 kb and performed the analyses in the same way, and the results were similar (Supplementary Fig. S7A–S7B). We also performed the same analyses for distinct types of somatic breakpoints separately (Supplementary Fig. S7C–S7F). Although different somatic breakpoints possessed diverse genomic properties, when taking aG4s into account, we found that a more open chromatin state is a prerequisite feature of the regions where these somatic breakpoints were located compared with the aG4 regions that do not contain breakpoints, which represents a distinctive characteristic of aG4-related somatic breakpoints (Supplementary Fig. S7C–S7F).
Because we classified the aG4 regions based on the presence or absence of somatic breakpoints, the differences between the two types of aG4 regions may merely denote the property divergences between breakpoints and aG4s without breakpoints. To explore this issue, we then estimated the Wasserstein distance between the signal value distributions of the two types of aG4 regions and the distribution of all breakpoint-signal values separately, and then determined the ratio of the above distances (See Supplementary Methods). Surprisingly, among vast primary sites and genomic properties, the signal distributions of non-breakpoint-related aG4 regions were closer to all breakpoint signal values (Supplementary Fig. S8) even when we considered the distinct subtype of SV (Supplementary Figs. S9–S12), suggesting that the divergence between above two types of aG4 regions cannot be simply explained by the unique properties of somatic breakpoints.
We noticed that RT was the most significant genomic property, so we investigated the differences in replication “activity” between the two types of aG4s. We used RT states developed by Poulet and colleagues (48) to annotate the two types of aG4s and calculated the proportion of the two types of aG4s annotated to 15 different RT states. We subtracted the annotated proportions of the two aG4 types to obtain their preferences over RT states. We found that breakpoint-related aG4s were enriched in the Early regions, whereas other aG4s showed preference in Dynamic and Late regions (Fig. 3B). We found similar results when we considered the subtypes of SV separately (Supplementary Fig. S13A–S13D). In addition, we compared the number of replication origins on the genomic segments (±0.5 Mb) where the two types of aG4s were positioned. In general, the genomic segment on which breakpoint-related aG4s were positioned displayed more replication origins (Supplementary Fig. S14A). When somatic SV subtypes are considered, such statistical significances occur only in specific primary sites (Supplementary Fig. S14B–S14E).
Finally, we estimated the z-score values of breakpoint-related aG4s and non–breakpoint-related aG4s on different genomic properties via randomization experiments. Z-score can be used to describe the magnitude of which the data deviate from the mean value. The results showed that the breakpoint-related aG4s exhibited elevated z-score values among the genomic properties representing active chromatin state (Fig. 3C), whereas other aG4s were more correlated with negative z-score values (Fig. 3C). Among all genomic properties, we noted that RT has a significant influence on z-score differences (Fig. 3C), which is consistent with the above findings. The results were consistent when we conducted the same analyses for each somatic SV subtype individually (Supplementary Fig. S15A–S15D).
HR deficiency enhances the connection between aG4s and SV breakpoints in cancers
DNA DSBs are one of the major sources of SVs (51, 52). The above analyses suggested that G4 structures were tightly associated with breakpoints of somatic SVs in cancers, which prompts us to question how G4s interacted with DSBs from a genome-wide perspective. Considering the lack of DSB data originating from cancer primary sites, we thus surveyed data from the K562 cell line (GSE145090). Notably, we found a significant accumulation of DSB signals immediately adjacent to G4 peaks, whereas no accumulation was found in the shuffled regions (Fig. 4A; Supplementary Fig. S16A). The DSB signal values within G4 peaks were well correlated between the DMSO- and ETO-treated groups (Supplementary Fig. S16B). In the promoter region of NBPF1, for example, we observed the presence of a DSB peak and a G4 ChIP-seq peak, and the active state of this region was supported by DNase-seq, H3K27ac ChIP-seq, and CAGE technologies (Fig. 4B).
It has been shown that the treatment with a known G4 ligand, Pyridostatin (PDS) for the stabilization of G4 structures reduces the viability of cells lacking BRCA1/2 and RAD51 (33). Thus, we further investigated the relationship between the above repair-related factors and G4s with distinct DSB-signal values. We ranked the in vivo G4s according to the intensity of DSB signals on them. We chose top 25% and bottom 25% of G4s and visualized the occupancy patterns of RAD51 and BRCA1 centered on them. We found that BRCA1 and RAD51 showed enhanced occupancy patterns around the G4s with higher DSB signal values (Supplementary Fig. S16C–S16F), indicating that the above repair-related factors were specifically responsive to G4-related DSBs. Our previous work showed that G4s were enriched at the TAD boundaries in the K562 cell line (26), whereas the above analyses suggests that the association of G4s with SV breakpoints was highlighted at TAD boundaries rather than within TADs. Thus, we set out to analyze the distribution pattern of DSB signals at TAD boundaries and within TADs. Similar to the enrichment pattern of G4s, we also found the enrichment of DSBs at the boundaries, whereas no dominant DSB enrichment was found inside TADs (Fig. 4C; Supplementary Fig. S16G). Moreover, we evaluated the intensity difference of DSB signal values between the G4s at TAD boundaries (50 kb) and other G4s. Stronger DSB signals emerged in G4s harbored at TAD boundaries than in other G4s (Supplementary Fig. S16H). The above findings imply that G4s have the potential to cause genomic stability, particularly at TAD boundaries (Fig. 4D).
Next, we evaluated the relationship between aG4s and somatic SVs in samples with HR deficiencies. HR deficiency can potentially lead to the failure of DSB repair caused by aG4s. We first studied whether there were any LoF mutations that occurred in the HR pathway of the cancer samples. We obtained HR-related gene set from KEGG pathway and divided the samples into LoF and non-mut groups according to the somatic mutation annotation of these genes from the PCAWG project (See Materials and Methods).
We calculated the proportion of samples containing aG4-related breakpoints across two groups, respectively. In line with the previous section, we defined aG4-related breakpoints as those within 1.5 kb away from the center of aG4s. As a result, the LoF group contained significantly more samples with aG4-related breakpoints than the non-mut group (Fig. 4E; χ2 test, P = 2.52 × 10−5). Similar results were observed when we defined the aG4-related breakpoints as 0.5 kb away from the aG4 center and a larger range of 2.5 kb (Supplementary Fig. S17A and S17B), as well as when we considered the SV subtypes separately (Supplementary Fig. S17C–S17F). To rule out the possibility that the difference was simply due to the presence of LoF mutations in the samples themselves, therefore, we estimated the magnitude of divergence in the number of the somatic breakpoints that were around aG4s and shuffled regions in each of the two groups (See Materials and Methods). As expected, we found stronger effect sizes in the LoF group than in the non-mut group (Fig. 4F; Wilcoxon one-side test, P = 2.33 × 10−276), suggesting that the relationship between aG4s and somatic breakpoints was enhanced when some HR-related genes were functionally inactivated. The discrepancy phenomenon persists when the aG4-related breakpoints were defined as being within 0.5 kb and 2.5 kb of the aG4 center (Supplementary Fig. S17G and S17H), and it is independent of SV subtypes as well (Supplementary Fig. S17I–S17L).
In addition, we calculated the relative contribution percentage of SBS3 across the COSMIC mutational signatures (49) for each sample (see Materials and Methods). SBS3 is one of the Single-Base Substitutions signatures whose presence represents a defect in DNA damage repair caused by abnormal HR in patients with cancer. We divided the samples into patients with and without SBS3 signals. We calculated the proportion of somatic breakpoints associated with aG4s (3kb window) in each sample, as well as the proportion of somatic breakpoints associated with shuffled regions in each sample (through 1,000 randomized experiments). As a result, the proportion of aG4-related breakpoints tended to rank higher as compared with shuffled regions in the samples where the SBS3 signals were detected (Fig. 4G; Wilcoxon one-side test, P = 1.94 × 10−6), suggesting that higher levels of somatic breakpoint emerged in the vicinity of aG4s in these samples. Moreover, when we defined the aG4-related breakpoints as 1kb or 5kb centered on aG4 regions, the differences remained significant (Supplementary Fig. S17M and S17N), as is the case when we took the subtype of SV into account (Supplementary Fig. S17O–S17R).
G4 improves the prediction of SV breakpoints
To evaluate the contribution of aG4s in the prediction of the somatic SV breakpoint densities, we built several machine-learning regression models, including gradient boosting machine (GBM), random forest (RF), and support vector machine (SVM). We divided the human genome into nonoverlapping windows of 0.5 Mb in length and calculated the signal values of the genomic properties on them. We split all windows into two parts, with 70% used for training and the remaining 30% for testing. We applied the trained regression models to the testing datasets and the independent validation datasets (See Supplementary Methods). Overall, all three machine-learning models exhibited similar MSEs in the training, testing, and independent validation datasets for a specific primary site, and no obvious overfitting was found (Supplementary Fig. S18A). All three models are more accurate in some primary sites such as breast and ovarian cancers but perform poorly in other primary sites, including uterus and bone cancers (Supplementary Fig. S18A). To further confirm whether there is dataset preference in these models, we also calculated the Wasserstein distances between the distributions of actual and predicted values for the training, testing, and independent validation datasets derived from different models in different cancer types, respectively. As a result, no significant Wasserstein distance differences were found when we observe a certain cancer type, which is consistent with the results obtained using MSE, showing that these models were not subject to dataset biases (Supplementary Fig. S18B).
Next, we measured the feature importance by using the permutation-based approach. We determined the MSE variation caused by the reshuffling of feature values in each primary site for all three models. We ranked the importance of all features in each primary site based on the relative magnitude of the MSE values. We found that DNase was a very stable feature for predicting breakpoint density relative to our regression models, and it was ranking first among all three machine-learning models (Fig. 5A–C). The importance of aG4s in GBM and RF models was second only to DNase (Fig. 5A and B), whereas in SVM regression model, aG4 was not as dominant as in the other two models, but still in the upper-middle range (Fig. 5C). We also trained the models for the independent validation set using the same methods and then calculated the importance of each feature. The importance ranking for aG4 was almost in line with the PCAWG-based model (Supplementary Fig. S18C). The above results suggest that aG4 is an important feature for predicting the density of SV breakpoints.
In addition to evaluating the importance of features, we further analyze how aG4 affects the prediction outcome of the model via the ALE method. Intriguingly, we found that the effect of aG4 on the prediction values could be decomposed into two parts (Fig. 5D). First, when the aG4 signal value within the genomic window was very low, it showed a negative effect on the prediction of somatic breakpoints, whereas the positive effect emerged as the aG4 signal value was enhanced (Fig. 5D). Considering that aG4 is a local feature, we believe that only when there are enough aG4s in a specific genome region, can it have a greater role in shaping the somatic breakpoint landscape of that region. In brief, the above results emphasized that aG4s play a critical role in predicting somatic breakpoints.
Discussion
By integrating genomic data from the PCAWG cohort with experimentally derived high-quality G4s data from multiple cancer cell lines, we have, to the best of our knowledge, conducted the first comprehensive and unbiased study of the relationship between aG4s and somatic SVs (i.e., translocations and inversions) in 2,508 patients with cancer with 20 primary sites. In addition, we revealed that this association could be influenced by multiple different epigenomic statuses and “BRCAness” mutations.
Because of the unique structural properties of G4s, these structures have been widely investigated in cell line models over the past years and have been recognized as one of the causes of genome instability (31, 32, 53, 54). Some, but not all, G4s share special motif patterns (e.g., GxN1–7GxN1–7GxN1–7Gx, where x≥3, and N can be any base). Recent studies have investigated the connections between G4 motifs and SVs, including copy-number alterations, translocations, and deletions based on the cancer genomic variants databases such as COSMIC (40–42). It was unclear, however, whether the presence of SVs was linked either to G4 structure or to the linear motif sequences. Our findings showed that the structures of the G4, rather than the motif sequence, contribute to the generation of SVs in cancers. In addition, the strict quality-controlled calling pipeline for SV detection from the PCAWG project further increased the reliability of our findings, and this panel of cancer types allowed us to check for possible differences between them. In particular, we revealed a shaping effect of genomic properties on breakpoint-related aG4 exhibited cancer-type specificity.
As was reported by other studies that G4 structures can cause genome instability in a replication-dependent manner (31, 55, 56), our results support the notion that the genome fragile sites induced by G4 structures may eventually evolve into SVs in the patients with cancer on a genome-wide scale. Notably, a previous study has confirmed the presence of G4 structures in cancer cells using BG4 antibody–staining techniques (57), and the authors speculated that these G4s may be associated with genomic instability. In our study, the construction and the use of the aG4 dataset enabled us to further answer and support the hypothesis that the G4s formed in cancer cells contributed to the structural variations.
Intriguingly, our study further demonstrated that the above connection between G4 and SVs could be modified by multiple factors, including chromatin structures (i.e., TAD boundaries), genomic and epigenomic signatures (i.e., RT), and “BRCAness” mutations. The TAD boundaries, regardless of the somatic breakpoint subtypes, strengthened the links between G4 and SVs. Our recent work, as well as a subsequent investigation, has shown that G4 structures were critical for the formation and maintenance of high-dimensional chromatin structures, including DNA looping and TADs (26, 27). Because the maintenance of higher order chromatin structures was important to the cellular epigenetic programs (46, 58–61), and together with the fact the boundary-related G4 structures interact frequently with proteins (26), we hypothesized that the presence of G4 structures at the boundary regions may be more persistent, which in turn may result in more opportunities for G4s to influence the generation of SVs. Besides, the aforementioned findings suggest that G4 may impact cancer SVs in a noncoding-like manner, for example, by affecting the TAD boundaries that may dysregulate the expression of genes, although a previous study illustrated that G4s were preferentially localized to cancer-associated gene regions (43).
We also found that the association between aG4s and SV breakpoints is dependent on the epigenetic features near the aG4s. Those aG4s that are harbored in the genomic regions exhibiting more open properties are more likely to be linked to somatic breakpoints, as suggested by the use of cancer ATAC-seq data from the TCGA database (62). Breakpoint-related aG4s tend to exhibit higher open signals than other aG4s and the regions where other breakpoints are located (Supplementary Fig. S19A). The open chromatin provides a more favorable opportunity for the folding of G4 structures (39), and thus destabilizes the genome. In addition, we observed stronger transcription signals around breakpoint-related aG4s as well after analyzing the high confidence peak of RNA polymerase and transcription factors provided by the ChIP-Atlas database (Supplementary Fig. S19B–S19C). This is also evident in the chromatin state annotations, where breakpoint-related aG4s were observed to be more concentrated in regions with higher transcriptional activity than other aG4s (Supplementary Fig. S20). Among all genomic properties we investigated, RT is a significant property that greatly influences the connection between aG4s and SV breakpoints. The genomic segments where the breakpoint-related aG4s are located contain more replication origins. We propose that the presence of aG4s in the genomic regions of earlier RT is more likely to trigger SVs because cancer cells are usually required to undergo rapid replication that is low-fidelity and lacks robust error correction mechanisms (63, 64). RT correlated well with chromatin organizations, especially with compartment A/B, and the boundary of RT domains showed a near one-to-one correspondence with the boundary of TADs (65, 66). Thus, earlier RT represents a more open and active chromatin state, which provides favorable and necessary conditions for the formation of G4 structures. Indeed, RT was also a key regulator for maintaining the epigenetic state of cells (67). A previous study revealed that the chromatin environment can shape the landscape of G4s (68); however, we found that the chromatin environment was not only related to the formation of G4 structures themselves, but also to the relationship between G4s and SVs. In other words, the ability of G4s to induce somatic breakpoints may be controlled by epigenetic signatures. However, the causal relationship still warrants further investigation.
Inappropriately processed DSBs are one of the major causes of SVs (11, 51, 69). Our study indicates that DSBs were enriched in the vicinity of G4 structures and were correlated well with G4 structure signals. So, the failure of repairing G4-related DSBs may lead to somatic SVs. We did observe that the association of aG4s with somatic breakpoints was significantly stronger in patients with cancer with HR deficiencies. This allowed us to propose that the HR deficiencies, especially the “BRCAness,” can enhance the aG4’s ability to induce the formation of somatic breakpoints. Specifically, under normal circumstances, genomic fragility triggered by G4s can be rescued promptly; however, in HR-deficient cancer cells, these unstable regions can only be repaired by other approaches such as non-HR, which often leads to SVs (70). Designing specific G4 ligands to stabilize its structure and thereby destabilize the cancer genome or even lead to the death of cancer cells is a promising approach to cancer therapy and has significant translational potential, especially for patients with "BRCAness."
Our machine-learning models showed that aG4 was a very useful feature for the prediction of somatic breakpoint density, ranking high among numerous genomic features, second only to chromatin accessibility, which, combined with the findings above, suggests that G4s may shape the somatic breakpoint landscape of structural variation in cancers. A recent pan-cancer modeling effort for somatic SVs suggests that non-B DNA structures, particularly G4s, were the major feature used to predict breakpoint hotspot regions (50). Similarly, we observed that the contribution of G4 in predicting the density of somatic breakpoint was also significant. Although heterogeneity varies across cancers, G4 appears to be a very stable predictor in most tissues. Because G4 tends to be located in functionally important regions (71, 72), the breakpoints related to G4s may undergo positive selection and thus are more likely to be associated with SVs genome-wide and, in addition, are more preferentially to be breakpoint hotspot regions. Therefore, we strongly suggest here that G4s should be considered as a key feature when designing SV prediction tools. Notably, the prediction effects of G4s on machine-learning models are primary-site specific. For most of the primary sites, the ALE of the three models increased with the increase of G4 ChIP-seq signal values, showing a positive linear correlation, especially in blood, breast, ovary, lung, and other primary sites. However, in individual primary sites such as skin, uterus, etc., we found that the influences of G4s were negatively correlated, which implies that the SV landscape in those primary sites may be dominated by some other factors. Thus, the contribution of G4s to the landscape of SV was dependent on primary sites. In addition, the models probably be partially influenced by the original G4 dataset because we did not obtain tissue-specific aG4s, and therefore, using the aG4 data derived from multiple cell lines can only represent some general aG4s rather than tissue type–specific aG4s. Even so, the importance ranking of aG4s among regression models was stable when we used other independent datasets for validation (Supplementary Fig. S18C). More importantly, we found that the relationship between aG4s and SV breakpoints was not affected by dataset biases. For example, by analyzing breakpoint data from independent validation datasets, we also found that aG4s were more breakpoint-rich than shuffled regions (Supplementary Fig. S21A), and this enrichment depended on whether the G4 sequences could form secondary structures (Supplementary Fig. S21B) and was enhanced in the vicinity of TAD boundaries (Supplementary Fig. S21C). Similarly, differential genomic properties are shown (Supplementary Fig. S21D–S21E). This suggests that the association of aG4s with SV breakpoints is related to their intrinsic nature rather than to the breakpoint datasets.
In conclusion, our study illustrates that the aG4s are vital modulators of somatic SV breakpoints in cancers from a genome-wide perspective via bioinformatic methods, which can provide supporting information for the screening of G4-based anticancer drug targets.
Authors' Disclosures
J.-L. Mergny reports grants from Institut National du Cancer, Agence Nationale de la Recherche (ANR), and Inserm during the conduct of the study. No disclosures were reported by the other authors.
Authors' Contributions
R. Zhang: Conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, writing–original draft, writing–review and editing. H. Shu: Writing–review and editing. Y. Wang: Writing–review and editing. T. Tao: Writing–review and editing. J. Tu: Funding acquisition, writing–review and editing. C. Wang: Supervision, writing–review and editing. J.-L. Mergny: Supervision, writing–review and editing. X. Sun: Conceptualization, supervision, funding acquisition, writing–review and editing.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (no. 61972084), the Natural Science Foundation of Jiangsu Province (no. BK20211513), the Postgraduate Research and Practice Innovation Program of Jiangsu Province (No. KYCX21_0143), China Scholarship Council (202106090125 to R. Zhang), as well as the INCa (PL-BIO 2020–117) and ANR G4Access (ANR-20-CE12–0023) grants.
The publication costs of this article were defrayed in part by the payment of publication fees. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).