Clonal hematopoiesis (CH) is a phenomenon of clonal expansion of hematopoietic stem cells driven by somatic mutations affecting certain genes. Recently, CH has been linked to the development of hematologic malignancies, cardiovascular diseases, and other conditions. Although the most frequently mutated CH driver genes have been identified, a systematic landscape of the mutations capable of initiating this phenomenon is still lacking. In this study, we trained machine learning models for 12 of the most recurrent CH genes to identify their driver mutations. These models outperform expert-curated rules based on prior knowledge of the function of these genes. Moreover, their application to identify CH driver mutations across almost half a million donors of the UK Biobank reproduces known associations between CH driver mutations and age, and the prevalence of several diseases and conditions. We thus propose that these models support the accurate identification of CH across healthy individuals.

Significance: We developed and validated gene-specific machine learning models to identify CH driver mutations, showing their advantage with respect to expert-curated rules. These models can support the identification and clinical interpretation of CH mutations in newly sequenced individuals.

See related commentary by Arends and Jaiswal, p. 1581

In healthy hematopoiesis, a pool of hematopoietic stem cells (HSC) contributes to all blood-related lineages. During aging, this process frequently causes clonal hematopoiesis (CH), a state in which one stem cell–derived population occupies a large fraction of the blood cells and platelets (16). This phenomenon of clonal expansion is driven by somatic mutations acquired by the HSCs at some point during life and is highly prevalent in elderly populations (13, 5, 713). These mutations that affect an array of genes collectively named CH drivers confer the HSCs bearing them growth advantages with respect to their neighbors and are thus under positive selection in hematopoiesis. In the past decade, a wide array of studies has linked the presence of CH to an increased risk of development of hematologic malignancies, cardiovascular diseases (CVD), all-cause mortality, and more recently, solid tumors and infectious diseases (2, 7, 1419, 20).

Although intensive research in recent years has led to the identification of some 60 CH driver genes (1, 12, 13, 21), we have only a fragmented understanding of which of their mutations can drive clonal expansions. Across large cohorts of blood donors from the general population (sequenced at ∼40×), such as the UK Biobank (UKB; ref. 22) or TOPMed (23), the ability to detect true CH driver mutations is hindered by the contamination with germline variants and sequencing artifacts (24). However, uncovering the repertoire of CH driver mutations is key to understanding the molecular mechanisms underpinning CH, accurately identifying individuals with this condition (powering epidemiologic studies of its relationship with more serious health conditions), and monitoring the potential impact of CH driver mutations on their health.

Faced with this reality, several research groups and health-related international institutions have summarized the knowledge accumulated on several CH genes in a series of expert-curated rules to select the mutations most likely to drive CH (2527). These are normally applied together with stringent filters to variants identified in the blood samples of healthy individuals to identify CH cases (24). Despite their practical usefulness, these expert-curated rules have a series of caveats. They cannot be learned—or systematically updated—directly from information on CH mutations (rather this information must be first sedimented into shared knowledge), and their coverage of genes is heterogeneous, due to the difference in the amount of knowledge available for each gene or even across different domains of one gene.

We reasoned that these hurdles could be overcome by applying a machine learning–based approach that produces explainable models, trained on high-quality available CH mutations, as we recently demonstrated for cancer driver mutations (28). As in the case of cancer, these models would not suffer from any biases from sedimented knowledge, could reveal complex patterns in CH mutations that have not been apparent so far, and would be easy to scale as more datasets of CH mutations become available. We have thus now repurposed this approach to build explainable models of 12 CH driver genes (collectively referred to as boostDM-CH) using bona fide CH mutations identified across known CH driver genes (12) and synthetic sets of neutral hematopoiesis mutations. When tested on CH mutations not included in the training, the models show a performance that is in general terms above that of expert-curated rules. The models also perform on par with deep saturation and experimental base-editing approaches. When applied to mutations identified in the blood of 470,000 donors from the general population (UKB), the mutations identified by boostDM-CH models as CH drivers show a very significant association with age and the development of hematologic malignancies and other diseases, whereas mutations identified as CH nondrivers show no meaningful associations.

BoostDM-CH Models Accurately Identify CH Driver Mutations

Training machine learning models aimed at distinguishing CH driver mutations requires a high-quality dataset of blood somatic mutations identified across individuals. We have shown before that repurposing tumor-blood sequencing data from large cancer cohorts using a reverse mutation calling eliminates germline contamination and provides a high-quality set of blood somatic mutations (Supplementary Note; ref. 12). Using these blood somatic mutations across more than 33,000 patients with cancer from three cancer genomics cohorts, we identified 64 CH driver genes (12) through the detection of signals of positive selection in their pattern of mutations (Fig. 1A; ref. 29).

Figure 1.

Building and evaluating boostDM-CH models. A, Blood somatic mutations used to train boostDM-CH models were identified across three cancer genomics cohorts through reverse calling (blood vs. tumor sample); then, CH driver genes were identified through signals of positive selection of their mutational pattern using the IntOGen pipeline; finally, models to identify CH driver mutations were built. B, Training and cross-validation of machine learning–based boostDM-CH models, exemplified through DNMT3A. Mutations with a score of the model equal to or greater than 0.5 are deemed CH drivers. C, Explanation of the classification of the DNMT3A-R882H mutation as a driver based on the contribution of the features used in training the model. The numbers in the radar plot correspond to the SHAP values of each feature. Features with a positive SHAP value (i.e., positive contribution to the classification of a mutation as driver) appear above the “0” line in the radar plot, and the main contributing features are shown in bold. D, Performance (F50 median ± IQR) of the cross-validation of 25 CH models as a function of their number of observed mutations. Blue dots represent the genes with models having median F50 above 0.8 and sufficient discovery index to deem the set of mutations across cancer genomics cohorts representative of their CH driver mutations (high-quality models). E, Precision–recall curves (with the value of the area under the curve indicated) of the 12 high-quality boostDM-CH models. F, Performance (F50 median ± IQR) of the classification of DNMT3A blood somatic mutations of the DNMT3A boostDM-CH model and a DNMT3A experimental base editing assay. G, Performance (F50 median ± IQR) in the classification of blood somatic mutations of boostDM-CH models and three sets of expert-curated rules. Left, overall performance in three CH datasets; right, gene-specific performance in one of the datasets. NMD, nonsense-mediated decay; PTM, posttranslational modifications; sgRNA, single-guide RNA; WES, whole-exome sequencing; WGS, whole-genome sequencing; WHO, World Health Organization; MSK-IMPACT, cohort of targeted sequencing samples recruited and analyzed by the Memorial Sloan Kettering Cancer Center.

Figure 1.

Building and evaluating boostDM-CH models. A, Blood somatic mutations used to train boostDM-CH models were identified across three cancer genomics cohorts through reverse calling (blood vs. tumor sample); then, CH driver genes were identified through signals of positive selection of their mutational pattern using the IntOGen pipeline; finally, models to identify CH driver mutations were built. B, Training and cross-validation of machine learning–based boostDM-CH models, exemplified through DNMT3A. Mutations with a score of the model equal to or greater than 0.5 are deemed CH drivers. C, Explanation of the classification of the DNMT3A-R882H mutation as a driver based on the contribution of the features used in training the model. The numbers in the radar plot correspond to the SHAP values of each feature. Features with a positive SHAP value (i.e., positive contribution to the classification of a mutation as driver) appear above the “0” line in the radar plot, and the main contributing features are shown in bold. D, Performance (F50 median ± IQR) of the cross-validation of 25 CH models as a function of their number of observed mutations. Blue dots represent the genes with models having median F50 above 0.8 and sufficient discovery index to deem the set of mutations across cancer genomics cohorts representative of their CH driver mutations (high-quality models). E, Precision–recall curves (with the value of the area under the curve indicated) of the 12 high-quality boostDM-CH models. F, Performance (F50 median ± IQR) of the classification of DNMT3A blood somatic mutations of the DNMT3A boostDM-CH model and a DNMT3A experimental base editing assay. G, Performance (F50 median ± IQR) in the classification of blood somatic mutations of boostDM-CH models and three sets of expert-curated rules. Left, overall performance in three CH datasets; right, gene-specific performance in one of the datasets. NMD, nonsense-mediated decay; PTM, posttranslational modifications; sgRNA, single-guide RNA; WES, whole-exome sequencing; WGS, whole-genome sequencing; WHO, World Health Organization; MSK-IMPACT, cohort of targeted sequencing samples recruited and analyzed by the Memorial Sloan Kettering Cancer Center.

Close modal

To accurately identify the CH driver mutations, we built gene-specific machine learning models that capture combinations of features that define CH driver mutations in each gene. These features include, for example, the significant clustering of mutations in specific regions of the linear sequence or the three-dimensional folded structure of the protein, the enrichment of mutations in certain domains, the consequence type of each mutation, and some additional features such as the conservation of the residue and posttranslational amino acid modifications (28).

The most important and challenging step to train these models is to start with a good quality set of driver/positive and nondriver/negative blood mutations. Using a set of known CH driver mutations as the positive set would produce models that reproduce our current knowledge biases. To solve this problem, we reasoned—as we had done previously in the case of cancer—that somatic mutations observed in human blood across individuals, an unbiased set highly enriched for driver mutations of each CH gene, constitute the best positive training set. The ideal negative set contains mutations that could have occurred through neutral mutagenesis in HSCs. We thus generated synthetic mutations by simulating neutral mutagenesis in HSCs, following the probabilities of trinucleotide changes observed across blood samples as the negative training set (28). Although these sets are highly enriched in driver and nondriver mutations, respectively, they are imperfect, as the group of observed CH mutations may contain non-drivers, whereas the synthetic set of mutations may contain drivers. These imperfections need to be taken into account in the strategy used to train the models (“Methods” and Supplementary Methods).

For example, to build a model for DNMT3A, we used as positive set the 2,650 blood somatic mutations identified across all patients with cancer from the three discovery cohorts via the reverse calling and created 50 negative sets of 2,650 synthetic mutations (Fig. 1B; “Methods” and Supplementary Methods). We then trained a robust model able to classify any mutation in the gene as CH driver or nondriver and to provide an explanation of the features used for this classification (boostDM-CH). For example, the R882H mutation in DNMT3A (Fig. 1C) is predicted as a CH driver primarily due to a combination of three salient features: its location within a linear (30) or three-dimensional (31) cluster and its conservation across vertebrate species (32).

We followed the same strategy for the 25 CH driver genes with a sufficient number of mutations to carry out the training (“Methods”). The boostDM-CH models of 12 CH driver genes (ASXL1, CHEK2, DNMT3A, GNAS, IDH2, MDM4, PPM1D, SF3B1, SRSF2, TET2, TP53, and U2AF1) had good performance (F50 above 0.80) and were deemed sufficiently representative of all their potential driver mutations (high discovery index, see “Methods” and Supplementary Methods; genes in blue in Fig. 1D; Supplementary Fig. S1). As a trend, genes with more observed CH mutations (or higher discovery index) yielded models with higher F50 (Fig. 1D; Supplementary Fig. S2A). Nevertheless, some CH driver genes with a relatively low number of mutations found in the three discovery cohorts, with most of them concentrated within significant clusters (e.g., U2AF1 and IDH2), also yielded high-quality models. The area under the cross-validation precision–recall curves (AUC) for these models ranged between 0.86 and 0.99 (Fig. 1E; Supplementary Fig. S2B).

To benchmark the performance of the DNMT3A boostDM-CH model, the most recurrent CH driver gene, we compared it with a recently published experimental base editing assay that quantified the reporter methylation activity of mutants at several specific residues (Supplementary Methods; ref. 33). The DNMT3A model showed a better performance than this experimental assay in the separation between observed and neutral (synthetic) CH mutations (Fig. 1F; Supplementary Fig. S3A; Supplementary Note). Moreover, when applied to an independent set of CH mutations identified across the general population (Japanese Biobank; ref. 34), the DNMT3A boostDM-CH model also showed higher F50 than the experimental assay (Fig. 1F; Supplementary Fig. S3A). The TP53 boostDM-CH model exhibited a performance comparable to that attained by four experimental deep mutagenesis studies of this gene (3538) on the task of identifying CH mutations observed across cohorts of patients with cancer and in the Japanese Biobank (Supplementary Fig. S3B).

We then compared the collective performance of boostDM-CH models with that of three sets of expert-curated rules [referred to as Niroula (25), Bick (26), and World Health Organization (27); Supplementary Table S1] on three different sets of CH mutations. We first evaluated their performance on the classification of cross-validation CH observed and synthetic neutral mutations across patients with cancer in 10 CH genes (MDM4 and CHEK2 are not covered by the three rule sets). Next, we compared the performance of boostDM-CH models and the expert-curated rules in the classification of a set of uncommon CH mutations observed in six genes (10), which were not included in the training nor in the cross-validation of models (“Methods”). Finally, we compared boostDM-CH and the three sets of expert-curated rules in the identification of CH mutations across a fully independent cohort, the Japanese Biobank. Interestingly, the performance of boostDM-CH models is systematically higher than that of the three sets of expert-curated rules, even when analyzing mutations not included in their training [Fig. 1G (left); Supplementary Fig. S3C]. This performance comparison reflects a variability across the 10 CH driver genes [Fig. 1G (right)]. These results demonstrate the potential of machine learning models trained from observed mutations—and thus unbiased by prior knowledge—to capture the genetic mechanisms underlying CH.

The Repertoire of CH Driver Mutations Reveals Mechanisms Underlying CH

We next used boostDM-CH models to classify all possible mutations in each CH driver gene (in silico saturation mutagenesis; Fig. 2A and B; Supplementary Fig. S4). Each mutation is classified on the basis of the boostDM-CH score provided by each model (“Methods”): mutations attaining a score equal to or above 0.5 are classified as CH drivers, whereas those with a score lower than 0.5 are deemed CH nondrivers. The boostDM-CH score can be used to further tier the mutations into high-confidence (score equal to or greater than 0.9) and other (between 0.5 and 0.9) drivers.

Figure 2.

In silico saturation mutagenesis of CH genes. Blueprints of CH driver mutations of DNMT3A (A) and TET2 (B). The plots represent the distribution of driver (red) and nondriver (gray) mutations. From top to bottom, the first plot contains observed exonic mutations (i.e., excluding intronic splice mutations, which are included in the training) across cancer genomics cohorts classified as drivers or nondrivers by boostDM-CH, with the height of each needle denoting the recurrence of the mutation across the training cohorts. The plot below the observed mutations presents the boostDM-CH scores of all possible SNVs along the protein sequence (in silico saturation mutagenesis). Vertical ticks (in red) below this plot represent all positions of the protein containing CH drivers according to boostDM-CH. Vertical ticks immediately below (in gold) show either missense or nonsense drivers identified as high-confidence (boostDM-CH score ≥ 0.9) or other (boostDM-CH score ≥0.5 and <0.9). The values of mutational features used to train the models are shown linearly along the protein sequence in the plot at the bottom of the figure. The histograms at the right show the distribution of boostDM-CH scores of all mutations identified in the three cancer genomics discovery cohorts. The dashed line indicates thresholds at 0.5 and 0.9 boostDM-CH scores, which define high-confidence drivers and other drivers. The bars below show the number (and proportion) of nondrivers and drivers (high-confidence and other drivers) observed in these cohorts. The two bars represent the partition of all (top) or unique (i.e., considering each different mutation only once, bottom) observed mutations. C, Radar plots representing SHAP values (decomposition of the boostDM-CH score of mutations into the additive contribution of features; see Supplementary Methods) for several illustrative driver mutations. In each mutation, the main features contributing to the driver prediction are highlighted in bold. NMD, nonsense-mediated decay; PTM, posttranslational modifications.

Figure 2.

In silico saturation mutagenesis of CH genes. Blueprints of CH driver mutations of DNMT3A (A) and TET2 (B). The plots represent the distribution of driver (red) and nondriver (gray) mutations. From top to bottom, the first plot contains observed exonic mutations (i.e., excluding intronic splice mutations, which are included in the training) across cancer genomics cohorts classified as drivers or nondrivers by boostDM-CH, with the height of each needle denoting the recurrence of the mutation across the training cohorts. The plot below the observed mutations presents the boostDM-CH scores of all possible SNVs along the protein sequence (in silico saturation mutagenesis). Vertical ticks (in red) below this plot represent all positions of the protein containing CH drivers according to boostDM-CH. Vertical ticks immediately below (in gold) show either missense or nonsense drivers identified as high-confidence (boostDM-CH score ≥ 0.9) or other (boostDM-CH score ≥0.5 and <0.9). The values of mutational features used to train the models are shown linearly along the protein sequence in the plot at the bottom of the figure. The histograms at the right show the distribution of boostDM-CH scores of all mutations identified in the three cancer genomics discovery cohorts. The dashed line indicates thresholds at 0.5 and 0.9 boostDM-CH scores, which define high-confidence drivers and other drivers. The bars below show the number (and proportion) of nondrivers and drivers (high-confidence and other drivers) observed in these cohorts. The two bars represent the partition of all (top) or unique (i.e., considering each different mutation only once, bottom) observed mutations. C, Radar plots representing SHAP values (decomposition of the boostDM-CH score of mutations into the additive contribution of features; see Supplementary Methods) for several illustrative driver mutations. In each mutation, the main features contributing to the driver prediction are highlighted in bold. NMD, nonsense-mediated decay; PTM, posttranslational modifications.

Close modal

Missense driver mutations (and specifically high-confidence ones) identified in the DNMT3A cluster at specific regions of the protein, such as the PWWP and DNA methylase domains (Fig. 2A). These clusters likely capture the interference with different aspects of the DNMT3A methylation activity, such as hindrance of the tetramerization (R882; ref. 5), or impairment of the recognition of histone modifications (mutations within the PWWP domain; ref. 33, 39). Driver nonsense mutations, expected to trigger nonsense-mediated decay, are distributed along its entire sequence. Similarly, in TET2, the identified missense CH driver mutations (some of which act through haploinsufficiency; ref. 40) seem enriched at the beginning and the end of the Tet-JBP domain and may interfere with the removal of methyl groups from CpGs (Fig. 2B; refs. 41, 42). A few missense mutations that occur in the N-terminal part of the protein are likely false positives of the model, but interestingly, they are not included in the high-confidence set of driver mutations (Supplementary Note). In contrast, driver nonsense mutations seem distributed along the entire protein sequence, and all of them are tiered as high-confidence (Fig. 2B).

The analysis of the distribution of CH driver mutations within the remaining 10 CH driver genes with boostDM-CH models reveals different landscapes (Supplementary Figs. S4 and S5). A group of CH drivers (such as ASXL1 and TP53) seems to act in a loss-of-function manner, showing a broad distribution of missense and nonsense CH driver mutations along long tracts of the protein sequence. This contrasts with another group of genes (SF3B1, PPM1D, U2AF1, and IDH2) known to act in a gain-of-function manner, in which CH driver mutations seem confined to specific regions of the protein and in extreme cases, to a single mutational hotspot. These clusters of CH driver mutations are related to underlying alterations in the biological function of these genes.

A detailed explanation of the contribution of different features to the classification of four illustrative CH driver mutations across the same number of genes is presented through the radar plots in Fig. 2C. For example, in the case of PPM1D, all nonsense, high-confidence CH driver mutations concentrate within the C-terminal portion of the protein. These mutations cause the escape from nonsense-mediated decay and give rise to a form of the protein lacking a degron sequence and thus with increased stability (43). This, in turn, causes altered phosphorylation of proteins involved in the response to DNA damage, such as TP53, providing mutant cells with an advantage when exposed to certain cytotoxic therapies (11, 43).

Whereas there is a high degree of overlap between mutations identified as drivers by boostDM-CH and by the expert-curated rule sets, clear differences are also apparent between them (Supplementary Fig. S6). There are also similarities and differences in the configuration of CH and cancer driver mutations in three of these CH driver genes for which we have also been able to build boostDM myeloid cancer models (Fig. 3A–C; Supplementary Fig. S7; ref. 28). One of the differences corresponds to a mutational hotspot of IDH2 observed in myeloid malignancies but absent in CH mutations (Fig. 3C). This hotspot is considered a driver CH mutation by the expert-curated rules, which are based in part on data from hematologic malignancies. However, its absence from the driver output of the IDH2 boostDM-CH model suggests it may be an example of incorrect definition of the rules (see section Newly Identified CH Driver Mutations). This is an example of the benefits provided by models trained from observed data over expert-curated rules established from sedimented knowledge.

Figure 3.

Comparison of boostDM-CH models and myeloid boostDM models in three genes. For each gene (A–C), the needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in CH (top) and myeloid cancer (bottom) cohorts. The internal plots represent the distribution of driver mutation along the sequence of each gene using in silico saturation mutagenesis by boostDM-CH (red) and boostDM-myeloid cancer (blue).

Figure 3.

Comparison of boostDM-CH models and myeloid boostDM models in three genes. For each gene (A–C), the needle plots represent the distribution of driver (red) and nondriver (gray) observed mutations in CH (top) and myeloid cancer (bottom) cohorts. The internal plots represent the distribution of driver mutation along the sequence of each gene using in silico saturation mutagenesis by boostDM-CH (red) and boostDM-myeloid cancer (blue).

Close modal

In summary, boostDM-CH models support the exploration of underlying CH mechanisms across genes. To facilitate this process, the results of the in silico saturation mutagenesis provided by boostDM-CH models of the 12 CH driver genes included in this section are available to the research community at www.intogen.org/ch/boostdm.

BoostDM-CH Models Identify CH Driver Mutations in a Large General Population Cohort

One of the main hurdles to exploiting large cohorts of donors (such as the UKB) for population-wide CH epidemiologic studies is the difficulty to accurately identify blood somatic mutations (in a relatively shallow sequencing) without a reference germline sample from the same donor (12, 24). Potential somatic mutations identified across these blood samples may contain a nonnegligible fraction of germline variants and sequencing artifacts, as well as passenger mutations. We reasoned that boostDM-CH models—trained on high-quality blood somatic mutations obtained from the reverse mutation calling across tumor cohorts—could be used to accurately identify CH driver mutations in this setting. We could then use these CH driver mutations to analyze the relationship between CH and several phenotypes across the population.

Thus, we next identified 201,857 potential somatic variants in the 12 genes across 467,202 individuals in the UKB (“Methods”). The application of BoostDM-CH models classified 41,306 of them as CH driver mutations (28,508, or 69.0%, missense; 12,098, or 29.3%, nonsense; and 700, 1.7%, splice site affecting; Fig. 4A and B; Supplementary Fig. S8A). CH driver mutations occurred in 8.2% (38,124) of the donors of the UKB cohort (92.5% of them bearing a single driver mutation). Differences in the trinucleotide mutational profiles of variants identified as CH drivers or nondrivers by boostDM-CH suggest that the latter contain a mixture of sequencing artifacts, germline variants, and somatic passenger mutations (Supplementary Fig. S8B). Most (72%) driver mutations were identified with high-confidence (boostDM-CH score equal to or greater than 0.9), as were most (90%) nondriver mutations (boostDM-score below 0.1; Supplementary Fig. S9A and S9B).

Figure 4.

Application of boostDM-CH to identify CH driver mutations across 467,202 donors. A, Identification of CH driver mutations across UKB donors using boostDM-CH models. B, Distribution of the variant allele freuqnecy (VAF) of CH driver and nondriver mutations; inner barplot, number of driver mutations identified in each of the 12 CH driver genes studied. C, Fold increase in the proportion of cases with CH driver and nondriver mutations across age groups (top). Association of CH with age measured via logistic regression (bottom). In this and subsequent regression plots in Fig. 5, the results of several sets of donors in the UKB are shown: “driver,” donors bearing at least one CH driver mutation (according to boostDM-CH); “multiple driver,” donors bearing more than one CH driver mutation; “driver large,” donors bearing at least one CH driver mutation with VAF ≥10%; “driver small,” donors bearing a CH driver mutation(s) with VAF < 10%; “nondriver,” donors bearing a potential mutation in a CH driver gene classified as nondriver by boostDM-CH; “potential mutation”, donors bearing any potential mutation (driver or nondriver) in a CH driver gene (see “Methods”). D, Association between carriers of CH mutations in the different CH genes and age. Mutations predicted as drivers by boostDM-CH increase significantly with age in all genes, except CHEK2 and MDM4 (Supplementary Note). CI, confidence interval. In the regressions, ns stands for non-significant.

Figure 4.

Application of boostDM-CH to identify CH driver mutations across 467,202 donors. A, Identification of CH driver mutations across UKB donors using boostDM-CH models. B, Distribution of the variant allele freuqnecy (VAF) of CH driver and nondriver mutations; inner barplot, number of driver mutations identified in each of the 12 CH driver genes studied. C, Fold increase in the proportion of cases with CH driver and nondriver mutations across age groups (top). Association of CH with age measured via logistic regression (bottom). In this and subsequent regression plots in Fig. 5, the results of several sets of donors in the UKB are shown: “driver,” donors bearing at least one CH driver mutation (according to boostDM-CH); “multiple driver,” donors bearing more than one CH driver mutation; “driver large,” donors bearing at least one CH driver mutation with VAF ≥10%; “driver small,” donors bearing a CH driver mutation(s) with VAF < 10%; “nondriver,” donors bearing a potential mutation in a CH driver gene classified as nondriver by boostDM-CH; “potential mutation”, donors bearing any potential mutation (driver or nondriver) in a CH driver gene (see “Methods”). D, Association between carriers of CH mutations in the different CH genes and age. Mutations predicted as drivers by boostDM-CH increase significantly with age in all genes, except CHEK2 and MDM4 (Supplementary Note). CI, confidence interval. In the regressions, ns stands for non-significant.

Close modal

The proportion of individuals with CH driver mutations identified by boostDM-CH models (relative to those in the youngest age bracket, 38–45 years) grows with age, as expected for true CH mutations (Fig. 4C). In contrast, the proportion of donors with nondriver CH mutations remains constant. Although all potential somatic single-nucleotide variants (SNV) in the 12 CH driver genes show a significant association with age (Fig. 4C, black dot), the significance is virtually entirely due to CH driver mutations (top red dot) because nondriver mutations (gray dot) show almost no association. The effect size is larger for individuals with CH mutations with variant allele frequency (VAF) > 10% (driver large) than for those with CH mutations with VAF ≤ 10% (driver small), and it also seems larger for donors bearing several co-occurring CH driver mutations (multiple drivers). The association with age is also clearer when high-confidence CH driver and nondriver mutations are compared (Supplementary Fig. S9C–S9E). CH driver mutations in almost all individual genes show higher association with age than nondriver mutations (Fig. 4D). The significant association with age is also verified across mutations never observed (or observed only once) in the three cancer genomics training cohorts, suggesting that boostDM-CH models are capable of identifying new CH driver mutations (Supplementary Fig. S10A and S10B; Supplementary Note), and it is also apparent across donors with different ethnic backgrounds in UKB (Supplementary Fig. S10C). Among 85 recurrent variants identified in 12 CH genes in UKB, those identified as drivers by boostDM-CH tend to confer high fitness (estimated as described by ref. 44) and show a significant association with age. In contrast, recurrent nondriver variants lack these two features (Supplementary Methods; Supplementary Fig. S11A–S11D; Supplementary Note).

We next verified known associations of CH (defined by the presence of CH driver mutations identified by boostDM-CH) with the exposure to external CH promoters across UKB individuals. We corroborated that individuals with a history of smoking have significantly higher likelihood of carrying CH driver mutations than nonsmokers, as have donors who suffered a solid malignancy (as a proxy of exposure to cytotoxic drugs; refs. 1, 12, 13) prior to their enrollment in UKB (Fig. 5A, Supplementary Fig. S12A and S12B). We also tested the association of CH driver mutations identified by boostDM-CH in UKB with subsequent conditions known to be linked to CH (Supplementary Table S2). First, we corroborated the known association of the presence of CH driver mutations with the development of hematologic malignancies, particularly of myeloid origin (Fig. 5B; Supplementary Fig. S13A–S13E). Both the significance of the association and the increase in risk seem higher for donors bearing large CH clones, clones with multiple mutations (Fig. 5B), or high-confidence CH drivers (Supplementary Fig. S13F). Conversely, the risk of donors bearing CH nondriver mutations to subsequently develop a hematologic malignancy is comparable with that of the general UKB population. We also verified that the known associations between CH driver mutations and increased risk of subsequent heart failure, development of a solid malignancy, risk of any type of infection, and all-cause mortality are reproduced (Fig. 5C; Supplementary Figs. S14–S16). Finally, we compared driver mutations identified by boostDM-CH with those identified by the expert-curated rules (Supplementary Fig. S17) on the basis of the aforementioned associations. In all cases, the association with boostDM-CH drivers showed equal or higher significance than that obtained using the established expert-curated rules (Supplementary Fig. S18A–S18C). This suggests that boostDM-CH drivers are more specific than those identified through expert-curated rules.

Figure 5.

BoostDM-CH driver mutations are significantly associated with different clinical features. A, Association of CH with known CH promoting factors, including smoking and solid tumor history prior to enrollment in UKB (as a proxy of previous chemotherapy treatment), measured via logistic regression. B, Association of CH with the risk of developing myeloid malignancies, measured via Kaplan–Meier analysis (left) and Cox regression analysis (right). C, Association of CH with all-cause death, heart failure, the occurrence of a solid tumor, and any infection (obtained via a composite variable), measured by logistic regression. Asterisks represent significant associations (FDR < 0.05); ns, nonsignificant. In logistic and Cox regression analyses presented in this and subsequent figures, age, sex, and 10 ancestry principal components are included as covariables (depending on the dependent variable in each case). Tumor-related regression analyses in this and other figures also include smoking history as covariable. Heart failure regression analyses presented in this and other figures also include smoking history, dyslipidemia, BMI, hypertension, and diabetes type II status as covariables. Infectious disease regression analyses also include smoking history and occurrence of hematologic neoplasms as covariables (Supplementary Methods). CI, confidence interval.

Figure 5.

BoostDM-CH driver mutations are significantly associated with different clinical features. A, Association of CH with known CH promoting factors, including smoking and solid tumor history prior to enrollment in UKB (as a proxy of previous chemotherapy treatment), measured via logistic regression. B, Association of CH with the risk of developing myeloid malignancies, measured via Kaplan–Meier analysis (left) and Cox regression analysis (right). C, Association of CH with all-cause death, heart failure, the occurrence of a solid tumor, and any infection (obtained via a composite variable), measured by logistic regression. Asterisks represent significant associations (FDR < 0.05); ns, nonsignificant. In logistic and Cox regression analyses presented in this and subsequent figures, age, sex, and 10 ancestry principal components are included as covariables (depending on the dependent variable in each case). Tumor-related regression analyses in this and other figures also include smoking history as covariable. Heart failure regression analyses presented in this and other figures also include smoking history, dyslipidemia, BMI, hypertension, and diabetes type II status as covariables. Infectious disease regression analyses also include smoking history and occurrence of hematologic neoplasms as covariables (Supplementary Methods). CI, confidence interval.

Close modal

Newly Identified CH Driver Mutations

We next focused on mutations identified across the UKB cohort which are classified as drivers by boostDM-CH models but not by the expert-curated rule sets (Supplementary Table S3). To this end, we first selected the mutations that were classified as drivers by the consensus of the three sets of expert-curated rules. These expert-curated rule consensus CH drivers were compared with those identified by the 10 corresponding boostDM-CH models (i.e., excluding MDM4 and CHEK2 not covered by all rule sets), yielding three groups of mutations: rules–boostDM–overlapping (N = 28,901), boostDM-exclusive (N = 10,079), and rules-exclusive (N = 7,436; Fig. 6A; Supplementary Fig. S19A and S19B). We observed a strong association with the age of individuals carrying either rules–boostDM–overlapping or boostDM-exclusive mutations, whereas a very weak association was observed for individuals carrying rules-exclusive mutations (Fig. 6B). A stronger association with age can also be observed for mutations identified uniquely by boostDM-CH than for those identified by each of the three sets of rules (Supplementary Fig. S19C and S19D). Furthermore, although both rules–boostDM–overlapping and boostDM-exclusive mutations showed significant association with all-cause mortality and development of myeloid malignancies, rules-exclusive mutations were not significantly associated with either condition (Fig. 6C). We also analyzed 52 recurrent SNVs identified as drivers by the Bick rule set (26) in the 12 genes included in this study (Supplementary Table S4), which were deemed nondrivers by a recent analysis that demonstrated they are not associated with age across UKB donors (24). Of these, 28 (54%) are correctly classified by boostDM-CH as nondrivers, a number which increases to 34 (65%) if only high-confidence boostDM-CH drivers are considered.

Figure 6.

CH driver mutations identified de novo by boostDM-CH. A, Number of CH driver mutations from the UKB cohort identified by boostDM-CH and three expert-curated rule sets in the 10 CH genes in common. The comparison of the consensus between the three sets of rules and boostDM-CH defines three groups of mutations: boostDM-exclusive, rules–boostDM–overlapping, and rules-exclusive. The subsequent plots present analyses of these three groups of mutations. B, Fold increase of the proportion of cases bearing CH mutations in each of the three groups of mutations across age bins (top). Association of the presence of CH mutations of each of the three groups and age in the 10 CH genes, measured by logistic regression (bottom). On the right-hand of the plot, the number of donors in the UKB cohort bearing mutations of each of the three groups is shown. C, Association of the presence of the different groups of CH mutations and all-cause mortality (top) and development of different types of myeloid cancer (bottom) in the 10 CH genes, measured by logistic regression. D, Association with age of the presence of CH mutations of each of the three groups in either DNMT3A or TP53. E, Association with age of hotspot mutations (at positions R140 and R172), as well as all mutations identified as nondrivers and all potential mutations identified in IDH2. The bottom panel presents the distribution of effect size (OR) resulting from randomly sampling five disjoint sets of nine mutations (the number of observed IDH2 R172 mutations) from two driver (top) and two nondriver (bottom) hotspots (according to boostDM-CH). See Supplementary Note for details. Asterisks represent significant associations (FDR < 0.05); ns, nonsignificant; WHO, World Health Organization.

Figure 6.

CH driver mutations identified de novo by boostDM-CH. A, Number of CH driver mutations from the UKB cohort identified by boostDM-CH and three expert-curated rule sets in the 10 CH genes in common. The comparison of the consensus between the three sets of rules and boostDM-CH defines three groups of mutations: boostDM-exclusive, rules–boostDM–overlapping, and rules-exclusive. The subsequent plots present analyses of these three groups of mutations. B, Fold increase of the proportion of cases bearing CH mutations in each of the three groups of mutations across age bins (top). Association of the presence of CH mutations of each of the three groups and age in the 10 CH genes, measured by logistic regression (bottom). On the right-hand of the plot, the number of donors in the UKB cohort bearing mutations of each of the three groups is shown. C, Association of the presence of the different groups of CH mutations and all-cause mortality (top) and development of different types of myeloid cancer (bottom) in the 10 CH genes, measured by logistic regression. D, Association with age of the presence of CH mutations of each of the three groups in either DNMT3A or TP53. E, Association with age of hotspot mutations (at positions R140 and R172), as well as all mutations identified as nondrivers and all potential mutations identified in IDH2. The bottom panel presents the distribution of effect size (OR) resulting from randomly sampling five disjoint sets of nine mutations (the number of observed IDH2 R172 mutations) from two driver (top) and two nondriver (bottom) hotspots (according to boostDM-CH). See Supplementary Note for details. Asterisks represent significant associations (FDR < 0.05); ns, nonsignificant; WHO, World Health Organization.

Close modal

We then focused on mutations identified as drivers by boostDM-CH and/or the expert-curated rule sets in several specific CH genes across the UKB. In the exemplary cases of DNMT3A and TP53, 4,605 and 501 (respectively) mutations missed by the consensus of the three sets of expert-curated rules, but identified by the corresponding boostDM-CH models, show a significant association with age and are thus likely drivers of CH (Fig. 6D). DNMT3A driver mutations identified exclusively by BoostDM-CH show a very similar distribution of methylation capability as that observed for rules–boostDM–overlapping mutations in the base editing assay mentioned above (Supplementary Fig. S19E; ref. 33). We found that 134 boostDM-exclusive mutations in ASXL1, all nonsense variants located upstream exon 11 (see Supplementary Fig. S17), are not significantly associated with age or prior smoking [in agreement with a recent experimental report (45)], suggesting that they may be false positives of the boostDM-CH model (Supplementary Fig. S20A and S20B). In the case of TET2, the lack of association with age of 271 boostDM-exclusive missense mutations identified outside the two segments of the protein agreed by the three sets of expert-curated rules also implies that there are, among them, false positives of the model (Supplementary Fig. S20C and S20D). Nevertheless, in comparison, 3,380 mutations deemed drivers by the three expert-curated rule sets but not by boostDM-CH seem to be false positives (Supplementary Fig. S20E). Thus, mutations identified as CH drivers in TET2 by expert-curated rule sets have a much higher rate of false positives than those identified by boostDM-CH. Moreover, the boostDM-CH score of the 271 suspected false positive boostDM-exclusive missense mutations is lower than 0.9, and they are thus not included in the high-confidence driver set (Supplementary Fig. S20F). Next, we explored the intriguing case of two IDH2 hotspots—one of which (R140) is rules–boostDM–overlapping—whereas the other (R172) is rules-exclusive (see Fig. 3C). We verified that although the mutations in the R140 hotspot are significantly associated with age, those in position R172 are not, suggesting that only mutations in R140 are CH drivers (Fig. 6E; Supplementary Note). Finally, we found that CHEK2 and MDM4 driver mutations also show a significant association with age after removing potential contaminating germline variants (Supplementary Fig. S21A–S21D; Supplementary Note).

The results of all these analyses, taken together, indicate that CH driver mutations identified exclusively by boostDM-CH across UKB are, for the most part, true CH drivers. BoostDM-CH models outperform the expert-curated rule sets in the detection of CH driver mutations in a cohort representing the general population.

Here, we built machine learning-based models to identify all CH driver mutations in 12 well-known CH driver genes. We were primarily motivated by the lack of unbiased methods to identify the mutations responsible for the development of CH, even within well-established CH driver genes (24). One key requirement to train such machine learning models is the availability of a big enough set of high-quality blood somatic mutations. To fulfill this, we resorted to cohorts of patients with cancer in which blood somatic mutations can be reliably identified through a comparison of blood and tumor samples (12). The models were then trained on features that distinguish these blood somatic mutations observed in CH genes across donors from those expected to arise under neutral mutagenesis (28).

We demonstrate that in general, these models outperform expert-curated rules designed by CH researchers on the basis of years of accumulated knowledge (2527) in identifying CH driver mutations across cancer genomics cohorts and other cohorts representing the general population. Machine learning models trained unbiasedly on observed blood somatic mutations are thus able to recapitulate sedimented knowledge on CH development. Of note, the similarity in the blueprints of CH driver mutations produced by the models and by expert-curated rules and the applicability of the models to identify CH driver mutations in the UKB indicate that CH across patients with cancer is largely driven by age, which suggests that it follows a similar evolutionary path as in the general population (Supplementary Note).

BoostDM-CH models can be used to gain new insights into the mutational mechanisms underlying CH in different genes. Here, we provide examples with DNMT3A, TET2, and the models of other genes. The 12 boostDM-CH models trained in this work and their associated data are available to the CH research community at www.intogen.org/ch/boostdm. In the future, as more cohorts for which blood and a second sample (as is done in cancer genomics cohorts) become available in the public domain, growing sets of reliable blood somatic mutations will be identified, and good-quality models for more CH genes will be within reach.

Moreover, we have shown a new path for the identification of CH driver mutations across donors in a large cohort, from whom only a blood sample is available, using boostDM-CH models. Applying this rationale to the UKB, we are able to recapitulate known associations of CH with age and the development of several phenotypes. Driver mutations newly identified by boostDM-CH across CH driver genes have higher specificity than those identified uniquely by expert-curated rule sets. In summary, BoostDM-CH models can distinguish CH driver mutations from a combination of sequencing artifacts, germline contamination, and passenger somatic mutations across half a million donors in the UKB (22) with better accuracy than expert-curated rules. This work constitutes a proof of principle for a wider use of boostDM-CH models in large retrospective or prospective clinical studies aimed at discovering such associations. Furthermore, it also illustrates the ways in which the models could assist in detecting CH across large populations to subsequently monitor individuals at risk of developing different conditions.

Blood Somatic Mutations in Three Cancer Genomics Cohorts

Blood somatic mutations identified across two of our discovery cohorts—The Cancer Genome Atlas (TCGA, whole-exome sequencing; ref. 46) and Hartwig Medical Foundation (HMF, whole-genome sequencing; ref. 47)—through reverse calling were obtained from our previous study (12). Briefly, aligned sequencing reads from normal (blood) and tumor BAM files from patients with solid tumors in TCGA and HMF cohorts were compared (using Strelka2) to call blood somatic mutations, which were subjected to a strict filtering postprocess described in detail in the aforementioned study. We thus retrieved all mutations identified in each of these two discovery cohorts. Blood somatic mutations in a third discovery cohort—MSK-IMPACT (in a gene sequencing panel; ref. 48) called by Bolton and colleagues (1)—were obtained from cBioPortal (https://www.cbioportal.org/; ref. 49).

Compendium of Mutational CH Driver Genes

CH driver genes were obtained from our previous work (12). We successfully built boostDM-CH models for 25 of them: ASXL1, ATM, CBL, CHEK2, CTCF, DNMT3A, EZH2, GNAS, IDH2, JAK2, KDM5C, KRAS, MDM4, NF1, PPM1D, RAD21, STAT5B, SF3B1, SH2B3, SRSF2, STAG2, TET2, TP53, U2AF1, and ZRSR2. Of these, we built high-quality boostDM-CH models (with the criteria defined in the main text and Supplementary Methods) for 12 genes (ASXL1, CHEK2, DNMT3A, GNAS, IDH2, MDM4, PPM1D, SF3B1, SRSF2, TET2, TP53, and U2AF1). See details in Supplementary Methods.

BoostDM-CH Implementation

BoostDM-CH is a supervised learning strategy based on mutations observed across blood samples and synthetic mutations generated following neutral mutagenesis. The boostDM-CH pipeline has been implemented in Python and Nextflow (50). The base classifiers were trained using XGboost v.0.90 (51), and SHAP (52) v.0.28.5 was used to compute the local explanations for the predictions of the base classifiers. The pipeline is available at a GitHub repository (https://github.com/bbglab/boostdm-pipeline/tree/ch). In silico saturation mutagenesis is a term that indicates the assessment of all possible changes in a gene or protein with a computational approach (28). To assess the driver potential of all possible mutations in CH driver genes, we used boostDM-CH models. The training of the models, the derivation of local explanations for each mutation, and their validation are described in Supplementary Methods.

Expert-Curated Rules to Identify CH Driver Mutations

Three sets of expert-curated rules designed by CH researchers [referred to as Niroula (25), Bick (26), and World Health Organization (27)] were used for boostDM-CH performance comparison (details in Supplementary Table S1). For most of the benchmarking, we only consider those mutations in genes included by all the three sets of expert-curated rules (10 genes).

Comparison with Myeloid BoostDM Models

The myeloid boostDM models were trained with data from 11 tumor-type–specific cohorts from IntOGen (Release v2023.05.31; ref. 28, 29) comprising acute myeloid leukemias (9 cohorts), myelodysplastic syndromes (1 cohort), and chronic myelogenous leukemias (1 cohort).

Application of BoostDM-CH Models to a Population-Based Cohort (UKB)

All analyses were performed using the UKB Research Analysis Platform (22). The cohort used in the study comprises 467,202 individuals (54% female) with no history of hematologic malignancy prior to enrollment in UKB, with age range 37 to 73 years and median age 58 years. Briefly, mutations were called in the 12 genes with selected boostDM-CH models from UKB CRAM files using a Nextflow pipeline (ref. 50; v20.10.0) implementing Mutect2 (GATK, v.4.2.2.0) in tumor only mode. Variants identified were filtered to eliminate potential germline contamination using a Panel of Normals, the 1000 Genomes Project, and the Genome Aggregation Database (53), as well as ad hoc filters to exclude low-quality calls. Find details in Supplementary Methods.

Clinical data from the UKB were downloaded in November 2022, and individual traits were pulled out from the whole phenotype file classified in data-fields. Basic information from the individuals used for the analyses includes age of recruitment (data-field: 21022), sex (data-field: 31), genetic principal components (data-field: 22009), death status (data-field 40007), and body mass index (BMI, data-field: 21001). Smoking status was defined as never smoker or ever smoker using smoking status information (data-field: 20116). The presence of cancer was defined from reported occurrences of cancer (data-field: 40009). Years to first cancer was assessed using the age of recruitment and the age of first cancer (data-field: 22008). Specific cancer type, CVD traits, infectious disease, and other conditions such as hypertension or diabetes mellitus type II were generated combining information from different data-fields (Supplementary Table S2) including ICD-10 diagnosis (data-fields: 40006, 41202, 41270), ICD-9 diagnosis (data-fields: 40013, 41203, 41271), self-reported cancer (data-field: 20001), self-reported noncancer illness (data-field: 20002), underlying cause of death (data-field: 40001), contributory cause of death (data-field: 40002), operation (data-field: 20004), and OPSC4 (data-field: 41272, 41200), similarly definitions outlined by (54, 55). For each definition, the first diagnosis event that occurred was selected. Years to first occurrence of cancer, CVD, and infectious diseases was calculated also using the difference between the date of recruitment and specific diagnosis dates (data-fields: 40005, 40000, 41260, 41262, 41263, 41280, 41281, 41282) and diagnosis age (data-fields: 20007, 20009) based on disease definitions. With regard to some of the covariables used for the association, diabetes mellitus type II was defined as its diagnosis or treatment with insulin or oral hypoglycemic medication (data-field: 6177); dyslipidemia was defined as cholesterol ≥240 mg/dL (data-field: 30690), LDL-direct ≥160 mg/dL (data-field: 30780), HDL-cholesterol <40 mg/dL (data-field: 30760), or the use of lipid-lowering drugs (data-field: 6177); hypertension was defined by its diagnosis or by having a systolic blood pressure ≥140 mm Hg (data-field: 4080), diastolic blood pressure ≥90 mm Hg (data-field: 4079), or the use of antihypertensive medication (data-field: 6177).

Unless otherwise specified, all regression models included age, sex, and the first 10 ancestry principal components as covariates. For cancer regressions, we also included smoking status as covariable, whereas for CVD, we included smoking status, BMI, diabetes mellitus type II, dyslipidemia, and hypertension status. Infectious diseases included smoking status and hematologic cancer. For age, cancer, smoking status, CVD, and death association with CH, we performed logistic regression analysis using logit (Python statsmodels package v.0.14.0). To analyze the risk of developing hematologic malignancies and myeloid neoplasms in CH carriers, we implemented a Cox proportional hazards model using CoxPHFitter (Python lifelines package v.0.27.8). We count an event as any reported diagnosis of hematologic or myeloid cancer, after enrollment in the UKB. Individuals without the event who died before the end of the follow-up were censored at the time of death, whereas the rest were censored at the last follow-up reported (June 25, 2021, from data-field 40005). The maximum number of years to an event was restricted to the 97th percentile of the UKB population. Kaplan–Meier curves were built using KaplanMeierFitter and logrank_test functions (Python lifelines package v.0.27.8).

Data and Software Availability

Blood somatic mutation data required to train boostDM-CH models are available through HMF and the Database of Genotypes and Phenotypes following the same procedure to access the original datasets used in the reverse calling approach (12). HMF blood somatic mutations are available as part of the data access request to HMF (https://www.hartwigmedicalfoundation.nl). TCGA blood somatic mutations are available through the Database of Genotypes and Phenotypes (phs002867) to researchers who have obtained permission to access protected TCGA data. Panel-sequenced data from the MSK-IMPACT targeted cohort are available through cBioPortal (https://www.cbioportal.org/study/summary?id=msk_ch_2020). The compendium of CH driver genes is available through IntOGen (intogen.org/ch). The in silico saturation mutagenesis of CH drivers described in this study is available through the boostDM-CH (intogen.org/ch/boostdm). Data required to reproduce UKB and Japanese Biobank analyses are available upon access request to both entities (details in Supplementary Methods). Software required to carry out analyses described in the article is publicly available at https://github.com/bbglab/boostdm-ch-analysis.

All authors report that although no patent application is envisioned giving the type of technology (software), IRB Barcelona retains any and all commercial rights over it. In addition, our innovation department is evaluating the registration of the source code. No disclosures were reported by the other authors.

S. Demajo: Conceptualization, data curation, software, formal analysis, validation, investigation, methodology, writing–review and editing. J. Ramis-Zaldivar: Data curation, software, formal analysis, validation, visualization, methodology, writing–review and editing. F. Muinos: Conceptualization, data curation, software, formal analysis, investigation, visualization, methodology, writing–review and editing. M.L. Grau: Software, formal analysis. M. Andrianova: Software, formal analysis, investigation, writing–review and editing. N. Lopez-Bigas: Conceptualization, resources, supervision, funding acquisition, methodology, writing–original draft, project administration, writing–review and editing. A. Gonzalez-Perez: Conceptualization, supervision, methodology, writing–original draft, writing–review and editing.

The authors would like to thank Martina Gasull and Paula Gomis for support in data access and Federica Brando for key support in the development of the boostDM-CH website. N. López-Bigas acknowledges funding from the European Research Council (consolidator grant 682398). S. Demajo was supported by a Juan de la Cierva fellowship from Spanish Ministerio de Ciencia e Innovación (IJC2020-044728-I). J.E. Ramis-Zaldivar was supported by a Postdoctoral AECC 2023 fellowship from Fundación Científica Asociación Española Contra el Cáncer (POSTD234814RAMI). This project was supported by the PID2021-126568OB-I00(CHEMOHEALTH) project, funded by the Spanish Ministry of Science (MCIN), AEI/10.13039/501100011033/. The project was also funded by Grant PLEC2021-008194 (AtheroClonal) funded by MICIU/AEI/10.13039/501100011033 and by “European Union NextGenerationEU/ PRTR”; the project leading to these results has received funding from “la Caixa” Foundation under the project code LCF/PR/HR22/52420011 (MyoClonal). This work was delivered as part of the PROMINENT team supported by the Cancer Grand Challenges partnership funded by Cancer Research UK (CGCATF-2021/100008), the NCI (OT2CA278668), and the Scientific Foundation of the Spanish Association Against Cancer, AECC. It has also been supported by the “CGI-Clinics” project, funded by the European Union's Horizon Europe Programme under grant agreement 101057509. IRB Barcelona is a recipient of a Severo Ochoa Centre of Excellence Award from the Spanish Ministry of Economy and Competitiveness (government of Spain) and an Excellence Institutional grant by the Asociacion Española contra el Cancer and is supported by CERCA (Generalitat de Catalunya). This publication and the underlying research are partly facilitated by Hartwig Medical Foundation and the Center for Personalized Cancer Treatment which have generated, analyzed, and made available data for this research. This publication and the underlying research are also partly facilitated by data collected and made public by The Cancer Genome Atlas network. This research has been conducted using the UK Biobank Resource under application number 69794.

Note Supplementary data for this article are available at Cancer Discovery Online (http://cancerdiscovery.aacrjournals.org/).

1.
Bolton
KL
,
Ptashkin
RN
,
Gao
T
,
Braunstein
L
,
Devlin
SM
,
Kelly
D
, et al
.
Cancer therapy shapes the fitness landscape of clonal hematopoiesis
.
Nat Genet
2020
;
52
:
1219
26
.
2.
Bowman
RL
,
Busque
L
,
Levine
RL
.
Clonal hematopoiesis and evolution to hematopoietic malignancies
.
Cell Stem Cell
2018
;
22
:
157
70
.
3.
Busque
L
,
Patel
JP
,
Figueroa
ME
,
Vasanthakumar
A
,
Provost
S
,
Hamilou
Z
, et al
.
Recurrent somatic TET2 mutations in normal elderly individuals with clonal hematopoiesis
.
Nat Genet
2012
;
44
:
1179
81
.
4.
Busque
L
,
Buscarlet
M
,
Mollica
L
,
Levine
RL
.
Concise review: age-related clonal hematopoiesis: stem cells tempting the devil
.
Stem Cells
2018
;
36
:
1287
94
.
5.
Challen
GA
,
Goodell
MA
.
Clonal hematopoiesis: mechanisms driving dominance of stem cell clones
.
Blood
2020
;
136
:
1590
8
.
6.
Jaiswal
S
,
Ebert
BL
.
Clonal hematopoiesis in human aging and disease
.
Science
2019
;
366
:
4673
.
7.
Chen
S
,
Wang
Q
,
Yu
H
,
Capitano
ML
,
Vemula
S
,
Nabinger
SC
, et al
.
Mutant p53 drives clonal hematopoiesis through modulating epigenetic pathway
.
Nat Commun
2019
;
10
:
5649
.
8.
Coombs
CC
,
Zehir
A
,
Devlin
SM
,
Kishtagari
A
,
Syed
A
,
Jonsson
P
, et al
.
Therapy-related clonal hematopoiesis in patients with non-hematologic cancers is common and associated with adverse clinical outcomes
.
Cell Stem Cell
2017
;
21
:
374
82.e4
.
9.
Fuster
JJ
,
Walsh
K
.
Somatic mutations and clonal hematopoiesis: unexpected potential new drivers of age-related cardiovascular disease
.
Circ Res
2018
;
122
:
523
32
.
10.
Gao
T
,
Ptashkin
R
,
Bolton
KL
,
Sirenko
M
,
Fong
C
,
Spitzer
B
, et al
.
Interplay between chromosomal alterations and gene mutations shapes the evolutionary trajectory of clonal hematopoiesis
.
Nat Commun
2021
;
12
:
338
.
11.
Hsu
JI
,
Dayaram
T
,
Tovy
A
,
De Braekeleer
E
,
Jeong
M
,
Wang
F
, et al
.
PPM1D mutations drive clonal hematopoiesis in response to cytotoxic chemotherapy
.
Cell Stem Cell
2018
;
23
:
700
13.e6
.
12.
Pich
O
,
Reyes-Salazar
I
,
Gonzalez-Perez
A
,
Lopez-Bigas
N
.
Discovering the drivers of clonal hematopoiesis
.
Nat Commun
2022
;
13
:
4267
.
13.
Hagiwara
K
,
Natarajan
S
,
Wang
Z
,
Zubair
H
,
Mulder
HL
,
Dong
L
, et al
.
Dynamics of age- versus therapy-related clonal hematopoiesis in long-term survivors of pediatric cancer
.
Cancer Discov
2023
;
13
:
844
57
.
14.
Marnell
CS
,
Bick
A
,
Natarajan
P
.
Clonal hematopoiesis of indeterminate potential (CHIP): linking somatic mutations, hematopoiesis, chronic inflammation and cardiovascular disease
.
J Mol Cell Cardiol
2021
;
161
:
98
105
.
15.
Calvillo-Argüelles
O
,
Jaiswal
S
,
Shlush
LI
,
Moslehi
JJ
,
Schimmer
A
,
Barac
A
, et al
.
Connections between clonal hematopoiesis, cardiovascular disease, and cancer: a review
.
JAMA Cardiol
2019
;
4
:
380
7
.
16.
Arends
CM
,
Liman
TG
,
Strzelecka
PM
,
Kufner
A
,
Löwe
P
,
Huo
S
, et al
.
Associations of clonal hematopoiesis with recurrent vascular events and death in patients with incident ischemic stroke
.
Blood
2023
;
141
:
787
99
.
17.
Miller
PG
,
Qiao
D
,
Rojas-Quintero
J
,
Honigberg
MC
,
Sperling
AS
,
Gibson
CJ
, et al
.
Association of clonal hematopoiesis with chronic obstructive pulmonary disease
.
Blood
2022
;
139
:
357
68
.
18.
Yu
B
,
Roberts
MB
,
Raffield
LM
,
Zekavat
SM
,
Nguyen
NQH
,
Biggs
ML
, et al
.
Association of clonal hematopoiesis with incident heart failure
.
J Am Coll Cardiol
2021
;
78
:
42
52
.
19.
Mitchell
SR
,
Gopakumar
J
,
Jaiswal
S
.
Insights into clonal hematopoiesis and its relation to cancer risk
.
Curr Opin Genet Dev
2021
;
66
:
63
9
.
20.
Bolton
KL
,
Koh
Y
,
Foote
MB
,
Im
H
,
Jee
J
,
Sun
CH
, et al
.
Clonal hematopoiesis is associated with risk of severe Covid-19
.
Nat Commun
2021
;
12
:
5975
.
21.
Wang
Y
,
Sano
S
,
Ogawa
H
,
Horitani
K
,
Evans
MA
,
Yura
Y
, et al
.
Murine models of clonal haematopoiesis to assess mechanisms of cardiovascular disease
.
Cardiovasc Res
2022
;
118
:
1413
32
.
22.
Sudlow
C
,
Gallacher
J
,
Allen
N
,
Beral
V
,
Burton
P
,
Danesh
J
, et al
.
UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age
.
PLoS Med
2015
;
12
:
e1001779
.
23.
Taliun
D
,
Harris
DN
,
Kessler
MD
,
Carlson
J
,
Szpiech
ZA
,
Torres
R
, et al
.
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
.
Nature
2021
;
590
:
290
9
.
24.
Vlasschaert
C
,
Mack
T
,
Heimlich
JB
,
Niroula
A
,
Uddin
MM
,
Weinstock
J
, et al
.
A practical approach to curate clonal hematopoiesis of indeterminate potential in human genetic data sets
.
Blood
2023
;
141
:
2214
23
.
25.
Niroula
A
,
Sekar
A
,
Murakami
MA
,
Trinder
M
,
Agrawal
M
,
Wong
WJ
, et al
.
Distinction of lymphoid and myeloid clonal hematopoiesis
.
Nat Med
2021
;
27
:
1921
7
.
26.
Bick
AG
,
Weinstock
JS
,
Nandakumar
SK
,
Fulco
CP
,
Bao
EL
,
Zekavat
SM
, et al
.
Inherited causes of clonal haematopoiesis in 97,691 whole genomes
.
Nature
2020
;
586
:
763
8
.
27.
Khoury
JD
,
Solary
E
,
Abla
O
,
Akkari
Y
,
Alaggio
R
,
Apperley
JF
, et al
.
The 5th edition of the World Health Organization Classification of Haematolymphoid Tumours: Myeloid and Histiocytic/Dendritic Neoplasms
.
Nature Publishing Group
:
Leukemia
;
2022
,
36
. pp.
1703
1719
. https://www.nature.com/articles/s41375-022-01613-1.
28.
Muiños
F
,
Martínez-Jiménez
F
,
Pich
O
,
Gonzalez-Perez
A
,
Lopez-Bigas
N
.
In silico saturation mutagenesis of cancer genes
.
Nature
2021
;
596
:
428
32
.
29.
Martínez-Jiménez
F
,
Muiños
F
,
Sentís
I
,
Deu-Pons
J
,
Reyes-Salazar
I
,
Arnedo-Pac
C
, et al
.
A compendium of mutational cancer driver genes
.
Nat Rev Cancer
2020
;
20
:
555
72
.
30.
Arnedo-Pac
C
,
Mularoni
L
,
Muiños
F
,
Gonzalez-Perez
A
,
Lopez-Bigas
N
.
OncodriveCLUSTL: a sequence-based clustering method to identify cancer drivers
.
Bioinformatics
2019
;
35
:
4788
90
.
31.
Tokheim
C
,
Bhattacharya
R
,
Niknafs
N
,
Gygax
DM
,
Kim
R
,
Ryan
M
, et al
.
Exome-scale discovery of hotspot mutation regions in human cancer using 3D protein structure
.
Cancer Res
2016
;
76
:
3719
31
.
32.
Pollard
KS
,
Hubisz
MJ
,
Rosenbloom
KR
,
Siepel
A
.
Detection of nonneutral substitution rates on mammalian phylogenies
.
Genome Res
2010
;
20
:
110
21
.
33.
Lue
NZ
,
Garcia
EM
,
Ngan
KC
,
Lee
C
,
Doench
JG
,
Liau
BB
.
Base editor scanning charts the DNMT3A activity landscape
.
Nat Chem Biol
2023
;
19
:
176
86
.
34.
Nagai
A
,
Hirata
M
,
Kamatani
Y
,
Muto
K
,
Matsuda
K
,
Kiyohara
Y
, et al
.
Overview of the BioBank Japan project: study design and profile
.
J Epidemiol
2017
;
27
:
S2
8
.
35.
Kato
S
,
Han
S-Y
,
Liu
W
,
Otsuka
K
,
Shibata
H
,
Kanamaru
R
, et al
.
Understanding the function-structure and function-mutation relationships of p53 tumor suppressor protein by high-resolution missense mutation analysis
.
Proc Natl Acad Sci U S A
2003
;
100
:
8424
9
.
36.
Giacomelli
AO
,
Yang
X
,
Lintner
RE
,
McFarland
JM
,
Duby
M
,
Kim
J
, et al
.
Mutational processes shape the landscape of TP53 mutations in human cancer
.
Nat Genet
2018
;
50
:
1381
7
.
37.
Ursu
O
,
Neal
JT
,
Shea
E
,
Thakore
PI
,
Jerby-Arnon
L
,
Nguyen
L
, et al
.
Massively parallel phenotyping of coding variants in cancer with Perturb-seq
.
Nat Biotechnol
2022
;
40
:
896
905
.
38.
Kotler
E
,
Shani
O
,
Goldfeld
G
,
Lotan-Pompan
M
,
Tarcic
O
,
Gershoni
A
, et al
.
A systematic p53 mutation library links differential functional impact to cancer mutation pattern and evolutionary conservation
.
Mol Cell
2018
;
71
:
178
90.e8
.
39.
Sendžikaitė
G
,
Hanna
CW
,
Stewart-Morgan
KR
,
Ivanova
E
,
Kelsey
G
.
A DNMT3A PWWP mutation leads to methylation of bivalent chromatin and growth retardation in mice
.
Nat Commun
2019
;
10
:
1884
.
40.
Kaasinen
E
,
Kuismin
O
,
Rajamäki
K
,
Ristolainen
H
,
Aavikko
M
,
Kondelin
J
, et al
.
Impact of constitutional TET2 haploinsufficiency on molecular and clinical phenotype in humans
.
Nat Commun
2019
;
10
:
1252
.
41.
Waheed
SO
,
Chaturvedi
SS
,
Karabencheva-Christova
TG
,
Christov
CZ
.
Catalytic mechanism of human ten-eleven translocation-2 (TET2) enzyme: effects of conformational changes, electric field, and mutations
.
ACS Catal
2021
;
11
:
3877
90
.
42.
Tulstrup
M
,
Soerensen
M
,
Hansen
JW
,
Gillberg
L
,
Needhamsen
M
,
Kaastrup
K
, et al
.
TET2 mutations are associated with hypermethylation at key regulatory enhancers in normal and malignant hematopoiesis
.
Nat Commun
2021
;
12
:
6061
.
43.
Kahn
JD
,
Miller
PG
,
Silver
AJ
,
Sellar
RS
,
Bhatt
S
,
Gibson
C
, et al
.
PPM1D-truncating mutations confer resistance to chemotherapy and sensitivity to PPM1D inhibition in hematopoietic cells
.
Blood
2018
;
132
:
1095
105
.
44.
Watson
CJ
,
Papula
AL
,
Poon
GYP
,
Wong
WH
,
Young
AL
,
Druley
TE
, et al
.
The evolutionary dynamics and fitness landscape of clonal hematopoiesis
.
Science
2020
;
367
:
1449
54
.
45.
Kohnke
T
,
Nuno
KA
,
Alder
CC
,
Gars
EJ
,
Phan
P
,
Fan
AC
, et al
.
Human ASXL1-mutant hematopoiesis is driven by a truncated protein associated with aberrant de-ubiquitination of H2AK119
.
Blood Cancer Discov
2024
;
5
:
202
23
.
46.
Hutter
C
,
Zenklusen
JC
.
The cancer genome atlas: creating lasting value beyond its data
.
Cell
2018
;
173
:
283
5
.
47.
Martínez-Jiménez
F
,
Movasati
A
,
Brunner
SR
,
Nguyen
L
,
Priestley
P
,
Cuppen
E
, et al
.
Pan-cancer whole-genome comparison of primary and metastatic solid tumours
.
Nature
2023
;
618
:
333
41
.
48.
Cheng
DT
,
Mitchell
T
,
Zehir
A
,
Shah
RH
,
Benayed
R
,
Syed
A
, et al
.
Memorial Sloan KetteringIntegrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT): a hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology
.
J Mol Diagn
2015
;
17
:
251
64
.
49.
Cerami
E
,
Gao
J
,
Dogrusoz
U
,
Gross
BE
,
Sumer
SO
,
Aksoy
BA
, et al
.
The cBio cancer genomics portal: an open Platform for exploring multidimensional cancer genomics data
.
Cancer Discov
2012
;
2
:
401
4
.
50.
Di Tommaso
P
,
Chatzou
M
,
Floden
EW
,
Barja
PP
,
Palumbo
E
,
Notredame
C
.
Nextflow enables reproducible computational workflows
.
Nat Biotechno
2017
;
35
:
316
9
.
l
51.
Chen
T
,
Guestrin
C
.
XGBoost: a scalable tree boosting system
. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
;
San Francisco, CA
.
New York (NY)
:
Association for Computing Machinery
;
2016
. p.
785
94
.
52.
Lundberg
SM
,
Lee
S-I
.
A unified approach to interpreting model predictions
. In:
NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems;
Long Beach, CA
.
Red Hook (NY)
:
Curran Associates Inc.
;
2017
. p.
4768
77
.
53.
Karczewski
KJ
,
Francioli
LC
,
Tiao
G
,
Cummings
BB
,
Alföldi
J
,
Wang
Q
, et al
.
The mutational constraint spectrum quantified from variation in 141,456 humans
.
Nature
2020
;
581
:
434
43
.
54.
Trinder
M
,
Walley
KR
,
Boyd
JH
,
Brunham
LR
.
Causal inference for genetically determined levels of high-density lipoprotein cholesterol and risk of infectious disease
.
Arterioscler Thromb Vasc Biol
2020
;
40
:
267
78
.
55.
Kar
SP
,
Quiros
PM
,
Gu
M
,
Jiang
T
,
Mitchell
J
,
Langdon
R
, et al
.
Genome-wide analyses of 200,453 individuals yield new insights into the causes and consequences of clonal hematopoiesis
.
Nat Genet
2022
;
54
:
1155
66
.
This open access article is distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.

Supplementary data