Abstract
Clonal hematopoiesis of indeterminate potential (CHIP) is characterized by detectable hematopoietic-associated gene mutations in a person without evidence of hematologic malignancy. We sought to identify additional cancer-presenting mutations usable for CHIP detection by performing a data mining analysis of 48 somatic mutation landscape studies reporting mutations at diagnoses of 7,430 adult and pediatric patients with leukemia or other hematologic malignancy. Following extraction of 20,141 protein-altering mutations, we identified 434 significantly recurrent mutation hotspots, 364 of which occurred at loci confidently assessable for CHIP. We then performed an additional large-scale analysis of whole-exome sequencing data from 4,538 persons belonging to three noncancer cohorts for clonal mutations. We found the combined cohort prevalence of CHIP with mutations identical to those reported at blood cancer mutation hotspots to be 1.8%, and that some of these CHIP mutations occurred in children. Our findings may help to improve CHIP detection and precancer surveillance for both children and adults.
This study identifies frequently occurring mutations across several blood cancers that may drive hematologic malignancies and signal increased risk for cancer when detected in healthy persons. We find clonal mutations at these hotspots in a substantial number of individuals from noncancer cohorts, including children, showcasing potential for improved precancer surveillance.
See related commentary by Spitzer and Levine, p. 192.
Introduction
Somatic mutations at hotspots (genetic loci observed to be frequently mutated across patients with cancer) often drive or contribute to cancer pathogenesis (1, 2). Many somatic landscape studies have now been performed with next-generation sequencing (NGS), typically for a single cancer type in small- to medium-sized patient cohorts, identifying mutations that recur frequently. Large single-study and pan-cancer analyses of both pediatric and adult cohorts have identified cancer genes and mutation hotspots having even lower but significant recurrence rates. For example, over 140 driver genes were identified in an analysis of children with six cancer types (3), and more than 450 hotspots were found in an analysis of 41 predominantly nonhematologic cancers (2). However, the number of included leukemic or other hematologic cancer samples in both pan-cancer and single-study analyses reporting hotspots at a codon level has generally been insufficient to identify blood cancer–specific hotspots of low recurrence. Existing databases that accumulate mutations across many individual studies [most notable in size, the Catalogue Of Somatic Mutations In Cancer (COSMIC; https://cancer.sanger.ac.uk/cosmic); ref. 4] are extremely valuable for identifying reported mutations within their included studies; yet, current versions of these databases lack many exome/genome-wide blood cancer studies that could improve detection of mutations recurrent at low frequencies, and some are difficult to filter for specific cancers. Hence, many hematologic malignancy mutation hotspots of lesser recurrence and their constituent mutations that may drive blood cancers are likely still unidentified or unappreciated.
Clonal hematopoiesis of indeterminate potential (CHIP), which is detected as an expansive clonal somatic mutation in a person currently free from hematologic malignancy, carries a greatly increased risk (HR >10) for future blood cancer development (5, 6). CHIP increases sharply in prevalence with advanced age (5–9), and backward extrapolation of trends in adults could suggest that CHIP in childhood is extremely rare. However, no exome-wide CHIP study has included large numbers of healthy children to directly address this issue. The mutations in hematopoiesis-regulating genes used to identify CHIP include hotspot mutations of high recurrence, but also include mutations lacking driver or recurrent evidence at specific amino acid positions. Regardless of neoplastic evidence, CHIP mutations are frequently identified in persons singly (5–10); therefore, most are likely responsible for the clonal expansion that makes their detection possible. Yet CHIP mutations identical to those reported at blood cancer mutation hotspots may be found to carry increased risk for future neoplastic transformation.
With the dual aims of identifying mutational hotspots of hematologic malignancies and using these hotspots to detect CHIP having potentially greater risk for affecting cancer transformation, we performed a recurrent mutation analysis independently from existing databases. We systematically extracted and analyzed reported data from 48 studies meeting inclusion criteria, which detected mutations at diagnoses of patients with acute leukemias, myeloproliferative neoplasms (MPN), and myelodysplastic syndrome (MDS). As the majority of these studies interrogated the entire exome and many are not currently included in other mutation databases, our analysis provides a valuable compiled list of recurrent hematologic cancer mutations at a protein-coding level with smaller genome-wide bias and larger sample size than previously reported. We then performed a large analysis of CHIP, focused on finding clonal mutations identical to those reported at these hotspots in the whole-exome sequencing (WES) data of 4,538 persons from three noncancer cohorts, which included the widely used 1000 Genomes Project (1KG) and more than 400 children. Our findings may lead to the improved prognostic and clinical utility of precancer surveillance in children and adults.
Results
Somatic Mutation Studies
By systematic literature search, we identified 48 somatic mutation landscape studies of blood cancers (3, 11–57) that could be used to determine recurrence of mutations reported at amino acid positions in protein-coding genes (Table 1). Focusing on diseases that may have been preceded by CHIP (58), we limited our investigation to studies that assessed patients by NGS at diagnosis of one of seven hematologic malignancies: acute lymphoid leukemia (ALL), acute myeloid leukemia (AML), chronic myeloid leukemia (CML), chronic myelomonocytic leukemia (CMML), juvenile myelomonocytic leukemia (JMML), MDS, and MPN (Fig. 1). After evaluation, filtering, and harmonization of 58,177 reported mutations assessed in 7,430 diagnostic patients, we determined a total of 20,141 mutations that altered an amino acid or splice site.
Study . | Main hematologic malignancy . | Patients assessed (N)a . | Mutations assessed (N)b . | Protein-altering mutations (N)c . |
---|---|---|---|---|
Andersson et al. (2015; ref. 11) | ALL | 65 | 250 | 146 |
Chen et al. (2018; ref. 12) | ALL | 36 | 302 | 294 |
De Keersmaecker et al. (2013; ref. 13) | ALL | 211 | 538 | 470 |
Holmfeldt et al. (2013; ref. 14) | ALL | 40 | 706 | 366 |
Liu et al. (2016; ref. 15) | ALL | 203 | 2,437 | 1,418 |
Liu et al. (2017; ref. 16) | ALL | 264 | 4,165 | 3,426 |
Oshima et al. (2016; ref. 17) | ALL | 55 | 1,845 | 397 |
Papaemmanuil et al. (2014; ref. 18) | ALL | 55 | 795 | 549 |
Paulsson et al. (2015; ref. 19) | ALL | 51 | 459 | 376 |
Russell et al. (2017; ref. 20) | ALL | 8 | 218 | 86 |
Ryan et al. (2016; ref. 21) | ALL | 42 | 45 | 30 |
Stengel et al. (2014; ref. 22) | ALL | 625 | 110 | 98 |
Ma et al. (2018; ref. 3) | ALL | 122 | 574 | 254 |
Huether et al. (2014; ref. 23) | ALL | 525 | 412 | 167 |
Bolouri et al. (2018; ref. 24) | AML | 684 | 1,983 | 584 |
de Rooij et al. (2017; ref. 25) | AML | 113 | 268 | 99 |
Dolnik et al. (2012; ref. 26) | AML | 50 | 171 | 153 |
Eisfeld et al. (2017; ref. 27) | AML | 177 | 258 | 199 |
Eisfeld et al. (2017; ref. 28) | AML | 10 | 36 | 36 |
Eisfeld et al. (2016; ref. 29) | AML | 23 | 68 | 56 |
Faber et al. (2016; ref. 30) | AML | 165 | 849 | 588 |
Farrar et al. (2016; ref. 31) | AML | 20 | 208 | 126 |
Garg et al. (2015; ref. 32) | AML | 67 | 443 | 250 |
Greif et al. (2018; ref. 33) | AML | 50 | 558 | 440 |
Hirsch et al. (2016; ref. 34) | AML | 53 | 257 | 246 |
Lavallée et al. (2015; ref. 35) | AML | 29 | 55 | 53 |
CGARN et al. (2013; ref. 36) | AML | 200 | 22,633 | 1,574 |
Madan et al. (2016; ref. 37) | AML | 153 | 210 | 126 |
Papaemmanuil et al. (2016; ref. 38) | AML | 1,540 | 3,902 | 3,341 |
Sehgal et al. (2015; ref. 39) | AML | 152 | 52 | 52 |
Sood et al. (2016; ref. 40) | AML | 13 | 180 | 94 |
Thol et al. (2017; ref. 41) | AML | 171 | 603 | 415 |
Kim et al. (2017; ref. 42) | CML | 100 | 68 | 40 |
Togasaki et al. (2017; ref. 43) | CML | 24 | 191 | 181 |
Mason et al. (2016; ref. 44) | CMML | 69 | 507 | 479 |
Merlevede et al. (2016; ref. 45) | CMML | 17 | 8,077 | 143 |
Palomo et al. (2016; ref. 46) | CMML | 56 | 262 | 255 |
Patnaik et al. (2017; ref. 47) | CMML | 261 | 16 | 15 |
Caye et al. (2015; ref. 48) | JMML | 118 | 187 | 107 |
Stieglitz et al. (2015; ref. 49) | JMML | 71 | 128 | 128 |
Haferlach et al. (2014; ref. 50) | MDS | 102d | 300 | 281 |
Pastor et al. (2017; ref. 51) | MDS | 50 | 25 | 18 |
Walter et al. (2013; ref. 52) | MDS | 150 | 322 | 254 |
Yoshida et al. (2011; ref. 53)e | MDS | 22 | 268 | 185 |
Churpek et al. (2015; ref. 54)f | MDS | 16 | 183 | 51 |
Schwartz et al. (2017; ref. 55)g | MDS | 54 | 288 | 195 |
Lundberg et al. (2014; ref. 56) | MPN | 197 | 267 | 236 |
Nangalia et al. (2013; ref. 57) | MPN | 151 | 1,498 | 1,064 |
Total | 7,430 | 58,177 | 20,141 |
Study . | Main hematologic malignancy . | Patients assessed (N)a . | Mutations assessed (N)b . | Protein-altering mutations (N)c . |
---|---|---|---|---|
Andersson et al. (2015; ref. 11) | ALL | 65 | 250 | 146 |
Chen et al. (2018; ref. 12) | ALL | 36 | 302 | 294 |
De Keersmaecker et al. (2013; ref. 13) | ALL | 211 | 538 | 470 |
Holmfeldt et al. (2013; ref. 14) | ALL | 40 | 706 | 366 |
Liu et al. (2016; ref. 15) | ALL | 203 | 2,437 | 1,418 |
Liu et al. (2017; ref. 16) | ALL | 264 | 4,165 | 3,426 |
Oshima et al. (2016; ref. 17) | ALL | 55 | 1,845 | 397 |
Papaemmanuil et al. (2014; ref. 18) | ALL | 55 | 795 | 549 |
Paulsson et al. (2015; ref. 19) | ALL | 51 | 459 | 376 |
Russell et al. (2017; ref. 20) | ALL | 8 | 218 | 86 |
Ryan et al. (2016; ref. 21) | ALL | 42 | 45 | 30 |
Stengel et al. (2014; ref. 22) | ALL | 625 | 110 | 98 |
Ma et al. (2018; ref. 3) | ALL | 122 | 574 | 254 |
Huether et al. (2014; ref. 23) | ALL | 525 | 412 | 167 |
Bolouri et al. (2018; ref. 24) | AML | 684 | 1,983 | 584 |
de Rooij et al. (2017; ref. 25) | AML | 113 | 268 | 99 |
Dolnik et al. (2012; ref. 26) | AML | 50 | 171 | 153 |
Eisfeld et al. (2017; ref. 27) | AML | 177 | 258 | 199 |
Eisfeld et al. (2017; ref. 28) | AML | 10 | 36 | 36 |
Eisfeld et al. (2016; ref. 29) | AML | 23 | 68 | 56 |
Faber et al. (2016; ref. 30) | AML | 165 | 849 | 588 |
Farrar et al. (2016; ref. 31) | AML | 20 | 208 | 126 |
Garg et al. (2015; ref. 32) | AML | 67 | 443 | 250 |
Greif et al. (2018; ref. 33) | AML | 50 | 558 | 440 |
Hirsch et al. (2016; ref. 34) | AML | 53 | 257 | 246 |
Lavallée et al. (2015; ref. 35) | AML | 29 | 55 | 53 |
CGARN et al. (2013; ref. 36) | AML | 200 | 22,633 | 1,574 |
Madan et al. (2016; ref. 37) | AML | 153 | 210 | 126 |
Papaemmanuil et al. (2016; ref. 38) | AML | 1,540 | 3,902 | 3,341 |
Sehgal et al. (2015; ref. 39) | AML | 152 | 52 | 52 |
Sood et al. (2016; ref. 40) | AML | 13 | 180 | 94 |
Thol et al. (2017; ref. 41) | AML | 171 | 603 | 415 |
Kim et al. (2017; ref. 42) | CML | 100 | 68 | 40 |
Togasaki et al. (2017; ref. 43) | CML | 24 | 191 | 181 |
Mason et al. (2016; ref. 44) | CMML | 69 | 507 | 479 |
Merlevede et al. (2016; ref. 45) | CMML | 17 | 8,077 | 143 |
Palomo et al. (2016; ref. 46) | CMML | 56 | 262 | 255 |
Patnaik et al. (2017; ref. 47) | CMML | 261 | 16 | 15 |
Caye et al. (2015; ref. 48) | JMML | 118 | 187 | 107 |
Stieglitz et al. (2015; ref. 49) | JMML | 71 | 128 | 128 |
Haferlach et al. (2014; ref. 50) | MDS | 102d | 300 | 281 |
Pastor et al. (2017; ref. 51) | MDS | 50 | 25 | 18 |
Walter et al. (2013; ref. 52) | MDS | 150 | 322 | 254 |
Yoshida et al. (2011; ref. 53)e | MDS | 22 | 268 | 185 |
Churpek et al. (2015; ref. 54)f | MDS | 16 | 183 | 51 |
Schwartz et al. (2017; ref. 55)g | MDS | 54 | 288 | 195 |
Lundberg et al. (2014; ref. 56) | MPN | 197 | 267 | 236 |
Nangalia et al. (2013; ref. 57) | MPN | 151 | 1,498 | 1,064 |
Total | 7,430 | 58,177 | 20,141 |
NOTE: Studies assessed somatic mutations in diagnostic patient samples with NGS and met additional inclusion criteria.
Abbreviation: CGARN, Cancer Genome Atlas Research Network.
aNumber of unique diagnostic patients assessed in the original study meeting inclusion criteria (hence, the total sample size of the original study may have been larger).
bNumber of mutations from the original study assessed in this hotspot mutation analysis.
cNumber of determined protein-altering or splice-site mutations used in recurrence tallies.
dSample size estimated for the random selection of mutations provided.
eSome samples from patients classified as CMML were included in this study.
fSome patients were also classified with AML in this study.
gSome patients were also classified with AML or MPN/JMML in this study.
Mutational Hotspots in Hematologic Cancers
We found 434 amino acid or splice-site positions occurring across 85 genes that met our criteria for hotspots (Methods; Supplementary Table S1). The most common hotspots were well-known and observed frequently even within individual studies (e.g., mutations at NRAS p.G12 were reported for 345 persons within 35 studies; Fig. 2A). Yet, 79 hotspots observed in three to eight persons had only a single mutation within any given study, highlighting the benefit of multistudy evaluations. Nearly all of the hotspots were recurrently listed across combined cancers in the most recent COSMIC database, which includes many blood cancer studies, the majority having been published between 1992 and 2010. Yet this version was lacking 32 of the 48 NGS studies (comprising 51% of total patients) used in our assessment. When restricting COSMIC mutation data to primary samples in the seven hematologic malignancies of our focus, the majority (n = 222) of our identified hotspots were observed in less than three individuals of this COSMIC subset (Supplementary Table S2).
Some of our identified mutation hotspots were unique to a specific hematologic malignancy (Fig. 2B; Supplementary Table S1). With the proportion of assessed patients being 49.4% AML, 31.0% ALL, and 19.6% other malignancies, we found that mutations at IDH2 p.R172 (n = 59), KIT p.N822 (n = 31), and KIT p.Y418 (n = 18) were reported only in patients with AML, whereas RPL10 p.R98 (n = 25), NOTCH1 p.L1678 (n = 24), FBXW7 p.R479 (n = 23), NOTCH1 p.L1600 (n = 22), and NOTCH1 p.L1585 (n = 19) mutations were observed only in patients with ALL. We also observed that some genes contain mostly point or nonframeshift hotspots (e.g., KRAS, NRAS, PTPN11, SF3B1), mostly nonsense or frameshift hotspots (e.g., ASXL1, NF1, STAG2), or had distinct groupings of mutation types by location within the coding region (e.g., CEBPA, NOTCH1).
Hotspot Mutations Evaluable for CHIP
We identified 70 of the 434 hotspots to have heightened potential for false positives if used to infer CHIP from mutations in blood samples alone (Supplementary Table S3). This resulted in 364 hematologic cancer mutation hotspots and 755 constituent mutations that could be confidently used in screening for CHIP with prevalent NGS methods (Supplementary Table S4). Only 45 (12.4%) of these 364 loci were identified as hotspots in a previous pan-cancer analysis (2), and only 35 (9.6%) were observed in three or more patients of a sizable MDS investigation not included in our assessment (59). Likely because many of the exome-wide studies that identified mutations at these hotspots were published around or after a previous landmark CHIP study (10), 350 of the 755 specific mutations identified in our hotspot mutation assessment were not present in the list of queried CHIP variants of that study. Of these, 134 were reported in ≥3 patients at diagnosis of a hematologic malignancy (Supplementary Table S5). Many of these unique mutations occur within NOTCH1, FLT3, CEBPA, KIT, and RUNX1.
CHIP in Noncancer Cohorts Identical to Hotspot Mutations
We next assessed three large noncancer cohorts (60–62) for CHIP at our identified hotspots within their combined 4,538 individuals. The 1KG cohort included 2,503 adults from 26 different populations with sample sizes ranging from N = 61 to N = 113 (median N = 99). The Qatari Genome (QTRG) and Simons Simplex Collection (SSC) cohorts included children and adults. We found the ages, sequencing depths, and methods utilized in each study to vary considerably (Supplementary Fig. S1A), with potential to influence the rate of clonal mutation identification. The means of the average depths of coverage across the 364 confident hotspots were 78.5×, 93.3×, and 53.6×, with averages of 27.5%, 32.7%, and 10.3% of hotspots covered at a depth ≥100× in the 1KG, SSC, and QTRG cohorts, respectively (Supplementary Fig. S1B).
All detected variants were evaluated for reliability (which led to the exclusion of eight outlier samples) and for meeting the criteria for CHIP at hotspots, resulting in the identification of 83 individuals (1.83% of the 4,530 included) each having a single CHIP mutation at one of 62 hotspots across 23 genes (Supplementary Fig. S2; Supplementary Table S6). The prevalence rate of CHIP at hotspots for each cohort was as follows: QTRG: 0.73% (9/1,231), 1KG: 2.48% (62/2,503), and SSC 1.51% (12/796)—with subgroup rates of 0.77% (3/388) for SSC children and 2.21% (9/408) for SSC adults (Fig. 3A). CHIP mutations at hotspots were most common in DNMT3A, TET2, and TP53 (Fig. 3B). Clonal size was not significantly associated with reported frequency in our recurrent mutation analysis (P = 0.56; Supplementary Fig. S3). However, CHIP detection was associated with frequency of hotspot recurrence [OR, 1.75 per log of reported mutations; 95% confidence interval (CI), 1.33–2.29; P < 0.0001]. Still, while half of the 12 hotspots (including those of lesser confidence for CHIP evaluation) reported in more than 100 patients of the recurrence analysis were observed with identical clonal mutations in the noncancer cohorts, six were not. These included NPM1 p.W288 and FLT3 p.D835, both of which may rapidly accelerate neoplastic onset, making them more difficult to observe in healthy persons (9). Together, these observations suggest that stratifying CHIP mutations by significant recurrence in hematologic malignancies may prove prognostic in studies of blood cancer risk.
CHIP at Mutation Hotspots by Age Groups
We detected CHIP mutations at the leukemic hotspots in 3 of 388 (0.77%) children of the SSC cohort (two autism spectrum disorder probands and one unaffected sibling). These mutations occurred at hematologic hotspots in DNMT3A [p.R882S, variant allele frequency (VAF) = 2.5%] and RUNX1 [p.G170 splice site (c.509–1G>T), VAF = 1.3%; p.S322X, VAF = 1.1%], with all variant reads passing manual review and being detected on both strands (Supplementary Table S6; Supplementary Fig. S4). We assessed the parent–child trios for each of these children for expected SNP inheritance (to preclude the possibility of sample mix-up of children and adult samples or data) and confirmed that all three samples were from the child. The child with a RUNX1 p.S322X clonal mutation also had an apparent germline SNP at RUNX1 p.L56S (variant present in 5/10 reads) that was inherited from the child's father. This SNP is present in the gnomAD version 2.1.1 database (63) at a frequency of 1.2% yet was also reported in 4/56 (7.1%) patients diagnosed with CMML (46), suggesting a possibility that this variant helped accelerate clonal growth under the two-hit hypothesis (64). Still, the clonal size and number of variant reads at each hotspot in these children were insufficient to completely rule out the possibility of artifacts, and hence these findings are preliminary.
The QTRG cohort having participants with age ranging from 0 to 85 years had a much lower sequencing depth, so we explored the effect of relaxing the requirement for detecting CHIP at hotspots to include mutations with only two variant reads. While such a threshold would result in a high false-positive rate in deeply sequenced data, a higher rate of true positives is likely to be present in lower depth data. This assessment found an additional 41 clonal mutations for this cohort (Supplementary Table S7) and found age to be associated with relaxed CHIP prevalence in this call set (OR, 1.025 per year of age; 95% CI, 1.004–1.048; P = 0.022; Fig. 3C).
CHIP Not Exclusive to Hotspots
We also investigated the noncancer cohorts for the prevalence of CHIP that included seldom or never-before reported mutations at diagnosis of blood cancers, but which may still be involved with clonal expansion due to their occurrence in genes with known hematopoietic function. Across the specified mutations and domains of the 74 allowed hematologic genes of Jaiswal and colleagues (10), and using their criteria, we observed 189 mutations not exclusive to hotspots within the 4,530 persons of our analyzed cohorts (Supplementary Table S8), 40 of which (21.2%) were identical to mutations reported at confident hotspots in three or more diagnostic patients in the 48 somatic landscape publications we assessed (i.e., those listed in Supplementary Table S4). For comparison, we also determined that 76 of the 224 (33.9%) and 298 of the 805 (37.0%) reported CHIP mutations in Jaiswal and colleagues (5) and Jaiswal and colleagues (10), respectively, were identical to mutations in our hotspot list.
Our general CHIP analysis identified six persons having two mutations (five of whom had a mutation of DNMT3A), none with more than two, and the remaining 177 persons having only a single CHIP mutation—yielding a general CHIP prevalence rate of 4.04% (183/4,530) across the cohorts. Including identified clonal mutations having a VAF >4% at the additional hotspots not previously utilized by Jaiswal and colleagues (10), 193/4,530 (4.26%) persons had qualifying CHIP. As in previous studies, CHIP not exclusive to hotspots was most frequent for TET2 and DNMT3A (Supplementary Table S8). However, we found mutations in SETD2, EP300, and KMT2A/D to be more common in our cohort, whereas ASXL1 and JAK2 mutations were noticeably less frequent. Variability of gene prevalence for CHIP mutations in DNMT3A, TET2, ASXL1, and, most strikingly, JAK2 between control and cardiovascular disease cohorts was previously observed (10). As the cohorts we assessed likely had fewer underlying cases of cardiovascular disease and younger average age than other CHIP studies using WES data, some of our gene prevalence rates may reflect these differences. Further exploration identified seven ASXL1 variants having a VAF >2%, yet only one reached the threshold of 4% required for this general CHIP analysis. No additional JAK2 variants with three or more supporting reads were observed at any frequency.
Potential CHIP at non-hotspots with a VAF >4% was observed in only four children having variant reads detected on both strands (Supplementary Table S8). In one child, the variant (DNMT3A p.V665L) had been previously identified and reported as a de novo mutation by Lim and colleagues (65), although it could possibly have been a CHIP mutation that had reached near-complete saturation (the VAF was 47.3%; see Methods). This particular variant was not observed in 55 individuals having Tatton-Brown–Rahman syndrome, a congenital condition due to germline DNMT3A mutations (66), one of whom developed AML in childhood. Yet the clinical association of autistic spectrum disorder in 20 (36%) of their study participants could be related to this detection of a DNMT3A mutation in an SSC proband. Also intriguing, another of the four children with CHIP at non-hotspots was one of the three children we initially identified to have a hotspot CHIP mutation, making the child's nonhotspot TET2 p.A1863S variant (VAF = 6.5%) a potentially synergistic CHIP mutation with their previously mentioned RUNX1 p.G170 splice-site hotspot variant (Supplementary Fig. S4). This observation may once again reflect an increased likelihood of observing CHIP in children when two genetic insults are present, as without an additional factor, single clonal mutations in younger persons may have lacked sufficient time to expand to detectable levels.
Discussion
Owing to the work of many researchers who published the findings of their somatic mutation studies (3, 11–57), we were able to compile a novel list of recurrent hematologic cancer mutations. The majority of these NGS studies assessed the entire exome/genome, allowing for a less biased and larger-scale assessment of diagnostic mutation recurrence in blood cancers than has previously been available. While significant recurrence alone does not imply a driver or even contributory role in cancer development, overall, the hotspots we identified will likely be enriched for pathogenic blood cancer mutations. Importantly, these identified hotspots also increase the number of mutations having documented recurrence in primary hematologic malignancies that can now be used to identify CHIP.
Using these recurrent mutation loci, we analyzed three large noncancer cohorts, finding clones with mutations identical to those observed at blood cancer mutation hotspots in 1.83% of persons across the combined cohorts. Without restricting to hotspots, our estimate of CHIP prevalence coincides with that of previous WES studies of CHIP, 4% to 5%, with the proportion of CHIP identical to diagnostic mutations at hotspots ranging from 21% to 37% within these studies. As the vast majority of all persons with identified CHIP are found to have only a single clonal mutation in hematopoiesis-regulating genes, most detected hotspot or nonhotspot mutations have likely driven the clonal expansion. Future studies of large size and long follow-up will be required to determine the degree to which CHIP at hotspots may carry an increased risk for aiding neoplastic transformation, as well as for apparently unrelated nonhematologic cardiovascular complications of CHIP (10).
Our analysis of CHIP included for the first time a large assessment of children free from blood disorders [please note that CHIP has been observed in children having aplastic anemia (67) and that postzygotic mutations of generally higher mutation frequencies in children have also been previously assessed (65)]. Of great interest, we detected CHIP at mutation hotspots in three of the children in our cohorts. This novel, preliminary detection of CHIP in children unselected for blood disorders is important because although only a small number of children were observed to have these clonal mutations, this rate may be higher than anticipated from past analyses of adult cohorts (e.g., only 1/1,039 adults aged 20–39 years was detected with CHIP in a previous analysis of WES data; ref. 5). The well-designed Simons Simplex Collection study sequenced children and parental samples simultaneously at nearly identical depths (60). That CHIP at hotspots in these children occurred at 35% of the overall parental frequency (2.2%) encourages future studies to investigate this phenomenon more thoroughly in children.
Abelson and colleagues (7) found that AML mutations reported with higher frequency in the COSMIC database (4) were also observed as CHIP more frequently. We similarly found an association between frequency of reported diagnostic mutations across the seven hematologic malignancies and their detection as clonal mutations in the noncancer cohorts, strengthening the conjecture that mutations at hotspot loci may increase risk for affecting transition from CHIP to hematologic cancer (6). Some of the hotspot mutations we list may have been overlooked in previous targeted designs for CHIP assessment. These mutations may now receive heightened priority when clinically observed as well as further investigation for functional significance and therapeutic potential.
Much remains to be discovered regarding most individual CHIP mutations, including their prevalence by ethnicity, risk for onset of nonhematologic diseases, and concomitant factors associated with their size and clonal dynamics. Using high-depth WGS and RNA sequencing, future work may expand to include the identification of clonal fusions in CHIP prevalence and risk assessments. While CHIP was previously thought to be pertinent to adults only, our analysis of noncancer cohorts identified clonal mutations at both hotspots and non-hotspots in children, indicating a need to further explore CHIP in younger-age cohorts. Our hematologic cancer–focused hotspot list will allow for subgroup analyses of CHIP in future studies, with promise to improve its prognostic performance and future blood cancer prevention research efforts.
Methods
Somatic Mutation Studies
An initial feasibility exercise was performed on 14 somatic mutation landscape studies to discover possible extraction-related problems and necessary steps to confidently identify and harmonize reported mutations usable for hotspot determination. This preliminary work was supplemented with a formal PubMed search on studies published before July 2018 with the following search criteria: (“genomic landscape,” “somatic landscape,” “genomic profile,” “mutational landscape,” OR “mutation landscape”) AND (“leukemia” OR “leukaemia”). This search returned a total of 172 papers, of which 61 were found to be review articles reporting on no new patients. We assessed studies focused on any of seven disorders: ALL, AML, CML, CMML, JMML, MPN, and MDS.
Mutation Evaluation
Each study was assessed for whether hematologic neoplasms had been investigated with NGS and for providing information that could accurately identify the genetic position of each alteration and the effect on protein coding. Studies that did not utilize NGS or reported mutations that could not be determined without ambiguity (owing to multiple transcripts for many genes, mutations from studies providing only an amino acid substitution without either an accompanying transcript identifier or genomic positions were generally deemed ambiguous) were not included. For remaining studies, all substitutions and all deletions were assessed separately for provided reference alleles to determine the combination of factors that were used in reporting mutations, specifically, cDNA or gDNA, 0-bp or 1-bp coordinate system, and human genome reference version. If only a trivial number of reference allele discrepancies were observed, these were discarded and the remaining variants were incorporated; otherwise, the entire group of substitutions, indels, or entire study was discarded. While translocations and large structural rearrangements are common for many cancers (68), these were not included in this investigation as our primary purpose was the identification of recurrent events occurring entirely within exonic regions, so that these could be queried for CHIP in WES data (in addition, less than one third of the studies provided breakpoints of large structural rearrangements at a base pair level). Many somatic landscape studies reported having performed validation of their listed mutations; uniformly, we relied on the mutations as reported by the original authors rather than requesting the original NGS data for reanalysis. Overall, our approach was designed to enrich for reported mutations having high confidence for accurate extraction rather than to collect increased numbers of mutations with less overall confidence in their authenticity.
Mutation Filtering and Harmonization
In studies that included mutations from relapse or secondary leukemia samples in addition to primary samples, mutations in the primary diagnostic samples were extracted only if they could be clearly distinguished from those in relapse or secondary leukemia samples. If there was no distinction, the entire study was excluded. For a few studies assessing samples that had been investigated in previous studies, mutations from patients reported in the older analysis were not included. Mutations in genes known for having high false-positive rates were frequently filtered by authors, and we likewise removed mutations reported in such genes from studies that had not performed such filtering. Similarly, a few studies reported all observed variants with little or no read/frequency filtering for mutation calling, and variants having minor VAFs were removed. Finally, a single RefSeq transcript was determined for each gene based on its study-utilized frequency and length, and each mutation was assessed for its effect on this transcript (listed in Supplementary Table S1). Only splice-site or protein-altering mutations were retained. A total of 48 studies (3, 11–57) satisfying all of these stringent criteria were included in the hotspot mutation analysis. Comparison with COSMIC data was performed with the most recent COSMIC database (version 92, August 2020).
Hotspot Determination
Similar to Chang and colleagues (2), we assessed mutational recurrence at amino acid and splice-site positions in protein-coding genes, yet also included indels in addition to substitutions. The number of reported protein-altering and splice-site substitutions and indels across the final set of 48 somatic mutation landscape studies was tallied at each amino acid locus, with indels tallied at the locus at which they began and splice sites tallied separately. As most studies did not provide silent or nonexonic mutations in their lists, a formal driver analysis was not possible, which may be considered a weakness of this work. Hence, we sought to determine minimalist criteria for our designation of hotspot based on deviance from expected recurrence in binomial distribution modeling that utilized the set of reported protein-altering mutations (Supplementary Methods; Supplementary Fig. S5). Various models were assessed, across which a threshold of at least three times recurrence was markedly pronounced for deviation from expectation. Thus, our minimalist criteria for hotspot designation was defined as three or more patients having such mutations at the same amino acid or splice-site position, with 434 loci meeting that criteria. To determine hotspots that could be confidently used in identifying CHIP, we additionally determined the mappability of the genetic sequence surrounding each locus with multiple approaches including an extensively repeated sequence search. This allowed us to exclude hotspots that occurred at difficult-to-map regions in CHIP analyses. Reported mutations at regions of homopolymer repeats or identical to nonrare germline SNPs were also classified as less confident loci for CHIP calling and excluded. These additional filtering steps resulted in a final total of 364 hotspot loci having increased likelihood for a driver or contributory role in hematologic malignancies, which could be confidently used for detecting CHIP.
Noncancer Cohorts
Three large cohorts unselected for cancer and having publicly available paired-end WES data were analyzed for CHIP-associated mutations. (i) 1KG: We assessed CHIP from the FASTQ data of 2,535 WES samples coming from the 26 populations of the phase 3 1000 Genomes Project submitted to the European Nucleotide Archive (ENA) as PRJNA262923 by the Wellcome Trust Sanger Institute. Persons in this cohort were generally assumed to be unselected for disease and ≥18 years old. One sample with unavailable paired-end FASTQ data files as well as samples that were poorly sequenced or had other issues previously identified for exclusion (61) were not utilized, resulting in a total of N = 2,503 persons whose samples were deemed satisfactory. DNA had been extracted from lymphoblastoid cell lines (LCL) for most participants (61). Also as per 1000 Genomes Project Consortium and colleagues (61, 69) and The International Genome Sample Resource (www.internationalgenome.org), the populations assessed consist of African Caribbean in Barbados (ACB); Americans of African Ancestry in SW, USA (ASW); Bengali in Bangladesh (BEB); Chinese Dai in Xishuangbanna, China (CDX); Utah Residents (from CEPH families) with Northern and Western European Ancestry (CEU); Han Chinese in Beijing, China (CHB); Han Chinese South (CHS); Colombians from Medellin, Colombia (CLM); Esan in Nigeria (ESN); Finnish in Finland (FIN); British in England and Scotland (GBR); Gujarati Indian from Houston, Texas (GIH); Gambian in Western Divisions in the Gambia (GWD); Iberian Population in Spain (IBS); Indian Telugu in the UK (ITU); Japanese in Tokyo, Japan (JPT); Kinh in Ho Chi Minh City, Vietnam (KHV); Luhya in Webuye, Kenya (LWK); Mende in Sierra Leone (MSL); Mexican Ancestry from Los Angeles, USA (MXL); Peruvians from Lima, Peru (PEL); Punjabi from Lahore, Pakistan (PJL); Puerto Ricans in Puerto Rico (PUR); Sri Lankan Tamil from the UK (STU); Toscani in Italia (TSI); and Yoruba in Ibadan, Nigeria (YRI). (ii) SSC: We assessed the FASTQ data of 804 paired-end WES samples from Simons Simplex families deposited in ENA as PRJNA167318. These included 413 parents and 391 children. Of the children, 205 were diagnosed with autism and 186 were unaffected siblings. DNA had been extracted from whole blood as reported in Sanders and colleagues (60). Of note, this cohort has been previously analyzed for postzygotic mutations of substantial frequency (65), and while not specifically looking for CHIP, Lim and colleagues (65) did identify two of the general (and none of the hotspot) CHIP mutations we list in Supplementary Table S8. (iii) QTRG: We assessed paired-end WES FASTQ data from the samples of 1,231 persons of the QTRG for precision medicine deposited in ENA as PRJNA290484. These included 510 persons diagnosed with type II diabetes and 270 persons designated as controls. The age spectrum of this cohort ranged from 0 to 85 years, with 25 persons being <18 years of age. DNA had been extracted from blood as reported in Fakhro and colleagues (62). While both this QTRG cohort and the SSC cohort assessed DNA from blood, a possibility for infrequent LCL-specific mutations could be present in the data of the 1KG samples. However, we found the frequency of CHIP mutations in the 1KG samples to be similar to the SSC parent cohort, implying that any LCL-specific effect may be small in magnitude.
CHIP Assessment at Hotspots
We aligned the paired-end FASTQ data files for each of the 4,538 samples in the noncancer cohorts to the human reference genome hg19 and performed subsequent deduplication, realignment, recalibration, filtering, and variant calling (additional details provided in Supplementary Methods). CHIP at hotspots classification required somatic mutations to have ≥3 variant reads of ≥20 total reads and to occur at one of the 364 confident amino acid hotspots identified in the recurrent mutation analysis (hence at a locus reported in at least three patients and with confident mapping). As the total number of loci at which CHIP could be called was reduced >100-fold from that used in other studies assessing CHIP across large domains (1,092 bases compared with >160,000 bases in Jaiswal and colleagues; ref. 10), we did not incorporate an additional lower-bound cutoff at hotspots. In addition, such mutations were required to yield the identical amino acid substitution or insertion/deletion effect (e.g., frameshift) as had been previously reported in at least one diagnostic tumor sample, without additional filtering for predicted benign/deleterious effect on protein. Only mutations meeting all of these criteria were used to identify CHIP at hotspots. CHIP at the 70 hotspots with greater potential for artifacts were also assessed, and a recovery analysis was performed to arbitrate any excluded mutations for inclusion. While the coverage statistics at the 364 hotspots were plausibly similar to those of other WES studies analyzed for CHIP, coverage was insufficient to detect small- to medium-sized clones at many hotspots. In a separate analysis, a classification of relaxed CHIP at hotspots was used with the sole modification of allowing ≥2 variant reads for identification.
General CHIP Assessment
We utilized the specified mutations and domains provided in Jaiswal and colleagues (10) to assess general CHIP (clonal mutations not exclusive to hotspot mutations previously reported in hematologic cancers, although in genes or domains that have been associated with hematopoiesis). We performed filtering for mappability, sequencing artifacts, and recurrence as we did for CHIP at hotspots, with an additional imposed VAF threshold requirement of 4%. We also assessed sensitivity of CHIP prevalence by gene for lower VAF threshold levels. We calculated the proportion of hotspots in our general CHIP calls at a gene level and overall level, and similarly computed these rates on the data of previously published cohorts (5, 10).
Statistical Analysis
We used logistic regression analysis to determine the association of year of age with relaxed CHIP mutation occurrence, with age being mean centered. We also used logistic regression to assess the association of frequency of reported mutations at hotspots and binary detection of CHIP at those loci. As the frequency of reported mutations was right skewed, the natural logarithm of this variable was used. Sensitivity analyses using unlogged data as well as log base 10 showed similar results. Deviance from expected recurrence frequencies for mutations based on binomial distributions was used to determine the effect of various potential hotspot threshold criteria. Testing for association of variant allele fraction and reported recurrence groups was computed with the Wilcoxon–Mann–Whitney test. A two-sided Fisher exact test was used to determine the significance of differences in prevalence for the different variant thresholds and cohort pairings assessed. Additional statistical details are provided in the Supplementary Methods. All statistical computations were performed with SAS software, version 9.4 (SAS Institute).
Authors' Disclosures
J.L. Rodriguez-Flores is a full-time employee of Regeneron Pharmaceuticals Inc. C.C. Mason reports other from Intermountain Healthcare Foundation (funding for the Pediatric Cancer Program) and Primary Children's Hospital Foundation (funding for the Pediatric Cancer Program), and grants from Primary Children's Center for Personalized Medicine during the conduct of the study. No disclosures were reported by the other authors.
Disclaimer
The content and expressed viewpoints are those of the authors only.
Authors' Contributions
J.E. Feusier: Data curation, software, validation, investigation, writing–original draft, writing–review and editing. S. Arunachalam: Data curation, software, validation, investigation, writing–review and editing. T. Tashi: Conceptualization, writing–review and editing. M.J. Baker: Data curation, writing–review and editing. C. VanSant-Webb: Data curation. A. Ferdig: Data curation, writing–review and editing. B.E. Welm: Funding acquisition, writing–review and editing. J.L. Rodriguez-Flores: Validation, writing–review and editing. C. Ours: Conceptualization, writing–review and editing. L.B. Jorde: Conceptualization, funding acquisition, writing–review and editing. J.T. Prchal: Conceptualization, writing–review and editing. C.C. Mason: Conceptualization, resources, data curation, software, formal analysis, supervision, funding acquisition, validation, investigation, visualization, methodology, writing–original draft, project administration, writing–review and editing.
Acknowledgments
The authors thank the many researchers who published the data and details of their studies as well as the participants in those studies, and acknowledge their original publications and the European Nucleotide Archive as sources of the primary data. They also thank Allyson Mower and Michele Ballantyne for invaluable advice, and the many publishers who granted text and data mining permission. C.C. Mason acknowledges Pediatric Cancer Program funding, which is supported by the Intermountain Healthcare and Primary Children's Hospital Foundations as well as the Department of Pediatrics and Division of Pediatric Hematology/Oncology at the University of Utah. Some work by J.E. Feusier was supported by National Center for Advancing Translational Sciences (NCATS)/NIH (UL1TR002538/TL1TR002540). Some work by S. Arunachalam was funded by a U.S. Department of Defense grant (W81XWH-14-1-0417; to B.E. Welm). L.B. Jorde acknowledges funding from NIH (GM118335/GM059290). The support and resources from the Center for High Performance Computing at the University of Utah are gratefully acknowledged.