Abstract
A breast cancer genome is a record of the historic mutagenic activity that has occurred throughout the development of the tumor. Indeed, every mutation may be informative. Although driver mutations were the main focus of cancer research for a long time, passenger mutational signatures, the imprints of DNA damage and DNA repair processes that have been operative during tumorigenesis, are also biologically illuminating. This review is a chronicle of how the concept of mutational signatures arose and brings the reader up-to-date on this field, particularly in breast cancer. Mutational signatures have now been advanced to include mutational processes that involve rearrangements, and novel cancer biological insights have been gained through studying these in great detail. Furthermore, there are efforts to take this field into the clinical sphere. If validated, mutational signatures could thus form an additional weapon in the arsenal of cancer precision diagnostics and therapeutic stratification in the modern war against cancer. Clin Cancer Res; 23(11); 2617–29. ©2017 AACR.
See all articles in this CCR Focus section, “Breast Cancer Research: From Base Pairs to Populations.”
Introduction: Breast Cancer Genomics—Access All Areas
The central tenet of cancer research has for decades been the identification of somatic driver mutations that are causally implicated in tumorigenesis (1). Thus, a host of breast cancer driver events are now known (2), including copy number aberrations (3–8), such as the ERBB2 and CCND1 amplification loci and homozygous deletions of CDKN2A/B and PTEN, and high-frequency substitution and insertion/deletion (indel) driver mutations in cancer genes like TP53 (∼frequency 53%), PIK3CA (8%–26%), CDH1 (21%), AKT1 (8%), and GATA3 (4%; refs. 9–12). Separately, extensive germline exploration has led to documentation of rare, high-penetrance (BRCA1, BRCA2, TP53; refs. 13, 14), moderate penetrance (PTEN, STK11, CDH1, ATM, CHEK2, BRIP1, PALB2; refs. 15–19), and common, low-penetrance risk alleles (20–24) for developing breast cancer (25). Essentially, enormous efforts have been placed on breast cancer classification based on somatic and germline mutation information, histopathologic markers, copy number, and expression profiles (9, 26, 27)—all aimed at improving diagnostic, prognostic, and therapeutic stratification.
When massive parallel sequencing arrived in the late 2000s (28), the increase in the speed of sequencing was of orders of magnitude, permitting access to large swathes of the human genome not previously accessible at a reasonable cost. In a striking testament to this technology, five back-to-back breast cancer articles were published in 2012 (9–12), providing a thorough view of the molecular foundations of breast cancer and saturating driver discovery in coding sequences (29). Quite apart from the mere handful of driver mutations present in each tumor, modern sequencing technologies enabled us to access the many thousands of passenger mutations present in each cancer as well. Herein lies a significant realization—that passenger mutations are not simply random manifestations or mutational debris—they represent the scars of biological processes that have gone awry during cancer development and are, therefore, a rich historical record of tumorigenesis (30).
Mutational Signatures: Making Sense of the Mayhem
The following model was previously proposed: At the point of a patient's cancer diagnosis, the set of somatic mutations revealed through sequencing of the tumor is the aggregate outcome of one or more mutational processes (30–32). Each process, defined by the mechanisms of DNA damage and DNA repair that constitute it, leaves a characteristic imprint or mutational signature on the cancer genome (Fig. 1). The final catalog of mutations is also determined by the intensity and duration of exposure to each mutational process (Fig. 1). Some may be weak or moderate in their intensity, whereas others may be very strong in their assertion. In addition, some exposures may be ongoing through the entire lifetime of the patient, even preceding the formation of the cancer, and some may commence late or become dominant later in tumorigenesis (Fig. 1). Furthermore, cancers comprise subclonal populations, which may be variably exposed to each mutational process (33, 34), promoting complexity of the final landscape of somatic mutations in a cancer genome.
Somatic mutational processes in human cancer. Each mutational process leaves a characteristic imprint, or mutational signature, on the cancer genome, comprising DNA damage and DNA repair components. The arrows indicate the duration and intensity of exposure to a specific mutational process. The amount of exposure to each mutational process could vary from one person to another. Mutational processes A, B, C, and D represent hypothetical mutational processes that have occurred through the lifetime of the developing tumor. A could represent a normal mutational process that happens in all our cells (including normal cells), hence it is occurring in a small amount throughout life. B could represent a mutational process caused by an environmental insult, such as an occupational exposure to a carcinogen. C could represent a mutational process which occurs in bursts through tumorigenesis such as intermitted exposure to a chemical or to an intermittent disease process. D could represent the acquisition of a defect in a gene involved in normal DNA repair. The final mutational portrait is a composite of all the mutational processes that have been active over the lifetime of the cancer patient. A different patient could have all of these mutational processes occurring in their tumor or could have some of the same mutational processes as well as other mutational processes present.
Somatic mutational processes in human cancer. Each mutational process leaves a characteristic imprint, or mutational signature, on the cancer genome, comprising DNA damage and DNA repair components. The arrows indicate the duration and intensity of exposure to a specific mutational process. The amount of exposure to each mutational process could vary from one person to another. Mutational processes A, B, C, and D represent hypothetical mutational processes that have occurred through the lifetime of the developing tumor. A could represent a normal mutational process that happens in all our cells (including normal cells), hence it is occurring in a small amount throughout life. B could represent a mutational process caused by an environmental insult, such as an occupational exposure to a carcinogen. C could represent a mutational process which occurs in bursts through tumorigenesis such as intermitted exposure to a chemical or to an intermittent disease process. D could represent the acquisition of a defect in a gene involved in normal DNA repair. The final mutational portrait is a composite of all the mutational processes that have been active over the lifetime of the cancer patient. A different patient could have all of these mutational processes occurring in their tumor or could have some of the same mutational processes as well as other mutational processes present.
Base Substitution Mutational Signatures in Breast Cancer
In 2012, the 183,016 substitutions present in 21 whole breast cancer genomes were used in a proof-of-principle exercise to demonstrate the existence of mutational signatures (30, 33). Critically, sequence context immediately 5′ and 3′ to each mutated base was taken into consideration in classifying each substitution. As there are six classes of base substitution and 16 possible sequence contexts for each mutated base (A, C, G, or T at the 5′ base and A, C, G, or T at the 3′ base), there are 96 possible mutated trinucleotides for each tumor. Various mathematical methods were explored and finally, nonnegative matrix factorization was used to extract five substitution signatures present in these tumors (signatures A–E, now known as signatures 1B, 2, 3, 8, and 13; refs. 30, 33; Fig. 2).
Currently known extracted substitution mutational signatures in human breast cancers. A, Table of 12 mutational signatures extracted using nonnegative matrix factorization. Each signature is ordered by mutation class (C>A/G>T, C>G/G>C, C>T/G>A, T>A/A>T, T>C/A>G, T>G/A>C), taking immediate flanking sequence into account, resulting in 96 triplets. For each class, mutations are ordered by 5′ base (A, C, G, T) first, before 3′ base (A, C, G, T). Y-axis reports the probability of a signature generating each of the 96 triplets. Signature extraction was performed separately in 17 cancer types. The bars report the results of the extraction on the 560 breast cancers (37) using a widely available algorithm using simply default parameters (38), and the error bars demonstrate the variability (of the presumptive same signatures) between cancers of different tissue types. The table also contains the associated etiologies of each signature, the prevalence of these signatures in breast cancer, and whether the signature is also seen in other tumor types. HR, homologous recombination. B, Absolute numbers of mutations of each signature in each sample (top) and proportion of each signature in each sample (bottom). Panel B reprinted by permission from Macmillan Publishers Ltd.: Nature 534:47–54, copyright 2016.
Currently known extracted substitution mutational signatures in human breast cancers. A, Table of 12 mutational signatures extracted using nonnegative matrix factorization. Each signature is ordered by mutation class (C>A/G>T, C>G/G>C, C>T/G>A, T>A/A>T, T>C/A>G, T>G/A>C), taking immediate flanking sequence into account, resulting in 96 triplets. For each class, mutations are ordered by 5′ base (A, C, G, T) first, before 3′ base (A, C, G, T). Y-axis reports the probability of a signature generating each of the 96 triplets. Signature extraction was performed separately in 17 cancer types. The bars report the results of the extraction on the 560 breast cancers (37) using a widely available algorithm using simply default parameters (38), and the error bars demonstrate the variability (of the presumptive same signatures) between cancers of different tissue types. The table also contains the associated etiologies of each signature, the prevalence of these signatures in breast cancer, and whether the signature is also seen in other tumor types. HR, homologous recombination. B, Absolute numbers of mutations of each signature in each sample (top) and proportion of each signature in each sample (bottom). Panel B reprinted by permission from Macmillan Publishers Ltd.: Nature 534:47–54, copyright 2016.
Subsequently, a methods article (35) and a landmark article (32) were published where this mathematical approach was applied across 30 cancer types involving 7,042 samples [507 whole-genome sequencing (WGS) and 6,535 whole-exome sequencing (WES)] and revealed 21 substitution signatures altogether (http://cancer.sanger.ac.uk/cosmic/signatures). The number of breast cancers available for analysis had increased considerably to 100 WGS and 800 WES tumors. Reassuringly, the same five substitution signatures that were recognized previously were consistently identified in this larger dataset (30, 32), reinforcing conviction in the concept of mutational signatures and in the methods applied to extract them.
In a recent endeavor exploring 560 WGS breast tumors (36), the largest cohort of WGS cancers of a single tissue type to date, a total of 12 substitution signatures were identified from 3,479,652 mutations (Fig. 2A). This may superficially appear to be a substantial surge in signature discovery in breast tumors. On close inspection, many of the new signatures are relatively rare, present in few samples (36). Thus, in a similar paradigm to that of drivers, we have likely saturated the discovery of high-frequency, common mutational signatures in breast cancer. Sequencing further primary breast tumors is unlikely to yield new, major signatures. The increase in power possibly permits disambiguation of closely correlated signatures. Signatures 1 and 5, hitherto classified as signature 1B, were only just separated by this analysis. Many different algorithms are available today for mutation signature extraction (37–41)—some may reveal 11 (with signature 1B) or 12 signatures (signatures 1 and 5) from this dataset, or with more relaxation of parameters, even 13 (Supplementary Data). Regardless, that five signatures were consistently seen when as few as 21 samples were studied reveals that these early signatures are robust and common, and report ubiquitously present mutagenic processes in breast cells.
Of the 12 signatures now documented in breast cancer (ref. 36; Fig. 2A), signature 1B or signatures 1 and 5 are associated with age of diagnosis; signatures 2 and 13 are associated with the activity of the APOBEC cytidine deaminases; signature 3 is associated with BRCA1/BRCA2 deficiency; signature 8 appears to be increased in tumors with BRCA1/BRCA2 deficiency, although also present at lower levels in other tumors; signatures 6, 20, and 26 are associated with mismatch repair deficiency; and signatures 17, 18, and 30 are of unknown etiology (36). Of note, these mutational signatures do not appear to demonstrate specificity to breast cancer subtype whether classified by estrogen receptor (ER) status or other systems such as PAM50 or AIMS.
Most breast tumors have less than 20,000 substitutions in total (less than 6.5 mutations per Mb; Fig. 2B). Only a handful of samples have a very large number of mutations (up to 94,000 substitutions; Fig. 2B). Irrespective of mutation burden, the vast majority of samples comprise multiple mutational signatures (36). A subset of samples may be composed predominantly of specific signatures, and may even be overwhelmed by a very large number of mutations from these signatures and termed “hypermutators” (42). This trait is associated with certain mutational processes: signatures 2, 13, 6, 20, 26, and 17 in breast cancers (36). Indeed, some of these signatures (signatures 8, 13, and 17) appear to dominate later in breast tumorigenesis (43), observed latterly in cancer evolution (33, 36) and in metastatic disease (44). Perhaps, in time, these associations will be definitively verified as harbingers of poorer outcomes.
It was previously observed that substitution signatures had particular relationships with classes of indels (30). Patients with germline BRCA1/BRCA2 mutations exhibited an excess of larger indels (>3 bp) with microhomology present at breakpoint junctions (30). Moreover, tumors with signature 6, 20, or 26, which are associated with mismatch repair deficiency, have a large number of indels at polynucleotide repeat tracts, consistent with a label of microsatellite instability in these cancers (32, 36). Thus, correlations are observable between substitution signatures and crude indel patterns.
Advancing the Frameworks of Mutational Signatures
Mutational processes in human somatic cells are not restricted to producing base substitutions. Indeed, DNA damage and DNA repair processes can generate patterns of indels and large-scale chromosomal aberrations or structural variation as well (31). Thus, the basic premise of mutational signatures was recently extended to structural variation in breast cancer (36).
Genomic instability is a broad concept that encompasses a wide range of chromosomal level abnormalities. Some tumors have a large number of rearrangements (several hundred) that are focused or “clustered” at specific loci reporting driver amplicons (e.g., CCND1, ERBB2) or are simply sites of chromothripsis (45), for example. In contrast, other tumors could have an equivalent number of rearrangements but have them widely distributed throughout the genome instead. Intuitively, different mutational processes are likely to underpin these disparate genomic outcomes (36).
Rearrangements were thus separated according to whether they were clustered or dispersed (Fig. 3A), and then by rearrangement class (tandem duplication, deletion, inversion, or translocation; Fig. 3B) and by size (36). Following this classification, we applied the same mathematical framework, as described previously, and extracted six rearrangement signatures (RS; ref. 36; Fig. 3C). This exercise of defining rearrangement signatures was not simply academic—unsupervised hierarchical clustering yielded seven major subgroups (groups A–G) that exhibited distinct associations with other genomic, histologic, gene expression, and clinical features (ref. 36; Fig. 4).
Extracting rearrangement mutational signatures in human breast cancers. A, Whole genome Circos plots were adapted from the R Circos package. Features depicted in Circos plots from outermost rings heading inwards: Karyotypic ideogram outermost. Base substitutions next, plotted as rainfall plots (log10 intermutation distance on radial axis, dot colors: blue = C>A; black = C>G; red = C>T; gray = T>A; green = T>C; pink = T>G). Ring with short green lines = insertions; ring with short red lines = deletions. Major copy number allele ring (green = gain), minor copy number allele ring (pink = loss); central lines represent rearrangements (green = tandem duplications; pink = deletions; blue = inversions; and gray = interchromosomal events). Note the difference in the nature of the distribution of rearrangements between the two tumors depicted. The whole genome profile on the left has >300 rearrangements that are clustered at distinct loci in specific chromosomes. In contrast, the >300 rearrangements present in the profile on the right-hand side are uniformly dispersed through the genome. The mutational processes underpinning the differing distributions in these two tumors are most likely to be different. Thus, separating rearrangements into whether they are clustered or dispersed represents a first step in the rearrangement classification. B, Types of rearrangements that can be ascertained easily. The hypothetical pieces of reference DNA from two different chromosomes on the left can be rearranged to form four main classes of rearrangements, as shown on the right. This is a second step in the classification of rearrangements prior to rearrangement signature extraction. The rearrangements are also divided by size before extraction. C, Six rearrangement signatures extracted using nonnegative matrix factorization. Probability of rearrangement element on y-axis. Rearrangement size on x-axis. Chr, chromosome; Del, deletion; Inv, inversion; Tds, tandem duplication; Trans, translocation.
Extracting rearrangement mutational signatures in human breast cancers. A, Whole genome Circos plots were adapted from the R Circos package. Features depicted in Circos plots from outermost rings heading inwards: Karyotypic ideogram outermost. Base substitutions next, plotted as rainfall plots (log10 intermutation distance on radial axis, dot colors: blue = C>A; black = C>G; red = C>T; gray = T>A; green = T>C; pink = T>G). Ring with short green lines = insertions; ring with short red lines = deletions. Major copy number allele ring (green = gain), minor copy number allele ring (pink = loss); central lines represent rearrangements (green = tandem duplications; pink = deletions; blue = inversions; and gray = interchromosomal events). Note the difference in the nature of the distribution of rearrangements between the two tumors depicted. The whole genome profile on the left has >300 rearrangements that are clustered at distinct loci in specific chromosomes. In contrast, the >300 rearrangements present in the profile on the right-hand side are uniformly dispersed through the genome. The mutational processes underpinning the differing distributions in these two tumors are most likely to be different. Thus, separating rearrangements into whether they are clustered or dispersed represents a first step in the rearrangement classification. B, Types of rearrangements that can be ascertained easily. The hypothetical pieces of reference DNA from two different chromosomes on the left can be rearranged to form four main classes of rearrangements, as shown on the right. This is a second step in the classification of rearrangements prior to rearrangement signature extraction. The rearrangements are also divided by size before extraction. C, Six rearrangement signatures extracted using nonnegative matrix factorization. Probability of rearrangement element on y-axis. Rearrangement size on x-axis. Chr, chromosome; Del, deletion; Inv, inversion; Tds, tandem duplication; Trans, translocation.
The spectrum of signatures within 560 breast cancers and individual patient whole genome profiles. The panels in the middle represent, from top to bottom: BRCA1- or BRCA2-null samples (dark purple) versus what are believed to be non-BRCA1/BRCA2–mutated samples (light purple), ER status (black = positive; gray = negative), proportions of substitution signatures, rearrangement signatures, and indel patterns present in the 560 patients. Figure legends are provided at the top of the figure. Samples are ordered according to hierarchical clustering performed on rearrangement mutational signatures. Six whole genome profiles of individual patients are shown to demonstrate how individualized each cancer genome is per patient. Note the striking differences between the six patients, even within the same “group” (groups B and G). Group D is enriched with BRCA1-null tumors, group G is enriched with BRCA2-null tumors, and group F is enriched with tumors that are never genetically BRCA1 null, are BRCA-like but different. Ins, insertions; Mh, microhomology mediated; Rep, polynucleotide repeat-tract mediated.
The spectrum of signatures within 560 breast cancers and individual patient whole genome profiles. The panels in the middle represent, from top to bottom: BRCA1- or BRCA2-null samples (dark purple) versus what are believed to be non-BRCA1/BRCA2–mutated samples (light purple), ER status (black = positive; gray = negative), proportions of substitution signatures, rearrangement signatures, and indel patterns present in the 560 patients. Figure legends are provided at the top of the figure. Samples are ordered according to hierarchical clustering performed on rearrangement mutational signatures. Six whole genome profiles of individual patients are shown to demonstrate how individualized each cancer genome is per patient. Note the striking differences between the six patients, even within the same “group” (groups B and G). Group D is enriched with BRCA1-null tumors, group G is enriched with BRCA2-null tumors, and group F is enriched with tumors that are never genetically BRCA1 null, are BRCA-like but different. Ins, insertions; Mh, microhomology mediated; Rep, polynucleotide repeat-tract mediated.
Three of the signatures are featured in homologous recombination (HR)–deficient tumors: RS1, dominated by long (>100 kb) tandem duplications, characterized many HR-deficient tumors but defined group F tumors associated with older age of diagnosis and poorer outcome in this small cohort; RS3, characterized by short (<10 kb) tandem duplications was specific to BRCA1-mutant tumors (group D); whereas RS5, defined by deletions (<10 kb), are present in BRCA1- and BRCA2-deficient samples and typified group G BRCA2-mutated samples (36). Hence, we were able to differentiate BRCA1- from BRCA2-null tumors, as well as a BRCA-like (but different) cohort with distinct clinical features (36). These diverse groups would have simply been labeled as having “genomic instability” in the past and been indistinguishable (Fig. 4).
Of the remaining rearrangement signatures, RS2, characterized by large (>100 kb) nonclustered deletions, inversions, and interchromosomal translocations, defined group E ER-positive tumors (36). In contrast, RS4 and RS6 were both characterized by clustered rearrangements and were enriched in groups A, B, and C, which were of mixed ER status but frequently had large driver amplicons, for example, ERBB2 and CCND1 (36).
Remarkably, deep analysis of individual rearrangement signatures has unearthed a novel, if somewhat disturbing, biological insight. Very recently, 33 loci were identified as sites that are rearranged by long RS1 tandem duplications more frequently than expected in independent tumors from different patients, even if by only a single tandem duplication (46). Interestingly, these hotspots are enriched for breast cancer germline susceptibility loci, breast-specific super-enhancer regulatory elements, and oncogenes (46). These loci have high transcriptional activity in breast tissue and are susceptible to double-strand break (DSB) damage and, following DSB repair, to formation of rearrangements. Yet, not all classes of rearrangements are represented at these sites—only long RS1 tandem duplications. It was hypothesized that long tandem duplications are more likely to effectively increase whole copies of these regulatory elements/genes, and that this could confer some degree of secondary selective pressure, even if incrementally (46). Indeed, corroborative transcriptomic evidence was observed to support this postulate, providing a devastating insight into this mutational process of HR deficiency: It may commence as a passenger mutational signature but, unwittingly, creates secondary driver events. RS1 is, therefore, a particularly deleterious genetic mechanism—an injurious mutational signature that perpetuates carcinogenesis (46).
This field of rearrangement signatures may only be in its infancy, but a number of deep messages are appearing, although clinical significance requires further evaluation. An exciting future awaits as the field matures and other tissue types are incorporated into these analyses.
Localized Mutational Signatures
The substitution signatures described thus far report mutagenesis distributed throughout the human genome. Intriguingly, localized mutagenesis has also been reported (30). By calculating an intermutation distance, or the distance from a substitution to the one immediately preceding it in the reference genome, we were able to appreciate focal substitution hypermutation (30). Although most mutations in a cancer genome would exhibit an intermutation distance of approximately 105 bp to approximately 106 bp, localized regions of hypermutation or “kataegis” presented as clusters of substitutions with shorter intermutation distances (defined as six or more substitutions with an average intermutation distance of <1,000 bp; refs. 30, 32, 36). These focal mutation showers had striking characteristics—an excess of cytosine mutations at a TpC sequence context and colocalizing with a different class of mutation altogether, rearrangements.
Kataegis mutations bear a strong resemblance to those of genome-wide signatures 2 and 13, which are associated with APOBEC enzymatic activity (47–49). APOBECs are a family of cytidine deaminases that evolved to restrict retroviruses and retrotransposon elements. APOBECs require single-stranded DNA (ssDNA) as a substrate for deamination of cytosine to uracil. Notably, experimental studies in yeast suggest that DSBs and end resection are a source of ssDNA required for APOBECs to generate kataegis (47). In contrast, alternative cellular processes such as replication or transcription have been hypothesized as a potential fount of ssDNA for APOBEC activity–generating signatures 2 and 13 (31). Thus, although APOBEC enzymes are involved in kataegis, and genome-wide signatures 2 and 13, they are believed to be mechanistically distinct mutational processes likely arising at different instances of cellular stress (Fig. 5).
Mechanistic insights from mutagenesis: the APOBEC family of enzymes in genome-wide (signatures 2 and 13) and localized mutational signatures (kataegis). On the basis of the predominant cytosine mutagenesis at a TpC sequence context, the APOBEC family of enzymes has been implicated in causing both localized kataegis and genome-wide signatures 2 and 13. A, APOBECs cause DNA damage, particularly on ssDNA, by deaminating cytosine into uracil. Uracil-N-glycosylase (UNG) first removes uracil before other components of the Base Excision Repair pathway restore the damaged DNA to its original state. If DNA is uncorrected and enters replication as uracil or an abasic site, then the possibilities are of generating C>T transition or C>G and C>A transversion mutations. B, Although APOBECs are involved in both localized and genome-wide mutagenesis, there is mounting experimental and analytic evidence to support the hypotheses that these signatures arise by different mechanisms. Kataegis is believed to require a DSB to arise first, before end resection of the DSB leaves ssDNA exposed for APOBEC deamination (left). In contrast, APOBEC deamination that gives rise to signatures 2 and 13 requires long stretches of ssDNA that could occur during uncoupling of the leading and lagging replication strands (right).
Mechanistic insights from mutagenesis: the APOBEC family of enzymes in genome-wide (signatures 2 and 13) and localized mutational signatures (kataegis). On the basis of the predominant cytosine mutagenesis at a TpC sequence context, the APOBEC family of enzymes has been implicated in causing both localized kataegis and genome-wide signatures 2 and 13. A, APOBECs cause DNA damage, particularly on ssDNA, by deaminating cytosine into uracil. Uracil-N-glycosylase (UNG) first removes uracil before other components of the Base Excision Repair pathway restore the damaged DNA to its original state. If DNA is uncorrected and enters replication as uracil or an abasic site, then the possibilities are of generating C>T transition or C>G and C>A transversion mutations. B, Although APOBECs are involved in both localized and genome-wide mutagenesis, there is mounting experimental and analytic evidence to support the hypotheses that these signatures arise by different mechanisms. Kataegis is believed to require a DSB to arise first, before end resection of the DSB leaves ssDNA exposed for APOBEC deamination (left). In contrast, APOBEC deamination that gives rise to signatures 2 and 13 requires long stretches of ssDNA that could occur during uncoupling of the leading and lagging replication strands (right).
Interestingly, an alternative form of kataegis was also rarely observed (0.9% of all kataegis foci identified in breast cancer; ref. 36). Also colocalizing with rearrangements, this version of kataegis exhibited a different base substitution pattern of T>G and T>C mutations predominantly at NTT and NTA sequences. The etiology of this form of kataegis is unknown.
Dynamic Cellular Processes and Mutational Signatures
The distribution of somatic mutations is uneven through cancer genomes, has been extensively studied, and has been found to be largely influenced by replication time domains and histone epigenetic marks (50, 51). Predicated on being able to probabilistically assign every mutation in a cancer to a mutational signature, similar analyses have now been performed as mutational signatures (36). Because mutational signatures are proxies for specific biological processes, the advantage of performing these analyses as mutational signatures is that one can interpret the influence of dynamic cellular events, such as replication, transcription, and nucleosome occupancy, on the associated biological processes (ref. 36; Table 1).
Summary of relationships between each mutational signature and various genomic features. The 20 mutational signatures are noted in the left-most column. This is followed by information on mutation classes, features that predominantly characterize each signature, and associated etiologies, if known. Relationships relating to transcriptional strands, replication time, and strands and chromatin organization are also noted.
Mutational signature . | Mutation type . | Predominant features of signature . | Associated mutational process . | Transcriptional strand . | Replicative strand . | Replication time . | Chromatin organization . |
---|---|---|---|---|---|---|---|
1 | Sub | C>T at CpG | Deamination of methyl-cytosine (age associated) | Some bias | Enriched late | ||
5 | Sub | T>C | Uncertain (age associated) | Some bias | Some bias | Enriched late | Slight enrichment at linker |
2 | Sub | C>T at TpCpN | APOBEC related | Some bias | Strong lagging strand bias | Enriched late | |
13 | Sub | C>G at TpCpN | APOBEC related | Some bias | Strong lagging strand bias | Flat | |
6 | Sub | C>T (and C>A and T>C) | MMR deficient | Some bias | Flat | ||
20 | Sub | C>A (and C>T and T>C) | MMR deficient | Some bias | Enriched late | ||
26 | Sub | T>C | MMR deficient | Some bias | Strong bias | Enriched late | Enriched at linker |
3 | Sub | HR deficient | Some bias | Some bias | Enriched late | ||
8 | Sub | C>A | Amplified by HR deficiency? | Some bias | Enriched late | ||
18 | Sub | C>A | Uncertain | Some bias | Some bias | Enriched late | Enriched at nucleosomes and periodic |
17 | Sub | T>G | Uncertain | Some bias | Enriched late | Enriched at nucleosomes and periodic | |
30 | Sub | C>T | Uncertain | Flat | |||
RS1 | Rearr | Large tandem duplications (>100 kb) | Uncertain type of HR deficiency? | NA | NA | Enriched early | |
RS2 | Rearr | Dispersed translocations | NA | NA | Enriched early | ||
RS3 | Rearr | Small tandem duplications (<10 kb) | HR deficiency (BRCA1) | NA | NA | Enriched early | |
RS4 | Rearr | Clustered translocations | NA | NA | Enriched early | ||
RS5 | Rearr | Deletions | HR deficient | NA | NA | Enriched early | |
RS6 | Rearr | Other clustered rearrangements | NA | NA | Enriched early | ||
Repeat-med | Indel | <3 bp indel at polynuc tract | MMR deficient | NA | NA | Enriched late | Enriched at linker and periodic |
Microhom | Indel | ≥3 bp indel with MMEJ-junctions | HR deficient | NA | NA | Enriched late |
Mutational signature . | Mutation type . | Predominant features of signature . | Associated mutational process . | Transcriptional strand . | Replicative strand . | Replication time . | Chromatin organization . |
---|---|---|---|---|---|---|---|
1 | Sub | C>T at CpG | Deamination of methyl-cytosine (age associated) | Some bias | Enriched late | ||
5 | Sub | T>C | Uncertain (age associated) | Some bias | Some bias | Enriched late | Slight enrichment at linker |
2 | Sub | C>T at TpCpN | APOBEC related | Some bias | Strong lagging strand bias | Enriched late | |
13 | Sub | C>G at TpCpN | APOBEC related | Some bias | Strong lagging strand bias | Flat | |
6 | Sub | C>T (and C>A and T>C) | MMR deficient | Some bias | Flat | ||
20 | Sub | C>A (and C>T and T>C) | MMR deficient | Some bias | Enriched late | ||
26 | Sub | T>C | MMR deficient | Some bias | Strong bias | Enriched late | Enriched at linker |
3 | Sub | HR deficient | Some bias | Some bias | Enriched late | ||
8 | Sub | C>A | Amplified by HR deficiency? | Some bias | Enriched late | ||
18 | Sub | C>A | Uncertain | Some bias | Some bias | Enriched late | Enriched at nucleosomes and periodic |
17 | Sub | T>G | Uncertain | Some bias | Enriched late | Enriched at nucleosomes and periodic | |
30 | Sub | C>T | Uncertain | Flat | |||
RS1 | Rearr | Large tandem duplications (>100 kb) | Uncertain type of HR deficiency? | NA | NA | Enriched early | |
RS2 | Rearr | Dispersed translocations | NA | NA | Enriched early | ||
RS3 | Rearr | Small tandem duplications (<10 kb) | HR deficiency (BRCA1) | NA | NA | Enriched early | |
RS4 | Rearr | Clustered translocations | NA | NA | Enriched early | ||
RS5 | Rearr | Deletions | HR deficient | NA | NA | Enriched early | |
RS6 | Rearr | Other clustered rearrangements | NA | NA | Enriched early | ||
Repeat-med | Indel | <3 bp indel at polynuc tract | MMR deficient | NA | NA | Enriched late | Enriched at linker and periodic |
Microhom | Indel | ≥3 bp indel with MMEJ-junctions | HR deficient | NA | NA | Enriched late |
Abbreviations: MMEJ, microhomology-mediated end joining; NA, not available; polynuc, polynucleotide; Rearr, rearrangement; Sub, substitution.
Reprinted by permission from Macmillan Publishers Ltd.: Nature Communications 7:11383, copyright 2016.
For example, one of the most noteworthy insights obtained from this analysis was the degree of asymmetry observed between replication strands for particular signatures. For approximately 100,000 mutations on the leading replicative strand, approximately 140,000 mutations were observed on the lagging strand specifically for APOBEC-related signatures 2 and 13 (36). This level of asymmetry implies that replication has a mechanistic role in the generation of signatures 2 and 13 (Fig. 5). APOBECs demand ssDNA as a deamination substrate, and replication is a perfect physiologic source of ssDNA. Indeed, in 2016, four other publications supported this observation through in vivo (52, 53) and in vitro (54, 55) studies. Replication strand asymmetry was also observed for signature 26 (36), one of the four mutational signatures associated with deficiency of mismatch repair. Had these analyses been performed on all mutations combined, the specific behaviors (Table 1) would not have been appreciable—the signal diluted by aggregation. Thus, these vignettes demonstrate the value of performing analyses as mutational signatures.
Ultimately, a profound theme has crystallized. Different signatures exhibit different relationships with replication, transcription, and chromatin organization, fortifying how mutational signatures must be true biological phenomena and are not simply theoretical, mathematical constructs.
In the Midst of Chaos, Lies Opportunity
Some mutational signatures are a direct pathophysiologic read-out of the abrogation of a DNA repair gene/pathway and could be used as a biomarker to report DNA repair deficiency in a tumor (31, 56). Somatic nullness of a single gene, such as BRCA1, however, does not simply produce one mutational signature; it produces a multitude of mutational patterns (36). On one hand, this complicates an already burdened mutational landscape. Conversely, this could be used to our advantage for potential clinical applications.
Very recently, a supervised Lasso logistic regression model was used to learn the multiple substitution, indel, and rearrangement mutational signatures that distinguish germline BRCA1/BRCA2–mutated cancers from sporadic tumors (57). Six mutational patterns were found to be discriminatory and were weighted to create a mutational signature–based predictor of BRCA1/BRCA2 deficiency called HRDetect (57).
HRDetect outperforms customary copy number–based approaches (refs. 58–60; e.g., HRD index) for detecting BRCA1/BRCA2 deficiency and any individual signature on its own (HRDetect AUC = 0.98). This is unsurprising, as a predictor that hunts for some combination of many signatures would be more sensitive and specific than a predictor that is dependent on only a single signal (57). Thus, HRDetect works extraordinarily well even in situations of reduced mutation information secondary to low tumor cellularity, low sequencing depth (e.g., low coverage WGS sequencing of ∼10-fold rather than 30-fold), or increased noise (e.g., in cancer specimens that have artefactual genetic changes arising from formalin fixation; ref. 57). This observation could have immediate potential applications.
Of particular clinical importance, HRDetect revealed a larger proportion of patients with BRCA1/BRCA2 deficiency than expected, of up to 22%, that is, many more than the 3.9% of germline mutation carriers that were knowingly recruited to the study (57). More than half of these tumors would not have been detected as BRCA1/BRCA2 null using targeted sequencing of these genes alone. BRCA1/BRCA2–null tumors are selectively sensitive to compounds such as PARP inhibitors (61–64), which are currently theoretically reserved for approximately 1% to 5% of the germline mutation carriers. Profoundly, if in fact one in every five breast cancer patients has the equivalent of a BRCA1/BRCA2–null tumor, could they be similarly selectively sensitive to PARP inhibitors? This is unknown, and it is now necessary to embark on experiments and/or clinical trials to seek conclusive evidence. The message to the community is this: We need clinical trials of drugs like PARP inhibitors, which are not restricted to germline mutation carriers, and are applied to sporadic breast and possibly other tumors in the general populace.
Beyond that of driver mutations, mutational signatures could contribute a powerful, additional spoke in the wheel of cancer diagnostics and therapeutic stratification. There is likely to be scope for identifying other pathophysiologic processes with sensitivities to different therapies (e.g., replication stress with WEE1/ATR inhibitors or perhaps stratifying sensitivity to immunotherapies; refs. 65, 66). The academic abstraction of mutational signatures takes a step closer toward the clinic.
Critical Dissection of the Mutational Signatures Concept
Although it is a fast-paced and exciting field, the mutational signatures model does warrant critical scrutiny. No matter how sophisticated the analyses of in vivo mutagenesis, there are limitations to studying tumors—it is an uncontrolled and noisy system, and even the best clinical metadata collections will, at best, provide associations.
First, we acknowledge that the model requires validation
Experiments that show how different signatures can be generated by different exposures will contribute toward reinforcing the concept. The field of environmental mutagenesis (67–71) will argue that historic TP53 and HPRT reporter assays, and experiments exposing mouse embryonic fibroblasts to external exposures (34), such as ultraviolet light and tobacco carcinogens, already provide evidence that mutations generated through exogenous exposures generate mutation patterns that are similar to those observed in human cancers. However, there have been limited efforts to demonstrate similarly clear relationships for endogenous mutational processes. Perhaps, systematic surveys of mutational signatures of DNA-damaging agents and from abrogation of DNA repair genes will be required to truly convince the scientific community that mutational signatures observed in human cancers arise from both external and internal sources of DNA damage and DNA repair. Experimental evidence showing that the amount of exposure (whether to a chemical compound or endogenous exposure) is correlated with the degree of mutagenesis will also help to strengthen conviction in this model. The final demonstration of being able to turn on a signature (through gene knockout) and turn it off again (through reversing the mutation) could definitively authenticate this model.
Second, what is the mathematical rigor of this concept?
The principle of factorizing or reducing a complex, multidimensional dataset into simpler parts is not unusual. Multiple different mathematical methods have been developed for precisely this purpose (37–41). Although showing striking similarity, the results obtained through these different methods are not identical (Supplementary Data, Supplementary Figs. S1–S3). This has raised concerns regarding reproducibility.
There are signatures that are staunchly similar, for example, the signatures related to the activity of APOBEC enzymes (signatures 2 and 13), which are pervasive, robust across algorithms (Supplementary Data, Supplementary Fig. S3), and undisputed as mutational signatures in human tumors. Related signatures are admittedly sometimes less clearly distinguishable (Supplementary Note, Supplementary Fig. S3). Various post hoc processing methods are reported to be used to tease these apart, and these do result in differences in the final extracted signatures. For example, signatures 6, 20, and 26 (all related to mismatch repair deficiency) are historically more difficult to disentangle because of commonalities in their 96-element profile. That they are more challenging to disambiguate from one another does not mean that they do not exist, of course, and may even reflect biological interactions between them.
The assignment of the amount of each signature present in individual tumors is also a source of variation between algorithms. Invariably, these algorithms assign a small proportion of every signature to every sample examined. This is unlikely to be biologically true, so penalties may or have been introduced to increase the “sparsity” of mutation assignments, resulting in variation in final signature contributions to individual samples. Thus, mutational signature extraction and the assignments of these signatures are currently not fully deterministic. Of course, balance needs to be struck between precise signature analysis and not overfitting data through post hoc processing.
Another potential source of muddied results comes from pooling of data across tumor types. At a first approximation, increasing the size of a cohort would provide greater statistical power for analysis. However, mixing of tumor types, particularly if they are not of equivalent numbers (e.g., 500 breast cancers with 25 leukemias) and have differing mutational burdens, could result in signal dilution or interference. This can be difficult to disentangle; therefore, pooled analyses should be undertaken with a very clear declaration of methods, including what post hoc processing steps are used. In the community, it remains controversial whether pooled analyses should be used. Such analyses imply that we expect to extract mutational signatures that are identical across all tissues. This may be true for some signatures, but not for all. There is no reason to expect that genes involved in HR repair perform precisely the same functions at the same time in the cell cycle to the same degree, in breast tissue as well as in colonic tissue. Indeed, the likelihood is that they almost certainly do not.
Even when studying a specific tissue type such as breast cancer, we acknowledge that there are genuine biological differences between cohorts of samples (see Supplementary Figs. S1–S3 for comparison of two cohorts). Rare signatures present in 1% to 2% of tumors only, may not have been detected previously, because it was simply not present in any prior dataset examined (Supplementary Figs. S4–S5). Thus, getting a different result such as a novel mutational signature in a new dataset may be a genuine new finding, provided, of course, that many of the canonical signatures are also detected.
It is very likely that mutational signatures in human tumors do indeed exist, but how analyses are performed could affect the results of a signature extraction. Therefore, for any given analysis, it is vital to report how it was performed with absolute clarity, and for reviewers to critically assess whether the method applied is appropriate to the biological question being asked. What is described in this review is what we have seen in breast cancers to date, although the possibility of change is there. Mathematical extractions of mutational signatures have their limitations and should not be considered as deterministic. Intertissue variation is expected (Supplementary Table S1). Perhaps one way of presenting data is that of an average signal with error bars indicative of the intertissue variation (Fig. 2A).
Thus, there is variability in mathematical extraction of signatures depending on the algorithm used, on how data are used (whether analyzed as a pool of multiple tumor types or analyzed as separate tumor types), and even on whether the data are derived from whole genome or from exome sequencing experiments. How best to handle these issues remains uncertain and will likely be resolved in time.
Future Directions
Today, we can demonstrate and quantify mutational signatures in breast and other cancers; we can gain novel biological insights and potentially exploit signature properties for clinical applications. As noted above, some thoughtfulness is still required in the interpretation of any cancer-based analysis, and experimental work remains the bastion for substantiating proposed etiologies or mechanisms underpinning mutational signatures.
Notwithstanding, we are able to thoroughly profile cancer genomes per patient (Fig. 4 shows six strikingly different whole genome profiles). Soberingly, for the near approximately 700 WGS and approximately 1,500 WES breast cancers that have already been scrutinized, no two patients shared the same set of drivers or the same quantities of signatures (Fig. 4). Personalized genomics is, therefore, not an option for us to debate; it is a fact of life and a challenge we must embrace.
Applying comprehensive genomic approaches judiciously (72), particularly within the context of clinical trials, could prove to be most rewarding. If we had access to informative cohorts with outcome data available, this would indeed help to accelerate translation into the clinic (72, 73).
It should also go beyond that of resequencing primary cancers. Precursors of breast cancer such as ductal carcinoma in situ (DCIS) and metastasic lesions (66) should be targeted for similarly detailed levels of driver and mutational signature investigation. Likewise, tumors separated temporally and spatially in individual patients could provide useful perspectives on tumorigenesis. There also remains more to explore through integration with other modalities, such as expression (74) and methylation, and assessments of surrounding tissue microenvironment.
Last but not least, the insights on mutational signatures have only transpired because data generated through many sequencing studies, from many academic and clinical centers, have been shared with the wider community. Thus, any future sequencing endeavor, be it within a clinical trial or otherwise, should be committed to data sharing. This is because the opportunity to learn new things from data resources not just immediately, but subsequently, is huge, particularly if thorough genomic profiling is available.
Disclosure of Potential Conflicts of Interest
S. Nik-Zainal is listed as a co-inventor on multiple patent filings related to the application of mutational signatures that are owned by Genome Research Limited, and is a consultant/advisory board member for Artios Pharma Ltd. No potential conflicts of interest were disclosed by the other author.
Acknowledgments
The authors thank Esther Lips (NKI, Holland, the Netherlands), Shelley Hwang (Duke University, Durham, NC), and Alastair Thompson (MD Anderson Cancer Center, Houston, TX) for critical assessment of the manuscript. The authors also thank the ICGC Breast Cancer Working Group and the BASIS Consortium funded by the Seventh EU Programme for having the foresight to see the potential and conceive the idea of these extensive resequencing experiments in breast cancer.
Grant Support
S. Nik-Zainal was a Wellcome-Beit Fellow and personally funded by a Wellcome Trust Intermediate Clinical Research Grant (WT100183MA) at the start of writing this review, and subsequently funded by a CRUK Advanced Clinician Scientist Award (C60100/A23916). S. Morganella is funded by core funds from the Wellcome Trust Sanger Institute.