DNA sequencing offers a powerful tool in oncology based on the precise definition of structural rearrangements and copy number in tumor genomes. Here, we describe the development of methods to compute copy number and detect structural variants to locally reconstruct highly rearranged regions of the tumor genome with high precision from standard, short-read, paired-end sequencing datasets. We find that circular assemblies are the most parsimonious explanation for a set of highly amplified tumor regions in a subset of glioblastoma multiforme samples sequenced by The Cancer Genome Atlas (TCGA) consortium, revealing evidence for double minute chromosomes in these tumors. Further, we find that some samples harbor multiple circular amplicons and, in some cases, further rearrangements occurred after the initial amplicon-generating event. Fluorescence in situ hybridization analysis offered an initial confirmation of the presence of double minute chromosomes. Gene content in these assemblies helps identify likely driver oncogenes for these amplicons. RNA-seq data available for one double minute chromosome offered additional support for our local tumor genome assemblies, and identified the birth of a novel exon made possible through rearranged sequences present in the double minute chromosomes. Our method was also useful for analysis of a larger set of glioblastoma multiforme tumors for which exome sequencing data are available, finding evidence for oncogenic double minute chromosomes in more than 20% of clinical specimens examined, a frequency consistent with previous estimates. Cancer Res; 73(19); 6036–45. ©2013 AACR.
High-throughput methods for whole-genome sequencing have provided researchers with an unprecedented ability to measure the complex state of genomic rearrangements characteristic of most cancers. Numerous methods for inferring structural variation from paired-end sequencing data have been developed (1–3), but the structural variants called by such methods are often considered only in isolation, and used primarily to identify potential fusion genes. The difficulty in discovering all true structural variants and filtering out false positives makes it hard to use the output of such methods to reassemble large regions of the tumor genome. However, such tumor genome assemblies can reveal the complex structure of the tumor genome and can be used to infer the mechanism by which somatic alterations critical to cancer progression occur, such as amplifications of oncogenes and deletions of tumor suppressors. Ideally, these tumor assemblies will also reveal unique features of the cancer that can be used as diagnostic features to monitor disease progression in the patient.
It is well documented that one mechanism by which genes become highly amplified in tumors is via the circularization of double-stranded DNA into what are known as double minute chromosomes (4). Double minute chromosomes have been shown to confer resistance to certain drugs, as well as pass along this resistance nonuniformly to daughter cells. They have been observed up to a few megabases in size, and contain chromatin similar to actual chromosomes, but lack the centromere or telomeres found in normal chromosomes. Because double minute chromosomes lack centromeres, they are, like mitochondria, randomly distributed to daughter cells during cell division (5). They are generally lost in future generations unless there is some selective pressure to maintain them. For example, when they confer selective growth advantage to tumor cells, they are readily retained at high copy number. In particular, double minute chromosomes containing oncogenes may serve to amplify these genes to hundreds of copies per cell. Double minute chromosomes are common in several types of cancer, including brain cancers, with an estimated 10% of glioblastoma multiforme tumors bearing double minute chromosomes (6).
Homogeneously staining regions (HSR) represent another common mode of extreme gene amplification in cancer, observed in both solid and hematologic cancers (7–10). An HSR arises from high copy number tandem duplication of a genomic segment such that it expands the affected chromosome. Because they are embedded within larger chromosomes, fluorescent in situ hybridization (FISH) probes specific to genomic sequences within the HSR give a broad band of staining in a specific position within the larger chromosome, distinct from the focal staining usually observed with a locus-specific FISH probe. It is believed that double minute chromosomes and HSRs are related, in that double minute chromosomes can derive from HSRs as well as create HSRs via chromosomal reinsertion (7, 11, 12). FISH is used to distinguish whether HSRs, double minute chromosomes, or both are present in a given tumor sample. Whatever their form, these highly amplified oncogenes are often key drivers of their respective tumors, and may be vital in their detection, diagnosis, and treatment.
Here, we present methods that analyze high-throughput whole-genome sequencing data from tumor and matched-normal samples to detect these extremely amplified and rearranged regions of the tumor genome. We show that these low-level results can be synthesized to construct accurate local genome assemblies that are circular in nature, suggesting the existence of double minute chromosomes and/or HSRs in the tumor. We use FISH analysis to independently identify double minute chromosomes and HSRs in two tumor samples. In addition, we show that RNA-seq data further supports our circular assemblies, in one case identifying a novel isoform of the gene CPM that co-opts intergenic DNA to create a novel exon. Further analysis of the amplicons is done to distinguish likely driver genes, such as MDM2, from likely passenger genes, such as those disrupted by the amplicon structure. We show that this is made possible by the configuration of the rearranged regions represented by the predicted circular assembly. Finally, our analysis of a much larger set of samples with whole-exome sequencing data suggests that 20% of glioblastoma multiforme samples harbor highly amplified oncogenic amplicons.
Materials and Methods
Tumor and normal genome sequencing data
Tumor and matched-normal whole genome BAM files were downloaded from CGHub (cghub.ucsc.edu) under the following sample identifiers: The Cancer Genome Atlas (TCGA)-06-0648-01A-01D-0507-08, TCGA-06-0648-10A-01D-0507-08, TCGA-06-0145-01A-01D-0507-08, and TCGA-06-0145-10A-01D-0507-08. In addition, 264 exome BAM files were downloaded from CGHub, with sample identifiers given in Supplementary Table S1. Whole genome and exome BAM files were indexed by samtools (13) and processed by BamBam (See Supplementary Materials and Methods) to determine relative copy number, allele fraction, and structural variants. Supplementary Tables S1 and S2 list the copy number and rearrangement breakpoints detected by BamBam for exome and whole-genome sequencing data, respectively.
Determining breakpoints related to highly amplified regions
The read support (number of overlapping reads) for a given breakpoint is directly proportional to the copy number of the regions it connects. Thus, by requiring breakpoints to have a high level of read support, we can filter out breakpoints that are part of a copy-neutral rearrangement, and breakpoints that led to low-copy amplifications and deletions. We can then focus on the breakpoints that are part of highly amplified regions in the tumor. The particular read support threshold is selected such that breakpoints that have read support expected of copy-neutral regions of the tumor genome are removed.
Reconstructing amplicons by walking a breakpoint graph
Similar to a recently published method (14), we construct a breakpoint graph by defining a node for each side of an amplified segment of the tumor genome, connecting these two nodes by a segment edge that represents the amplified segment, and defining a set of bond edges, each of which represents a pair of segment sides that are adjacent in the tumor genome. Which amplified segments of the tumor genome are included as segment edges in the breakpoint graph is determined by relative copy number. The bond edges are the highly supported breakpoints found in the manner described earlier. If an amplified segment is interrupted by a breakpoint, then that segment will be split into two segment edges.
We visualize the graph with the segments laid out according to genomic position. Assuming all segments have the same relative copy number and each segment side has exactly one bond edge attached to it, we determine the arrangement of the amplified segments in the tumor genome by starting at the node for the left side of the first segment and traversing the segment edge to the node at the other side. The path then follows a bond edge attached to that node over to the segment side to which it is connected, then traverses the new segment edge, continuing in this manner of alternatively traversing segment and bond edges until we have returned to the starting node. If there are segment edges remaining that we have not traversed, then we select a new starting node among them and repeat this procedure, until all segment edges are accounted for. Each cyclic path we determine defines a separate circular chromosome via the concatenation of its segments in the direction of traversal. This interpretation of the graph is unambiguous, as the overall direction of traversal of any circular chromosome is immaterial, as is the order in which we discover them. When the number of bond edges per segment side (i.e., node) in the graph is not uniformly one, or not all segments have the same relative copy number, some segments may require more than one traversal to account for the extra copies, and some walks may terminate at segment sides that have no bond. Whenever multiple walks can be made through the set of bond edges and segments, we enumerate the different solutions, including potentially circular solutions that cannot close the loop due to one or more missing bond edges. A single solution that features multiple independent cyclic paths occurs when multiple circular walks are required in the procedure described earlier. Here, distinct sets of bond edges and segments are connected in independent closed loops that lack any bond edges that could fuse the independent loops together into a single, larger closed loop. A toy example breakpoint graph and its solution are described in Supplementary Fig. S1.
When there are multiple solutions, the optimal path(s) through the graph are taken to be those that most closely agree with the observed relative copy number. The number of times a solution traverses a given segment produces an estimate of that segment's copy number. The root mean square deviation (RMSD) of the segment traversal counts to the observed relative copy number for each solution is calculated, and then the solution(s) with the smallest RMSD value are labeled as optimal.
RNA-Seq reads were mapped using BWA (15) and coverage was calculated using BEDtools (16). We calculated the total coverage per transcript per million uniquely mapped read pairs for each gene within the TCGA-06-0648 double minute chromosomes. For CPM, the coverage over a truncated version of the gene was used as described in the text. These normalized expression metrics of TCGA-06-0648 were compared with a set of nine other TCGA glioblastoma multiforme samples that comprised the original RNA-Seq cohort for glioblastoma multiforme (Table 1).
|Sample .||MDM2 .||CPM (exons 1–8) .||CAND1 .||RAP1B .|
|0648 fold increase||23.98||15.70||29.00||0.51|
|Sample .||MDM2 .||CPM (exons 1–8) .||CAND1 .||RAP1B .|
|0648 fold increase||23.98||15.70||29.00||0.51|
aExpression reported as normalized transcript coverage (coverage per transcript per million uniquely mapped paired-end reads).
bExcludes 06-0214 and 06-0648, which have MDM2 amplification.
FISH was conducted for EGFR/Cep7 and MDM2/Cep12 using prelabeled probes (Vysis, Abbott Molecular). Charged slides with 4-μm paraffin sections were dewaxed, rinsed, microwaved in 10 mmol/L sodium citrate buffer, then digested in pepsin–HCl (40 μg/mL, 10 minutes at 37°C), rinsed, and dehydrated. Probe and slides were codenatured using a HYBrite automated hybridizer at 80°C for eight minutes then hybridized for two to three days at 37°C.
Exome analysis to estimate prevalence of double minute chromosomes
Additional samples that potentially bear double minute chromosomes were identified in TCGA glioblastoma multiforme exome datasets as follows. Tumor and matched normal exomes were processed by BamBam to compute relative coverage and identify somatic rearrangements. High copy number peaks were defined as regions with 5-fold increased relative coverage versus their matched normal, a threshold selected to conservatively filter out peaks caused by low-level amplifications and noise (see Supplementary Materials and Methods for details). Glioblastoma multiforme tumors with multiple such high copy number peaks were manually analyzed to discover any oncogenes within peaks, associate peaks with nearby somatic rearrangements, and determine if a sample exhibits multiple peaks with similar copy number levels. Samples were scored as having possible double minute chromosomes if they either contain multiple distinct high copy number peaks with similar copy number levels and at least one peak contained an oncogene, or a single distinct peak containing an oncogene, spanning approximately 1 Mb, and having an associated rearrangement.
Association of double minute chromosomes/HSR samples with other glioblastoma multiforme tumor features
Samples containing likely chromothripsis events were identified from the set of 26 samples with whole-genome sequencing data determined to have a possible double minute chromosomes by selecting the subset that had multiple (>3) high copy number peaks (18 samples), and were compared with samples containing no high copy number peaks as determined by whole-exome data (112 samples). Thereafter, t tests were conducted in R comparing several features between the two groups, including molecular subtype, survival, and mutation in PTEN, TP53, KEL, IDH1, PIK3R1, PIK3CA, POTEB, NF1, RB1, and EGFR, amplification of PRDM2, MET, MDM2, EGFR, CDK4, CCNE1, and CCND2, and deletion of RB1, PTEN, PRDM2, MET, and CCND2. Bonferroni-adjusted P values were reported. TP53 mutations within the two groups were further evaluated using a two-tailed Fisher's exact test.
BamBam: a robust method for identifying tumor-specific variation
Considering that a BAM file storing a single patient's whole-genome sequence at high coverage (>30×) can be hundreds of gigabytes in compressed form, a serial analysis of two sequencing datasets requires researchers to store intermediate results that must be merged to conduct a comparative analysis, such as identifying mutations found only in the tumor sample. To overcome this problem, we developed BamBam, a tool that conducts a comparative analysis of a patient's tumor genome versus his/her germline by simultaneously processing the tumor and matched-normal short-read alignments stored in SAM/BAM-formatted files (13). Simultaneous processing of both BAM files enables BamBam to efficiently calculate tumor relative coverage and allele fraction, discover somatic mutations and germline SNPs, and infer regions of structural variation. The relative coverage and allele fraction estimates made by BamBam can be used to estimate tumor copy number and normal contamination in the sequenced tumor sample (See Supplementary Methods for details).
Reconstruction of candidate double minute chromosomes using whole-genome sequencing
We focused on three samples from the initial set of 19 glioblastoma multiforme samples from TCGA subjected to whole-genome sequence analysis where we detected highly amplified segments overlapping oncogenes, suggestive of double minute chromosomes. We applied these methods to samples designated TCGA-06-0152, TCGA-06-0648, and TCGA-06-0145. In each case, whole-genome sequencing of a tumor biopsy was available separately from whole-genome sequencing of a blood sample (matched normal tissue sample). For two of these samples (TCGA-06-0152 and TCGA-06-0648), multiple segments had similar levels of amplification whereas the third sample (TCGA-06-0145) had one large amplified region with further rearrangements internal to the region.
Sample TCGA-06-0648 had a striking pattern of genome amplification in which 16 distinct segments (15 from chromosome 12 and a small fragment from chromosome 9) had similarly high levels of amplification (>10 copies) and also appeared to be linked to each other by high confidence rearrangement events identified by BamBam (Fig. 1A). One of the chromosome 12 segments contains the MDM2 oncogene. Out of the total of 701 putative somatic breakpoints called, 97 breakpoints met the filtering criteria specified in Supplemental Methods. Only 16 of these 97 breakpoints further met or surpassed a chosen minimum read support threshold of 100 to identify breakpoints likely associated with highly amplified regions, including two breakpoints that did not have split read evidence. All of these highly supported breakpoints are proximate to the boundaries of the highly amplified segments clustered on chromosome 12, suggesting that the highly supported breakpoints and the amplifications are directly associated and may represent the rearranged configuration of one (or multiple) amplicons in the tumor's genome.
Figure 1B shows a schematic of the amplified segments of TCGA-06-0648 and their associated breakpoints. This diagram predicts a circular path that completely accounts for all observed breakpoints and amplified segments, resulting in a single 891-kb circular amplicon containing a single copy of MDM2. Because all but two breakpoints were refined by split read solutions, the estimated size of this amplicon is accurate to within approximately 100 bp. This reconstructed circle is consistent with either an array within a larger chromosome of precise tandemly duplicated copies of an initial circular amplicon formed from these 16 genomic segments (assuming single-copy breakpoints joining this tandem array to nonrepeated DNA were not sufficiently covered to be observed) or an extra-chromosomal circular DNA (double minute chromosomes; ref. 17) with average copy number of approximately 84 in the tumor sample. As we can assume neither a perfectly clonal tumor nor a stable number of double minute chromosome copies in every generation of tumor cells, the average copy number of 84 should be considered the number of double minute chromosome copies in the average tumor cell sequenced. We specifically searched for breakpoints with lower read support within the amplicon region connecting it to other single-copy genomic locations, as these could provide evidence for an insertion site of a tandem array, but found none. Thus, the presence of a double minute chromosome is the more parsimonious explanation of these data, as it does not require us to postulate the existence of one or more pairs of unobserved breakpoints where one or more HSRs each containing multiple exact copies of this 891-kb region are inserted into larger chromosomes.
Double minute chromosomes have been identified in many tumor types where they often contain oncogenes important for that cancer, such as EGFR in the case of glioblastoma multiforme (18). The TCGA-06-0648 double minute chromosome contains several protein-coding genes from chromosome 12, including intact copies of the MDM2 oncogene and CAND1, which encodes an inhibitor of cullin ring–ubiquitin ligase complexes (19). In addition, it includes a truncated allele of carboxypeptidase M (CPM), a membrane-bound and secreted protease that cleaves the C-terminal residue of epidermal growth factor (EGF; ref. 20). The amplified allele of CPM lacks the last exon, which should not affect the catalytic or major structural domains of the protein, but removes the amino acids necessary for GPI anchoring of CPM to the plasma membrane and would be expected to result in an exclusively secreted form of CPM (21). A partial allele of the ras-family gene RAP1B is also present, but as it lacks the promoter and first exon, it is unlikely that this allele is expressed. It seems likely that MDM2 drove the high copy maintenance of this double minute chromosome in TCGA-06-0648, but CPM and CAND1 could contribute as well.
This region of the tumor genome has all of the hallmarks of a chromothripsis event, suggesting that the double minute chromosome was created by connecting shattered fragments of chromosome 12 and a small region of chromosome 9 into a single circular episome (22). The observation that multiple regions with uniformly high read depths are connected by a set of structural variants with similarly high-read support suggests that these alterations likely occurred together during a single event, such as chromothripsis, instead of a series of independent focal rearrangements. Furthermore, all breakpoints lack homology or exhibit 2 to 6 bp microhomologies at their junctions, indicating that nonhomologous end-joining (NHEJ) and microhomology-mediated end joining (MMEJ) are the primary DNA double-stranded break-repair mechanisms responsible for constructing the double minute chromosome (23).
We applied these same methods to sample TCGA-06-0152, which also had several highly amplified segments, including segments containing the MDM2, CDK4, and EGFR oncogenes (TCGA; ref. 24). This analysis predicted two amplicons, in which each amplicon harbored at least one oncogene (Supplementary Fig. S2).
The final case, TCGA-06-0145, exhibits an extreme level of amplification (>50-fold) of a single approximately 800 kb genomic segment including EGFR that could indicate the presence of an EGFR-double minute chromosome or HSR (Fig. 2). In contrast with the other samples, the amplified region of TCGA-06-0145 contains significant variations in the major and minor allele frequency as well as deletion and duplication events with lower read support, which are more compatible with an HSR interpretation. However, again, we were unable to find evidence of breakpoints that would link this amplicon to another genomic region, which argues against an HSR.
The solution to the breakpoint graph of TCGA-06-0145 shows the possibility of three distinct paths that incorporate all breakpoints and explain the observed copy number, and each path predicts a different form of EGFR. EGFR is intact in the dominant path (seven of every nine copies). The remaining two paths, each present in one of nine copies, feature breakpoints that are internal to the EGFR gene, with one path producing a nonfunctional form of EGFR and the other deleting exons 2 to 7 of EGFR. This form of EGFR is known as EGFRvIII, a highly oncogenic, constitutively active form of EGFR that is expressed in multiple tumor types (25). This is interesting because it suggests two scenarios: (i) EGFRvIII emerges after wild-type EGFR is significantly amplified or (ii) EGFRvIII is created early but cells with more copies of wild-type EGFR have a selective advantage in the tumor population. The former scenario seems most plausible, as the increased number of copies subsequently improves the chance that the EGFRvIII mutant will emerge. Regardless of the true scenario, the ratio of EGFRvIII to wild-type EGFR suggests that a high copy number of oncogenic EGFRvIII may not be necessary to provide significant advantage over the wild-type amplification to the growing tumor cell.
Transcriptome data reveals a novel double minute chromosome-associated fusion protein
For one of the three tumor samples examined in this study (TCGA-06-0648), RNA sequencing was also conducted by TCGA. We analyzed these data together with that from the nine other samples in the initial glioblastoma multiforme RNA-Seq batch to examine the expression of alleles associated with the TCGA-06-0648 double minute chromosome. As expected from the absence of the promoter and first exon, RAP1B expression was half that of the other glioblastoma multiforme samples that do not have amplifications in this region, suggesting that only the intact copy of RAP1B on chromosome 12 is expressed and the double minute chromosome allele is not expressed. In contrast, MDM2, CAND1, and the first eight exons of CPM were expressed at more than 15-fold higher levels than was observed in the glioblastoma multiforme samples lacking amplification of these genes (Table 1).
For CPM, we observed that many reads originating in exon 8 terminated in a region 1.47 Mb away from it in the normal version of chromosome 12, but only 13.5 kb away in the double minute chromosome. Closer analysis of this region revealed that the 5′ end of these reads are just downstream of a canonical splice-site acceptor sequence that generates a new exon encoding a novel 30 amino acid carboxy terminus for the double minute chromosome-derived CPM allele (Fig. 3). This region is not part of any known transcript and the resulting protein sequence has no strong homology to any other proteins. This sequence is unlikely to provide a GPI anchor site; therefore, we anticipate that the double minute chromosome-derived CPM protein would be secreted. It is not clear what the functional effect of expressing this altered CPM gene would be, although it may affect EGF metabolism as both membrane-bound and secreted forms of CPM are known to cleave the carboxy-terminal arginine of EGF (26).
In summary, from the point of view of gene expression, the double minute chromosome results in overexpression of MDM2, CAND1, and a novel form of CPM in this tumor sample.
TCGA-06-0648 and TCGA-06-0145 amplicons exist as double minute chromosomes
The ability to distinguish HSRs from double minute chromosomes from short-read sequencing data is limited. To independently assess the nature of the amplification events in these tumors, we conducted FISH analysis on paraffin sections derived from tumors TCGA-06-0648 and TCGA-06-0145 using probes to MDM2 (amplified in TCGA-06-0648) and EGFR (amplified in TCGA-06-0145; Fig. 4). Material was unavailable for TCGA-06-0152. In 0648, the MDM2 probe gave a punctate pattern throughout the nucleus, indicating many nonchromosomal sites of MDM2, typical of double minute chromosomes. In contrast, the EGFR probe gave a broad pattern of staining in addition to punctate spots for TCGA-06-0145, consistent with a combination HSR and double minute chromosomes.
Prevalence of putative double minute chromosomes/HSRs in exome sequencing data
Exome sequencing is cheaper than whole-genome sequencing, and samples of tumors that have been subjected to exome sequencing are more plentiful. Evidence for double minute chromosomes can be obtained from exome-sequencing by searching for patterns of multiple regions of high-level amplification overlapping at least one oncogene, a pattern common to the double minute chromosome-containing samples we analyzed. In contrast, broad or chromosome arm-level amplification events as well as focal events with a modest level of amplification (e.g., duplications) are not expected to be part of a double minute chromosome. To estimate the prevalence of double minute chromosomes/HSRs in glioblastoma multiforme, we searched a set of 264 TCGA samples with tumor and matched-normal exome sequencing data for signatures of double minute chromosomes. As detailed in the Materials and Methods, we first conducted a computational survey of samples to identify samples exhibiting focal extreme amplification(s), prioritizing those samples with multiple distinct peaks on one or more chromosomes. A careful manual review of these samples was conducted to assess the likelihood that the sample contains a double minute chromosome, looking for the quality of relative copy number calls, any evidence of structural variation associated with the amplified peaks, and the presence of potential oncogenes.
As described in Table 2 and Supplementary Table S1, 61 samples (23%) have features, suggesting the presence of a double minute chromosome. A total of 121 oncogenes were amplified across these 61 samples, with at least one oncogene identified in every putative double minute chromosome. EGFR was the oncogene most frequently associated with these high-level amplicons, followed by CDK4 and MDM2. MYCN, which has been identified in a number of glioblastoma multiforme- double minute chromosomes(6), was associated with amplicons in two samples. As a group, the putative double minute chromosomes/HSR-containing samples showed similar survival to the cohort of samples of analyzed by exome sequencing (data not shown). Taken together, these results suggest that almost a quarter of glioblastoma multiforme samples have oncogenic amplicons present at high copy number.
|Total potential double minutes||61|
|Potential DM with EGFR||46|
|Potential DM with CDK4||18|
|Potential DM with MDM2||14|
|Total potential double minutes||61|
|Potential DM with EGFR||46|
|Potential DM with CDK4||18|
|Potential DM with MDM2||14|
Validation of exome sequencing-based prediction of double minute chromosomes/HSR-containing samples
Recently, the TCGA project conducted whole-genome sequencing on an additional 25 tumor/normal pairs from the glioblastoma multiforme cohort. This new dataset contains 23 samples that we predicted to harbor a double minute chromosome/HSR on the basis of our curation of the exome data for these samples. Although complete analysis of this data is being pursued by the Analysis Working Group, we conducted a preliminary analysis using BamBam and the methods described earlier to identify circular amplicons in these samples (summarized in Supplementary Table S3). For 16 of the 23 samples, we were able to reconstruct at least one circular amplicon. For the remaining 7 of the 23 samples, multiple highly amplified peaks were identified with breakpoints connecting many, but not all, of the peaks. We could not reconstruct circular amplicons for these samples, although the possibility remains that rearrangement breakpoints allowing for a circular solution were not detected for technical reasons. For 2 of the 25 samples where we did not predict a double minute chromosome/HSR based on the exome sequencing data, we also did not detect highly amplified genomic regions with associated rearrangements in the whole-genome sequencing data.
This larger set of samples with circular amplicons allowed us to look for common features associated with these samples compared with TCGA glioblastoma multiforme samples where exome data suggests no amplicons. Previous studies identifying medulloblastomas with chromothripsis-associated double minute chromosomes and other complex genetic rearrangements noted that such samples harbor p53 mutations (27, 28). However, there was no enrichment of p53 mutations in the samples with circular amplicons in this sample cohort. The only significant association we observed was with PTEN deletion (Bonferroni-adjusted P value = 0.0052), a common event in glioblastoma multiforme tumors.
Focusing on the 16 samples where we successfully reconstructed circular amplicons with at least one oncogene, a mixture of simple and complex amplicons was observed. One sample harbored two simple circular amplicons, each containing one genomic segment with an oncogene and the remaining five had a single oncogenic amplicon containing one genomic segment (Supplementary Table S3). Ten samples are complex, harboring multiple highly amplified segments from one or more chromosomes. Of these 10, six samples have at least two circular amplicons. Together, these results strengthen our estimates of the prevalence double minute chromosomes/HSR amplicons in glioblastoma multiformes and suggest that samples containing such amplicons can often harbor multiple independent amplicons.
The ability to integrate relative copy number with breakpoints enables us to understand the topology of vital parts of the cancer genome. By examining the overall pattern of amplification events in the TCGA glioblastoma multiforme whole-genome sequencing data, we found that some samples had multiple highly amplified segments of similar copy number. Surprisingly, for many of these highly amplified tumor regions, we are able to completely explain both the observed copy number and highly supported breakpoints surrounding them by solving a simple breakpoint graph, which describes the order and orientation of the highly amplified segments in the tumor genome. For the three glioblastoma multiforme samples analyzed in detail here, the optimal solutions to the breakpoint graphs of amplified segments are circular amplicons. These circular solutions suggest that the observed amplified regions are double minute chromosomes or HSRs.
Sequence coverage of 30× could be limiting our ability to detect breakpoints associated with the chromosomal integration site of an array of these amplicons, as the copy number of the breakpoint at the integration site is much lower than that of the amplicon. Thus, at this coverage we cannot reliably distinguish between a double minute chromosome and an HSR with our bioinformatic analysis. The availability of tissue sections derived from these same tumor samples did allow us to directly address this issue for two samples. FISH analysis of one sample, TCGA-06-0648, is consistent with a double minute chromosome, whereas another, TCGA-06-0145, gives a pattern suggestive of a combination of double minute chromosome and HSR.
Much can be learned through precise knowledge of the amplicon structure. Genes whose coding or promoter regions are disrupted by the amplicon structure are obvious passenger gene candidates, provided that transcriptional machinery is unable to create a novel transcript like the one observed with the birth of a new exon for the CPM gene. Highly expressed genes such as MDM2 that remain intact may drive tumor development and/or play a role in tumorigenesis, using the double minute chromosome's ability to rapidly reproduce to significantly increase the oncogenic capacity of these cells. The highly amplified state of these oncogene-harboring amplicons indicates that they have strong oncogenic potential and, thus, confer a selective advantage to the tumor cell. Their formation was likely a key event in the tumorigenesis of these glioblastoma multiforme tumors and they are likely to persist over time.
Examining the larger set of over 250 TCGA glioblastoma multiforme samples for which there is exome data suggests that oncogenic amplicons are quite prevalent as they are found in almost a quarter of the samples. Using recently generated whole-genome sequencing data for an additional 23 samples from the set of samples where we had predicted double minute chromosomes/HSRs from exome sequencing data, we were able to confirm all of these samples had highly amplified genomic segments with rearrangements connecting at least some of the amplified segments. Further, we were able to reconstruct circular amplicons for 16 of these samples and discovered that six had multiple amplicons as we had observed for TCGA-06-0152, suggesting that the presence of multiple amplicons is common, and that they persist by conferring a combined selective advantage to the tumor cells. These results bolster the notion that chromothripsis-type events occur with reasonable frequency in glioblastoma multiforme and, through amplification of oncogene expression, contribute to tumorigenesis.
The prevalence of these amplicons suggests that tumor-specific breakpoints associated with double minute chromosomes amplicons may be a potential diagnostic for the presence of tumor cells in a significant fraction of patients with glioblastoma multiforme, especially if double minute chromosome-derived DNA is present in the blood. Several recent observations suggest the possibility of finding such tumor DNA in the blood of patients who have glioma. Skog and colleagues reported microvesicles that bud off from glioblastoma multiforme tumor cells with lipid bilayers intact, carrying cytoplasmic contents of the tumor cell such as mRNA, miRNA, and angiogenic proteins (29). These vesicles can deliver their contents to other cells or blood. More recently, Balaj and colleagues have isolated microvesicles containing single-stranded DNA (ssDNA) with amplified oncogenic sequences, in particular c-Myc (30). Tumor nucleic acids leaked into the blood in this manner or via macrophages that engulf necrotic or apoptotic cells have been proposed as possible disease biomarkers in several cancers including glioma (31, 32). Indeed, a recent study robustly detected tumor-associated rearrangements from patients who have breast and colorectal cancer via high-throughput sequencing of the cell-free, plasma fraction of blood (33). The methods described in this study could readily be applied to such sequencing data and, thus, potentially provide a noninvasive method to characterize and monitor patients with glioma.
Disclosure of Potential Conflicts of Interest
J. Zachary Sanborn is a co-founder, equity holder, and chief technology officer, and David Haussler is a co-founder, equity holder, and member of the scientific advisory board of Five3 Genomics, LLC. No potential conflicts of interest were disclosed by the other authors.
Conception and design: J.Z. Sanborn, S.R. Salama, D. Haussler
Development of methodology: J.Z. Sanborn
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): C.W. Brennan, T. Mikkelsen, S. Jhanwar
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): J.Z. Sanborn, S.R. Salama, M. Grifford, C.W. Brennan, S. Jhanwar, S. Katzman, L. Chin, D. Haussler
Writing, review, and/or revision of the manuscript: J.Z. Sanborn, S.R. Salama, M. Grifford, C.W. Brennan, T. Mikkelsen, S. Katzman, D. Haussler
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): J.Z. Sanborn
Study supervision: J.Z. Sanborn, S.R. Salama
The authors thank the members of the TCGA glioblastoma multiforme Analysis Working Group and the Haussler and Josh Stuart lab cancer genomics group for helpful discussions of our results. The authors also thank Melissa A. Wynne for technical assistance with the FISH assay. The results published here are, in part, based upon data generated by The Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov/.
This work was financially supported by grants from the Howard Hughes Medical Institute (D. Haussler and S.R. Salama), NCI 5U24ACA143858 (D. Haussler, J.Z. Sanborn, and M. Grifford), NCI 5U24CA143845 (L. Chin), NCI P01CA095616 (L. Chin), NHGRI (D. Haussler and M. Grifford), NHGRI 5U01ES017154 (D. Haussler and M. Grifford), California Institute for Quantitative Biosciences (S. Katzman), and the Leon Levy Foundation (C.W. Brennan). D. Haussler and M. Grifford are supported by a Stand Up to Cancer Dream Team Translational Cancer Research Grant, a Program of the Entertainment Industry Foundation (SU2C-AACR-DT0409).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.