Low-frequency genetic variants in cancer genome datasets are often simply artifacts of DNA damage introduced by routine sample preparation, not tumor-driving mutations. A new algorithm found that around three quarters of all the samples in The Cancer Genome Atlas contain large numbers of these sequencing errors.

Many low-frequency somatic variants included in The Cancer Genome Atlas (TCGA) may actually be sequencing errors, not necessarily rare driver mutations, as often suspected. Rather, they could be artifacts of DNA damage introduced by routine sample preparation, according to a recent study (Science 2017;355:752–6).

“This is a very timely paper,” says Trevor Pugh, PhD, a cancer geneticist at Princess Margaret Cancer Centre in Toronto, Canada, who was not involved in the new study. “Today, we're sequencing much, much more deeply than we used to, so we're going to start confounding mutations like these oxidative-damage mutations with real tumor-driving variants.”

Archived samples are known to be riddled with mutagenic changes that could be confused for tumor-driving mutations, but fresh tumor samples were thought to be mostly fine. Then in 2013, a team from the Broad Institute, which included Pugh, was sequencing tumors from children with neuroblastoma and found hundreds to thousands of mutations when they expected just 10 to 20. The researchers discovered that the acoustic energy used to shear DNA extracted from tumor tissue was frequently turning guanine into 8-oxoguanine, a nucleotide that the sequencing machine read as a thymine (Nucleic Acids Res 2013;41:e67). These G-to-T transversions were not tumor-causing mutations but artifacts of the sonication process.

Laurence Ettwiller, PhD, and her colleagues from New England Biolabs (NEB), a molecular biology reagents company in Ipswich, MA, have now extended those findings and quantified the prevalence of such erroneous variants in two widely used sequencing datasets: the 1000 Genomes Project and TCGA. The researchers compared the reads of the two complementary strands from each sequencing run to detect aberrant transversions introduced by DNA damage and scored the degree of mismatching in a metric dubbed the Global Imbalance Value (GIV).

Based upon the GIV, Ettwiller's team found that 41% of the datasets in the 1000 Genomes Project contained damaged samples. In TCGA, 73% of the 1,800 sequenced tumor and healthy matched samples revealed damage so extensive that at least half of all the G-to-T variants were not true mutations. Other nucleotide imbalances such as C-to-T occurred at lower but still appreciable frequencies.

According to Pugh, analytic tools like MuTect and VarScan can correct the problem, although not perfectly. To eliminate the false variants, the NEB researchers used a mix of enzymes that repaired the DNA damage before sequencing. “But,” says Ettwiller, “we don't know whether or not this cocktail of enzymes will actually work on the TGCA dataset,” because of differences in experimental setup.

NEB markets the DNA-repair mix used in the study, so the authors have an inherent financial conflict, yet that doesn't bother Alexander Dobrovic, PhD, a molecular geneticist from the Olivia Newton-John Cancer Research Institute in Melbourne, Australia. “They clearly have a product to sell, but it's a useful product,” he says. “We'll be using that ourselves.”

Another workaround: molecular barcoding, which involves adding unique tags to each stretch of DNA so that errors introduced during library construction can be detected among duplicate sequences and remedied computationally. “I like molecular barcoding because you're able to directly measure the type and degree of DNA damage,” Pugh says. “You're reading out exactly what you have in the tube and then correcting for it.”

However, repairing DNA was not the study's primary aim. “The goal of the paper was to alert the community to a potential problem,” says Tom Evans, PhD, an enzymologist at NEB. “Solutions will come later.” –Elie Dolgin

For more news on cancer research, visit Cancer Discovery online at http://cancerdiscovery.aacrjournals.org/content/early/by/section.