Abstract
Background: Whole-exome sequencing (WES) has recently emerged as an appealing approach to systematically study coding variants. However, the requirement for a large amount of high-quality DNA poses a barrier that may limit its application in large cancer epidemiologic studies. We evaluated the performance of WES with low input amount and saliva DNA as an alternative source material.
Methods: Five breast cancer patients were randomly selected from the Pathways Study. From each patient, four samples, including 3 μg, 1 μg, and 0.2 μg blood DNA and 1 μg saliva DNA, were aliquoted for library preparation using the Agilent SureSelect Kit and sequencing using Illumina HiSeq2500. Quality metrics of sequencing and variant calling, as well as concordance of variant calls from the whole exome and 21 known breast cancer genes, were assessed by input amount and DNA source.
Results: There was little difference by input amount or DNA source on the quality of sequencing and variant calling. The concordance rate was about 98% for single-nucleotide variant calls and 83% to 86% for short insertion/deletion calls. For the 21 known breast cancer genes, WES based on low input amount and saliva DNA identified the same set variants in samples from a same patient.
Conclusions: Low DNA input amount, as well as saliva DNA, can be used to generate WES data of satisfactory quality.
Impact: Our findings support the expansion of WES applications in cancer epidemiologic studies where only low DNA amount or saliva samples are available. Cancer Epidemiol Biomarkers Prev; 24(8); 1207–13. ©2015 AACR.
Introduction
The advent of next-generation sequencing (NGS) techniques and the reduction in cost overtime have transformed the landscape of human genetic research by offering a widely accessible tool to interrogate the genome at an unprecedented pace and scale (1). Compared with whole-genome sequencing (WGS), which remains costly for population-wide applications, whole-exome sequencing (WES), which targets the approximately 1% coding sequences of the human genome, provides an appealing solution with a balanced trade-off between cost, genome coverage, functional annotation, and analytical burden (2, 3). It thus has been widely adopted to study Mendelian diseases (4, 5) and characterize cancer genomes (6), and begun to make its way into clinical practice for novel diagnosis and identification of therapy targets (7–9).
In epidemiologic research, WES is emerging as a new powerhouse in searching for coding risk variants (10–12), surpassing genome-wide genotyping microarrays that are limited to common and known variants. Several previous studies have evaluated the performance of different WES technologies and platforms (13–15). However, two practical issues remain that may impede the application of WES to large epidemiologic populations, namely the apparent need for relatively large amounts of high-quality DNA, and the current need to source this from peripheral blood. The large amount of needed genomic DNA (e.g., 3 μg) poses a practical challenge to studies where such amounts are unavailable or would deplete the resource. As saliva samples are now routinely collected in many epidemiologic studies as an inexpensive alternative source of genomic DNA using noninvasive methods, there could be broader use of WES if it were shown that saliva DNA performs comparably well to blood DNA on WES platforms.
To address these two aforementioned issues in WES, we evaluated the WES performance of the Agilent SureSelect Human All Exon Kit in conjunction with the Illumina HiSeq 2500 platform, which is currently one of the few mainstream choices for WES library preparation and sequencing, respectively (15, 16). Our goal was to determine the performance of sequencing, variant calling for single-nucleotide variations (SNV) and short insertion/deletion (indels), and the accuracy in identifying coding variants in known breast cancer–related genes, using different DNA input amounts (0.2 μg, 1 μg, and 3 μg genome DNA) from peripheral blood, and different DNA sources (1 μg DNA from saliva).
Materials and Methods
Genomic DNA samples
Genomic DNA samples were obtained from the Pathways Study, a prospective cohort study that recruited recently diagnosed breast cancer patients from the Kaiser Permanente Northern California (KPNC) health plan membership (17). At the baseline in-person interview after patient consent, blood samples were collected from 90% of participants via phlebotomy, and saliva samples were also collected from 96% of participants by the Oragene DNA Self-Collection Kit (DNA Genotek Inc.) as an alternative source of genomic DNA. The biospecimens were shipped to Roswell Park Cancer Institute (RPCI) for processing and storage under the auspices of the RPCI Data Bank and Biorepository (DBBR; ref. 18). Whole blood was aliquoted for DNA extraction using the Qiagen FlexiGene Kit. DNA from approximately 2 mL saliva samples was extracted using the Oragene Kit. Nucleotide concentration of DNA samples was determined by both NanoDrop and PicoGreen techniques. DNA samples were stored at −80°C until analysis. For this study, we included randomly selected samples from five women diagnosed with triple-negative breast cancer [estrogen receptor (ER)–negative, progesterone receptor (PR)–negative, and human epidermal growth factor receptor 2 (Her2)–negative] who had DNA available from both peripheral blood and saliva samples. The study was approved by the Institutional Review Boards (IRB) of RPCI and KPNC.
Library preparation and sequencing
Genomic DNA from whole blood (3 μg, 1 μg, and 0.2 μg DNA) and from saliva (1 μg DNA) was captured using the Agilent SureSelect Human All Exon v5 Kit. The 3 μg and 1 μg input amounts were fragmented to a size range of 150–200 bp followed by end repair, adaptor ligation, and low PCR cycle (5 cycles). The 0.2 μg input followed the same procedures, except using a higher number of PCR cycles (11 cycles). Individual libraries were barcoded, pooled (5-plex) and loaded to four lanes of a HiSeq Flow Cell, followed by 101 bp paired-end sequencing using Illumina HiSeq 2500 according to the manufacturer's protocol. To eliminate potential batch effects, the libraries were randomly assigned to four sequencing lanes using the OSAT program to ensure that the distribution of DNA input amount and DNA source was even across lanes (19). The library preparation and sequencing was performed by the RPCI Genomics Shared Resource.
Variant calling for SNVs and Indels
The raw sequence reads were aligned to the Human Reference Genome (NCBI build 37) using the Burrows-Wheeler Aligner (20). After removing PCR duplicates using Picard (21), the GATK software version 3.0 (22) was used for local realignment, base quality recalibration, and variant calling of SNVs and small indels. In the variant calling step, variants were first called in each sample separately, and then joint genotyping analysis was performed on the samples from the same DNA source and same DNA input amount, followed by variant recalibration to generate analysis-ready variants. Only the variants that passed the GATK quality filter (tranche sensitivity threshold 99.9%) were used in our analysis.
Benchmark rate of variant calling concordance
As the bioinformatics pipeline may have a major impact on variant calling concordance, we estimated the concordance level of our pipeline based on a reference WES dataset with high-quality variant callsets and used the concordance rate as a benchmark in our evaluation. The publicly available WES data of a CEU (Utah residents of northern and western European ancestry) trio (NA12878, NA12891, and NA12892) were downloaded from the 1000 Genomes Project. The WES data were originally generated using Agilent SureSelect All Exon v2 Kit, followed by 76-bp paired-end sequencing. Variant calls for NA12878 from our pipeline were compared with two comprehensive variant callsets compiled by the Genome in a Bottle Consortium (GIBA) for this particular individual (23). The two callsets contain high-quality variant calls in the whole exome and in the high-confidence portion of the exome, respectively. The high-confidence portion of the exome excludes simple repeats, known segmental duplications, known structural variants reported in dbVar (24) for NA12878, regions paralogous to the 1000 Genomes Project “decoy reference,” and regions in the RepeatSeq database (25). The calculated concordance rates of SNV and indel calls for the NA12878 subject were then used as guidelines for assessing the consistency of variant calling for samples of varying DNA input amount and DNA source from each patient in our study.
Results
Sequencing performance
From each exome library, we obtained 63 to 102 million reads, with an average sequencing depth of 67 to 111× and ≥94% bases covered by at least 20× (Table 1). The PCR duplicate rates in all samples ranged from 0.03 to 0.13, except for one outlier library generated from 1 μg blood DNA with a duplicate rate of 0.30. The mapping rate of the sequenced reads to the reference genome in each sample was 98% to 100%; the exome capture rate was 50% on average; and the average insert size was 200 bp. All were within the expected range, indicating overall good performance of exome sequencing.
Sample ID . | DNA source . | DNA amount . | Average sequencing depth . | % bases covered by at least 20× . | Exome capture ratea . | PCR duplicate rate . | Mapping% . | Mean insert size . |
---|---|---|---|---|---|---|---|---|
PBCTNPLT001 | Whole blood | 3 μg | 111 | 97.5 | 0.55 | 0.08 | 99.7 | 215 |
PBCTNPLT002 | Whole blood | 3 μg | 93 | 96.2 | 0.57 | 0.08 | 99.7 | 206 |
PBCTNPLT003 | Whole blood | 3 μg | 92 | 95.9 | 0.57 | 0.06 | 99.7 | 207 |
PBCTNPLT004 | Whole blood | 3 μg | 70 | 94.0 | 0.55 | 0.06 | 99.7 | 216 |
PBCTNPLT005 | Whole blood | 3 μg | 77 | 95.2 | 0.59 | 0.03 | 99.7 | 208 |
PBCTNPLT001 | Whole blood | 1 μg | 74 | 94.3 | 0.42 | 0.30 | 99.6 | 207 |
PBCTNPLT002 | Whole blood | 1 μg | 92 | 96.5 | 0.54 | 0.08 | 99.7 | 214 |
PBCTNPLT003 | Whole blood | 1 μg | 88 | 95.6 | 0.55 | 0.09 | 99.7 | 212 |
PBCTNPLT004 | Whole blood | 1 μg | 79 | 95.8 | 0.56 | 0.06 | 99.7 | 217 |
PBCTNPLT005 | Whole blood | 1 μg | 111 | 98.0 | 0.56 | 0.06 | 99.7 | 209 |
PBCTNPLT001 | Whole blood | 0.2 μg | 67 | 93.8 | 0.48 | 0.12 | 99.2 | 186 |
PBCTNPLT002 | Whole blood | 0.2 μg | 75 | 94.8 | 0.51 | 0.08 | 99.0 | 185 |
PBCTNPLT003 | Whole blood | 0.2 μg | 77 | 94.9 | 0.53 | 0.07 | 99.0 | 184 |
PBCTNPLT004 | Whole blood | 0.2 μg | 74 | 94.9 | 0.51 | 0.08 | 99.0 | 182 |
PBCTNPLT005 | Whole blood | 0.2 μg | 79 | 95.5 | 0.53 | 0.09 | 98.9 | 186 |
PBCTNPLT001 | Saliva | 1 μg | 72 | 94.9 | 0.56 | 0.08 | 99.1 | 205 |
PBCTNPLT002 | Saliva | 1 μg | 77 | 95.7 | 0.54 | 0.05 | 99.4 | 214 |
PBCTNPLT003 | Saliva | 1 μg | 99 | 97.4 | 0.50 | 0.13 | 98.2 | 203 |
PBCTNPLT004 | Saliva | 1 μg | 73 | 95.0 | 0.52 | 0.10 | 97.7 | 210 |
PBCTNPLT005 | Saliva | 1 μg | 69 | 93.9 | 0.51 | 0.12 | 98.5 | 201 |
Sample ID . | DNA source . | DNA amount . | Average sequencing depth . | % bases covered by at least 20× . | Exome capture ratea . | PCR duplicate rate . | Mapping% . | Mean insert size . |
---|---|---|---|---|---|---|---|---|
PBCTNPLT001 | Whole blood | 3 μg | 111 | 97.5 | 0.55 | 0.08 | 99.7 | 215 |
PBCTNPLT002 | Whole blood | 3 μg | 93 | 96.2 | 0.57 | 0.08 | 99.7 | 206 |
PBCTNPLT003 | Whole blood | 3 μg | 92 | 95.9 | 0.57 | 0.06 | 99.7 | 207 |
PBCTNPLT004 | Whole blood | 3 μg | 70 | 94.0 | 0.55 | 0.06 | 99.7 | 216 |
PBCTNPLT005 | Whole blood | 3 μg | 77 | 95.2 | 0.59 | 0.03 | 99.7 | 208 |
PBCTNPLT001 | Whole blood | 1 μg | 74 | 94.3 | 0.42 | 0.30 | 99.6 | 207 |
PBCTNPLT002 | Whole blood | 1 μg | 92 | 96.5 | 0.54 | 0.08 | 99.7 | 214 |
PBCTNPLT003 | Whole blood | 1 μg | 88 | 95.6 | 0.55 | 0.09 | 99.7 | 212 |
PBCTNPLT004 | Whole blood | 1 μg | 79 | 95.8 | 0.56 | 0.06 | 99.7 | 217 |
PBCTNPLT005 | Whole blood | 1 μg | 111 | 98.0 | 0.56 | 0.06 | 99.7 | 209 |
PBCTNPLT001 | Whole blood | 0.2 μg | 67 | 93.8 | 0.48 | 0.12 | 99.2 | 186 |
PBCTNPLT002 | Whole blood | 0.2 μg | 75 | 94.8 | 0.51 | 0.08 | 99.0 | 185 |
PBCTNPLT003 | Whole blood | 0.2 μg | 77 | 94.9 | 0.53 | 0.07 | 99.0 | 184 |
PBCTNPLT004 | Whole blood | 0.2 μg | 74 | 94.9 | 0.51 | 0.08 | 99.0 | 182 |
PBCTNPLT005 | Whole blood | 0.2 μg | 79 | 95.5 | 0.53 | 0.09 | 98.9 | 186 |
PBCTNPLT001 | Saliva | 1 μg | 72 | 94.9 | 0.56 | 0.08 | 99.1 | 205 |
PBCTNPLT002 | Saliva | 1 μg | 77 | 95.7 | 0.54 | 0.05 | 99.4 | 214 |
PBCTNPLT003 | Saliva | 1 μg | 99 | 97.4 | 0.50 | 0.13 | 98.2 | 203 |
PBCTNPLT004 | Saliva | 1 μg | 73 | 95.0 | 0.52 | 0.10 | 97.7 | 210 |
PBCTNPLT005 | Saliva | 1 μg | 69 | 93.9 | 0.51 | 0.12 | 98.5 | 201 |
aThe exome capture rate is calculated as the sequenced bases in the capture regions divided by the length sum of all mapped reads.
We then examined whether the DNA input amount and DNA source affected sequencing quality. In comparisons of the two lower DNA input amounts (1 μg and 0.2 μg) with the standard 3 μg blood DNA, no significant differences were found in total sequenced reads, sequencing depth, percent bases covered by at least 20×, PCR duplicate rate, or exome capture rate. The only significant differences were the mapping rate and the mean insert size. The mapping rate from the 0.2 μg DNA input was marginally lower, and the mean insert size was shorter than the two higher input amounts (Student t test P values ≤ 0.001). Similarly, in comparisons between saliva DNA and blood DNA, the only significant differences were also observed in the mapping rate and the mean insert size (P values < 0.05). It should be noted, however, that all the mapping rates exceeded 98%. Using a multivariable linear model to relate each of the sequencing statistics in Table 1 with patient ID, DNA amount, and DNA source, only the mean insert size was significantly different by input amount, with shorter insert size when using 0.2 μg DNA compared with 1 μg DNA (P < 0.001).
Quality of variant calls
We next investigated the performance of variant calling by DNA input amount and DNA source. For each of the 5 breast cancer patients, we detected 42.6 to 52.3k variants, including SNVs and indels. The number of indels was approximately 10.8% to 12.0% of the number of SNVs, consistent with that from the 1000 Genomes data (26). We investigated the overall variant calling quality on the basis of three commonly used quality metrics: transition-transversion ratio (Ti/Tv) for SNV calls, heterozygous-homozygous ratio (Het/Homo), and percentage of overlap with known variants in dbSNP (Table 2). The Ti/Tv ratio for each sample ranged from 2.59 to 2.64, consistent with that commonly observed in WES studies (13, 27, 28).
Sample ID . | DNA source . | DNA input amount . | Number of variants . | Transition/transversion ratio . | Heterozygous/homozygous call ratio . | % overlapping with dbSNP . |
---|---|---|---|---|---|---|
PBCTNPLT001 | Whole blood | 3 μg | 43014 | 2.59 | 1.59 | 98.46 |
PBCTNPLT002 | Whole blood | 3 μg | 44353 | 2.63 | 1.59 | 98.45 |
PBCTNPLT003 | Whole blood | 3 μg | 44868 | 2.61 | 1.66 | 98.13 |
PBCTNPLT004 | Whole blood | 3 μg | 52352 | 2.62 | 2.03 | 98.07 |
PBCTNPLT005 | Whole blood | 3 μg | 43781 | 2.62 | 1.64 | 98.46 |
PBCTNPLT001 | Whole blood | 1 μg | 42610 | 2.60 | 1.58 | 98.61 |
PBCTNPLT002 | Whole blood | 1 μg | 44202 | 2.64 | 1.60 | 98.51 |
PBCTNPLT003 | Whole blood | 1 μg | 44645 | 2.62 | 1.66 | 98.19 |
PBCTNPLT004 | Whole blood | 1 μg | 52206 | 2.61 | 2.04 | 98.10 |
PBCTNPLT005 | Whole blood | 1 μg | 43841 | 2.62 | 1.66 | 98.40 |
PBCTNPLT001 | Whole blood | 0.2 μg | 42592 | 2.59 | 1.57 | 98.51 |
PBCTNPLT002 | Whole blood | 0.2 μg | 44060 | 2.64 | 1.60 | 98.51 |
PBCTNPLT003 | Whole blood | 0.2 μg | 44707 | 2.61 | 1.65 | 98.17 |
PBCTNPLT004 | Whole blood | 0.2 μg | 52161 | 2.61 | 2.03 | 98.08 |
PBCTNPLT005 | Whole blood | 0.2 μg | 43626 | 2.62 | 1.64 | 98.43 |
PBCTNPLT001 | Saliva | 1 μg | 42711 | 2.59 | 1.58 | 98.56 |
PBCTNPLT002 | Saliva | 1 μg | 44317 | 2.64 | 1.60 | 98.50 |
PBCTNPLT003 | Saliva | 1 μg | 44998 | 2.61 | 1.71 | 98.16 |
PBCTNPLT004 | Saliva | 1 μg | 52287 | 2.62 | 2.04 | 98.14 |
PBCTNPLT005 | Saliva | 1 μg | 43700 | 2.62 | 1.64 | 98.50 |
Sample ID . | DNA source . | DNA input amount . | Number of variants . | Transition/transversion ratio . | Heterozygous/homozygous call ratio . | % overlapping with dbSNP . |
---|---|---|---|---|---|---|
PBCTNPLT001 | Whole blood | 3 μg | 43014 | 2.59 | 1.59 | 98.46 |
PBCTNPLT002 | Whole blood | 3 μg | 44353 | 2.63 | 1.59 | 98.45 |
PBCTNPLT003 | Whole blood | 3 μg | 44868 | 2.61 | 1.66 | 98.13 |
PBCTNPLT004 | Whole blood | 3 μg | 52352 | 2.62 | 2.03 | 98.07 |
PBCTNPLT005 | Whole blood | 3 μg | 43781 | 2.62 | 1.64 | 98.46 |
PBCTNPLT001 | Whole blood | 1 μg | 42610 | 2.60 | 1.58 | 98.61 |
PBCTNPLT002 | Whole blood | 1 μg | 44202 | 2.64 | 1.60 | 98.51 |
PBCTNPLT003 | Whole blood | 1 μg | 44645 | 2.62 | 1.66 | 98.19 |
PBCTNPLT004 | Whole blood | 1 μg | 52206 | 2.61 | 2.04 | 98.10 |
PBCTNPLT005 | Whole blood | 1 μg | 43841 | 2.62 | 1.66 | 98.40 |
PBCTNPLT001 | Whole blood | 0.2 μg | 42592 | 2.59 | 1.57 | 98.51 |
PBCTNPLT002 | Whole blood | 0.2 μg | 44060 | 2.64 | 1.60 | 98.51 |
PBCTNPLT003 | Whole blood | 0.2 μg | 44707 | 2.61 | 1.65 | 98.17 |
PBCTNPLT004 | Whole blood | 0.2 μg | 52161 | 2.61 | 2.03 | 98.08 |
PBCTNPLT005 | Whole blood | 0.2 μg | 43626 | 2.62 | 1.64 | 98.43 |
PBCTNPLT001 | Saliva | 1 μg | 42711 | 2.59 | 1.58 | 98.56 |
PBCTNPLT002 | Saliva | 1 μg | 44317 | 2.64 | 1.60 | 98.50 |
PBCTNPLT003 | Saliva | 1 μg | 44998 | 2.61 | 1.71 | 98.16 |
PBCTNPLT004 | Saliva | 1 μg | 52287 | 2.62 | 2.04 | 98.14 |
PBCTNPLT005 | Saliva | 1 μg | 43700 | 2.62 | 1.64 | 98.50 |
The Het/Homo call ratios varied notably by patients' racial/ethnic background. Among the 3 patients of European descent (PBCTNPLT001, 002, 005) and 1 patient of Hispanic descent (PBCTNPLT003), the ratio ranged from 1.58 to 1.71, while the ratio was higher in all four samples from 1 patient of African descent (PBCTNPLT004), with values as high as 2.03 to 2.04. This racial/ethnic variation is consistent with the literature (16, 29). Of note, DNA from this patient also showed the highest number of variants among the 5 patients evaluated, probably due to high genetic diversity in African ancestry (30, 31). The overlap between the called variants and known variants from dbSNP was high, and the percentage of novel variants was below 2%, with the highest in samples from the 2 patients of non-European descent (PBCTNPLT003, 004). Despite these racial/ethnic variations, we noticed little difference in any of the above three quality metrics of variant calls by DNA input amount or DNA source.
Concordance of variant calls by DNA input amount and DNA source
We next assessed the concordance of variant calls within each patient between each of the two lower DNA input amounts and the 3 μg input amount, as well as between saliva DNA and blood DNA. In all comparisons, the concordance rate was close to or exceeded 98% for SNVs, and for indels the rate ranged from 83% to 86% (Fig. 1; Supplementary Table S1). When comparing the concordance rate of our samples with the benchmark rates estimated for our pipeline based on the NA12878 data (see Materials and Methods), we found the average SNV concordance rate for each of the 5 breast cancer patients was higher than the reference concordance rates calculated for the NA12878 subject in whole-exome and high-confidence exome regions, respectively (94.3% and 96.4%), whereas the concordance rate of indel calls was only slightly lower than the reference concordance rate of NA12878 in whole exome (87.1%; Fig. 1). When compared with the 3 μg DNA input, the 0.2 μg DNA input amount had a marginally lower concordance rate than the 1 μg DNA input, particularly for indel calls (83.9% vs. 85%). For saliva DNA, the SNV concordance remained at a high level (98.3%) but the indel concordance was the lowest among all comparisons (83.6%), which could be due to shorter DNA fragments from saliva than those from blood DNA. Nevertheless, the slightly inferior indel concordance is still in an acceptable range (32).
We further investigated the quality metrics of the discordant variant calls by DNA input amount and DNA source (Supplementary Figs. S1 and S2). We found that these variants were enriched with potential false positives, as characterized by lower quality scores, higher novel variant percentage, indel length, and Het/Homo call ratio. These findings suggest that we might underestimate the actual variant concordance concerning only bona fide variants after excluding false variant calls.
Detection of coding variants in known breast cancer genes
Lastly, as all samples evaluated in our study were collected from women diagnosed with triple-negative breast cancer, we examined whether the use of a lower DNA input amount or saliva samples had any impact on the detection of coding variants that may be underlying breast cancer etiology. We compiled a list of 21 breast cancer–related genes from the Cancer Gene Census (33; Supplementary Table S2) and assessed the concordance of variants within these genes among the four samples from each patient. As shown in Fig. 2 and Supplementary Table S3, compared with the coding variants detected from the 3 μg blood DNA input amount (39 to 59 per sample including both SNVs and indels), the number of variants detected from the two lower DNA input amounts differed slightly by 0 to 2, and for DNA sourced from saliva by –1 to 2. The concordance rate was 100% with the 1 μg blood DNA input amount, 97.4% to 100% with the 0.2 μg DNA input amount, and 94.9% to 100% with the saliva DNA. All discordant calls came from one SNV and four indels (Supplementary Table S4). After manual review of the sequence alignment files, we concluded that these discordant calls were either false Indel calls introduced by homopolymer (34), or the variants reside in regions where sequencing coverage was too low to make reliable calls. Therefore, the true variant concordance rate can reach 100% with respect to true variants.
Discussion
Our results demonstrate that lower DNA input amounts and DNA from saliva have relatively small effects on WES quality and variant calling consistency. To the best of our knowledge, this is the first comprehensive evaluation of the impact of lower DNA input amount and DNA source on the performance of WES with potential applications for cancer epidemiology. We further demonstrated that lower DNA input amount and saliva DNA can reliably detect variants in breast cancer–related genes, which supports their use in epidemiologic studies searching for coding risk variants, when sample requirements according to a manufacturer's standard protocol cannot be readily met.
Among various commonly used sequencing and variant calling quality metrics evaluated, we found that the data generated from 1 μg blood DNA were essentially the same as the 3 μg blood DNA, and that there was little impact on most quality metrics when using DNA input amounts as low as 0.2 μg. The only differences were shorter insert size and lower mapping rates when using 0.2 μg DNA. The shorter insert size may result from extra fragmentation in the DNA shearing step due to lower DNA amount and high cycle number of PCR (n = 11) performed. The slightly lower mapping rate could also result from more random errors introduced by increased PCR cycles. Nevertheless, the shorter insert size or slightly lower mapping rate has little effect on the rate of PCR duplication, sequencing depth, or downstream variant calling.
We demonstrated that the WES performance relevant to sequencing and variant calling qualities for saliva DNA samples was similar to that of blood DNA samples. The mapping rate to the reference genome and the insert size of saliva samples were only slightly lower than that of blood samples (98%–99% vs. 100% for mapping rate; 201–214 vs. 207–219 for insert size), indicating very low bacterial DNA contamination and shorter DNA fragments in the saliva samples. As the saliva samples used in our study were collected and processed without any special optimization for NGS applications, we expect this finding has wide generalizability to saliva samples collected routinely in many epidemiologic studies.
Regarding variant calling concordance according to input amount and source of DNA, we did observe inferior indel concordance to that of SNV calls, especially in the lower 0.2 μg DNA input amount and in saliva DNA. This could be due to the more complex structure of indel variants themselves, which make their calling from short-read data more challenging than SNVs. In addition, the lower insert size associated with lower DNA input amount and saliva DNA had a greater impact on indel calling. Nonetheless, the magnitude of the difference was small and negligible in most applications.
Although it is possible to infer copy-number variations (CNV) from WES data and several algorithms have been developed for this purpose, previous studies evaluating the performance of these algorithms concluded that the sensitivity, accuracy, and power were still limited (35, 36). We thus did not evaluate the impact of DNA input amount and saliva DNA on CNV detection in our study.
The motivation of our study is to test whether we can reliably detect rare variants related to breast cancer etiology using low DNA input and saliva DNA. Therefore, we designed the study with a sequencing depth typically used for detecting rare variants. We expect the concordance rate would be lower at a substantially lower sequencing depth, particularly when DNA input is low or saliva DNA is used. Future studies are warranted to assess the impact of varying sequencing depth on the concordance rate.
In summary, we provide compelling evidence that when the standard DNA requirement of a manufacturer's WES protocol cannot be satisfied, lower DNA input amounts (down to 0.2 μg) or using saliva as an alternative DNA source can generate comparable results. These findings may allow the expansion of WES applications in epidemiologic studies in which DNA specimens may be a finite resource or only low DNA amounts or saliva samples are available. However, caution should be taken for indel calls, as we found a larger impact of low DNA input and saliva DNA on indels than on CNVs. Currently, there are two exome capture platforms that require less than 0.2 μg input DNA: the Ion AmpliSeq Exome Kit, which can only be run on the Ion Proton Sequencer, and the Illumina Nextera Exome Kit. Both kits use as little as 50 ng DNA as the starting material. It will be interesting to investigate comprehensively the performance of WES data generated using such low input amounts, and to compare the performance among different exome capture platforms with such low DNA input. Our study did show larger impact on calling indels than SNVs when lowering DNA input amount or using saliva DNA. We may anticipate that the performance difference will be even larger when using 50 ng input DNA. In addition, such difference may become stronger when using other exome capture platforms, as the Agilent platform was reported to have increased sensitivity for indels than other platforms (16).
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Authors' Contributions
Conception and design: Q. Zhu, C.D.Morrison, S.T. Glenn, W. Davis, C.B. Ambrosone, S. Liu, S. Yao
Development of methodology: Q. Zhu, Q. Hu, C.D.Morrison, J.M. Conroy, S.T. Glenn, W. Davis, C.B. Ambrosone, S. Yao
Acquisition of data (provided animals, acquired and managed patients, provided facilities, etc.): C.D. Morrison, J.M. Conroy, S.T. Glenn, W. Davis, M.L. Kwan, I.J. Ergas, L.H. Kushi
Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): Q. Zhu, Q. Hu, L. Shepherd, J. Wang, L. Wei, L.H. Kushi, S. Liu, S. Yao
Writing, review, and/or revision of the manuscript: Q. Zhu, Q. Hu, L. Shepherd, J. Wang, J.M. Conroy, W. Davis, M.L. Kwan, J.M. Roh, L.H. Kushi, C.B. Ambrosone, S. Liu, S. Yao
Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): Q. Zhu, L. Shepherd, W. Davis, I.J. Ergas, J.M. Roh, S. Yao
Study supervision: Q. Zhu, C.B. Ambrosone, S. Yao
Grant Support
The Pathways Study is supported by NIH R01 CA105274 (to L.H. Kushi). The RPCI Bioinformatics Shared Resource, Biostatistics Shared Resource, Data Bank and BioRepository, and Genomics Shared Resource are CCSG Shared Resources supported by NIH grant P30 CA016056.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.