Abstract
In large-scale genome-wide association studies based on high-density single nucleotide polymorphism (SNP) genotyping array, the quantity and quality of available genomic DNA (gDNA) is a practical problem. We examined the feasibility of using the Multiple Displacement Amplification (MDA) method of whole-genome amplification (WGA) for such a platform. The Affymetrix Early Access Mendel Nsp 250K GeneChip was used for genotyping 224,940 SNPs per sample for 28 DNA samples. We compared the call concordance using 14 gDNA samples and their corresponding 14 WGA samples. The overall mean genotype call rates in gDNA and the corresponding WGA samples were comparable at 97.07% [95% confidence interval (CI), 96.17-97.97] versus 97.77% (95% CI, 97.26-98.28; P = 0.154), respectively. Reproducibility of the platform, calculated as concordance in duplicate samples, was 99.45%. Overall genotypes for 97.74% (95% CI, 97.03-98.44) of SNPs were concordant between gDNA and WGA samples. When the analysis was restricted to well-performing SNPs (successful genotyping in gDNA and WGA in >90% of samples), 99.11% (95% CI, 98.80-99.42) of the SNPs, on average, were concordant, and overall a SNP showed a discordant call in 0.92% (95% CI, 0.90-0.94) of paired samples. In a pair of gDNA and WGA DNA, similar concordance was reproducible on Illumina's Infinium 610 Quad platform as well. Although copy number analysis revealed a total of seven small telomeric regions in six chromosomes with loss of copy number, the estimated genome representation was 99.29%. In conclusion, our study confirms that high-density oligonucleotide array-based genotyping can yield reproducible data and MDA-WGA DNA products can be effectively used for genome-wide SNP genotyping analysis. (Cancer Epidemiol Biomarkers Prev 2008;17(12):3499–508)
Introduction
Quantity and quality of source DNA is a major concern for genome-wide studies using rapidly evolving array-based high-throughput genotyping technologies. Currently available gene-chip platforms can interrogate up to one million single-nucleotide polymorphisms (SNP) from one DNA sample, which naturally requires sufficient quantity and quality of input genomic DNA (gDNA). The source of such DNA is quickly exhausted as stored samples are used up. Obtaining gDNA from blood for very high throughput genotyping involves cost, storage space, time, and skill for DNA extraction (1). DNA from other noninvasive sources, e.g., buccal mucous cell, can be obtained as well, but the amount and the quality will be variable (2). One option to solve the problem is to immortalize peripheral blood lymphocyte cells by insertion of the viral genome (3). However, the process is labor-intensive, costly, and time consuming, and most importantly viable cells must be available to obtain DNA. Whole-genome amplification (WGA) is rapidly becoming a popular option to sustain the source of DNA for large-scale genotyping studies (4-7). WGA by the Multiple Displacement Amplification (MDA) method generates a large amount of high-quality DNA from a very small amount of input DNA. The usefulness of the WGA method depends on its ability and fidelity to reproduce the entire genome with least amplification bias because any such bias, potentially resulting in errors in high-density SNP genotyping, will have a major effect on power to detect linkage or associations (6). We carried out a study to assess the concordance rate and the reproducibility of the genotyping calls obtained from gDNA samples and the corresponding WGA DNA samples in microarray-based high-throughput SNP typing assay.
Materials and Method
We used a total of 28 samples, 14 gDNA samples from healthy individuals and their corresponding 14 WGA DNA samples, and genotyped 224,940 SNPs per sample using 28 early access Affymetrix Mendel Nsp Array GeneChips. The mean and median inter-SNP distance of the SNP array chip were 11.19 kb and 4.82 kb, respectively. Of the 14 gDNA samples, 12 were collected from healthy controls from three centers (four samples per center) and the remaining two were duplicate reference gDNA samples. The three centers were (a) the Northern California site of the Breast Cancer Family Registry-Northern California Cancer Center; (b) the Ontario site of the Breast Cancer Family Registry-Cancer Care, Ontario (8); and (c) the German Cancer Research Center participating in the German Breast Cancer Study (9). In addition to the 28 early access Affymetrix Mendel Nsp Array Gene-Chips, we also tested the reference gDNA sample and the corresponding WGA DNA sample on Illumina's Infinium 610 Quad SNP-chip interrogating 620,901 markers, including 592,532 SNPs and 28,369 nonpolymorphic copy number variation probes. All the DNA samples were extracted from blood, except for the four samples from the Northern California Cancer Center, which were obtained from lymphoblastoid cell line. These 14 gDNA samples were amplified using the Qiagen Repli-g Midi kit to obtain their corresponding WGA DNA samples, as described below.
Whole-Genome Amplification
We used a MDA-based WGA kit from the Qiagen Repli-g Midi kit. The manufacturer's protocol (10) was followed, which includes alkali (KOH) denaturation of gDNA samples before amplification. This method of amplification uses isothermal genome amplification by Phi 29 DNA polymerase capable of replication of up to 100 kb without dissociating from the gDNA template. This DNA polymerase has a 3′ to 5′ exonuclease proofreading activity to maintain high fidelity during replication and is used in the presence of exonuclease-resistant primers to achieve high yields of DNA product (10). The quality and integrity of the tested gDNA samples were assessed using an Agilent 2100 BioAnalyzer. The gDNA size varied between 1,500 bp to >10,000 bp, and the 260/280 ratio measured in Nano Drop ND-1000 UV spectrophotometer was between 1.8 and 1.9. We used 2.5 μL of gDNA in TE buffer at 10 ng/μL concentration for WGA reaction. DNA was incubated isothermally at 30°C for 10 h, followed by heat inactivation (at 65°C) of DNA polymerase for 3 min. After amplification, the quality of the WGA product was checked on DNA 7500 chips using the Agilent 2100 BioAnalyzer (Fig. 1). Each electropherogram clearly showed uniform amplification producing smear starting from 1.5 kb extending to >10.0 kb size with clear peak at around 7.0 kb.
Agilent 2100 BioAnalyzer electropherogram of 10 WGA DNA samples (normalized to 50 ng/μL concentration) overlaid on ladder marker peaks (red). After the initial spike at 50 bp, the subsequent ladder peaks correspond to 100, 300, 500, 700, 1,000, 1,500, 2,000, 3,000, 5,000, 7,000, and 10,380 bp, respectively. All the samples (in different colors) show a uniform smearing effect starting around 1,500 bp extending to >10 kb size with clear peak around the 7 kb region (11th peak).
Agilent 2100 BioAnalyzer electropherogram of 10 WGA DNA samples (normalized to 50 ng/μL concentration) overlaid on ladder marker peaks (red). After the initial spike at 50 bp, the subsequent ladder peaks correspond to 100, 300, 500, 700, 1,000, 1,500, 2,000, 3,000, 5,000, 7,000, and 10,380 bp, respectively. All the samples (in different colors) show a uniform smearing effect starting around 1,500 bp extending to >10 kb size with clear peak around the 7 kb region (11th peak).
Genotyping
Microarray-based genome-wide SNP genotyping was done using the early-access Affymetrix Mendel Nsp 250K chip. DNA samples were normalized to 50 ng/μL concentration. The Affymetrix standard protocol (11) was followed with slight modification in the PCR purification step. High-speed ultracentrifugation was used instead of vacuum extraction. To compare the effect of quantity of PCR product used for hybridization on genotype call rate, 60 μg (as suggested by Affymetrix) and 90 μg of purified PCR products were used for fragmentation. For the WGA samples, 60 μg of purified PCR products from six samples (two from each center) and 90 μg from the other six samples (two from each center) were used. Scanning was done in high-resolution Affymetrix GeneChip scanner 3000 7G. The electronic data were saved as DAT and CEL files. The CAB files for the images were used to transfer the data into GCOS 1.4 for subsequent use of the data in G-TYPE v4.0 software. Using V3 annotation for the early-access chips, a total of 224,940 SNPs were genotyped per sample. The SNPs are approximately evenly distributed within the whole genome; mean and median inter-SNP distance were 11.19 kb and 4.815 kb, respectively. The Affymetrix BRLMM algorithm was used to generate the genotype calls.
Statistical Analysis
The completeness of genotyping was determined for each of the 224,940 SNPs for the 14 gDNA samples (Com_gDNA) and 14 WGA DNA samples (Com_WGA). For example, if 14 of 14 gDNA samples could be genotyped for a given SNP, then the “Com_gDNA” was 100% for that particular SNP. But the same SNP might have been genotyped in 12 of 14 or 86% of the corresponding WGA DNA samples, so “Com_WGA” would be 86%.
We report the concordance between gDNA and WGA DNA samples in two ways: (a) concordance by sample pair (i.e., the mean proportion of SNPs concordant between paired gDNA and WGA DNA samples) and (b) concordance by individual SNPs (i.e., the mean proportion of paired samples for which a particular SNP was concordant). Reproducibility of the platform was calculated as the proportion of SNPs concordant between two duplicate reference gDNA samples. For a given SNP, the discordant rate was calculated as the proportion of informative pairs across the paired samples that were discordant for that particular SNP. By informative data we mean the number of paired observations for which a genotype call for a given SNP could be made in both the gDNA and the WGA DNA samples. For example, if a given SNP could be genotyped in all 14 gDNA samples but in only 12 WGA DNA samples, then we excluded the 2 pairs (where we have only genotype result for gDNA but not for WGA DNA) and included only the data for the 12 informative pairs, where that given SNP could be genotyped in both gDNA and WGA DNA samples to calculate the discordance. If among these 12 informative pairs, we found discrepancy of genotype call in one pair, then we calculated the discordance rate to be 1 in 12 or 8.3%.
For both copy number (CN) and loss of heterozygosity (LOH) analyses, we took the gDNA samples as reference against which the paired analysis was done for WGA samples. For CN analysis, background correction was done with adjustment for fragment length and probe sequence, but no normalization was done. The log2 ratio of the signal intensity was used for calculation of the CN. For detection of CN change regions, the Hidden Markov Model (12) was used with maximum probability of 0.995, genomic decay of 10,000,000, and σ = 2. Maximum probability specifies the probability of retaining the same state between neighboring observations. The genomic decay describes how quickly (expressed in base pairs) the Hidden Markov Model retention of state will decay toward the initial probability. Specifies the Gaussian bandwidth of the distribution from which observations are drawn. Higher values of σ would expect more noise, but may not detect smaller regions. Smaller values will result in more regions. The reported regions contain at least 10 probe sets overlapping the regions in 7 of 14 samples. The CN change regions were mapped to cytoband regions and the length was calculated from the start to the end regions. The total number of SNPs within a CN change region and the average CN were reported.
The paired LOH regions were calculated assuming the maximum probability of 0.99, genomic decay of 10,000,000, and genotype error = 0.01. The reported regions overlap at least in 7 of 14 samples. The length of the region was calculated from the start to the end regions.
Results
The overall mean call rate was 97.07% [95% confidence interval (95% CI), 96.17-97.97] in 14 gDNA samples and 97.77% (95% CI, 97.26-98.28; P = 0.154) in corresponding WGA samples. Center-specific call rates, proportion of different genotypes (AA, AB, or BB), and raw intensity data are presented in Table 1. There was no significant difference in call rates or raw intensity for either gDNA or WGA DNA samples across different centers. We also did not observe any significant difference in genotype call rates using 60 or 90 μg of amplified PCR product of WGA samples for hybridization (97.82%; 95% CI, 97.01-98.63 versus 97.71%; 95% CI, 96.85-98.56; P = 0.826). Reproducibility of the Affymetrix platform, as measured by concordance of SNP genotyping in duplicate reference samples, was found to be 99.45%.
Genotype calls in gDNA and WGA samples by center using Affymetrix early access Mendel Nsp Array GeneChips interrogating 224,940 SNPs using the BRLLM algorithm
. | Center 1 . | Center 2 . | Center 3 . | ANOVA . |
---|---|---|---|---|
. | Mean (95% CI) . | Mean (95% CI) . | Mean (95% CI) . | P . |
GDNA samples* | n = 4 | n = 4 | n = 4 | |
Call rate (%) | 96.67 (95.01-98.32) | 96.16 (93.19-99.14) | 97.68 (95.14-100.21) | 0.405 |
AB call (%) | 28.76 (27.05-30.48) | 28.80 (27.03-30.58) | 28.42 (26.66-30.17) | 0.863 |
AA call (%) | 35.09 (33.68-36.51) | 34.79 (32.50-37.08) | 35.73 (33.76-37.71) | 0.552 |
BB call (%) | 32.81 (30.83-34.79) | 32.57 (30.13-35.00) | 33.52 (31.23-35.81) | 0.626 |
Raw intensity | 8.77 (8.27-9.27) | 8.66 (8.26-9.07) | 8.67 (8.26-9.08) | 0.831 |
WGA samples† | n = 4 | n = 4 | n = 4 | |
Call rate (%) | 98.17 (97.51-98.83) | 97.42 (96.28-98.56) | 98.36 (97.40-99.33) | 0.111 |
AB call (%) | 27.76 (26.92-28.6) | 27.48 (26.74-28.22) | 27.91 (27.34-28.48) | 0.436 |
AA call (%) | 36.22 (35.63-36.81) | 36.03 (35.21-36.84) | 36.23 (35.53-36.94) | 0.773 |
BB call (%) | 34.18 (33.44-34.92) | 33.90 (32.85-34.96) | 34.21 (33.42-35.01) | 0.682 |
Raw intensity | 8.67 (8.20-9.13) | 8.62 (8.10-9.14) | 8.76 (8.27-9.25) | 0.805 |
. | Center 1 . | Center 2 . | Center 3 . | ANOVA . |
---|---|---|---|---|
. | Mean (95% CI) . | Mean (95% CI) . | Mean (95% CI) . | P . |
GDNA samples* | n = 4 | n = 4 | n = 4 | |
Call rate (%) | 96.67 (95.01-98.32) | 96.16 (93.19-99.14) | 97.68 (95.14-100.21) | 0.405 |
AB call (%) | 28.76 (27.05-30.48) | 28.80 (27.03-30.58) | 28.42 (26.66-30.17) | 0.863 |
AA call (%) | 35.09 (33.68-36.51) | 34.79 (32.50-37.08) | 35.73 (33.76-37.71) | 0.552 |
BB call (%) | 32.81 (30.83-34.79) | 32.57 (30.13-35.00) | 33.52 (31.23-35.81) | 0.626 |
Raw intensity | 8.77 (8.27-9.27) | 8.66 (8.26-9.07) | 8.67 (8.26-9.08) | 0.831 |
WGA samples† | n = 4 | n = 4 | n = 4 | |
Call rate (%) | 98.17 (97.51-98.83) | 97.42 (96.28-98.56) | 98.36 (97.40-99.33) | 0.111 |
AB call (%) | 27.76 (26.92-28.6) | 27.48 (26.74-28.22) | 27.91 (27.34-28.48) | 0.436 |
AA call (%) | 36.22 (35.63-36.81) | 36.03 (35.21-36.84) | 36.23 (35.53-36.94) | 0.773 |
BB call (%) | 34.18 (33.44-34.92) | 33.90 (32.85-34.96) | 34.21 (33.42-35.01) | 0.682 |
Raw intensity | 8.67 (8.20-9.13) | 8.62 (8.10-9.14) | 8.76 (8.27-9.25) | 0.805 |
Genomic DNA samples.
Whole genome–amplified samples.
The overall mean completeness of genotyping in gDNA samples (Com_gDNA) was 97.11% (95% CI, 97.08-97.14) and that of WGA samples (Com_WGA) was 97.80% (95% CI, 97.77-97.92; P < 0.001). Genotype completeness by chromosome in gDNA and WGA samples is presented in Fig. 2A. For both the gDNA and WGA samples, the completeness of genotype call was highest in SNPs of X chromosome (marked as chromosome 23 in the figure). Among the autosomes, completeness was more consistent across the chromosomes for gDNA samples compared with WGA samples. In particular, for the WGA samples, with reference to chromosome 10, which had median size and is also free from regions with copy number bias, as shown in the latter section of the results, the genotyping completeness was lower (ANOVA, P < 0.001) for the SNPs in chromosomes 16, 17, 19, 20, and 22.
A.X axis, mean (95% CI) completeness of genotype call in gDNA samples (blue solid square) and WGA samples (red open square); Y axis, chromosomes. Chromosome X is chromosome 23. Error bars, 95% CI. B. Genotype concordance rate in all samples combined and samples from different centers. Error bar, SD. C. Discordant rate (Y axis) as a function of completeness of genotyping in WGA samples (X axis) and the number of SNPs in each group. Error bar, 95% CI. D. Discordant rate in different chromosomes for the well performing SNPs (those which could be successfully genotyped in >90% of the cases in both gDNA and WGA samples) and the total number of SNPs in each chromosome. Error bar, 95% CI.
A.X axis, mean (95% CI) completeness of genotype call in gDNA samples (blue solid square) and WGA samples (red open square); Y axis, chromosomes. Chromosome X is chromosome 23. Error bars, 95% CI. B. Genotype concordance rate in all samples combined and samples from different centers. Error bar, SD. C. Discordant rate (Y axis) as a function of completeness of genotyping in WGA samples (X axis) and the number of SNPs in each group. Error bar, 95% CI. D. Discordant rate in different chromosomes for the well performing SNPs (those which could be successfully genotyped in >90% of the cases in both gDNA and WGA samples) and the total number of SNPs in each chromosome. Error bar, 95% CI.
Concordance by Sample Pair
Considering all the 224,940 SNPs genotyped per sample, the overall concordance between genotype calls from gDNA and WGA DNA samples was 97.74% (95% CI, 97.03-98.44) without significant differences (mean ± SD) between control samples and samples from different centers (control, 98.34% ± 0.528; center 1, 97.51% ± 1.092; center 2, 97.00% ± 1.408; center 3, 98.41% ± 1.256; ANOVA, P = 0.387). In other words, in each gDNA-WGA pair sample, there were on average 2.26% (95% CI, 1.55-2.97) SNPs with discordant genotypes. In the next step, we restricted the analysis to well performing SNPs, i.e., SNPs that could be successfully genotyped in >90% of samples or in other words, SNPs with Com_gDNA and Com_WGA >90%. There was a total of 191,251 well performing SNPs, and the concordance rate improved to 99.11% (95% CI, 98.80-99.42), indicating that >99% of the well performing SNPs (i.e., genotyped in >90% of samples) show the same genotypes in gDNA and WGA samples. Figure 2B shows the concordance of these 191,251 SNPs for combined sample pairs and for samples from different centers. No significant difference in concordance was noted among the centers (control, 99.31% ± 0.341; center 1, 99.05% ± 0.48; center 2, 98.75% ± 0.65; center 3, 99.43% ± 0.48; ANOVA, P = 0.343).
To explore the characteristics of this small proportion of SNPs producing discordant calls for each of the comparisons, we further analyzed the concordance by individual SNPs among the 14 paired samples.
Concordance by SNPs
Among the total 224,940 genotyped SNPs, only 99 SNPs (0.04%) showed discordant calls in all 14 paired observations and a total of 2,721 SNPs (1.2%) showed discordant calls in ≥7 paired observations. Figure 2C illustrates the discordant rate as a function of completeness of genotyping in WGA samples. A similar result was obtained for completeness of genotyping in gDNA samples (data not shown). These results clearly show that the well performing SNPs (high Com_gDNA and/or high Com_WGA) had the least discordance. The finding suggests that WGA can be used efficiently with minimum error for good performing SNPs.
In the next step, we examined whether there is chromosomal bias for discordant calls. Recognizing the fact that the discordant rate is significantly influenced by completeness of genotyping for a particular SNP, for interrogating chromosomal bias we included the 191,251 well performing SNPs that had both the Com_gDNA and Com_WGA >90%. For practical purposes, for a genome-wide gene mapping study, one should filter the data on the SNP call rate or the completeness of the SNP genotyping. Figure 2D shows the mean (95% CI) discordant rate for the SNPs by chromosome. The data show that the overall mean discordant rate was only 0.92% (95% CI, 0.90-0.94), i.e., overall a well performing SNP had a discordant genotype result in <1% of sample pairs. There was a significant difference, however, in the discordant rate by SNPs (proportion of discordant sample pairs) for some of the chromosomes (ANOVA, P < 0.001). Compared with chromosome 1, the lowest discordance (0.44%; 95% CI, 0.37-0.51) was observed for the SNPs in chromosome-X (shown as chromosome 23 in the graph) and significantly higher discordant rates were found among SNPs in chromosome 16 (1.08%; 95% CI, 0.95-1.22), chromosome 19 (1.47%; 95% CI, 1.22-1.73), chromosome 20 (1.11%; 95% CI, 0.96-1.26), and chromosome 22 (1.36%; 95% CI, 1.11-1.62).
This apparent chromosomal bias for discordance and the effect of completeness of genotyping on discordant rates led us to look for copy number changes in WGA samples.
Copy Number Analysis
For copy number analysis, we took gDNA samples as the reference for the corresponding WGA samples. Figure 3 shows the regions of copy number changes in the WGA samples compared with the corresponding gDNA samples. The blue regions indicate loss of copy number. It is noted that in most of the chromosomes, the loss of copy number was detected in the telomeric regions. There were a total of seven regions in six chromosomes (see Fig. 3). The smallest region was 1.9 Mb in the chromosome 16 p13.3 region and the largest was 4.7 Mb in the chromosome 9 q34.2-q34.3 region. The seven regions with loss of copy number represent 21,509,539 bp. For the currently assembled human genome size of 3,021,400,000 bp (13), these data give an estimated genome coverage of 100× (3,021,400,000 − 21,509,539)/3,021,400,000 = 99.29%. It may be noted that all these seven regions were previously reported to have CN variation by different investigators (14-20). These reported variation IDs in the Database for Genomic Variants (21) are also presented in Fig. 3. Figure 4A shows the copy number changes in chromosome 9 of all the 14 WGA samples in our study. The top panel indicates the copy number loss regions marked by blue, the middle panel shows the plot of estimated copy number in reference to the gDNA samples, and the bottom panel represents the heat map where blue indicates loss of copy number, gray the normal copy number, and red the gain in copy number. Figure 4B shows the data of the chromosome 9q34.2 and 9q34.3 regions (the same 135M to 140M bp region, which are shown in Fig. 4A to have loss of copy number in our study) from the Database of Genomic Variants. The browser view clearly indicates that other investigators detected CN variations in that region. It may be noted, however, that our study indicates a defect in WGA in these regions, and this paired analysis (where gDNA is used as reference for the corresponding WGA sample) does not confirm CN variation. We noted the cytoband(s) of those regions with the copy number changes (loss or gain) in all the chromosomes. SNPs in cytoband regions with loss of copy number were marked as group 1 (n = 1,401) and those with normal copy number as group 2 (n = 223,593). In the next step, we further analyzed the SNPs in chromosomes with respect to the copy number changes. The overall completeness of genotyping for group 1 SNPs at 90.56% (95% CI, 89.85-91.27) was significantly lower than group 2 SNPs at 97.84% (95% CI, 97.82-97.87; P < 0.001). Figure 5A shows the completeness of genotyping of groups 1 and 2 SNPs by chromosomes. Figure 5B shows that the discordance of group 1 SNPs at 8.66% (95% CI, 7.73-9.61) was significantly higher than that of group 2 SNPs at 2.71% (95% CI, 2.68-2.75; P < 0.001) in all the chromosomes. Therefore, it is clear that the few small regions with copy number loss in the WGA samples affect both the completeness of the genotyping call (SNP performance) and the discordant rate (inaccurate calls), and these areas are situated mainly in the telomeric regions.
Top, graphical representation of the genome-wide copy number (CN) change regions. Blue regions, loss of copy number in WGA samples compared with the corresponding gDNA samples. No region was identified as gain of copy number. Bottom, chromosomal location, length, average CN, the CN variation ID number from the database of genomic variants, and the number of SNPs in those CN loss regions shown on top.
Top, graphical representation of the genome-wide copy number (CN) change regions. Blue regions, loss of copy number in WGA samples compared with the corresponding gDNA samples. No region was identified as gain of copy number. Bottom, chromosomal location, length, average CN, the CN variation ID number from the database of genomic variants, and the number of SNPs in those CN loss regions shown on top.
A. Detailed view of CN changes detected in all the 14 WGA samples in chromosome 9. Top, copy number loss regions (blue); middle, plot of estimated copy number in reference to the gDNA samples; bottom, heat map of all the 14 WGA samples (each row a WGA sample), showing loss of copy number (blue), normal copy number (gray), and gain in copy number (red). The eight samples with loss of CN are highlighted in black box in the bottom panel. The cytoband regions of chromosome 9 are at the lower part of the bottom panel. B. Genome browser view from the Database of Genomic Variants for the copy number loss region of chromosome 9q34.2, q34.3 region (130M to 140M) shown in A. Cytoband regions (dark gray), CN variations reported in the publicly available Database of Genomic Variants (orange), and Insertion-Deletions (InDels) between the 100 bp and 1kb sizes (green).
A. Detailed view of CN changes detected in all the 14 WGA samples in chromosome 9. Top, copy number loss regions (blue); middle, plot of estimated copy number in reference to the gDNA samples; bottom, heat map of all the 14 WGA samples (each row a WGA sample), showing loss of copy number (blue), normal copy number (gray), and gain in copy number (red). The eight samples with loss of CN are highlighted in black box in the bottom panel. The cytoband regions of chromosome 9 are at the lower part of the bottom panel. B. Genome browser view from the Database of Genomic Variants for the copy number loss region of chromosome 9q34.2, q34.3 region (130M to 140M) shown in A. Cytoband regions (dark gray), CN variations reported in the publicly available Database of Genomic Variants (orange), and Insertion-Deletions (InDels) between the 100 bp and 1kb sizes (green).
A. Completeness of genotyping of group 1 SNPs (those in cytobands with loss of CN) and group 2 SNPs (in cytobands with normal CN) by chromosome in WGA samples. B. Discordant rate of group 1 SNPs (those in cytobands with loss of CN in WGA samples) and group 2 SNPs (in cytobands with normal CN in WGA samples) by chromosome.
A. Completeness of genotyping of group 1 SNPs (those in cytobands with loss of CN) and group 2 SNPs (in cytobands with normal CN) by chromosome in WGA samples. B. Discordant rate of group 1 SNPs (those in cytobands with loss of CN in WGA samples) and group 2 SNPs (in cytobands with normal CN in WGA samples) by chromosome.
LOH Analysis
We also examined paired LOH regions for the WGA samples compared with the corresponding gDNA samples. We detected five LOH regions: chromosome 2 q21.1 (15,505 bp), chromosome 5q13.1 (8,540 bp), chromosome 8q11.22 (156,577 bp), chromosome 13q31.1 (17,973 bp), and chromosome 20q13.31 (39,596 bp). There was no overlap between these LOH regions and the copy number change regions, indicating copy-neutral LOH. Also, none of these regions was near the telomere. There was a total of only 30 SNPs (0.013% of total genotyped SNPs) covering these five very small genomic regions (total 238,191 bp, accounting for 0.00078% of whole genome) with LOH. As opposed to the usual LOH seen in tumor DNA, these were copy-neutral LOH and therefore, as expected, SNPs in these regions were discordant in 22.53% (95% CI, 9.93-35.13) sample pairs. In fact, this also represents error of amplification.
Cross-check on Illumina's Infinium Platform
The genotype call rate for the reference gDNA on Illumina's Infinium platform was 99.74% and that of the corresponding WGA DNA was 99.68% with concordance rate of 99.998%. A total of 38,655 SNPs were common between the Illumina's 610 Quad chip and Affymetrix early access Mendel Nsp GeneChip. Of these 38,655 common SNPs between the two platforms, for the reference gDNA sample, a total of 38,136 SNPs could be successfully genotyped on both platforms. Genotype calls for only 143 SNPs (0.375%) were discordant between the two platforms. In other words, the genotype calls by the two platforms were concordant for 99.625% of the SNPs.
Discussion
WGA is a promising solution to eliminate the practical problem in the limitation of the source of DNA needed for genome-wide scans. To fulfill the purpose, WGA must satisfy some basic requirements. First, the amplification process should be highly accurate to avoid undue errors. Second, amplification should not produce a bias in the distribution of the DNA products. Questions of amplification-induced error and template bias generated by the WGA process have been addressed elsewhere through small and large scale SNP detection methodologies (1, 22-26). Third, a high amplification factor is required so that WGA generates a useful amount of DNA from small starting samples. Finally, the WGA method should be applicable to a wide array of genomic platforms (24).
Different methods of WGA have been used thus far in different studies by different investigators. Three main methods have been used for WGA: (a) MDA (22, 27), (b) Primer Extension Preamplification (28), and (c) Degenerate Oligonucleotide-Primed PCR (5, 29). Besides the methods of amplification, other critical issues include amount of DNA input (30, 31), amplified DNA yield (24), and the level of bias (32). Pinard et al. compared the yield of WGA product using the different amplification methods from 25 ng of gDNA as starting material: the MDA based REPLI-g method generated 2,100-fold amplification, GenomiPhi 640-fold, Primer Extension Preamplification 120-fold, and Degenerate Oligonucleotide-Primed PCR 92-fold (24). The sharp contrast among the yields derived from the two MDA based methods (REPLI-g and GenomiPhi) may be attributed to the use of KOH alkali denaturation before the amplification process, which opens priming sites more efficiently than the thermal denaturation used in the GenomiPhi protocol (24).
There is evidence that the level of error introduced during WGA reaction seems to be a function of amount of starting material. In this connection, Dean (22) and Lovmar (33) have evaluated the genotyping performance of MDA WGA using a range of gDNA inputs, and both authors focused attention on their evaluation of genotyping performance of WGA DNA derived from 3 ng of gDNA. Bergen et al. carried out extensive investigation on the effect of gDNA mass (1,10, 25, 50, 100, and 200 ng) on WGA and genotyping performance (30). They found that, for optimal performance in single-plex SNP genotyping using TaqMan platform, at least 10 ng of lymphoblastoid gDNA input in WGA reaction was required; but over 100 ng of lymphoblastoid gDNA input into WGA reaction was required to obtain optimal short tandem repeat genotyping performance from WGA DNA. In their work, the WGA obtained from 25 ng of gDNA input showed 99.9% completion of genotyping with 2.3% discordance. Lasken and Egholm recommended 10 to 100 ng of gDNA template in the MDA WGA reaction to avoid stochastic amplification (34). In our lab, for single-plex SNP genotyping using the fluorescent polarization method, we have seen up to 100% completion of genotyping with 25 ng of WGA DNA sample per well in PCR reaction from the WGA stock obtained from 25 ng of gDNA input in 50 μL WGA reaction volume. Figure 6 shows the clustering of 84 genotype calls for rs1476413 using 25 ng of gDNA on the left panel and 25 ng corresponding WGA DNA (from stock of WGA obtained from 25 ng of gDNA input in WGA reaction) on the right panel. SNP concordance was 100%. Among the gDNA samples, five were not clustered tightly (undetermined or no call), but clearly three were heading toward GA genotype cluster and the other two were heading toward the AA genotype cluster. However, in case of the corresponding WGA samples (right panel), the samples were nicely separated in three distinct genotype clusters. Sawcer et al. used a total of 508 WGA samples for genotyping on the Illumina GoldenGate platform and found that the likelihood of successful genotyping from WGA DNA correlated with the starting concentration of genomic DNA used in the amplification reaction: a large proportion of samples (n = 404) failed to produce genotype calls and the mean starting concentration was 5.9 ng/μL, whereas for the rest of samples (n = 104) for which they had successful genotype calls, the concentration of the starting gDNA was 17.4 ng/μL (25). The present study was not designed to find out optimal gDNA input into the WGA reaction. Rather we focused on the performance of WGA DNA derived from 25 ng of gDNA as input in the WGA reaction. In the context of genome-wide genotyping, only 25 ng of good quality genomic DNA as starting material for subsequent WGA reaction may be considered a good alternative to the standard requirement of 250 to 500 ng of gDNA for microarray-based high throughput genotyping.
Genotyping from gDNA and corresponding WGA DNA for rs1476413 using fluorescent polarization method (a single-base extension method). Clustering of 84 genotype calls, using 25 ng of gDNA (left) and 25 ng of corresponding WGA DNA (from stock of WGA obtained from 25 ng of gDNA input in WGA reaction; right). SNP concordance was 100%.
Genotyping from gDNA and corresponding WGA DNA for rs1476413 using fluorescent polarization method (a single-base extension method). Clustering of 84 genotype calls, using 25 ng of gDNA (left) and 25 ng of corresponding WGA DNA (from stock of WGA obtained from 25 ng of gDNA input in WGA reaction; right). SNP concordance was 100%.
Arriola et al. amplified genomic DNA at different starting amounts (0.5, 5, 10, and 50 ng) using the Phi29-based MDA method and found that the fold amplification was highest when the input DNA was low, and this higher fold amplification was correlated to amplification bias in Comparative Genomic Hybridization profiles (31).
Paez et al. used the Phi 29 polymerase-based amplification method, with or without alkali denaturation before amplification, and tested the accuracy and genome-wide coverage of the derived WGA product through both direct sequencing of around 500,000 bp and high-density oligonucleotide arrays interrogating 10K SNPs with mean intermarker distance of 210 kb on the Affymetrix platform (32). Their study showed better call rates with prior alkali denaturation. The call rate was 92.93% in genomic DNA and 92.06% in WGA samples with prior alkali denaturation. In the present study, we used 25 ng of gDNA as starting material and treated with KOH before WGA by the MDA method and used the Affymetrix Early Access Mendel Nsp 250K GeneChip containing 224,940 SNPs with mean and median inter-SNP distance of 11.19 kb and 4.815 kb, respectively. We found that the overall call rate was 97.07% (95% CI, 96.17-97.97) in genomic DNA samples and 97.77% (95% CI, 97.26-98.28) in WGA samples.
In a small-scale genotyping study in which only 6 SNPs were genotyped in 172 samples, a concordance of 100% was found among gDNA and corresponding WGA DNA (35). On the other hand, when genotyping was done on a larger number of SNPs on the Illumina linkage panel (2,320 SNPs) platform (36) or using the Illumina GoldenGate method (345 SNPs) (7), the call concordance was found to vary between 98.8% and 99.7%. One study explored the utility of MDA on 10K SNP arrays, reporting good coverage and high concordance rates but reduced call rates (32). In our study, using 250K SNP chip, the overall concordance was 97.74% (95% CI, 97.03-98.45), and when the analysis was restricted to well performing SNPs (Com_gDNA and Com_WGA >90%), 99.11% (95% CI, 98.80-99.42) of the SNPs, on an average, were concordant, and overall a SNP showed discordant call only in 0.92% (95%CI, 0.90-0.94) of paired samples. Moreover, we used the early access chips where the SNP panel was not yet fully optimized for SNP performance. For practical purposes, in genome-wide analysis SNPs should be filtered by call rate (across the samples). Analyzing the small number of SNPs that caused discordant calls, we identified that there were very few regions with copy number loss and those were predominantly at the telomeric regions. We also looked at paired LOH regions for the WGA samples compared with the corresponding gDNA samples and found only five copy-neutral LOH regions (smallest region at 2q21.1 of 8,540 bp and the largest one at 8q11.22 of 156,577 bp), none of which was located near telomeric regions. In a previous study, Paez et al. also found few chromosomal regions with loss of copy number in MDA-based WGA samples, but none of those regions were telomeric (32). To our knowledge, this is one of the first studies to examine the SNP concordance of WGA product with healthy human germline gDNA samples on very high-density oligonucleotide-based SNP chips interrogating 224,940 SNPs. Although only in one pair of samples, we also tested the performance of MDA-based WGA product on a different platform, Illumina's 610 Quad chip interrogating 592,532 SNPs, and noticed 99.998% concordance with the gDNA. Previous studies have not used such a high-resolution microarray platform to address this issue. It may be noted that neither the Affymetrix nor the Illumina GoldenGate assay protocol uses further WGA step in sample processing; rather, PCR amplification is used. On the other hand, Illumina's Infinium chemistry uses WGA as a part of DNA sample processing before hybridization.
The present study was limited to the use of high-quality intact gDNA as input into the WGA reaction. Considering the fragment size of the degraded DNA extracted from formalin-fixed paraffin-embedded samples, MDA-based WGA may not be a suitable option for Affymetrix GeneChip. However, fragmentation PCR-based method for WGA is an appropriate choice for the formalin-fixed paraffin-embedded samples. In a very recent publication (Epub 2008 June 12), Mead et al. have documented that degraded DNA amplified with MDA-based WGA gave low call rates and concordance across all platforms at standard loading concentration; but the fragmentation PCR-based method of WGA gave high call rate and concordance for degraded DNA (37).
In summary, our results suggest that Phi29 MDA-based WGA product provides a highly accurate and reasonably comprehensive representation of the unamplified human genome, suitable for high-resolution genome-wide genotyping studies using oligonucleotide-based SNP genotyping arrays.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Grant support: National Cancer Institute, NIH under RFA-CA-06-503, and through cooperative agreements with members of the Breast Cancer Family Registry and PIs and partly by U01 CA122171 and P30 CA 014599.
Acknowledgments
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The content of this article does not necessarily reflect the views or policies of the National Cancer Institute or any of the collaborating centers in the Cancer Family Registries (CFR), nor does the mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government or the CFR.