Abstract
With the large numbers of single nucleotide polymorphisms (SNPs) available and new technologies that permit high throughput genotyping, we have investigated the possibility of the localization of disease genes with genome-wide panels of SNP markers and taking advantage of the linkage-disequilibrium (LD) between the disease gene and closely linked markers. For this purpose, we selected cases from the Ashkenazi Jewish population, in which the mutant alleles are expected to be identical by descent from a common founder and the regions of LD encompassing these mutant alleles are large. As a validation of this approach for localization, we performed two trials: one in autosomal recessive Bloom syndrome, in which a unique mutation of the BLM gene is present at elevated frequencies in cases, and the other in autosomal dominant hereditary nonpolyposis colorectal cancer (HNPCC), in which a unique mutation of MSH2 is present at elevated frequencies. In the Bloom syndrome trial, we genotyped 3,258 SNPs in 10 Jewish Bloom syndrome cases and 31 non-Bloom syndrome Jewish persons as a comparison group. In the HNPCC trial, we genotyped 8,549 SNPS in 13 Jewish HNPCC cases whose colon cancers exhibited microsatellite instability and in 63 healthy Jews as a comparison group. To identify significant associations, we performed (a) Fisher’s exact test comparing genotypes at each locus in cases versus controls and (b) a haplotype analysis by estimating the frequency of haplotypes with the expectation-maximization algorithm and comparing haplotype frequencies in cases versus controls by logistic regression and a maximum likelihood ratio method. In the Bloom syndrome trial, by Fisher’s exact test, statistically significant association was detected at a single locus, TSC0754862, which is a locus 1.7 million bp from BLM. Two-locus, three-locus, and four-locus haplotypes that included TSC0754862 and flanked BLM were also statistically more frequent in cases versus controls. In the HNPCC trial, although a significant P value was not obtained by the single SNP genotype analysis, significant associations were detected for several multilocus haplotypes in an 11-million-bp region that contained the MSH2 gene. This work demonstrates the power of the LD mapping approach in an isolated population and its general applicability to the identification of novel cancer-causing genes.
INTRODUCTION
Since the identification by linkage analysis of many of the major cancer susceptibility genes (e.g., BRCA1, BCRA2, MLH1, and MSH2), finding new disease genes has been hampered by small family size, genetic heterogeneity, and smaller relative risks associated with the presumptive disease genes (1). A strategic approach that could potentially overcome these problems is genome-wide single-nucleotide polymorphism (SNP) linkage-disequilibrium (LD) mapping (2). This approach relies on the identification of genetic association between the disease allele and nearby marker alleles in a statistical analysis of cases in comparison with controls. However, in the current conception of the LD mapping approach, high-density SNP panels and large numbers of cases and controls would be needed to identify novel alleles. As an alternative design, we have considered the feasibility of SNP LD mapping with low-density SNP panels and comparing relatively small numbers of high- or intermediate-risk cases and controls selected from an isolated population.
Linkage disequilibrium is defined as the excess in co-occurrence of two alleles over that expected if the two alleles occurred independently (3). In an isolated population that was established recently from relatively small numbers of founders, random genetic drift will have operated to reduce the number of different disease-causing mutations segregating in the population (4). These disease-causing mutations are called “founder mutations” because they were passed down to present-day carriers identical by descent from a common founder individual. Importantly, for any particular disease-causing mutation, markers in the region flanking the founder mutation on the chromosome are nearly always in LD with the disease allele.
The particular linear combination of alleles at loci adjacent to the disease allele constitutes a founder haplotype. The size of the region conserved in the founder haplotype is a key variable that is important in using LD mapping successfully (5, 6). The chromosomal segment in which markers are in LD is defined as an LD block (7, 8). The size of an LD block surrounding a disease gene depends on historical recombination events between the disease-causing mutation and flanking markers and, hence, on the time the mutation has been segregating in the population. When a disease-causing mutation arises, it is born on a chromosome containing a specific haplotype that consists of all of the alleles that happened to fall together on that chromosome. As the chromosome is transmitted in the population from generation to generation, the LD between the disease-causing mutation and any given marker will deteriorate exponentially over time as a function of the frequency of recombination between the two sites. Thus, if a disease-causing mutation were established recently in the population (for example, ∼50 generations), the size of chromosomal region that contains loci in LD with the mutation could be extensive (on the order of millions of basepairs). With large block size, it should be possible to use low-density genome-wide SNP panels that are currently available commercially to localize disease-causing genes with a case-control design.
Although LD has been exploited previously to refine the map position of disease genes once they had been localized by conventional linkage analysis, limitations in high-throughput genotyping have until now prevented utilization of LD for the identification of disease genes in genome scans. With recent advances in high-throughput genotyping and the large number of validated SNPs in the public databases, we decided to test the feasibility of using genome-wide SNP LD mapping in two Mendelian genetic disorders as models. We selected cases and controls from the Ashkenazi Jewish population, because the origins of the Jewish population are relatively recent and the population was established from small numbers of founders (9). Consequently, genetic heterogeneity is reduced, and the sizes of LD blocks flanking disease-causing mutations are generally large. These critical properties have been demonstrated for many disease-causing mutations in the Jewish population (Table 1), raising the possibility that these properties could be used to identify new genes.
We selected two Mendelian disorders—autosomal recessive Bloom syndrome and autosomal dominant hereditary nonpolyposis colorectal cancer (HNPCC)—that featured well-studied founder mutations in the Ashkenazi Jewish population because these model genetic systems provide the optimal conditions in which to test and develop the genome-wide SNP LD mapping approach. Bloom syndrome (MIM 210900)4 is a clinically recognizable entity characterized by growth deficiency, genomic instability, and a marked increase in predisposition to the development of cancer (10). The primary defect in Bloom syndrome is caused by homozygous or compound heterozygous mutations of the Bloom syndrome gene, BLM, a member of the RecQ family of DNA helicases (11). Approximately one third of persons with Bloom syndrome are Ashkenazi Jewish, and almost all of these persons have inherited the same 6-bp deletion and 7-bp insertion at nucleotide positions 2207–2212 in the BLM cDNA, referred to as blmAsh. The blmAsh mutation is present in the Ashkenazi Jewish population at an increased frequency because of founder effect (12). Markers in an approximately 2-million-bp (Mb) region flanking BLM in chromosome band 15q26.1 are in LD with the blmAsh mutation (13). HNPCC (MIM 1145004) is caused predominantly by mutations in the mismatch repair genes MLH1 and MSH2 (14). Persons with disease-causing mutations in these genes are at increased risk for a wide spectrum of epithelial cancers, and the tumors that develop in mutation carriers feature microsatellite instability (MSI; ref. 15). HNPCC is not at an increased frequency in the Ashkenazi Jewish population; however, a founder mutation MSH2*1906G>C, which results in a substitution of proline for alanine at codon 636 (A636P) in the MSH2 protein, is the predominant mismatch repair gene mutation in Ashkenazi Jews (16). The region of LD that flanks the chromosome bearing the A636P mutation can be as large as 11 Mb (16). Here, we present two proof-of-principle experiments that demonstrate the feasibility of genome-wide SNP LD mapping for the localization of disease genes in a genetically isolated population. Our results suggest that this SNP LD mapping approach could be used for the identification of disease genes whose localization is currently unknown.
MATERIALS AND METHODS
Subjects.
Two subject groups were studied. The subject group composed of persons with Bloom syndrome has been described previously (13). A subject group of persons with HNPCC syndrome, as defined by the Amsterdam I criteria (17), or with HNPCC-like syndrome, as defined by families with three or more colon cancers diagnosed at any age among a group of first and second-degree relatives, was ascertained by family history questionnaire by the colorectal cancer disease management team (including endoscopy, surgery, and oncology clinics) or by referral to the Clinical Genetics Service at Memorial Hospital. All of the cases visited the genetics clinic for counseling and for clinical investigation of a possible disease-causing mutation in one of the mismatch repair genes. Families with familial adenomatous polyposis were excluded. Tumor material was obtained for MSI testing, and blood was obtained for mutational analysis of the MLH1, MSH2, and MSH6 genes. From an ascertainment of 133 families, 31 Ashkenazi Jewish families meeting HNPCC or HNPCC-like criteria were identified in which a DNA sample could be obtained from a person affected with colorectal cancer. Of these 31 families, 13 had one or more tumors that exhibited MSI and 18 had tumors that did not exhibit MSI. In the HNPCC localization trial, by selecting the 13 persons with tumors that exhibited MSI, we enriched for persons who carried mismatch repair gene mutations. Earlier reports have described a subset of the cases studied here (16, 18).
For the BLM localization trial, to constitute a comparison group, 10 healthy Ashkenazi Jewish males were chosen at random from a group of 50 males ascertained at the New York Blood center for an unrelated epidemiologic study as described previously (12). In addition, 21 persons from Ashkenazi Jewish HNPCC/HNPCC-like families were included as part of the comparison group: 8 cases with colon cancers that exhibited MSI and 13 cases with colon cancers that did not exhibit MSI. The inclusion of HNPCC/HNPCC-like cases in the control group was originally done as a cost-saving measure. For the MSH2 localization trial, to constitute a comparison group, 23 DNA samples from Ashkenazi persons were purchased from the National Laboratory for the Genetics of Israeli Populations at Tel-Aviv University, Israel, and another 40 DNA samples from healthy Ashkenazi Jews were chosen at random from a group of over 2,000 persons ascertained through the New York Cancer Project (19).
Genotyping.
DNA samples were prepared from blood as described previously (12, 18). For the BLM localization trial, genotyping was carried out on Orchid BioSciences’ SNPstream ultra-high throughput platform as described previously (8, 20). Briefly, the method combines solution-phase multiplex single nucleotide extension (SNE) with a solid-phase sorting of labeled SNE primers by hybridization to a chip that contains 384 4 × 4 arrays of 12 oligonucleotide tags and 4 oligonucleotides for positive and negative controls. Each SNE primer contained 1 of the 12 oligonucleotide tags at its 5′ end, and the SNE reactions were performed in 12-plex. Separation of the 4 × 4 arrays during hybridization was achieved with a patented gasket. In these experiments, the genotyping failure rate was 4.1%.
For the MSH2 localization trial, genotyping was carried out with Affymetrix GeneChip 10K Human Mapping arrays (21, 22). Briefly, the method consists of a one-primer amplification assay performed on genomic DNA in which sequence complexity had been reduced by restriction enzyme digestion with XbaI. Allele-specific hybridization of the amplified probe was then performed on oligonucleotides on the array. Because two different chip arrays were used in these experiments (the early-access 10K array and the 10K array), the loci included in the statistical analysis were constituted from the intersection of the two sets of loci successfully genotyped. The overall genotyping failure rate in the 94 DNA samples genotyped by this method was 9.8%. The genotyping error rate was estimated to be less than 0.02% (0 genotype discrepancies in 3,474 tests from 70 duplicate loci included on the genotyping chips).
Single-Nucleotide Polymorphism Selection.
For the genome-wide Orchid SNP panel (now marketed by Beckmann), an initial set of ∼4,200 SNPs from The SNP Consortium were selected. The complete set of SNPs was arranged into ∼350 unique 12-plex reactions for the purpose of performing the assays on the ultra-high throughput platform. The complete set of markers was then validated on a set of five Centre d’Etude du Polymorphisme Humain pedigrees and in three independent populations. A final set of 3,258 markers was chosen for the analysis performed here after eliminating SNPs that performed poorly in the assay, failed Hardy-Weinberg equilibrium tests, or exhibited Mendelian segregation errors, or that were not polymorphic in the populations initially evaluated. An average genome-wide spacing of 1 SNP per 874 kb was achieved for this panel (Table 2). Of the 3,258 SNPs assayed in the Ashkenazi Jewish samples tested here, 123 were not polymorphic.
For the Affymetrix genotyping chip, the selection of SNPs from The SNP Consortium was based on computer predictions of XbaI restriction enzyme fragments likely to contain SNPs, followed by empirical testing across more than 300 samples from Caucasians, African Americans, and Asians. Genotyping accuracy for the 10K array, measured by concordance with SNE genotypes, dideoxy sequencing, and Mendelian inheritance, was estimated to be >99.5% (21, 22). Reproducibility was measured at 99.99% when measured over 16 individuals with 9 replicates each. The SNPs on the array were highly informative, with an average heterozygosity of 0.37 across Caucasians, African Americans, and Asians, and broadly distributed across the genome. The mean and median distances between the loci genotyped in these experiments were 389 and 194, respectively (Table 2). The average heterozygosity over the 94 Jewish persons tested here was 32.4%, and 120 loci were not polymorphic in this sample set.
Statistical Analysis.
Two sided Fisher’s exact tests were used to compare the single SNP genotype frequencies between cases and controls. For haplotype analyses, we ordered all of the loci by chromosome position, from the telomere of the p arm to the telomere of the q arm for each chromosome in turn, and we applied a sliding window consisting of n loci, where n is a number between 2 and 12, that we moved down the chromosome one locus at a time (23). In the estimation of haplotype frequencies, we included samples in which genotype data were missing according to the following scheme: for n = 3 or 4, we excluded samples with missing data at two or more loci; for n = 5 or 6, we excluded samples with missing data at three or more loci; for n = 7 or 8, we excluded samples with missing data at four or more loci; for n > 8, we excluded samples with missing data at five or more loci. Including samples with missing data, we assigned the most likely haplotypes to each individual with the expectation-maximization algorithm (24, 25). From there, we compared the estimated haplotype frequencies between cases and controls as described below.
For the analysis of the data generated by genotyping with the Orchid SNP panel, in which the density of SNPs was ∼1 SNP/0.8 Mb, we analyzed haplotypes constructed from two, three, and four adjacent loci. For the Affymetrix SNP panel, in which the density of SNPs was ∼1 SNP/0.4 Mb, we analyzed haplotypes constructed from 2 to 12 adjacent loci. Because we selected cases from a genetically isolated population, we assumed that, in some or all of the cases, a single founder haplotype would be present that is associated with disease. Consequently, for each group of adjacent loci, we identified the haplotype that registered the minimum P value. We calculated P values for association between the estimated haplotypes and disease status with two approaches. A logistic regression approach based on generalized linear models was performed with the haplo.score program (26). Score statistics and corresponding P values were generated for each of the observed multilocus haplotypes, and the minimum observed P value and the corresponding haplotype were recorded. Because the distribution of the score statistic might not be normal, based on the small numbers of cases and controls, P values were calculated from empirical null distributions based on at least 1000 simulations.
A maximum likelihood ratio approach was performed as described in SAS Genetics (25). For a given group of adjacent loci, the χ2 statistic of each haplotype was calculated, and the smallest P value along with the corresponding haplotype were recorded. Permutation tests (10,000 permutations) were performed as described (27) to obtain empirical null distributions of P values. In the maximum likelihood ratio method, a haplotype frequency cutoff of 0.005 was used for all analyses.
To adjust P values produced by Fisher’s exact test for multiple testing, we used two approaches: (a) the Benjamini and Hochberg (28) correction, which is a method for controlling the false discovery rate, and (b) a permutation procedure. The Benjamini and Hochberg correction consists of ranking all P values and adjusting each by multiplying by the total number of markers and dividing by the rank of that P value. In the second approach, we performed a permutation procedure to simulate the distribution of the minimum P value that we would expect if none of the markers were truly associated with disease. To do this, disease status (case versus control) was randomly permuted among the persons tested, keeping their SNP genotypes unaltered; a P value from Fisher’s exact test then was calculated at every SNP for each of the permuted datasets, and the smallest P value was recorded. This permutation procedure was repeated 1,000 times, from which an empirical distribution of the minimum P value was obtained.
For the haplotype analyses, which included many dependent tests in which the correlation coefficients could not be easily determined, the Benjamini and Hochberg correction could not be used. Consequently, the permutation procedure was used to simulate P values; the lowest P value was calculated for each of 10,000 permutations for the entire data set, and from this distribution, we estimated the P value that corresponded to the conventional 5% threshold.
RESULTS
Genome-wide SNP Linkage Disequilibrium Mapping of the Bloom Syndrome Gene BLM.
For the Bloom syndrome gene localization, we genotyped 10 unrelated Ashkenazi Jews with Bloom syndrome and 31 Ashkenazi Jews without Bloom syndrome as controls at 3,258 SNPs (Orchid SNP panel). To assess association between a marker and disease status, we performed a two-sided Fisher’s exact test of the frequency of the three possible genotypes in cases versus controls. Each P value was displayed in order by chromosome position (Fig. 1 A). The smallest P value (P = 1.6 × 10−5) was obtained at TSC0754862, which is ∼1.7 Mb from the 5′ end of the BLM gene. At TSC0754862, 8 of 10 individuals with Bloom syndrome were homozygous G/G, 1 individual was heterozygous G/A, and 1 individual had no marker information. Of the 31 controls, 2 were homozygous G/G, 10 were homozygous A/A, 17 were heterozygous G/A, and 2 had no marker information.
Adjusting the P values for multiple testing by controlling the false discovery rate (28), we found that TSC0754862 had the smallest adjusted P value (P = 0.05); all of the other P values were substantially larger (Fig. 1 B). By the permutation approach, the fifth percentile of the empirical P value distribution was 3.1 × 10−5, which was larger than the observed P value of 1.6 × 10−5 registered at TSC0754862 (data not shown). No other observed P values fell below the fifth percentile.
To determine whether the analysis of multiple loci could also be used to localize BLM, we calculated P values corresponding to two-locus to four-locus haplotypes with the haplo.score program (26). In the three-locus haplotype analysis (Fig. 1 C and D), the haplotype G-C-G at the three adjacent loci TSC0754862, TSC0125422, and TSC0033517, spanning approximately 1 Mb, had the smallest unadjusted P value (P = 3.0 × 10−6). By the permutation procedure, the P value associated with the G-C-G haplotype was significant, and no other P values obtained reached this level of significance (data not shown).
In our analysis of two-locus and four-locus haplotypes, for haplotypes that contained the TSC0754862 locus, we obtained minimum unadjusted P values that were also very small (9.2 × 10−6 and 9.2 × 10−7, respectively). On examining the region around TSC0754862, we found that 16 of 20 Bloom syndrome case chromosomes carried the haplotype G-C-G-A at SNPs TSC0754862, TSC0125422, TSC0033517, and TSC0288059, spanning approximately 1.7 Mb and encompassing the BLM gene (Fig. 1 D). None of the controls carried this haplotype, and haplotype sharing in the cases deteriorated outside this region. We concluded that, in this analysis, the chromosomal segment that contained BLM was the only region identified by the genome-wide SNP LD mapping approach.
Genome-wide SNP Linkage Disequilibrium Mapping of the HNPCC Gene MSH2.
For the HNPCC localization trial, we genotyped 13 unrelated Ashkenazi Jews with MSI-positive colorectal cancers from HNPCC or HNPCC-like families and 63 healthy Ashkenazi Jews as controls at 8,549 SNPs (Affymetrix SNP panel). Ten of the 13 persons genotyped carried the A636P founder mutation; the remaining 3 persons carried two different mutations, a single mutation in MSH6 in one person (see ref. 18) and a single mutation in MLH1 in two persons. As before, to assess association between a marker and disease status, we performed a two-sided Fisher’s exact test of the frequency of the three possible genotypes in cases versus controls (Fig. 2 A). One marker 4.2 Mb distal to MSH2, TSC520086, registered a P value of 0.0008, which was the 11th smallest P value in the set. Two other markers in the MSH2 region registered small P values; TSC529535, which is 1.1 Mb distal to MSH2, registered a P value of 0.002, which ranked 18th, and TSC43644, which is 6.2 Mb proximal to MSH2, also registered a P value of 0.002, which ranked 29th. The smallest P value (P = 7.0 × 10−5) obtained in the single-locus analysis was TSC1443434 on chromosome 15 (see below). After adjusting for multiple testing by controlling the false discovery rate, none of the P values obtained would be considered significant. This difference between results obtained in the Bloom syndrome and HNPCC localization trials was expected because in a dominant condition, at most, only one half the chromosomes in the cases would be truly associated with the disease-causing mutation.
We next performed haplotype analyses to determine whether this approach could provide evidence for MSH2 localization. Because the density of the Affymetrix SNP panel was greater than that of the Orchid panel, we extended the analysis of haplotypes to include up to 12 adjacent loci with a haplotype frequency cutoff point of 0.05. The results obtained from the analysis of 4-, 8-, and 12-locus haplotypes were representative of the entire set of analyses (Fig. 2,B and Fig. 1,A and B in the supplementary figures).5 By logistic regression, the smallest P values obtained in the region that contained MSH2 were 0.005 (loci TSC44044–TSC43644, ranking 64th in ascending order), 2.0 × 10−5 (loci TSC54246–TSC43644, ranking 1st), and 4.0 × 10−6 (loci TSC59005–TSC535216, again ranking 1st), respectively (see Fig. 3 B for loci and their positions relative to MSH2). The genetic region that contained MSH2 ranked as the smallest P value in 5 of the 11 haplotype analyses performed. Overall, the P value obtained in the 12-locus haplotype analysis of loci TSC59005–TSC535216 was the smallest P value obtained for the entire analysis. By permutation analysis, a P value smaller than 6.0 × 10−6 would be considered significant. By logistic regression, the region containing TSC1443434 on chromosome 15 was the only other region that obtained a significant P value, registering its smallest P value (P = 4.0 × 10−6) in the three-locus haplotype analysis.
By the maximum likelihood ratio method, the results were only slightly different (Fig. 3,A). In the 4-, 8-, and 12-locus haplotype analyses, the smallest P values obtained in the region that contained MSH2 were 0.002 (loci TSC529535–TSC91794, ranking 8th), 2.0 × 10−6 (loci TSC59005–TSC51308, ranking 1st), and 1.1 × 10−6 (loci TSC529535–TSC588566, ranking 1st), respectively (again, see Fig. 3 B). The smallest P value registered in the entire analysis by the maximum likelihood ratio test was in the 12-locus analysis at loci TSC529535 to TSC588566. The second smallest P value (2.0 × 10−6) was recorded at the next step proximal on the chromosome, TSC59005 to TSC535216, which was the haplotype that recorded the smallest P value in the analysis by logistic regression. By the maximum likelihood ratio test, the region that contained MSH2 never ranked below 8th smallest P value in the haplotype analyses, recording the smallest P value in 6 of 11 analyses performed. The P values for the MSH2 region were considered significant, falling below the empirically determined 5% threshold (in the 12-locus analysis, 7.0 × 10−6) in 5 of the 11 haplotype analyses. All of the corresponding haplotypes were derived from adjacent loci that contained MSH2, whereas in the logistic regression some of the haplotypes were derived from loci proximal to MSH2. In the maximum likelihood ratio test, two additional regions beside the one that contained chromosome 15 recorded significant P values, one on chromosome 3 and another on chromosome 18.
Importantly, when we compared the haplotypes from which the smallest P values were registered by each of the two analytical methods, we found that for haplotypes that did not include MSH2, many of them were different. The same haplotypes could be obtained by logistic regression if a haplotype frequency cutoff of 0.005 was used. However, in this case, the results from the logistic regression analysis were no longer significant (data not shown). The haplotypes identified by the minimum P values in the maximum likelihood ratio test with a few exceptions formed an overlapping series (Fig. 3 B), which would be expected if there was conservation of the founder’s haplotype in the region that contained MSH2.
It was previously reported that loci in a region greater than 11 Mb are in LD with the A636P mutation (16). The SNP panel used here contained 50 loci in a 11.6-Mb region flanking MSH2. Examination of the 12-locus haplotypes and the accompanying P values for this region showed that 9 of the 10 A636P-bearing chromosomes examined shared a common core 4.3-Mb region consisting of 14 loci that flanked MSH2 (Fig. 3 B). Six of the 10 A636P-bearing chromosomes examined shared a common 8-Mb region consisting of 41 loci, the majority of which were proximal to MSH2. Only four chromosomes carried the entire 11.6-Mb region.
DISCUSSION
The SNP LD mapping strategy used here relied on the comparatively large regions of LD that encompass founder mutations segregating in the Jewish population. With a simple study design and DNAs from a small number of persons with Bloom syndrome (n = 10) or HNPCC/HNPCC-like (n = 13) as cases and a small number of controls (n = 31 or n = 63, respectively), we identified by genetic association the chromosomal regions that contained the BLM and MSH2 genes. In the case of the BLM localization, a panel of only 3,258 SNPs, with an average spacing of 1 SNP per 874 kb (Table 2), was sufficient to detect a significantly different allele frequency in cases versus controls at the locus TSC0754862 (Fig. 1,A), which is localized 1.7 Mb from BLM, and significantly different haplotype frequencies at a two-locus, three-locus, and four-locus haplotype that contained TSC0754862 (Fig. 1,C). Similarly, in the case of the MSH2 localization, a panel of only 8,549 SNPs, with an average spacing of 1 SNP per 389 kb (Table 2), was sufficient to detect significantly different haplotype frequencies in cases versus controls at 14 loci that spanned a 4-Mb segment that included MSH2 (Fig. 2,B, 3 A and B). After adjustments for multiple testing, the P values of comparisons at markers flanking the BLM and MSH2 genes were the only significant P values obtained, with the exception of single regions on chromosomes 3, 15, and 18 identified in the MSH2 analyses, indicating that the rates of type I error were low with the design used. These valid localizations demonstrated that SNP LD mapping can work, even when the density of SNPs genotyped is low, because the LD around these genes is so extensive.
The frequency of the alleles in the controls that are associated with the mutation is an important factor in the success of this approach in gene discovery. Four loci (TSC0754862, TSC0125422, TSC0033517, and TSC0288059) in a 1.7-Mb region that flanked BLM were present on the relatively intact founder haplotype (Fig. 1 D). blmAsh was associated with the minor allele at TSC0754862, which had an allele frequency of 0.36 in controls, and with the major allele allele at TSC0288059, which had an allele frequency of 0.55 in controls. The resulting P value for association at TSC0754862 (P = 1.6 × 10−5) was considerably smaller than at TSC0288059 (P = 1.8 × 10−2). TSC0125422 had a minor allele frequency of 0.09 in controls but the minor allele was not associated with blmAsh, and TSC0033517 was not polymorphic in our sample of Ashkenazi Jews. Because the possibility of detecting association depends on observing a frequency difference between cases and controls, lower frequency alleles are especially advantageous if by chance the mutation arose on a chromosome carrying the less frequent allele. We conclude that increasing the density of SNPs in the panel should increase the power of the approach, because the likelihood becomes greater that a low-frequency allele that is strongly associated with the disease-causing mutation is present in the SNP panel.
In the case of MSH2, in which the region of LD extends over 11 Mb, more than 50 markers of the 8,549 genotyped could have contributed to the localization. However, a significant P value was not achieved in a single-locus test. Instead, 6-locus to 12-locus haplotype analyses overcame the limitation by delivering significant P values in the MSH2 region, indicating that haplotype analysis can be more powerful than the single-locus analysis. This additional power originates from the increased informativeness of haplotypes. We note the importance of not eliminating samples that have missing genotype data. When the samples with missing data were excluded, no significant P values were obtained (data not shown). This failure occurred because no data were recorded for several important case samples at TSC326842, which is central in the core MSH2 region (Fig. 3 B). Consequently, we advocate the use of the expectation-maximization algorithm to compute maximum likelihood estimates from the incomplete data. A recent publication by Morris et al. (29) suggested that the use of unphased genotype data would be more efficient than the use of imputed haplotype data. Genotype data from validation studies such as ours could be used to evaluate new analytical methods such as this one.
The problem of possible false-positive results was raised by the results of the haplotype analysis of the MSH2 data (Fig. 3 B and data not shown). Formerly, such results could have issued from real associations, such as with modifier genes. However, with the finding of several statistically significant associations outside of the region that contained MSH2, it seems likely that with more data, these associations would become less significant. False-positive results can occur by chance because of sampling error. Our testing strategy limits the overall false-positive rate. When multiple testing is taken into account, P values must be less than 7.0 × 10−6 to be considered statistically significant. We note that this adjustment is very similar to that given by the Bonferroni correction. We speculate that experiments performed with greater numbers of loci in the SNP panel would demand even lower P values to achieve significance.
In this work, we performed two different haplotype analyses of our data, which proved similar in their ability to detect association between the markers tested in the disease gene regions and disease status. In the logistic regression analysis, significant P values were recorded when the haplotype frequency cutoff was set to 0.05. Using a cutoff point of 0.005, however, resulted in a failure to detect the significant association between MSH2 and linked markers and many false positives. This problem most likely occurred because, when rarer haplotypes were included in the analysis, the variance factor in the denominator of the score statistic could become very small, which inflated the score statistic and resulted in increased noise.
The markers in the panels used here were selected primarily from SNPs previously validated by The SNP Consortium and for their even, genome-wide distribution. In addition, each SNP had a relatively high heterozygosity (>0.25) in the three major population groups used by The SNP Consortium (8, 20, 21, 22). The SNPs were selected without consideration for the proximity to any particular gene. We note that, if by chance the SNP TSC0754862 had not been genotyped in this experiment, then BLM would not have been localized. This observation suggests that false negatives in SNP LD mapping remain a problem until SNP panels are constructed that provide full coverage of the genome.
In comparison with conventional linkage approaches, the association design tested here had comparable power for gene localization. BLM was first localized by a linkage approach known as homozygosity mapping (30), and MSH2 was first localized by conventional linkage analysis in several large families (31). With respect to the identification of the minimum genomic regions that contain these genes, the association method used here defined a smaller critical region relative to the numbers of cases studied. This observation holds true because the core regions of LD around each of these genes is less than 5 Mb, whereas an analysis of many meioses would be required in linkage analysis to obtain a similarly small critical region. Moreover, because in this SNP LD mapping approach, only a single case per family is required, it is easier to collect samples for the analysis, making it better suited than linkage for diseases in which family material is difficult to collect (e.g., in late-onset diseases such as cancer). Once disease genes have been localized in the genome-wide scan, testing additional markers in the region of LD should permit narrowing of the critical region through the identification of historical recombinants.
The 10 Ashkenazi Jews with Bloom syndrome were chosen at random from the 27 persons available to us who could have been selected. The 13 Ashkenazi Jews with MSI-positive tumors from HNPCC/HNPCC-like families comprised all such cases available for analysis at Memorial Hospital. The controls consisted of healthy Ashkenazi Jews and, in the case of the BLM localization, Ashkenazi Jews with HNPCC (both MSI positive and MSI negative). Although the risk of development of colon cancer due to blmAsh mutation is increased ∼2-fold (19), the frequency of blmAsh in the Ashkenazi Jewish population is low (0.01), and the occurrence of blmAsh chromosomes in the controls would have biased toward the null hypothesis. The frequency of the MSH2 A636P chromosomes in the Ashkenazi population is so low (<0.23%) as not to be a concern in this study (16). We note that the presence of ethnicity-only matched controls in the analysis did not diminish our ability to identify significant associations in this work.
Detecting significant associations in common diseases in the face of genetic heterogeneity, phenocopies, low penetrance, and the adjustments required for multiple testing present significant challenges to the association approach because they reduce the power of the study. It is possible that these problems could be overcome by using highly dense SNP panels and genotyping large numbers of cases. We have demonstrated here that the use of high-risk cases from a genetically isolated population could be a more productive and less costly approach. Drawing cases from a genetically isolated population very likely will be critical for the success of SNP LD mapping because it provides the best control for the variation in SNP allele frequencies among populations, and it ensures that the disease-causing mutation and its surrounding genomic environs are identical by descent from a common founder.
A, plot of the P values obtained by two-sided Fisher’s exact test comparing genotype frequencies (2 × 3) at each of the 3,258 SNP loci tested in 10 Ashkenazi Jewish cases with Bloom syndrome versus 31 non-Bloom syndrome controls. Loci were arranged by nucleotide number, starting at nucleotide 1 on chromosome 1 and ascending by chromosome number to the last nucleotide on chromosome 22. The X chromosome was not represented in the Orchid SNP panel. The P values were transformed by −log10(P value) to display the scores as peaks. The smallest P value (highest peak) was obtained at TSC0754862 (P = 1.56 × 10−5), a locus 1.7 Mb from BLM. B, plot representing the single marker P values in A after adjustment for multiple testing by the Benjamini–Hochberg correction. Horizontal dotted line, the adjusted significance threshold of P = 0.05. C, plot representing the minimum P values calculated by logistic regression after estimating frequencies of three-locus haplotypes by expectation-maximization algorithm in cases versus controls. The smallest P value (P = 3.0 × 10−6) was obtained at the three-locus haplotype TSC0754862, TSC0125422, and TSC0033517 immediately proximal to BLM. D, three-locus haplotype analysis of the loci flanking BLM on chromosome 15. In parentheses, the nucleotide position of the proximal SNP of each three-locus haplotype analyzed. A P value was calculated for each haplotype by comparison of the frequencies of the haplotypes in cases versus controls. Each peak, the smallest unadjusted P value of those calculated for each group of three consecutive loci. Above each peak, the haplotype that gives the smallest P value. BLM is at nucleotides 88,859,837 (5′ end of the gene) to 88,957,942 (3′ end of the gene). All of the nucleotide positions were as given in the human genome database build 32.
A, plot of the P values obtained by two-sided Fisher’s exact test comparing genotype frequencies (2 × 3) at each of the 3,258 SNP loci tested in 10 Ashkenazi Jewish cases with Bloom syndrome versus 31 non-Bloom syndrome controls. Loci were arranged by nucleotide number, starting at nucleotide 1 on chromosome 1 and ascending by chromosome number to the last nucleotide on chromosome 22. The X chromosome was not represented in the Orchid SNP panel. The P values were transformed by −log10(P value) to display the scores as peaks. The smallest P value (highest peak) was obtained at TSC0754862 (P = 1.56 × 10−5), a locus 1.7 Mb from BLM. B, plot representing the single marker P values in A after adjustment for multiple testing by the Benjamini–Hochberg correction. Horizontal dotted line, the adjusted significance threshold of P = 0.05. C, plot representing the minimum P values calculated by logistic regression after estimating frequencies of three-locus haplotypes by expectation-maximization algorithm in cases versus controls. The smallest P value (P = 3.0 × 10−6) was obtained at the three-locus haplotype TSC0754862, TSC0125422, and TSC0033517 immediately proximal to BLM. D, three-locus haplotype analysis of the loci flanking BLM on chromosome 15. In parentheses, the nucleotide position of the proximal SNP of each three-locus haplotype analyzed. A P value was calculated for each haplotype by comparison of the frequencies of the haplotypes in cases versus controls. Each peak, the smallest unadjusted P value of those calculated for each group of three consecutive loci. Above each peak, the haplotype that gives the smallest P value. BLM is at nucleotides 88,859,837 (5′ end of the gene) to 88,957,942 (3′ end of the gene). All of the nucleotide positions were as given in the human genome database build 32.
A, plot of the P values obtained by two-sided Fisher’s exact test comparing genotype frequencies (2 × 3) at each of the 8,549 SNP loci tested in 13 Ashkenazi Jewish HNPCC/HNPCC-like cases versus 63 healthy Ashkenazi Jewish controls. Loci were arranged by nucleotide number, starting at nucleotide 1 on chromosome 1 and ascending by chromosome number to the last nucleotide on chromosome X. The P values were transformed by −log10(P value) to display the scores as peaks. The smallest P value in the analysis (P = 2.4 × 10−5) was obtained at TSC1443434 at nucleotide position 18,511,390 on chromosome 15. Arrow, the region that contains MSH2. B, plot representing the minimum P values calculated by logistic regression after estimating the frequencies of 12-locus haplotypes by expectation-maximization algorithm in cases versus controls. The smallest P value (4.0 × 10−6) was obtained at the 12-locus haplotype TSC59005 to TSC535216, which contains MSH2.
A, plot of the P values obtained by two-sided Fisher’s exact test comparing genotype frequencies (2 × 3) at each of the 8,549 SNP loci tested in 13 Ashkenazi Jewish HNPCC/HNPCC-like cases versus 63 healthy Ashkenazi Jewish controls. Loci were arranged by nucleotide number, starting at nucleotide 1 on chromosome 1 and ascending by chromosome number to the last nucleotide on chromosome X. The P values were transformed by −log10(P value) to display the scores as peaks. The smallest P value in the analysis (P = 2.4 × 10−5) was obtained at TSC1443434 at nucleotide position 18,511,390 on chromosome 15. Arrow, the region that contains MSH2. B, plot representing the minimum P values calculated by logistic regression after estimating the frequencies of 12-locus haplotypes by expectation-maximization algorithm in cases versus controls. The smallest P value (4.0 × 10−6) was obtained at the 12-locus haplotype TSC59005 to TSC535216, which contains MSH2.
A, plot representing the minimum P values calculated by the maximum likelihood ratio test after estimating frequencies of 12-locus haplotypes by expectation-maximization algorithm in 13 Ashkenazi Jewish HNPCC/HNPCC-like cases versus 63 healthy Ashkenazi Jewish controls. The smallest P value (1.1 × 10−6) was obtained at the 12-locus haplotype TSC529535 to TSC588566, which contains MSH2. The adjacent haplotype distal to this one recorded the second smallest P value (2.0 × 10−6) in the analysis. Arrow, the region that contains MSH2. B, the haplotypes that recorded the minimum P values for the 50 loci that are contained in the region of LD surrounding MSH2. The first P value (0.44), represented by the transformation −log10(P), was obtained for the leftmost haplotype shown containing the loci TSC516788 to TSC46032, the second P value (1.1) for the next haplotype to the right containing the loci TSC57106 to TSC529535, and so forth. Numbers 1 and 2 in columns across top, numbers arbitrarily assigned to alleles for representation of the haplotypes. CRC numbers on left side, case designations 1 through 10 of persons who carried the A636P mutation. Genotypes at each of the 50 loci are shown for each of the 10 cases. Basepair, nucleotide positions given from human genome database build 33. TSC ID, locus designations. Shaded, the portion of the founder haplotype that is carried by each case. Arrow, the position of the MSH2 gene.
A, plot representing the minimum P values calculated by the maximum likelihood ratio test after estimating frequencies of 12-locus haplotypes by expectation-maximization algorithm in 13 Ashkenazi Jewish HNPCC/HNPCC-like cases versus 63 healthy Ashkenazi Jewish controls. The smallest P value (1.1 × 10−6) was obtained at the 12-locus haplotype TSC529535 to TSC588566, which contains MSH2. The adjacent haplotype distal to this one recorded the second smallest P value (2.0 × 10−6) in the analysis. Arrow, the region that contains MSH2. B, the haplotypes that recorded the minimum P values for the 50 loci that are contained in the region of LD surrounding MSH2. The first P value (0.44), represented by the transformation −log10(P), was obtained for the leftmost haplotype shown containing the loci TSC516788 to TSC46032, the second P value (1.1) for the next haplotype to the right containing the loci TSC57106 to TSC529535, and so forth. Numbers 1 and 2 in columns across top, numbers arbitrarily assigned to alleles for representation of the haplotypes. CRC numbers on left side, case designations 1 through 10 of persons who carried the A636P mutation. Genotypes at each of the 50 loci are shown for each of the 10 cases. Basepair, nucleotide positions given from human genome database build 33. TSC ID, locus designations. Shaded, the portion of the founder haplotype that is carried by each case. Arrow, the position of the MSH2 gene.
Grant support: This work was supported by the Tavel-Resnick Foundation, the Frankel Fellowship, the Lymphoma Foundation, the Danziger Foundation, the Byrne Foundation, R01-HL56778 (N. Ellis), and CA103394 (N. Mitra).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Notes: Supplementary data for this article can be found at Cancer Research Online (http://cancerres.aacrjournals.org); N. Mitra and T-Z. Ye contributed equally to the work; M. Phillips is currently at the Genome Quebec Innovation Centre, Montreal, Quebec, Canada.
Requests for reprints: Nathan Ellis, Department of Medicine, Memorial Sloan-Kettering Cancer Center, 1275 York Avenue, New York, NY 10021. Phone: (212) 639-7183; Fax: (212) 717-3571; E-mail: [email protected]
See website for Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim.
Supplementary figures for this article can be found at Cancer Research Online (http://cancerres.aacrjournals.org).
Linkage disequilibrium block sizes for a selection of representative Jewish genetic diseases
Disease . | Gene . | Allele . | Location . | Block size (Mb)* . | Reference . |
---|---|---|---|---|---|
Bloom syndrome | BLM | blm Ash | 15q26.1 | 1.2 | 13 |
Breast cancer | BRCA1 | 187delAG | 17q21 | 1.0 | 32 |
Breast cancer | BRCA2 | 6174delT | 13q12.3 | 2.5 | 33 |
Dysautonomia | IKBKAP | IVS20+6T>C | 9q31-33 | 3.0 | 34 |
Gaucher disease | GBA | N370S | 1q21 | 4.5 | 35 |
HNPCC | MSH2 | A636P | 2p22-21 | >10 | 16 |
Hypercholesterolemia | LDLR | 197delG | 9p13.3 | 3.5 | 36 |
Hyperinsulinemia | SUR1 | ΔF1388 | 11p15.1 | 6.2 | 37 |
Idiopathic torsion | DYT1 | delGAG | 9q34 | 1.8 | 38 |
Mucolipidosis | MCOLN1 | IVS3-2A>G | 19p13.3 | 1.9 | 39 |
Disease . | Gene . | Allele . | Location . | Block size (Mb)* . | Reference . |
---|---|---|---|---|---|
Bloom syndrome | BLM | blm Ash | 15q26.1 | 1.2 | 13 |
Breast cancer | BRCA1 | 187delAG | 17q21 | 1.0 | 32 |
Breast cancer | BRCA2 | 6174delT | 13q12.3 | 2.5 | 33 |
Dysautonomia | IKBKAP | IVS20+6T>C | 9q31-33 | 3.0 | 34 |
Gaucher disease | GBA | N370S | 1q21 | 4.5 | 35 |
HNPCC | MSH2 | A636P | 2p22-21 | >10 | 16 |
Hypercholesterolemia | LDLR | 197delG | 9p13.3 | 3.5 | 36 |
Hyperinsulinemia | SUR1 | ΔF1388 | 11p15.1 | 6.2 | 37 |
Idiopathic torsion | DYT1 | delGAG | 9q34 | 1.8 | 38 |
Mucolipidosis | MCOLN1 | IVS3-2A>G | 19p13.3 | 1.9 | 39 |
Block sizes were estimated by determination of the physical distance (by human genome database build 34.3) between the most distal and most proximal genetic markers associated with the disease as described in the referenced articles.
Mean and median distance between SNPs
Chromosome . | No. of SNPs . | . | Mean distance between SNPs (kb) . | . | Median distance between SNPs (kb) . | . | |||
---|---|---|---|---|---|---|---|---|---|
. | Orchid . | Affymetrix . | Orchid . | Affymetrix . | Orchid . | Affymetrix . | |||
1 | 276 | 649 | 890 | 374 | 626 | 188 | |||
2 | 284 | 722 | 847 | 328 | 703 | 161 | |||
3 | 220 | 604 | 882 | 321 | 694 | 157 | |||
4 | 228 | 592 | 835 | 321 | 614 | 183 | |||
5 | 226 | 603 | 800 | 296 | 646 | 151 | |||
6 | 195 | 610 | 872 | 275 | 700 | 146 | |||
7 | 203 | 447 | 775 | 349 | 606 | 173 | |||
8 | 173 | 442 | 828 | 321 | 668 | 148 | |||
9 | 127 | 418 | 1041 | 309 | 738 | 138 | |||
10 | 149 | 474 | 899 | 281 | 689 | 142 | |||
11 | 162 | 504 | 842 | 273 | 685 | 119 | |||
12 | 163 | 399 | 805 | 328 | 625 | 140 | |||
13 | 106 | 373 | 1062 | 255 | 744 | 136 | |||
14 | 118 | 320 | 882 | 271 | 572 | 139 | |||
15 | 96 | 245 | 1028 | 329 | 744 | 175 | |||
16 | 87 | 187 | 933 | 434 | 691 | 179 | |||
17 | 95 | 144 | 838 | 540 | 661 | 278 | |||
18 | 82 | 242 | 925 | 316 | 712 | 144 | |||
19 | 75 | 77 | 798 | 723 | 627 | 328 | |||
20 | 87 | 166 | 712 | 374 | 524 | 194 | |||
21 | 49 | 147 | 907 | 227 | 524 | 118 | |||
22 | 57 | 62 | 836 | 552 | 459 | 276 | |||
X | 0 | 122 | 0 | 1141 | 0 | 648 | |||
Sum | 3258 | 8549 | — | — | — | — | |||
Average | — | — | 874 | 389 | 648 | 194 |
Chromosome . | No. of SNPs . | . | Mean distance between SNPs (kb) . | . | Median distance between SNPs (kb) . | . | |||
---|---|---|---|---|---|---|---|---|---|
. | Orchid . | Affymetrix . | Orchid . | Affymetrix . | Orchid . | Affymetrix . | |||
1 | 276 | 649 | 890 | 374 | 626 | 188 | |||
2 | 284 | 722 | 847 | 328 | 703 | 161 | |||
3 | 220 | 604 | 882 | 321 | 694 | 157 | |||
4 | 228 | 592 | 835 | 321 | 614 | 183 | |||
5 | 226 | 603 | 800 | 296 | 646 | 151 | |||
6 | 195 | 610 | 872 | 275 | 700 | 146 | |||
7 | 203 | 447 | 775 | 349 | 606 | 173 | |||
8 | 173 | 442 | 828 | 321 | 668 | 148 | |||
9 | 127 | 418 | 1041 | 309 | 738 | 138 | |||
10 | 149 | 474 | 899 | 281 | 689 | 142 | |||
11 | 162 | 504 | 842 | 273 | 685 | 119 | |||
12 | 163 | 399 | 805 | 328 | 625 | 140 | |||
13 | 106 | 373 | 1062 | 255 | 744 | 136 | |||
14 | 118 | 320 | 882 | 271 | 572 | 139 | |||
15 | 96 | 245 | 1028 | 329 | 744 | 175 | |||
16 | 87 | 187 | 933 | 434 | 691 | 179 | |||
17 | 95 | 144 | 838 | 540 | 661 | 278 | |||
18 | 82 | 242 | 925 | 316 | 712 | 144 | |||
19 | 75 | 77 | 798 | 723 | 627 | 328 | |||
20 | 87 | 166 | 712 | 374 | 524 | 194 | |||
21 | 49 | 147 | 907 | 227 | 524 | 118 | |||
22 | 57 | 62 | 836 | 552 | 459 | 276 | |||
X | 0 | 122 | 0 | 1141 | 0 | 648 | |||
Sum | 3258 | 8549 | — | — | — | — | |||
Average | — | — | 874 | 389 | 648 | 194 |
Acknowledgments
The authors gratefully acknowledge James German of Cornell University Medical College who through his lifelong study of Bloom’s syndrome made this work possible.