Abstract
Background: Cases with a family history are enriched for genetic risk variants, and the power of association studies can be improved by selecting cases with a family history of disease. However, in recent genome-wide association scans utilizing familial sampling, the excess relative risk for familial cases is less than predicted when compared with unselected cases. This can be explained by incomplete linkage disequilibrium between the tested marker and the underlying causal variant.
Methods: We show that the allele frequency and effect size of the underlying causal variant can be estimated by combining marker data from studies that ascertain cases based on different family histories. This allows us to learn about the genetic architecture of a complex trait, without having identified any causal variants. We consider several validated common marker alleles for breast cancer, using our own study of high risk, predominantly bilateral cases, cases preferentially selected to have at least two affected first- or second-degree relatives, and published estimates of relative risk from standard case–control studies.
Results: To obtain realistic estimates and to accommodate some prior beliefs, we use Bayesian estimation to infer that the causal variants are probably common, with minor allele frequency >5%, and have small effects, with relative risk around 1.2.
Conclusion: These results strongly support the common disease common variant hypothesis for these specific loci associated with breast cancer.
Impact: Our results agree with recent assertions that synthetic associations of rare variants are unlikely to account for most associations seen in genome-wide studies. Cancer Epidemiol Biomarkers Prev; 21(2); 262–72. ©2011 AACR.
This article is featured in Highlights of This Issue, p. 251
Introduction
Familial cases of disease are likely to segregate genetic risk variants, and therefore offer increased efficiency over sporadic cases for detecting new variants (1–2). Large collections of familial cases ascertained through genetics clinics (3), and of cases with two primary cancers ascertained through cancer registries (4) have been developed specifically for genetic studies of breast cancer. The increased efficiency of studies of these “genetically enriched” cases results from the fact that, for example, the allele frequency difference between cases with one affected first-degree relative and population controls is about 1.5 times greater than in unselected cases, and there is an approximately 2-fold higher difference for cases with 2 affected first-degree relatives (1). In breast cancer, there is the possibility of bilateral disease, among cases of which there is also a 2-fold higher difference. Recently, we (5) and others (6) have exploited the genetic enrichment of bilateral and familial cases in genome-wide association scans (GWAS) and thereby identified novel single nucleotide polymorphisms (SNP) associated with breast cancer.
For a given ascertainment scheme the excess risk, that is the increase in effect size compared with a study of sporadic cases and population controls, can be predicted for variants having a direct causal effect on disease. However, the relationship for marker SNPs is less clear owing to incomplete linkage disequilibrium (LD) between the marker and the causal variant. Significant deviation of the excess risk from its expected value has been observed in familial cases of breast cancer (6), and here we show similar deviation in bilateral cases. The observed excess risk is generally lower than that predicted for the causal variant, implying that the gain in efficiency is less than had been anticipated.
Here, we give expressions for the relative risk for marker genotypes in bilateral and familial cases, in terms of the marker and causal genotype frequencies, LD between marker and causal variant, and the relative risk of the causal variant. We show that the attenuation of the excess risk observed in recent studies can be explained by realistic models of LD and causal relative risks.
We then note that, given the marker relative risks in at least 3 distinct sampling schemes, it is possible to infer the genotype frequency and relative risk of the causal variant, even if the identity of that variant is unknown. This bears on recent debates over whether the numerous SNP associations emerging from GWAS are primarily driven by rare variants with larger effects than those estimated by GWAS, or by common variants with similar properties to the tag SNPs used in GWAS. The former scenario would explain more heritability than the latter (7), giving a partial account of the missing heritability problem (8). Recently, motivated by earlier observations that rare variation could explain common disease (9), Dickson and colleagues argued that rare variations could stochastically occur in coupling with a common SNP allele, creating “synthetic” association of the common variant (10). But this hypothesis has been challenged on theoretical and empirical grounds, with the evidence to date pointing to a greater role for common causal variation (11, 12). These arguments are based on simulations and whole-genome averages, and there has been little work on determining whether specific causal variants are rare or common, as this would normally require complete resequencing of associated regions.
Here, we address this issue for 10 specific loci that have been consistently associated with breast cancer, by inferring the allele frequency and relative risk of the causal variants from the marker relative risks estimated from studies of sporadic, bilateral, and familial cases. To allow for sampling variation and to accommodate some prior beliefs, we conduct a Bayesian estimation of causal effects to show that the variants underlying these established associations are probably common and have similar relative risks to those observed for the marker SNPs. This is in agreement with recent simulation results (13) and, for the first time, provides explicit support for the common disease common variant (CDCV) hypothesis applied to breast cancer. The approach we develop can be applied to various familial sampling schemes, including those based on concordant or discordant sib pairs, including twins, and is not limited to the studies of breast cancer described here.
Materials and Methods
Subjects
Cases and controls were selected from the British Breast Cancer Study (BBCS), details of which have been published previously (4, 5). For this analysis, we included 1,695 cases, 1,564 of whom had 2 sequential or simultaneous primary breast cancers, and 131 of whom had at least 2 affected first-degree relatives. The excess relative risk among cases with a second primary and those with 2 affected first-degree relatives has been shown to be equivalent (1) so the subsequent analysis assumes that all cases are bilateral. A total of 2001 controls were ascertained as friends and nonblood relatives of the cases. All cases and controls were of self-reported white Caucasian ancestry and all controls were free from breast cancer at enrolment in the study. Collection of blood samples and questionnaire information from case and control subjects was undertaken with informed consent and in accordance with the tenets of the Declaration of Helsinki.
Genotyping
DNA was extracted from blood samples using conventional methodologies and quantified using PicoGreen (Invitrogen). Genotype data from the BBCS was obtained for 10 SNPs that have been reproducibly associated with breast cancer by GWAS (Table 1). Genotypes for rs13387042, rs4973768, and rs6504950 have been included in previous publications (14, 15). Additional genotyping of rs1338740 and de novo genotyping of 3 SNPs (rs10941679, rs2046210, and rs11249433) was carried out using Taqman nuclease assay, with reagents designed by Applied Biosystems as Assays-by-Design and genotyping done using the ABI PRISM 7900HT Sequence Detection System according to manufacturer's instructions. For 4 other SNPs (rs889312, rs2981582, rs3817198, and rs3803662) genotyping was done by KBioscience Ltd., using their proprietary in house system (KASPar), a competitive allele-specific PCR SNP genotyping system that uses FRET quencher cassette oligos.
Per-allele relative risks of associated SNPs considered in this study
SNP . | Band . | Gene . | MAF . | Unselected cases . | Familial cases . | Bilateral cases . |
---|---|---|---|---|---|---|
rs11249433 | 1p11 | 0.39 | 1.16 (1.09–1.24)a | 1.08 (1.02–1.15) | 1.22 (1.09–1.35) | |
rs13387042 | 2q35 | 0.49 | 1.12 (1.09–1.15)b | 1.21 (1.14–1.29) | 1.09 (0.97–1.21) | |
rs4973768 | 3p22 | SLC4A7 | 0.46 | 1.11 (1.08–1.13)c | 1.16 (1.10–1.24) | 1.11 (0.99–1.23) |
rs10941679 | 5p12 | 0.25 | 1.19 (1.11–1.28)d | 1.11 (1.04–1.19)g | 1.29 (1.15–1.43) | |
rs889312 | 5q11 | MAP3K1 | 0.38 | 1.12 (1.08–1.16)e | 1.22 (1.14–1.30) | 1.11 (1.01–1.21) |
rs2046210 | 6q25 | 0.34 | 1.15 (1.03–1.28)f | 1.15 (1.08–1.22)g | 1.12 (0.99–1.25) | |
rs2981582 | 10q25 | FGFR2 | 0.38 | 1.26 (1.22–1.29)e | 1.43 (1.35–1.53) | 1.39 (1.29–1.49) |
rs3817198 | 11p15 | LSP1 | 0.3 | 1.07 (1.04–1.11)e | 1.12 (1.05–1.19) | 1.10 (1.00–1.20) |
rs3803662 | 16q12 | TOX3 | 0.25 | 1.19 (1.15–1.23)e | 1.30 (1.22–1.39) | 1.41 (1.30–1.52) |
rs6504950 | 17q22 | COX11 | 0.27 | 0.95 (0.92–0.97)c | 0.92 (0.86–0.99)g | 0.93 (0.79–1.07) |
SNP . | Band . | Gene . | MAF . | Unselected cases . | Familial cases . | Bilateral cases . |
---|---|---|---|---|---|---|
rs11249433 | 1p11 | 0.39 | 1.16 (1.09–1.24)a | 1.08 (1.02–1.15) | 1.22 (1.09–1.35) | |
rs13387042 | 2q35 | 0.49 | 1.12 (1.09–1.15)b | 1.21 (1.14–1.29) | 1.09 (0.97–1.21) | |
rs4973768 | 3p22 | SLC4A7 | 0.46 | 1.11 (1.08–1.13)c | 1.16 (1.10–1.24) | 1.11 (0.99–1.23) |
rs10941679 | 5p12 | 0.25 | 1.19 (1.11–1.28)d | 1.11 (1.04–1.19)g | 1.29 (1.15–1.43) | |
rs889312 | 5q11 | MAP3K1 | 0.38 | 1.12 (1.08–1.16)e | 1.22 (1.14–1.30) | 1.11 (1.01–1.21) |
rs2046210 | 6q25 | 0.34 | 1.15 (1.03–1.28)f | 1.15 (1.08–1.22)g | 1.12 (0.99–1.25) | |
rs2981582 | 10q25 | FGFR2 | 0.38 | 1.26 (1.22–1.29)e | 1.43 (1.35–1.53) | 1.39 (1.29–1.49) |
rs3817198 | 11p15 | LSP1 | 0.3 | 1.07 (1.04–1.11)e | 1.12 (1.05–1.19) | 1.10 (1.00–1.20) |
rs3803662 | 16q12 | TOX3 | 0.25 | 1.19 (1.15–1.23)e | 1.30 (1.22–1.39) | 1.41 (1.30–1.52) |
rs6504950 | 17q22 | COX11 | 0.27 | 0.95 (0.92–0.97)c | 0.92 (0.86–0.99)g | 0.93 (0.79–1.07) |
Estimates for unselected cases are taken from
aref. (19),
bref. (14),
cref. (15),
dref. (17),
eref. (16), and
fref. (18). Estimates for familial cases are taken from ref. (6). Estimates for bilateral cases are taken from this study (see Methods).
gRelative risk for a proximal SNP in LD (r2 > 0.75) with the lead SNP.
Abbeviation: MAF, minor allele frequency.
Call rates for SNPs genotyped as part of this analysis were 99.9% (rs1338740), 99.2% (rs10941679), 99.9% (rs2046210), 99.7% (rs11249433), 98.2% (rs889312), 97.8% (rs2981582), 98.9% (rs3817198), and 98.0% (rs3803662), and there was no evidence of deviation from Hardy Weinberg equilibrium in controls, for any of the SNPs (all P > 0.05). Duplicate concordance based on a 3.5% random sample was 100% for all SNPs.
Published data
We obtained summary OR estimates from published studies using cases unselected for a family history (14–19). We also obtained summary estimates from a recent GWAS by Turnbull and colleagues that preferentially selected cases to have at least 2 affected first or second-degree relatives (6). The exact ascertainment criterion was unspecified and we assumed that half the cases had at least 2 affected first-degree relatives and the other half at least 2 affected second-degree relatives. This approximation was made following informal consultation with those authors, but our conclusions turn out to be similar if we assume, say, that each case has at least 1 affected first-degree and 1 affected second-degree relative.
In what follows, we approximate the OR by the relative risk, as this is appropriate for a rare disease and leads to simplification in the analysis. Furthermore, several published studies used population-based controls, not selected to be disease free, for which the relative risk was estimated directly.
Effect size in familial cases
Let Y denote disease status, Y = 1 for disease present and Y = 0 for disease absent. Let M be a diallelic marker with genotypes {0, 1, 2} corresponding to the number of minor alleles present. Let D be a diallelic variant with direct causal effect, also with genotypes {0, 1, 2}.
The relative risk of causal genotype d, compared with baseline 0, is
and the relative risk of marker genotype m, compared with baseline 0, is
where |$f_{D|M} (d|m) = \Pr (D = d|M = m)$|. The marker relative risk thus depends only on the causal relative risks and the conditional distribution of causal genotype given marker genotype, which reflects the LD, or correlation, between the 2 genotypes.
Among bilateral cases of disease, the marker relative risk is given by
where the subscript of γ indicates that each case has an affected 0th-degree relative. This expression also holds for cases with an affected monozygotic twin. We assume that controls are not selected for their disease status or family history, and that cancers arise independently in each breast, conditional on other individual-level risk factors.
For familial cases, the marker relative risk is obtained by considering the probability that affected relatives share a causal variant identical by descent (IBD). For first-degree relatives the probability is ½ that a causal variant is shared IBD, and the probability that a subject with genotype d is affected and has at least 1 affected first-degree relative is given by
where J denotes the number of first-degree relatives for the subject. Assuming that disease risks are small, this is approximated by
so that the marker relative risk is
where
In addition to the conditional distribution |$f_{D|M} (d|m)$|, the marker and causal relative risks are now also related through the causal genotype distribution fD(d). Similarly, for a case with at least 2 affected first-degree relatives, the marker relative risk is approximately
Similar arguments give
for cases with at least 1 affected second-degree relative, and
for cases with at least 2 affected second-degree relatives. We use the geometric mean of equations (1.7) and (1.9) to model the summary estimates of Turnbull and colleagues (6), denoted by γM;fam.
For each case ascertainment scheme we define the excess risk as the ratio of the log relative risk in selected, familial cases to the log relative risk in unselected, sporadic cases. Thus for bilateral cases the excess risk for genotype m is , which is 2 when disease and marker genotypes are perfectly correlated.
Identification of causal effects
From equations (1.2) to (1.9), the marker and causal relative risks are related through the conditional distribution of causal genotype given marker genotype and the unconditional distribution of causal genotypes fD(d). Given the marker relative risks from a number of study designs, it is possible in principle to solve for the causal effects. For diallelic marker and causal variant,
is specified by 6 free parameters, fD(d) by 2, and there are 2 causal relative risks |$\gamma _{D_d }$|. All parameters could therefore be identified given the marker relative risks from 10 different familial sampling schemes. This is more than is practical to obtain, but with some mild assumptions we can reduce the parameters to a practical number.
First, we assume a multiplicative model of disease risk at the causal variant, in which . This assumptions fits most observed marker associations well, and can be expected to extend to causal variants as well (20). We further assume Hardy–Weinberg Equilibrium for both marker and causal variant, so that the genotype frequency distributions can be parameterized by allele frequencies. Under these assumptions all the relative risks factorize into allelic terms and we may simply work with D and M taking values in {0,1}. This leaves 4 free parameters to be identified:
we assume that the marker allele frequencies fM(m) are known, so that we have just 3 free parameters to identify. Therefore, the marker relative risks from just 3 familial ascertainment schemes are needed to solve for the effects of the causal variant. Here, we use studies of unselected cases [equation (1.2), from published data], cases with bilateral disease [equation (1.3), from our data] and an equal mixture of cases with at least 2 affected first- or second-degree relatives [equations (1.7) and (1.9), from Turnbull and colleagues].
Bayesian inference of causal effects
In practice, we cannot solve for |$f_{D|M} (1|1)$|, fD(1), and |$\gamma _{D_1 }$| exactly because estimates of marker relative risks are subject to sampling variation and do not conform to equations (1.2) to (1.9). Instead, we estimate the causal parameters using a likelihood calculated from the estimated marker effects, assuming that they are obtained from independent samples. Letting β denote log relative risk, |$\beta _. = \log (\gamma _ \cdot )$|, the likelihood is
where |$\phi ( \cdot ;\mu ,\sigma ^2 )$| is the normal density with mean μ and variance σ2, the mean terms are obtained from equations (1.2), (1.3), (1.7), and (1.9), and the variances are assumed equal to their sample estimates.
Maximum likelihood estimation of the causal effects is unsatisfactory for two reasons. First, in the data we consider, the likelihood function can be multimodal and difficult to maximize numerically. Second, in a number of instances the likelihood is maximized at parameter values that are unrealistic, such as a very high causal relative risk but minimal correlation between causal and marker variant. This scenario is unreasonable because the SNPs studied were selected as the most strongly associated within their regions, and so are likely to have the highest correlation with the causal variant.
For these reasons, we conduct a Bayesian analysis using the likelihood in equation (1.11) and the following prior distributions. For the causal minor allele frequency (MAF), we follow the distribution of sequence variants observed in the ENCODE regions by the HapMap consortium (21). This strongly favors variants with frequency < 5%, with roughly equal probability given to frequencies > 20%. This distribution is closely approximated by a Beta(0.345, 1.058) distribution scaled by 0.5, which we use as our prior for fD(1) (Table 2).
Prior distribution of MAF and SD of log relative risk
MAF (%) . | ENCODE frequency (%) . | SD of log RR . |
---|---|---|
0–5 | 46 | 1.21 |
5–10 | 13 | 0.84 |
10–15 | 10 | 0.61 |
15–20 | 6 | 0.46 |
20–25 | 5 | 0.36 |
25–30 | 5 | 0.3 |
30–35 | 4 | 0.25 |
35–40 | 4 | 0.23 |
40–45 | 4 | 0.21 |
45–50 | 3 | 0.2 |
MAF (%) . | ENCODE frequency (%) . | SD of log RR . |
---|---|---|
0–5 | 46 | 1.21 |
5–10 | 13 | 0.84 |
10–15 | 10 | 0.61 |
15–20 | 6 | 0.46 |
20–25 | 5 | 0.36 |
25–30 | 5 | 0.3 |
30–35 | 4 | 0.25 |
35–40 | 4 | 0.23 |
40–45 | 4 | 0.21 |
45–50 | 3 | 0.2 |
NOTE: “ENCODE frequency” gives the proportion of variants in the ENCODE regions having MAF in the stated range. “SD of log RR” gives the SD of the log relative risk for SNPs according to the model of Spencer and colleagues (13).
Abbreviations: RR, relative risk; SD, standard deviation.
The causal relative risk is taken to be log-normal with most of the distribution less than 3, the value above which linkage analysis is expected to be more powerful than GWAS (11). We use a log-normal distribution with location 0 and scale dependent on the causal MAF according to the distribution proposed by Spencer and colleagues (ref. 13; Table 2). Together with our prior for the allele frequency this represents 80% belief that and 90% belief that
, a strong but not overwhelming prior belief in moderate relative risks.
The prior for |$f_{D|M}$| is more difficult to specify because the range of values it can take depends upon the marker and causal allele frequencies. The prior information in this case consists only of imprecise beliefs that the causal and marker genotypes are in strong LD, but this notion is not easy to parameterize and quantify. GWAS are designed on the premise that all common variants are correlated with a tag SNP at say r2 ≥ 0.8, but we would like to allow the possibility that causal variants are rare, as reflected in our prior for fD(1), which implies lower correlation with associated tag SNPs.
Here, we quantify the LD between marker and causal variants by the log OR (logOR) for association between the two minor alleles
This is unbounded for all values of marker and causal allele frequencies, and the values of |$f_{D|M}$| can be expressed as functions of the allele frequencies and θ. As the prior for θ, we use a normal distribution with mean 6 and variance 1.48, which corresponds to r2 ≈ 0.8 when marker and causal alleles have equal frequency of about 0.25, with 90% probability that 0.5 ≤ r2 ≤ 0.9 in that case. This reflects generally held beliefs about tag SNPs for common causal variants, while allowing strong statistical associations between rare causal variants and common tag SNPs. This prior strongly favors a positive correlation between the minor alleles of the marker and the causal variant: whereas a negative correlation is possible, it would imply a much lower r2 between the two alleles, which conflicts with the assumption that the most informative marker SNP has been chosen from the region.
It turns out that when this prior is used for the logOR, the posterior is virtually unchanged from the prior. We noticed a similar behavior for other priors that strongly favored a positive correlation between the two alleles, suggesting that the likelihood is almost independent of θ for high values of θ. Therefore, the prior for the logOR has little effect on the posterior distributions for the causal relative risk and allele frequency, which are our main parameters of interest. We return to this point in the discussion.
We obtained posterior distributions for |$\gamma _{D_1 }$|, fD(1), and θ using winBUGS with 100,000 samples, discarding the first 10,000 and keeping all other settings at default values. In addition to posterior median, mode and 95% credible intervals, we calculated the posterior probability that the causal MAF is less than 5% or 1%.
Results
Marker relative risks in bilateral cases
In Table 1, we show the relative risks for 10 SNPs estimated from our sample of bilateral cases along with recent literature-based estimates for sporadic and familial cases. It is clear that there is an excess relative risk in bilaterals, confirming that bilateral sampling does offer a gain in efficiency. However, this excess is systematically less than 2, the value predicted for a causal variant, so that the gain in efficiency is less than had been anticipated.
These observations can be formally tested. Assuming normality of the log relative risk estimates,
for each SNP, if the hypothesis holds that |$c\beta _M \, = \,\beta _{M;0}$|. Summing equation (2.1) over all 10 SNPs gives a χ2 variable on 10df which can be regarded as a deviance for the parameter c. The maximum likelihood estimate of c from Table 1 is 1.41 which is significantly different from both 1 (P = 0.027) and 2 (P = 0.003). It is of note that no individual SNP had an excess risk significantly less than 2, but the attenuation is significant when considering all SNPs jointly.
Figure 1 illustrates the excess relative risk in bilateral cases for a marker with allele frequency 0.25, for a range of causal relative risks and allele frequency. The logOR between marker and disease minor alleles is 6, although this value has very little effect on the figure. It is noticeable that the excess risk exceeds 2 when the causal allele is less common than the marker, but is less than 2 if the causal allele is more common. The attenuation is greater when the causal relative risk is higher. This pattern roughly holds for all values of marker and causal parameters, and also when the causal variant has dominant or recessive action (results not shown). Because the excess risk is systematically less than 2 for the markers shown in Table 1, this suggests that the causal variants might have risk allele frequency at least equal to that of the marker SNPs. However, this must be weighed against the greater prior probability that the causal variant is rare, given the distribution of allele frequency in the genome. The following section addresses this question.
Excess relative risk in bilateral cases for a marker with allele frequency 0.25. Excess risk is the ratio of the log relative risk in bilateral cases to that in unselected cases. Each line corresponds to a frequency of the minor allele at the causal variant, which is associated to the marker minor allele with log OR of 6.
Excess relative risk in bilateral cases for a marker with allele frequency 0.25. Excess risk is the ratio of the log relative risk in bilateral cases to that in unselected cases. Each line corresponds to a frequency of the minor allele at the causal variant, which is associated to the marker minor allele with log OR of 6.
Inference of causal effects
We applied Bayesian estimation of causal effects to the estimated marker effects shown in Table 1. The marker MAFs are the most accurate currently available and they were assumed to be known exactly. Tables 3 and 4 show summaries of the posterior distributions of the causal allele frequency and relative risk of the 10 associated loci. Kernel density plots of these parameters are shown in Figs. 2 and 3.
Kernel density plots of prior and posterior distributions of the causal allele frequency at 10 breast cancer loci.
Kernel density plots of prior and posterior distributions of the causal allele frequency at 10 breast cancer loci.
Kernel density plots of prior and posterior distributions of the causal relative risk at 10 breast cancer loci.
Kernel density plots of prior and posterior distributions of the causal relative risk at 10 breast cancer loci.
Posterior distribution summaries for causal minor allele frequencies, estimated from 100,000 MCMC samples
Band . | Gene . | Marker SNP . | Marker MAF . | Median . | Mode . | 95% CI . | Pr(<5%) . | Pr(<1%) . |
---|---|---|---|---|---|---|---|---|
(Prior) | 0.063 | 0.00 | 10−5–0.46 | 0.46 | 0.27 | |||
1p11 | rs11249433 | 0.39 | 0.23 | 0.06 | 0.021–0.48 | 0.093 | 0.0060 | |
2q35 | rs13387042 | 0.49 | 0.28 | 0.25 | 0.069–0.49 | 0.0071 | <10−5 | |
3p22 | SLC4A7 | rs4973768 | 0.46 | 0.30 | 0.31 | 0.097–0.49 | 0.0017 | <10−5 |
5p12 | rs10941679 | 0.25 | 0.26 | 0.07 | 0.023–0.49 | 0.067 | 0.0034 | |
5q11 | MAP3K1 | rs889312 | 0.38 | 0.25 | 0.21 | 0.045–0.49 | 0.032 | <10−5 |
6q25 | rs2046210 | 0.34 | 0.16 | 0.02 | 0.0055–0.48 | 0.22 | 0.051 | |
10q25 | FGFR2 | rs2981582 | 0.38 | 0.40 | 0.48 | 0.21–0.49 | <10−5 | <10−5 |
11p15 | LSP1 | rs3817198 | 0.3 | 0.16 | 0.03 | 0.0074–0.48 | 0.23 | 0.038 |
16q12 | TOX3 | rs3803662 | 0.25 | 0.29 | 0.38 | 0.084–0.49 | 0.0018 | <10−5 |
17q22 | COX11 | rs6504950 | 0.27 | 0.051 | 0.02 | 0.013–0.46 | 0.49 | 0.0033 |
Band . | Gene . | Marker SNP . | Marker MAF . | Median . | Mode . | 95% CI . | Pr(<5%) . | Pr(<1%) . |
---|---|---|---|---|---|---|---|---|
(Prior) | 0.063 | 0.00 | 10−5–0.46 | 0.46 | 0.27 | |||
1p11 | rs11249433 | 0.39 | 0.23 | 0.06 | 0.021–0.48 | 0.093 | 0.0060 | |
2q35 | rs13387042 | 0.49 | 0.28 | 0.25 | 0.069–0.49 | 0.0071 | <10−5 | |
3p22 | SLC4A7 | rs4973768 | 0.46 | 0.30 | 0.31 | 0.097–0.49 | 0.0017 | <10−5 |
5p12 | rs10941679 | 0.25 | 0.26 | 0.07 | 0.023–0.49 | 0.067 | 0.0034 | |
5q11 | MAP3K1 | rs889312 | 0.38 | 0.25 | 0.21 | 0.045–0.49 | 0.032 | <10−5 |
6q25 | rs2046210 | 0.34 | 0.16 | 0.02 | 0.0055–0.48 | 0.22 | 0.051 | |
10q25 | FGFR2 | rs2981582 | 0.38 | 0.40 | 0.48 | 0.21–0.49 | <10−5 | <10−5 |
11p15 | LSP1 | rs3817198 | 0.3 | 0.16 | 0.03 | 0.0074–0.48 | 0.23 | 0.038 |
16q12 | TOX3 | rs3803662 | 0.25 | 0.29 | 0.38 | 0.084–0.49 | 0.0018 | <10−5 |
17q22 | COX11 | rs6504950 | 0.27 | 0.051 | 0.02 | 0.013–0.46 | 0.49 | 0.0033 |
NOTE: Pr[<5% (1%)], probability that causal allele frequency is less than 5% (1%).
Abbreviation: CI, credible interval.
Posterior distribution summaries for causal relative risks, estimated from 100,000 MCMC samples
Band . | Gene . | Marker SNP . | Marker RR . | Median . | Mode . | 95% CI . |
---|---|---|---|---|---|---|
(Prior) | 1.00 | 1.00 | 0.14–7.24 | |||
1p11 | rs11249433 | 1.16 | 1.13 | 1.09 | 1.06–1.88 | |
2q35 | rs13387042 | 1.12 | 1.19 | 1.14 | 1.11–1.67 | |
3p22 | SLC4A7 | rs4973768 | 1.11 | 1.16 | 1.12 | 1.10–1.46 |
5p12 | rs10941679 | 1.19 | 1.14 | 1.12 | 1.08–1.75 | |
5q11 | MAP3K1 | rs889312 | 1.12 | 1.16 | 1.13 | 1.10–1.71 |
6q25 | rs2046210 | 1.15 | 1.17 | 1.11 | 1.07–3.12 | |
10q25 | FGFR2 | rs2981582 | 1.26 | 1.30 | 1.29 | 1.25–1.43 |
11p15 | LSP1 | rs3817198 | 1.07 | 1.13 | 1.08 | 1.06–2.48 |
16q12 | TOX3 | rs3803662 | 1.19 | 1.24 | 1.22 | 1.18–1.50 |
17q22 | COX11 | rs6504950 | 0.95 | 0.73 | 0.93 | 0.095–0.95 |
Band . | Gene . | Marker SNP . | Marker RR . | Median . | Mode . | 95% CI . |
---|---|---|---|---|---|---|
(Prior) | 1.00 | 1.00 | 0.14–7.24 | |||
1p11 | rs11249433 | 1.16 | 1.13 | 1.09 | 1.06–1.88 | |
2q35 | rs13387042 | 1.12 | 1.19 | 1.14 | 1.11–1.67 | |
3p22 | SLC4A7 | rs4973768 | 1.11 | 1.16 | 1.12 | 1.10–1.46 |
5p12 | rs10941679 | 1.19 | 1.14 | 1.12 | 1.08–1.75 | |
5q11 | MAP3K1 | rs889312 | 1.12 | 1.16 | 1.13 | 1.10–1.71 |
6q25 | rs2046210 | 1.15 | 1.17 | 1.11 | 1.07–3.12 | |
10q25 | FGFR2 | rs2981582 | 1.26 | 1.30 | 1.29 | 1.25–1.43 |
11p15 | LSP1 | rs3817198 | 1.07 | 1.13 | 1.08 | 1.06–2.48 |
16q12 | TOX3 | rs3803662 | 1.19 | 1.24 | 1.22 | 1.18–1.50 |
17q22 | COX11 | rs6504950 | 0.95 | 0.73 | 0.93 | 0.095–0.95 |
The posterior estimates of the causal allele frequency have a wide range, roughly 0.1 to 0.4, whereas the estimates of the causal relative risks are all roughly between 1.2 to 1.3. The evidence points strongly to the causal variants being common. For only three loci, 11p15 (LSP1, rs3817198), 6q25 (rs2046210), and 17q22 (COX11, rs6504950), is there a reasonable probability that the causal allele has frequency less than 5%, and for each of those the probability that it is less than 1% is considerably lower. Neither rs2046210 nor rs6504950 were directly typed in the familial cases (6); instead a SNP in strong LD was used, and this slight conflict of information may have kept the posteriors closer to the prior than otherwise. The generally high probability of common causal variants is in spite of the wide credible intervals on the allele frequencies, which arise from the conflict between the prior that strongly favors rare variants, and the data which are more consistent with common variants. This result is consistent with recent work suggesting that causal variants have similar properties to tag SNPs (11–13, 22), and is in line with the CDCV hypothesis.
Sensitivity analysis
Although our prior distributions reflect generally held beliefs about causal variants, there are two important questions that can be asked of our approach. First, noting that our posterior distributions consistently indicated common causal variants, would our procedure identify a rare causal variant if one were present? Second, as we could not solve for the causal effects exactly, how much information was gained by combining the estimates from three study designs?
To address the first question, we considered a causal allele with frequency 0.1% and relative risk 3.35, and logOR of 9 with a marker allele of frequency 5%. This represents a rare variant with effect size at 1 SD of its corresponding distribution, and allele frequency fairly close to that of a common marker SNP. From equations (1.2) to (1.9) the predicted marker relative risk is 1.047 in unselected cases, 1.204 in bilateral cases and 1.222 in cases with a mixture of family histories. The r2 between marker and causal alleles is only 0.019, but the excess risk is greater than 2 in bilateral and familial cases, in line with Fig. 1.
It is apparent that this configuration differs from the marker effects we observed for breast cancer, yet it represents a rare causal variant that is consistent with our prior beliefs and is strongly associated with the marker. To determine whether we could infer such a causal variant with our approach, we used SE of 0.05, 0.06, and 0.03 for respectively, similar to the higher values appearing in Table 1. We then sampled
and
from the corresponding normal distributions and inferred posterior distributions for the causal parameters. This procedure was repeated 1,000 times.
On average, the posterior median for the causal allele frequency was 0.10 and the mode was 0.0002. The mean probability was 47% that the causal allele frequency was less than 5%, and 36% that it was less than 1%, the corresponding prior probabilities being 46% and 27%. The average posterior median of the causal relative risk was 1.41 and the mode 1.19. Thus the presence of a rare variant could be inferred reliably, although point estimation of its allele frequency and relative risk seems to be heavily biased and the entire posterior distribution ought to be considered when drawing inferences.
To address the second question of whether the familial samples add information to the inference, we repeated the estimation of causal effects using only the marker relative risks for unselected cases. We adjusted their SE to reflect the total information provided from the 3 studies, using an inverse-variance formula
In this way, we can assess the information contributed specifically by the study designs as distinct from their additional sample size.
The mean width of the 95% credible interval for the causal allele frequency was 44%, compared with 43% when using 3 study designs. For the causal relative risk, the mean width was 2.04 compared with 0.78. The posterior median for the causal allele frequency was on average 0.6 times that when using 3 study designs, whereas the posterior median for the relative risk was 1.2 times higher. Although these results seem to give more support for a rare causal variant than the 3 study model, this is due to the reduced information in the data (reflected in the wider credible intervals) to move the estimates away from their prior distributions. Moreover, the degree of support is still weak. We see that the use of 3 study designs allows stronger conclusions to be reached than a single study of equivalent sample size, as a result of the increased number of parameters identifiable by including familial cases.
Discussion
One should be cautious about taking our estimates of causal effects too literally. They are dependent on prior distributions, and have wide credible intervals. Although we have shown that our procedure could infer a rare variant if one were present, point estimates of its allele frequency and relative risk are heavily biased. Several groups are currently engaged in fine mapping and resequencing efforts in the regions studied, which will lead to more direct estimates of causal effect sizes. Thus the quantitative estimates presented here will eventually be redundant, although it will be interesting to compare our estimates with the actual causal effects when known.
Instead, we emphasize the qualitative nature of our results, which indicate that most, if not all, associations with breast cancer so far identified by GWAS are likely to be markers for common causal variants with modest effects. This is consistent with the CDCV hypothesis that originally motivated GWAS, but not with recent suggestions that many GWAS hits could be markers for rare causal variants (10). In this respect, our results agree with other recent work in support of the CDCV hypothesis. Anderson and colleagues (11) argued that GWAS has low power to find a rare variant that had not already been detected by linkage, and noted examples of resequencing projects that had not identified rare variants underlying a common GWAS hit. This includes currently unpublished work by the Wellcome Trust Case–Control Consortium in which sequencing of 16 regions identified by GWAS did not identify any underlying rare causal variant. Wray and colleagues (12) show that the distribution of risk allele frequencies from currently known GWAS hits is consistent with the majority of these hits arising from common variants. Iles (22) showed that early findings of GWAS have been at loci for which the power is highest, which are indeed the common variants. Because we confined attention to SNPs identified in the first wave of breast cancer GWAS, and have subsequently been replicated, we should expect these loci to be enriched for common variants, and in this respect our results are unsurprising. But in contrast to these other studies, we are able to estimate causal effects for specific loci rather than average properties of all causal variants. Our results indicate that these loci are consistent with the general pattern of common causal variation suggested by other work, and our methods can be applied to further markers that emerge from GWAS.
Our approach cannot distinguish between the effect of a single common variant and the average effect of a number of variants with a common total frequency. Although such a scenario is theoretically possible for a complex disease (9), Wray and colleagues have argued against this scenario for the loci found to date (12). We cannot lend support to either position here other than to note the fact that all 10 SNPs indicated a common causal variant, suggesting that if rare variants do underlie these associations then they do so either in large numbers or not at all.
Several authors (13, 20, 22) have used simulations to estimate the empirical conditional distribution of causal allele frequencies and relative risks, given that a marker was identified by GWAS and subsequently replicated. Our approach to modeling the LD between markers and causal variants is much simpler, but we found this model had little effect on the parameters of interest. We do not explicitly model the process of marker discovery by GWAS, and in that respect our prior is more favorable to rare variants, thus strengthening our conclusion that the causal variants are common.
The use of familial cases in association studies is motivated by the excess relative risk in the ascertained sample compared with a sample of unselected cases. We have shown, however, that imperfect correlation between markers and causal variants leads to an excess risk in familial cases that differs from the predicted value. The difference could be in either direction, and indeed when the causal and marker variants have similar frequency, the excess risk is higher at the marker than at the causal variant, so that the study design is even more efficient than predicted (Fig. 1). In our data, however, there was a systematic attenuation of the excess risk in bilateral cases, similar to observations for familial cases in the study of Turnbull and colleagues (6), which is most consistent with causal variants of higher frequency than the markers. The efficiency of bilateral sampling, while still greater than that of unselected sampling, seems to be less than predicted, and this may have implications for the design of future studies of common genetic risk factors.
Some other mechanisms can also lead to attenuation of the excess risk. We assumed a multiplicative model in which each copy of the risk allele multiplies the disease risk to the same degree, but the true model could be recessive, dominant, or more general. We can rewrite equations (1.2) to (1.9) in terms of recessive or dominant effects: it turns out that under a recessive model the excess risk attenuates at higher causal frequencies than under the multiplicative model, whereas for a dominant model it attenuates at lower frequencies (results not shown). Dominant causal variants could therefore be more consistent with rare variation than the multiplicative model considered, but the relevant probabilities remained low when we assumed this model in our analyses, and for brevity we have omitted these results.
We have also assumed that effects act on the log-risk scale, which is convenient as additional polygenic and environment effects cancel out of relative risk calculations so we need not assume a model for them. If however the effects act on say the logistic or probit scales, then the excess relative risk would be attenuated even at the causal variant. We considered this possibility by allowing for a normally distributed polygenic random effect with mean zero and variance 2log2, consistent with a sibling relative recurrence of 2 (23). Acting on the logistic scale this could reduce the excess relative risk for the causal variant from 2 to 1.8 in bilateral cases, but this is less than the degree of attenuation we observed in our data. Subgroup effects, such as age or tumor subtype-specific risks, could also attenuate the marginal excess risk, but we did not observe any such effects in our data.
We have shown that genetic markers of breast cancer have lower excess risk in familial cases than had been predicted, leading to reduced improvements of efficiency in these study designs. However, this information can be usefully exploited to estimate the relative risk and allele frequency of the underlying causal variants. Despite using a prior distribution that favors rare variation, we showed that data from bilateral and familial cases strongly imply that the causal variants underlying recent GWAS findings are common with modest effects, in line with other recent work favoring the CDCV hypothesis. We look forward to the outcome of current fine mapping projects to confirm the accuracy of these predictions.
Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.
Grant Support
This work was supported by the Medical Research Council (G1000718 to F. Dudbridge), Cancer Research UK (C150/A5660 and C1178/A3947 to J. Peto and Id.S. Silva), and Breakthrough Breast Cancer (O. Fletcher, N. Johnson, and N. Orr). We acknowledge NHS funding to the NIHR Royal Marsden Biomedical Research Centre and the National Cancer Research Network (NCRN).
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.