## Abstract

Confounding by ethnicity (i.e. population stratification) can result in bias and incorrect inferences in genotype-disease association studies, but the effect of population stratification in gene-gene or gene-environment interaction studies has not been addressed. We used logistic regression models to fit multiplicative interactions between two dichotomous variables that represented genetic and/or environmental factors for a binary disease outcome in a hypothetical cohort of multiple ethnicities. Biases in main effects and interactions due to population stratification were evaluated by comparing regression coefficients in mis-specified models that ignored ethnicities with their counterparts in models that accounted for ethnicities. We showed that biases in main effects and interactions were constrained by the differences in disease risks across the ethnicities. Therefore, large biases due to population stratification are not possible when baseline disease risk differences among ethnicities are small or moderate. Numerical examples of biases in genotype-genotype and/or genotype-environment interactions suggested that biases due to population stratification for main effects were generally small but could become large for studies of interactions, particularly when strong linkage disequilibrium between genes or large correlations between genetic and environmental factors existed. However, when linkage disequilibrium among genes or correlations among genes and environments were small, biases to main effects or interaction odds ratios were small to nonexistent. (Cancer Epidemiol Biomarkers Prev 2006;15(1):124–32)

## Introduction

When the risk of disease varies between ethnic groups and the statistical distribution of an exposure variable (whether genetic or environmental) also varies between these groups, the association between the exposure and disease may be confounded by the ethnic background of the studied groups. This form of confounding by ethnicity is called population stratification in epidemiologic studies of genetic risk factors (1).

The effect of population stratification in epidemiologic studies of disease association with a single candidate gene has been extensively studied (2-9). Although various approaches using unlinked markers have been proposed to deal with the effects of population stratification on association tests of candidate genes (10-15), epidemiologic studies are increasingly concerned with interactions between genetic and/or environmental factors. None of the prior literature explicitly addresses the pattern or extent of bias due to population stratification in association studies involving interactions between genes and environmental factors.

To evaluate bias due to population stratification, we employed two dichotomous variables to represent the genetic and/or environmental factors and used logistic regression models to fit multiplicative interactions between the two variables for a binary disease outcome in a hypothetical cohort of multiple ethnicities. We derived algebraic solutions for asymptotic biases in the maximum likelihood estimates of the interaction variables under corresponding models that ignored ethnicities and identified conditions for the biases to reach their maximum or minimum expected values. We also provided numerical examples of biases due to population stratification under a wide range of conditions that may be observed in epidemiologic studies.

## Materials and Methods

### Data Structure and Model Specification

Assume that the interactions between two categorical variables *V*_{1} and *V*_{2} are studied for a binary disease outcome *Y* during a certain time in a closed cohort of subjects. Either *V*_{1} or *V*_{2} or both can be genetic or environmental variables (i.e., either gene-gene or gene-environment interactions can be modeled using the following approach). *V*_{1} = 0 denotes the reference category of *V*_{1}, and *V*_{1} = 1 denotes the comparison category. *V*_{2} is coded in the same fashion as *V*_{1}. *Y* = 1 indicates the presence of disease, and *Y* = 0 indicates the absence of disease. Assume the cohort is comprised of *k* ethnicities represented by indicator variables *E*_{1}, *E*_{2},…, *E _{k}*, so that a subject in the

*i*th ethnicity will be coded by

*E*= 1 and

_{i}*E*= 0, for

_{j}*i, j*(

*i*≠

*j*) = 1, 2,…,

*k*. Assume associations between disease and

*V*

_{1}and

*V*

_{2}are modeled with logistic regression as

where *π* is the conditional probability of the disease (*Y* = 1) given *V*_{1}, *V*_{2}, and ethnicity, and binomial random error is assumed. Without loss of generality, let *α*_{1} specifies the log odds of disease (i.e., logit function of the baseline disease risk) in the lowest-risk ethnicity *E*_{1}, and 0 < *α*_{2} < … < *α _{k}*, where

*α*

_{2}specifies the log odds ratio (OR) of disease risks comparing ethnicity

*E*

_{2}versus

*E*

_{1}. Similarly,

*α*specifies the log OR of disease risks comparing ethnicity

_{k}*E*versus

_{k}*E*

_{1};

*β*

_{1}and

*β*

_{2}specify the main effects associated with the comparison of

*V*

_{1}and

*V*

_{2}relative to their reference categories, respectively; and

*β*

_{3}specifies the multiplicative interaction between

*V*

_{1}and

*V*

_{2}.

Population stratification may be present when the joint distributions of *V*_{1} and *V*_{2} are different across the ethnicities. Biases due to population stratification can be evaluated by omitting all ethnicity indicator variables from model Eq. A to fit the mis-specified model

We define the asymptotic biases due to population stratification to be *β*_{1}^{*} − *β*_{1}, *β*_{2}^{*} − *β*_{2} for the main effects of *V*_{1} and *V*_{2}, respectively, and define *β*_{3}^{*} − *β*_{3} as the asymptotic bias for the interaction effects between *V*_{1} and *V*_{2}. Here, we do not deal with issues of variance or precision of estimates, assuming model coefficient estimates obtained from sufficiently large samples satisfy *E*(*β̂*_{i}^{*}) = *β _{i}*

^{*}, where

*i*= 1, 2, or 3.

### Two Ethnicities

We obtained numerical estimates of large sample biases due to population stratification by fitting logistic regression models to simulated data generated under a wide range of conditions using Stata 7.0 (College Station, TX). Baseline disease risk in the low-risk ethnicity was specified to be 1%, whereas baseline risk in the high-risk ethnicity specified to be 2% or 10%. To specifically assess genotype-genotype interactions, subsequent text assumes *V*_{1} = *G*_{1} and *V*_{2} = *G*_{2} to denote the study of interactions involving two candidate genes *G*_{1} and *G*_{2}. For simplicity, we assumed that candidate gene *G _{i}* (

*i*= 1, 2) had two alleles

*A*and

_{i}*a*with frequency

_{i}*p*and 1 −

_{i}*p*, respectively. Genotype

_{i}*a*was coded as the reference category (i.e.,

_{i}a_{i}*G*= 0), whereas genotypes

_{i}*A*and

_{i}A_{i}*A*were coded as the comparison (i.e., “at-risk” genotype) category (i.e.,

_{i}a_{i}*G*= 1). The at-risk genotype frequencies of

_{i}*G*were specified in the range of 5% to 95%, assuming single-locus Hardy-Weinberg equilibrium, corresponding to

_{i}*p*ranging from 3% to 78%.

_{i}In some cases, *G*_{1} and *G*_{2} may not be independently distributed due to linkage disequilibrium between the two loci, denoted here as *D*. Because *D* between *G*_{1} and *G*_{2} is constrained by *D*_{max} and *D*_{min}, where *D*_{max} = min[*p*_{1} × (1 − *p*_{2}),(1 − *p*_{1}) × *p*_{2}] and *D*_{min} = max[−*p*_{1} × *p*_{2},−(1 − *p*_{1}) × (1 − *p*_{2})] (16), we specified the degree of linkage disequilibrium between the two genes by *D* = *D*′ × *D*_{max} or *D*′ × *D*_{min}, where *D*′ took on the values of 0 (no linkage disequilibrium), 0.2 (small linkage disequilibrium), 0.5 (moderate linkage disequilibrium), and 0.9 (strong linkage disequilibrium). Main effects and interactions of genes were specified by assigning *β _{j}* = 0 (no effect, OR = 1),

*β*= ±0.4 (small effect, OR = 1.49 or 0.67),

_{j}*β*= ±0.8 (moderate effect, OR = 2.23 or 0.45), or

_{j}*β*= ±1.6 (large effect, OR = 4.95 or 0.20). Using the expected frequency of each genotype-disease category under the specified conditions, a hypothetical cohort of 100,000 observations was generated. Disease status for each observation was determined by comparing its disease risk with the standard uniform random variable. 5000 case-control samples were randomly drawn from the hypothetical cohort. Each sample consisted of 95% of diseased individuals as cases and an equal number of nondiseased individuals as controls. Both correctly specified models and their corresponding mis-specified models were fitted to each sample to obtain point estimates of all relevant regression coefficients. The corresponding point estimates from the 5,000 samples were averaged to obtain large sample point estimates for biases due to population stratification on both main effects and interactions. Results are presented in Figs. 1, 2, and 3.

_{j}### More than Two Ethnicities

For more than two ethnicities (*k* > 2), we assumed baseline risks of *k* ethnicities were specified to fall within certain ranges. Specifically, we assumed that baseline disease risk of the *i*th ethnicity *π _{i}* ∼ Uniform [0.01, 0.02] or

*π*∼ Uniform [0.01, 0.1] for

_{i}*i*= 1,…,

*k*. Similarly, we assumed that genotype frequencies were uniformly distributed and were consistent with single-locus Hardy-Weinberg Equilibrium. We considered the at-risk genotype frequency within ranges of 5% to 10%, 5% to 20%,…, up to 5% to 95%. Linkage disequilibrium ranged from 90% of minimum possible value (i.e.,

*D*

_{min}) to 90% of maximum possible value (i.e.,

*D*

_{max}). Under these assumptions, we generated 5,000 sets of variables for a hypothetical cohort. Each set of disease risks and genotype frequencies of the

*k*ethnicities was randomly assigned from the distributions specified above, assuming

*k*= 2, 5, or 10, respectively, and

*β*

_{1}=

*β*

_{2}=

*β*

_{3}= 0 (i.e., OR = 1) or

*β*

_{1}=

*β*

_{2}=

*β*

_{3}= 0.693 (i.e., OR = 2). Next, case-control samples were randomly drawn from the hypothetical cohort under each set of variables to obtain bias estimates as described in the previous paragraph. The range and average of the 5,000 sets of bias estimates were presented in Fig. 4.

## Results

### Determination of Maximal Possible Bias

We have determined the maximal biases that could result from population stratification (see Appendix 1 for derivations). In the simplest case when the cohort is comprised of two ethnicities (i.e., *k* = 2: one “higher risk” ethnicity and one “lower risk” ethnicity), the maximal bias that can be attained under conditions of population stratification is determined by the difference in log odds of disease risks in the two populations (i.e., the log of OR of disease risks in the two ethnicities) being compared. Specifically, the asymptotic bias to the main effect estimates for *V*_{1} or *V*_{2} are bounded by −*α*_{2} and *α*_{2}, where *α*_{2} is the difference in log odds of the baseline risk of disease in the higher risk population compared with the lower risk population (see Eqs. A and B and Appendix 1). This maximum bias is reached only under extreme conditions, such as when the joint occurrence of *V*_{1} = 1 and *V*_{2} = 0 is never observed in the low-risk ethnicity, and the joint occurrence of *V*_{1} = 0 and *V*_{2} = 0 is never observed in the high-risk ethnicity.

Asymptotic biases to estimates of interactions between *V*_{1} and *V*_{2} are bounded by −2*α*_{2} and 2*α*_{2}. That is, the maximal bias due to population stratification that can be attained for an interaction between two factors is bounded by twice the log OR of the disease risks in the two populations being compared. However, these maximal biases can be reached only when all of the following four conditions hold: (*a*) the joint occurrence of *V*_{1} = 1 and *V*_{2} = 0 is never observed in the high-risk ethnicity; (*b*) the joint occurrence of *V*_{1} = 0 and *V*_{2} = 1 is never observed in the high-risk ethnicity; (*c*) the joint occurrence of *V*_{1} = 1 and *V*_{2} = 1 is never observed in the low-risk ethnicity; (*d*) the joint occurrence of *V*_{1} = 0 and *V*_{2} = 0 is never observed in the low-risk ethnicity. Similarly, the maximal bias in the other direction is −2*α _{k}* only when the reverse of the four conditions described above hold (see Appendix 1).

Similarly, when *k* > 2 ethnicities in the cohort, the most extreme biases that could result from population stratification are ±*α _{k}* and ±2

*α*to main effects and interaction estimates, respectively, where

_{k}*α*represents the maximum of log OR among baseline risks of the

_{k}*k*ethnicities (see Eqs. A and B and Appendix 1). To our knowledge, this is the first demonstration of the bounds on the magnitude of biases to interaction estimates for genotype-genotype or genotype-environment interaction studies. However, the boundary conditions that result in the theoretical extremes are unlikely to represent the conditions observed in most studies. Therefore, we undertook numerical evaluations next to consider situations that are more likely to be encountered in actual studies.

### Two Ethnicities

We have summarized results of large sample biases to main effects and genotype-genotype or genotype-environment interactions using six sets of conditions that may be encountered in association studies (Table 1). Note that bias can only arise when differences in baseline disease risk among ethnicities exist (1), which is a necessary condition for confounding to occur (17). Condition a represented the situation where no population stratification existed for *G*_{1} or *G*_{2}, when *G*_{1} and *G*_{2} had the same marginal distributions, and linkage disequilibrium between *G*_{1} and *G*_{2} was the same in each ethnicity (i.e., joint distributions of *G*_{1} and *G*_{2} were the same across the ethnicities). Under these conditions, ignoring ethnicity did not result in large sample bias due to population stratification in any of these estimates when *β*_{1} = *β*_{2} = *β*_{3} = 0. When *β*_{1} = *β*_{2} = 0 but *β*_{3} ≠ 0, ignoring ethnicity did not result in large sample bias to *β*_{1} or *β*_{2} but resulted in negligible biases toward the null hypothesis to the interaction term *β*_{3}. When *β*_{1} ≠ 0 or *β*_{2} ≠ 0, a slight bias towards the null hypothesis was observed, whereas biases to *β*_{3} were no longer always towards the null. Instead, the bias to *β*_{3} depended on both the main and interaction effects between the two genes. In all cases, the magnitude of these large sample biases was negligible and reflected nonlinearity of logistic regression model (18-20) rather than biases due to population stratification.

Condition b (Fig. 1) represented the situation when the marginal genotype distributions of both genes were same across both ethnicities, such that the frequency of the at-risk genotype of *G*_{1} in the high-risk ethnicity was equal to the frequency of the at-risk genotype of *G*_{1} in the low-risk ethnicity, and the frequency of the at-risk genotype of *G*_{2} in the high-risk ethnicity was equal to the frequency of the at-risk genotype of *G*_{2} in the low-risk ethnicity. However, condition b also specified that the joint genotype distributions of *G*_{1} and *G*_{2} were different due to unequal linkage disequilibrium between the two genes across ethnicities. In these circumstances, ignoring ethnicity could result in large sample biases in both main effects and interaction estimates. Linkage disequilibrium between *G*_{1} and *G*_{2} was *D*′ × *D*_{max} in the high-risk ethnicity and *D*′ × *D*_{min} in the low-risk ethnicity (Fig. 1). Thus, at-risk genotypes of both genes appeared together more frequently in the high-risk ethnicity than in the low-risk ethnicity. For simplicity, it was assumed that the frequency of the at-risk genotypes of *G*_{1} and *G*_{2} were equal in both ethnicities and that *β*_{1} = *β*_{2} = *β*_{3} = 0. When ethnicity was ignored, large sample biases to interaction and main-effect estimates depended on baseline disease risks, the proportions of the two ethnicities in the cohort, and the frequencies of the at-risk genotypes (Fig. 1). The magnitude of bias increased with greater differences in baseline disease risks and/or linkage disequilibrium between the two genes. Figure 1 suggests that there was no simple pattern of bias corresponding to the variables considered here. However, large biases in interaction estimates tended to occur when large biases in main effects also occurred.

Under condition c, the marginal genotype distribution of *G*_{1} was constant across ethnicities, and *G*_{1} and *G*_{2} were in linkage equilibrium in both ethnicities (*D* = 0). When *β*_{1} = 0 and *β*_{3} = 0, there were no large sample biases to their estimates. As shown in Fig. 2A and B for “No Linkage Disequilibrium in Either Ethnicity,” biases in *G*_{2} main effect estimates followed the same patterns as biases to a single candidate gene under conditions of population stratification (1, 6). Bias was positive if at-risk genotype frequency of *G*_{2} was greater in high-risk ethnicity (Fig. 2A). Bias was negative if at-risk genotype frequency of *G*_{2} was greater in low-risk ethnicity (Fig. 2B). Condition d differed from condition c only in that linkage disequilibrium was specified to be different across ethnicities. Under condition d, biases to main effects and interactions depended on the ethnicity-specific genotype frequencies of both genes and baseline disease risks as well as the main effects of the genes and their interactions. For example, we considered the situation when baseline disease risks by ethnicity were 10% versus 1%, main effects and interaction were 0, linkage disequilibrium was *D*′ × *D*_{max} in the high-risk ethnicity and *D*′ × *D*_{min} in the low-risk ethnicity. Figure 2A presents the results where the frequency of the at-risk genotype of *G*_{2} varied within the high-risk ethnicity, whereas Fig. 2B presented the results where the frequency of the at-risk genotype of *G*_{2} varied within the low-risk ethnicity. As shown in these figures for D′ = 0.2 and D′ = 0.5, large sample biases occurred to *G*_{1} main effects, *G*_{2} main effects, and interaction effects. These biases became more pronounced with increasing degrees of linkage disequilibrium. However, the biases did not follow a simple monotonic pattern with changing genotype frequencies.

Conditions e and f (Fig. 3) occurred when the marginal genotype distributions of both genes differed across ethnicities. If *G*_{1} and *G*_{2} were in linkage equilibrium (condition e), the patterns for the direction and magnitude of biases to main effects were similar to those reported for condition c, but biases to interaction effect estimates did not follow simple patterns corresponding to the marginal genotype frequencies of either gene. If *G*_{1} and *G*_{2} were in linkage disequilibrium (condition f), biases to main effects no longer followed simple patterns. For example, in Fig. 3A, baseline disease risks by ethnicity were 10% versus 1%, and both main effects and interactions were assumed to be 0. Bias depended on at-risk genotype frequencies of both genes as well as on the degree of linkage disequilibrium between the two genes. Large biases were observed even when genotype frequencies were not very different across ethnicities. Again, no simple relationship of bias was observed with respect to genotype frequency. Therefore, biases due to population stratification can be large in relatively unpredictable ways when marginal genotype frequencies of both genes differ by ethnicity and the two genes are in linkage disequilibrium.

### More than Two Ethnicities

When we expanded our analyses to consider cohorts consisting of *k* = 5 or 10 ethnicities, we observed that (on average) large sample biases to main and interaction effects were either nonexistent or negligible (Fig. 4A-D), even if the conditions for population stratification (Table 1) were met. This result follows because we assumed that baseline disease risks and the joint genotype distributions of both genes were uncorrelated. Biases to both main effects and interaction were greatest when *k* = 2 (i.e., when there were only two ethnicities in the cohort) but decreased with increasing number of component ethnicities. Biases to both main effects and interaction were smaller when baseline disease risks of the *k* ethnicities were all within the range of 1% to 2% and larger when baseline risks of the *k* ethnicities were all within the range of 1% to 10%. Biases to main effects tended to increase as the differences in genotype frequencies across ethnicities increased. For example, in *G*_{1} main effects and G_{2} main effects in Fig. 4A-D, the range of biases for main effects increased as the range of at-risk genotypes of both genes across the *k* ethnicities increased from 5% to 10% up to 5% to 95%. Under the latter more extreme conditions, biases to main effects approached their theoretical bounds.

For example, based on our algebraic derivation of bounds to the biases of the estimates, when baseline risks of *k* ethnicities were within 1% to 2% (Fig. 4A and B), biases to main effects approached their bounds of −0.7 and 0.7 (on the natural log scale). Although interaction estimates were bounded by −1.4 and 1.4, the actual biases observed were far from reaching these bounds. On the other hand, biases to interaction did not show monotonic relationships corresponding to increasing or decreasing ranges of marginal genotype frequency ranges for either gene. Even when the range of marginal genotype frequencies of both genes was small across ethnicities, biases to interaction estimates could still be very large. The patterns were similar when baseline risks of *k* ethnicities were within 1% to 10% (Fig. 4C and D), where biases to main effects were bounded by −2.4 to 2.4 and biases to interaction by −4.8 to 4.8. Similar patterns were observed when OR = 1 (Fig. 4A and C) and when OR = 2 (Fig. 4B and D) for both main effects and interactions.

## Discussion

We have evaluated the potential for bias due to population stratification in studies of genotype-genotype or genotype-environment interactions by deriving bounds for the maximal biases that may be conferred by population stratification. We also provided numerical examples for the potential biases in main effects and interactions under a variety of conditions commonly met in association studies. Bias can occur in gene-gene interaction studies if linkage disequilibrium differs across ethnicities and the sample consists of the mixed ethnicities, even when the conditions of population stratification for either gene alone are absent. When the two genes were in linkage equilibrium, both main effects and interactions were unbiased if marginal distributions of genes did not vary by ethnicity. When the two genes were in linkage disequilibrium, biases could be substantial for all effect estimates. Genes may seem to be in linkage equilibrium in the entire cohort, whereas within each ethnicity, they are not. Thus, when multiple genes are jointly studied across ethnicities, pooling ethnicities may induce biases that depend on the joint distributions or correlations between the genes across ethnicities, as well as their marginal distributions.

These arguments can be extended to studies of genotype-environment interactions, where correlation between the gene and environment factors exists. However, it is more likely that differences across ethnicities in the joint distributions of genetic and environmental factors result from different marginal distributions rather than from correlations between the genes and environments of interest. Although severe biases were only observed under extreme stratification conditions (e.g., large differences in linkage disequilibrium, in the marginal frequencies of two genes, and in disease risks across ethnicities), population stratification may result in larger biases in genotype-genotype interaction studies than in studies involving only one gene. Our analytic derivation showed that in most settings of interaction where the joint distributions of two factors differ across the levels of a third factor, and where disease risk also varies across those levels, the omission of that third factor as a covariate will produce biases in the estimates of interaction estimates that can be 2-fold as large as biases to the estimates of main effects. To our knowledge, this is the first time such quantitative evaluation has been addressed. We anticipate that our findings of population stratification effect on interaction studies will be useful for studies of both gene-gene interactions and gene-environment interactions (21, 22) in different populations.

Additional studies are required to address other aspects of bias due to population stratification in genotype-genotype interaction studies. For example, we have not considered the effect of deviations from Hardy-Weinberg equilibrium. Such deviations could also confer biases to interaction terms involving two or more genes. Nonetheless, the magnitude of biases would be bounded as shown in Appendix 1. Similarly, we have focused on large sample biases to point estimates from logistic regression in genotype-genotype interaction studies. Another future challenge is to assess the effect of population stratification on variance estimation and hypothesis testing. In the case of studies involving single candidate genes, Heiman et al. (8) addressed both issues by evaluating false-positive rates and comparing with confounding risk ratios due to population stratification. However, these evaluations have not been undertaken for genotype-genotype interactions. Marchini et al. (22) considered power issues due to population stratification, which have not been considered here. Finally, we have not evaluated additional situations of potential interest, such as the case of a main effect of a gene in one population but not in another.

When genotype distributions are the same across ethnicity or independent of ethnicity, ignoring ethnicity may still result in attenuation of the OR due to the nonlinearity of the logit link function in logistic regression (18-20). In the context of a single gene having same distributions in two or more ethnicities, bias is absent if the gene has no effect even when ethnicities are ignored. Otherwise, attenuation of the estimate towards the null will increase with the magnitude of the gene's effect. The magnitude of the biases would be negligible unless disease risks were also extremely different across ethnicities. We found that when interactions between two genes are considered, the same rules applied to biases in the main effects of the two genes. However, bias may occur for interaction estimates even absent interaction (i.e., *β*_{3} = 0). In addition, the direction of the bias is not always predictable when interaction is present (i.e., *β*_{3} ≠ 0). Nonetheless, the magnitude of these biases to main effects or interactions was generally negligible, unless the *β*_{1}, *β*_{2}, and *β*_{3} were all large (>1.6), and the relative disease risk between the two ethnicities was >10-fold.

We have evaluated biases under relatively extreme conditions of population stratification with respect to disease risk and allele frequencies. The ranges of variables employed here were similar or more extreme than those in other studies (10, 23). In real situations, biases on genotype-genotype interactions should be smaller when ethnicity strata are more numerous, and the range of disease risk is narrower than considered here. Furthermore, our results are consistent with those of Wacholder et al. (1) and Wang et al. (6) as to the smaller potential for bias in the presence of larger numbers of ethnicities. Similar arguments hold for gene-environment interaction studies. For studies of gene-gene interactions, linkage disequilibrium patterns differ by ethnicity (13). Therefore, studies of genotype-genotype interactions should specifically consider potential linkage disequilibrium, baseline disease risks, and genotype frequency differences by ethnicity. This issue takes on special significance in light of suggestions that population-specific linkage disequilibrium might contribute to nonreplication of association study results, including studies of genotype-genotype interactions (24).

The data presented here support the hypothesis that bias due to population stratification can occur in association studies involving genotype-genotype interactions, particularly if the two genes are in strong linkage disequilibrium. However, our results show that the magnitude of potential bias is constrained by the differences in disease risk among populations. Thus, when these disease risk differences among populations are small, population stratification cannot lead to large biases. Furthermore, our empirical results show that population stratification causes relatively small biases even under extreme conditions and is unlikely to cause large biases to estimates of main effects and interactions under usual study conditions, particularly when the correlation (i.e., linkage disequilibrium) among the interacting factors is small. Therefore, if population stratification is not a major concern, studies of interaction involving unlinked genes (e.g., genes in common metabolic pathways that are located on different chromosomes) might be appropriate for case-control association studies, whereas haplotype-based approaches might be more appropriate for genes in linkage disequilibrium.

## Appendix 1: Algebraic Analyses of Asymptotic Bounds on Biases Due to Population Stratification

*A. Algebraic Analyses of Biases when k = 2 Ethnicities.* The log-likelihood for a cohort sample of *N* independent observations can be written as

where *i* = 1, 2, …, *N*. When there are two ethnicities *E*_{1} and *E*_{2} in the cohort, they can be referred to as the high-risk ethnicity and low-risk ethnicity, respectively. Assume associations between disease and *V*_{1} and *V*_{2} are modeled with logistic regression as

In the presence of population stratification, we assumed the following mis-specified model was fitted:

Let *f*_{V1V2E2} represent the expected fraction of joint occurrence of *V*_{1} and *V*_{2} in ethnicity *E*_{1} (*E*_{2} = 0) or *E*_{2} (*E*_{2} = 1), where the subscripts *V*_{1}, *V*_{2}, and *E*_{2} each take values 0 or 1. For example, *f*_{000} represents the fraction of observations having joint reference categories of *V*_{1} and *V*_{2} in the low-risk ethnicity *E*_{1} (i.e., *V*_{1} = 0, *V*_{2} = 0, *E*_{2} = 0); likewise, *f*_{001} represents the fraction of observations having the joint reference categories of *V*_{1} and *V*_{2} in the high-risk ethnicity *E*_{2} (i.e., *V*_{1} = 0, *V*_{2} = 0, *E*_{2} = 1). Correspondingly, the expected values of *D* are *π*_{000} = *P*(*α*_{1}) and *π*_{001} = *P*(*α*_{1} + *α*_{2}) by Eq. A, where *P*(*x*) is the logistic function defined by *P*(*x*) = exp(*x*) / [1+ exp(*x*)]. Let *ε* and 1 − *ε* represent the proportions of the high-risk and low-risk ethnicity in the cohort, respectively, then the expected fraction of observations having the joint reference categories in the entire cohort is *f*_{00·} = (1 − *ε*) × *f*_{000} + *ε* × *f*_{001}, where the “·” subscript indicates that observations are pooled over the associated index ethnicity in this case. Let *π̂*_{00·} be the estimated expected value of *D* for these observations under the mis-specified model (Eq. B), then *π̂*_{00·} = *P*(*α̂*^{*}), etc. Then the expected values of the maximum likelihood estimates of the variables in Eq. B are found by

Because (1 − *ε*) × (*f*_{000}/*f*_{00·}) + *ε* × *f*_{001}/*f*_{00·} = 1 and *P*(*x*) is monotonic function, *α*_{1} ≤ *α*^{*} ≤ *α*_{1} + *α*_{2}. *α*^{*} = *α*_{1} only when *f*_{001} = 0 (i.e., when no observations in high-risk ethnicity fall into the joint reference categories of *V*_{1} and *V*_{2}); and *α*^{*} = *α*_{1} + *α*_{2} only when *f*_{000} = 0 (i.e., when no observations in low-risk ethnicity fall into the joint reference categories of *V*_{1} and *V*_{2}).

Similarly, (1 − *ε*) × (*f*_{100}/*f*_{10·}) + *ε* × *f*_{101}/*f*_{10·} = 1; therefore, *α*_{1} + *β*_{1} ≤ *α*^{*} + *β*_{1}^{*} ≤ *α*_{1} + *β*_{1} + *α*_{2}, because *α*_{1} ≤ *α*^{*} ≤ *α*_{1} + *α*_{2} , then *β*_{1} − *α*_{2} ≤ *β*_{1}^{*} ≤ *β*_{1} + *α*_{2}. Therefore, the biases to main effect estimates of *V*_{1} are bounded by −*α*_{2} ≤ *β*_{1}^{*} − *β*_{1} ≤ *α*_{2}. *β*_{1}^{*} − *β*_{1} = *α*_{2} only when *f*_{100} = 0 and *f*_{001} = 0 [i.e., when no observations in the low-risk ethnicity fall into the comparison category of *V*_{1} (*V*_{1} = 1) and reference category of *V*_{2} (*V*_{2} = 0)]. In addition, no observations in the high-risk ethnicity fall into reference categories of *V*_{1} and *V*_{2}; *β*_{1}^{*} − *β*_{1} = −*α*_{2} only when *f*_{101} = 0 and *f*_{000} = 0 (i.e., when no observations in the high-risk ethnicity fall into the comparison category of *V*_{1} and reference category of *V*_{2}). In addition, no observations in the low-risk ethnicity fall into joint reference categories of *V*_{1} and *V*_{2}.

In the same fashion, the biases to main effect estimates of *V*_{2} are bounded by −*α*_{2} ≤ *β*_{2}^{*} − *β*_{2} ≤ + *α*_{2}. *β*_{2}^{*} − *β*_{2} = *α*_{2} only when *f*_{010} = 0 and *f*_{001} = 0; *β*_{2}^{*} − *β*_{2} = −*α*_{2} only when *f*_{011} = 0 and *f*_{000} = 0.

Next, using above derivations, *β*_{3} − 2*α*_{2} ≤ *β*_{3}^{*} ≤ *β*_{3} + 2*α*_{2}; thus, the biases to estimates of interaction between *V*_{1} and *V*_{2} are bounded by −2*α*_{2} ≤ *β*_{3}^{*} − *β*_{3} ≤ + 2*α*_{2}. However, only when *f*_{100} = *f*_{010} = *f*_{001} = *f*_{111} = 0 will *β*_{3}^{*} − *β*_{3} = −2*α*_{2}; that is, all observations in the low-risk ethnicity fall into either both reference categories or both comparison categories of *V*_{1} and *V*_{2}, and no observations in the high-risk ethnicity fall into either both reference categories or both comparison categories of *V*_{1} and *V*_{2}; only when *f*_{110} = *f*_{000} = *f*_{011} = *f*_{101} = 0 will *β*_{3}^{*} − *β*_{3} = 2*α*_{2}.

The maximum bias is reached when the joint occurrence of *V*_{1} = 1 and *V*_{2} = 0 is never observed in the low-risk ethnicity, and the joint occurrence of *V*_{1} = 0 and *V*_{2} = 0 is never observed in the high-risk ethnicity. Similarly, the lower bound on the bias is reached only when the joint occurrence of *V*_{1} = 1 and *V*_{2} = 0 is never observed in the high-risk ethnicity, and the joint occurrence of *V*_{1} = 0 and *V*_{2} = 0 never occurs in the low-risk ethnicity.

The maximal biases for interactions can be reached only when all of the following four conditions hold: (*a*) the joint occurrence of *V*_{1} = 1 and *V*_{2} = 0 is never observed in the high-risk ethnicity; (*b*) the joint occurrence of *V*_{1} = 0 and *V*_{2} = 1 is never observed in the high-risk ethnicity; (*c*) the joint occurrence of *V*_{1} = 1 and *V*_{2} = 1 is never observed in the low-risk ethnicity; (*d*) the joint occurrence of *V*_{1} = 0 and *V*_{2} = 0 is never observed in the low-risk ethnicity. Similarly, the maximal bias in the other direction is −2*α _{k}* only when all of the following four conditions hold: (

*a*) the joint occurrence of

*V*

_{1}= 1 and

*V*

_{2}= 0 is never observed in the low-risk ethnicity; (

*b*) the joint occurrence of

*V*

_{1}= 0 and

*V*

_{2}= 1 is never observed in the low-risk ethnicity; (

*c*) the joint occurrence of

*V*

_{1}= 1 and

*V*

_{2}= 1 is never observed in the high-risk ethnicity; (

*d*) the joint occurrence of

*V*

_{1}= 0 and

*V*

_{2}= 0 is never observed in the high-risk ethnicity.

*B. Algebraic Analyses of Biases when k > 2 Ethnicities.* Assume *k* ethnicities *E*_{1}, *E*_{2},…, *E _{k}* comprise a cohort, with expected fractions

*ε*

_{1},

*ε*

_{2},…,

*ε*, respectively, where

_{k}*π*) =

*α*

_{1}+

*β*

_{1}×

*V*

_{1}+

*β*

_{2}×

*V*

_{2}+

*β*

_{3}×

*V*

_{1}×

*V*

_{2}+

*α*

_{2}×

*E*

_{2}+

*α*

_{3}×

*E*

_{3}+ … +

*α*×

_{k}*E*(A). Maximum likelihood estimates for mis-specified model that ignored all

_{k}*k*ethnicities

*E*

_{1},

*E*

_{2},…,

*E*are

_{k}As an example for the notation, here, the expected fraction of observations in the joint reference category in the entire cohort is *f*_{00·} = *ε*_{1} × *f*_{000} + *ε*_{2} × *f*_{001} + … + *ε _{k}* ×

*f*

_{00k}, where the “·” subscript indicates that observations are pooled over the associated index ethnicity in this case.

Without loss of generality, assume baseline risks *π*_{001} < *π*_{002} < … < *π*_{00k}, so that 0 < *α*_{2} < *α*_{3} … < *α _{k}*. Then biases on intercept term satisfy

*α*

_{1}≤

*α*

^{*}≤

*α*

_{1}+

*α*

_{k}, α^{*}=

*α*

_{1}only when

*f*

_{002}=

*f*

_{003}= … =

*f*

_{00k}= 0 (i.e., ethnicities 2 to

*k*do not have joint reference category of

*V*

_{1}= 0 and

*V*

_{2}= 0). In addition,

*α*

^{*}=

*α*

_{1}+

*α*only when

_{k}*f*

_{001}=

*f*

_{002}= … =

*f*

_{00(k−1)}= 0 [i.e., ethnicities 1 to (

*k*− 1) do not have joint reference category of

*V*

_{1}= 0 and

*V*

_{2}= 0]. Biases on

*V*

_{1}effect estimates satisfy

*β*

_{1}−

*α*≤

_{k}*β*

_{1}

^{*}≤

*β*

_{1}+

*α*

_{k}, β_{1}

^{*}=

*β*

_{1}+

*α*only when

_{k}*f*

_{002}=

*f*

_{003}= … =

*f*

_{00k}= 0 and

*f*

_{101}=

*f*

_{102}= … =

*f*

_{10(k−1)}= 0 [i.e., ethnicities 2 to

*k*do not have joint categories of

*V*

_{1}= 0 and

*V*

_{2}= 0, and ethnicities 1 to (

*k*− 1) do not have joint categories of

*V*

_{1}= 1 and

*V*

_{2}= 0]. In addition,

*β*

_{1}

^{*}=

*β*

_{1}−

*α*only when

_{k}*f*

_{001}=

*f*

_{002}= … =

*f*

_{00(k−1)}= 0 and

*f*

_{102}=

*f*

_{103}= … =

*f*

_{10k}= 0 [i.e., ethnicities 2 to

*k*do not have joint categories of

*V*

_{1}= 1 and

*V*

_{2}= 0, and ethnicities 1 to (

*k*− 1) do not have joint categories of

*V*

_{1}= 0 and

*V*

_{2}= 0].

There will be no bias on *V*_{1} effect estimates (i.e., *β*_{1}^{*} = *β*_{1}) if *f*_{001} = *f*_{002} = … = *f*_{00(k−1)} = 0 and *f*_{101} = *f*_{102} = … = *f*_{10(k−1)} = 0, or if *f*_{002} = *f*_{003} = … = *f*_{00k} = 0 and *f*_{001} = *f*_{002} = … = *f*_{00(k−1)} = 0. Similarly, *β*_{2} − *α _{k}* ≤

*β*

_{2}

^{*}≤

*β*

_{2}+

*α*

_{k}, β_{2}

^{*}=

*β*

_{2}+

*α*only when

_{k}*f*

_{002}=

*f*

_{003}= … =

*f*

_{00k}= 0 and

*f*

_{011}=

*f*

_{012}= … =

*f*

_{01(k−1)}= 0; and

*β*

_{2}

^{*}=

*β*

_{2}−

*α*only when

_{k}*f*

_{001}=

*f*

_{002}= … =

*f*

_{00(k−1)}= 0 and

*f*

_{012}=

*f*

_{013}= … =

*f*

_{01k}= 0. In addition,

*β*

_{3}− 2

*α*≤

_{k}*β*

_{3}

^{*}≤

*β*

_{3}+ 2

*α*and

_{k}*β*

_{3}

^{*}=

*β*

_{3}+ 2

*α*only when

_{k}*f*

_{00i}=

*f*

_{01i}=

*f*

_{10i}=

*f*

_{11i}for

*i*= 2,…,(

*k*− 1).

**Grant support:** USPHS grants R21-ES11658 and P50-CA105641 and R01-CA85074 (T.R. Rebbeck).

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

## Acknowledgments

We thank Warren Ewens, Peter Kanetsky, Caryn Lerman, Richard Spielman, and Thomas Ten Have for their helpful comments during the development and execution of this research.