## Abstract

The association between the genotypic frequencies of the cytochrome *P*450 1A1 polymorphism and the risk of childhood leukemia is explored with the data from a matched case-control study. The data are displayed in a 3 × 3 case-control array, and the discordant pair counts are assessed for quasi-independence, homogeneity, and symmetry. This statistical approach is contrasted to the more typical analysis of matched data based on a conditional logistic model and estimated odds ratios. The statistical analysis of 175 matched pairs (part of a large study of potential environmental/genetic influences on the risk of childhood leukemia) showed no evidence of an association between cytochrome *P*450 1A1 genotype frequencies and case-control status.

## Introduction

The possible association between cytochrome *P*450 1A1 (CYP1A1) genetic polymorphism and disease has been explored by several investigators. Some examples are the investigations of Ladonna et al. (1) on Spanish toxic oil syndrome, Ishibe et al. (2) on breast cancer, and Kim et al. (3) on cervical cancer and several reports on the association with childhood leukemia (e.g., refs. 4, 5). These analyses, as well as most analyses of matched case-control genetic data, are summarized in terms of ratios of discordant pairs of observations (odds ratios) estimated from conditional logistic regression models. The following statistical approach to the analysis of the CYP1A1/leukemia matched data is based on a 3 × 3 case-control array that allows an assessment of three fundamental statistical issues (i.e., independence, homogeneity, and symmetry of the discordant pairs). In addition, this approach is contrasted to the more typical use of odds ratios estimated from a conditional logistic model.

## Data

To describe the CYP1A1 polymorphism and the risk of acute lymphoblastic leukemia, data collected as part of the Northern California Childhood Leukemia Study are used. These matched case-control observations were abstracted from a large number of genetic/environmental variables that potentially influence the risk of childhood leukemia. The cases are children ages 0 to 14 years old with newly diagnosed leukemia (1995 to 1999) obtained from major hospitals in the San Francisco Bay Area. Comparison with California State Cancer Registry data shows that >90% of the eligible children were ascertained. The control children were randomly selected from birth certificate records and matched to cases with respect to sex, age, race, and county of birth. A more extensive description of this far-ranging study is found elsewhere (6). The CYP1A1/leukemia data consisting of 175 matched pairs of acute lymphoblastic leukemia cases and their controls (117 concordant and 58 discordant pairs) are given in Table 1.

**Table 1.**

. | Control: AA^{*}
. | Control: AG . | Control: GG . | Total . |
---|---|---|---|---|

Case: AA^{*} | 103 | 26 | 2 | 131 |

Case: AG | 23 | 14 | 2 | 39 |

Case: GG | 1 | 4 | 0 | 5 |

Total | 127 | 44 | 4 | 175 |

. | Control: AA^{*}
. | Control: AG . | Control: GG . | Total . |
---|---|---|---|---|

Case: AA^{*} | 103 | 26 | 2 | 131 |

Case: AG | 23 | 14 | 2 | 39 |

Case: GG | 1 | 4 | 0 | 5 |

Total | 127 | 44 | 4 | 175 |

AA, CYP1A1 homozygotic wild-type.

## Quasi-Independence

Genotype data classified into a square array are quasi-independent when the categorical variables (row and columns) are independent with respect to only the discordant pairs. No restrictions are placed on the concordant pairs (the main diagonal of the case-control array). In symbols, the six expected cell counts of the discordant pairs (denoted *F** _{ij}*) are

. The values *p** _{i}* represents the case genotypic frequencies, and

*q*

*represents the control genotypic frequencies. The quantity*

_{j}*N*represents the “total number of pairs” that would have occurred if the genotypic frequencies were independent and the data were randomly sampled. Specifically, the value

*N*=

*n*/ ∑

*p*

_{i}*q*

*for*

_{j}*i*≠

*j*= 1, 2, and 3 where

*n*represents the total observed number of discordant pairs.

The maximum likelihood estimates of the *p** _{i}* and

*q*

*frequencies are found by iterative techniques (7) or an application of a specialized log-linear model (8). Both procedures are designed to estimate the genotypic frequencies excluding the concordant pairs from consideration (truncated). For the Northern California Childhood Leukemia Study data (Table 1), these estimated genotypic frequencies are*

_{j}A natural estimate of the expected number of case-control discordant pairs based on the quasi-independence model is *F̂** _{ij}* =

*N̂p̂*

_{i}*q̂*

*where*

_{j}*N̂*=

*n*/ ∑

*p̂*

_{i}*q̂*

*for*

_{j}*i*≠

*j*= 1, 2, and 3. These expected counts estimated from the data in Table 1 are displayed in Table 2.

**Table 2.**

. | Control: AA^{*}
. | Control: AG . | Control: GG . | Total . |
---|---|---|---|---|

Case: AA | 103 | 26.582 | 1.418 | 131 |

Case: AG | 22.418 | 14 | 2.582 | 39 |

Case: GG | 1.582 | 3.418 | 0 | 5 |

Total | 127 | 44 | 4 | 175 |

. | Control: AA^{*}
. | Control: AG . | Control: GG . | Total . |
---|---|---|---|---|

Case: AA | 103 | 26.582 | 1.418 | 131 |

Case: AG | 22.418 | 14 | 2.582 | 39 |

Case: GG | 1.582 | 3.418 | 0 | 5 |

Total | 127 | 44 | 4 | 175 |

AA, CYP1A1 homozygotic wild-type.

The Pearson *χ*^{2} goodness-of-fit test statistic (1 *df*) summarizing the deviations of the observed values from the expected values (Table 1 versus Table 2) is *X*_{Q}^{2} = 0.712 (*P* = 0.399). The *df* values are the number of observations (off-diagonal cell frequencies = 6) minus the number of independent parameters necessary to estimate the expected values or to specify the appropriate log-linear model. In the case of quasi-independence, five independent estimated parameters establish the expected values in Table 2 making the *df* equal to 1 (6 − 5 = 1).

In the context of the analysis of matched case-control data, another form of this same χ^{2} test is called a “test for the consistency of the odds ratios” (9). The noninformative concordant pairs are excluded because the increased correlation within pairs due to the matching process tends to increase the number of concordant pairs relative to the number predicted by a model postulating independence.

## Marginal Homogeneity

Quasi-independence does not imply marginal homogeneity (identical case and control genotypic frequency distributions). The issue of marginal homogeneity in general is the subject of several statistical articles (e.g., refs. 10-12). The expected row/column totals, under the conjecture of marginal homogeneity, are estimated by

where *n** _{i.}* = ∑

_{j}*f*

*,*

_{ij}*n*.

*= ∑*

_{j}

_{i}*f*

*and*

_{ij}*f*

*represents the number of pairs in the (*

_{ij}*i*,

*j*)

^{th}cell.For the CYP1A1/leukemia data, the estimated homogeneous marginal totals are

## Symmetry

When the case-control discordant pairs are independent and the marginal frequencies are homogeneous, the expected counts of case-control pairs create a symmetrical array; that is, when case-control status is unrelated to the genotypic frequencies, the counts within the three kinds of discordant pairs (*f** _{ij}* versus

*f*

*) differ by chance alone. Under these conditions (independence + homogeneity = symmetry), the maximum likelihood estimates of the genotypic frequencies (denoted*

_{ji}*P̂*

*) are*

_{i}where *f*_{{ij}} = *f** _{ij}* +

*f*

*. Because the number of independent parameters equals the number of independent observations (2), the maximum likelihood estimates*

_{ji}*p̂*

*can be derived by equating expected and observed frequencies (13). The specific estimated genotype frequencies from the acute lymphoblastic leukemia data are*

_{i}These estimated genotypic frequencies produce an estimate of the expected counts of matched case-control discordant pairs (denoted *F_{ij}′*) where

The estimated “sample size” is *N̂′* = *n* / ∑*P̂ _{i}*

*P̂*

*for*

_{j}*i*≠

*j*= 1, 2, and 3. The Northern California Childhood Leukemia Study matched pairs data (Table 1) produce the estimates displayed in Table 3. These expected discordant pairs are quasi-independent, and the marginal frequencies are homogeneous.

**Table 3.**

. | Control: AA^{*}
. | Control: AG . | Control: GG . | Total . |
---|---|---|---|---|

Case: AA | 103 | 24.5 | 1.5 | 129.0 |

Case: AG | 24.5 | 14 | 3.0 | 41.5 |

Case: GG | 1.5 | 3.0 | 0 | 4.5 |

Total | 129.0 | 41.5 | 4.5 | 175 |

. | Control: AA^{*}
. | Control: AG . | Control: GG . | Total . |
---|---|---|---|---|

Case: AA | 103 | 24.5 | 1.5 | 129.0 |

Case: AG | 24.5 | 14 | 3.0 | 41.5 |

Case: GG | 1.5 | 3.0 | 0 | 4.5 |

Total | 129.0 | 41.5 | 4.5 | 175 |

AA, CYP1A1 homozygotic wild-type.

The Pearson goodness-of-fit test statistic (Table 1 versus Table 3) is *X*_{S}^{2} = 1.184 (*P* = 0.757) and has an approximate *χ*^{2} distribution with 3 *df* when the observed counts randomly differ from the expected counts. The comparison of these observed and expected numbers of discordant pairs is identical to the sum of three McNemar-like test statistic (14). In symbols,

The χ^{2} test for symmetry requires three independent estimated parameters giving a test statistic with 3 *df* (6 − 3 = 3). In addition, this test statistic partitions into three independent components each with 1 *df*.

In addition, the three estimated genotypic frequencies *P̂** _{i}* lead to a compact form of a χ

^{2}test statistic to evaluate marginal homogeneity in a 3 × 3 case-control array. The test statistic is

The test statistic *X*_{H}^{2} summarizes deviations from marginal homogeneity and has an approximate χ^{2} distribution with 2 *df* when the expected marginal frequencies are homogeneous. The χ^{2} expression for testing homogeneity requires four independent estimated parameters yielding 2 *df* (6 − 4 = 2). For the Northern California Childhood Leukemia Study data, the χ^{2} value is *X*_{H}^{2} = 0.479 (*P* = 0.787).

## Conditional Logistic Model

The additive conditional logistic model applied to genotype frequency data collected in a matched design yields estimates of the logarithms of the three ratios of discordant pairs (denoted *b** _{i}*). When the genotypic frequencies are the same for both cases and controls, the model estimated ratio within all three kinds of discordant pairs is 1 (

*b*

_{1}=

*b*

_{2}=

*b*

_{3}= 0). Furthermore, the additive model requires that

*b*

_{1}+

*b*

_{3}=

*b*

_{2}. For the CYP1A1/leukemia data, these estimated log-ratios are

*b̂*_{1}= −0.170 for AA/AG discordant pairs,*b̂*_{2}= 0.110 for AA/GG discordant pairs, and*b̂*_{3}= 0.280 for AG/GG discordant pairs.

The corresponding estimated odds ratios (e^{b̂1}) are 0.843, 1.116, and 1.323, respectively.

The estimated log-ratios are directly related to the expected cell counts generated by the quasi-independence model. In symbols, the estimates from the quasi-independence model give

In other words, both models generate identical expected counts contained in a 3 × 3 case-control array (Table 2).

From another prospective, both models necessarily conform to the relationship that

or

The requirement that *D* = 1 is essentially the definition of quasi-independence or model additivity (no interaction) within a 3 × 3 array.

The distribution of the estimated value

has estimated variance given by

These two estimates provide a computationally easy and additional assessment of the correspondence between the observed and the expected counts generated by the quasi-independence or the additive conditional logistic models.

The estimate of log(*D̂*) and its estimated variance from the CYP1A1/leukemia data are log(*D̂*) = −1.264 and *v̂* = 2.332 yielding the test statistic

The value *X _{Q}′*

^{2}has an approximate χ

^{2}distribution with 1

*df*when the data (Table 1) randomly differ from the expected counts generated by the quasi-independence or the conditional logistic models (Table 2). In general, this χ

^{2}statistic (

*X*

_{Q}′^{2}) and the χ

^{2}statistic (

*X*

_{Q}^{2}) will be similar, particularly for case-control arrays with many discordant pairs.

It should also be noted that the likelihood ratio test based on the additive conditional logistic model addresses only the hypothesis that *b*_{1} = *b*_{2} = *b*_{3} = 0 or the ratios of the corresponding discordant pairs are 1.0. The score likelihood ratio test statistic is identical to the previously described test for marginal homogeneity χ_{H}^{2}.

## Discussion

A symmetrical case-control array of matched pairs data indicates that no association likely exists between case-control status and genotypic frequencies. When statistical evidence emerges of nonsymmetry, two issues become important (i.e., independence and marginal homogeneity). That is, two reasons exist for a significant lack of symmetry: the failure of the genotypic frequencies to be independent (failure of the additive conditional logistic model to reflect the data) and the failure of the matched data to have the same case and control genotypic frequencies or both. These two sources of deviation for a symmetrical array are easily identified and indicate different dimensions of the association between genotypic frequencies and disease risk.

In general, the likelihood ratio statistic estimated from an additive conditional logistic model addresses only the issue of marginal homogeneity of the case-control array because an additive model explicitly requires the discordant matched pairs to be independent. That is, substantial differences in the ratios of discordant pairs can exist in a 3 × 3 case-control array with perfectly homogeneous marginal frequencies *X*_{H}^{2} = 0 when the genotypic frequencies that determine the numbers of discordant pairs are not independent. Without an assessment of the quasi-independence model *X*_{Q}^{2} or equivalently the consistency of the odds ratios (*b*_{1} + *b*_{3} = *b*_{2}), inferences from matched case-control data are potentially biased and even potentially misleading. As with statistical models in general, goodness-of-fit is a critical issue.

The interpretation of quasi-independence of two variables is not different in principle from the interpretation in most contingency tables. Quasi-independence becomes an issue when specific cell frequencies are truncated from consideration. In a matched pairs design, the frequencies on the diagonal cells of the case-control array (the concordant pairs) are not included in the analysis. Nevertheless, two variables are not independent (or not quasi-independent) when the occurrence of one changes the probability of the occurrence of the other. For example, case-control status and genotypic frequencies are not quasi-independent when *P*(AA | AG is a control) is not equal to *P*(AA case). A phenotype frequency among the matched pair cases will not be quasi-independent when, for example, cases and controls have differing racial compositions and the phenotypic frequencies under investigation differ among races. In fact, nonindependence potentially arise whenever the controls fail to be a random sample of the population from which the cases were selected.

The χ^{2} test of symmetry is a simultaneous evaluation of both independence and homogeneity. Applied to the CYP1A1/leukemia data, this test produces no evidence of an association between genotypic frequencies and case-control status (*X*_{S}^{2} = 1.184 with *P* = 0.757). It then becomes a foregone conclusion that *χ*^{2} tests of quasi-independence and marginal homogeneity (*X*_{Q}^{2} = 0.712 and *X*_{H}^{2} = 0.479) consists of two nonsignificant pieces.

**Grant support:** U.S. Environmental Health Sciences research grants R01 ES09137 and PS42 ES04705.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

## References

*P*450 1A1 polymorphism and breast cancer in the Nurses' Health Study.