## Abstract

Although gathered as continuous data, expression measurements from gene microarrays may be quantized before downstream analysis and modeling. This is especially true for modeling gene prediction and genetic regulatory networks. Coarse quantization results in lower computational requirements, lower data requirements for model inference, and easier conceptualization. This paper proposes a mixture model for binarization. For each gene, the model, composed of a sum of two distributions, is fit to expression data for that gene, and data points are binarized according to the model. The mixture model is based on the assumption of multiplicative up-regulation. The proposed method is compared with mean and median binarization by comparing classification performance based on the binary data from the different methods. Classification is performed for simulated data generated from a microarray model studied previously and for cancer data arising from two studies involving hereditary breast cancer and small, round blue-cell tumors of childhood.

## Introduction

It is a general principle that a system should be modeled at the lowest level of complexity that permits the accomplishment of the purposes for which the model is being developed. Lower levels of model complexity mean less computation, lower data requirements for model identification, and greater ease of conceptualization. In a sense, this is an engineering form of Occam’s razor in which the pragmatics of the problem imply some minimal level of necessary complexity. The issue of complexity reduction is particularly salient for the development of prediction models and genetic regulatory models using microarray data, because the number of genes is very large and the number of samples very small. Modeling involves numerous goals, including prediction of targets based on pathways, characterization of disease states, and the design of optimal time-dependent dosing regimens (1). For gene-expression-based models, various modeling decisions must be made based on ones goals, computational power, and data abundance: the set of genes to be included in the model, the degree of complexity allowed for functional relations between genes within the model, and the quantization of expression levels.

Regarding quantization, the subject of this paper, perhaps the most well-studied genetic regulatory network model is the Boolean network (2–4). As the name implies, this model uses binary quantization in which a gene is either ON (expression level equals 1) or OFF (expression level equals 0). The intent of such a coarse-grained model is not to serve as an architecture for biochemical pathways, but rather to give qualitative insight into multivariate, dynamical gene interaction, to provide a mathematical structure to study this dynamical behavior at the level of logical switching, and to design therapeutic strategies based on the dynamical behavior of the network when the ON-OFF paradigm is sufficient. Application of Occam’s razor is implicit in the study of dynamical behavior and the design of therapies in the context of binary quantization; indeed, both are valid to the extent that they can be pragmatically described in a binary framework. In fact, the Boolean model has provided conceptual insights into the behavior of genetic regulatory networks (5–7). The Boolean-network model, which is deterministic, has been extended recently to a stochastic framework that allows for different functional relationships and perturbations between states. These stochastic networks are called “probabilistic Boolean networks” (8). A probabilistic Boolean network maintains the rule-based structure of a Boolean network, whereas allowing for uncertainty. Two studies have demonstrated that intervention can be addressed within the context of probabilistic Boolean networks (9, 10), and a third has considered optimal time-dependent external control based on dynamic programming (11). Even should the necessary technologies for diagnosis, monitoring, and therapy for these kinds of model-based strategies become practically feasible, their successful use would still depend on appropriate binarization of continuous data. The issue of binarization has been addressed in some detail in a recent paper, and we refer to it for a more in-depth discussion of binarization issues (12).

The approach we take in this paper is to apply a binarization procedure based on a multiplicative model for expression up-regulation. The expression level of a gene varies across a set of microarrays, and in the context of a binary quantization is either ON or OFF for the various samples yielding the microarrays. Taken as a collection, the measurements for a particular gene compose a distribution of values, and we shall model that distribution as a mixture of two distributions, one corresponding to lack of up-regulation and the other to up-regulation. The parameters of these two distributions will be estimated from the expression data for the gene, and then data will be quantized according to the modeled distributions. The mixture model depends on whether one is using ratios or the intensities directly. We focus on ratios, and comment on how the procedure applies to straight intensity data with a simpler model.

## Materials and Methods

### Mixture Model.

The mixture model used for binarization is based on the assumption of multiplicative up-regulation. For a particular gene, *g*, we assume that its values across the set of microarrays for which it is not up-regulated are described by a normal random variable *U* of which the mean is the nominal value of *g* in these samples. The model has a multiplicative factor *K* > 1 such that the values of *g* in the up-regulated samples are governed by the random variable *KU*, which is also normal because *U* is normal. Assuming a two-channel cDNA microarray, the values of the reference channel are modeled by a random variable *B*. Taking logs, there are two possibilities. For samples in which *g* is not up-regulated, we get:

For an up-regulated sample, we get:

If we make the simplifying assumption that *B* is not random, then the distribution of log *KU*/*B* is simply a shift of the distribution of log *U*/*B* and both possess a distribution that is the log of a normal distribution. Hence, across the samples, the logs of the ratios will appear as a mixture of two log-of-normal (not lognormal) distributions.

On the basis of the preceding considerations, we postulate a log-of-normal mixture model for the observations of a particular gene across a set of microarrays. We assume that the logs follow the mixture model

where *X* is the log of the ratio, ℒN denotes the log-of-a-normal distribution having density

and *c*_{k},μ_{k} and ς_{k}^{2} are the weights, means, and variances, respectively, of the mixture model. *Q* is the number of quantization levels, which in our case is *Q*=2, but could be different were we to model more levels of multiplicative up-regulation. Estimation of the model parameters is explained in “Appendix.” Fig. 1 shows a mixture of two log-of-normal distributions.

Binarization based on the log-of-normal mixture model is achieved by thresholding. Without loss of generality, assume μ_{1} < μ_{2}. The log of a ratio *X* is quantized by:

where for *i*=1, 2,

Because closed forms do not exist for the preceding integrals, they are evaluated by Monte Carlo methods.

Were we to use direct intensities instead of ratios, then we would not have quotients and the not-up-regulated and regulated cases would simply involve a normal random variable *U* and a second normal random variable *KU*, respectively. Hence, we would apply a Gaussian mixture model.

### Testing Binarization Performance Using a Simulation Model.

We test the performance of MMB^{3} using a model-based simulation and compare it to using the mean and median of the samples for each gene, denoted by “Mean” (12) and “Median” (12, 13), respectively. The Mean method is based on a threshold *T* (*i.e.,* replacing the *T* in E) that is the mean of the intensities of each gene, and the Median method is based on a threshold *T* that is the median of the intensities of each gene. Then the intensity of this gene is quantized to 1 if the intensity exceeds *T*, and it is 0 otherwise. The latter two methods are simple but effective if there is sufficient separation in the mixture.

The simulated data are generated by a parameterized random signal model (14). It is assumed that the *n* genes on the *m* microarrays have the mean expression levels *I*_{1},*I*_{2},…, *I*_{n}, with *I*_{i}=100 + ξ_{i}, where ξ_{i} follows an exponential distribution with mean 3000. The model assumes that the intensities *U*_{i1}, *U*_{i2}, …, *U*_{im} for a specific gene *g*_{i} follow a normal distribution N(*I*_{i}, (α*I*_{i})^{2}), where α is a fixed coefficient of variation. Here we set α=0.2, which is realistic. With the inclusion of normally distributed additive noise *N*_{ij}, observed intensities are given by:

for samples in which gene *g*_{i} is not up-regulated, and

when gene *g*_{i} is up-regulated. The model assumes that the additive noise varies from microarray to microarray, with *N*_{ij}∼N(0,η_{i}^{2}), where η_{i}∼N(μ_{a}, ς_{a}^{2}). Here we set μ_{a} = 30 and ς_{a}=10. To complete the ratio model, we assume the reference channel has normally distributed values *B*_{ij} with mean equal to the mean, μ_{I}, of the intensity means and SD αμ_{I}. In essence, this means that we are assuming a uniform reference probe across all of the genes, and the mean probe intensity is normalized to the mean intensity of the expression means. Other reference models could be used, for instance, assuming each gene to have its own probe.

We test binarization performance by comparing the effects of the different methods on classification. On the basis of the simulation model, we take the log

quantize the log to obtain *q*(X_{ij}), and then attempt to perform classification of the up-regulated and not-up-regulated samples based on gene expression. The testing principle is that binarization yielding good classification is a satisfactory binarization (at least for classification purposes). For classification we use a recently proposed method involving Bayesian variable selection in conjunction with a probit regression model to relate the gene expression with the class (15). This method has performed well on labeled data from hereditary breast cancer, and we will be able to compare those earlier results with those obtained from binary expression data.

Considering the high computational cost of the Bayesian gene selection, we use the principle that the smaller the sum of squares within groups and the larger the sum of squares between groups, the better the classification performance. Hence, we take the ratio of the between-group to within-group sum of squares and determine a threshold so that we only keep those genes for which the ratio exceeds the threshold. The leave-one-out method is used to estimate classification errors.

Subsequent to the simulation, we consider tumor classification using two different published data sets. Once again we are interested in which binarization method yields the best classification rate. It will be seen that MMB has no errors for both data sets. Not only does this show the worth of using the mixture model (which is our main intent), but it also lends support to a proposition put forth by Shmulevich and Zhang (12) that binary data can often perform important data analysis tasks. There, the authors demonstrate the ability of binary data to form meaningful clusters; here, the experiments demonstrate the ability of binary data to perform successful classification.

The cancer data will be normalized to account for variation between microarrays. For the *j*th array we take the log transform of the ratios *t*_{1j},*t*_{2j}, . . , *s*,*t*_{nj} to obtain log *t*_{ij}, for *i*=1, 2, …, *n*. The mean, *m*_{j}, of the log-transformed ratios for the *j*th array is computed, and we define the normalized ratio according to the equation

for *i*=1, 2, …, *n*. The normalized ratios are used for classification.

The sensitivity of the Bayesian-variable-selection method was examined in the original study by adding synthetic Gaussian noise to the data (15). Here we will check sensitivity using only actual data by considering how the binarization methods perform when the data are not normalized for microarray-induced variation. This has the effect of making the classification more difficult because it tends to increase the dispersion of expression measurements for each gene.

## Results and Discussions

### Simulation Study.

The Bayesian selection algorithm ranks a gene according to the percentage of times it appears among the posterior samples generated by the algorithm (15). The five strongest (highest ranked) genes are used for classification for each binarization method being tested. These will differ according to the method. In each case, the five strongest genes are used in the probit classification to classify the up-regulated samples. The average recognition accuracies, based on 50 simulations, are shown in Table 1 for different up-regulation factors. There is little difference for large *K*, but when *K* is small, the mixture-model method significantly outperforms the Mean and Median methods. Because classification accuracy depends on *K* in the up-regulation model, this means that the mixture-model approach should help for more difficult classification problems.

### Hereditary Breast Cancer Data.

First we consider hereditary breast cancer data, which can be downloaded from the web page of the original paper (16). In that paper, cDNA microarrays are used in conjunction with classification algorithms to show the feasibility of using differences in global gene expression profiles to separate BRCA1 and BRCA2 mutation-positive breast cancers. Twenty-two tumor samples from 21 patients are examined: 7 BRCA1, 8 BRCA2, and 7 sporadic. There are 3226 genes for each tumor sample. After binarization, we use the probit regression classifier with Bayesian gene selection to classify BRCA1 *versus* BRCA2 and sporadic.

Table 2 gives the leave-one-out error counts for the three binarization methods being tested and the four experiments under consideration, the two cancers with normalized and non-normalized data. The 5 top-ranked genes under Bayesian variable selection are used for classification for each binarization method.

Tables 3 through 5 give the top 10 scoring genes found by Bayesian variable selection for MMB, Median, and Mean, respectively, when the data has been normalized. Using the top 5 genes for classification for each binarization method, MMB yields no errors, whereas both Mean and Median result in one error each. Given that there are 22 microarrays, this represents a 5% improvement using MMB. This is significant, especially because MMB yields no errors at all.

By comparing the lists in Tables 3 through 5 with the list of the top 10 scoring genes in the original paper (15), shown in Table 6, we see that even with the compression of binarization, Bayesian selection may still find a number of similar significant genes. The top 10 scoring genes using MMB with normalized data includes 3 of the genes on the original list, all 3 being in the original top 5. Included among the 3 are TOB1 and phosphofructonkinase. Neither Mean nor Median has any of the original top 10 among their respective top 10.

For the non-normalized data, we see from Table 2 that the performance of both MMB and Mean remain the same, with MMB still perfect, but Median results in three errors, which is very poor given that MMB has no errors.

### Small Round Blue-Cell Tumor Data.

Now we apply binarization and probit regression classification with Bayesian feature selection to a published data set for SRBCTs of childhood, which includes NB, RMS, non-Hodgkin’s lymphoma, and the Ewing family of tumors (17). We classify the rhabdomysosarcoma and NB tumors. The data set for the two cancers is composed of 2308 genes and 35 samples, 23 samples for RMS, and 12 samples for NB.

Tables 7 through 9 give the 10 strongest genes using Bayesian selection with MMB, Median, and Mean, respectively. For the top 5 genes on each list, the classification, and the recognition accuracies are shown in Table 2. Again, MMB yields no errors. Cadherin 2, which tops the significant gene list for MMB, is identified as an important gene in the original SRBCT study (17). In fact, it can perfectly classify RMS and non-Hodgkin’s lymphoma by itself using the binary data. Even with perfect single-gene classification possible, Median still produces one error, although Mean does yield perfect classification, with cadherin 2 at the top of its list.

The situation is very different for the non-normalized data. Again, MMB identifies cadherin 2 as the strongest gene and produces no errors. However, both Mean and Median perform poorly, with four and three errors, respectively.

## Appendix

This appendix describes parameter estimation for the log-of-normal mixture model. First, the samples are partitioned into *Q* clusters using fuzzy C-means clustering, and then initial parameters are estimated using the standard vector quantization method. We then estimate the parameters in C using the expectation-maximization algorithm by iterating the following steps:

Γ_{i} denotes the observations of gene *i*.

We note that, if *Q* were not fixed at *Q*=2 for binary quantization, we could find an optimal number of distributions in the model via the minimum description length principle by building a binary tree with *Q* leaves, each representing a distribution in the mixture model (18).

Supported by the National Human Genome Research Institute, the University of Texas M. D. Anderson Cancer Center, and the National Science Foundation grant DMS-0225692.

The abbreviations used are: MMB, mixture model binarization; SRBCT, small, round blue-cell tumor; NB, neuroblastoma; RMS, rhabdomyosarcoma.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.