New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Shrinkagebased similarity metric for cluster analysis of microarray data

Communicated by Michael H. Wigler, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, June 18, 2003 (received for review December 19, 2002)
Abstract
The current standard correlation coefficient used in the analysis of microarray data was introduced by M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein [(1998) Proc. Natl. Acad. Sci. USA 95, 14863–14868]. Its formulation is rather arbitrary. We give a mathematically rigorous correlation coefficient of two data vectors based on James–Stein shrinkage estimators. We use the assumptions described by Eisen et al., also using the fact that the data can be treated as transformed into normal distributions. While Eisen et al. use zero as an estimator for the expression vector mean μ, we start with the assumption that for each gene, μ is itself a zeromean normal random variable [with a priori distribution ], and use Bayesian analysis to obtain a posteriori distribution of μ in terms of the data. The shrunk estimator for μ differs from the mean of the data vectors and ultimately leads to a statistically robust estimator for correlation coefficients. To evaluate the effectiveness of shrinkage, we conducted in silico experiments and also compared similarity metrics on a biological example by using the data set from Eisen et al. For the latter, we classified genes involved in the regulation of yeast cellcycle functions by computing clusters based on various definitions of correlation coefficients and contrasting them against clusters based on the activators known in the literature. The estimated false positives and false negatives from this study indicate that using the shrinkage metric improves the accuracy of the analysis.
Recent advances in technology, such as microarraybased gene expression analysis, have allowed us to “look inside the cells” by quantifying their transcriptional states. While the most interesting insight can be obtained from transcriptome abundance data within a single cell under different experimental conditions, in the absence of technology to provide one with such a detailed picture, we have to make do with mRNA collected from a small, frequently unsynchronized, population of cells. Furthermore, these mRNAs will give only a partial picture, supported only by those genes that we are already familiar with and possibly missing many crucial undiscovered genes.
Of course, without the proteomic data, transcriptomes tell less than half the story. Nonetheless, it goes without saying that microarrays have already revolutionized our understanding of biology even though they provide only occasional, noisy, unreliable, partial, and occluded snapshots of the transcriptional states of cells.
In an attempt to attain functional understanding of the cell, we try to understand the underlying structure of its transcriptional statespace. Partitioning genes into closely related groups has thus become the key mathematical first step in practically all statistical analyses of microarray data.
Traditionally, algorithms for cluster analysis of genomewide expression data from DNA microarray hybridization are based on statistical properties of gene expressions and result in organizing genes according to similarity in pattern of gene expression. If two genes belong to a cluster then one may infer a common regulatory mechanism for the two genes or interpret this information as an indication of the status of cellular processes. Furthermore, coexpression of genes of known function with novel genes may lead to a discovery process for characterizing unknown or poorly characterized genes. In general, since false negatives (FNs) (where two coexpressed genes are assigned to distinct clusters) may cause the discovery process to ignore useful information for certain novel genes, and false positives (FPs) (where two independent genes are assigned to the same cluster) may result in noise in the information provided to the subsequent algorithms used in analyzing regulatory patterns, it is important that the statistical algorithms for clustering be reasonably robust. Unfortunately, as the microarray experiments that can be carried out in an academic laboratory for a reasonable cost are small in number and suffer from experimental noise, often a statistician must resort to unconventional algorithms to deal with smallsample data.
A popular and one of the earliest clustering algorithms reported in the literature was introduced in ref. 1. In that paper, the geneexpression data are collected on spotted DNA microarrays (2) and based on gene expression in the budding yeast Saccharomyces cerevisiae during the diauxic shift (3), the mitotic cell division cycle (4), sporulation (5), and temperature and reducing shocks. Each entry in a gene expression vector represents a ratio of the amount of transcribed mRNA under a particular condition with respect to its value under normal conditions. All ratio values are logtransformed to treat inductions and repressions of identical magnitude as numerically equal but opposite in sign. It is assumed that the raw ratio values follow lognormal distributions, and hence, the logtransformed data follow normal distributions. Although our mathematical derivations will rely on this assumption for the sake of simplicity, we note that our approach can be generalized in a straightforward manner to deal with other situations where this assumption is violated.
The gene similarity metric used in ref. 1 was a form of correlation coefficient. Let G_{i} be the (logtransformed) primary data for gene G in condition i. For any two genes X and Y observed over a series of N conditions, the classical similarity score based on Pearson correlation coefficient is: [1] where [2] and G_{offset} is the estimated mean of the observations, i.e., Note that Φ_{G} is simply the (rescaled) estimated standard deviation of the observations. In the analysis presented in ref. 1, “values of G_{offset} which are not the average over observations on G were used when there was an assumed unchanged or reference state represented by the value of G_{offset}, against which changes were to be analyzed; in all of the examples presented there, G_{offset} was set to 0, corresponding to a fluorescence ratio of 1.0.” To distinguish this modified correlation coefficient from the classical Pearson correlation coefficient, we shall refer to it as Eisen correlation coefficient. Our main innovation is in suggesting a different value for G_{offset}, namely G_{offset} = γḠ, where γ is allowed to take a value between 0.0 and 1.0. Note that when γ = 1.0, we have the classical Pearson correlation coefficient and when γ = 0.0, we have replaced it by Eisen correlation coefficient. For a nonunit value of γ, the estimator for G_{offset} = γḠ can be thought of as the unbiased estimator Ḡ being shrunk toward the believed value for G_{offset} = 0.0. We address the following questions: What is the optimal value for the shrinkage parameter γ from a Bayesian point of view? (See ref. 6 for some alternate approaches.) How do the gene expression data cluster as the correlation coefficient is modified with this optimal shrinkage parameter?
To achieve a consistent comparison, we leave the rest of the algorithms undisturbed. Namely, once the similarity measure has been assumed, we cluster the genes by using the same hierarchical clustering algorithm as the one used by Eisen et al. (1). Their hierarchical clustering algorithm is based on the centroidlinkage method [referred to as the “averagelinkage method” of Sokal and Michener (7) in ref. 1] and is discussed further below. The modified algorithm has been implemented by us within the New York University MicroArray Database system and can be freely downloaded from: bioinformatics.cat.nyu.edu/nyumad/clustering/. The clusters created in this manner were used to compare the effects of choosing differing similarity measures.
Model
We derive the proposed similarity metric. In our setup, the microarray data is given in the form of the levels of M genes expressed under N experimental conditions. The data can be viewed as where M ≫ N and is the data vector for gene j. We begin by rewriting S (see Eqs. 1 and 2) in our notation: [3]
In the most general setting, we can make the following assumptions on the data distribution: let all values X_{ij} for gene j have a normal distribution with mean θ_{j} and standard deviation β_{j} (variance β_{j}^{2}); i.e., with j fixed (1 ≤ j ≤ M), where θ_{j} is an unknown parameter (taking different values for different j). To estimate θ_{j}, it is convenient to assume that θ_{j} is itself a random variable taking values close to zero: The assumed distribution aids us in obtaining the estimate of θ_{j} given in Eq. 6.
For convenience, let us also assume that the data are rangenormalized, so that β_{j}^{2} = β^{2} for every j. If this assumption does not hold on the given data set, it is easily corrected by scaling each gene vector appropriately. Following common practice, we adjusted the range to scale to an interval of unit length, i.e., its maximum and minimum values differ by 1. Thus, and
Replacing (X_{j})_{offset} in Eq. 3 by the exact value of the mean θ_{j} yields a clairvoyant correlation coefficient of X_{j} and X_{k}. In reality, since θ_{j} is itself a random variable, it must be estimated from the data. Therefore, to get an explicit formula for S(X_{j}, X_{k}) we must derive estimators for all j.
In Pearson correlation coefficient, θ_{j} is estimated by the vector mean ; Eisen correlation coefficient corresponds to replacing θ_{j} by 0 for every j, which is equivalent to assuming (i.e., τ^{2} = 0). We propose to find an estimate of θ_{j} (call it ) that takes into account both the prior assumption and the data.
First, let us obtain the posterior distribution of θ_{j} from the prior and the data. This derivation can be done either from the Bayesian considerations or via the JamesStein shrinkage estimators (see refs. 8 and 9). Here, we discuss the former method.
Assume initially that N = 1, i.e., we have one data point for each gene, and denote the variance by σ^{2} for the moment: From these assumptions, we get (see ref. 10 for full details) [4]
Now, if N > 1 is arbitrary, X_{j} becomes a vector X._{j}. In ref. 10 we show (by using likelihood functions) that the vector of values , with , can be treated as a single data point from the distribution .
Thus, following the same derivation with σ^{2} = β^{2}/N, we have a Bayesian estimator for θ_{j} given by E(θ_{j}X._{j}): [5]
Unfortunately, Eq. 5 cannot be used in Eq. 3 directly, because τ^{2} and β^{2} are unknown, so must be estimated from the data. The details of the estimation are presented in ref. 10.
The resulting explicit estimate for θ_{j} is [6] where is an estimator for 1/(β^{2}/N + τ^{2}).
Finally, we substitute from Eq. 6 into the correlation coefficient in Eq. 3 wherever (X_{j})_{offset} appears to obtain an explicit formula for S(X._{j}, X._{k}).
Algorithm and Implementation
The implementation of hierarchical clustering proceeds in a greedy manner, always choosing the most similar pair of elements (starting with genes at the bottommost level) and combining them to create a new element. The “expression vector” for the new element is simply the weighted average of the expression vectors of the two elements that were combined. This structure of repeated pairwise combinations is conveniently represented in a binary tree, whose leaves are the set of genes and internal nodes are the elements constructed from the two children nodes. The algorithm is described below in pseudocode.
Hierarchical clustering pseudocode.
Given
Switch:
Pearson: γ = 1;
Eisen: γ = 0;
While (no. clusters > 1) do

Compute similarity table:

Find (j*, k*):

Create new cluster N_{j}_{*}_{k}_{*} = weighted average of G_{j}_{*} and G_{k}_{*}.

Take out clusters j* and k*.
As each internal node can be labeled by a value representing the similarity between its two children nodes, one can create a set of clusters by simply breaking the tree into subtrees by eliminating all the internal nodes with labels below a certain predetermined threshold value.
The implementation of generalized hierarchical clustering with options to choose different similarity measures has been incorporated into the New York University MicroArray Database (NYUMAD), an integrated system to maintain and analyze biological abundance data along with associated experimental conditions and protocols. To enable widespread utility, NYUMAD supports the MAGEML standard (www.mged.org/Workgroups/MAGE/mageml.html) for the exchange of gene expression data, defined by the Microarray Gene Expression Data Group. More detailed information about NYUMAD can be found at http://bioinformatics.cat.nyu.edu/nyumad/.
Results
Mathematical Simulation. To compare the performance of these algorithms, we started with a relatively simple in silico experiment. In such an experiment, one can create two genes X and Y and simulate N (≈100) experiments as follows: where α_{i}, chosen from a uniform distribution over a range [L, H] (), is a “bias term” introducing a correlation (or none if all αs are zero) between X and Y. and are the means of X and Y, respectively. Similarly, σ_{X} and σ_{Y} are the standard deviations for X and Y, respectively.
Note that, with this model if the exact values of the mean and variance are used.
The model was implemented in mathematica (11); the following parameters were used in the simulation: N = 100, τ ∈ {0.1, 10.0} (representing very low or high variability among the genes), σ_{X} = σ_{Y} = 10.0, and α = 0 representing no correlation between the genes or representing some correlation between the genes. Once the parameters were fixed for a particular in silico experiment, the geneexpression vectors for X and Y were generated many thousand times, and for each pair of vectors S_{c}(X, Y), S_{p}(X, Y), S_{e}(X, Y), and S_{s}(X, Y) were estimated by four different algorithms and further examined to see how the estimators of S varied over these trials. These four different algorithms estimated S according to Eqs. 1 and 2 as follows: Clairvoyant estimated S_{c} by using the true values of θ_{X}, θ_{Y}, σ_{X}, and σ_{Y}; Pearson estimated S_{p} by using the unbiased estimators X̄ and Ȳ of θ_{X} and θ_{Y} (for X_{offset} and Y_{offset}), respectively; Eisen estimated S_{e} by using the value 0.0 as the estimator of both θ_{X} and θ_{Y}; and Shrinkage estimated S_{s} by using the shrunk biased estimators and of θ_{X} and θ_{Y}, respectively. In the latter three, the standard deviation was estimated as in Eq. 2. The histograms corresponding to these in silico experiments can be found in Fig. 1. Our observations are summarized in Table 1.
In summary, one can conclude that for the same clustering algorithm, Pearson tends to introduce more FNs and Eisen tends to introduce more FPs than Shrinkage. Shrinkage, on the other hand, reduces these errors by combining the good properties of both algorithms.
Biological Example. We then proceeded to test the algorithms on a biological example. We chose a biologically wellcharacterized system and analyzed the clusters of genes involved in yeast cell cycle. These clusters were computed by using the hierarchical clustering algorithm with the underlying similarity measure chosen from the following three: Pearson, Eisen, or Shrinkage. As a reference, the computed clusters were compared to the ones implied by the common cellcycle functions and regulatory systems inferred from the roles of various transcriptional activators (see Fig. 2).
Note that our experimental analysis is based on the assumption that the groupings suggested by the chromatin immunoprecipitation analysis are, in fact, correct and thus, provide a direct approach to compare various correlation coefficients. It is quite likely that the chromatin immunoprecipitationbased groupings themselves contain many false relations (both positive and negative) and corrupt our inference in some unknown manner. Nonetheless, we observe that the trends of reduced false positives and negatives in shrinkage analysis with these biological data are consistent with the analysis based on mathematical simulation and hence, reassuring.
In the work of Simon et al. (12), genomewide location analysis was used to determine how the yeast cellcycle gene expression program is regulated by each of the nine known cellcycle transcriptional activators: Ace2, Fkh1, Fkh2, Mbp1, Mcm1, Ndd1, Swi4, Swi5, and Swi6. It was also found that cellcycle transcriptional activators that function during one stage of the cell cycle regulate transcriptional activators that function during the next stage. This serial regulation of transcriptional activators together with various functional properties suggests a simple way of partitioning some selected cellcycle genes into nine clusters, each one characterized by a group of transcriptional activators working together and their functions (see Table 2): for instance, group 1 is characterized by the activators Swi4 and Swi6 and the function of budding; group 2 is characterized by the activators Swi6 and Mbp1 and the function involving DNA replication and repair at the juncture of G_{1} and S phases, etc.
Upon closer examination of the data, we observed that the data in its raw “prenormalized” form is inconsistent with the assumptions used in deriving γ: (i) the gene vectors are not rangenormalized, so β_{j}^{2} ≠ β^{2} for every j, and (ii) the N experiments are not necessarily independent.
Range normalization and subsampling of experiments were used before clustering in an attempt to alleviate these shortcomings. The clusters on the processed data set, thresholded at the cutoff value of 0.60, are listed in Tables 3, 4, 5. The choice of the threshold parameter is discussed further in Discussion.
Our initial hypothesis can be summarized as follows: Genes expressed during the same cellcycle stage and regulated by the same transcriptional activators should be in the same cluster. We compared the performance of the similarity metrics based on the degree to which each of them deviated from this hypothesis. Below we list some of the observed deviations from the hypothesis.
Possible FPs.

Bud9 (group 1: budding), Egt2 (group 7: cytokinesis), and Cdc6 (group 8: prereplication complex formation) are placed in the same cluster by all three metrics: (E68, S49, and P51).

Mcm2 (group 2: DNA replication and repair) and Mcm3 (group 8) are placed in the same cluster by all three metrics: (E68, S15, and P15).

For more examples, see ref. 10.
Possible FNs. Group 1: budding (Table 2) is split into five clusters by the Eisen metric: {Cln1, Och1} ∈ E58, {Cln2, Msb2, Rsr1, Bud9, Mnn1, Exg1} ∈ E68, Gic1 ∈ E29, Gic2 ∈ E64, and {Kre6, Cwp1} ∈ E33; into four clusters by the Shrinkage metric: {Cln1, Bud9, Och1} ∈ S49, {Cln2, Gic2, Msb2, Rsr1, Mnn1, Exg1} ∈ S6, Gic1 ∈ S32, and {Kre6, Cwp1} ∈ S65; and into eight clusters by the Pearson metric: {Cln1, Och1} ∈ P1, {Cln2, Rsr1, Mnn1} ∈ P15, Gic1 ∈ P29, Gic2 ∈ P2, {Msb2, Exg1} ∈ P3, Bud9 ∈ P51, Kre6 ∈ P11, and Cwp1 ∈ P62.
We introduced a new notation to represent the resulting cluster sets, and a scoring function to aid in their comparison.
Each cluster set can be written as follows: where x denotes the group number (as described in Table 2), n_{x} is the number of clusters group x appears in, and for each cluster j ∈ {1,..., n_{x}} there are y_{j} genes from group x and z_{j} genes from other groups in Table 2. A value of * for z_{j} denotes that cluster j contains additional genes, although none of them are cellcycle genes. The cluster set can then be scored according to the following measure: [7] [8] [9]
Table 2 contains those genes from Fig. 2 that were present in our data set. Tables 3, 4, 5 contain these genes grouped into clusters by a hierarchical clustering algorithm using the three metrics (Eisen in Table 3, Shrinkage in Table 4, and Pearson in Table 5) thresholded at a correlation coefficient value of 0.60. Genes that have not been grouped with any others at a similarity of 0.60 or higher are absent from the tables; in the subsequent analysis they are treated as singleton clusters.
The subsampled data yielded the estimate γ ≃ 0.66. In our set notation, the resulting Shrinkage clusters with the corresponding error score computed as in Eq. 9 can be written as follows: The error scores for the Eisen (γ = 0.0) and Pearson (γ = 1.0) cluster sets, computed according to Eq. 9, are
From the data shown in Tables 3, 4, 5, as well as by comparing the error scores, one can conclude that for the same clustering algorithm and threshold value, Pearson tends to introduce more FNs and Eisen tends to introduce more FPs than Shrinkage, as Shrinkage reduces these errors by combining the good properties of both algorithms. This observation is consistent with our mathematical analysis and the simulation presented above.
We have also conducted a more extensive computational analysis of Eisen's data, but omitted it from this article because of space limitations. This analysis appears in a full technical report available for download from www.cs.nyu.edu/cs/faculty/mishra/ (10).
Discussion
Microarraybased genomic analysis and other similar highthroughput methods have begun to occupy an increasingly important role in biology, as they have helped to create a visual image of the statespace trajectories at the core of cellular processes. This analysis will address directly to the observational nature of the new biology. As a result, we need to develop our ability to “see,” accurately and reproducibly, the information in the massive amount of quantitative measurements produced by these approaches or be able to ascertain when what we see is unreliable and forms a poor basis for proposing novel hypotheses. Our investigation demonstrates the fragility of many of these analysis algorithms when used in the context of a small number of experiments. In particular, we see that a small perturbation of, or a small error in the estimation of, a parameter (the shrinkage parameter) has a significant effect on the overall conclusion. The errors in the estimators manifest themselves by missing certain biological relations between two genes (FNs) or by proposing phantom relations between two otherwise unrelated genes (FPs).
A global picture of these interactions can be seen in Fig. 3, the receiver operator characteristic (ROC) figure, with each curve parametrized by the cutoff threshold in the range of [–1, 1]. An ROC curve (13) for a given metric plots sensitivity against (1–specificity), where
Sensitivity = fraction of positives detected by a metric [10] Specificity = fraction of negatives detected by a metric [11] and TP(γ), FN(γ), FP(γ), and TN(γ) denote the number of true positives, false negatives, false positives, and true negatives, respectively, arising from a metric associated with a given γ. (Recall that γ is 0.0 for Eisen, 1.0 for Pearson, and is computed according to Eq. 6 for Shrinkage, which yields 0.66 on this data set.) For each pair of genes, {j, k}, we define these events using our hypothesis (see above) as a measure of truth:
TP: {j, k} are in the same group (see Table 2) and {j, k} are placed in the same cluster;
FP: {j, k} are in different groups, but {j, k} are placed in the same cluster;
TN: {j, k} are in different groups and {j, k} are placed in different clusters; and
FN: {j, k} are in the same group, but {j, k} are placed in different clusters.
FP(γ) and FN(γ) were already defined in Eqs. 7 and 8, respectively, and we define [12] and [13] where Total = = 946 is the total no. of gene pairs {j, k} in Table 2.
Fig. 3 suggests the best threshold to use for each metric and can also be used to select the best metric to use for a particular sensitivity.
The dependence of the error scores on the threshold can be more clearly seen from Fig. 4. It shows that the conclusions we draw above hold for a wide range of threshold values, and hence a threshold value of 0.60 is a reasonable representative value.
As a result, to study the clustering algorithms and their effectiveness, one may ask the following questions. If one must err, is it better to err on the side of more FPs or more FNs? What are the relative costs of these two kinds of errors? Intelligent answers to our questions depend crucially on how the cluster information is used in the subsequent discovery processes.
Acknowledgments
This research was conducted under the support of the National Science Foundation's Qubic program, the Defense Advanced Research Projects Agency, a Howard Hughes Medical Institute biomedical support research grant, the U.S. Department of Energy, the U.S. Air Force, the National Institutes of Health, the National Cancer Institute, the New York State Office of Science, Technology, and Academic Research, a National Science Foundation Graduate Research Fellowship, and a New York University McCracken Fellowship.
Footnotes

↵§ To whom correspondence should be addressed. Email: mishra{at}nyu.edu.

Abbreviations: FN, false negative; FP, false positive; TP, true positive; TN, true negative.
 Received December 19, 2002.
 Copyright © 2003, The National Academy of Sciences
References
 ↵
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc. Natl. Acad. Sci. USA 95, 14863–14868.pmid:9843981
 ↵
Schena, M., Shalon, D., Heller, R., Chai, A., Brown, P. O. & Davis, R. W. (1996) Proc. Natl. Acad. Sci. USA 93, 10614–10619.pmid:8855227
 ↵
DeRisi, J. L., Iyer, V. R. & Brown, P. O. (1997) Science 278, 680–686.pmid:9381177
 ↵
Spellman, P. T., Sherlock, G., Zhang, M., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998) Mol. Biol. Cell 9, 3273–3297.pmid:9843569
 ↵
Chu, S., DeRisi, J. L., Eisen, M. B., Mulholland, J., Botstein, D., Brown, P. O. & Herskowitz, I. (1998) Science 282, 699–705.pmid:9784122
 ↵

Sokal, R. R. & Michener, C. D. (1958) Univ. Kans. Sci. Bull. 38, 1409–1438.
 ↵
James, W. & Stein, C. (1961) in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics, ed. Neyman, J. (Univ. of California Press, Berkeley), Vol. 1, pp. 361–379.
 ↵
 ↵
Cherepinsky, V., Feng, J., Rejali, M. & Mishra, B. (2003) NYU CS Technical Report 2003845 (New York University, New York).
 ↵
Wolfram, S. (1999) The Mathematica Book (Cambridge University Press, Cambridge, U.K.), 4th Ed.
 ↵
 ↵
Egan, J. P. (1975) Signal Detection Theory and ROC Analysis (Academic, New York).