New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Incorporating genespecific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data

Edited by Tomoko Ohta, National Institute of Genetics, Mishima, Japan, and approved February 3, 2005 (received for review November 8, 2004)
Abstract
Because of the increase of genomic data, multiple genes are often available for the inference of phylogenetic relationships. The simple approach for combining multiple genes from the same taxon is to concatenate the sequences and then ignore the fact that different positions in the concatenated sequence came from different genes. Here, we discuss two criteria for inferring the optimal tree topology from data sets with multiple genes. These criteria are designed for multigene data sets where genespecific evolutionary features are too important to ignore. One criterion is conventional and is obtained by taking the sum of loglikelihoods over all genes. The other criterion is obtained by dividing the loglikelihood for a gene by its sequence length and then taking the arithmetic mean over genes of these ratios. A similar strategy could be adopted with parsimony scores. The optimal tree is then declared to be the one for which the sum or the arithmetic mean is maximized. These criteria are justified within a twostage hierarchical framework. The first level of the hierarchy represents genespecific evolutionary features, and the second represents sitespecific features for given genes. For testing significance of the optimal topology, we suggest a twostage bootstrap procedure that involves resampling genes and then resampling alignment columns within resampled genes. An advantage of this procedure over concatenation is that it can effectively account for genespecific evolutionary features. We discuss the applicability of the twostage bootstrap idea to the Kishino–Hasegawa test and the Shimodaira–Hasegawa test.
Phylogenetic relationships can be estimated by a frequentist, Bayesian, or parsimony framework (for a general overview, see ref. 1). Within the frequentist framework, available procedures include maximum likelihood and distance methods. The bootstrap procedure (e.g., refs. 2–4) is widely employed to calculate the significance of a topology that is obtained with the frequentist or parsimony approaches. This procedure assumes that all columns of sequence data are samples from an independent and identical distribution, which we will refer to as the “iid assumption.” Although site dependency seems to be a more realistic assumption for biological sequence evolution (e.g., refs. 5–8), the iid assumption is widely employed because it saves computation time and convenient asymptotic theories of statistics can then be applied.
Because of the increase of genomic data, phylogeny estimation and testing can be based on multiple genes (e.g., refs. 9–12). A simple approach is to concatenate multiple genes, estimate the true tree, and then apply a bootstrap procedure to test whether the estimated topology is significantly better than alternatives. Concatenation may violate the iid assumption. Because different genes may have been subject to different evolutionary pressures, it may be appropriate to describe each gene by its own parameter set. The set may include parameters for nucleotide frequencies, branch lengths, etc. This sort of separate analysis of genes may be preferable to concatenation (refs. 9–11 and 13–15, but see ref. 16).
At the extreme, different genes may even support different tree topologies. Permutation tests (17–19) are available to detect departures among genes in “homogeneity” of tree support. However, these tests simply examine whether genes support congruent trees and whether we can ignore genespecific effects via concatenation. These tests fail to give further direction about combining the separate information when multiple genes support different trees.
To determine the optimal topology even when different genes support different topologies, we should consider what topology selection criterion is most appropriate. The usual sum of parsimony or loglikelihood scores over all columns of separately analyzed multiple genes can be the basis for selecting an optimal tree (e.g., refs. 9–11), and we will refer to this as the “sum criterion.” The sum criterion is superior to sequence concatenation because it is affected by genespecific evolutionary features. When the sum criterion is used to select optimal tree topologies, it is critical to determine how to test the significance of the selected trees. A simple application of the bootstrap procedure merges across genes the pool of sitewise loglikelihood or parsimony scores and then samples sitewise scores from this merged pool with replacement. The simple onestage procedure may be illadvised because the variation among genes is not properly considered and because the iid assumption among scores within the merged pool is no longer valid. Here, we apply a twostage bootstrap procedure (20) to avoid problems associated with the onestage version.
Another criterion to select tree topologies is the “average criterion.” If different sequences have different lengths, the sum of scores might fail to be a good criterion because long sequences may have a big impact on the sum of scores and the resulting optimal tree will reflect mainly the tree supported by long sequences and their genespecific features. This sum of scores criterion therefore might be sensitive to long genes that have experienced unusual evolutionary processes. To remove the effect of sequence length, one can assign each gene an average score per site and then average these averages. In a similar way to the sum criterion, the twostage bootstrap procedure can then be employed.
For the justification of our twostage bootstrap procedure, we assume a hierarchical structure of multigene data set generation. First, evolutionary characteristics of each gene are determined. These evolutionary features are summarized by parameters representing nucleotide frequencies, branch lengths, etc. The parameter sets for different genes are independent samples from some common distribution of parameter sets. According to this hierarchical framework, each gene can be visualized as being determined by a common mechanism with perturbation. Because of the perturbation, genes do not all have identical properties. Instead, their properties follow an unknown true distribution. Second, once the properties of each gene are determined, they are then expressed through the sequence columns. These columns are distributed around the evolutionary features specific to a particular gene. The degree of dispersion of sequence columns around the features of a gene can vary among different genes.
Although it is possible that the evolutionary features of a gene and its length are correlated, we assume that this is not the case. Therefore, our twostage iid assumption leads to a twostage bootstrap procedure (20) for testing the significance of a tree. In this context, the first stage of the twostage bootstrap procedure is to sample genes. The second stage samples columns of sequence data with replacement within resampled genes. Although sequence lengths are generally long, the number of available genes may be relatively small. Thus, the expected value of the estimated variance from the twostage bootstrap procedure may be seriously different from the true variance. This bias should be considered when testing hypotheses. Under the twostage iid assumption, it is possible to analytically calculate the variance of bootstrap resamples. We show how to do this for testing the significance of a topology.
The twostage hierarchical structure is similar to a random effects model (21). One main purpose of the conventional random effects model is to test for a difference between groups. In our case, groups correspond to genes. A normality assumption is needed for ease of testing. The purpose of our procedure is to test whether the mean of the random effects model is equal to zero. We note that this test can be done without an assumption about the type of distribution by employing a nonparametric bootstrap procedure.
Methods
Here, we show how the hierarchial structure of genes and sequence columns can be used in determining an optimal tree topology and then testing its significance in a likelihood context. Kullback–Leibler distance (KLD) (22) is a distance measure between two models. In our case, these models correspond to tree topologies. We show that our hierarchical assumption of genes and columns can be directly extended to the KLD measure.
KLD. For the ith gene, KLD_{i} between the true but unknown datagenerating mechanism f(·) and the model g(·θ) is: where the x_{ij} values are sampled sequence columns from f(·) and n_{i} is the length of gene i. Each term of {log f(x_{ij} ) – log g(x_{ij} θ)} is a sitewise KLD_{i} . The estimate of the minimum KLD_{i} can be obtained with the maximum of , which is denoted mKLD_{i} here. That is, we obtain mKLD_{i} at maximum likelihood estimates () under the model g(·θ),
The mKLD_{i} is used to choose the model (topology) which is closest to the unknown true model. Because we do not know the true datagenerating mechanism f(·), we cannot directly calculate mKLD_{i} . However, the log f(x_{ij} ) terms are common to all competing models and mKLD_{i} differences can be used in model comparison. Suppose two models (topologies) g _{1} and g _{2} are compared, then where and are maximum likelihood estimates under the model g _{1}(·θ_{1}) and g _{2}(·θ_{2}). The term is a sitewise mKLD_{i} difference and is denoted . If the quantity of Eq. 2 is significantly larger (smaller) than zero, model g _{2} (g _{1}) is termed closer to the true model. The variance of is considered in testing significance. The log and log terms are not exactly independent among different sites j (j = 1,..., n_{i} ) because and are functions of all x_{ij} values. However, for large n_{i} , and are close to the true values θ_{1} and θ_{2}, and these true values minimize KLD between f(·) and g _{1}(·θ_{1}), and between f(·) and g _{2}(·θ_{2}). Thus, it is almost correct to regard log and log as functions of solely x_{ij} . Because x_{ij} has a hierarchical iid structure, functions of x_{ij} also have a hierarchical iid structure.
Testing Phylogeny in a Likelihood Context. Suppose we want to know whether tree topology a or b is closer to the truth. If the estimated difference of mKLD_{i} between the two tree topologies deviates significantly from zero, it means that one of the two trees is significantly closer to the unknown true distribution than the other [Kishino–Hasegawa (KH) test, ref. 3]. If one of a and b is the optimal tree as estimated by maximum likelihood or parsimony rather than a tree that was of interest apriori, the uncertainty of the estimate of the optimal tree should be considered in obtaining a confidence set of trees [Shimodaira–Hasegawa (SH) test, ref. 4]. Here, we show how to apply the hierarchical structure of to the KH and SH tests.
For the maximum likelihood estimates of parameters, let the loglikelihood values calculated at the jth column of the ith gene under trees a and b be l_{a} _{,} _{ij} and l_{b} _{,} _{ij} respectively. Let the sitewise loglikelihood difference be , which is the sitewise difference of mKLD.
We consider the distribution of sitewise differences of mKLD. That is, for K genes, we assume that y_{ab} _{,} _{ij} (i = 1,..., K; j = 1,..., n_{i} ) is an observation of random variable Y_{ab} _{,} _{i} with the properties and . Except for a finite mean and variance, little is assumed about the type of the distribution. The values need not be the same among genes i, but we do require independence between w_{ab,i} and . The expected value of is denoted . We consider the hierarchical structure where w_{ab,i} is an observation of random variable W_{ab} with the properties . We assume that gene lengths n_{i} are random variables that are independent of W_{ab,i} and with E(n_{i} ) = n (0 < n <∞) and Var(n_{i} ) = σ _{n} < ∞. The hierarchical structure in our approach is similar to a random effects model. Although a conventional goal with the random effects model would be to test whether exceeds zero, we are more interested here in testing whether μ _{ab} is zero. For our application, normality assumptions for Y_{ab} _{,} _{i} and W_{ab} are not required.
KH test for multiple genes. Results pertaining to the KH test for multiple genes are below and are justified in the supporting information, which is published on the PNAS web site. We consider the testing problem, H _{0}: μ _{ab} = 0 versus H _{1}: μ _{ab} ≠ 0. If μ _{ab} is greater (less) than zero, then tree a (tree b) can be regarded as more reliable. Define S_{ab} as the sum of sitewise loglikelihood differences between tree a and b over all columns of all genes,
This means
If S_{ab} is far from zero, then H _{0} is rejected. In general, n ^{2} and σ^{2} _{n} are big. Because σ^{2} _{ab} _{,} _{W} is multiplied by K(n ^{2} + σ^{2} _{n} ), the amonggene variation (σ^{2} _{ab} _{,} _{W} ) can have a big impact on the variance of S_{ab} .
To approximate the sampling distribution of S_{ab} under H _{0},weuse twostage bootstrap resampling. First, we resample genes (denoted by *) and second, we resample columns of the resampled genes (denoted by ★). Because it is computationally intensive to maximize likelihoods for resampled data, we employ the idea of the RELL method (3). That is, we consider twostage resampling from the set of likelihood values that are already calculated instead of resampling gene and sequence columns. Let the resampled loglikelihood value at the jth column of the ith gene under trees a and b be and . Define where and is the length of resampled ith gene. We have and where y _{ab} _{,} _{i} is (y_{ab} _{,} _{i} _{1},..., y_{ab} _{,} _{ini} ) ^{T} . The expected values of Eq. 5 and 6 are and Eq. 7 means that we should consider the bias in the estimation of Var(S_{ab} ) by . This bias is represented by the second term of Eq. 7. Also in testing H _{0}, this bias should be considered.
Following two guidelines of the bootstrap method by Hall and Wilson (23), we approximate the distribution of with the distribution of , where and are the square roots of the unbiased estimators of Var(S_{ab} ) and . We have
If is outside the 95% interval of , then we reject the null hypothesis H _{0}: μ _{ab} = 0. When gene number is large and sequences are long, asymptotically follows a standard normal distribution when H _{0} is true and the test can be performed without a bootstrap procedure. SH test for multiple genes. In many cases, we want to test significance between an optimal and another tree, not between two trees that are both selected prior to the analysis. The SH test is designed for this situation because it considers the uncertainty of choosing the optimal tree (ref. 4; see also ref. 24). Often, the optimal tree is selected from a set of candidates, and the goal is to test whether the optimal tree is significantly better than the others. The candidates that are not significantly worse than the optimal tree are used to construct the confidence set of trees.
The definition of S_{ab} in Eq. 3 can be rewritten
Eq. 10 suggests the sum criterion to determine the optimal tree from multiple genes. Suppose there are p candidate trees, one of which is the optimal tree . In our method,
To apply the twostage bootstrap to the SH test, let H_{ab} and be , where a, b = 1,..., p. If a = b, H_{ab} and are zero. Under the least favorable configuration (4) in which the expected loglikelihood sum is the same over all trees, the hypothesized value of μ _{ab} is equal to zero, and this makes . Following the steps below, we can construct a confidence set of trees.

For each tree τ(τ ∈ {1,..., p}), calculate the test statistic .

Generate q sets of twostage resampled loglikelihood values .

Calculate from and , for all a, b = 1,..., p.

For each tree τ with each resampled data set r (r = 1,..., q), calculate , where of is estimated with resampled data as follows

For each τ (τ ∈ {1,..., p}), if T _{τ} is not in the critical region of the distribution of with level α, τ is included in the confidence set of trees.
The Average Criterion to Select Tree Topology. Above, we considered the sum criterion to select the tree topology. As noted, the sum of loglikelihood values over all genes can be sensitive to relatively long genes that happened to experience atypical evolutionary processes. To remove the effect of sequence length, we consider an alternative to the sum criterion that we refer to as the average criterion.
If we define and as then and . Letting n ^{⊖} and n ^{⊗} be the expected values of 1/n_{i} and 1/n ^{2} _{i} , the variance of and are and the expectation of is
In a similar way to Eq. 7, we should consider the bias in the estimation of when using . The unbiased estimators of and are, respectively,
We can approximate the distribution of with the distribution of . The extension of the KH and SH tests for the average criterion is straightforward. With this criterion, optimal trees from original and resampled data, and , are
The tree inferred with the average criterion is not necessarily the same as the tree obtained by the sum criterion (see below for examples).
Results
Data. As an example, we analyzed data from Cao et al. (10). Sitewise loglikelihood values (l_{a} _{,} _{ij} 's) of mitochondrial genes were obtained from the authors. The purpose of Cao et al.'s work was to find the position of turtle within the amniotes group (for the 15 tree topologies considered by Cao et al., see Table 1). Rather than placing our focus on the turtle position, here we concentrate on our twostage resampling procedure and the difference between our method and the weighted SH test after sequence concatenation or after merging separately calculated loglikelihoods. Among 14 mitochondrial genes, we excluded 12S and 16S rRNA data. These two rRNA genes were analyzed by Cao et al. (10) with a nucleotide substitution model, whereas the other 12 genes are proteincoding genes, and Cao et al. (10) used an amino acid replacement model to analyze them. With our hierarchical approach, it may be unreasonable to assume that the former two genes are observations from the same distribution as the 12 proteincoding genes. Tree support varies among the 12 proteincoding genes. When the 12 proteincoding genes were separately analyzed by Cao et al. (10), they found that topologies 2, 3, 4, and 9 were the maximum likelihood topologies for 5, 5, 1, and 1 genes, respectively.
Variability Among Genes. We compared the distribution of sitewise loglikelihood (l_{a} _{,} _{ij} ) values between gene pairs and tested whether those distributions are significantly different. To do this, we performed the nonparametric Kolmogorov–Smirnov test (25). Many pairwise comparisons show a significant difference between genes (see supporting information). We obtained qualitatively similar results for all 15 topologies. This finding implies that ignoring genespecific effects may be unwarranted and that neither sequence concatenation nor a conventional onestage bootstrap procedure are recommended. In a similar way, we considered topology pairs a and b and tested the amonggene variability of sitewise differences of loglikelihood (y_{ab} _{,} _{ij} ) values. We observed high variability among genes for most tree pairs that were examined (data not shown).
Comparison with SH Test. Because the second term of Eq. 7 is not negligible in general, we need proper “weighting” by and as described in Methods to approximate the distribution of (S _{ab} – Knμ _{ab} ) with the distribution of . The weighting scheme also can be adapted to the simple SH test. We refer to this adaptation as the weighted SH (WSH) test (4). Because it considers weighting in its test statistic, our method corresponds more to the WSH test than the unweighted SHtest. We obtained optimal tree topologies with the sum and average criteria and we applied the twostage resampling idea to the SH test. We compared our two stage procedures to the WSH test after sequence concatenation and the WSH test after merging separately calculated loglikelihoods (Table 1).
We applied the WSH test by using the consel software package (26) to sample with replacement from the sitewise loglikelihoods obtained in the analysis of Cao et al. (10). The bootstrap probabilities (BPs) of concatenation and merging procedures were calculated with the consel software. The BPs of our twostage procedure were calculated in similar ways to Eqs. 12 and 13 without centering.
In our twostage procedures, topologies 2 and 3 are obtained by the sum and average criteria. Topology 2 is also obtained from the procedure of sequence concatenation and merging loglikelihoods. The confidence sets of topologies in Table 1 contain the same trees (trees 2–4) with a 5% significance level, but the P values of the topologies vary among sets.
Discussion
In the Methods, we explained our method in a likelihood context. If we redefine l_{a} _{,} _{ij} as the parsimony score of the jth site of the ith gene under tree a, then getting and testing the optimal tree by parsimony could be done as described in Methods. The idea can also be straightforwardly applied to distance matrix methods. Because the hierarchical structure of gene and sequence columns can be extended to the hierarchical structure of mKLD, it could be also extended to a hierarchical structure of evolutionary distance. From the multiple distance matrices obtained in analyses of separate genes, an average distance matrix could be calculated. The optimal tree would then be reconstructed with this average distance matrix. To assign bootstrap support to subclades, a twostage bootstrap procedure could be adopted. The idea of an average distance matrix corresponds to the average criterion. The sum criterion also can be extended to the distance method. We can multiply each distance matrix with the length of the gene. The tree can be reconstructed with the sum of these multiple matrices. A twostage bootstrap procedure can be employed to get the bootstrap support. For long branches on an evolutionary tree of a gene, the estimated distance can be extremely large and have poor precision. Therefore, instead of using the average or sum of distances, it might be better to use the median of multiple estimated distances.
It is possible to estimate . For gene i, can be obtained with samples of y _{ab,ij} (j = 1,..., ni). Using the fact that of Eq. 8 is an unbiased estimator of Var(S_{ab} ) of Eq. 4 together with estimated , , and , we can obtain . To calculate the precision of , we should specify the distributions of y_{ab} _{,} _{ij} and n_{i} , or at least know their third and fourth moments. Because we make minimal assumptions about the first two moments of y_{ab} _{,} _{ij} and n_{i} , we do not try to quantify .If y_{ab} _{,} _{ij} and w_{ab} _{,} _{i} followed normal distributions with all values equal, the conventional random effects approach (21) could test whether . Instead, we try to make our method less dependent on the model type and try to make our method less parametric. For this reason, we investigate the heterogeneity of the distribution of y_{ab} _{,} _{ij} with the Kolmogorov–Smirnov test instead of directly quantifying . Even when , the distribution of y_{ab} _{,} _{ij} can be heterogeneous because the terms can vary among genes. When all genes have the same , there still may be variation among genes in impact on the sum criterion because of different sequence lengths n_{i} and because may not be zero. This means that concatenation of sequences or merging sitewise loglikelihoods followed by the iid assumption are not reasonable approaches.
From Eq. 4, we see that the sum of sitewise loglikelihood values over all columns of all genes is expected to have smaller variance when is small and that variances of the loglikelihoods of genes are significantly affected by the variance of sequence length. On the other hand, the average over all genes of the average loglikelihood per site removes the effect of sequence length and reflects the average structure among genes robustly. The sum and average criteria are two extremes among many possible weighting schemes. Intermediate schemes exist, but it is unclear which intermediate would be best. More experience with genomic data and familiarity with genespecific variation would help in determining how to select an optimal scheme.
From Cao et al.'s data (10), we excluded the two of 14 mitochondrial genes that are not proteincoding. We did this in order not to violate the hierarchical structure of the mKLD differences, y_{ab} _{,} _{ij} . We showed that hierarchical structure of genes and sites within genes are straightforwardly extended to the hierarchical structure of mKLD difference. With different amino acid (or nucleotide) substitution models in different genes, or if genes are seriously heterogeneous for reasons such as horizontal gene transfer, the hierarchical iid assumption of mKLD is violated. If topological difference has a big impact on mKLD but choice of evolutionary model does not, then different models for different genes or heterogeneous genes would not be a problem.
The 12 proteincoding genes from Cao et al. (10) are variable in terms of tree support. Topologies 2, 3, 4, and 9 are the maximum likelihood trees for 5, 5, 1, and 1 individually analyzed genes, respectively. The total concatenated sequence length is 3,235 aa. The sum of lengths of genes that support topology 2 is 1,651 aa, and the sum supporting topology 3 is 1,118 aa. This finding is consistent with the fact that the optimal tree is topology 2 under the concatenation and merging procedures with the sum criterion. The length of the COB gene, which supports topology 9, is 365 aa. This relatively short gene has little effect in the concatenation and merging approaches with the sum criterion. With our twostage resampling procedure, COB can be resampled more than once and can have a bigger effect. Because the average criterion removes the effect of sequence length, it produces higher P values and bootstrap probabilities than the sum criterion for topology 9. Other analyses that we have performed also indicate that the effect of gene sampling on phylogenetic inference can be substantial (e.g., see supporting information).
In our approach, sequence columns are assumed to be iid samples from a common distribution. However there may exist heterogeneous partitions within a gene (e.g., ref. 14). For example, the three codon positions of proteincoding genes show significant heterogeneity in evolutionary dynamics (e.g., ref. 28). For proteincoding sequences, partitioning could also be done according to structural environment of the position (e.g., buried or exposed to solvent, αhelix or βstrand or coil). Here, we do not intensively investigate the sometimes difficult issue of choosing data partitions. More understanding of the evolutionary process would surely assist in defining data partitions.
Although categorizing by codon position or structural environment may lead to substantial variation in evolutionary process among partitions, these categorization schemes do not fit well into our hierarchical framework. We have treated genespecific perturbations from a common evolutionary mechanism as random effects. Variation among partitions defined by codon position or protein structure would probably be better described with fixed effects. It is not easy to envision randomly sampling partitions defined by codon or protein structure.
Moreover, if each codon position or structural partition has an appropriate evolutionary model, the ratio of lengths between different branches on a tree might not be expected to vary among partitions. There is no obvious reason, for example, that we would expect all αhelix positions in a genome to evolve quickly relative to βstrand positions on one branch of a tree but slowly relative to βstrand positions on another branch of the tree. In contrast, genes are units on which phenotypic selection acts. Natural selection might induce genespecific effects on branch length that could vary among branches and would yield ratios of branch lengths that vary among genes.
There have been proposals to take into account the variability of evolutionary process among genes (e.g., refs. 28 and 29). One treatment assumes all genes share a topology and the branch lengths for different genes vary only by genespecific proportionality constants (28). Another assumes all genes evolved via the same topology, but there is no correlation when branch lengths of different genes are examined. Comparisons between the proportional branch lengths, the uncorrelated branch lengths, and the concatenated treatment have been made (30). Recently, the proportionality treatment was adapted to a Bayesian framework (31 and 32), and the uncorrelated treatment could be similarly adapted. In our procedure, all genes (or data partitions) are assumed to be independent samples from the same distribution. However, partitions that seriously violate this assumption (e.g., morphological versus sequence data) can be simultaneously analyzed in a Bayesian framework (refs. 31 and 32, see also ref. 33). Suchard et al. (34) recently introduced a promising hierarchical Bayesian approach that can pool information among genes without requiring either uncorrelated or strictly proportional branch lengths.
We have referred to the genespecific properties that are our focus here as being due to “perturbations” from a “common mechanism.” We have intentionally been somewhat vague as to the possible biological sources of these perturbations. The most obvious explanation for genespecific perturbations is mutation or natural selection. Natural selection operating on a trait coded by the gene could simultaneously affect the branch lengths or other parameters at all gene positions.
Although natural selection and mutation are likely to be important sources of genespecific perturbations in evolutionary parameters, they should not generate different topologies for different genes. Even if all genes evolved according to the same topology, genespecific perturbations may be partially responsible for variation among genes in the level of support for the shared topology. Our methods are designed to consider this variation.
A possibility that could result in different topological histories among genes is differential lineage sorting. Rannala and Yang (27) explicitly considered lineage sorting by ascribing variability among gene trees to ancestral polymorphism and stochasticity. There are many advantages to the direct and explicit treatment of Rannala and Yang (27). A possible disadvantage is that highly detailed models may lack robustness when assumptions are violated. Our less detailed treatment of genespecific properties assumes there is a central tree of interest and that random genespecific perturbations are unbiased relative to this tree. The fact that we do not specify the biological source of these perturbations can be viewed as either an advantage or disadvantage of our procedure.
Regarding the source of genespecific perturbations, another possibility is they arise from model misspecification. It may be that all genes share a common history but that the model employed to analyze them is not equally appropriate for all genes. If the effects of this model misspecification are unbiased among genes in that the average inference among genes is expected to be the truth, our hierarchical treatment would be appropriate. Undoubtedly, a more desirable solution would be to remedy the misspecification by improving the model.
Here, we use the sum and average criteria to infer the optimal tree from multigene data. We also introduce a twostage bootstrap procedure to test the significance of the optimal tree. Our method has an advantage over sequence concatenation in that it can effectively consider genespecific features. It is justified by the assumption of a hierarchical structure of genes and sites within genes. The method has the advantage of being able to combine the information from multiple genes that might support different trees. We expect that the rapid growth of genomic data will lead to a better understanding of the effects of gene sampling on evolutionary inference. For the time being, only a small number of genes are typically available. With these small data sets, it is especially important to consider effects of gene sampling on the bias and power of phylogenetic hypothesis testing.
Acknowledgments
We thank Y. Cao for providing data. This work was supported by the Institute for Bioinformatics Research and Development of the Japanese Science and Technology Corporation, Japanese Society for the Promotion of Science Grant 16300086, and National Science Foundation Grants INT990934 and DEB0120635.
Footnotes

↵ § To whom correspondence should be sent at the ‡ address. Email: seo{at}iu.a.utokyo.ac.jp

Author contributions: T.K.S., H.K., and J.L.T. designed research; T.K.S., H.K., and J.L.T. performed research; T.K.S. analyzed data; and T.K.S., H.K., and J.L.T. wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: KLD, Kullback–Leibler distance; KH, Kishino–Hasegawa; SH, Shimodaira–Hasegawa; WSH, weighted SH.
 Copyright © 2005, The National Academy of Sciences
References

↵
Felsenstein, J. (2004) Inferring Phylogenies (Sinauer, Sunderland, MA).
 ↵
 ↵
 ↵
 ↵

Pedersen, A. K. & Jensen, J. L. (2001) Mol. Biol. Evol. 18 : 763–776. pmid:11319261

↵
Robinson, D. M., Jones, D. T., Kishino, H., Goldman, N. & Thorne, J. L. (2003) Mol. Biol. Evol. 20 , 1692–1704. pmid:12885968
 ↵
 ↵
 ↵
 ↵
 ↵

↵
Debry, R. W. (1999) Syst. Biol. 48 , 286–299. pmid:12066708

↵
Debry, R. W. (2003) Syst. Biol. 52 , 604–617. pmid:14530129

↵
Gontcharov, A. A., Marin, B. & Melkonian, M. (2004) Mol. Biol. Evol. 21 , 612–624. pmid:14739253
 ↵

Swofford, D. L. (1995) paup*: Phylogenetic Analysis Using Parsimony (* and Other Methods) (Sinauer, Sunderland, MA).

↵
Waddell, P. J., Kishino, H. & Ota, R. (2000) Mol. Biol. Evol. 17 , 1988–1992. pmid:11110915
 ↵

↵
Scheffé, H. (1959) in The Analysis of Variance (Wiley, New York), pp. 221–260.
 ↵
 ↵

↵
Goldman, N., Anderson, J. P. & Rodrigo, A. G. (2000) Syst. Biol. 49 , 652–670. pmid:12116432

↵
Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (1992) in Numerical Recipes in C (Cambridge Univ. Press, Cambridge, U.K.), 2nd Ed., pp. 623–628.

↵
Shimodaira, H. & Hasegawa, M. (2001) Bioinformatics 17 , 1246–1247. pmid:11751242

↵
Rannala, B. & Yang, Z. (2003) Genetics 164 , 1645–1656. pmid:12930768
 ↵

↵
Yoder, A. D. & Yang, Z. (2000) Mol. Biol. Evol. 17 , 1081–1090. pmid:10889221

↵
Pupko, T., Huchon, D., Cao, Y., Okada, N. & Hasegawa, M. (2002) Mol. Biol. Evol. 19 , 2294–2307. pmid:12446820

↵
Ronquist, F. & Huelsenbeck, J. P. (2003) Bioinformatics 19 , 1572–1574. pmid:12912839

↵
Nylander, J. A. A., Ronquist, R., Huelsenbeck, J. P. & NievesAldrey, J. L. (2004) Syst. Biol. 53 , 47–67. pmid:14965900

↵
Lewis, P. O. (2001) Syst. Biol. 50 , 913–925. pmid:12116640

↵
Suchard, M. A., Kitchen, C. M. R., Sinsheimer, J. S. & Weiss, R. E. (2003) Syst. Biol. 52 , 649–664. pmid:14530132
Citation Manager Formats
More Articles of This Classification
Biological Sciences
Related Content
 No related articles found.