A Dirichlet process model for detecting positive selection in proteincoding DNA sequences
 *Section of Ecology, Behavior, and Evolution, Division of Biological Sciences, University of California at San Diego, La Jolla, CA 920930116;
 ^{‡}Division of Biostatistics and Bioinformatics, Department of Family and Preventive Medicine, University of California at San Diego, La Jolla, CA 920930717; and
 ^{§}Antiviral Research Center, University of California at San Diego, 150 West Washington Street, La Jolla, CA 92103
See allHide authors and affiliations

Edited by Joseph Felsenstein, University of Washington, Seattle, WA, and approved March 1, 2006
Abstract
Most methods for detecting Darwinian natural selection at the molecular level rely on estimating the rates or numbers of nonsynonymous and synonymous changes in an alignment of proteincoding DNA sequences. In some of these methods, the nonsynonymous rate of substitution is allowed to vary across the sequence, permitting the identification of single amino acid positions that are under positive natural selection. However, it is unclear which probability distribution should be used to describe how the nonsynonymous rate of substitution varies across the sequence. One widely used solution is to model variation in the nonsynonymous rate across the sequence as a mixture of several discrete or continuous probability distributions. Unfortunately, there is little population genetics theory to inform us of the appropriate probability distribution for amongsite variation in the nonsynonymous rate of substitution. Here, we describe an approach to modeling variation in the nonsynonymous rate of substitution by using a Dirichlet process mixture model. The Dirichlet process allows there to be a countably infinite number of nonsynonymous rate classes and is very flexible in accommodating different potential distributions for the nonsynonymous rate of substitution. We implemented the model in a fully Bayesian approach, with all parameters of the model considered as random variables.
Natural selection leaves a detectable signature in comparisons of proteincoding DNA sequences; a bias in the ratio of the rates of nonsynonymous and synonymous substitutions is unambiguous evidence for natural selection. Purifying selection causes the rate of nonsynonymous substitution to be smaller than the rate of synonymous substitution. In fact, the predominant pattern found in analysis of alignments of proteincoding DNA is that nonsynonymous substitutions have a lower rate than synonymous substitutions. This finding is consistent with natural selection acting to eliminate deleterious mutations that change the protein to a less functional form. However, natural selection can also act to increase the probability that a nonsynonymous mutation is fixed in the population. Positive, or direction, selection causes the relative rates of nonsynonymous to synonymous substitutions to be >1. Examples of positive selection have been found in many genes but perhaps most famously in human major histocompatibility complex (MHC) (1), HIV1 envelope (env) gene (2), sperm lysins (3), and primate stomach lysozymes (4).
Although seemingly simple, detecting positive natural selection from alignments of proteincoding DNA is recognized as a formidable statistical problem. Many methods have been proposed to detect the footprint of natural selection, all of which are based on measuring the relative rates or numbers of nonsynonymous and synonymous substitutions (2, 5–20). Most of the methods assume a constant rate of nonsynonymous and synonymous substitutions across the sequence. These methods are illsuited for detecting positive natural selection when only a few of the amino acid positions in a gene are under the influence of positive natural selection, with the others under purifying selection. Applying a method that assumes a constant rate of nonsynonymous change across the sequence potentially masks the signature of positive natural selection that is present at only a few positions in the alignment (21). In cases where such methods have been successful, many sites are typically under strong positive selection (e.g., MHC). More recently, Nielsen and Yang (2) developed a method that allows the rate of nonsynonymous substitution to vary across the sequence. The method of Nielsen and Yang has proven useful for detecting positive natural selection in sequences where only a few sites are under directional selection.
The Nielsen and Yang (2) approach assumes that the nonsynonymous/synonymous rate ratio (d _{N} /d _{S} ratio) at a site is a random variable drawn from some probability distribution. The goal is to estimate parameters of the model from alignments of proteincoding DNA and, by using these parameter estimates, identify amino acids under positive selection. Their model is quite complicated and contains many parameters to estimate. Some of the parameters account for the fact that the sequences are related to one another through some unknown phylogenetic tree. Other parameters account for biases in the substitution process, such as an increased rate of mutation for transitions. The remaining parameters, however, describe how d _{N} /d _{S} varies across the sequence. Nielsen and Yang showed not only that it is practical to efficiently estimate these parameters from alignments of proteincoding DNA sequences but also that one can use an empirical Bayesian approach to detect specific amino acid residues that are under the influence of positive natural selection.
How should variation in the rate of nonsynonymous substitution across a sequence be modeled? Population genetics theory is largely silent on this issue. For the most part, we lack information on the distribution of selection coefficients for new mutations and the demography of the populations under consideration, making it difficult to predict a distribution for the rate of nonsynonymous substitution (22, 23). The original Nielsen and Yang (2) approach assumed that a site could be in one of three categories, each of which differed in its d _{N} /d _{S} rate ratio. With probability p _{1} , a site is in category 1, which has d _{N} /d _{S} = 0; with probability p _{2} , the site is in category 2, which has d _{N} /d _{S} = 1; and with probability p _{3} , the site is in a category that has d _{N} /d _{S} > 1 (p _{1} + p _{2} + p _{3} = 1). Nielsen and Yang (2) also considered a model that allowed d _{N} /d _{S} to vary continuously on the interval (0, 1); a gamma distribution, truncated on the interval (0. 1) was used to model amino acid sites that are acting neutrally or under the influence of purifying natural selection. Under this model, a site has d _{N} /d _{S} drawn from a truncated gamma distribution with probability p _{1} or has d _{N} /d _{S} > 1 with probability p _{2} . Later, Yang et al. (12) systematically explored more ways to model variation in d _{N} /d _{S} across a sequence. They considered a total of 13 models. The “M10” model from Yang et al. (12), for example, assumes that, with probability p _{1} , the d _{N} /d _{S} rate ratio is drawn from a beta distribution on the interval (0, 1) and, with probability p _{2} , the d _{N} /d _{S} rate ratio is drawn from an offset gamma distribution. Even though many of the models considered in Nielsen and Yang (2) and in Yang et al. (12) describe d _{N} /d _{S} as varying continuously, in practice, these continuous distributions are discretized to allow the likelihood to be calculated (the likelihood is averaged over the different discrete categories). At least as currently implemented, then, all of these models describing how the rate of nonsynonymous substitution varies across the sequence are discrete.
We take a different approach to modeling variation in d _{N} /d _{S} across sites, allowing sites to be in one of a number of classes, with each class having a different d _{N} /d _{S} ratio. The prior probability distribution for the number of classes and the d _{N} /d _{S} for each class is described by a Dirichlet process prior. The Dirichlet process prior provides a flexible way to model situations in which the data elements are drawn from a mixture of simpler parametric distributions. For typical mixture models, the number of mixture components is assumed to be known, and determining the correct number of components a priori for a particular model is difficult. For the Dirichlet process prior, however, the number of mixture components is countably infinite, obviating the need to determine the correct number of mixture components. Here we apply the Dirichlet process to the problem of detecting positive selection in proteincoding DNA sequences. Instead of assuming that the d _{N} /d _{S} rate ratio for a site is drawn from a particular parametric distribution, as is the case for Nielsen and Yang (2) and Yang et al. (12), we assume that a site is assigned to a category with a particular d _{N} /d _{S} value. The number of selection categories and the d _{N} /d _{S} value for each category are both considered random variables in our model. The Dirichlet process has been used in one other case for an evolutionary problem: Lartillot and Philippe (24) used the Dirichlet process to model variation in the substitution process across alignments of amino acid sequences. Similarly, Pond and Frost (25) described a flexible discretization scheme that is able to fit a wide range of rate distributions. However, this scheme still requires that the maximum number of rate classes be specified a priori.
We estimate the parameters of the model in a Bayesian framework. Calculating the joint posterior probability distribution of the parameters involves summation over all possible phylogenetic trees relating the sequences and, for each tree, integration over all possible combinations of parameter values. We use Markov chain Monte Carlo (MCMC) to approximate the posterior probability distribution of the parameters and apply the method to six alignments of proteincoding DNA sequences (12, 17, 26, 27).
Results and Discussion
The Dirichlet process prior has two components: One is a parameter, usually called α, that influences the probability that data elements find themselves in the same cluster. In other words, the parameter controls the “clumpiness” of the process. The other component of a Dirichlet process prior is a probability distribution that describes the probability of the parameter assigned to each cluster. (The Dirichlet process prior is described in more detail in Materials and Methods.) We examined the robustness of the inferences of positive selection to three different choices for α, choosing α such that the prior mean of the number of components (k) was E(k) ≈ 1, E(k) = 5, and E(k) = 10; we did this to keep the number of selection classes to a manageable number because the likelihood calculations become too computationally cumbersome when k is large (e.g., k > 25). An alternative that we did not explore is to place a prior probability distribution on α. Often in problems using a Dirichlet process prior, a gamma hyperprior is placed on α.
Tables 1–6 compare the posterior and prior probability distributions for the number of mixture components (selection categories) for each of the six alignments. For all of the alignments we examined, there is very little posterior probability for k = 1 and k = 2, even when there may be a substantial amount of prior probability for those cases. The data are very difficult to explain with only a few selection classes.
We also examined the marginal posterior probability of ω = d _{N}/d _{S} for each site. Fig. 1 shows the probability distribution for four arbitrarily chosen sites from the HIV1 env alignment. Three things are apparent: First, and as expected, different sites for the same data set have different probability distributions for ω. Second, the distribution for ω varies depending on the value of the parameter used in the analysis. The distributions were similar when the prior expectation of the number of selection classes was 5 or 10 but were different when the prior expectation was 1. In short, when the model attempts to explain the data with too few selection categories, the marginal posterior probability distribution for ω at a site becomes a compromise between the necessities for that site and others that are grouped into the same selection category. Third, and perhaps more importantly, the marginal posterior probability distribution of ω for a site can be used to directly calculate the probability that the site experiences positive selection. The probability that ω > 1 is the probability that the site is under positive selection. This quantity can be directly calculated from the output of the MCMC procedure as the fraction of the time the site had ω > 1. Fig. 2 shows the probability that each site is under positive selection for all of the analyses of this paper. There is strong concordance between the sites that we find to be under positive selection with the sites that Yang et al. (12) found to be under positive selection. Yang et al. (12) found that analysis of different probability distributions describing how ω varies across a sequence would typically result in a set of sites that always were found to be under positive selection and another set of sites that were inferred to be under positive selection for some models but not others. In general, the Dirichlet process prior picks out the sites that were consistently found to be under positive selection by Yang et al. (12) but assigns lower probabilities to these same sites of being under positive selection. For example, the Yang et al. (12) analysis of the HIV1 env alignment consistently found sites 28, 66, and 87 to be under positive selection (with probabilities of 0.99 or greater), and sites 26 and 51 to be under positive selection with probability 0.99 under the M12 model. (The M12 model is a mixture of two normal distributions with a discrete category with ω = 0.) Our analysis of the HIV1 env alignment finds sites 26, 28, 51, 66, 83, and 87 to be under positive selection, all having a probability of >0.95 and having ω > 1. Our analysis does not condition on the maximum likelihood values of the parameters (the tree, branch lengths, and substitution model parameters) as is the case of the Nielsen and Yang (2) approach. It is likely that the accommodation of uncertainty in the model parameters causes the probabilities of sites being in particular categories to be dampened relative to approaches that do not account for parameter uncertainty.
Table 7 lists sites that had a high probability of being under positive selection for all six genes. For the most part, the same sites are found to be under positive selection regardless of the value of the concentration parameter used in the analysis. For example, sites 26, 28, 51, 66, 83, and 87 of the HIV1 env alignment were inferred to be under positive selection regardless of the value of α assumed in the analysis. Sites 24, 68, 69, and 76 had a probability >0.95 of being under positive selection when E(k) ≈ 1 but not when E(k) = 5 or E(k) = 10. However, the probability of those sites being under positive selection was just below the 0.95 threshold. (Sites 24, 68, 69, and 76 had probabilities ranging between 0.88 and 0.93 of having ω > 1 when the expected number of selection categories was set to 5 or 10.)
Methods for detecting the presence of positive natural selection in proteincoding DNA have become an important tool in studies of molecular evolution. The recent advances that allow the nonsynonymous/synonymous rate ratio to vary across the sequence have opened up the possibility of detecting specific amino acid residues that are functionally important, displaying an elevated d _{N} /d _{S} rate ratio. The method we describe here represents an important extension of existing methods by allowing a more flexible description of how d _{N} /d _{S} varies across a sequence and by accounting for uncertainty in parameters of the model when making inferences of positive selection.
Materials and Methods
Data.
We assume an alignment of proteincoding DNA sequences is available. The alignment is contained in the matrix X = {x_{ij} }, where i = 1, 2, …, s and j = 1, 2, …, c; s is the number of sequences, and c is the number of codons in each sequence. The information at the jth site is contained in the vector x _{j} = (x _{1j}, x _{2j} , …, x _{sj} )′.
Phylogenetic Model.
The s species are related to one another through an unknown phylogeny, represented as a bifurcating tree. Possible phylogenies are denoted τ _{1} , τ _{2} , …, τ _{B} _{(s)} , where B(s) is the number of possible phylogenetic trees for s species. We assume a timereversible model of nucleotide substitution (see below), which means that the phylogenetic trees can be arbitrarily rooted without changing the likelihood. Therefore, we assume unrooted phylogenetic trees throughout this study. The number of unrooted phylogenetic trees for s species is B(s) = (2s − 5)!/[2 ^{s} ^{−3} (s − 3)!] (28). Each unrooted tree has 2s − 3 branches. Each branch on the tree has a length, ν. The set of branch lengths for the jth tree is denoted ν _{j} = (ν_{1}, ν _{2} , …, ν _{2s} _{−3} ).
We assume that substitutions occur on the phylogenetic tree according to a continuoustime Markov process. The instantaneous rate of change from codon i to codon j is contained in a matrix, Q, the entries of which are given by where ω is the nonsynonymous/synonymous rate ratio, κ is the transition/transversion rate ratio, and π _{j} is the stationary frequency of codon j (2, 14, 15). The rate matrix, Q, is 61 × 61 in dimension; although there are 64 possible codon triplets, the three termination codons of the universal code are excluded from the state space, leaving the 61 sense codons as the states of the substitution process. The diagonal elements of the rate matrix Q are specified such that each row sums to 0. We rescale the matrix such that the average rate of synonymous substitution is 1; this means that the branch lengths are in terms of expected number of synonymous substitutions per codon site. This scaling is different from the usual for phylogenies, where the rate matrix is rescaled such that the branch lengths are in terms of expected number of substitutions per site. Finally, note that the substitution process is timereversible because π _{i} q _{ij} = π _{j} q _{ji} for all i ≠ j. Practically speaking, this means that we can use unrooted phylogenies instead of rooted trees when calculating the likelihood on the tree (29). The transition probabilities of the process are the probability of finding the process in codon j conditioned on the process starting in codon i and run over a branch of length ν. The transition probabilities can be calculated as P(ν) = e ^{Qν} .
Each of the c sites can be in one of k selection categories, where each category differs in its d _{N} /d _{S} value. The information on the category membership is contained in an association vector, z. For example, for the alignment of four sites given above, one possible assignment of sites to selection categories is z = (1, 1, 2, 3). Here, k = 3, and the first two sites are in the same category. The allocation vector describes a partition of the sites into different classes, each class with its own d _{N} /d _{S} value. The number of ways to partition a set of c sites into k classes is described by the Stirling numbers of the second kind (30): The sites for the example alignment of c = 4 sites can be assigned to k = 3 categories in a total of S(4, 3) = 6 ways: z = (1, 1, 2, 3); z = (1, 2, 1, 3); z = (1, 2, 3, 1); z = (1, 2, 2, 3); z = (1, 2, 3, 2); and z = (1, 2, 3, 3). We label possible partitions of the sites into classes by using the restricted growth function notation of Stanton and White (31), where elements are sequentially numbered with the constraint that the index numbers for two sites are the same if they are found in the same category. The number of sites in the ith category is denoted η _{i} . For the example given above, η _{1} = 2, η _{2} = 1, and η _{3} = 1.
Likelihood.
Assuming that substitutions are independent across sites, the likelihood for an alignment can be calculated as the product of the site likelihoods: We used the Felsenstein (29) pruning algorithm to calculate likelihoods for specific combinations of parameter values.
Statistical Model.
We estimate parameters in a Bayesian framework. All parameters of the model are considered random values with their own prior and posterior probability distributions. The joint posterior probability of all of the parameters of the model is The posterior probability [f(· X)] is equal to the product of the likelihood [ℒ(·) or, equivalently, f(X ·)] and the prior probability distribution [f(·)] of the parameters divided by the marginal likelihood [f(X)].
Bayesian analysis requires that a prior probability distribution be specified for the parameters. Here we assume that the parameters take the prior probability distributions shown in Table 8.
The model, as described, is fairly typical of codon models implemented to date (32). However, the Dirichlet process prior is new to the problem of detecting positive natural selection and requires more explanation. The Dirichlet process prior is widely used in clustering problems as a nonparametric alternative, where the data elements are drawn from a mixture distribution with a countably infinite number of components (33, 34). [Ewens and Tavaré (35) describe the relationship between the Dirichlet process prior and the Ewen’s sampling formula from population genetics.] The Dirichlet process prior has two parameters: a concentration parameter, which, loosely speaking, determines the degree to which data elements are grouped together, and a base measure, G _{0} (·), which describes the probability distribution of the parameter value associated with each cluster. The model can be described as follows: First, the number of classes, k, and an allocation variable, z, are randomly drawn from the probability distribution Once the sites have been partitioned into k categories, a ω = d _{N}/d _{S} value is assigned to each category by drawing randomly from the following probability distribution k times, G _{0}(ω_{i}) = f(ω_{i}) = [1/(1 + ω_{i})^{2}] which is the probability density of the ratio of two identically distributed exponential random variables. Here, ω _{i} is the d _{N} /d _{S} rate ratio for the ith selection category. The probability of having k categories is obtained by summing over all possible partitions for k categories, and is where _{n} a _{k} is the absolute values of the Stirling numbers of the first kind. The expected number of categories is Finally, the probability of finding two sites grouped together into the same cluster is f(z_{i} = z_{j}α, c) = 1/(1 + α)].
MCMC.
We approximated posterior probabilities by using MCMC (36, 37). The general idea is to construct a Markov chain that has as its state space the parameters of the statistical model and a stationary distribution that is the posterior probability of the parameters. Samples drawn from the Markov chain when at stationarity are valid, albeit dependent, draws from the posterior probability distribution of the parameters (38). The fraction of the time the Markov chain visits any particular combination of parameter values is an approximation of the joint posterior probability of those parameters.
We wrote a computer program that implements the MCMC method for the phylogenetic model described above. For each iteration of the Markov chain, the program randomly picks a parameter to change, proposes a new value for the parameter, and then accepts or rejects the proposed state as the next state of the chain by using the Metropolis–Hastings formula. Most of the proposal mechanisms have been described elsewhere. We use the LOCAL mechanism of Larget and Simon (39) to simultaneously change the tree and branch lengths (τ, ν). We update the codon frequencies (π) by using a Dirichlet proposal mechanism and change the transition/transversion rate ratio (κ) and the category d _{N} /d _{S} values (ω) by using a rate multiplier [i.e., we multiply the old parameter value by e ^{λ} ^{(u} ^{−½)} to obtain a new value for the parameter, where u is a uniform(0, 1) random number and λ is a tuning parameter]. Finally, we update the allocation vector by using a Gibbs sampling algorithm for nonconjugate models proposed by Neal (40). We implement the method of Neal (40) as follows: First, we randomly choose a site, i, and remove it from the set of k selection classes currently in computer memory. If the site is in a selection class by itself (i.e., η _{zi} for the class was 1), then the entire class is deleted, and k is decreased by 1. Otherwise, η _{zi} for the selection class that the site is associated with is decreased by 1. We then construct five new (temporary auxiliary) classes by drawing the d _{N} /d _{S} value for each from the prior probability distribution. These auxiliary classes represent components that have no other sites associated with them. Gibbs sampling is performed to update site i but with respect to the distribution that includes the temporary auxiliary parameters. The site is assigned to the jth previously existing class with probability Zη _{−i} f(x _{i} ω _{j} ) and to the jth of five auxiliary classes with probability Z(α/5)f(x _{i} ω _{j} ), where Z is the appropriate normalizing constant. After this update, all d _{N}/d _{S} values not associated with a selection class are discarded.
All analyses were run for a total of 2 million update cycles. Chains were thinned, with samples taken every 100 cycles. Samples taken during the first 100,000 cycles were discarded as the burnin phase. The MCMC procedure was repeated, resulting in two sets of samples for each gene. Convergence was assessed by using the program tracer (http://evolve.zoo.ox.ac.uk/software.html?id=tracer).
Data Analysis.
We analyzed six alignments of proteincoding DNA sequences. These alignments included: (i) βglobin gene sequences from s = 17 vertebrates (12); (ii) s = 23 env gene sequences from Japanese encephalitis virus (12, 26); (iii) s = 28 sequences of the HA1 domain of the hemagglutinin gene of human influenza virus A (17); (iv) s = 13 sequences of the HIV1 env gene V3 region (27); (v) HIV1 pol gene sequences from s = 23 isolates (12); and (vi) s = 29 HIV1 vif gene sequences (12).
Acknowledgments
J.P.H. was supported by National Institutes of Health Grant GM069801 and National Science Foundation Grants DEB0445453 and CACTTOL22035A.
Footnotes
 ^{†}To whom correspondence should be addressed. Email: johnh{at}biomail.ucsd.edu

Author contributions: J.P.H., S.J., S.W.D.F., and S.L.K.P. designed research; J.P.H. performed research; S.J., S.W.D.F., and S.L.K.P. contributed new reagents/analytic tools; J.P.H. analyzed data; and J.P.H. wrote the paper.

Conflict of interest statement: No conflicts declared.

This paper was submitted directly (Track II) to the PNAS office.
 Abbreviations:
 MCMC,
 Markov chain Monte Carlo.
Abbreviation:
 © 2006 by The National Academy of Sciences of the USA
References
 ↵

↵
 Nielsen R. ,
 Yang Z.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵

↵
 Yang Z. ,
 Nielsen R. ,
 Goldman N. ,
 Pedersen A.

↵
 Yang Z. ,
 Nielsen R.
 ↵
 ↵
 ↵

↵
 Fitch W. M. ,
 Bush R. M. ,
 Bender C. A. ,
 Cox N. J.
 ↵

↵
 Pond S. K. ,
 Muse S. V.

↵
 Pond S. L. K. ,
 Frost S. D. W.
 ↵

↵
 Sawyer S. ,
 Hartl D.

↵
 Nielsen R. ,
 Yang Z.

↵
 Lartillot N. ,
 Philippe H.

↵
 Pond S. L. K. ,
 Frost S. D. W.

↵
 Zanotto P. M. ,
 Kallas E. G. ,
 Souza R. F. ,
 Holmes E. C.

↵
 Leitner T. ,
 Kumar S. ,
 Albert J.

↵
 Schröder E.
 ↵

↵
 Abramowitz M. ,
 Stegun I. A.

↵
 Stanton D. ,
 White D.
 ↵
 ↵
 ↵

↵
 Ewens W. J. ,
 Tavaré S.
 Kotz S. ,
 Read C. B. ,
 Banks D. L.
 ↵

↵
 Hastings W. K.

↵
 Tierney L.
 Gilks W. R. ,
 Richardson S. ,
 Spiegelhalter D. J.
 ↵

↵
 Neal R. M.