New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Evolutionary population genetics of promoters: Predicting binding sites and functional phylogenies

Edited by Tomoko Ohta, National Institute of Genetics, Mishima, Japan (received for review June 30, 2005)
Abstract
We study the evolution of transcription factorbinding sites in prokaryotes, using an empirically grounded model with point mutations and genetic drift. Selection acts on the site sequence via its binding affinity to the corresponding transcription factor. Calibrating the model with populations of functional binding sites, we verify this form of selection and show that typical sites are under substantial selection pressure for functionality: for cAMP response protein sites in Escherichia coli, the product of fitness difference and effective population size takes values 2NΔF of order 10. We apply this model to crossspecies comparisons of binding sites in bacteria and obtain a prediction method for binding sites that uses evolutionary information in a quantitative way. At the same time, this method predicts the functional histories of orthologous sites in a phylogeny, evaluating the likelihood for conservation or loss or gain of function during evolution. We have performed, as an example, a crossspecies analysis of E. coli, Salmonella typhimurium, and Yersinia pseudotuberculosis. Detailed lists of predicted sites and their functional phylogenies are available.
Regulatory interactions between genes are believed to provide an important mode of evolution, which accounts for a substantial part of the differentiation between species (1). This is reflected by the sequence variability of regulatory DNA: there is ample case evidence of compensatory evolution at conserved function but also of rapid functional changes even between closely related species (2). Lacking a quantitative model of regulatory evolution, however, alignments of regulatory sequences and predictions of their functionality have proven notoriously difficult.
A large body of existing work has focused on the identification of transcription factorbinding sites as the main functional elements of regulatory DNA. For factors with known binding specificity (given in the form of a position weight matrix), putative binding sites are identified from their conservation in crossspecies comparisons. Different measures of conservation have been introduced, which involve, e.g., the sequence similarity of aligned loci or their independent high scoring in all species compared (3–7). These methods are powerful prediction tools for binding sites. From an evolutionary point of view, however, the conservation criteria are heuristic. Hence, it is difficult to quantify the statistical significance of the results, which depends on the number and evolutionary distance of the species compared. Sequence conservation tends to be too restrictive in cases where substantial sequence variation is compatible with the position weight matrix, in particular for distant species. Simple sequence similarity measures implicitly assume neutral evolution, whereas independent scoring of orthologous sites ignores the evolutionary link between the species altogether. Most importantly, none of these conservation measures allows a consistent statistical treatment of functional innovations in the evolution of binding sites.
A more explicit model for regulatory DNA should address two issues: How does the sequence divergence between species depend on their evolutionary distance, and how does the specific biological function of binding sites enter? Answering these questions is a considerable task for experiment and theory, which must link the biophysics of binding sites with their population dynamics. In particular, it involves quantifying the selection by which the sequence evolution at functional loci is distinguished from that of neutral background DNA. As an important step in this direction, the notion of a fitness landscape for bindingsite sequences has been introduced, where the fitness of a site depends on the binding energy of the corresponding factor (8). The evolutionary importance of the binding energy has also been highlighted in ref. 9, where it was shown that nucleotide substitution rates within functional sites in Escherichia coli depend on the energy difference induced by the substitution as predicted from the position weight matrix. The biophysics of factorDNA binding imposes stringent constraints on the form of the fitness landscape (10) and has important consequences for bioinformatic binding site searches (11). Using such fitness landscapes, we have introduced a stochastic evolution model for functional loci, which is based on Kimura–Ohta point substitutions with rates governed by the fitness difference between the corresponding sequence states (12, 13). This model demonstrates the possibility of rapid adaptive formation of binding sites under positive selection and provides evolutionary constraints on eukaryotic promoter architecture. A similar evolutionary model (14) underlies a recently introduced method to identify conserved binding sites in multiple alignments (15).
In this paper, we develop a quantitative evolutionary rationale for the crossspecies analysis of regulatory sequences, which goes beyond the mere prediction of binding sites. For aligned regulatory DNA of orthologous genes, our method predicts sites together with their functional evolution. The method is based on the evolution model of refs. 12 and 13 and uses a bioinformatic measurement of selection pressures for functionality, which is obtained from sequence data of verified functional sites. Typical functional loci for pleiotropic factors, as exemplified by the cAMP response protein (CRP) family in E. coli, are found to be under substantial selection, in contrast to nonfunctional loci, which evolve neutrally. For families of aligned loci, our method assigns likelihood values to different modes of evolution and associates them with functional histories: (i) neutral evolution of nonfunctional loci, (ii) evolution of functional loci under timeindependent selection, and (iii) evolution under timedependent selection, corresponding to loss or gain of function along a given branch of the phylogeny.
Theory
Evolution Models for Nonfunctional Sequence and Functional Loci. We consider genomic loci a = (a _{1},..., a_{l} ) consisting of l contiguous nucleotides and elementary substitution processes a → b, where a, b are any two sequence states differing by exactly one nucleotide. For nonfunctional (background) sequences, we use uniform nucleotide substitution rates μ_{a}→ _{b} depending on the nucleotide to be mutated and on its nearest sequence neighbors (16). Models of this type are neutral with respect to factor binding and have been shown to provide a good description of intergenic background DNA in E. coli (11). A locus is defined as functional if binding of the corresponding factor at that locus affects the regulation of a gene. Functional loci are assumed to be under selection. This is described by a (Malthusian) fitness function F(a), which measures the contribution of a genotype a to the growth rate of the number of individuals carrying that genotype (and is therefore defined only up to an additive constant, the genotypeindependent fitness). Notice that this definition of a functional locus is weaker than that of a functional binding site, which is a functional locus with a sequence state a that is likely to actually bind the factor. A functional locus can lose its binding sequence due to deleterious mutations, and conversely, a nonfunctional locus can become a spurious binding site. According to the Kimura–Ohta theory (17–19), selection leads to modified substitution rates at functional loci, where N denotes the effective population size. In writing Eq. 1 , we have assumed μN << 1, so that subsequent substitution processes are well separated in time and can be assumed to be independent.
Stationary Population Distributions and Evolutionary Scoring. For background sequences, we use a stationary distribution of the form (11) where p _{0}(a) is the singleletter equilibrium distribution, and π_{0}(aa′) is the conditional distribution for letter a given its left neighbor a′ (see Supporting Text, which is published as supporting information on the PNAS web site). Assuming the underlying neutral dynamics with rates μ _{a}→_{b} satisfies detailed balance, the dynamics under selection satisfies detailed balance as well, and the stationary distribution for functional loci takes the form (12, 13) with the constant given by the normalization ∑ _{a} P _{0}(a) = ∑ _{a} Q(a) = 1. These distributions give the probability density to find a locus with sequence a, which can be inferred from longterm frequency counts at a given locus, or equivalently, from the fraction count of sequence a in a large ensemble of independently evolving loci. We identify the distributions P _{0} and Q with the ensembles of background respective functional loci for a given factor in a genome. (This involves the approximation that all functional loci are under similar selection pressure.) Hence, the usual position weight matrix (20) for the factor is given by the singlenucleotide (marginal) distributions
If the background distribution is approximated by a factorized form, , and if the fitness is an additive function of the nucleotide positions, , the Q distribution factorizes as well, with q_{i} (a) = p _{0}(a) exp[2N f_{i} (a) + const.]. This form of the stationary distribution has previously been introduced in ref. 14 for proteincoding sequences and has been used in ref. 15. For binding sites, however, the factorized form is a heuristic approximation, because generic fitness landscapes are not additive. Regulatory fitness effects follow from the expression level of the regulated gene, which in turn depends on the relevant binding sites through the binding probability of the corresponding transcription factor (8, 13). The factorbinding energy is often nearly additive in the nucleotide positions (21). The individual contributions ε _{i} (a) can be inferred in an approximate way from the position weight matrix (20, 22) (up to an overall constant ε_{0}),
The binding probability, however, is a strongly nonlinear function of the energy. Hence, the fitness effect of a substitution at one position depends on the nucleotides present at all other positions. This induces correlations between nucleotide frequencies at any two positions within functional loci (13), in addition to the shortranged correlations already present in background sequences. These correlations prevent the factorization of the distributions P _{0}(a) and Q(a). However, because the fitness F(a) depends on the sequence state a only via the binding energy E(a), we can project these distributions on the energy E as independent continuous variables, summing over all sequence states a with an (approximately) equal value of E. Denoting the projected ensembles for simplicity with the same symbols P _{0} and Q, Eq. 3 takes the form
It is this simplification that enables us to infer the functional form of these distributions from bioinformatic frequency counts. The total distribution of energies in the noncoding part of the genome is
From a bioinformatic point of view, this is a hidden Markov model for the sequence composition of noncoding DNA. The two alternative distributions P _{0}(E) and Q(E) have prior probabilities 1  λ and λ, i.e., the parameter λ measures the overall fraction of the genome covered by functional loci. The relative likelihood between the distributions Q and P _{0} is described by the score function S(E) ≡ log[Q(E)/P _{0}(E)]. Comparing with Eq. 6 , we obtain the identification which establishes an important and rather general evolutionary grounding of the bioinformatic loglikelihood score.
TimeDependent Distributions and CrossSpecies Scores. It is straightforward to generalize the probabilistic analysis to pairs of species separated by an evolutionary time t. Defining the conditional transition probabilities and , we obtain the joint distribution of energy pairs for orthologous loci at neutrality and under constant selection. The total distribution of energy pairs is
The ensembles and Q^{t} define the loglikelihood score which is easily shown to be symmetric in its two arguments, because detailed balance ensures reversibility of the evolutionary dynamics at equilibrium. Eq. 12 is a key result of this paper: it shows how the specific evolution of binding loci can be integrated into a bioinformatic scoring procedure. The evolutionary information is contained in the second score term, i.e., the log ratio of the transition probabilities. This term measures a difference in evolutionary fates: A locus with energy E _{1} in the “twilight region” between the Q and P _{0} ensembles on average maintains its binding energy if functional but evolves toward lesser binding if nonfunctional. The energy transition probabilities and can be obtained in a straightforward way from the underlying sequence evolution process (see Supporting Text). They are shown in Fig. 1 for fixed E _{1} and for various evolutionary distances t. The difference between the distributions and , i.e., the additional discriminatory power of the evolutionary information, is seen to increase with the distance of the species compared. In the longdistance limit, we have and , leading to independent scoring of aligned loci in Eq. 12 .
TimeDependent Selection and Functional Switching. Here we generalize the evolutionary model to include loss or gain of function at the level of individual loci. Consider a rooted phylogeny consisting of two species at evolutionary distances t _{1} and t _{2} from their last common ancestor, i.e., at distance t = t _{1} + t _{2} from each other. We assume that an initially functional locus can lose function at a small rate ν_{} (with ν_{} t << 1), and conversely, an initially nonfunctional locus can gain function at a comparable rate ν_{+} (such that the average fraction λ of functional loci is maintained). There are now four alternative evolutionary histories involving at most one functional switch: evolution under timeindependent neutrality or selection, timedependent selection leading to functionality at t _{1}, and nonfunctionality at t _{2}, and vice versa (see Fig. 2). These occur with probabilities , , , and , respectively, with the abbreviation t = (t _{1}, t _{2}). The four corresponding energy pair distributions are , Q^{t} (E _{1}, E _{2}), and (E _{1}, E _{2}), which is defined in an analogous way. In Eq. 13 t′ denotes the switching point and E′ the energy at that point. We have approximated the ancestral energy distribution by the stationary ensembles P _{0} or Q and have used detailed balance. Within this switch mode, we have summed over the probabilities of gain and loss of function. Disentangling these alternatives is possible but statistically insignificant in phylogenies with few species. Neglecting histories with more than one functional switch, the total distribution of energy pairs is now and there are three independent loglikelihood scores weighing each of the ensembles Q^{t} , , against the background ensemble . The hidden Markov model can readily be extended to rooted phylogenies with more than two species. The expressions for the P _{0}, Q, and R distributions generalizing Eqs. 9 , 10 , and 13 involve a factor G _{0} for each branch under neutral evolution and a factor G_{s} for each branch under selection, as well as integrations over the (unobserved) energies at the internal nodes of the phylogeny and the variables E′ and t′ of the switching point. In the case of three species, there are eight different functional histories: timeindependent neutrality and selection, as well as timedependent selection leading to a functional locus in any one or any two species (see Supporting Text for details).
Site Prediction and Quality Measures. For a given pair of aligned loci with energies (E _{1}, E _{2}), the hidden Markov model (Eq. 14 ) determines the probability of belonging to each of its four ensembles. The probability of conserved functionality is
The corresponding probabilities for conserved neutrality, and for functional switching, and ρ _{Q} (E) for functionality in the singlespecies case are defined in an analogous way. These probabilities form the basis of our predictions for individual sites and site pairs, and they serve to quantify the predictive quality. For functional loci, we further distinguish functional and nonfunctional binding sites, using as approximate threshold the energy E _{max} of the weakest verified site. Hence, given a total of n aligned pairs of loci, an expected number with are conserved functional sites. (Because is small for entries of order E _{max}, this number depends only weakly on the cutoff ρ_{min}.) A predicted set of functional site pairs with contains entries, of which n_{f} (ρ_{0}) are expected to be true functional pairs. The number n_{f} (ρ_{0}) is given by the integral of Eq. 16 over the region . Hence, the expected fractions of false positives and of false negatives are
Analogous definitions apply to the singlespecies case.
Results
Selection Pressure for CRP Sites in E. coli. Scanning the genome of E. coli (obtained from the NCBI database, accession no. NC_000913) produces sequence counts of n = 520,729 loci in 4,244 intergenic regions. We use a relatively large window size of l = 22, taking into account core binding motifs as well as informative flanking positions. Our CRP position weight matrix q_{i} (a)(i = 1,..., l; a = A, C, G, T) is obtained from 48 experimentally verified binding sites in the DPInteract database (23). For each sequence count a = (a _{1},..., a_{l} ), we hence infer the CRPbinding energy E(a) from Eq. 5 (in units of ε_{0} and with E = 0 set to the point of maximal binding). The resulting energy histogram is shown in Fig. 3a . In the region E > E _{max} ≈ 13, where no factor binding is expected, the data are well approximated by the background distribution P _{0}(E), whereas the excess counts for E < E _{max} are attributed to functional loci. Their distribution Q(E) in this region can be estimated independently from the ensemble of verified binding sites. Optimal consistency with the hidden Markov model (Eq. 7 ) is reached for λ = 0.00065, where the distribution W(E) produces a very similar form of Q(E) and fits the entire histogram well; see Fig. 3b . The effective fitness landscape for functional loci is then inferred from the data using Eq. 6 . In the nonbinding region E > E _{max}, the fitness takes a constant value F _{0}, i.e., the evolution is always neutral in that region, as expected. The constant F _{0} is unimportant, because only fitness differences enter the evolutionary dynamics (Eq. 1 ). Larger values occur in the binding region E < E _{max}. The excess fitness landscape 2NΔF(E) ≡ 2N(F(E)  F _{0}) is shown as the orange line in Fig. 3b . Loci with strong binding (i.e., with energies E < 8) have substantial effective fitness values in the range 2NΔF ∼ 6–11, which are interpreted as typical selection pressures for functionality. Genetic drift counteracts selection, producing also loci with weaker binding (8 < E < 12) and reduced effective fitness 2NΔF < 6. These fitness estimates are rather robust results of our procedure. It should be kept in mind, of course, that the shape of the fitness landscape ΔF(E) reflects an average over the family of CRPbinding sites, which have a spectrum of individual selection coefficients and selected binding strengths.
With this fitness landscape, the hidden Markov model (Eq. 7 ) thus gives an excellent description of the CRPbinding energy statistics in intergenic DNA of E. coli. For an individual locus, the model predicts the probability ρ _{Q} (E) of being functional, given its binding energy E. This probability is indicated by the color shading in Fig. 3a . The parameter λ determines the total number of predicted functional binding sites, . The power to predict individual binding sites remains limited, however. The reason is apparent from Fig. 3a : Many functional sites have energy values in the “twilight region” (appearing in violet), where there is already a sizeable amount of background counts. Hence, any prediction will be torn between many false negatives or many false positives, depending on the energy cutoff chosen. This situation will be drastically improved by the crossspecies analysis to which we now turn.
Evolution Between E. coli and Salmonella typhimurium. The Salmonella genome is also obtained from NCBI (accession no. NC_003197). Our alignment of the two genomes contains 135,534 pairs of loci in well aligned intergenic regions flanked by orthologous genes (for details, see Supporting Text and refs. 24 and 25). The average identity between aligned sequences is 93%, which measures the evolutionary distance t between the two species. The CRPbinding energies E _{1} in E. coli and E _{2} in S. typhimurium are inferred using the same position weight matrix, which is justified, because the factor itself is highly conserved (3, 9). The resulting dot plot of energy pairs (E _{1}, E _{2}) is shown in Fig. 4a . The distribution is significantly pinched to the diagonal in the binding region E _{1}, E _{2} < 12, indicating the expected higher conservation of the energy for functional binding sites (9). We quantify this effect by the conditional probability of evolving from energy E _{1} to E _{2} under selection as compared with its neutral counterpart (see Theory). Both distributions are readily obtained from numerical simulations of the evolution processes; examples are shown in Fig. 1.
Fig. 4b contains again the energy pair counts together with the distribution W ^{t} (E _{1}, E _{2}) of the twospecies hidden Markov model (Eq. 14 ). We use the same fitness landscape F(E) as for E. coli and a midpoint approximation for the root point of the tree, i.e., t _{1} = t _{2} = t/2. The fit parameter λ = 0.0018 is now higher than in the singlespecies case (reflecting a higher fraction of functional loci in aligned regions), and there are small probabilities of selection gain and loss, ν_{+} t ∼ ν_{} t ∼ 0.025. Quite remarkably, this distribution reproduces the entire energy pair data well, which indicates the consistency of our approach as well as an overall similarity of the evolutionary characteristics between the two species compared.
Functional Histories. For an individual pair of aligned loci with energies (E _{1}, E _{2}), the hidden Markov model (Eq. 14 ) predicts the probabilities of conserved neutrality and conserved function, and , and of functional switching, and ; see Eq. 15 . These probabilities are indicated by the color shading in Fig. 4a . The twilight region between the Q^{t} and ensembles (appearing in violet) is seen to be much smaller and pushed toward larger energies (E _{1}, E _{2} ∼ 10) than in Fig. 3a . This indicates that the additional evolutionary information substantially improves the predictions already for two species. For example, a prediction list for conserved functionality is obtained by ranking the pairs of loci in order of decreasing with a lower cutoff ρ_{0}. It has estimated fractions γ_{+}(ρ_{0}) of false positives and γ_{}(ρ_{0}) of false negatives, which are given by Eq. 18 . Plotting the two parameters against each other produces a socalled detection error tradeoff curve (26), which can be compared with its singlespecies counterpart (see also ref. 11). Both curves are shown in Fig. 5, which is published as supporting information on the PNAS web site, which quantifies the predictive power gained by the crossspecies comparison (see Supporting Text). The Q^{t} list with a cutoff ρ_{0} = 0.30, which contains 211 pairs of loci, is available upon request (see Supporting Text). Among them are 32 of the 40 verified binding sites in the aligned region. Compared with recently published lists of sites in E. coli (9, 11) and to our own singlespecies analysis, this list is considerably shorter at a comparable detection level of true sites. Yet it contains a substantial number of new entries, which are statistically significant at the level of two species but not of a single one.
A fraction of the functional binding sites in one species is predicted to lose their binding ability during evolution due to deleterious mutations, although their loci remain functional, i.e., under conserved selection. From E. coli to Salmonella or vice versa, we estimate this fraction to be ≈5%, by using the energy transition probabilities . However, our model also contains another mode of functional switching, which is due to loss or gain of selection for a locus. The corresponding site pairs have widely differing energies (E _{1}, E _{2}), which lead to high values of or . Examples include CRP loci in the intergenic regions prmAyhdG and ytfJytfK, which are predicted to be functional E. coli but not in S. typhimurium. In the first case, there is no other likely CRPbinding site in the same intergenic region, indicating a possible functional change in the promoter as a whole. The second case, where one finds also a conserved site in the same region, may point to a compensatory change between loci, which leaves the function of the promoter as a whole intact. The evolutionary analysis gets simpler if experimental information is available even in one of the species compared. For example, the second CRPbinding locus from DPInteract (23) for nupG has energies E _{1} = 2.8 in E. coli and E _{2} = 8.8 in S. typhimurium. Because we know that the site is functional in E. coli, we need to compare only the probabilities of evolution under constant and timedependent selection, leading to a substantial likelihood for a functional switch. There is a conserved site in the same intergenic region, which could take over the binding function in S. typhimurium. Note that the statistical weight in favor of a switch stems solely from the large energy change; both sites would individually qualify as functional under standard independent scoring. Of course, these predictions bear a higher level of uncertainty, because alignment ambiguities can lead to an artificially high value of the energy difference, and the prior switching probabilities ν_{+}, ν_{} are only orderofmagnitude estimates.
We have extended our analysis to include a third species, Yersinia pseudotuberculosis (NCBI accession no. NC_006155). Dot plots of energies (E _{1}, E _{2}, E _{3}) for triplets of aligned loci and their probabilistic scoring are reported in Supporting Text and Figs. 6 and 7, which are published as supporting information on the PNAS web site. As expected, we find a further improvement of the detection error tradeoff for prediction of loci with conserved function; see Fig. 5. This is due in part to the alignment, which introduces a bias toward conserved loci. We also find candidate loci with loss or gain of function, such as the fourth malEmalK locus in E. coli, which we predict to be nonfunctional in both S. typhimurium and in Y. pseudotuberculosis. Three other verified sites in that region are conserved in all three species. Similar candidates for functional switches are the second verified binding sites for dadAX, tsx, and araB.
Discussion
Binding Sites in Bacteria Evolve Under Substantial Selection. Our evolutionary model is based on stochastic substitutions in the space of sequences (a _{1},..., a_{l} ) of binding loci. These loci are treated as coherent population genetic units, taking into account that the evolution of any two nucleotides within a functional locus is correlated (13). This differs from standard bioinformatic approaches such as position weight matrices, which assume the nucleotides a_{i} to be independent. Our in silico measurement of the selection pressure is based on the sequence ensemble Q of functional loci and its background counterpart P _{0}. Assuming that functional loci evolve under selection, and background loci evolve neutrally, the loglikelihood score of these ensembles determines the effective fitness difference of sequence states at these two kinds of loci: S ≡ log(Q/P _{0}) = 2NΔF, where N is the effective population size. For CRP loci in E. coli, we obtain effective fitness differences 2NΔF of order 10 between strong factor binding and no binding. Because our method involves ensembles, this is an orderofmagnitude estimate for typical loci, which does not exclude that some sites may be under substantially stronger or weaker selection. We note, however, that our estimate also predicts the correct amount of energy conservation for functional loci found in our crossspecies analysis.
A substantial level of selection explains well known evolutionary characteristics of regulatory sequences (2): they may be well conserved between distant species (if under constant selection for functionality) but can also show considerable variation even between closely related species (if under positive selection for change). At the level of selection found, bindingsite gain by rapid adaptive evolution as discussed in ref. 13 is indeed possible. On the other hand, conservation will not be complete even under selection. A certain fraction (increasing with evolutionary distance) of initially functional sites will be lost because of deleterious mutations. This opens the possibility of compensatory changes involving different loci, as they are observed in ref. 27. It also indicates that the theory of promoter evolution should not stop at the level of individual binding sites. Selection couples not only the nucleotides within one locus but also the evolutionary fate of different loci. Understanding the longterm dynamics of regulation ultimately requires a consistent populationgenetic theory of entire promoters.
Improving Binding Site Searches Requires a Quantitative Evolutionary Rationale. The difficulty of predicting functional sites from their binding score in a single species is well known and has been called the “futility theorem” in ref. 28. It is caused by the coexistence of functional and nonfunctional loci in the twilight region of marginal binding. In the framework of our probabilistic model, this is quantified by tradeoff curves between false positives and false negatives. What is a computational dilemma, may, however, reflect evolutionary design. If a sufficient reservoir of marginal binding seeds is present even in background sequences, a fully functional site can form by rapid adaptation as a response to new demand, ensuring the evolvability of regulatory interactions (13).
To overcome the futility theorem, we have introduced here a quantitative method that includes evolutionary information into bindingsite searches. At the core of our model are the crossspecies energy transition probabilities and , which quantify the “phenotypic” evolution of loci and discriminate efficiently between functionality and nonfunctionality (see Fig. 1). These probabilities can be used to build a systematic likelihood score for families of aligned orthologous loci in a phylogeny, which is of the general form . This scoring allows clear significance estimates and rankings of the results.
We have applied the method to comparative analysis of three bacterial species. We find a substantial improvement of the predictive quality already at the level of two species and a further improvement for three species. This confirms the results of a recent study of conserved sites in several Saccharomyces species, where the significance of the evolutionary information as a function of evolutionary distances is discussed in detail (15). Of course, elementary evolutionary steps other than point mutations are expected to become important in eukaryotes. Nevertheless, our general rationale of inferring selection pressures from site frequencies should remain applicable.
Putative Regulatory Innovations in Bacterial Phylogenies Can Be Traced by Comparative Sequence Analysis. Previous approaches have focused on the conservation of regulatory sequences as a sign of their functionality. Here we aim at a more comprehensive view, which includes functional changes into a quantitative statistical model. We emphasize again that in the presence of selection, there are two conceptually different modes of change: binding sites can lose or gain functionality due to deleterious or beneficial mutations at constant selection for binding, or they can respond to changes in the selection itself. Our model distinguishes these modes statistically by their energy transition probabilities and thus builds functional phylogenies for specific loci (as exemplified in Fig. 2).
In our comparative analysis of bacterial species, we find a large number of loci predicted to have conserved function but also some cases with evidence for gain or loss of function. It has been shown that changes in the gene regulation of orthologous genes can lead to phenotypic differences between E. coli and S. typhimurium (29). With caveats due to uncertainties in the rates of loss or gain of function, our findings provide at least a starting point for further targeted crossspecies experiments. We can thus begin to quantify the role of regulatory innovations in molecular evolution.
Acknowledgments
We thank Johannes Berg and Ulrich Gerland for a critical reading of the manuscript. V.M. acknowledges support through the STIPCO European network, Contract HPRNCT200200319 (grant to M.L.).
Footnotes

↵ * To whom correspondence should be addressed. Email: lassig{at}thp.unikoeln.de.

Author contributions: V.M. and M.L. designed research, performed research, analyzed data, and wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviation: CRP, cAMP response protein.
 Copyright © 2005, The National Academy of Sciences
References

↵
Ptashne, M. & Gann, A. (2002) Genes and Signals (Cold Spring Harbor Lab. Press, Woodbury, NY).

↵
Wray, G. A., Hahn, M. W., Abouheif, H., Balhoff, J. P., Pizer, M., Rockman, M. V. & Romano, R. A. (2003) Mol. Biol. Evol. 20 , 13771419. pmid:12777501

↵
Rajewsky, N., Socci, N. D., Zapotocky, M. & Siggia, E. D. (2002) Genet. Res. 12 , 298308.

McCue, L. A., Thompson, W., Carmack, C. S. & Lawrence, C. E. (2002) Genet. Res. 12 , 15231532.

Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B. A. & Johnston, M. (2003) Science 301 , 7176. pmid:12775844
 ↵
 ↵

↵
Brown, C. T. & Callan, C. G., Jr. (2004) Proc. Natl. Acad. Sci. USA 101 , 24042409. pmid:14983022

↵
Gerland, U., Moroz, D. & Hwa, T. (2002) Proc. Natl. Acad. Sci. USA 99 , 1201512020. pmid:12218191

↵
Djordjevic, M., Sengupta, A. M. & Shraiman, B. I. (2003) Genome Res. 13 , 23812390. pmid:14597652

↵
Berg, J. & Lässig, M. (2003) Biophysics (Moscow) 48 , Suppl. 1, 3644.
 ↵
 ↵
 ↵

↵
Arndt, P. & Hwa, T. (2005) Bioinformatics 21 , 23222328. pmid:15769841

↵
Kimura, M. (1962) Genetics 47 , 713719. pmid:14456043

Kimura, M. & Ohta, T. (1969) Genetics 61 , 763771.

↵
Ohta, T. & Tachida, H. (1990) Genetics 126 , 219229. pmid:2227381
 ↵
 ↵
 ↵

↵
Robison, K., McGuire, A. M. & Church, G. M. (1988) J. Mol. Biol. 284 , 241254.

↵
Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G. & Thompson, J. D. (2003) Nucleic Acids Res. 31 , 34973500. pmid:12824352

↵
Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. (1998) Biological Sequence Analysis (Cambridge Univ. Press, Cambridge, U.K.).

↵
Egan, J. P. (1975), Signal Detection Theory and ROC Analysis (Academic, New York).
 ↵
 ↵

↵
Winfield, M. D. & Groisman, E. A. (2004) Proc. Natl. Acad. Sci. USA 101 , 1716217167. pmid:15569938