New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Protein stability imposes limits on organism complexity and speed of molecular evolution

Edited by William A. Eaton, National Institutes of Health, Bethesda, MD, and approved August 24, 2007 (received for review June 11, 2007)
Related Article
 In This Issue Oct 09, 2007
Abstract
Classical population genetics a priori assigns fitness to alleles without considering molecular or functional properties of proteins that these alleles encode. Here we study population dynamics in a model where fitness can be inferred from physical properties of proteins under a physiological assumption that loss of stability of any protein encoded by an essential gene confers a lethal phenotype. Accumulation of mutations in organisms containing Γ genes can then be represented as diffusion within the Γdimensional hypercube with adsorbing boundaries determined, in each dimension, by loss of a protein's stability and, at higher stability, by lack of protein sequences. Solving the diffusion equation whose parameters are derived from the data on point mutations in proteins, we determine a universal distribution of protein stabilities, in agreement with existing data. The theory provides a fundamental relation between mutation rate, maximal genome size, and thermodynamic response of proteins to point mutations. It establishes a universal speed limit on rate of molecular evolution by predicting that populations go extinct (via lethal mutagenesis) when mutation rate exceeds approximately six mutations per essential part of genome per replication for mesophilic organisms and one to two mutations per genome per replication for thermophilic ones. Several RNA viruses function close to the evolutionary speed limit, whereas error correction mechanisms used by DNA viruses and nonmutant strains of bacteria featuring various genome lengths and mutation rates have brought these organisms universally ≈1,000fold below the natural speed limit.
Phenomenological approaches to study molecular evolution, developed since the preDNA era, assume selective advantage of certain alleles or the existence of a singlefitness peak in the genome space (1). Although mathematical genetics approaches have brought remarkable insights over the years, they lack a fundamental connection between the fitness (reproductive success) of organisms and molecular properties of proteins encoded by their genomes. On the other hand, our understanding of the molecular basis of folding, stability, and function of proteins has advanced significantly. In particular, statistical–mechanical studies provided key insights into sequence requirements for proteins to fold and to be stable in their native functional states (2). Here, we develop an evolutionary model where the fitness of an organism, in principle, can be directly inferred from its DNA sequences using a proteinfolding model and a well defined physiological assumption of genotype–phenotype relationship. This physiological assumption is motivated by recent experiments that showed that knockout of any essential gene confers a lethal phenotype to an organism (3, 4). The number of such essential genes varies from organism to organism: all genes are essential in viruses, whereas in bacteria, essential genes can reach up to onethird of all genes. As a minimal functional requirement, most proteins [except intrinsically disordered ones (5, 6)] have to be stable in their native conformations. Therefore, here we assume the following fundamental genotype–phenotype relationship: for an organism to be viable, all of its essential genes must encode (at least minimally) stable proteins. Although this requirement is certainly minimal, it is necessary for essential genes and universal. There have been conflicting opinions as to whether greater stability of proteins confers selective advantage or disadvantage or is neutral (7–9). Here we assume that protein stability is essentially a physiologically neutral trait insofar as a protein possesses sufficient stability to stay in the folded state (9, 10). However, proteins accumulate mutations over the course of evolution. Although many mutations may be neutral with respect to protein stability (9), and some of them can be stabilizing, eventually accumulation of too many mutations will render the protein unstable and nonfunctional, and it will confer a lethal phenotype to its carrier organism.
Therefore, molecular evolution can be rendered as diffusion in a space of protein stabilities. To describe the space in which such random walks occur, we first turn to an elementary consideration of the thermodynamics of protein folding.
Results
In their native conformations, proteins possess low effective energy (or enthalpy, because the difference between energy and enthalpy of a protein at normal pressure is negligible) and low conformational entropy, whereas in the denatured (unfolded) state, their effective energy and conformational entropy are both high. Effective energy of a protein is essentially a potential of mean force that includes direct intraprotein interactions as well as protein and solventsolvent contributions that change as proteins fold and unfold. Proteins unfold in a twostate manner at temperature T_{f} when free energy of the native state G_{F} equals to free energy of the unfolded state G_{U} (11): where k_{B}, E_{F}, E_{U}, and S_{U} are the Boltzmann constant, free energy, energy, and entropy of folded and unfolded states, respectively; we assumed here that conformational entropy of the folded state is low compared with conformational entropy of the unfolded state (11, 12). Mutations affect mostly the compact native state by changing its energy E_{F} (13, 14). A protein gets unfolded when its energy reaches value E_{max}, such that it loses stability at the temperature T at which the organism lives, or assuming the conformational entropy of unfolding approximately constant for a standard size 100aa domain (11) [This is a simplified description; see supporting information (SI) Text for a more complete analysis.] Therefore, if the energy of the native state of a mutant protein exceeds E_{max}, the protein becomes unstable and dysfunctional, causing the death of the organism that carries its gene (see Fig. 1).
Because of the effect of sequence depletion, the range of the possible native energies of proteins is limited from below as well; there are simply no sequences that can deliver energies below a certain threshold energy E_{min} (2, 15, 16) (see SI Text for an estimate of E_{min}).
At any given time t, the genotype of an organism that carries Γ genes can be fully characterized in this model by the native effective energies of the corresponding Γ proteins, i.e., as a point (E_{1}, E_{2}… E_{Γ}) in the Γdimensional space. We assume that changes of stability of different proteins upon mutations are independent and can be statistically described by the probability distribution W (E_{i}^{WT} → E_{i}^{mut}) = W (E_{i}^{WT}, E_{i}^{mut}). In vitro experiments on mutant proteins provide a reasonable approximation for the shape of W (see ref. 13 and Methods). For simplicity, we assume that the probability distribution W is the same for all proteins, because it is a statistical property directly following from the physics of interactions among amino acid residues in the protein structure. As each protein accumulates mutations in its sequence, its stability changes, and so is the position of the organism in the stability space E⃗(t) (we use vector notation here to highlight the Γdimensional character of space of native energies of all proteins in a genome). By construction, our analysis excludes natively unstructured proteins, so they are not counted in Γ even if essential for the organism, because the effect of mutations in the unstructured proteins on the fitness of the organism cannot be easily incorporated in the model.
Our model is equivalent to the following Γdimensional diffusion problem. Population consisting of N noninteracting organisms represents N independent random walkers, each diffusing in Γdimensional space. Organisms replicate with rate b, and there is natural mutationunrelated decay rate of genomes d (death rate due, e.g., to degradation of viral RNA). Introducing the joint (not normalized) distribution of genotypes in a population, i.e., number of organisms P(E⃗, t) having genotype E⃗ = (E_{1}, E_{2} … E_{Γ}) at time t, we can write the replicationmutation balance diffusion equation (See Methods for derivation and detailed discussion of its applicability.) In this equation, h is the average, and D is the mean square change of protein stability upon point mutation, The first two terms in Eq. 4 are due to the mutational flux of organisms to and from the vicinity of point E⃗, and the third term corresponds to replication of organisms. m is mutation rate per gene. Very importantly, Eq. 4 should be augmented by 2Γ boundary conditions P(E_{1}… E_{i}…, E_{Γ}) = 0 for any E_{I} > E_{max} and for any E_{i} < E_{min} corresponding to organismal death and sequence depletion conditions for each gene, as explained above. We assume that organism replication rate b does not depend on E⃗, so the fitness landscape is flat, and there are only two phenotypes, the viable phenotype for E_{min} < E_{I} < E_{max}, i = 1… Γ, and a lethal phenotype otherwise.
Eq. 4 provides a deterministic description of the evolution of distribution of genotypes. It is applicable in the regime when Nm > 1 (9, 17). In the opposite case, Nm < 1, the population becomes essentially monoclonal after each mutation, so the boundary conditions at the organism level would not be straightforward.
Because mutations in all proteins are independent, and boundary conditions are the same in each dimension (i.e., for each gene), one can write P(E) = ∏_{i=1}^{Γ} p(E_{i}), separate the variables, and reduce the problem to a product of Γ 1D diffusion problems for each gene: Again, h and D are mean and mean square (de)stabilization effects upon point mutations. The analysis of the data on point mutations available in the ProTherm database (18) and in ref. 13 gives h = 1(kcal/mol); D ≈ 3(kcal/mol)^{2} (see Methods and SI Text). In the diffusion approximation, Eqs. 4 and 5, the exact distribution of protein stability changes upon mutations W is reduced to its first two moments. This approximation does not change any qualitative features of the model.
The longtime solution of Eq. 4 has the form P(E, t) = e^{λt}P(E). The steadystate solution of Eq. 5 is where A is the normalization constant. Converting E, E_{min}, and E_{max} into protein stabilities ΔG, ΔG_{min}, and ΔG_{max} using Eqs. 1 and 2 and assuming a standard unfolded state, we get the distribution of stabilities of all proteins: The population growth rate λ is then The population survives only when λ ≥ 0, which imposes an upper bound Γ* on the number of genes of an organism at a given mutational load: Note that mΓ/b is simply the number of mutations per portion of the genome encoding essential genes per replication event. Even in the absence of the natural death process (i.e., when d = 0), the righthand side (r.h.s.) of Eq. 8 establishes an absolute upper physical limit on this number for any unicellular asexually reproducing organism.
Eqs. 6–8 represent the main results of this work, and now we turn to their biological implications. First, Eq. 6a predicts a universal distribution of the stabilities of all existing protein domains. Stabilities of proteins vary in the range from 0 to ≈20 kcal/mol (18, 19), providing an estimate for ΔG_{max} − ΔG_{min} ≈ 20 kcal/mol. As can be seen from Fig. 2, Eq. 6a describes the distribution of protein stabilities quite well with parameters h and D derived from independent point mutation experiments. In particular, it reproduces a characteristic asymmetric distribution of protein stabilities. Previously, the distribution of stabilities of evolving proteins has been studied in lattice protein simulations by Taverna and Goldstein (10) and by Bloom et al. (9). The latter paper (9) also developed a master equation description of the protein stabilities. The results of these studies are qualitatively consistent with Eq. 6a in the limit of high stabilities (low values of E). However, unlike Eq. 6a, they predict p(E_{max}) ≠ 0 and apparently monotonic dependence of p on ΔG. This difference is partly due to the conversion of the integrodifferential equation Eq. 11 to the differential equation Eq. 4, to make it amenable to the exact analytical treatment. Also, it may be partly due to the use of the 2D lattice model in simulations in refs. 9 and 10. A numerical solution of full integrodifferential Eq. 11 yields small but nonzero values of p at E_{max} (or ΔG = 0). However, importantly, the full numeric solution of Eq. 11 still predicts a pronounced maximum of p(ΔG) at ΔG ≈ −5 kcal/mol (data not shown).
Interestingly, the experimental distribution of protein stabilities (Fig. 2) vanishes at ΔG = 0, in agreement with our differential equation analysis. This feature may be due to a number of reasons. First, there is a possible bias in the ProTherm database (e.g., aggregation impedes precise measurements of marginal values of ΔG; proteins from many different species of all kingdoms of life are mixed in the database, etc.). Second, there is a possibility (and likelihood) of an actual additional selection against barely stable proteins. For example, one can imagine that as protein folding rates decrease with protein stability (20, 21), marginally stable proteins would fold too slowly to perform their biochemical function. Therefore, although our theory provides a reliable quantitative description of the distribution in a broad range of protein stabilities, a complete quantitative understanding of its behavior in the region close to the ΔG = 0 threshold can be achieved only through an unbiased highthroughput experimental measurement of protein stabilities in a complete genome, possibly in parallel with a structural proteomics project.
Eqs. 7 and 8 establish an upper limit on the genome sizes of organisms at a given mutational load and replication and death rates. They predict that, if the genomes are too large or the mutation rate too high, populations will go extinct because of lethal mutagenesis. Our theory predicts a universal upperlimit threshold of a mutational load that a population can sustain, even without the death of organisms or the loss of parent genomes due to natural causes (e.g., degradation of parent RNA in viruses). With the estimated values of the parameters h, D, and E_{max} − E_{min}, the mutational threshold can be estimated (for d = 0) as approximately six mutations per essential portion of the genome per replication. This prediction is in good quantitative agreement with many lethal mutagenesis experiments on RNA viruses (22). The exponential decay of the distribution of protein stabilities at low E makes Eqs. 7 and 8 not very sensitive to the particular choice of the value of E_{min}. Moreover, the upper limit of six mutations per genome per generation is an estimate of the lethal mutagenesis threshold from above, because it assumes that movements of a genome in the protein stability space along each axis are independent, reflected in the separation of variables in Eq. 4. In fact, when mΓ/b>1 (more than one mutation per generation), the possibility exists that because mutations are (mostly) introduced upon genome replication, more than one mutation will be introduced in a genome at the same time. In the language of our diffusion equation, that means that the “random walker” advances along several dimensions at a time (see Methods for more detail). Because h > 0, these correlated moves in the stability space result in a faster drift toward E_{max}, slightly decreasing the threshold mutation rate (unpublished results).
Our theory predicts that the maximum essential genome size (number of essential genes) Γ* is smaller for organisms with higher mutation rate m. This prediction is in excellent agreement with observations for a broad range of organisms (see figure 1 in ref. 22, reproduced as SI Fig. 7). A further example is provided by data on the distribution of viral genome lengths. It is well known that the mutation rate in RNA viruses is much greater than that in dsDNA viruses (22, 23). Correspondingly, the theory predicts much longer genomes for dsDNA viruses than for RNA viruses, in harmony with observations (Fig. 3).
Another important prediction from the theory concerns thermophilic organisms. Stability of proteins at elevated environmental temperature requires that their nativestate energy be lower than of mesophilic proteins, according to Eq. 1 (see also SI Text). Correspondingly, the range of possible nativestate energies E_{max} − E_{min} in Eqs. 6–8 shrinks for thermophilic organisms, and the mutational meltdown threshold decreases. The magnitude of the effect can be estimated using a typical value of the entropic difference between unfolded and folded states for a typical 100aa protein domain, 0.25 kcal/mol/K (11). An increase of environmental temperature by 60 K, typical of hyperthermophiles, results in a decrease of E_{max} − E_{min} from 20 to 8 kcal/mol, leading to a decrease in the mutational meltdown threshold (r.h.s. of Eq. 8) from 6 to 2. The implication of that is 2fold. First, it suggests that hyperthermophilic (crenarchaeal) viruses (phages) can only be dsDNA, because RNA viruses are already in the mutational meltdown regime at this temperature. Indeed, all known crenarchaeal viruses are dsDNA (24), and so far no extremophilic virus with an RNA genome has been found.
Second, assuming the same mutation rate per nucleotide in mesophilic and thermophilic organisms (25), our theory predicts that organisms living at elevated temperatures should have shorter genomes. In Fig. 4, we plotted the number of genes in the genomes of 202 bacteria and archaea (see supporting information table 1 of ref. 14) as a function of their optimal growth temperature. It is clearly seen that prokaryotes living at 60°C or above systematically possess shorter genomes (≈2,000 genes) than mesophilic prokaryotes with optimal growth temperature of 20–40°C. The existence of mesophilic prokaryotes with short genomes does not contradict our theory, because Eq. 8 sets only an upper limit on the size of the essential proteome; if an organism functions with a smaller number of genes, it simply does not “feel” the pressure on the genome size because of protein stability requirements. In these cases, regulation of the genome size is governed by other biological mechanisms, such as the energetic cost of genome duplication and repair.
Discussion and Conclusions
In contrast to other approaches (8), we did not a priori assume here an existence of optimal stability of proteins that renders highest fitness. Such assumptions are often motivated by a circular argument that observed stabilities of proteins are not too high. Here we assume that greater stability of proteins is not functionally advantageous to organisms, so long as proteins possess at least minimal stability (ΔG = 0) to function, On the other hand, the sequence entropy factor favors less stable proteins; sequences that can deliver higher stability become more scarce (10, 26). The entropic factor in sequence space is the main reason why destabilizing mutations are statistically more frequent (i.e., that mutation drift parameter h in Eq. 4 is positive in our theory); mutated sequences of lower stability are more likely to be found than those stabilizing a protein (10, 26) [a more detailed analysis that explicitly takes into account sequence statistics, i.e., it considers the realistic dependence of h and D in Eq. 4 on E; ref. 26 supports this conclusion (data not shown)].
One would then expect that most proteins should be “marginally stable,” with the distribution of stabilities peaked close to ΔG = 0 value (9, 10). However, the distribution of stabilities of real proteins has a characteristic peak at a moderate (not too low) value ≈5 kcal/mol and a sharp asymmetric decrease at both lower and higher stabilities (see Fig. 2). This theory explains why naïve expectations of marginal stability of proteins are not borne out. Proteins evolve as part of organisms, and very low values of their stability would result in toofrequent mutational “falls from the ΔG = 0 cliff,” resulting in the death of the organism and subsequent elimination of the gene encoding the marginally stable protein. This is a clear manifestation of the effect of the constraints imposed by organismal evolution on the distribution of the molecular properties of proteins. An important lesson from this study is that a peaked distribution of a specific property of evolved proteins, such as stability ΔG, does not imply or require that proteins with the peak value of the property confer the highest fitness to their carrier organisms. Instead, the peaked distribution can arise from a balance between opposing factors (stability above lethal threshold and sequence entropy) in a locally flat fitness landscape with just two (lethal and viable) phenotypes.
Another key universal prediction of this model is lethal mutagenesis at high mutation rate, approximately six mutations per genome per replication for mesophiles and one to two for hyperthermophiles living close to 100°C. Because this number is per replication, a higher replication rate would make it possible to avoid lethal mutagenesis at a given absolute mutation rate. Lethal mutagenesis is often attributed to Eigen's error catastrophe (27, 28). However, this interpretation may be misleading. In fact, at high mutation rates, the quasispecies theory predicts a very different phenomenon, delocalization in sequence space. Because the quasispecies theory applies only to soft selection, extinction of populations via lethal mutagenesis cannot occur there (29, 30). The lethal mutagenesis predicted in the present work is also different from the Muller ratchet (31), because it can occur even for infinite populations upon exceeding a well defined mutation rate threshold, in contrast to the Muller ratchet mechanism (32). As Wilke and coworkers pointed our recently (33), there has been no clear understanding of lethal mutagenesis. Our findings represent a simple firstprinciple theory of this important phenomenon and may have direct implications for further development of therapies based on mutationinducing drugs or radiation.
The results presented here are most directly applicable to RNA viruses whose mutation rates are close to the mutational meltdown threshold of Eq. 8. In DNA organisms, mutation rates are typically 2–4 orders of magnitude lower because of the action of errorcorrection mechanisms (23). However, highly pathogenic strains of bacteria often exhibit mutator phenotypes (34) whose mutation rates can approach the mutational load limit discovered in this work. As in the case of viruses, it was argued that higher mutation rates in some strains of bacteria may emerge to facilitate adaptation to their environment. However, the mutational meltdown threshold puts a physical limit on the mutational response of pathogens to rapidly changing host environments.
Our model is basic, and it does not consider many biologically relevant factors, such as functional selection, epistasis, or death of organisms because of environmental fluctuations. Organisms do not interact in the model and do not compete for resources. We consider hard selection whereby population size is not fixed.
We further simplified our consideration of protein thermodynamics by assuming a roughly equal length of all proteins (reflected in universal E_{max} − E_{min} values). It is known that protein lengths vary in a broad range. However, a twostate folding unit to which this theory is applicable is a protein domain. The range of variation of domain sizes is somewhat smaller than that for complete proteins, so we made our estimates for a typical 100aa domain. The variation of domain lengths can be taken into account in further development of the theory.
Our analysis establishes the limits on the genome size that an organism can maintain at a given mutational load. In particular, the limitations set by this theory may be important for understanding the origins of life or de novo design of artificial life, because early or newly designed organisms probably could have an imperfect replication machinery and, thus elevated mutation rates.
Methods
Derivation of MutationDiffusion Eq. 4.
Assuming that replication and mutations are decoupled.
We consider the population of noninteracting N organisms. Each organism mutates and replicates independently, and therefore the mutationdiffusion equations will be derived for organisms. As a first and conceptually simplest step, we consider that mutations and replications are decoupled.
Consider a small time interval Δ t, such that the number of mutations occurring during this interval mΔ t is small. Now the probability density that an organism mutates from state (E′_{1}, E′_{2}… E′_{N}) to (E_{1}, E_{2}… E_{S}) is given by the expression: Here we assumed that mutations in a gene occur independently. The first term in the r.h.s. of Eq. 9 accounts for situations when mutations do not occur; the probability of that for each gene is (1 − mΔ t), and δ(E′_{i} − E_{i}) has the usual meaning of Dirac's deltafunction. The second term in the r.h.s. describes the probability that a mutation actually occurred in a gene, and W(E′_{i} → E_{i}) describes the probability that native state energy of a protein encoded by this gene changes from E′_{i} to E_{i}. This quantity depends only on the distribution of energetic effects of point mutations in proteins (see below). Now consider the ensemble consisting of N organisms. The distribution of genotypes in the population is given by the function P(E_{1}, E_{2}… E_{Γ}), such that: and the population size N is not fixed in this model; it describes hard selection, so that N itself depends on time where the growth/decay rate λ is defined in the text.
The distribution of genotypes of all organisms at time t + Δ t is then given by the relation: where the last term describes replication events: the probability that an organism replicates in a time interval Δ t is bΔ t, where b the is replication rate. Now, considering m Δt to be small enough to keep only firstorder in mΔ t terms in Eq. 9, we obtain from Eqs. 9 and 10, Now we can convert the integrodifferential Eq. 11 into a differential diffusion equation following a standard procedure: expanding P in the Taylor series up to a second order: Substituting Eq. 12 into Eq. 11, we arrive at Eq. 4 with properly defined parameters h and D and boundary conditions as outlined above.
The case when replication and mutations are coupled.
Consider now a scenario when replication and mutations are coupled, i.e., that mutations occur exclusively upon replication, with formation of the daughter genome on the parents' genome as a template. Now we have to distinguish between conservative (singlestranded viruses) and semiconservative replication (doublestranded organisms) (35).
Conservative replication.
Again, consider a small time interval Δt → 0. Introducing dimensionless quantity m̃=m/b and the probability of mutations per gene per replication and using otherwise the same notation as in Eq. 9: Here, the first term corresponds to the situation that replication did not occur in the time interval Δt, the second term says that when replication occurs, one master genome is preserved (i.e., conservative replication), and the third term says that when replication occurred, it could be either without mutation (probability (1 − m̃) for each gene) or with mutations, and we assumed that more than one mutation in one gene is unlikely. Assuming that m̃ ≪ 1, i.e., that the probability of a missense mutation per gene per replication is small [which holds under most biologically relevant conditions (23, 36)], we arrive at the same integrodifferential equation, Eq. 11.
Semiconservative replication.
In this case, the master equation for T will be slightly different, because mutations can occur now in both organisms so that no “master genome” is preserved (35), After straightforward algebra, we again arrive at the same final integrodifferential equation, Eq. 11. However, now the effective mutation rate is doubled, −2m. This correction is intuitive because, in the case of semiconservative replication, both progenies can acquire mutations (35).
Although the final equations appear to be the same in both cases (coupled and decoupled mutations and replications), this is so only under the additional condition that the probability of a mutation upon replication per gene is low, m̃ ≪ 1. The reason such requirement appears in the coupled case (and does not seem to be relevant in the decoupled case) is that, in the decoupled case, the mutations are smeared in time, so the probability that two genes mutate at the same time can always be neglected. In the case of coupled mutations and replications, mutations occur in “bursts,” i.e., upon replication only. For that reason, potentially mutations may become correlated between genes (which will lead to a more complex mutationreplication master equation). This factor will decrease the lethal mutagenesis threshold from six mutations per genome per generation to a slightly lower value (unpublished work). However, these correlations still remain insignificant when m̃ ≪ 1, which is a realistic condition, as we noted above.
The key biological difference between coupled and uncoupled scenarios in conservative cases is that when mutation and replication are coupled, the master genome is always preserved in a population, because mutations are introduced in the daughter strand only. This condition may significantly alter the conditions for the lethal mutagenesis threshold, because a preserved wildtype genome can always produce a nonmutated wildtype progeny (33). However, the fully coupled conservative replication scenario does not appear realistic for most organisms, such as viruses and bacteria. In most viruses such as the HIV retrovirus, mutations and replication are decoupled, and parent genomes are not preserved as mutations are introduced, e.g., mostly at the reversetranscription stage, after which original parent RNA is degraded. In bacteria, replication is semiconservative, i.e., mutations can be introduced in either strand by the action of an errorcorrection mechanism (35, 37), so that both progenies may carry mutations after replication, again eliminating the preservation of the parent genome.
Derivation of Energetic Mutation Parameters h and D from the Database.
The statistical data on how protein stability changes in response to mutations have been collected from the ProTherm database (18). In total, 2,804 and 2,418 mutations were considered for 105 and 71 proteins separately in two kinds of experimental measurements. In cases when multiple data were reported for the same mutant (e.g., measured by different techniques or evaluated at different conditions with subsequent extrapolation to normal), the average was taken over all data points corresponding to the same mutant. As noted in the text, mutations affect mostly the native states of proteins, so that to a good approximation,
The histogram in Fig. 5 represents the mutation freeenergy probability distribution, Using this distribution and the definitions provided in the text, we obtain the values of h and D quoted there. As a note of caution, we observe that possible biases in the ProTherm database can affect W, because some of the mutations have been performed with a specific biochemical purpose in mind.
Acknowledgments
We are grateful to Boris Shakhnovich, Jesse Bloom, and Claus Wilke for many fruitful discussions. This work was supported by the National Institutes of Health.
Footnotes
 ^{§}To whom correspondence should be addressed. Email: eugene{at}belok.harvard.edu

Author contributions: E.I.S. designed research; K.B.Z. and E.I.S. performed research; K.B.Z. and P.C. contributed new reagents/analytic tools; K.B.Z., P.C., and E.I.S. analyzed data; and K.B.Z. and E.I.S. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0705366104/DC1.
 Abbreviation:
 r.h.s.,
 righthand side.
 © 2007 by The National Academy of Sciences of the USA
References
 ↵
 Gillespie JH
 ↵
 ↵
 ↵
 Herring CD,
 Blattner FR
 ↵
 ↵
 ↵
 Bloom JD,
 Labthavikul ST,
 Otey CR,
 Arnold FH
 ↵
 ↵
 Bloom JD,
 Raval A,
 Wilke CO
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Meyerguz L,
 Grasso C,
 Kleinberg J,
 Elber R
 ↵
 van Nimwegen E,
 Crutchfield JP,
 Huynen M
 ↵
 Kumar MD,
 Bava KA,
 Gromiha MM,
 Prabakaran P,
 Kitajima K,
 Uedaira H,
 Sarai A
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Jacobs KL,
 Grogan DW
 ↵
 ↵
 Crotty S,
 Cameron CE,
 Andino R
 ↵
 ↵
 ↵
 Takeuchi N,
 Hogeweg P
 ↵
 ↵
 Lynch M,
 Burger R,
 Butcher D,
 Gabriel W
 ↵
 Bull JJ,
 Sanjuan R,
 Wilke CO
 ↵
 ↵
 Tannenbaum E,
 Deeds EJ,
 Shakhnovich EI
 ↵
 Drake JW,
 Holland JJ
 ↵
 Brumer Y,
 Shakhnovich EI