New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Theory of prokaryotic genome evolution
Contributed by Eugene V. Koonin, August 24, 2016 (sent for review June 7, 2016; reviewed by Edo Kussell and Claus O. Wilke)
See related content:
- Profile of Eugene V. Koonin- Jan 23, 2017

Significance
Bacteria and archaea have small genomes with tightly packed protein-coding genes. Typically, this genome architecture is explained by “genome streamlining” (minimization) under selection for high replication rate. We developed a mathematical model of microbial evolution and tested it against extensive data from multiple genome comparisons to identify the key evolutionary forces. The results indicate that genome evolution is not governed by streamlining but rather, reflects the balance between the benefit of additional genes that diminishes with the genome size and the intrinsic preference for DNA deletion over acquisition. These results explain the observation that, in an apparent contradiction with the population genetic theory, microbes with large genomes reach higher abundance and are subject to stronger selection than small “streamlined” genomes.
Abstract
Bacteria and archaea typically possess small genomes that are tightly packed with protein-coding genes. The compactness of prokaryotic genomes is commonly perceived as evidence of adaptive genome streamlining caused by strong purifying selection in large microbial populations. In such populations, even the small cost incurred by nonfunctional DNA because of extra energy and time expenditure is thought to be sufficient for this extra genetic material to be eliminated by selection. However, contrary to the predictions of this model, there exists a consistent, positive correlation between the strength of selection at the protein sequence level, measured as the ratio of nonsynonymous to synonymous substitution rates, and microbial genome size. Here, by fitting the genome size distributions in multiple groups of prokaryotes to predictions of mathematical models of population evolution, we show that only models in which acquisition of additional genes is, on average, slightly beneficial yield a good fit to genomic data. These results suggest that the number of genes in prokaryotic genomes reflects the equilibrium between the benefit of additional genes that diminishes as the genome grows and deletion bias (i.e., the rate of deletion of genetic material being slightly greater than the rate of acquisition). Thus, new genes acquired by microbial genomes, on average, appear to be adaptive. The tight spacing of protein-coding genes likely results from a combination of the deletion bias and purifying selection that efficiently eliminates nonfunctional, noncoding sequences.
The majority of bacterial and archaeal genomes are small, at least compared with the genomes of multicellular and many unicellular eukaryotes (1, 2). Also, with the exception of deteriorating genomes of some parasitic bacteria, the prokaryotic genomes are highly compact, with densely packed protein-coding genes and a low fraction of noncoding sequences (3). The small genome size is thought to be selected for fast replication, whereas the high gene density additionally facilitates coregulation of gene expression via the operon organization (4, 5). Across the full range of cellular life forms, a significant positive correlation has been shown to exist between genome size and
The population genetic theory clearly predicts an inverse correlation between the strength of selection at different levels and genome size: small genomes are predicted to be subject to stronger selection than large genomes (9). However, when protein sequence-level selection was measured for multiple groups of closely related bacteria using the ratio of the nonsynonymous to synonymous substitution rates (dN/dS) as a proxy, the opposite effect, namely a significant negative correlation between dN/dS and genome size, was observed, indicating that larger prokaryotic genomes typically evolve under a stronger selection than small ones (13).
Here, we sought to further investigate the evolutionary factors that control genome evolution in prokaryotes. We reproduced the negative correlation between dN/dS and genome size on an expanded genome collection and then, developed a mathematical model of genome evolution by gene gain and loss in prokaryotic populations. By fitting the distribution of genome sizes predicted by the model to the empirical distribution for many groups of prokaryotes, we found that a good fit between theory and the data could be obtained only for models that included a positive mean fitness contribution of the gained genes countered by a deletion bias. These results imply that, at the level of gene gain and loss, selection for genome compactness does not play a major role in determining the genome size in prokaryotes. Rather, the relatively small number of genes in prokaryotes is explained by the diminishing return associated with the acquisition of new genes as the genome grows combined with the intrinsic deletion bias.
Results
Protein-Level Selection and Microbial Genome Size: Model Prediction and Genomic Data.
It has been shown previously that, in prokaryotes, genome size is negatively and significantly correlated with dN/dS, suggestive of a stronger protein-level selection in larger genomes (13). For the purpose of this work, we reproduced these findings on a substantially expanded set of groups of closely related bacterial and archaeal genomes compiled in the latest version of the Alignable Tight Genomic Cluster (ATGC) database (14) (Fig. 1). Specifically, genome size (measured by the number of genes) and dN/dS values are correlated with Spearman’s rank correlation coefficient
Genome size and selection in prokaryotes. Mean observed genome sizes (number of genes) are plotted against (A) the selection strength and (B) the estimated effective population size for each ATGC. Error bars correspond to 1 SD. ATGCs with more than 20 species are indicated in orange. The mean genome size, calculated using the mathematical model, is indicated in B by a solid red line. The model was applied with a constant selection coefficient, the gain and loss rates of Eq. 11, and optimized parameters shown in Table 1.
To quantitatively analyze the relations between selection strength and genome size, a mathematical model was implemented using the Moran process framework (15). In the model, a genome is represented as a collection of
The model of genome evolution. (Upper) Illustration of the mathematical model of genome size evolution. The number of genes changes stochastically via gene gains and losses, which occur with probabilities
Gain and loss probabilities govern the stochastic genome size dynamics, with the intuitive relation (derivation is given in Methods)
Different regimes for the selection and the gain/loss rates ratio (Eq. 7). In the shaded area, the genome size steady state is achieved for
For the calculation of the steady-state distribution of genome size, it is useful to present the population model of the genome size evolution as a random walk, where the probability of a step up is
Even for
In all three log-likelihood optimized parameter sets, the selection coefficient is strictly positive for all ATGCs and weakly depends on
For complementarity, fitting with constant
As an approximation for nonuniversal, ATGC-specific factors that affect the genome size, optimization of the gene acquisition and deletion rates was performed separately for each ATGC. The selection coefficient was taken to be the same for all ATGCs and set to the value obtained in the global fitting at the previous stage. For each ATGC,
Comparison of the model predictions with the empirical genome size distributions. The observed genome size distributions are shown by bars for six ATGCs that consist of 20 species or more each. Genome size distributions predicted by the population evolution model are shown by red lines using the selection landscape and deletion–acquisition rate of Eqs. 10 and 11 and optimized
Evolution of Distinct Functional Classes of Genes.
In our model, the gene acquisition and deletion rates,
Finally, analogous to the whole-genome fitting procedure, to account for ATGC-specific effects, model distributions were further optimized by fitting ATGC-specific
Discussion
The notion of strong purifying selection that favors small genomes (or more precisely, a small number of genes) in prokaryotes (10) seemed incompatible with the observed significant positive correlation between the genome sizes of bacteria and archaea and the inferred selection strength on the protein level reflected in the dN/dS ratio (13) (this work). These observations indicate that, on average, the larger the genome of a bacterium or an archaeon, the stronger selection under which the protein-coding genes evolve. This apparent discrepancy between the comparative genomic observations and the predictions of the population genetic theory motivated us to further investigate the selection regimes of microbial genomes. To this end, we compared the predictions of a mathematical model of genome evolution with the genome size distributions in 60 clusters of closely related bacteria and archaea.
To infer the selection landscape with regard to the genome size, the effective population size was estimated for each ATGC using the dN/dS ratio and assuming the same selection coefficient
Inference of gain and loss probabilities requires estimation of three terms, namely gene gain and loss rates and the selection landscape. In principle, all three values depend on the genome size, such that it is impossible to infer all terms without assumptions on the functional forms of the respective dependencies. However, it has to be emphasized that our conclusions do not depend on specific modeling assumptions and hold as long as
Extension of the model to account for subsets of genes allowed additional validation of the model consistency and assessment of the selection affecting any class of genes. In particular, we analyzed genes that are associated with specific cellular functions as classified in the COG database (17, 18) under the assumption that functionally similar genes evolve under similar selection landscapes. We obtained good fits to the model for all functional classes, indicating that the conclusion on the typical beneficial effect of gene acquisition applies to functionally diverse classes of genes. However, there were two notable exceptions to this consistency, namely the mobilome and the singletons. The distributions of the sizes of these classes in all ATGCs are much wider than predicted by the model, and the parameters could not be optimized to obtain a good fit. These observations imply that the evolutionary regimes of the mobilome and the singletons qualitatively differ from the genes in the other functional classes that possess “normal” cellular functions. There are indications of the nature of these differences. Many components of the mobilome, such as transposons, propagate within a genome, so that the dynamics of this class is dominated by duplication rather than gain from an external gene pool. The singletons are fast evolving genes that, on average, do not confer any benefit on the organism. Indeed, in a separate recent analysis of microbial genome evolution models, we have shown that the replacement rate of the singletons is effectively infinite compared with the replacement rates of the rest of the genes (28).
The model analysis described here was performed assuming a steady state with respect to the genome size, and two points have to be addressed in this regard. Each ATGC consists of phylogenetically close genomes (29) that cannot be considered independent samples from the genome size distribution without additional substantiation. We, therefore, verified that, for each ATGC, the divergence between the genomes, defined in this case as the number of gains and losses, was greater than the genome size distribution width, ensuring that a sufficient number of evolutionary events occurred so that the genomes represent an independent set of samples from the size distribution. A low bound for the number of gene gain and loss events that occurred since the divergence of individual genomes can be estimated from the number of singletons divided by the probability of the acquisition of such genes. The required number of gains and loss events for sufficient sampling of the distribution can be estimated from the SD of all genome sizes in an ATGC. The estimated numbers of gain events are greater than the SDs for all ATGCs (SI Appendix, Fig. S11), indicating that the genome size distribution was sampled sufficiently.
In our previous work, a comprehensive analysis of gain and loss events was performed for the same groups of microbial genomes (ATGCs) that were analyzed here (14). The gene loss to gain ratios obtained through a maximum likelihood reconstruction of genome evolution formed a broad distribution, with the mean value of about two. The evolution models analyzed here (Eqs. 10 and 11) yield a skewed distribution of genome sizes, which implies a distribution of loss/gain ratios with the mean slightly greater than unity. The distributions and parameters fitted here cannot explain such a large difference between the model predictions and inferences from genome comparison. Thus, it seems likely that, most of the time, the majority of the genomes are somewhat smaller than the long-term equilibrium size. A biologically plausible scenario is that prokaryotes are exposed to beneficial genetic material only for short periods of time, resulting in brief intervals of fast growth followed by slow genome shrinking (30). Steady state is possible under this scenario as well but only as the average over multiple cycles of gain and loss, which probably occur on a timescale much longer than the scale of the ATGC evolution. The strict steady-state analysis presented here can be regarded as a coarse-grained description of more complex evolutionary scenarios; however, our key finding, that acquired genes are, on average, beneficial, is expected to hold also for higher-order analyses.
The results of this analysis indicate that elimination of genes under the pressure of purifying selection is not the dominant factor of microbial evolution. On the contrary, acquisition of genes by microbes seems to be largely an adaptive process, although the positive selection that governs the genome dynamics, on average, is likely to be weak. This conclusion by no means contradicts the population genetic theory as such (9) but is incompatible with the assumption that newly acquired (and fixed in the population) genes are, on average, neutral. In other words, all of the estimates of the cost of the genetic material (11) can be valid in themselves, but the positive selection coefficient that is, on average, associated with new genes offsets these costs. Given that new genes are, on average, beneficial, microbes with larger
To summarize, the analysis described here presents the formal theory of the evolution of prokaryotic gene content. Perhaps unexpectedly, comparison of the theory predictions with the genomic data shows that gene gain by prokaryotes, leading to genome growth, is largely an adaptive process, with the exception of “nonfunctional” gene classes, the mobilome and the singletons. From the biological standpoint, it seems plausible that the apparent beneficial effect of gene gain is a combined result of the capture of metabolic enzymes that can expand the biochemical capacity of microbes (31), regulators and signaling proteins that enhance regulatory circuits (32), and defense genes (33). However, much more research is required to reconstruct the full functional landscape of microbial evolution.
Methods
Prokaryotic Genome Size Data and Nonsynonymous to Synonymous Nucleotide Substitution Ratio.
Genomes of 707 prokaryotic species grouped into 60 ATGCs (14, 29) were analyzed. In addition to the number of genes (
Inference of Effective Population Size.
Synonymous mutations are assumed to be neutral and, therefore, fixed at a rate
where
Selection Function Symmetry with Respect to Gene Acquisition and Deletion.
The relation given by Eq. 1 is derived as follows. Noting by
Selection and Fitness Relations.
The selection coefficient
when considering the selective advantage of individual
Mean Genome Size Dynamics.
For uniform population invaded by mutation that is associated with fitness
where
where the + and − superscripts indicate acquisition and deletion events, respectively. The fitness derivative with respect to the number of genes can be approximated as
such that
Substituting Eqs. 18, 19, 20, 21, and 22 to Eq. 17, we get Eq. 5 for genome size dynamics.
Steady-State Genome Size Distribution.
The genome size distribution satisfies the difference equation
This equation can be approximated by a second-order differential equation (37). The left-hand side is expanded to first order in
where in the weak mutation limit,
Optimization of the Goodness of Fit R 2 for Mean Genome Size Vs. Effective Population Size Dependency.
To search the parameter space more efficiently during the log-likelihood optimization, a preliminary optimization stage was implemented. Eq. 7 determines the relation between
Maximum Likelihood Optimization.
The log likelihood (LL) of the model given the data is estimated as
where
where the sum is over all 707 species,
where the inner sum is over all species that belong to ATGC
Gain and Loss Probabilities Genome Size Dependent.
The gene gain (loss) probability depends on genome size through the selection coefficient and the acquisition (deletion) rate. The variation in gain and loss probability
where all quantities are calculated using the mean ATGCs genome size and effective population size. If, say, the second term in the above equations is significantly smaller than the first term, the variation in
For parameters fitted using linear selection landscape and summarized in SI Appendix, Table S2, the terms involving the derivatives of
The Model with Two Types of Genes.
The model can be extended to account for distinct classes of genes evolving under different selection landscapes. For two classes, the numbers of genes in each class,
where the interpretation is as follows. The probability to acquire a gene of class
where
To calculate the steady-state distribution for a subset of genes,
Acknowledgments
We thank members of the group of E.V.K. for helpful discussions. The authors’ research is supported by intramural funds of the US Department of Health and Human Services (to the National Library of Medicine).
Footnotes
- ↵1To whom correspondence should be addressed. Email: koonin{at}ncbi.nlm.nih.gov.
This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 2016.
Author contributions: Y.I.W. and E.V.K. designed research; I.S. performed research; I.S., Y.I.W., and E.V.K. analyzed data; and I.S. and E.V.K. wrote the paper.
Reviewers: E.K., New York University; and C.O.W., The University of Texas at Austin.
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1614083113/-/DCSupplemental.
Freely available online through the PNAS open access option.
References
- ↵.
- Koonin EV,
- Wolf YI
- ↵.
- Reddy TB, et al.
- ↵
- ↵.
- Price MN,
- Huang KH,
- Arkin AP,
- Alm EJ
- ↵.
- Nuñez PA,
- Romero H,
- Farber MD,
- Rocha EP
- ↵.
- Lynch M,
- Conery JS
- ↵.
- Lynch M
- ↵.
- Lynch M
- ↵.
- Lynch M
- ↵
- ↵.
- Lynch M,
- Marinov GK
- ↵
- ↵.
- Novichkov PS,
- Wolf YI,
- Dubchak I,
- Koonin EV
- ↵
- ↵
- ↵
- ↵.
- Tatusov RL,
- Koonin EV,
- Lipman DJ
- ↵.
- Galperin MY,
- Makarova KS,
- Wolf YI,
- Koonin EV
- ↵
- ↵.
- Konstantinidis KT,
- Tiedje JM
- ↵
- ↵
- ↵.
- Daubin V,
- Ochman H
- ↵.
- Yu G,
- Stoltzfus A
- ↵.
- Petrov DA,
- Sangster TA,
- Johnston JS,
- Hartl DL,
- Shaw KL
- ↵
- ↵.
- Kuo CH,
- Ochman H
- ↵.
- Wolf YI,
- Makarova KS,
- Lobkovsky AE,
- Koonin EV
- ↵.
- Novichkov PS,
- Ratnere I,
- Wolf YI,
- Koonin EV,
- Dubchak I
- ↵
- ↵.
- Maslov S,
- Krishna S,
- Pang TY,
- Sneppen K
- ↵
- ↵.
- Makarova KS,
- Wolf YI,
- Koonin EV
- ↵
- ↵.
- Sella G,
- Hirsh AE
- ↵.
- Kryazhimskiy S,
- Tkacik G,
- Plotkin JB
- ↵.
- Codling EA,
- Plank MJ,
- Benhamou S
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
- Biological Sciences
- Evolution