## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Theory of prokaryotic genome evolution

Contributed by Eugene V. Koonin, August 24, 2016 (sent for review June 7, 2016; reviewed by Edo Kussell and Claus O. Wilke)

## Significance

Bacteria and archaea have small genomes with tightly packed protein-coding genes. Typically, this genome architecture is explained by “genome streamlining” (minimization) under selection for high replication rate. We developed a mathematical model of microbial evolution and tested it against extensive data from multiple genome comparisons to identify the key evolutionary forces. The results indicate that genome evolution is not governed by streamlining but rather, reflects the balance between the benefit of additional genes that diminishes with the genome size and the intrinsic preference for DNA deletion over acquisition. These results explain the observation that, in an apparent contradiction with the population genetic theory, microbes with large genomes reach higher abundance and are subject to stronger selection than small “streamlined” genomes.

## Abstract

Bacteria and archaea typically possess small genomes that are tightly packed with protein-coding genes. The compactness of prokaryotic genomes is commonly perceived as evidence of adaptive genome streamlining caused by strong purifying selection in large microbial populations. In such populations, even the small cost incurred by nonfunctional DNA because of extra energy and time expenditure is thought to be sufficient for this extra genetic material to be eliminated by selection. However, contrary to the predictions of this model, there exists a consistent, positive correlation between the strength of selection at the protein sequence level, measured as the ratio of nonsynonymous to synonymous substitution rates, and microbial genome size. Here, by fitting the genome size distributions in multiple groups of prokaryotes to predictions of mathematical models of population evolution, we show that only models in which acquisition of additional genes is, on average, slightly beneficial yield a good fit to genomic data. These results suggest that the number of genes in prokaryotic genomes reflects the equilibrium between the benefit of additional genes that diminishes as the genome grows and deletion bias (i.e., the rate of deletion of genetic material being slightly greater than the rate of acquisition). Thus, new genes acquired by microbial genomes, on average, appear to be adaptive. The tight spacing of protein-coding genes likely results from a combination of the deletion bias and purifying selection that efficiently eliminates nonfunctional, noncoding sequences.

The majority of bacterial and archaeal genomes are small, at least compared with the genomes of multicellular and many unicellular eukaryotes (1, 2). Also, with the exception of deteriorating genomes of some parasitic bacteria, the prokaryotic genomes are highly compact, with densely packed protein-coding genes and a low fraction of noncoding sequences (3). The small genome size is thought to be selected for fast replication, whereas the high gene density additionally facilitates coregulation of gene expression via the operon organization (4, 5). Across the full range of cellular life forms, a significant positive correlation has been shown to exist between genome size and

The population genetic theory clearly predicts an inverse correlation between the strength of selection at different levels and genome size: small genomes are predicted to be subject to stronger selection than large genomes (9). However, when protein sequence-level selection was measured for multiple groups of closely related bacteria using the ratio of the nonsynonymous to synonymous substitution rates (*dN*/*dS*) as a proxy, the opposite effect, namely a significant negative correlation between *dN*/*dS* and genome size, was observed, indicating that larger prokaryotic genomes typically evolve under a stronger selection than small ones (13).

Here, we sought to further investigate the evolutionary factors that control genome evolution in prokaryotes. We reproduced the negative correlation between *dN*/*dS* and genome size on an expanded genome collection and then, developed a mathematical model of genome evolution by gene gain and loss in prokaryotic populations. By fitting the distribution of genome sizes predicted by the model to the empirical distribution for many groups of prokaryotes, we found that a good fit between theory and the data could be obtained only for models that included a positive mean fitness contribution of the gained genes countered by a deletion bias. These results imply that, at the level of gene gain and loss, selection for genome compactness does not play a major role in determining the genome size in prokaryotes. Rather, the relatively small number of genes in prokaryotes is explained by the diminishing return associated with the acquisition of new genes as the genome grows combined with the intrinsic deletion bias.

## Results

### Protein-Level Selection and Microbial Genome Size: Model Prediction and Genomic Data.

It has been shown previously that, in prokaryotes, genome size is negatively and significantly correlated with *dN*/*dS*, suggestive of a stronger protein-level selection in larger genomes (13). For the purpose of this work, we reproduced these findings on a substantially expanded set of groups of closely related bacterial and archaeal genomes compiled in the latest version of the Alignable Tight Genomic Cluster (ATGC) database (14) (Fig. 1). Specifically, genome size (measured by the number of genes) and *dN/dS* values are correlated with Spearman’s rank correlation coefficient

To quantitatively analyze the relations between selection strength and genome size, a mathematical model was implemented using the Moran process framework (15). In the model, a genome is represented as a collection of *Methods*)*dN/dS* ratio (*Methods*). Fixed acquisition and deletion events, hereinafter “gene gain” and “gene loss,” respectively, occur with probabilities given by the multiplication of the event rate and the fixation probability:**2** is used together with the relation of Eq. **1**.

Gain and loss probabilities govern the stochastic genome size dynamics, with the intuitive relation (derivation is given in *Methods*)*SI Appendix*, Fig. S1. The extremum point depends on the deletion–acquisition rates ratio **7** is determined by the sign of **16**). The genome size distribution extremum point at *SI Appendix*, Fig. S1). This condition is met when*SI Appendix*, Fig. S1).

For the calculation of the steady-state distribution of genome size, it is useful to present the population model of the genome size evolution as a random walk, where the probability of a step up is *Methods*) has a steady-state solution:**7**), cannot be written using *Methods*). To account for the most general case, where both the selection coefficient and the acquisition–deletion rates ratio vary with genome size,

Even for *Methods*), and therefore, in principle, it is not evident that the resulting fit corresponds to a global maximum of the log likelihood rather than a local peak. We, therefore, performed the fitting with different starting points in parameter space chosen as explained in *Methods*. In brief, Eq. **7** is used to estimate the mean genome size for different effective population sizes, and parameters are optimized, such that goodness of fit *SI Appendix*, Table S1) that were then used as starting points for additional optimization by log-likelihood maximization (*Methods*). Three of five starting points converged to similar values (*SI Appendix*, Table S2), whereas the remaining two converged to local maxima associated with significantly lower likelihood. In all optimized sets of parameters (from both stages), the selection coefficient is positive for all ATGCs, indicating that additional genes are, on average, beneficial as expected given the positive correlation between genome size and effective population size (*SI Appendix*, Fig. S2).

In all three log-likelihood optimized parameter sets, the selection coefficient is strictly positive for all ATGCs and weakly depends on *Methods*). Accordingly, additional fittings were performed using constant selection coefficient (namely, independent of the genome size). In this case, the log-likelihood optimization always converged to the same set of parameters, with *SI Appendix*, Table S4). This value of

For complementarity, fitting with constant *SI Appendix*), allowing inference of the selection landscape beyond the first-order expansion for a constant acquisition–deletion rates ratio. In this case, the best fit is achieved for positive and saturating selection landscape, indicating that, on average, additional genes are beneficial but that the benefit decreases with the growth of the genome size.

As an approximation for nonuniversal, ATGC-specific factors that affect the genome size, optimization of the gene acquisition and deletion rates was performed separately for each ATGC. The selection coefficient was taken to be the same for all ATGCs and set to the value obtained in the global fitting at the previous stage. For each ATGC, *SI Appendix*, Fig. S3, and the resulting genome size distributions for ATGCs with 20 or more species are shown in Fig. 4.

### Evolution of Distinct Functional Classes of Genes.

In our model, the gene acquisition and deletion rates, *Methods*)*SI Appendix*, Table S5). The distributions were calculated using the values of *SI Appendix*, Figs. S4 *A* and *B* and S5 *A* and *B*). For the translation system components and the genes involved in energy transformation, the log-likelihood values were

Finally, analogous to the whole-genome fitting procedure, to account for ATGC-specific effects, model distributions were further optimized by fitting ATGC-specific *SI Appendix*, Fig. S6 *A* and *B* for the translation system components and genes involved in energy transformation of the largest ATGC001 (complete results are in *SI Appendix*, Figs. S7–S10). However, two classes of genes, namely the components of the “mobilome,” such as prophage genes and transposons, as well as the singletons (genes with no detectable orthologs within the given ATGC), dramatically deviate from model predictions (*SI Appendix*, Figs. S6 *C* and *D*, S9, and S10). For these gene classes, the observed distributions are significantly wider than those predicted by the model, regardless of the model parameters or the selection landscape. Accordingly, the log-likelihood values reflect the disagreement and are

## Discussion

The notion of strong purifying selection that favors small genomes (or more precisely, a small number of genes) in prokaryotes (10) seemed incompatible with the observed significant positive correlation between the genome sizes of bacteria and archaea and the inferred selection strength on the protein level reflected in the *dN/dS* ratio (13) (this work). These observations indicate that, on average, the larger the genome of a bacterium or an archaeon, the stronger selection under which the protein-coding genes evolve. This apparent discrepancy between the comparative genomic observations and the predictions of the population genetic theory motivated us to further investigate the selection regimes of microbial genomes. To this end, we compared the predictions of a mathematical model of genome evolution with the genome size distributions in 60 clusters of closely related bacteria and archaea.

To infer the selection landscape with regard to the genome size, the effective population size was estimated for each ATGC using the *dN/dS* ratio and assuming the same selection coefficient *dN/dS* calculation are nearly universal and encode central biological functions, such as translation, that are functionally highly similar across the entire bacterial domain of cellular life (22). Small variations in *dN/dS* ratio (Fig. 1*A*) would be abolished or reversed.

Inference of gain and loss probabilities requires estimation of three terms, namely gene gain and loss rates and the selection landscape. In principle, all three values depend on the genome size, such that it is impossible to infer all terms without assumptions on the functional forms of the respective dependencies. However, it has to be emphasized that our conclusions do not depend on specific modeling assumptions and hold as long as *N _{e}* and genome size. The fitted values of the deletion–acquisition rates ratio are very close to, albeit greater than unity. This slight but consistent excess of gene loss over gain is likely to reflect the deletion bias that has been identified as an intrinsic feature of genome evolution in both bacteria and eukaryotes (25⇓–27).

Extension of the model to account for subsets of genes allowed additional validation of the model consistency and assessment of the selection affecting any class of genes. In particular, we analyzed genes that are associated with specific cellular functions as classified in the COG database (17, 18) under the assumption that functionally similar genes evolve under similar selection landscapes. We obtained good fits to the model for all functional classes, indicating that the conclusion on the typical beneficial effect of gene acquisition applies to functionally diverse classes of genes. However, there were two notable exceptions to this consistency, namely the mobilome and the singletons. The distributions of the sizes of these classes in all ATGCs are much wider than predicted by the model, and the parameters could not be optimized to obtain a good fit. These observations imply that the evolutionary regimes of the mobilome and the singletons qualitatively differ from the genes in the other functional classes that possess “normal” cellular functions. There are indications of the nature of these differences. Many components of the mobilome, such as transposons, propagate within a genome, so that the dynamics of this class is dominated by duplication rather than gain from an external gene pool. The singletons are fast evolving genes that, on average, do not confer any benefit on the organism. Indeed, in a separate recent analysis of microbial genome evolution models, we have shown that the replacement rate of the singletons is effectively infinite compared with the replacement rates of the rest of the genes (28).

The model analysis described here was performed assuming a steady state with respect to the genome size, and two points have to be addressed in this regard. Each ATGC consists of phylogenetically close genomes (29) that cannot be considered independent samples from the genome size distribution without additional substantiation. We, therefore, verified that, for each ATGC, the divergence between the genomes, defined in this case as the number of gains and losses, was greater than the genome size distribution width, ensuring that a sufficient number of evolutionary events occurred so that the genomes represent an independent set of samples from the size distribution. A low bound for the number of gene gain and loss events that occurred since the divergence of individual genomes can be estimated from the number of singletons divided by the probability of the acquisition of such genes. The required number of gains and loss events for sufficient sampling of the distribution can be estimated from the SD of all genome sizes in an ATGC. The estimated numbers of gain events are greater than the SDs for all ATGCs (*SI Appendix*, Fig. S11), indicating that the genome size distribution was sampled sufficiently.

In our previous work, a comprehensive analysis of gain and loss events was performed for the same groups of microbial genomes (ATGCs) that were analyzed here (14). The gene loss to gain ratios obtained through a maximum likelihood reconstruction of genome evolution formed a broad distribution, with the mean value of about two. The evolution models analyzed here (Eqs. **10** and **11**) yield a skewed distribution of genome sizes, which implies a distribution of loss/gain ratios with the mean slightly greater than unity. The distributions and parameters fitted here cannot explain such a large difference between the model predictions and inferences from genome comparison. Thus, it seems likely that, most of the time, the majority of the genomes are somewhat smaller than the long-term equilibrium size. A biologically plausible scenario is that prokaryotes are exposed to beneficial genetic material only for short periods of time, resulting in brief intervals of fast growth followed by slow genome shrinking (30). Steady state is possible under this scenario as well but only as the average over multiple cycles of gain and loss, which probably occur on a timescale much longer than the scale of the ATGC evolution. The strict steady-state analysis presented here can be regarded as a coarse-grained description of more complex evolutionary scenarios; however, our key finding, that acquired genes are, on average, beneficial, is expected to hold also for higher-order analyses.

The results of this analysis indicate that elimination of genes under the pressure of purifying selection is not the dominant factor of microbial evolution. On the contrary, acquisition of genes by microbes seems to be largely an adaptive process, although the positive selection that governs the genome dynamics, on average, is likely to be weak. This conclusion by no means contradicts the population genetic theory as such (9) but is incompatible with the assumption that newly acquired (and fixed in the population) genes are, on average, neutral. In other words, all of the estimates of the cost of the genetic material (11) can be valid in themselves, but the positive selection coefficient that is, on average, associated with new genes offsets these costs. Given that new genes are, on average, beneficial, microbes with larger *dN/dS* and the number of genes (Fig. 1) that, at first glance, seemed paradoxical and to contradict the theory. Conceivably, the evolution of genuinely neutral, noncoding sequences is governed by the cost combined with the deletion bias, resulting in the purge of such sequences predicted by the theory and hence, the “wall to wall” architecture of the prokaryotic genomes (3).

To summarize, the analysis described here presents the formal theory of the evolution of prokaryotic gene content. Perhaps unexpectedly, comparison of the theory predictions with the genomic data shows that gene gain by prokaryotes, leading to genome growth, is largely an adaptive process, with the exception of “nonfunctional” gene classes, the mobilome and the singletons. From the biological standpoint, it seems plausible that the apparent beneficial effect of gene gain is a combined result of the capture of metabolic enzymes that can expand the biochemical capacity of microbes (31), regulators and signaling proteins that enhance regulatory circuits (32), and defense genes (33). However, much more research is required to reconstruct the full functional landscape of microbial evolution.

## Methods

### Prokaryotic Genome Size Data and Nonsynonymous to Synonymous Nucleotide Substitution Ratio.

Genomes of 707 prokaryotic species grouped into 60 ATGCs (14, 29) were analyzed. In addition to the number of genes (

### Inference of Effective Population Size.

Synonymous mutations are assumed to be neutral and, therefore, fixed at a rate **2**, for nonsynonymous mutations, we have (34)

where *Escherichia coli* is **14** allows estimation of effective population size for each ATGC.

### Selection Function Symmetry with Respect to Gene Acquisition and Deletion.

The relation given by Eq. **1** is derived as follows. Noting by

### Selection and Fitness Relations.

The selection coefficient

when considering the selective advantage of individual

### Mean Genome Size Dynamics.

For uniform population invaded by mutation that is associated with fitness

where **17** is general, the only assumption being that the mutation rate is low enough, such that the weak mutation limit condition is satisfied. However, for the model analysis, it is more practical to derive the equation for genome size dynamics rather than the fitness. The integral over

where the + and − superscripts indicate acquisition and deletion events, respectively. The fitness derivative with respect to the number of genes can be approximated as

such that

Substituting Eqs. **18**, **19**, **20**, **21**, and **22** to Eq. **17**, we get Eq. **5** for genome size dynamics.

### Steady-State Genome Size Distribution.

The genome size distribution satisfies the difference equation

This equation can be approximated by a second-order differential equation (37). The left-hand side is expanded to first order in

where in the weak mutation limit, **9**.

### Optimization of the Goodness of Fit R 2 for Mean Genome Size Vs. Effective Population Size Dependency.

To search the parameter space more efficiently during the log-likelihood optimization, a preliminary optimization stage was implemented. Eq. **7** determines the relation between *SI Appendix*, Fig. S12), and ATGCs genome size distributions are only slightly skewed (Fig. 4). Specifically, it is possible to optimize the selection and deletion–acquisition rates ratio parameters (Eqs. **10** and **11**) to maximize the goodness of fit **7**, only the ratio **11** by **10** and **11** require optimization of five parameters:

### Maximum Likelihood Optimization.

The log likelihood (*LL*) of the model given the data is estimated as

where **9**. Specifically, for the log-likelihood estimation of a model, the parameters were optimized to maximize the log-likelihood

where the sum is over all 707 species,

where the inner sum is over all species that belong to ATGC

### Gain and Loss Probabilities Genome Size Dependent.

The gene gain (loss) probability depends on genome size through the selection coefficient and the acquisition (deletion) rate. The variation in gain and loss probability

where all quantities are calculated using the mean ATGCs genome size and effective population size. If, say, the second term in the above equations is significantly smaller than the first term, the variation in

For parameters fitted using linear selection landscape and summarized in *SI Appendix*, Table S2, the terms involving the derivatives of

### The Model with Two Types of Genes.

The model can be extended to account for distinct classes of genes evolving under different selection landscapes. For two classes, the numbers of genes in each class,

where the interpretation is as follows. The probability to acquire a gene of class **30** and **31** follows the same steps as for the complete genome. The integral of Eq. **17** in this case is a sum with four terms, namely, acquisition or deletion of either class

where

To calculate the steady-state distribution for a subset of genes, **30** is decoupled from Eq. **31**.

## Acknowledgments

We thank members of the group of E.V.K. for helpful discussions. The authors’ research is supported by intramural funds of the US Department of Health and Human Services (to the National Library of Medicine).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: koonin{at}ncbi.nlm.nih.gov.

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 2016.

Author contributions: Y.I.W. and E.V.K. designed research; I.S. performed research; I.S., Y.I.W., and E.V.K. analyzed data; and I.S. and E.V.K. wrote the paper.

Reviewers: E.K., New York University; and C.O.W., The University of Texas at Austin.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1614083113/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- Koonin EV,
- Wolf YI

- ↵.
- Reddy TB, et al.

- ↵
- ↵.
- Price MN,
- Huang KH,
- Arkin AP,
- Alm EJ

- ↵.
- Nuñez PA,
- Romero H,
- Farber MD,
- Rocha EP

- ↵.
- Lynch M,
- Conery JS

- ↵.
- Lynch M

- ↵.
- Lynch M

- ↵.
- Lynch M

- ↵
- ↵.
- Lynch M,
- Marinov GK

- ↵
- ↵.
- Novichkov PS,
- Wolf YI,
- Dubchak I,
- Koonin EV

- ↵
- ↵
- ↵
- ↵.
- Tatusov RL,
- Koonin EV,
- Lipman DJ

- ↵.
- Galperin MY,
- Makarova KS,
- Wolf YI,
- Koonin EV

- ↵
- ↵.
- Konstantinidis KT,
- Tiedje JM

- ↵
- ↵
- ↵.
- Daubin V,
- Ochman H

- ↵.
- Yu G,
- Stoltzfus A

- ↵.
- Petrov DA,
- Sangster TA,
- Johnston JS,
- Hartl DL,
- Shaw KL

- ↵
- ↵.
- Kuo CH,
- Ochman H

- ↵.
- Wolf YI,
- Makarova KS,
- Lobkovsky AE,
- Koonin EV

- ↵.
- Novichkov PS,
- Ratnere I,
- Wolf YI,
- Koonin EV,
- Dubchak I

- ↵
- ↵.
- Maslov S,
- Krishna S,
- Pang TY,
- Sneppen K

- ↵
- ↵.
- Makarova KS,
- Wolf YI,
- Koonin EV

- ↵
- ↵.
- Sella G,
- Hirsh AE

- ↵.
- Kryazhimskiy S,
- Tkacik G,
- Plotkin JB

- ↵.
- Codling EA,
- Plank MJ,
- Benhamou S