Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / EVOLUTION
Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution
Institut für Botanik III, Heinrich-Heine Universität, Universitätsstrasse 1, 40225 Düsseldorf, Germany
Edited by W. Ford Doolittle, Dalhousie University, Halifax, Nova Scotia, Canada, and approved November 21, 2006 (received for review July 25, 2006)
| Abstract |
|---|
|
|
|---|
microbial evolution | phylogenomics | gene clusters
There are currently three main approaches to quantifying LGT. The first involves identification of codon usage, GC content, or nucleotide-pattern properties within genomes that differ from the genomic norm and hence are likely to represent acquired sequences (68). This approach is powerful but can uncover only recent LGT events. The second approach involves gene-tree comparisons in search of incongruent branching patterns. This approach has delivered widely conflicting results, ranging from estimates that up to 60% of all genes are affected by LGT (9) to estimates that as few as 14% (10) or even only 2% are affected (11). The reason for such divergent quantitative estimates is primarily founded in the uncertainties inherent to phylogenetic reconstruction by using real data (1214) and in differences among investigated gene and genome samples. A third approach entails inference of gene-gain and -loss events (1518). Estimates using this approach, in which gene losses are weighted against gene acquisitions (LGTs) according to a predetermined loss-to-LGT ratio, suggest that between 40% (16) and 90% (17) of all gene families might be affected by LGT; these discrepancies are caused by different a priori specified gain/loss ratios and the genome samples studied.
An additional approach to inferring LGT but hitherto in a nonquantitative manner involves the identification of genes showing patchy distribution patterns across genomes (19, 20). Although differential gene loss can account for patchy distributions in individual instances, it cannot be invoked to account for all such patterns, because the inferred size of ancestral genomes would become unrealistically large. We reasoned that this phenomenon, which Doolittle et al. (21) have termed the "genome of Eden," could be used to estimate the rate of LGT. Given the current distribution of genes across genomes and a reference tree, one can calculate ancestral genome sizes under the assumption that all gene distributions are due to gene loss only. If ancient genome sizes become unrealistically large, incremental allowance of LGT should solve the genome-of-Eden problem, and the amount of LGT that causes inferred ancestral genome sizes to assume a size distribution similar to modern ones would be an estimator of the LGT rate.
However, what if a given gene tree is different from the reference tree? Here, we grant each gene the full benefit of all phylogenetic doubt; we assume (i) all gene trees are perfectly compatible with the same reference tree, (ii) gene loss is unpenalized, and (iii) no paralogy; that is, all within-genome duplications for each gene family are assumed to have occurred subsequent to the last divergence for each lineage. Taken together, these three assumptions mean we infer no LGTs from phylogenetic conflicts; hence, our approach delivers conservative lower-bound constraints for the minimum LGT rate during prokaryote genome evolution.
| Results |
|---|
|
|
|---|
|
Ancestral Genome Sizes Constrain the Average LGT Rate. To estimate the minimum amount of LGT in the present gene-distribution data, we first plotted the PAPs onto a reference tree for the rRNA operon (SI Fig. 5). We designate an evolutionary scenario that utilizes vertical inheritance and gene loss only as the loss-only model; gene distribution is governed solely by loss, each ancestral genome contains all families present in its descendants, and genomes hence become progressively larger back through time (Fig. 2a). With the present data and our reference tree, the prokaryotic ancestor would have had a genome encoding 57,670 families, exceeding the average genome size in our sample (2,198 families) by 26-fold and the largest genome in our sample (8,317 families; Bradyrhizobium japonicum) by 7-fold (Table 1).
|
|
|
We started by allowing only one LGT event per family (Fig. 2c), the LGT
1 model. This model allows each gene to have two origins, one of which is an LGT. For 35% of our families, neither LGT nor loss is required; the remaining 65% accept one LGT. This average LGT rate of 0.65 LGT per family (Table 2) brings inferred ancestral genome sizes down to <8,000 genes (Fig. 3b), with a maximum of 7,607 and a mean of 2,858, closer to contemporary genomes (Table 1) but still a bit too large.
|
15 and LGT
31 models, no families accommodate the maximum number of LGTs allowed (Table 2). Because only a very small proportion of gene families require many LGTs to account for their phyletic distributions, allowing more LGTs hardly changes the average LGT frequency per gene family (Table 2).
Comparison of the distributions of 190 modern genome sizes with 187 inferred ancestral genome sizes for differing LGT allowances using the Wilcoxon test (24) revealed that all models except LGT
3 are rejected at
= 0.02 (Table 1). With no LGT, ancestral genome sizes are too large; however, with too much LGT, they become too small. Even for the LGT
3 model, only 8% of all families accept all three LGTs allowed, such that the average rate across all genes in the LGT
3 model is
1.1 LGT per gene family. This amount of LGT is sufficient to bring the distribution of ancestral genome sizes into congruence with that of modern genomes.
Too Much LGT Makes Ancestral Genomes Too Small.
Allowance of many LGTs causes inferred ancestral genome sizes to become far too small in comparison to modern genomes (Fig. 3 e and d). The mean ancestral genome size in the models allowing seven or more LGTs is less than half the mean of modern genomes, and the size distributions of modern and ancestral genomes are different at
= 0.05 using the Wilcoxon test (Tables 1 and 2). Furthermore, for genomes with
1,000 families, ancestral sizes are biased toward miniscule sizes with too much LGT allowance (Table 3). Thus, although the genome of Eden demands LGT to keep ancient genome sizes realistically small, too much LGT makes them unrealistically small.
|
31 model, 92% of all gene families are inferred to evolve without a single loss (Table 2), which is unrealistic, because loss events are abundant in bacterial evolution (25, 26). Hence, introducing too many LGTs turns gene loss, an important and common mechanism affecting genome size, into a rare event. The mean origin-to-loss ratio observed in the LGT
3 model is 1:1 (Table 2), twice the threshold value used for LGT inference in previous studies (16, 17) that constrained the LGT rate as opposed to estimating it.
The Tree Is Not Too Important, but Family Size Is.
Neither different reference trees (using maximum likelihood with or without a gamma distribution of rate variation across sites or using ribosomal protein sequences) nor alternative rootings (within proteobacteria, actinobacteria, or mollicutes) affected the average LGT rate across all genes by >10% (SI Table 5). Changing the reference tree or rooting has little influence on the average LGT rate, because the majority of gene families are of small size and have a discrete taxonomic distribution; 50% of all genes fall into families that occur in
14 genomes, and 39% of all families occur in only two often congeneric (Fig. 1a) genomes and hence can exhibit one LGT at most.
The cutoff used to assort genes into gene families has a similarly small effect. Repeating the above analyses using cutoffs of 35% amino acid identity or the rather strict value of 40% in the clustering procedure, we found 61,981, and 66,118 families, respectively, and very similar distributions of family sizes (SI Fig. 6). Although the 35% and 40% cutoffs disrupt a few protein families that are united at the 30% cutoff, for example enolase or the ribosomal proteins S11 and L11 (data not shown), all models except LGT
3 are again excluded, and the average LGT rate drops slightly to 0.99 and 0.89, respectively (Table 2).
By contrast, small gene families exert a strong influence on the average LGT rate, because they are abundant and require little LGT to account for their distribution, regardless of the assumed tree. Accordingly, excluding small families from our analysis would deliver higher average rates. For example, if we use a very permissive cutoff of 25%, we obtain 53,349 families. Relative to the 30% cutoff, there are 4,724 fewer families of size <10, because small families are preferentially subsumed into larger families at this lower cutoff (SI Fig. 6). In this case, the average rate increases to 1.55 events per family (SI Table 5). This effect of small families can be further illustrated if we consider only families present in
10 genomes, where the LGT
15 model is preferred, and the average LGT rate jumps to 5.3 events per family (SI Table 5). Clearly, families present in
10 genomes can accept more LGT, but they speak for only 14% of all families in the data.
Indeed, the effect of excluding small families is even more dramatic than the use of a random tree. If we use a random tree (SI Fig. 4) to infer LGT rates as above, we are assuming there is virtually no vertical inheritance in the gene-distribution data. Accordingly, the random tree corresponds to an evolutionary scenario of LGT only. For the random tree, the Wilcoxon test excludes all models except LGT
31 and yields an average rate of 3.3 LGT events per gene family (Table 2). However, the random tree LGT rate is still lower than for families present in
10 genomes using the rRNA reference tree; gene family size bears upon the genome of Eden more heavily than does the assumed tree itself. Thus, although higher estimates for the lower-bound LGT rate can be obtained by disregarding small families (SI Fig. 9), small families contain the majority of genes. The genome-of-Eden constraint applies to all gene families, not just a select population thereof.
| Discussion |
|---|
|
|
|---|
1.1 events per family per family life span (Fig. 3 and Table 1) provides the best fit of inferred ancestral genome sizes to those currently observed in real microbes. This average LGT rate is a very conservative lower bound, because it is based on the assumptions that all families investigated contain orthologs only, and that all gene trees are compatible.
One could argue that ancient genomes were bigger than those of today, and that the amount of LGT inferred here is still not necessary. Indeed, it has been suggested that the vast majority of all LGT occurred before the origin of cells, and that little or none has occurred since (1). However, this suggestion cannot be true, because nucleotide-pattern comparisons indicate that LGT is still an ongoing process today (68). One could also argue that ancient genomes were much skimpier than those of today, and that they inflated only recently in a case of evolutionary last-minute shopping, such that higher average LGT rates than those inferred here would be tenable. However, in the absence of evidence to the contrary, Occam's razor would prefer the simpler premise that genome sizes, rates of loss, and rates of LGT in the past, on average, were not fundamentally different from those of today. In the LGT
3 model, genome size is not only similar to the values currently observed among prokaryotes (Tables 1 and 3); it is also far more constant across time than in the other models (Fig. 3). The same is true for gene-origin and -loss frequencies (SI Fig. 10). By allowing the frequencies of LGT, gene origin, and gene loss to vary freely, we obtain a picture of genome evolution that is marked by uniformity of all three parameters over time and lineages, but only if genome-size distributions remain uniform as well.
This observation and the comparison of modern and inferred genome-size distributions (Fig. 3 and Table 1) indicate that average LGT rates of
1.1 LGT per family are necessary and sufficient to account for the present distribution of genes across 190 prokaryotic genomes. This conservative lower-bound estimate stands against and is irreconcilable with recent inferences from gene-tree comparisons that as much as 86% (10) or even 98% (11) of all genes are related by vertical inheritance only. The burgeoning effects that so much vertical inheritance would have upon ancestral genome sizes were not considered in those studies. Rampant vertical inheritance leads to the genome of Eden, and a modest amount of LGT offers remedy.
Above and beyond our full-benefit assumptions, the lower-bound nature of our
1.1 LGT per family estimate has two further caveats. First, it is possible that the first origin we infer for each gene is not a birth event, but itself is an LGT from an unsampled genome. Although no genome sample size would exclude that possibility, if we assume that all families were born outside rather than within the lineages sampled, our estimate for the average rate would increase only to
2.1. Second, our methods count only observed events; unobserved gene families or events (27) are disregarded.
Our findings indicate that LGT occurs very frequently among prokaryotes in terms of having impact upon individual gene family distributions, in that at least 65% of all families (and given the ultraconservative nature of our full benefit assumptions, probably all) have been affected during the course of evolution. These results can be taken as support for the view that a core set of genes that has remained immune to LGT throughout all of evolution is unlikely to exist (28, 29). The estimates of the average LGT rate reported here represent solely the amounts required to keep ancestral genome-size distributions within realistic bounds; additional contributions from gene-tree comparisons or nucleotide-pattern analyses were not considered. However, despite much LGT, gene-distribution patterns are still nonrandom, as Fig. 1b attests. Further specification of the extent of this important process of natural variation among prokaryotes is germane to understanding the evolutionary mechanisms that govern the distribution of genes across genomes.
| Materials and Methods |
|---|
|
|
|---|
Gene Families (PAPs). All proteins in the 190 genomes were clustered by similarity into gene families by using the reciprocal best BLAST hit (BBH) approach. Each protein was BLASTed against each of the genomes. Pairs of proteins that resulted as reciprocal BBHs of E value < 110 were aligned by using ClustalW (30) to obtain amino acid identities. Using a cutoff of 30% amino acid identity in the clustering procedure (22), the proteins fell into 57,670 families with two or more members, in addition to 149,894 singletons that were excluded from pattern analysis. The resulting gene families represent a PAP of gene distribution across the prokaryotic genomes. A PAP includes 190 digits; if a gene family includes one or more genes from genome i, then digit xi in its corresponding pattern is "1"; otherwise, it is "0."
Reconstruction of Phylogenetic Trees. For the reference tree, the sequences of the rRNA operon (16S, 23S, and 5S) from all 190 genomes were aligned by using ClustalW (30) and concatenated, and gapped sites were removed. This alignment was used for phylogenetic reconstruction by maximum likelihood (ML) with and without rate variation using dnaml (31), PhyML (32), and Neighbor Joining (33). The tree of concatenated L11 and S11 ribosomal protein sequences was inferred with PhyML (32). Trees were rooted between archaebacteria and eubacteria; additional roots of the dnaml tree (between proteobacteria, actinobacteria, or mollicutes and other genomes) were tested. The random tree was obtained by shuffling species names in the ML tree and rooting on the longest internal edge. All Newick format trees are provided in SI Table 6.
Evolutionary Model Reconstruction and Calculation of Ancestral Genome Size. In the loss-only model, all gene families (57,670) are assumed to have originated at the root. The loss events for each gene are estimated by using a binary recursive algorithm that scans the tree and infers the minimum number of losses. When a gene is absent in a whole clade, a single loss event is inferred in the common ancestor of that clade (e.g., Fig. 2a, clade III). In the SO model, each gene family is assumed to have originated at its first occurrence on the reference tree. A binary recursive algorithm scans the tree root to tips to identify the first hypothetical taxonomic unit (ancestral node) that is the common ancestor of all gene "present" cases (e.g., the common ancestor of clades I + II in Fig. 2b).
In the LGT
1 model, each gene family is allowed to have two gene origins, where one is an LGT. The first origin is inferred as in the SO model, followed by researching for a gene origin in each of the two clades branching from the first-origin node (e.g., Fig. 2c). If the hypothetical taxonomic unit that was inferred as the first origin has no gene "absent" descendants, the gene family is inferred to have a single origin. Once the nodes of the two origins are set, losses are inferred as in the loss-only model.
We tested additional models allowing 4, 8, 16, and 32 origins, where one is an origin, and the rest are LGTs. These are implemented in the same way as in the LGT
1 model, except that the origin search is iterated. For example, a search for origins under the LGT
4 model entails (i) a search for the first origin (as in the SO model); (ii) a search for the next origin in descendants (as in the LGT
1 model); and (iii) for each next origin, another search. If an origin has no gene "absent" descendants, the number of origins inferred is smaller than the maximum allowed (e.g., Fig. 2d, clade II, where three origins are inferred under the LGT
4 model). The distributions of ancestral and modern genome sizes were compared by using the Wilcoxon MannWhitney nonparametric test (24). For cutoffs other than 30%, the inferred distributions were compared with the modern distribution for clusters at the respective cutoff.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: LGT, lateral gene transfer; PAP, presence/absence pattern; SO, single origin.
*To whom correspondence should be addressed. E-mail: tal.dagan{at}uni-duesseldorf.de
Author contributions: T.D. and W.M. designed research; T.D. performed research; T.D. and W.M. analyzed data; and T.D. and W.M. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS direct submission.
This article contains supporting information online at www.pnas.org/cgi/content/full/0606318104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
R. Sorek, Y. Zhu, C. J. Creevey, M. P. Francino, P. Bork, and E. M. Rubin Genome-Wide Experimental Determination of Barriers to Horizontal Gene Transfer Science, November 30, 2007; 318(5855): 1449 - 1452. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Arraiano, J. Bamford, H. Brussow, A. J. Carpousis, V. Pelicic, K. Pfluger, P. Polard, and J. Vogel Recent Advances in the Expression, Evolution, and Dynamics of Prokaryotic Genomes J. Bacteriol., September 1, 2007; 189(17): 6093 - 6100. [Full Text] [PDF] |
||||
![]() |
S. Linz, A. Radtke, and A. von Haeseler A Likelihood Framework to Measure Horizontal Gene Transfer Mol. Biol. Evol., June 1, 2007; 24(6): 1312 - 1319. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||