Previous Article |
Table of Contents
| Next Article
EVOLUTION
Highways of gene sharing in prokaryotes
Institute for Molecular Bioscience and Australian Research Council Centre in Bioinformatics, The University of Queensland, Brisbane 4072, Australia
Edited by Carl R. Woese, University of Illinois at Urbana-Champaign, Urbana, IL, and approved August 17, 2005 (received for review May 16, 2005)
| Abstract |
|---|
|
|
|---|
lateral genetic transfer | microbial genomes | molecular phylogeny
|
There is considerable evidence for long-distance LGT events. To date, this evidence has typically been based on either the broad application of nonphylogenetic "surrogate" methods across a wide range of taxa (13) or the use of phylogenetic methods on a subset of available taxa with restricted phylogenetic (14) or environmental (15) distributions. However, the set of sequenced genomes represents taxa that are phylogenetically and ecologically diverse, and can be used to test broad hypotheses about the sharing of genes. Here we apply rigorous phylogenetic methods to annotated proteins from 144 completely sequenced genomes sampled across 15 phyla of prokaryotes. We derive a reference supertree from 22,432 orthologous protein families, and by comparing individual protein trees with this reference tree, we infer for this data set the frequency and phyletic extent of LGT, the taxa implicated as partners in LGT events, and the tendency of different cellular functions to be subjected to transfer.
| Methods |
|---|
|
|
|---|
4 containing a total of 382,991 proteins. Sequences within each cluster were hierarchically clustered, and maximal subsets in which no genome is represented more than once (maximally representative clusters, MRCs; ref. 16) were identified, yielding 22,437 MRCs containing 220,240 sequences. Sequences in each MRC were aligned by using several different algorithms, and the alignment yielding the highest score according to the word-oriented objective function (WOOF) (17) was chosen for subsequent analysis. Ambiguously aligned regions were removed (18) to yield 22,432 alignment sets. Bayesian phylogenetic analysis (19, 20) was used to associate posterior probability (PP) values with all possible groupings of taxa (bipartitions) for each alignment set. Models and parameters were selected after extensive calibration (supporting information). Bipartitions (internal edges) having PP
0.95 were assessed for topological consistency with a reference supertree generated by the MRP method (21) from all strongly supported (PP
0.95) bipartitions among the 22,432 protein trees. In the absence of eukaryotic nuclear genomes from our analysis, the supertree was arbitrarily rooted on the edge connecting the bacterial and archaeal subtrees. This rooting does not imply that prokaryotes constitute a monophyletic group.
We developed an algorithm to identify the minimal set of subtree prune-and-regraft operations (22), here simply termed edits, required to make our supertree topologically consistent with a given protein tree (Fig. 1). Of the 19,672 protein trees fully or partially resolved at PP
0.95, we computed the minimal edit path exactly for 19,351 (13,849 completely congruent with the supertree, and 5,502 with a nonzero edit distance) and used ratchet-based heuristics (see supporting information) to recover a result for 237 of the remaining 321. The minimum number of proteins in any of these 321 data sets was 14; in simulations with smaller data sets, where our heuristics returned a result it was minimal in >95% of cases. Edit distances ranged from 1 (3,694 trees) to 22 (1 tree).
| Results |
|---|
|
|
|---|
0.95 and were used to compute a supertree by the method of matrix representation with parsimony (21). This supertree (Fig. 4), our reference hypothesis about relationships among these 144 prokaryotes, is remarkably congruent with taxonomy based on 16S rDNA. Of the nine phyla represented by more than one genome, our supertree reconstructs eight as monophyletic, with only Euryarchaeota paraphyletic.
Individual protein trees that strongly (PP strictly
0.95) support the bipartition of taxa implied by a given internal edge or node in the supertree are concordant with that node, and support a regime of vertical inheritance at that node. Protein trees that are strongly incongruent with a supertree node are discordant, and provide prima facie evidence of LGT. Of the 95,194 strongly supported bipartitions among our 22,432 protein trees, 82,473 (86.6%) are concordant with the supertree. Discordance is highly variable across the supertree: for many nodes that subtend a single genus or species, <5% of the corresponding protein tree bipartitions with PP
0.95 are discordant, whereas for "backbone" nodes that define the branching order of phyla, frequently >40% are discordant (see ref. 23). The implied relationships among members of a single genus are sometimes strongly supported (e.g., Bordetella and Staphylococcus) but are more often contradicted by many protein trees (e.g., Clostridium, Prochlorococcus, and some relationships within Streptococcus and Escherichia). Only 22 of 110 protein trees strongly support the basal Aquifex + Thermotoga clade seen in our supertree and in many previous studies (24-26). The phylogenetic approach strongly supports alternative partners for these two genomes.
To examine whether methodological artefacts could be responsible for this level of discordance, we carried out extensive statistical analyses (see supporting information) to test whether inferred discordance was more prevalent among data sets that are most prone to artefacts of clustering, alignment or phylogenetic inference. For some tests of protein clustering and G+C content biases, increasing threshold stringency eliminates more discordant than concordant conclusions. We also performed a bootstrapped parsimony analysis of insertion and deletion states in the aligned protein sequences. Over all cases where strongly supported bipartitions (PP
0.95) are paired with strong parsimony conclusions (bootstrap
70%), the level of agreement for discordant bipartitions (92%) is only slightly lower than for concordant ones (94%). These tests imply that, at the stringent PP thresholds we employ here, erroneous conclusions are only slightly biased toward discordance, i.e., toward LGT.
Genome Partners and the Phylogenetic Network. Proteins with discordant histories can be identified by simple comparison against a reference tree. It is much more difficult to identify the partners implicated in a transfer, or the shortest transfer path. The edit path between the supertree and a discordant protein tree represents a hypothesis about the set of historical LGT events responsible for the observed discordance. We developed an algorithm (supporting information) to search recursively for the shortest edit path(s) between the supertree and each discordant protein tree. We define a transfer as obligate if it is implied by every path in the set of most-parsimonious edit paths resolving the discordance of a given MRC, and as possible if it appears in some, but not necessarily all, of the most-parsimonious edit paths. Implied LGT events found in the edit paths of many discordant protein trees define "highways" of LGT between taxa.
We observe that many common obligate edit operations (putative LGT events) affect taxa that are topologically close to each other and relatively terminal in the supertree. One such event, inferred for no fewer than 175 protein trees, implies transfer between an ancestor of Yersinia pestis and a common ancestor of Escherichia coli plus Salmonella. More than 250 LGT events are implicated within the Synechococcus-Prochlorococcus clade, consistent with the low support values seen in the reference supertree. Because LGT between immediate sister taxa cannot be inferred by using topological comparisons, our inferred counts of "short-distance" edits likely underestimate the true extent of sharing between closely related genomes.
"Long-distance" edits imply LGT between taxa from different phyla or divisions, typically crossing basal or "backbone" nodes in the supertree.
-Proteobacteria are implicated in a particularly large number of obligate long-distance LGT events, >150 with the
-proteobacterial genus Pseudomonas alone. The two best-represented phyla, Proteobacteria and Low-G+C Gram-positives, are implicated in many long-distance edits; this is not wholly due to sampling frequency, as the four proteobacterial divisions (
,
,
, and
) preferentially exchange genes with different partners (Fig. 2) to an extent not simply proportional to the number of genomes represented in each division. Aquifex shares a substantial number of transfer events only with
-proteobacteria, whereas clostridia (here including Fusobacterium nucleatum and Thermoanaerobacter tengcongensis) show diverse transfer relationships including with euryarchaeotes, Thermotoga, and
-proteobacteria. The
-proteobacteria exhibit more obligate transfers with other proteobacterial groups than among themselves, whereas pseudomonads and xanthomonads are frequently intermingled with
-proteobacteria, and sometimes with
-proteobacteria, in the protein trees.
In analyzing LGT involving individual taxa, it is often useful to consider (as above for higher-order taxa) not only obligate transfers, but also the more-numerous possible transfers. Among the five clostridia, T. tengcongensis shows the strongest affinity for T. maritima (34 possible transfers), with lesser affinities for F. nucleatum (23 possible transfers) and the three species of Clostridium (fewer than 10 in each case). T. tengcongensis also has the largest number of possible transfers (40 and 33) with the Archaea in general and the Euryarchaeotes in particular, whereas no other member of the clostridia has >19 possible transfers with the Archaea. Within the Proteobacteria, there is extensive evidence for transfers within genera such as Escherichia, Vibrio, and Xanthomonas. The most ecologically versatile organisms tend to be implicated in the largest number of transfers between major proteobacterial divisions: Pseudomonas aeruginosa, a soil- and water-borne bacterium, and a prominent pathogen in plants and animals, is implicated in possible transfers with organisms such as Ralstonia solanacearum and Caulobacter crescentus, which live in soil and water, as well as animal pathogens including Pasteurella multocida and Photorhabdus luminescens. The generalist plant pathogen R. solanacearum in turn has shared many genes with plant pathogens and symbionts including Pseudomonas syringae, Bradyrhizobium japonicum, and Mesorhizobium loti.
|
-proteobacteria, and other Gram-positive divisions. Proteins of Aquifex and Thermotoga frequently co-occur, often with euryarchaeal proteins as well. Proteins of F. nucleatum and T. tengcongensis also co-occur, supporting their arrangement in the supertree, but share many orthologs and paralogs with representatives of other Gram-positive divisions and with T. maritima. Profiles do not support monophyly of all Gram-positive devisions, as the high-G+C Gram-positive divisions show a much stronger affinity for the
- and
-proteobacteria than for the low-G+C Gram-positive divisions, even when size corrections are applied. Pseudomonads and xanthomonads often show stronger affinities for
- and
-proteobacteria than for each other, or for other subdivisions of the
-proteobacteria such as Enterobacteraceae.
|
2 tests to examine functional correlates of the concordant versus discordant bipartitions among our 22,432 protein family trees. The National Center for Biotechnology Information clusters of orthologous groups (COG) database defines four major groupings: metabolism, cellular processes, information storage and processing, and poorly characterized or hypothetical genes (29). These groupings are further subdivided into 25 categories. The overall
2 for the four major groupings (Table 4) was 128.45 (3 df, P = 1.17 x 10-27), with "metabolism" and "cellular processes" overrepresented among the set of discordant bipartitions relative to their frequency among the concordant ones. A test of distribution across the 25 functional categories (Table 5) yields a
2 value of 414.29 (24 df, P = 8.70 x 10-73). Among proteins with annotated function, the only category with a distributional bias substantially different from its parent grouping is "inorganic ion transport and metabolism," which is underrepresented among discordant bipartitions, although its parent ("cellular processes") is overrepresented.
We also assigned functions to our orthologous families by using The Institute for Genomic Research role categories database (30) in which 19 major groupings relevant to our analysis are further subdivided into 120 categories.
2 tests as above yielded P values of <1 x 10-56 for both sets of functional categories. Among the 19 major groupings (Table 6), the functions most strongly overrepresented in the set of discordant bipartitions are "energy metabolism" and "mobile and extrachromosomal element functions," whereas "DNA metabolism," "protein synthesis," "protein fate," and "regulatory functions" were strongly underrepresented (i.e., highly concordant with the reference topology). Fig. 3 shows the ratio of observed to expected discordance for major role category groupings: proteins with metabolic functions tend to be more discordant than informational proteins. Among the 120 subsidiary categories (Table 7), several relating to sugar metabolism are among those most strongly overrepresented among our discordant bipartitions, as are proteins involved in amino acid metabolism and detoxification. Ribosomal proteins are less prone to discordance, as are proteins involved in DNA replication and repair, cell wall synthesis, and cell division.
| Discussion |
|---|
|
|
|---|
-proteobacteria, and Thermotoga with the low-G+C Gram-positive divisions, probably in the branch that includes T. tengcongensis and F. nucleatum. These alternative affiliations were identified by Cavalier-Smith (31) on the basis of cell wall and other characters. The particularly strong nature of these alternative affiliations suggests that the arrangement of the Aquifex and Thermotoga genomes in the supertree may reflect strong lateral signal, rather than the shared history that is strongly supported in most of the tree. Given the extremely weak support (22 of 110 constituent trees at PP
0.95) for the grouping of Aquifex and Thermotoga in the supertree, it may be surprising that this arrangement is proposed at all. However, this arrangement appears to be a compromise among conflicting alternative affiliations for these taxa with various bacterial and archaeal partners, none of which is individually supported more strongly than is their association with each other. The true phylogenetic position of these taxa, and which proteins have vertical or lateral histories, may become clearer as more genomes from their respective phyla are sequenced. Extensive LGT has occurred among certain mesophiles as well. Some proteins from the Xanthomonas group of plant pathogens, and from soil bacteria of genus Pseudomonas, have closest relatives among the
- and
-proteobacteria. There is also evidence of extensive transfer within Cyanobacteria, particularly among strains of Prochlorococcus and Synechococcus.
|
Our results clearly show that genetic modification of organisms by lateral transfer is a widespread natural phenomenon. It will likely be impossible to assess exactly the footprint of LGT on prokaryotic genomes, because the percentage of genes or proteins that yield discordant tree topologies depends on the taxa sampled. This is particularly true for "orphan" proteins (
6.5% in our data set), which lack recognizable homologs. However, for the diverse prokaryotes in our sample, we find a pervasive coherent vertical genetic signal with significant modulation by LGT, particularly among thermophiles, pathogens, and cyanobacteria. Coupled with rigorous phylogenetic methodology such as we employ here, the growth of community genomics (34-36) will lead to an increasingly precise delineation of the genomic, functional, and environmental determinants of vertical and lateral genetic transfer in nature.
| Acknowledgements |
|---|
| Footnotes |
|---|
This paper was submitted directly (Track II) to the PNAS office.
Abbreviations: rDNA, rRNA gene; LGT, lateral genetic transfer; MRC, maximally representative cluster; PP, posterior probability.
* To whom correspondence should be addressed. E-mail: m.ragan{at}imb.uq.edu.au.
© 2005 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
W. D. Swingley, R. E. Blankenship, and J. Raymond Integrating Markov Clustering and Molecular Phylogenetics to Reconstruct the Cyanobacterial Species Tree from Conserved Protein Families Mol. Biol. Evol., April 1, 2008; 25(4): 643 - 654. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Fong and J. M. Archibald Evolutionary Dynamics of Light-Independent Protochlorophyllide Oxidoreductase Genes in the Secondary Plastids of Cryptophyte Algae Eukaryot. Cell, March 1, 2008; 7(3): 550 - 553. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. P. Fournier and J. P. Gogarten Evolution of Acetoclastic Methanogenesis in Methanosarcina via Horizontal Gene Transfer from Cellulolytic Clostridia J. Bacteriol., February 1, 2008; 190(3): 1124 - 1127. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Pisani, J. A. Cotton, and J. O. McInerney Supertrees Disentangle the Chimerical Origin of Eukaryotic Genomes Mol. Biol. Evol., August 1, 2007; 24(8): 1752 - 1760. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. K. Azad and J. G. Lawrence Detecting laterally transferred genes: use of entropic clustering methods and genome position Nucleic Acids Res., July 9, 2007; 35(14): 4629 - 4639. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. P. Williams, B. W. Sobral, and A. W. Dickerman A Robust Species Tree for the Alphaproteobacteria J. Bacteriol., July 1, 2007; 189(13): 4578 - 4586. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Iwasaki and T. Takagi Reconstruction of highly heterogeneous gene-content evolution across the three domains of life Bioinformatics, July 1, 2007; 23(13): i230 - i239. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Feng and E. R.M. Tillier A fast and flexible approach to oligonucleotide probe design for genomes and gene families Bioinformatics, May 15, 2007; 23(10): 1195 - 1202. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. G. Beiko and R. L. Charlebois A simulation test bed for hypotheses of genome evolution Bioinformatics, April 1, 2007; 23(7): 825 - 831. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. F. Doolittle and E. Bapteste Inaugural Article: Pattern pluralism and the Tree of Life hypothesis PNAS, February 13, 2007; 104(7): 2043 - 2049. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Dagan and W. Martin Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution PNAS, January 16, 2007; 104(3): 870 - 875. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Burrage, L. Hood, and M. A. Ragan Advanced computing for systems biology Brief Bioinform, December 1, 2006; 7(4): 390 - 398. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Comas, A. Moya, R. K. Azad, J. G. Lawrence, and F. Gonzalez-Candelas The Evolutionary Origin of Xanthomonadales Genomes and the Nature of the Horizontal Gene Transfer Process Mol. Biol. Evol., November 1, 2006; 23(11): 2049 - 2057. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Zhaxybayeva, J. P. Gogarten, R. L. Charlebois, W. F. Doolittle, and R. T. Papke Phylogenetic analyses of cyanobacterial genomes: Quantification of horizontal gene transfer events Genome Res., September 1, 2006; 16(9): 1099 - 1108. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Martinez, Y. Wang, M. A. Raimondo, J. M. Coombs, T. Barkay, and P. A. Sobecky Horizontal Gene Transfer of PIB-Type ATPases among Bacteria Isolated from Radionuclide- and Metal-Contaminated Subsurface Soils. Appl. Envir. Microbiol., May 1, 2006; 72(5): 3111 - 3118. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Inagaki, E. Susko, and A. J. Roger Recombination between elongation factor 1{alpha} genes from distantly related archaeal lineages PNAS, March 21, 2006; 103(12): 4528 - 4533. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||