Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / MICROBIOLOGY
Global patterns in bacterial diversity
,
*Departments of Molecular, Cellular, and Developmental Biology and
Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309
Edited by Norman R. Pace, University of Colorado, Boulder, CO, and approved May 29, 2007 (received for review December 22, 2006)
| Abstract |
|---|
|
|
|---|
environmental distribution | microbial ecology | phylogenetic diversity | UniFrac
| Results |
|---|
|
|
|---|
|
|
|
As we showed previously in marine environments (11), cultured samples from different environments (Fig. 2, pink circles and hexagons) generally cluster together rather than with their environment types. Cultured samples separate by salinity, however, both in the hierarchical cluster (SI Fig. 4) and along PC1 (Fig. 1). Although cultured samples do not separate from other water samples when PC1 and PC2 alone are used, PC3 clearly separates these groups (Fig. 2B). A few samples still do not separate from the cultured isolates when the first three principal components are used. These samples include both uncultured marine ice samples (Fig. 2B, green circles), about half of the endolithic communities (Fig. 2B, green triangles), and a small proportion of the other environment types. We have previously noted the similarity between uncultured marine ice communities and cultured isolates (11) and related it to the observation that most bacteria in marine ice can be cultured (15). The results suggest that the same may be true for many endolithic communities.
The saline environments separated along PC2 according to the same properties as the nonsaline environments, although clustering within each saline environment was looser. Hierarchical clustering (SI Fig. 4) and PCoA (Fig. 2) divided saline water samples into three subgroups: surface water, mostly in coastal regions (Fig. 2, blue inverted triangles); subsurface water, mostly in the open ocean (Fig. 2, gray sidewise triangles); and anoxic water from many locations (Fig. 2, cyan triangles; Table 1). The saline sediments (Fig. 2, purple circles) clustered together but overlapped other saline environments, including hypersaline mats, stromatolites, hydrothermal vent colonizers (Table 1, Saline–misc; Fig. 2, yellow squares), and anoxic saline water samples. Like nonsaline water and cultured isolates, surface/coastal water and cultures from saline environments separated from saline sediments along PC2. These results reinforce the suggestion that substrate type (water vs. sediment) is the second most important property for structuring diversity, perhaps because of differences in lineages adapted to planktonic vs. sessile lifestyles. However, because anoxic water samples cluster with sediments, oxygenation may also be important. For instance, clades of obligate anaerobes, such as the Clostridia, and clades with many planktonic representatives, such as filamentous
-proteobacteria, probably account for some of these community differences.
Environment Types Differ Substantially in Phylogenetic Diversity (PD). We also determined the PD of each sample, which is the branch length that remains when all other sequences are removed from the tree (16), and the PD gain (G), which is the branch length a sample adds to a tree containing sequences from all other samples (16). For example, if a new sample contained only sequences already found in other studies, adding that sample's sequences to the tree would add no new branch length, and the G value would be 0. Environments with high G values are promising sites for discovering new, diverse microbial lineages. Samples with high PD and low G values have many phylogenetic lineages that are also found in other environments.
Because sequencing effort influences diversity estimates, we regressed both G (Fig. 3) and PD (SI Fig. 5) values on the number of OTUs in each sample. The relationships between sequencing effort and both PD and G are approximately linear (R2 of 0.76 and 0.91, respectively), suggesting that deep sequencing of one environment uncovers as much new diversity as shallow sequencing of many related environments. Regressions for individual environment types indicated substantial differences in their contributions to known diversity (Fig. 3). We quantified these differences by calculating the residual of each sample from the regression of all samples (Fig. 3, blue line). Highly positive or negative residuals indicate high or low diversity respectively (Table 1; see Data Set 1 for individual sample results).
|
The nonsaline cultured group also had significantly lower average PD and G residual values than the other environments (Table 1). This result is consistent with the observation that few lineages in these environments can be cultured (2). The saline-cultured environments also had negative average residuals for both total PD and G.
Saline sediment and saline-misc (Table 1) have significantly higher G values than other environment types. Nonsaline sediments and springs resembled saline sediments, but the sample sizes were too low for statistical significance (Table 1). Saline and nonsaline sediments also had high average PD residuals (Table 1). High diversity in sediments is consistent with previous observations and may stem from their highly stratified nature and chemical gradients (17). Nonsaline sediments are less thoroughly sampled than saline sediments and are thus especially good targets for future sequencing efforts. Interestingly, the miscellaneous saline and nonsaline spring groups had high G and low PD values, indicating that they, on average, contain relatively few, but highly divergent, lineages.
Some environment types clustered poorly, suggesting that they may not form natural groups. Residuals for individual samples are thus of interest (see SI Data Set 1 for values). The sample with the lowest G residual (Sws_M_163; –3.64 standard deviations from the mean) was from the Sargasso Sea (20), an environment known to have low diversity because of nutrient limitation and little spatial heterogeneity. The samples with the highest G residuals (So_Mm+_166 and So_Mm+_168; 5.08 and 3.71 standard deviations from the mean, respectively) were from different layers of the Guerrero Negro hypersaline mat, the molecular analysis of which introduced 15 previously unidentified candidate phyla, an unprecedented number for a single environment (21).
| Discussion |
|---|
|
|
|---|
The results also add an interesting perspective to the study of extreme environments. Although organisms in environments at the extremes of temperature and pH are presumably under strong selective pressures, they still cluster by salinity and substrate type, indicating that the general properties of these environments still primarily determine which lineages can survive there.
The ability of comparisons of 16S rRNA data to reveal the effects of specific chemical and physical factors on microbial communities depends on the quality of information that has been measured for the source environments and the accessibility of this information in the public databases. Although we found clear patterns of variation between environment types, such as the split between saline and nonsaline environments, testing whether this split stems from ionic strength, osmolarity, availability of sulfate for reduction, or other factors remains unresolved, in part because detailed measurements were not available for many of the environmental samples. Another limitation is that, because the records do not include information on how many times each sequence was observed in each sample, it is not possible to compare samples by using quantitative measures of β diversity such as weighted UniFrac (22). Information about relative abundances is also required for almost all measurements of
diversity (total diversity of a sample) including Chao1, ACE, rarefaction analysis, and the Shannon and Simpson indices (reviewed in ref. 23). Thus, improved availability of environment information within structured, machine-readable fields in the database is a key requirement for future large-scale analyses of the factors influencing microbial diversity.
The overview that this analysis provides is useful for evaluating where to direct new sequencing efforts. The environmental clustering patterns allow us, at least in some cases, to define environment types based on the occurrence of similar bacterial lineages rather than arbitrary criteria. For instance, nonsaline lakes and rivers behave as a cohesive group but saline water does not. Evaluation of these environment types, as well as of individual environments, allows us to identify optimal targets for finding new diversity.
| Materials and Methods |
|---|
|
|
|---|
Making the Phylogenetic Tree. We used NAST (24) to add sequences from the 111 selected studies to the standard Arb alignment (25). We then added the 21,752 sequences from the studies to a guide tree with >110,000 sequences using the Arb parsimony insertion tool. The guide tree was initially described in ref. 26 but was subsequently enhanced by the Pace lab (J. K. Harris and N. R. Pace, personal communication). We used a lanemask ("lanemaskPH") that is provided with the Hugenholz Arb database (27) available at the Ribosomal Database Project II (28), to exclude hypervariable regions from consideration while generating the tree. We chose a parsimony insertion algorithm rather than a de novo method such as neighbor joining (NJ) because it can relate sequences from different parts of the 16S rRNA molecule. This is essential because there is very little overlap in sequenced 16S rRNA regions when comparing all of the studies. For instance, only 6,552 of the 21,752 sequences (30%) were complete between positions homologous to 300 and 700 in Escherichia coli 16S rRNA and only 7,102 (33%) were complete for the region between E. coli positions 700 and 1,100. To test whether the Arb parsimony insertion tree gave similar results to a tree built de novo, we performed PCoA clustering on NJ trees of sequences from the 82 and 90 environments that had >15 sequences in the 300–700 region and the 700–1,100 region, respectively. The NJ trees were also made in Arb, by using the Jukes–Cantor model of nucleotide substitution. We compared the results to those from Arb parsimony insertion trees with the same set of sequences. For both regions, the results of PCoA clustering with the parsimony insertion and NJ trees were almost identical (data not shown). Clustering by using only the portion of the data that could be incorporated into the NJ trees recovered the saline/nonsaline split as the most important division in the data for both regions, although the coordinate axes were rotated slightly.
Selecting OTUs and Annotating the Tree with Environment Information. We divided the sequences into 225 environmental samples using annotations from the associated publications. By excluding 23 samples with <15 OTUs each, we produced a tree with 12,984 OTUs representing 202 samples. For each environmental sample, we chose OTUs with a 97% identity threshold using our Divergent Set software (5). We decided to dereplicate the sequence data for several reasons. First, dereplication of the data has little effect on clustering with UniFrac, because inclusion of near similar sequences will not change the amount of unique branch length in the tree. Removing near similar sequences thus produces a smaller tree that is more easily manipulated, without affecting the results. Second, because the inclusion of very small samples in a UniFrac analysis can produce spurious results, we wanted to exclude small environmental samples. Because some studies deposit near-identical sequences in GenBank, and others deposit sequences only after choosing OTUs, we needed to remove near-identical sequences from all studies to evaluate our sampling effort fairly. Finally, when we corrected the raw PD and G values for sampling effort, it was again essential to ensure that the results would be robust to the methodology used to choose OTUs in the original studies. We chose the 97% threshold because this is the most common threshold used for dereplication at the species level. Repetition of the analysis with all available sequences, i.e., without choosing OTUs at all, provided almost identical UniFrac clustering results (data not shown).
Statistical Analyses. We performed PCoA and hierarchical clustering in the UniFrac web interface (10), using the Arb tree and a file mapping sequence labels to environmental samples as input. PCoA is similar to principal coordinates analysis (PCA), except that the starting point is a matrix of distances between samples rather than a matrix of observations about each sample. We used the unweighted pair group method with arithmetic mean (UPGMA) hierarchical clustering algorithm, which produces clusters by finding the nearest pair of neighbors at each step, finding the midpoint between these neighbors, and adding a cluster consisting of the neighbors to a growing tree.
We also used the Arb tree for diversity analyses. We calculated PD for each sample by removing all sequences not from the sample from the tree and summing the remaining branch length. We determined G by removing only the sequences from that sample from the tree and summing the remaining branch length. We corrected each PD and G value for sampling effort by calculating the residual from the regression of PD and G vs. OTU count for all of the samples. We determined whether the average G and PD residuals for each environment type were significantly different from samples not in that environment type with a two-tailed Student's t test. These statistical analyses were performed by using custom code written in the Python language.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: OTU, operational taxonomic unit; PCoA, principal coordinates analysis; PD, phylogenetic diversity; G, gain in phylogenetic diversity; SSU, small subunit; NJ, neighbor-joining.
To whom correspondence should be addressed. E-mail: rob{at}spot.colorado.edu
Author contributions: R.K. designed research; C.A.L. performed research; C.A.L. analyzed data; and C.A.L. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/cgi/content/full/0611525104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
Z. Liu, C. Lozupone, M. Hamady, F. D. Bushman, and R. Knight Short pyrosequencing reads suffice for accurate microbial community analysis Nucleic Acids Res., September 25, 2007; 35(18): e120 - e120. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||