Evidence for a persistent microbial seed bank throughout the global ocean

Edited by David M. Karl, University of Hawaii, Honolulu, HI, and approved January 29, 2013 (received for review October 11, 2012)
March 4, 2013
110 (12) 4651-4655


Do bacterial taxa demonstrate clear endemism, like macroorganisms, or can one site’s bacterial community recapture the total phylogenetic diversity of the world’s oceans? Here we compare a deep bacterial community characterization from one site in the English Channel (L4-DeepSeq) with 356 datasets from the International Census of Marine Microbes (ICoMM) taken from around the globe (ranging from marine pelagic and sediment samples to sponge-associated environments). At the L4-DeepSeq site, increasing sequencing depth uncovers greater phylogenetic overlap with the global ICoMM data. This site contained 31.7–66.2% of operational taxonomic units identified in a given ICoMM biome. Extrapolation of this overlap suggests that 1.93 × 1011 sequences from the L4 site would capture all ICoMM bacterial phylogenetic diversity. Current technology trends suggest this limit may be attainable within 3 y. These results strongly suggest the marine biosphere maintains a previously undetected, persistent microbial seed bank.
Baas Becking (1) proposed that, in microbial ecology, “everything is everywhere, but the environment selects,” suggesting that variation in environmental factors drives biogeographic patterns of microbial community membership. Microbial communities are altered by environmental factors such as day length, pH, and biological interactions (25). In some ecosystems, community composition changes quickly across space and time as niches open and close; these changes may reflect rapid dispersal or rapid growth of rare or dormant taxa from a “microbial seed bank” (69). Many studies indicate that dispersal between distant environments is limited (1013), implying that microbial communities also can be shaped by their demographic history, especially at finer phylogenetic levels.
Sequencing costs now are dropping low enough to allow deep sequencing of individual microbial communities. Our L4-DeepSeq dataset (∼10 million 16S rRNA V6 reads) showed that nearly all operational taxonomic units (OTUs) identified at any time during a 72-mo time series from one location in the Western English Channel were present in this single deeply sequenced time point (6, 14). Therefore, in this ecosystem, virtually all taxa were present at all times, but their abundance varied over many orders of magnitude as environmental conditions changed. These results suggest that, in contrast to the widely accepted model that the presence or absence of particular microbial taxa drives community structure, sufficient sequencing would show instead that global patterns of bacterial community composition within the marine biosphere consist primarily of changes in relative abundance of community members shared across all environments. In other words, the null hypothesis—that all bacteria are found in any particular environment because of an immense and persistent microbial seed bank—might be tested against the alternative hypothesis—that some environments lack some bacteria, i.e., that endemism exists—by sufficient sequencing.
Here we begin to test this hypothesis by comparing the L4-DeepSeq dataset with the global International Census of Marine Microbes (ICoMM), which comprises 356 datasets of bacterial 16S rRNA V6 amplicon sequences from 40 different studies ranging from marine pelagic and sediment samples to mangrove and sponge-associated environments (15). This comparison with a range of more shallowly sequenced sites allowed us to investigate the overlap in community membership between biomes, the phylogenetic similarity of communities in different biomes, and the potential that deep-sequencing a given ecosystem might identify a core microbiota for the global ocean.

Results and Discussion

As hypothesized, we found that the L4-DeepSeq site showed significant overlap with the ICoMM datasets. This overlap increased with increasing sequencing depth. Specifically, we show that with increasing sequencing depth at the L4 site, Faith’s phylogenetic gain (the fraction of shared branch length unique to a query sample relative to a reference sample; hereafter referred to as “phylogenetic gain”) for the pooled ICoMM data decreased with respect to the L4-DeepSeq sample (16). In other words as sequencing depth increases, the degree of phylogenetic overlap between the L4-DeepSeq sample and the pooled ICoMM data increases (Fig. 1). Given this trend, one question that arises is how deeply we would need to sequence the L4 site to see full phylogenetic overlap with the current ICoMM database. Extrapolation of the apparent log-linear relationship between sequencing depth and phylogenetic overlap suggests that, to achieve a 100% overlap, a 16S rRNA V6 sequencing depth of 1.93 × 1011 reads would be required. Such an experiment is feasible in principle, although to collect enough cells for sequencing at this depth, >200 L of seawater would have to be filtered (compared with the 2 L collected for our L4-DeepSeq sample), and $1.34M would be spent on sequencing (based on 2.5 billion reads per Illumina HiSeq2000 run at $20,000 per run). This experiment would expand the amount of material sampled by two orders of magnitude, although both these sample volumes are miniscule compared with the size of the marine environment, and the difference is not predicted to affect the result substantially.
Fig. 1.
L4-DeepSeq rarefaction depth vs. percent phylogenetic gain of the pooled ICoMM data relative to the L4 site. Rarefaction depth is plotted on a log scale (base 10). Rarefaction was replicated 10 times at each depth. Error bars represent the SD. The arrow indicates the sequencing depth when phylogenetic overlap is complete (when overlap = 100%, depth = 1.93E11).
We also can quantify the overlap between the ICoMM datasets and the L4-DeepSeq sample nonphylogenetically by calculating the number of OTUs shared between L4-DeepSeq and each ICoMM biome. A total of 108,866 nonsingleton OTUs at 97% identity were identified in the pooled, nonrarefied ICoMM and L4-DeepSeq datasets. Of these nonsingleton OTUs, 74,404 were found in the ICoMM samples, and 44,165 were found in the L4-DeepSeq sample. The ICoMM data were summarized by biome and resampled to an even depth of 15,790 sequences per biome (the depth of the smallest sequencing effort). Of the OTUs identified in each ICoMM biome, 31.7–66.2% were found in the L4-DeepSeq sample (Fig. 2A). As expected, the neritic epipelagic biome, which includes the L4 site, has the highest overlap (lowest OTU gain, i.e., the fraction of OTUs unique to a query sample relative to a reference sample). Surprisingly, the phylogenetic gain for the neritic epipelagic biome is close to the average across all biomes (Fig. 2B). This result may reflect the fact that more individual samples came from the neritic epipelagic biome (Table S1), which has a wide geographic distribution (thus increasing the chances of detecting more site-specific, phylogenetically divergent taxa). Strikingly, two of the lowest overlaps with L4-DeepSeq were observed for the estuarine and mangrove biomes, which represent a shift from the standard fully saline marine ecosystems to a reduced-salinity environment, supporting the observation that salinity is a major driver of microbial community structure (17). Our main finding is that, on average, 44% of the microbial community in any given biome was present at a single geographic site (L4). Therefore, at a detection resolution of 10 million sequences, almost half of all marine microbial taxa (from any given biome) also are found in an arbitrarily chosen sample if sequenced at sufficient depth, although the extent of this core depends on some of the choices made in the analysis (Materials and Methods). This result could help explain recent work showing the unexpectedly wide distribution of certain organisms (such as thermophilic endospores) in arctic sediments (18).
Fig. 2.
(A) Percent OTU overlap across marine biomes (relative to the L4-DeepSeq sample). (B) Phylogenetic gain across marine biomes. For both A and B, individual biomes were rarefied to 15,790 sequences and compared with the full L4-DeepSeq sample (∼10 million reads). The dashed lines represent the average across all biomes.
If this dispersed, shared seed bank exists, community differences between sites are a function both of depth of sequencing (a technical issue) and the fact that taxa are found in significantly different proportions as driven by a range of environmental factors (e.g., light, temperature, nutrients). In our analysis of the ICoMM datasets, as expected, community structure (both composition and abundance of taxa) gave a unique signature for each biome, leading to biome-specific sample clustering (Fig. 3); however, many OTUs from L4-DeepSeq were shared with all other biomes (shown by the radiating white edges in Fig. 3), highlighting the considerable OTU overlap revealed by deep sequencing one sample. Despite this high overlap, the differences in taxonomic composition between L4-DeepSeq and the ICoMM biomes still were substantial, as defined by the presence of abundant OTUs in each biome that were not found in the L4-DeepSeq sample. In Fig. 3, the large phylogenetic tree shows the evolutionary relationships between the abundant (>500 reads) OTUs in the combined ICoMM and L4-DeepSeq dataset, and the smaller trees display the lineages that contained abundant ICoMM OTUs not found in the L4-DeepSeq dataset (colored wedges). For example, the marine cold seep biome contributed OTUs from the Halanaerobiaceae family. This family includes anaerobic, halophylic species, which have been found to be highly abundant in hypersaline brine pools such as those associated with cold seeps (19); this comparison suggests that a number of Halanaerobiaceae OTUs in the cold seep biome were not detected in the L4 site (hence the Halanaerobiaceae lineage is colored), whereas all the abundant OTUs (>500 sequences) from every other lineage were all shared between the cold seep biome samples and L4 (hence they are white). The neritic sublittoral zone, which included sponge-associated samples, contributed taxa from the Desulfovibrio that were not present in the L4-DeepSeq sample (Fig. 3). Representatives from Desulfovibrio have been identified in sponges previously (20) and are known coral pathogens (21). Finally, the marine hydrothermal vent samples contributed members of the Campylobacterales not detected in the L4-DeepSeq sample. Campylobacterales is an order within the ɛ-proteobacteria that includes both free-living and host-associated chemolithotrophs, such as those associated with tube-worms surrounding hydrothermal vents (22). These biome-specific OTUs (potentially endemic OTUs) highlight differences in the taxonomic lineages that have a degree of endemism between biomes. Understanding the phylogenetic diversity that is unique to each environment can shed light on the ecological processes and interactions that are important in those biomes.
Fig. 3.
Biome-specific community clustering and phylogenetic differences between biomes and the L4-DeepSeq sample. The network in the center of the figure was constructed in Cytoscape, using the BioLayout format (edge-weighted, force-directed). To reduce the complexity of the network, only OTUs that appear more than 500 times in the OTU table were included. Nodes and edge colors represent biome type (see Figs. S4 and S5). OTUs are represented as invisible points at the fringes of the plot (i.e., at the termini of the edges). Each OTU is connected to the samples in which it appears via an edge (colored according to the biome to which the sample belongs). The white-colored node at the center of the network represents the L4-DeepSeq sample. All edges connected to the L4-DeepSeq sample also are colored white to allow visualization of overlap with other biomes. The phylogenetic tree in the upper left corner of the plot shows taxa that are unique to particular environments (i.e., that are not present in the L4-DeepSeq sample; the wedge area is proportional to abundance). Smaller trees highlight individual biomes (only taxa contributed from that biome are shown in color, and the remaining branches of the tree are shown in white). The larger tree serves as the key for the smaller identical trees, displaying the names of each lineage. A high-resolution version of this figure is available in Fig. S3 (see Figs. S4 and S5 for keys).
Despite the clear overlap between the L4-DeepSeq sample and individual ICoMM biomes, no significant core microbial community was found among the shallowly sequenced ICoMM samples at the OTU level (97%; Fig. S1). However, Fig. 1 suggests that if we were to sequence any particular ICoMM biome to 10 million reads, we would see patterns of overlap similar to those that were detected in the L4 site. When comparing across the ICoMM data (at 5,000 sequences per sample), the single most ubiquitous OTU (from Rickettsiales) was found in 63% of the ICoMM samples and 90% of the biomes. Combined with the L4-deepseq results, this result suggests that (in certain cases) observed microbial endemism might be an artifact of low sequencing depth (Fig. 1 and Fig. S1). Interestingly, at current levels of sequencing depth, OTUs that were shared among multiple samples tended to be abundant (Fig. S2), suggesting that a core taxon is far more likely to be represented by an abundant OTU than by a rare one, although several rare taxa are shared among sites (Fig. S2). Prior work has supported that higher population density increases the probability of dispersal (10), as is consistent with our finding that abundant OTUs tend to be shared and rare OTUs seem to be disproportionately endemic. However, this finding simply may be the result of the limits of detection, because higher density also would improve the chances of being picked up by shallow sequencing efforts. We note that a similar observation also would be likely if low-abundance reads represented sequencing error. Differentiating low-abundance 16S rRNA reads from erroneous reads is still an open research challenge, although we attempted to be conservative with our quality controls.
Taken together, these results support our hypothesis that the global ocean contains a persistent microbial seed bank and that changes in community structure primarily reflect shifts in the relative abundance of taxa rather than their presence or absence. The presence of a shared seed bank has broad implications for ecological theory and for microbial community modeling. If every organism may occur anywhere, dispersal limitation is less important, because low-abundance populations can expand rapidly when conditions are right. A seed bank also would help explain the observed seasonal relationship between taxonomic evenness and taxonomic richness in the Western English Channel, where increased evenness also increases observed richness at a given sequencing depth [because fewer rare taxa are missed because of sampling considerations (6)], and may explain these patterns in other habitat types. To address this hypothesis further, future work in microbial biogeography should focus on increasing both sequencing breadth and depth to understand the difference between rare and absent taxa. We note that the present results deal only with the phylogenetic distribution of 16S rRNA gene-defined phylotypes, and endemism that would not be detected using the present techniques may occur at the whole-genome level. Geographically localized evolution and natural selection undoubtedly will lead to differences in genome sequence between spatially distant but phylogenetically similar organisms, and characterizing these fine-scale differences remains an important challenge for the field.

Materials and Methods


All bacterial sequence data analyzed in this study were obtained from the V6 hypervariable region of the 16S rRNA gene. All amplicons were generated using procedures outlined in Huber et al. (23) using a set of five forward primers (67F-PP: 5′-gcctccctcgcgccatcagCNACGCGAAGAACCTTANC-3′; 967F-UC1: 5′-gcctccctcgcgccatcagCAACGCGAAAAACCTTACC-3′; 967F-UC2: 5′-gcctccctcgcgccatcagCAACGCGCAGAACCTTACC-3′; 967F-UC3: 5′-gcctccctcgcgccatcagATACGCGARGAACCTTACC-3′; 967F-AQ: 5′-gcctccctcgcgccatcagCTAACCGANGAACCTYACC-3′) and four reverse primers (1046R: 5′-gccttgccagcccgctcagCGACAGCCATGCANCACCT-3′; 1046R-PP: 5′-gccttgccagcccgctcagCGACAACCATGCANCACCT-3′; 1046R-AQ1: 5′-gccttgccagcccgctcagCGACGGCCATGCANCACCT-3′; 1046R-AQ2: 5′-gccttgccagcccgctcagCGACGACCATGCANCACCT-3′). The raw ICoMM sequence data were downloaded from the ICoMM database (http://icomm.mbl.edu). The L4-DeepSeq sample was taken on December 12, 2007, as described previously (6). The L4-DeepSeq data have been deposited in the European Bioinformatics Institute-Sequence Read Archive database under the accession number ERP001778.

Sequence Processing and Analysis.

All sequence analysis was done using QIIME 1.5.0-dev (24). ICoMM 454 data were denoised using the QIIME denoiser (25). Concern over whether the ultra-deep sequencing of the 16S rRNA V6 region would lead to a saturation of sequence variability space was alleviated because the potential variability within the 62-bp-long fragment provides a total of 5.32 × 1036 97% clustered OTUs. Therefore, the observed seed bank cannot statistically be the result of a lack of variation in 16S rRNA, even if variation is constrained by known secondary structure features, which are highly variable for V6. A two-step open-reference OTU picking workflow was developed within the QIIME environment. In the first step, OTUs were picked (i.e., reads were binned into species groups based on 97% sequence similarity) for the L4-DeepSeq sample against the Greengenes database (26) preclustered at 97% identity, and reads that did not group with any sequences in the reference collection were clustered de novo. A representative sequence was chosen as the cluster centroid for each de novo OTU, and these representative sequences were combined with the full Greengenes set and used as the reference for open-reference OTU picking on the ICoMM samples. Cluster centroids for new OTUs were chosen again as the OTU representative sequences, and all representative sequences from the L4-DeepSeq sample and the ICoMM samples were combined and aligned against the Greengenes core set with PyNAST (27). All OTUs whose representative sequences failed to align were discarded. Two different phylogenetic trees were used in these analyses: one for open-reference OTU picking, and one for closed-reference OTU picking. The open-reference out-picking tree is generated by aligning OTU representative sequences (i.e., cluster centroids) with PyNAST against the Greengenes core set (version: February 2011) (27, 28), filtering highly variable positions as defined by positions with a 1 in the corresponding “lanemask” (29), and building a tree using FastTree 2.1.3 with default parameters (30). This is the standard workflow in place in pick_subsampled_reference_otus_through_otu_table.py in QIIME 1.5.0-dev. The closed-reference OTU picking tree is the Greengenes tree, pruned to contain only tips corresponding to OTU representative sequences (i.e., cluster centroids) in the 4feb2011 97% OTU reference collection. The construction of this tree is described in McDonald et al. (26) (see below). Briefly, the Greengenes tree (February 2011) was built from quality-filtered (26), full-length 16S rRNA sequences (97%-similar OTUs), using FastTree v2.1.1 with a maximum likelihood method [CAT (short for CATegorization) approximation, with branch lengths rescaled using a gamma model] (30). Taxonomy was assigned to each sequence using the Ribosomal Database Project classifier (31) retrained on Greengenes. All eukaryotic and archaeal OTUs (i.e., those not classified as k_Bacteria) were filtered out of the OTU table, because the goal of this study was to focus on the composition of the bacteria community, leaving 356 ICoMM samples (∼8 million reads total, after quality control) and the L4-DeepSeq sample (∼10 million reads, after quality control). Singleton OTUs were filtered out of the L4-DeepSeq sample, as were ICoMM OTUs that appeared in only a single sample, to reduce further the noise caused by PCR or sequencing error. Note that a two-study heuristic was applied to the ICoMM samples, meaning that each OTU must appear in at least two samples to be included in downstream analysis, but for the L4-DeepSeq sample only a two-observation heuristic was applied, meaning that each OTU must be observed at least two times, but both of those observations can be in the same sample (See Table S2 for overlap values resulting from a two-observation heuristic applied to all the data). Our reasoning is that we expect to see many more reads of truly rare organisms in the L4-DeepSeq sample because of the greater sequencing depth, so imposing a two-sample heuristic against samples with much lower sequencing depth would result in the filtering of real OTUs. This approach results in conservative (i.e., low) estimates of the fractions of ICoMM OTUs observed in L4-DeepSeq. See Datasets S1 and S2 for a comprehensive list of the QIIME commands used to generate these analyses and for the full OTU table (generated with the two-observation/two-sample heuristics for L4/ICoMM), respectively.
ICoMM samples were rarefied (randomized resampling without replacement) to 5,000 sequences per sample (resulting in the retention of 355 ICoMM samples), and the rarefied ICoMM OTU table was combined with the full L4-DeepSeq set. OTU gain (the number of OTUs unique to a query sample relative to a reference sample) and phylogenetic gain [the fraction of shared branch length unique to a query sample relative to a reference sample; Faith’s phylogenetic gain, in units of UniFrac branch length (31)] were calculated for each ICoMM sample and for the pooled ICoMM data relative to the L4-DeepSeq sample (16). Alpha diversity, beta diversity, and the core microbiome were calculated and plotted using QIIME (24).
Analyses of the ICoMM data were performed with and without the L4 ICoMM samples [86 samples (4)]; however, the results reported here were analyzed excluding the L4-ICoMM samples (to be conservative in our estimation of community overlap). The slight offset between the L4 ICoMM samples and the L4-DeepSeq sample (resulting, in part, from differences in the source and frequency of sequencing errors between Illumina and 454 platforms) was greater than observed previously (6). This discrepancy is the result of an updated strategy [relative to ref. 6] for OTU picking and calculating fractional OTU gain (as described above) and our use of a newer version of the Greengenes reference collection. Therefore, we applied the OTU richness gain, fractional OTU gain, and phylogenetic gain (415, 0.188, and 0.010, respectively) from the L4 ICoMM December 2007 sample relative to the L4-DeepSeq sample, which were sequenced from the same extract and PCR product, as a correction factor in this study. This step was not performed in a previous analysis (6).
To assess how sequencing depth at the L4 site impacted the phylogenetic overlap between the L4 site and the entire ICoMM dataset, we rarefied the L4-DeepSeq sample from between 5,000 and 10 million reads. Bootstrapped phylogenetic gains were calculated for each rarefaction depth (n = 10), relative to the full ICoMM dataset.
Biome clustering was visualized using a network diagram, constructed with QIIME 1.5.0-dev and Cytoscape (33). To cluster the OTUs and biomes in the network diagram, we used the stochastic spring-embedded algorithm (BioLayout), as described previously (34), in which nodes act like physical objects that repel each other, and connections act like a spring with a spring constant and a resting length. The nodes are organized in a way that minimizes forces in the network and therefore brings together the samples that share the most OTUs at highest abundance. The phylogenetic trees annotating Fig. 3 were produced as follows. The tree is the Greengenes tree (version: February 2011) (26), collapsed to show major groups and pruned to eliminate OTUs that were not found in any of the environments that were considered in the study. For coloring the tree, OTUs found in L4-DeepSeq were eliminated to produce a reduced OTU table showing only what is unique relative to L4-DeepSeq. The guide tree is colored by sample metadata per environment, using the categories as shown in the diagram. Colors are propagated from the tips of the tree (OTUs where sequences were found in specific environments) back through the rest of the tree. All environments other than the one selected were set to “no count” in TopiaryExplorer (35), so that the OTUs found in each environment but not in L4-DeepSeq are colored independently (for example, if a given OTU is found both in deep sediment and in surface water but not in L4-DeepSeq, it will be colored in both those trees). Consequently, colored wedges highlight the taxa of the OTUs found in each environment that are unique with respect to L4; however, if a wedge is colored, this coloring need not imply that the majority of possible taxa in that wedge were found in that environment.
Core communities were defined in two ways: (i) the community shared between individual or pooled ICoMM biomes and the L4-DeepSeq sample; and (ii) the community overlap between a given percentage of the ICoMM samples. The first definition highlights the consensus community that results from comparing a shallow-sequence sample (∼16,000 reads per biome) with a deep-sequenced sample (∼10 million reads). The second definition highlights the shared community between two or more shallow-sequenced samples. Gain calculations (see above) were used to compute the first core type, and the script compute_core_microbiome.py (QIIME 1.5.0-dev) was used to compute the second core type.

Data Availability

Data deposition: The L4-DeepSeq data have been deposited in the European Bioinformatics Institute-Sequence Read Archive database (accession number ERP001778).


We thank Maureen L. Coleman and the two anonymous reviewers for their constructive comments on this manuscript and Amazon Web Services (AWS) for the AWS in Education Researcher's Grant to the QIIME development group. This work was supported in part by the US Department of Energy under Contract DE-AC02-06CH11357. Funding for S.M.G. was provided by National Institutes of Health Training Grant 5T-32EB-009412.

Supporting Information

Supporting Information (PDF)
Supporting Information


LGM Baas Becking Geobiologie of Inleiding Tot de Milieukunde [Geobiology or Introduction to the Science of the Environment] (W. P. Van Stockum and Zoon. Dutch, The Hague, 1934).
JA Gilbert, et al., The seasonal structure of microbial communities in the Western English Channel. Environ Microbiol 11, 3132–3139 (2009).
N Fierer, RB Jackson, The diversity and biogeography of soil bacterial communities. Proc Natl Acad Sci USA 103, 626–631 (2006).
JA Gilbert, et al., Defining seasonal marine microbial community dynamics. ISME J 6, 298–308 (2012).
Q Ruan, et al., Local similarity analysis reveals unique associations among marine bacterioplankton species and environmental factors. Bioinformatics 22, 2532–2538 (2006).
JG Caporaso, K Paszkiewicz, D Field, R Knight, JA Gilbert, The Western English Channel contains a persistent microbial seed bank. ISME J 6, 1089–1093 (2012).
BJ Finlay, Global dispersal of free-living microbial eukaryote species. Science 296, 1061–1063 (2002).
JT Lennon, SE Jones, Microbial seed banks: The ecological and evolutionary implications of dormancy. Nat Rev Microbiol 9, 119–130 (2011).
ML Sogin, et al., Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci USA 103, 12115–12120 (2006).
JBH Martiny, et al., Microbial biogeography: Putting microorganisms on the map. Nat Rev Microbiol 4, 102–112 (2006).
RJ Whitaker, DW Grogan, JW Taylor, Geographic barriers isolate endemic populations of hyperthermophilic archaea. Science 301, 976–978 (2003).
J Bahl, et al., Ancient origins determine global biogeography of hot and cold desert cyanobacteria. Nat Commun 2, 163 (2011).
MJ Follows, S Dutkiewicz, S Grant, SW Chisholm, Emergent biogeography of microbial communities in a model ocean. Science 315, 1843–1846 (2007).
N Davies, D Field, Sequencing data: A genomic network to monitor Earth. Nature; Genomic Observatories Network 481, 145–145 (2012).
L Zinger, et al., Global patterns of bacterial beta-diversity in seafloor and seawater ecosystems. PLoS ONE 6, e24570 (2011).
DP Faith, Conservation evaluation and phylogenetic diversity. Biol Conserv 61, 1–10 (1992).
CA Lozupone, R Knight, Global patterns in bacterial diversity. Proc Natl Acad Sci USA 104, 11436–11440 (2007).
C Hubert, et al., A constant flux of diverse thermophilic bacteria into the cold Arctic seabed. Science 325, 1541–1544 (2009).
S Mouné, P Caumette, R Matheron, JC Willison, Molecular sequence analysis of prokaryotic diversity in the anoxic sediments underlying cyanobacterial mats of two hypersaline ponds in Mediterranean salterns. FEMS Microbiol Ecol 44, 117–130 (2003).
K Negandhi, et al., Florida reef sponges harbor coral disease-associated microbes. Symbiosis 51, 117–129 (2010).
O Barneah, E Ben-Dov, E Kramarsky-Winter, A Kushmaro, Characterization of black band disease in Red Sea stony corals. Environ Microbiol 9, 1995–2006 (2007).
M Eppinger, C Baar, G Raddatz, DH Huson, SC Schuster, Comparative analysis of four Campylobacterales. Nat Rev Microbiol 2, 872–885 (2004).
JA Huber, et al., Microbial population structures in the deep marine biosphere. Science 318, 97–100 (2007).
JG Caporaso, et al., QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7, 335–336 (2010).
J Reeder, R Knight, Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat Methods 7, 668–669 (2010).
D McDonald, et al., An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6, 610–618 (2012).
JG Caporaso, et al., PyNAST: A flexible tool for aligning sequences to a template alignment. Bioinformatics 26, 266–267 (2010).
TZ DeSantis, et al., Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 72, 5069–5072 (2006).
DJ Lane, 16S/23S rRNA sequencing. Nucleic Acid Techniques in Bacterial Systematics, eds E Stackerbrandt, M Goodfellow (John Wiley and Sons, West Sussex, England, 1991).
MN Price, PS Dehal, AP Arkin, FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Q Wang, GM Garrity, JM Tiedje, JR Cole, Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73, 5261–5267 (2007).
C Lozupone, M Hamady, R Knight, UniFrac—An online tool for comparing microbial community diversity in a phylogenetic context. BMC Bioinformatics 7, 371 (2006).
CT Lopes, et al., Cytoscape Web: An interactive web-based network browser. Bioinformatics 26, 2347–2348 (2010).
RE Ley, et al., Evolution of mammals and their gut microbes. Science 320, 1647–1651 (2008).
M Pirrung, et al., TopiaryExplorer: Visualizing large phylogenetic trees with environmental metadata. Bioinformatics 27, 3067–3069 (2011).

Information & Authors


Published in

Go to Proceedings of the National Academy of Sciences
Go to Proceedings of the National Academy of Sciences
Proceedings of the National Academy of Sciences
Vol. 110 | No. 12
March 19, 2013
PubMed: 23487761


Data Availability

Data deposition: The L4-DeepSeq data have been deposited in the European Bioinformatics Institute-Sequence Read Archive database (accession number ERP001778).

Submission history

Published online: March 4, 2013
Published in issue: March 19, 2013


  1. deep sequencing
  2. microbial ecology
  3. rare biosphere


We thank Maureen L. Coleman and the two anonymous reviewers for their constructive comments on this manuscript and Amazon Web Services (AWS) for the AWS in Education Researcher's Grant to the QIIME development group. This work was supported in part by the US Department of Energy under Contract DE-AC02-06CH11357. Funding for S.M.G. was provided by National Institutes of Health Training Grant 5T-32EB-009412.


This article is a PNAS Direct Submission.



Sean M. Gibbons
Graduate Program in Biophysical Sciences and
Institute for Genomics and Systems Biology, Argonne National Laboratory, Lemont, IL 60439;
J. Gregory Caporaso
Institute for Genomics and Systems Biology, Argonne National Laboratory, Lemont, IL 60439;
Department of Computer Science, Northern Arizona University, Flagstaff, AZ 86011;
Meg Pirrung
Department of Pharmacology, University of Colorado Denver, Aurora, CO 80303;
Dawn Field
National Environmental Research Council Centre for Ecology and Hydrology, Wallingford OX1 3SR, United Kingdom; and
Rob Knight
Department of Chemistry and Biochemistry and
Howard Hughes Medical Institute, University of Colorado, Boulder, CO 80303
Jack A. Gilbert1 [email protected]
Graduate Program in Biophysical Sciences and
Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637;
Institute for Genomics and Systems Biology, Argonne National Laboratory, Lemont, IL 60439;


To whom correspondence should be addressed. E-mail: [email protected].
Author contributions: S.M.G., J.G.C., D.F., and J.A.G. designed research; S.M.G. performed research; J.G.C., M.P., R.K., and J.A.G. contributed new reagents/analytic tools; M.P. designed and constructed phylogenetic trees; S.M.G., J.G.C., R.K., and J.A.G. analyzed data; and S.M.G. wrote the paper.

Competing Interests

The authors declare no conflict of interest.

Metrics & Citations


Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.

Citation statements



If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by


    View Options

    View options

    PDF format

    Download this article as a PDF file


    Get Access

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to get full access to it.

    Single Article Purchase

    Evidence for a persistent microbial seed bank throughout the global ocean
    Proceedings of the National Academy of Sciences
    • Vol. 110
    • No. 12
    • pp. 4431-4853







    Share article link

    Share on social media