Microbial community gene expression in ocean surface waters

Frias-Lopez et al. 10.1073/pnas.0708897105

Supporting Information

Files in this Data Supplement:

SI Figure 4
SI Figure 5
SI Figure 6
SI Figure 7
SI Figure 8
SI Table 2
SI Figure 9
SI Figure 10
SI Figure 11
SI Figure 13
SI Methods
SI Table 3
SI Table 4
SI Table 5
SI Table 6
SI Figure 12




SI Figure 4

Fig. 4. Comparison of linearly amplified and unamplified mRNA from cultures of Prochlorococcus (MED4) cells using custom Affymetrix arrays. Expression values for protein-coding genes of Prochlorococcus MED4 for unamplified RNA vs. the amplified RNA obtained from a 100-ng aliquot from the former. Results from two independent experiments are shown.





SI Figure 5

Fig. 5. Comparison of linearly amplified mRNA from duplicate cultures of Prochlorococcus (MED4) cells using custom Affymetrix arrays. Expression values for protein-coding genes of Prochlorococcus MED4 of replicate amplified samples plotted against each other showing the reproducibility of the amplification. Results are from two independent experiments.





SI Figure 6

Fig. 6. Comparison of the results of an experiment designed to reveal up-regulated genes in Prochlorococcus (MIT9313) under phosphate starvation, using unamplified (A) and amplified (B) RNA using custom Affymetrix arrays. Treatment 1: Control culture in phosphate-replete media. Treatment 2: phosphate-starved cultures. The same four genes appear as differentially expressed in both amplified and unamplified treatments: a phoB two component response regulator, a Som like protein (phosphate-limitation inducible outer membrane porins), and two ABC transporter substrate (phosphate) binding protein.





SI Figure 7

Fig. 7. Analysis of accuracy of RNA amplification as a function of position along the Prochlorococcus MED4 chromosome using custom Affymetrix arrays. The ratio of the expression values yielded from amplified and unamplified RNA for protein-coding genes (blue) and ribosomal RNAs and tRNAs (red dots). The circled red dots are rRNAs.





SI Figure 8

Fig. 8. Stringency of the BLASTX bit score cutoff, in terms of alignment length and amino acid identity. Each circle represents an alignment between a cDNA pyrosequencing read and an NCBI-nr database sequence. Alignments with a bit score >40 were considered significant in our analyses.





SI Figure 9

Fig. 9. Comparison of transcriptional levels of selected genes using pyrosequencing and RT-qPCR/qPCR. The unamplified environmental RNA and DNA samples were used for quantitative PCR. The cDNA to DNA ratio in qPCR analysis (x axis) was calculated based on the modified method (see SI Methods). The cDNA to DNA ratio in pyrosequence analysis (y axis) was normalized to the size of the respective libraries. More specifically, the ratio was calculated as the fraction of reads assigned to the targeted gene in the cDNA library divided by that in the DNA library. Three sets of genes were selected based on their enrichment in the cDNA pyrosequence library. Green solid circle: genes with normalized cDNA/DNA ratio >1. Blue solid circle: genes with normalized cDNA/DNA ratio <1. Red open circle: gene only detected in the cDNA library but not in the DNA library, and thus the cDNA/DNA ratio could not be calculated for pyrosequencing data (dotted part of y axis). The full names of the 10 selected genes are listed in SI Table 2.





SI Figure 10

Fig. 10. Rarefaction analyses for cDNA and DNA libraries. The rarefaction analysis was based on the frequency of significant BLASTX matches in the GOS peptide database, with increasing number of Roche GS20 DNA pyrosequencing reads. Red dots represent the average values, and the black dots represents the 95% confidence interval values.





SI Figure 11

Fig. 11. Empirical cumulative probability density function of the number of DNA reads assigned to a GOS protein cluster. The GOS protein clusters were arbitrarily binned to low, medium, high, and extremely high categories. Boundary values for each category, e.g., the number of DNA reads assigned to the cluster and its probability, also are shown.





SI Figure 12

Fig. 12. Effect of gene length on the number of hits in the DNA library, assessed by using Prochlorococcus MIT9301. (A) Linear relationship between the number of hits in the DNA database and gene length in the genome of MIT9301. (B) Relationship between the normalized cDNA against DNA hits and the normalized cDNA already normalized against gene length. In blue, core genes, i.e., genes present in all genomes of Prochlorococcus sequenced to date. In pink, flexible genes, i.e., genes not present in all genomes of Prochlorococcus sequenced.





SI Figure 13

Fig. 13. Distribution of the frequency of polymeric nucleotide sequence (A, T, G, C, and N) lengths found in the 75-m cDNA (A) and 75-m DNA (B) pyrosequencing libraries. The peak in polymeric sequence length at 15-16 bp in the cDNA reads is a result of the polyadenylation in library preparation. The dashed line at 10 bp indicates the cutoff used in the trimming of the cDNA data.





SI Methods

Sample Collection for DNA Extraction. Bacterioplankton samples for DNA extraction were collected as previously described with minor modifications (1). Briefly, the seawater was prefiltered in line through 125-mm Whatman GF/A filter (Whatman, Maidstone, U.K.) before the final collection of bacterioplankton cells onto 0.22-mm Steripak-GP20 filter (Millipore, Bedford, MA) using a Masterflex peristaltic pump (Cole Parmer Instrument Company, Vernon Hills, IL). After a total of 260 liters of seawater was filtered, the Steripak filter was covered with lysis buffer (50 mM Tris•HCl, 40 mM EDTA, and 0.75 M sucrose) and frozen in -80°C aboard before shipped frozen to the laboratory where they were stored at -80°C until DNA extraction.

DNA Extraction. DNA was extracted using slightly modified lysis and purification methods (2). Briefly, a solution of 5 mg/ml of lysozyme in 3 ml of lysis buffer was added to the Steripak-GP20 filter cartridge (Fisher, Fairlawn, NJ) after thawing, and incubated at 37°C for 30 min. Proteinase K (Sigma, St Louis, MO) in sterile water was added (at a final concentration of 0.5 mg×ml-1) into the Steripak-GP20 filter cartridge, followed by addition of SDS (Sigma, St Louis, MO) to a final concentration of 1%. The filter cartridges were sealed and incubated at 55°C for 20 min, followed by further incubation at 70°C for 5 min to further promote cell lysis. The lysate was remove from the filter cartridge, and nucleic acids were extracted twice with phenol:chloroform:IAA (25:24:1; Sigma, St Louis, MO) and once with chloroform:isoamyl alcohol (24:1; Sigma). The purified aqueous phase was concentrated by spin dialysis using a Centricon 100 filter. An aliquot (~2 mg) of the extracted DNA was used for GS20 pyrosequencing.

Sample Collection for RNA Extraction. Bacterioplankton cells for total RNA extraction were collected filtering seawater from the same water sample that was used in DNA sample collection. We modified the collection process to shorten sampling time and improve sample preservation, which is critical in transcriptomics studies. The Niskin bottle transportation time in the water column depends entirely on the depth the CTD reaches; however, immediately upon shipboard retrieval of the CTD, a smaller volume of seawater (~ 1 liter) was filtered as rapidly as possible. The time from the start of filtration to storage in RNA later was 12 min. Briefly, the seawater was prefiltered through 1.6-mm GF/A filters (Whatman, Maidstone, U.K.) and then filtered through 25-mm and 0.22-mm Durapore filters (Millipore, Bedford, MA) using a four-head peristaltic pump system. The prefiltering step was used to remove most eukaryotic cells, although picoeukaryote cells (eukaryotes <2.0 mm in diameter) were present in the sample. The four Durapore filters (identica replicates) were immediately transferred to a screw-cap tube containing 1 ml of RNAlater (Ambion Inc., Austin, TX) after filtration, and frozen and kept at -80°C aboard the R/V Kilo Moana. Samples were transported frozen to the laboratory in a dry shipper and stored at -80°C until RNA extraction procedures.

RNA Extraction. Total RNA was extracted using a mirVana RNA isolation kit (Ambion, Austin, TX), with several modifications to recover RNA possibly released to the 1 ml of RNAlater due to the sample freeze and thaw. Samples were thawed on ice, and the 1 ml of RNAlater was gently pipetted out and loaded onto two Microcon YM-50 columns (Millipore, Bedford, MA) for desalting and concentrating by centrifugal filtration. The resulting 50 ml of RNAlater was added back to the sample tubes, and total RNA extraction was proceeded following the mirVana manual. Genomic DNA was removed using a Turbo DNA-free kit (Ambion, Austin, TX). Finally, extracted RNA (DNase-treated) from four replicate filters were combined, purified, and concentrated by using the MinElute PCR Purification Kit (Qiagen, Valencia, CA).

Microarray Analysis of Prochlorococcus Gene Expression. For the experiments with Prochlorococcus MED4, cells were grown in the Pro-99 seawater-based medium (3) at 21°C under continuous white light at 16 mol photon×m-1×s-1. Cells were harvested by centrifugation (10,000 ´ g) in log phase growth. Growth conditions and cell collection under phosphorus starvation of Prochlorococcus MIT 9313 were as described by Martiny et al. (4). Samples of Prochlorococcus MIT 9313 were taken after 12 h under phosphorus starvation.

Before microarray analysis and RNA amplification, DNA was removed using the Turbo DNA-free kit (Ambion, Austin, TX). Synthesis, labeling, and hybridization of cDNA onto customized MD4-9313 Affymetrix (Santa Clara, CA) microarrays were performed following the standard Affymetrix protocol, and scanning was carried out according to Affymetrix protocols for Escherichia coli (www.affymetrix.com/support/technical/manual/expression_manual.affx). Data visualization was carried out by using GeneSpring software (version 7.3.1; Silicon Genetics, Palo Alto, CA). An initial normalization was applied using the Robust Multichip Average algorithm (5) implemented in GeneSpring. Those values were later normalized using the lowess correction performed by using the software R (www.R-project.org) (6).

RT-qPCR Analysis. Possible traces of DNA were removed using Ambion's Turbo DNA-free kit (Ambion, Austin, TX) following the manufacturer's instructions with minor modifications. The volume of Turbo DNase I was increased to 3 ml of Turbo DNase I (Ambion's Turbo DNA-free, Ambion) and the reaction mixture was incubated at 37°C for 60 min. RNA (1 ng) was reverse-transcribed with random hexamer primers and Superscript II reverse transcriptase (Invitrogen, Carlsbad, CA) following the manufacturer's instructions. Reverse transcription was performed at 42°C for 2 h, after an initial incubation step of 10 min at 25°C. The synthesized cDNA and purified environmental DNA (1 ng) were used in SYBR green quantitative PCR (qPCR) using the specific primers for the genes of interest (SI Table 2). To compare the relative expression of genes, we modified the method (7) and used the formula to take into consideration the different amplification efficiencies in separate qPCR runs.

Sequence Analyses of cDNA and DNA Reads. The defined bit score cutoff for assigning reads to GOS peptides and NCBI-nr protein was based on in silico tests using BLASTX comparisons against nonmarine microbial genomes (SI Fig. 8) where a bit score of >40 was shown to result in low false positive frequencies (<2%). Furthermore, a breakdown of amino acid identity and length values for bit scores >40 observed in DNA library (SI Fig. 8) highlights the stringency of this cutoff.

Assignment of reads to GOS protein clusters enabled the calculation of cluster-based expression ratio, a normalized comparison of the number of reads found for each protein cluster in the cDNA library relative to that found in the DNA library. To normalize this ratio for the difference in DNA and cDNA library size, the number of reads assigned to any given protein cluster was divided by the total number of reads in the respective library. The resulting cluster fraction for the cDNA library then was expressed as a function of the representation in DNA library. The cluster-based expression ratios were ranked from highest to lowest (Fig. 1) to look at clusters being expressed at elevated levels.

The relative abundance of detected clusters was taken into consideration by dividing cluster-based expression ratios into categories based on their abundance in the DNA library. Using an empirical cumulative density function (SI Fig. 11), clusters were categorized as low (<9 read members), medium (9-161 read members), high (161-461 read members), or extremely high abundance (>461 read members). This abundance measure also reflects the conservation of protein clusters, because more conserved proteins clusters are likely to have more members (e.g., RNA polymerase). Rarefaction analysis for each sample was based on best matches against the GOS database. The frequency of observed best matches to GOS protein clusters for each library was used to calculate rarefaction curves with the program Analytic Rarefaction 1.3.

Putative Prochlorococcus reads were identified as reads with top BLASTX hit (against NCBI-nr) to Prochlorococcus and with a bit score >40. Each of these putative Prochlorococcus reads then was searched against a database of 11 whole-genome sequences using BLASTN and assigned to the best hit gene. For comparison with a single-reference genome, MIT9301, the assigned genes from 11 strains all were translated to their MIT9301 ortholog (8), where one exists. The number of raw cDNA reads per gene was used to indicate the most transcribed genes in the entire Prochlorococcus population. To normalize cDNA reads per gene copy, the number of DNA reads per gene first was divided by the gene length (´1,000 to give reads per kb) to account for a clear direct relationship between gene length and its representation in the DNA reads (SI Fig. 12). A clear, direct relationship with gene length does not exist for cDNA reads. The number of cDNA reads per gene then was divided by this normalized DNA (DNA reads per kb) to give an indication of per-copy cDNA abundance. This additional normalization to gene length, which is not possible for the whole community without good reference genomes, is generally consistent with the expression ratio (cDNA/DNA)-analogous to the cluster-based expression ratio used for whole-community analyses-except, for example, in cases of very short genes (SI Fig. 12).

Removal of Low-Quality and Ribosomal RNA (rRNA) GS20 cDNA Sequences. Polymeric sequences inadvertently introduced into the cDNA library during cDNA synthesis (via polyadenylation of mRNA/aRNA and subsequent amplification step) were trimmed from reads based on the observed frequency of polymeric sequences in the DNA library (SI Fig. 13). A noticeable peak in poly(A/T) sequences in the cDNA library around 16 bp (SI Fig. 13) is attributable to polyadenylation of the mRNA and subsequent amplification with a T7-BmpI-(dT)16VN primer. To remove residual T7 promoter and priming sites not cleaved by BmpI, reads were initially screened by using cross-match (-minmatch 10, -minscore 10; found in 32,246 reads). Reads containing a poly(A/T) sequence >10 bp (cutoff based on SI Fig. 13) or multiple poly(A/T) runs in a single read (4 ´ 6 bp) were trimmed unless a significant BLASTN match across the polymeric sequence in the cDNA read was identified in a read from the DNA library (39,444 reads remained untrimmed). By using these criteria, bases flanking the ends of each cDNA read were trimmed, and reads with polymeric sequences located in the middle of reads were deemed putative chimeras and removed from the dataset (5,232 chimeric reads).

rRNAs were removed from the cDNA library by using a combined 5S, 16S, 18S, 23S, and 28S rRNA database derived from available microbial genomes and sequences from the ARB SILVA LSU and SSU databases (www.arb-silva.de). BLASTN matches with bit score >40 were considered significant and deemed rRNA sequences (65,859 reads; 51.3% of reads). This bit score cutoff resulted in <1.7% false positives against a database of all non-rRNA microbial genes from available microbial genomes. After trimming and removal of rRNAs, 54,568 reads (average length 95 bp) totaling 5,194,332 bp remained in the cDNA sample. Raw metagenomic GS20 DNA and cDNA reads have been deposited in GenBank.

MEGAN and Statistical Analysis. We performed sequence comparisons of DNA and cDNA pyrosequencing results against the NCBI-nr database. Only the best hit of the top BLASTX hits with a bit score >40 was used for MEGAN analysis (version 2beta3, August 2007). MEGAN is a new software program (9) used to explore the taxonomical content of the dataset, employing the NCBI taxonomy to summarize and order the results. Moreover, MEGAN gives the number of hits obtained for the different taxonomic groups, which allows for statistical comparison of the distribution of those groups on the phylogenetic trees. Statistical differences between taxonomic groups on the DNA and cDNA trees obtained in MEGAN was assessed using the software R (www.R-project.org) (6). c2 test was used to estimate differences at the level of kingdom. In this case, we used the Pearson's c2 test with simulated P value (based on 10,000 replicates) and the log likelihood ratio (G test) test with Williams' correction (g.test.r code in R, from Peter L. Hurd, www.psych.ualberta.ca/~phurd/cruft/).

1. Coleman ML, et al. (2006) Genomic islands and the ecology and evolution of Prochlorococcus. Science 311:1768-1770.

2. Suzuki MT, Preston CM, Beja O, de la Torre JR, Steward GF, DeLong EF (2004) Phylogenetic screening of ribosomal RNA gene-containing clones in Bacterial Artificial Chromosome (BAC) libraries from different depths in Monterey Bay. Microb Ecol 48:473-488.

3. Rippka R, et al. (2000) Prochlorococcus marinus Chisholm et al. 1992 subsp. pastoris subsp. nov. strain PCC 9511, the first axenic chlorophyll a2/b2-containing cyanobacterium (Oxyphotobacteria). Int J Syst Evol Microbiol 50(Pt 5):1833-1847.

4. Martiny AC, Coleman ML, Chisholm SW (2006) Phosphate acquisition genes in Prochlorococcus ecotypes: Evidence for genome-wide adaptation. Proc Natl Acad Sci USA 103:12552-12557.

5. Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19:185-193.

6. RDC Team (2007) R: A Language and Environment for Statistical Computing. Available at www.R-project.org.

7. Livak KJ, Schmittgen TD (2001) Analysis of relative gene expression data using real-time quantitative PCR and the 2(T)(-DDC) method. Methods 25:402-408.

8. Kettler GC, et al. (2007) Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet 3:e231.

9. Huson DH, Auch AF, Qi J, Schuster SC (2007) MEGAN analysis of metagenomic data. Genome Res 17:377-386.

This Article

  1. PNAS March 11, 2008 vol. 105 no. 10 3805-3810
  1. OA Abstract
  2. Figures Only
  3. OA Full Text
  4. Full Text (PDF)
  5. » Supporting Information