Conserved rates and patterns of transcription errors across bacterial growth states and lifestyles

Significance Organisms rely on accurate transcription for proper cellular function. Whereas errors incurred during replication are transmitted to subsequent generations, those that occur during transcription are transient and affect only a subset of the encoded proteins. Although transcription errors may increase survival in stressful conditions, the majority of these errors are harmful, and their rates must be minimized. By assessing the transcription errors genome-wide in Escherichia coli and in two bacterial endosymbionts, we discovered that all species had remarkably similar transcription error rates. This conservation is unexpected given that both endosymbiotic species lack orthologs of several E. coli RNA fidelity factors and that lifestyle differences among these species have led to vast differences in their mutation and substitution rates. Errors that occur during transcription have received much less attention than the mutations that occur in DNA because transcription errors are not heritable and usually result in a very limited number of altered proteins. However, transcription error rates are typically several orders of magnitude higher than the mutation rate. Also, individual transcripts can be translated multiple times, so a single error can have substantial effects on the pool of proteins. Transcription errors can also contribute to cellular noise, thereby influencing cell survival under stressful conditions, such as starvation or antibiotic stress. Implementing a method that captures transcription errors genome-wide, we measured the rates and spectra of transcription errors in Escherichia coli and in endosymbionts for which mutation and/or substitution rates are greatly elevated over those of E. coli. Under all tested conditions, across all species, and even for different categories of RNA sequences (mRNA and rRNAs), there were no significant differences in rates of transcription errors, which ranged from 2.3 × 10−5 per nucleotide in mRNA of the endosymbiont Buchnera aphidicola to 5.2 × 10−5 per nucleotide in rRNA of the endosymbiont Carsonella ruddii. The similarity of transcription error rates in these bacterial endosymbionts to that in E. coli (4.63 × 10−5 per nucleotide) is all the more surprising given that genomic erosion has resulted in the loss of transcription fidelity factors in both Buchnera and Carsonella.

Errors that occur during transcription have received much less attention than the mutations that occur in DNA because transcription errors are not heritable and usually result in a very limited number of altered proteins. However, transcription error rates are typically several orders of magnitude higher than the mutation rate. Also, individual transcripts can be translated multiple times, so a single error can have substantial effects on the pool of proteins. Transcription errors can also contribute to cellular noise, thereby influencing cell survival under stressful conditions, such as starvation or antibiotic stress. Implementing a method that captures transcription errors genome-wide, we measured the rates and spectra of transcription errors in Escherichia coli and in endosymbionts for which mutation and/or substitution rates are greatly elevated over those of E. coli. Under all tested conditions, across all species, and even for different categories of RNA sequences (mRNA and rRNAs), there were no significant differences in rates of transcription errors, which ranged from 2.3 × 10 −5 per nucleotide in mRNA of the endosymbiont Buchnera aphidicola to 5.2 × 10 −5 per nucleotide in rRNA of the endosymbiont Carsonella ruddii. The similarity of transcription error rates in these bacterial endosymbionts to that in E. coli (4.63 × 10 −5 per nucleotide) is all the more surprising given that genomic erosion has resulted in the loss of transcription fidelity factors in both Buchnera and Carsonella.
transcription errors | RNA polymerase fidelity | base substitutions A mong the multiple types of information processing errors, the majority of research has focused on mutations that occur during DNA replication because such errors are heritable and form the basis of evolutionary change. However, errors that occur during transcription and translation can also have substantial effects on gene function by producing misfolded and malfunctioning proteins. The rate of translation errors is typically an order of magnitude higher than the rate of transcription errors (1)(2)(3)(4)(5)(6). However, errors occurring during transcription often elicit more dire consequences than those occurring during translation because individual mRNAs can be translated up to 40 times (7,8), resulting in a burst of flawed proteins. Therefore, a single transcription error can result in many flawed proteins, whereas a translation error will disrupt only a single protein.
Because deleterious transcription errors are not transmitted to subsequent generations, they can occur more frequently than mutations to DNA but still infrequently enough to ensure the cell is not overburdened with faulty proteins. Estimates of the rate of transcription errors in Escherichia coli have been determined in vitro by measuring the misincorporation of radiolabeled nucleotides into repeating dinucleotide tracts (1,9) and in vivo by quantifying the reversion frequencies of nonsense mutations in lacZ (2,3). These assays yielded variable estimates of transcription error rates of 10 −4 -10 −5 per nucleotide, several orders of magnitude higher than the mutation rate (10)(11)(12). Studies that assay individual loci are often not representative of the genome as a whole because sequence-or genome-specific features, such as base composition (12,13) or sequence motifs (14), affect the incidence of information processing errors. Moreover, transcription error rate reversion assays based on the recovery of functional proteins might also include translation errors, if these occur at a sufficiently high rate.
RNAseq offers an approach to both disentangle transcription errors from translation errors and provide an error rate for every transcribed gene in a genome. Unfortunately, the high error rates both of cDNA synthesis (3-6 × 10 −5 per nucleotide) (15)(16)(17) and of high-throughput sequencing technologies (possibly as high as 10 −2 -10 −3 per nucleotide) (18,19) renders the transcription errors obtained by conventional RNAseq indistinguishable from sequencing artifacts. Two recently developed methods offer ways to circumvent these problems by allowing transcription errors to be distinguished from sequencing and cDNA synthesis errors. Through the use of altered library preparation protocols, these methods reduce the overall error rate of RNAseq to less than 10 −8 (20) and 10 −12 (21,22) per nucleotide, making it possible to measure error rates across the entire transcriptomes of viruses and other organisms.
In this study, we implement both of these RNAseq-based methods in E. coli to examine whether transcription error rates vary according to growth state and physiological condition, as has been reported for translation error rates (23)(24)(25)(26) and for the combined transcription and translation error rate (27). Moreover, we ask whether transcription error rates are increased in the endosymbiotic bacteria Buchnera aphidicola and Carsonella ruddii-species that have lost known transcription fidelity factors and whose mutation rates, substitution rates, and rates of protein sequence evolution are all amplified as a result of genetic drift and the loss of repair enzymes (Fig. S1). We show that transcription error rates are remarkably similar across organisms, even for broad categories of RNA on which the cell is known to selectively degrade malfunctioning rRNA (28).

Resource Limitation and Growth Phase Do Not Alter Transcription
Error Rates. We tested the effects of different growth conditionsall of which have been associated with altered mutation rates and/or Significance Organisms rely on accurate transcription for proper cellular function. Whereas errors incurred during replication are transmitted to subsequent generations, those that occur during transcription are transient and affect only a subset of the encoded proteins. Although transcription errors may increase survival in stressful conditions, the majority of these errors are harmful, and their rates must be minimized. By assessing the transcription errors genomewide in Escherichia coli and in two bacterial endosymbionts, we discovered that all species had remarkably similar transcription error rates. This conservation is unexpected given that both endosymbiotic species lack orthologs of several E. coli RNA fidelity factors and that lifestyle differences among these species have led to vast differences in their mutation and substitution rates.
translation error rates-on rates of transcription errors. Using a deep-sequencing approach to identify errors, we measured the transcription error rate in E. coli when grown under four growth conditions [tryptic soy broth (TSB) complex media or M9 glucose minimal media, each sampled at midlog and at stationary phase]. Note that these errors include both base substitutions during the process of transcription and any damage to the mRNA after transcription. Each of the four conditions were assayed in duplicate, and in total, we detected 2,621 transcription errors, with the number of errors per sample ranging from 156 to 681. In neither of the nutrient sources was there significant differences in transcription error rates for cells harvested at midlog phase or at 8 h after entering stationary phase ( Fig. 1; paired Wilcoxon test, P > 0. 30). Similar to what we observed for E. coli assayed at different growth phases, transcription error rates do not differ significantly in nutrient-rich (TSB) and nutrient-poor (M9) growth media ( Fig. 1; paired Wilcoxon test, P = 0.3429). Furthermore, there are no significant differences in overall transcription error rates between any pair of individual conditions tested [ Fig. 1; two-tailed t tests, t(2) < 2.3, P > 0.14], and the average transcription error rate over all conditions is 4.63 ± 0.34 (SEM) × 10 −5 for E. coli mRNA.
Distribution of Transcription Errors. The use of a high-throughput sequencing method to detect transcription errors (as opposed to a reporter-gene method) enables analysis of transcription errors genome-wide as well as the localization of errors to individual sites in each transcript. Starting at the scale of whole genomes, we analyzed the fluctuation in transcription error rates and found that the 95% of measurements made for 50-kb nonoverlapping windows across the entire E. coli genome varies threefold among genomic regions, ranging from 2.3 to 7.2 × 10 −5 (Fig. S2). Regions containing highly expressed genes had an increased number of transcription errors (Fig. S3), resulting from increased coverage enabling the discovery of more errors relative to areas in the genome with low coverage.
Transcription proceeds in the direction of DNA replication on the leading strand and in the opposite direction on the lagging strand, in which case there can be collisions between the replication and transcription machineries. Despite an increased likelihood of collision-induced errors on the lagging strand, there is no significant difference in the transcription error rates between genes encoded on the two strands (Wilcoxon test, P > 0.90; Fig. S4). Next, we tested whether adjacent nucleotides affected the occurrence of transcription errors and found that neither a particular preceding nor succeeding nucleotide induced transcription errors. Only when both the preceding and succeeding nucleotides are guanine residues do we observe a significant increase in transcription error frequency (Fisher's exact test, P < 0.02). Taken together, transcription errors occur without regard for genome location, direction of transcription, or for the vast majority of neighboring nucleotides.
Biases in E. coli Transcription Errors. Measuring transcription errors using a sequencing-based approach provides information about the absolute frequencies of each of the possible base substitutions. C→U errors were most common, occurring at a significantly higher frequency than all other transcription errors ( Fig. 2A), presumably attributable to high rates of cytosine deamination after the RNA is transcribed. It has previously been reported that transcription errors incur a higher rate of transitions than transversions (20,29), the same overall pattern that we observe in E. coli (Wilcoxon test, P < 0.05). This trend, however, is driven solely by high incidence of C→U changes and no longer reaches significance after removing these transitions from the analysis (Wilcoxon test, P > 0. 50). Next, we tested the effect of individual nucleotides on the frequency of transcription errors in E. coli and found that G/C→N errors occur at higher frequencies than do A/U→N errors (Wilcoxon test, P < 0.02; Fig. 2B). Additionally, N→A/U errors occurred at a significantly higher rate than do N→G/C errors ( Fig. 2C; Wilcoxon test, P < 0.02). These effects are not due solely to the high frequency of C→U errors: even after the removal of C→U errors (Methods), G/C→N errors remain significantly more frequent than A/U→N errors (Fig. 2B), and N→A/U errors remain significantly more frequent than N→G/C errors (Fig. 2C).

Transcription Error Rates in Host-Restricted Bacteria with Reduced
Genomes. The bacterial endosymbionts, B. aphidicola and C. ruddii, harbor small genomes (450 and 190 kb, respectively) and have very high substitution rates, as a consequence of both their lack of several repair mechanisms (Fig. S1) and the reduced efficacy of selection due to their small effective population sizes. These features are also expected to augment rates of transcription errors, so we assayed the transcription error rates in these endosymbionts using methods identical to those used for E. coli. For the replicate samples of B. aphidicola, we detected a total of 169 transcription errors in total mRNA, yielding a transcription error rate of 2.69 ± 0.73 (SEM) × 10 −5 , which is not significantly different from the rate that we obtained for E. coli mRNA [two-tailed t test, t(8) = 2.527, P > 0.05; Fig. 3A].
Transcription errors in C. ruddii mRNA could not be assigned unequivocally because the C. ruddii RNA was extracted from a natural population of individuals, rendering it difficult to distinguish between transcription errors and the polymorphisms that might be present in the population. Instead, we quantified transcription error rates for 16S and 23S ribosomal RNA in both C. ruddii and B. aphidicola because these operons are present in single copy, have high read-coverage (despite the rRNA removal step), and are not polymorphic within a species. Unlike C. ruddii and B. aphidicola, the E. coli genome possesses multiple polymorphic rRNA operons, making it unfeasible to estimate rRNA transcription error rates in E. coli. We detected a total of 1,014 errors in C. ruddii rRNAs and 4,377 errors in B. aphidicola rRNAs, yielding rRNA transcription error rates of 5.13 × 10 −5 for C. ruddii and 3.37 × 10 −5 for B. aphidicola (Fig. 3A). Our estimates of bacterial transcription error rates are, in descending order, 5.13 × 10 −5 for C. ruddii rRNA, 4.63 × 10 −5 for E. coli mRNA, 3.37 × 10 −5 for Buchnera rRNA, and 2.69 × 10 −5 for Buchnera mRNA. The transcription error rates for B. aphidicola mRNA and rRNA do not differ significantly from one another.
Biases in Endosymbiont Transcription Error Rates. Assessing the transcription errors occurring in both Buchnera mRNA and rRNA allowed us to determine whether there are any observable differences in the error rates for two RNA substrates, as might be caused by base compositional biases or selection. All possible nucleotide substitutions, as attributable to transcription errors, were detected in both the mRNA and rRNA samples (although one of the B. aphidicola mRNA replicates lacked any A→C changes). There were no significant differences for any of the individual substitution classes between mRNA and rRNA or among any of individual substitution classes (Fig. 3B).
Effects of Transcription Errors on Protein Sequences. Given that each transcript can be translated-perhaps multiple times-into protein, we determined which transcription errors result in an amino acid substitution. On average, 68 ± 1.46% (SEM) of transcription errors cause an amino acid substitution in E. coli, whereas 80% of the transcription errors in Buchnera result in amino acid substitutions. If errors were to occur at random over the E. coli transcriptome, the probability of changing an amino acid is significantly higher than that actually incurred by transcription errors (76% vs. 68%; pairwise Wilcoxon test, P < 0.008).

Discussion
Considering the range of variation in replication error rates and in translation error rates both within and among bacterial species, our finding that transcription error rates are similar for different species and for different classes of RNA sequences and under different physiological conditions within a species is bewildering. The mutation (i.e., DNA replication error) rates for bacteria span by several orders of magnitude (10); for the specific organisms that we consider, spontaneous mutation rates vary nearly 50-fold, from 8.9 × 10 −11 per site per generation in E. coli (10) to 4.0 × 10 −9 for Buchnera aphidicola (29). In contrast, based on our genome-wide deep-sequencing approach, the transcription error rates of these two species differ by less than twofold (2.7 × 10 −5 vs. 4.3 × 10 −5 ), with E. coli having the slightly higher rate. Our initial prediction was that endosymbionts would have higher transcription error rates because they are subject to high levels of genetic drift and would therefore sustain more deleterious mutations; however, neither of the studied endosymbionts had elevated transcription error rates. We reasoned that differential regulation of transcription fidelity factors, such as greA (30,31), greB (31), or dksA (31, 32), operating during transcription, translation, or protein degradation, could provide a mechanism for E. coli to modulate its transcription error rate under various conditions and growth phases. The conservation of transcription error rates among species is all of the more surprising given that these endosymbionts lack homologs for several of these transcription fidelity factors (Fig. S1). Endosymbionts possess the most highly reduced bacterial genomes (33), and the genome sizes of Buchnera and Carsonella are only 641 and 160 kb, respectively (34,35), in contrast to the 4,640-kb genome of the E. coli MG1655. Genome reduction in endosymbionts results from elimination of genes that are no longer necessary in the host environment but also involves the loss of apparently beneficial genes, such as those that enhance the efficiency of universal cellular processes, such as DNA repair, translation, and transcription (Fig. S1). The lack of certain DNA repair enzymes in endosymbionts have been implicated in their extreme base compositions and increased mutation rates (36)(37)(38); however, loss of multiple RNA fidelity factors, such as greB in Buchnera (Fig. S1) and greA, greB (Fig. S1) and dksA in Carsonella, seems not to affect transcription error rates.
These bacterial endosymbionts are missing transcription fidelity factors, but their transcription error rates are unchanged, implying that there are mutations within RNAP that can increase the fidelity of transcription. If there is indeed an optimal transcription error rate across bacteria, selection may have improved the intrinsic error rate in the endosymbiont RNAPs after they lost the transcription fidelity factors. However, neither of the RNAPs of the endosymbionts possess a mutation known to increase transcription fidelity in E. coli (39). It is possible that endosymbionts do not require rapid transcription and can tolerate slow but accurate transcription (39). The presence of these fidelity factors in E. coli could allow its RNAP to make more errors (which are then corrected), as a result of selection for increased transcription speeds and increased growth rates (40).
Not only were transcription error rates similar in proteobacterial taxa of vastly different lifestyles, population structures, genomes sizes, and mutation rates, but the error rates were comparable across organisms for different broad categories of RNAs. Because structural RNAs (16S and 23S rRNAs) persist longer than mRNAs, they can incur more damage (due to oxidative stress or deamination), thereby leading to an increase in our estimates of the error rate for ribosomal RNAs. On the other hand, one might anticipate rRNAs to have lower error rates than mRNAs, because subfunctional molecules would be preferentially targeted for degradation (28), leaving only those rRNAs that do not contain errors. It should be noted that under both scenarios, the error rate during transcription does not change, but rather the variation in the estimated error rates is caused by differences in the fate of rRNAs after transcription. We were only able to measure transcription errors rates for both mRNA and rRNA in Buchnera. The average error rate for Buchnera mRNA was slightly lower than for rRNA, but this estimate was based on the detection of many fewer errors, and there is no significant difference between the two categories of RNAs (Fig. 3). It is not possible to measure transcription errors in rRNAs of E. coli and in mRNAs in Carsonella-in both cases, DNA polymorphisms inherent to the sample prevent recognition of transcription errors; however, the error rates of E. coli mRNA and Carsonella rRNAs differ by less than 10%.
Unlike what we observe for transcription error rates, the mutation rate of an individual strain can vary depending on its growth conditions. E. coli mutation rates have been shown to increase by an order of magnitude during stationary phase and under nutrientlimited conditions (41). Much of the variation in the mutation rate within a species has been attributed to expression of error-prone polymerases during stationary phase (42)(43)(44) and to increased chemical damage occurring during the switch from exponential growth to stationary growth (45)(46)(47). Such chemical damage to DNA is usually corrected through DNA repair pathways, but because analogous pathways do not exist for RNA, it is potentially more susceptible to this source of damage. That there is no increase in either the rate or spectrum of errors to RNA during stationary phase suggests that other mechanisms compensate for stationaryphase stresses (e.g., dps protein and catalases) (48)(49)(50) or that RNA is too short-lived to be significantly affected.
The relative frequencies of each type of transcription error were similar across organisms and across growth conditions (Fig. 2) and correspond to what is observed for spontaneous mutations in these organisms (i.e., that C→U substitutions constitute the most common class of errors, and that A/T→T/A and G/C→C/G transversions occur at some of the lowest frequencies) (12,33,(51)(52)(53). Cytosine is the most unstable nucleobase and has an even higher rate of deamination to uracil when nucleic acids are in a singlestranded state (54), so the pronounced bias toward this error is expected. Therefore, some of the observed transcription errors appear to be due to damage to RNA, although current methods simply enumerate errors and do not discriminate between those caused by base misincorporations occurring during transcription and by damage to the RNA after transcription. Nonetheless, chemical damage occurring after transcription is biologically relevant because ribosomes can still translate the damaged base.
Many of the initial measurements of transcription errors in bacteria were restricted to single reporter genes and assayed the combined effects of transcription and translation errors by assessing how frequent functional proteins were produced from a mutant gene (2,3,14). These assays considered errors in translation to be relatively rare because, in this system, it was thought that only the first ribosome on a transcript would be capable of mistranslation and that most errors could be ascribed to the process of transcription (2,3). However, translation errors in E. coli can occur at rates between 10 −3 and 10 −4 per codon (4)(5)(6), suggesting that many of the original measurements of transcription errors are confounded by the inclusion of translation errors. Furthermore, the error rates varied by up to an order of magnitude for different stop codons (3), indicating that these fluctuations may be attributed to different translation error rates for different codons (55); therefore, the rates derived from these studies require validation by methods that exclusively consider transcription errors.
Previous studies reported that the combined transcription/translation error rate, as inferred from the frequency of errors in protein sequences, increases both in stationary phase and under starvation conditions (4,6,27). Because we detected no differences in transcription error rates between these different growth conditions, we reason that this variation manifests during translation and is most likely caused by tRNA scarcity during stationary phase (6,55,56). Although decreases in ribonucleotide concentration also occur during stationary phase (57), this has little effect on the overall fidelity of gene expression. Decreases in ribonucleotide concentration have been shown to increase the frequency of transcriptional pausing (58), which is closely associated with base misincorporations during transcription (39,59,60), so it seems that either (i) ribonucleotide concentration does not decrease enough under our experimental conditions to significantly alter the transcription error rate or (ii) that ribonucleotide concentration-induced pausing does not result from transcription errors. Nonetheless, it is curious that cellular growth conditions modify both the rate of DNA mutations and the rate of protein translation errors but not the transcription error rate.
Rates of translation errors have been estimated as being at least an order of magnitude higher than rates of transcription errors, but because most transcripts are translated multiple times, the realized number of modified proteins originating from transcription errors will equal or exceed the number caused by translation errors. This amplification of individual transcription errors into multiple proteins is likely to account for the reduction of transcription vs. translation error rates (10 −5 vs. 10 −4 ).
It has been suggested that errors in proteins, as caused by transcription and translation errors, contribute to survivability in the face of external stresses by the production of novel proteins or metabolites (27,61,62) or by inducing the general stress response (63). Such effects could not be accomplished through genomic mutations because such mutations can incur permanent decrements to fitness after the stress is removed. Although transcription errors can increase cellular noise and confer a benefit under certain temporary conditions, most variation introduced by errors will not be advantageous. Thus, the predominant direction of selection is to lower error rates because too many errors will overload the proteome with deleterious proteins. Whether or not the above argument is tenable, our findings, showing a remarkable consistency of transcription error rates across ecologically diverse bacterial species, different RNA categories, and under a variety of stress and nonstress growth conditions indicate that transcription errors would contribute very little to such transient protein errors. Transcription is a much less accurate process than DNA replication, and because transcription errors are not heritable (and the vast majority of RNAs are transcribed faithfully under any set of conditions), there appears to be little selection to modulate the overall transcription error rate.

Methods
Strains and Growth Conditions. Transcription errors were enumerated for E. coli MG1655 grown at 37°C in (i) 15 g/L TSB or (ii) M9 minimal media supplemented with 0.4% glucose. Bacterial cultures were preconditioned in either TSB or M9 minimal media for 24 h before inoculation for sampling. Overnight cultures were diluted to OD 600 = 0.05 into fresh media and sampled at midlog phase (4 h for TSB; 6 h for M9) and stationary phase (18 h for TSB; 24 h for M9).
Transcription errors were enumerated for B. aphidicola, an insect endosymbiont recovered directly from its aphid host, Acyrthosiphon pisum. B. aphidicola were isolated from 5 g adult aphids by a membrane filtration method (64) as follows: aphids were crushed by mortar and pestle in 15 mL buffer A (25 mM KCl, 35 mM Tris·HCl, 100 mM EDTA, and 250 mM sucrose, pH 8.0) at 4°C, and the homogenate was centrifuged at 1,500 × g for 15 min. Pellets were resuspended in 15 mL buffer A and passed serially through 100-, 20-, 8-, and 5-μm filters. B. aphidicola cells were recovered from the filtrate by centrifugation. Transcription errors occurring in the genome of C. rudii, another insect endosymbiont, were determined from a pooled sample of bacteriocytes from 200 dissected larvae of the psyllid Pachypsylla venusta collected locally from galls present on a hackberry tree. Bacteriocytes were stored in buffer A at -20°C before RNA extraction.
RNA Extractions. RNA was extracted from E. coli following the RNAsnap protocol for gram-negative bacteria (65). Roughly 10 8 bacterial cells were harvested by centrifugation at 16,000 × g for 30 s, the supernatant was removed by aspiration, and pelleted cells were immediately transferred to liquid nitrogen to halt transcription. Samples were transferred to ice, mixed with 100 μL RNAsnap solution [18 mM EDTA, 0.025% SDS, 1% 2-mercaptoethanol, and 95% (vol/vol) formamide], briefly vortexed, and incubated for 7 min at 95°C. Following incubation, samples were centrifuged at 16,000 × g for 5 min. Supernatants were mixed with an equal volume of PCI (phenol/chloroform/ isoamyl alcohol, 25:24:1), the aqueous phase was removed and treated with an equal volume of chloroform, and RNA was precipitated by addition of 1/10 volume 3 M sodium acetate, 1/50 volume 50 mg/mL glycogen, and 3 volumes of 100% ethanol. DNA contamination was tested using a Qbit high sensitivity DNA assay (Life Technologies), and RNA quality was assessed on an Agilent Bioanalyzer. Ribosomal RNAs were removed from total RNA preparation using the MICROBExpress kit (Life Technologies).
RNA was extracted from B. aphidicola and C. ruddii by the addition of 0.75 mL TRIzol reagent (Life Technologies) to 0.25 mL harvested cells (or bacteriocytes in the case of C. ruddii). Samples were mixed with 0.5 mL sterile zirconium beads, vortexed for 2 min to disrupt cells, and incubated for 5 min at 20°C. Following a chloroform extraction, nucleic acids were precipitated from the aqueous phase by the addition of 1/10 volume 3 M sodium acetate, 1/50 volume 50 mg/mL glycogen, and an equal volume of 100% isopropyl alcohol. Precipitated nucleic acids were washed twice with 70% (vol/vol) ethanol, suspended in 50 μL RNasefree dH 2 O, and treated with DNase, according to the supplier's specifications (Promega). Reactions were terminated by the addition of an equal volume of PCI, and total RNA was precipitated, quantified, extracted, tested for purity, and cleared of ribosomal RNAs as described above.
Library Preparation and Sequencing. We applied two library preparation procedures that have been reported to differentiate errors that occur during transcription from those that arise during sequencing (20)(21)(22). Both methods aim to produce multiple cDNA copies of each mRNA and identify consensus errors, which represent those that are actually present in the corresponding mRNA template. The first method involves successive rounds of sequencing streptavidin-captured mRNAs (20) to generate the multiple cDNA copies of each mRNA, and the second method (termed CircSeq) (21,22) is based on the sequencing of short, circularized fragments of mRNA that are copied multiple times by rolling-circle amplification before sequencing. Attempts at the original streptavidin-capture method of Gout et al. (20) failed to generate multiple copies of cDNA from each mRNA, and even after consulting with the authors and applying several suggested additions and modifications to the published protocol, we concluded that this method, as currently described, cannot be used to estimate transcription error rates.
For the CircSeq procedure, we followed the protocol of Acevedo and Andino (22) with the following modifications that reduced the total number of steps. Starting with 1 μg purified mRNA, samples were fragmented with the NEB Magnesium Fragmentation module at 94°C for 5 min and then assayed by denaturing PAGE. Regions of the gel-containing RNA fragments in the 80-to 100-nt size range were excised from the gel, and RNA was eluted from crushed gel slices by overnight incubation in a solution containing 600 mM sodium acetate, 0.017% (wt/vol) SDS, and 1.67 mM EDTA at 4°C. RNA was recovered from the eluent by ethanol precipitation, washed in 70% (vol/vol) EtOH, resuspended in 14 μL ddH 2 O, and analyzed for quality on an Agilent Bioanalyzer RNA chip. RNA fragments were circularized by incubating the entire sample volume with 1 μL T4 polynucleotide kinase (NEB), 1 μL T4 RNA ligase I (NEB), 2 μL T4 RNA ligase buffer (NEB), and 2 μL 10 mM ATP for 30 min at 37°C. Samples were purified by PCI extraction and ethanol precipitation, and libraries were prepared for Illumina sequencing by following the protocol accompanying the NEBNext Ultra RNA Library Prep Kit through completion of the second strand synthesis step. After this step, samples were repurified by PCI extraction and ethanol precipitation and analyzed with an Agilent Bioanalyzer RNA chip to determine the extent of rolling circle amplification, which occurred during the cDNA synthesis step of the NEB protocol. After confirming amplification status, ddH 2 O was added to a final volume of 200 μL, and samples were subjected to 12 min of pulsed sonication (15 s on, 15 s off, amplitude 20%) in a Qsonica sonicator to obtain fragments for sequencing. After harvesting nucleic acids by EtOH precipitation, we resumed the NEBNext Ultra RNA Library Prep Kit protocol for a target insert size of 300 bp. Samples were barcoded using NEBNext Multiplex Oligos (Index Primers Set 1), and the resulting libraries were sequenced on an Illumina MiSeq using 300-nt reads. Sequencing files were discriminated based on their identifying barcodes and analyzed using the CirSeq_v2 pipeline (21).
Data Processing and Analysis. After the sequences were processed by the CirSeq_v2 pipeline with an average quality score cutoff of 20 ( Fig. S5 and SI Methods), we removed those duplicate and multicopy genes that are polymorphic within the E. coli genome (e.g., structural RNA genes, ompF and ompC, and tufA and tufB) because the source of variation cannot be unequivocally assigned. Transcription error rates were adjusted for base composition of the sample using the weighted average of the occurrence of each nucleotide in the particular individual transcriptome being considered.
We developed custom Python scripts to determine the following: (i) transcription errors, calculated by tabulating the total number of errors identified by the CirSeq_v2 pipeline within the protein coding regions of the genome (SI Methods); (ii) nucleotide coverage, calculated by adding the overall coverage of each base within the protein coding regions of the genome; (iii) error rates, calculated by tabulating the total number of errors and base coverage of all coding regions within 50-kb nonoverlapping windows across an entire genome and dividing the number of errors by the coverage, yielding an error rate (SI Methods); (iv) leading/lagging strand error rates, calculated by tabulating the errors and coverage of all genes situated on either the leading or lagging strands and calculating the error rate as above; and (v) the number of errors that would result in an amino acid replacement by chance, calculated by randomly generating simulated transcription errors from each sequenced transcriptome and determining their effects on the amino acid sequence. All statistics were performed in Prism Graphpad or R.
The list of nucleic acid information processing genes and the associated functions were curated using EcoCyc (66). Orthologs of these genes in the endosymbionts were determined using BLASTP from the National Center for Biotechnology Information (NCBI) (blast.ncbi.nlm.nih.gov/Blast.cgi) with an E score cutoff of ≤1 and an amino acid-positive score cutoff of ≥40%. The genome accession numbers for the genomes used in this study are NZ_ACFK01000001 for B. aphidicola LSR1 and NC_008512 for C. ruddii PV and were accessed through NCBI.