## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Modeling and automation of sequencing-based characterization of RNA structure

Contributed by Jennifer A. Doudna, April 29, 2011 (sent for review February 13, 2011)

## Abstract

Sequence census methods reduce molecular measurements such as transcript abundance and protein-nucleic acid interactions to counting problems via DNA sequencing. We focus on a novel assay utilizing this approach, called selective 2′-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq), that can be used to characterize RNA secondary and tertiary structure. We describe a fully automated data analysis pipeline for SHAPE-Seq analysis that includes read processing, mapping, and structural inference based on a model of the experiment. Our methods rely on the solution of a series of convex optimization problems for which we develop efficient and effective numerical algorithms. Our results can be easily extended to other chemical probes of RNA structure, and also generalized to modeling polymerase drop-off in other sequence census-based experiments.

Over the past 30 years, techniques have been developed that probe RNA structures with small molecules. In this class of techniques, a chemical reagent modifies RNA molecules in a structure-dependent fashion. Depending on the reagent used, four distinct types of information can be gleaned, including spatial nucleotide contact information, solvent accessibility of the RNA backbone, the local electrostatic environment adjacent to each nucleotide, and the local nucleotide flexibility (1). In each of these techniques, the modification location is detected during conversion to cDNA by blockage of reverse transcriptase at the modification site (Fig. 1). The detection can be performed by direct sequencing of the cDNA fragments using high-throughput sequencing technology (2). However, because at most a single modified site is revealed by every sequenced fragment (the closest modification to the 3′ end), a mathematical model and inference framework are needed to accurately infer the underlying structural properties given the observed fragment distribution.

In this work, we introduce such a model and framework in the context of the SHAPE (selective 2′-hydroxyl acylation analyzed by primer extension) technique for characterizing local nucleotide flexibility (3–5). The identification of adduct formation can be performed by capillary electrophoresis (SHAPE-CE) or by high-throughput sequencing of cDNA fragments (SHAPE-Seq) (2) (Fig. 1). Every fragment begins at the 3′ end of the molecule and terminates at some adduct [(+) *channel*], or possibly at a location where there was natural polymerase drop-off (6, 7), which is controlled for in a separate control experiment [(−) *channel*]. Following sequencing, reads are mapped back to the RNA sequence and are classified by their end location. The resulting read counts are the sufficient statistics for a model that is used to infer estimates of the probabilities of adduct formation at each nucleotide, called relative reactivities.

The probabilistic model we develop for SHAPE and the sequencing that follows in SHAPE-Seq is highly structured and has recursive properties that allow for efficient maximum-likelihood inference and confidence interval estimates. Our approach is inspired by probabilistic models used in RNA sequencing (RNA-Seq) analysis to measure transcript identity and abundance (8) and should be easily generalizable to any chemical probing technique that characterizes different aspects of RNA structure. We present results that confirm the accuracy of our approach and that reveal the simplicity in analyzing the data despite the complexity of the models. Together, these provide a proof of principle for the utilization of SHAPE-Seq for high-throughput RNA structure characterization.

## Modeling Polymerase Drop-Off

We consider an RNA molecule that contains *n* sites. The sites are numbered 1 to *n* according to their sequence position with respect to the molecule’s 3′ end, where the latter is assigned position 0 and is excluded from analysis (Fig. 1). In a SHAPE-Seq experiment, we observe cDNA fragments of varying lengths, where a *k*-*fragment* corresponds to a mapped read of length *k* that spans sites 0 to *k* - 1 (1 ≤ *k* ≤ *n*), and a *complete fragment* corresponds to a full transcript of length *n* + 1. In the (−) channel control assay, the primary source of incomplete fragments is reverse transcriptase’s (RT) natural drop-off while transcribing the molecule. Natural drop-off arises largely due to structural properties of the molecule, and thus RT’s propensity to drop may vary along sites. To study this process at nucleotide resolution, we define the *drop-off propensity at site* *k*, *γ*_{k}, to be the conditional probability that RT terminates transcription at site *k*, given that it has reached this site. The *n* parameters Γ = (*γ*_{1},…,*γ*_{n}), 0 ≤ *γ*_{k} ≤ 1 ∀ *k*, completely characterize natural drop-off, and we wish to estimate them from the (−) channel fragment counts.

A maximum-likelihood (ML) estimate of Γ is derived from the (−) channel *k*-fragment and complete fragment counts (*Y*_{1},…,*Y*_{n+1}) as follows: We denote by *p*_{k} the probability that a molecule results in a *k*-fragment (1 ≤ *k* ≤ *n*) and by *p*_{n+1} the probability that it is transcribed in full. The likelihood of the counts is thus proportional to , where , and takes the form of a log linear model. It is well-known that such a model is uniquely maximized by [1]from which we can retrieve the estimate using the relations for *k* = 1,…,*n*. We obtain *γ*_{k} by observing that a fragment has length greater than *k* if and only if no drop-off occurred until site *k*; that is, . Hence, [2]Importantly, RT’s drop-off gradually degrades the pool of actively transcribed molecules throughout the experiment and is captured by the relations .

## Modeling Chemical Modification

In the (+) channel, the RNA is treated with an electrophile that reacts with conformationally flexible nucleotides to form 2′-*O*-adducts. We define the *relative reactivity* of a site to be the probability of adduct formation at that site. Therefore, associated with the RNA molecule are *n* nonnegative real numbers Θ = (*θ*_{1},…,*θ*_{n}), , which we wish to estimate from sequencing data.

During modification, an RNA may be exposed to variable numbers of electrophile molecules. We model the number of times an RNA is exposed to electrophile molecules as a Poisson process of an unknown rate *c* > 0, i.e., we assume that [3]where each exposure may result in the modification of a site. A point that is key to interpreting SHAPE data is that a *k*-fragment is assumed to be generated when site *k* is the site that is *first* encountered by RT, regardless of the number of adducts that formed upstream of *k*. This is important in light of the fact that SHAPE experiments are calibrated to yield single-hit kinetics (i.e., *c* ≈ 1). Under such conditions, a considerable portion of the molecules are hit twice or more (e.g., 26.42% when *c* = 1, as compared to 36.79% that are hit once). Given that an RNA is exposed *i* times, the probability that a molecule carries its first adduct at site *k* is [4]When *k* = *n*, the second sum is taken to be 0 so that Eq. **4** reduces to (for convenience, we define 0^{0} = 1). Note that Eq. **4** entails an approximation to our understanding of SHAPE chemistry as it allows for repeated modification of a site. This approximation is minor, however, largely due to negligible abundance of molecules with a multitude of adducts under single-hit kinetics as well as the lengths of the molecules. This premise is also supported by robustness analysis of our framework (see *SI Text* for details). Notably, the low probability of many hits also justifies the use of an unbounded Poisson model rather than its truncation.

We can now obtain the probability of a molecule being modified at site *k* (although possibly also at subsequent sites): [5][6][7]Incorporating the natural drop-off probabilities, we have [8]where reflects the natural degradation in the elongating modified-molecule pool. We attribute all other observed fragments to natural causes, which take effect in two distinct pools: unmodified molecules and modified ones for which RT may drop off before encountering the first adduct (see Fig. 1). These factors are combined to yield the probability of observing a *k*-fragment from natural drop-off: [9][10][11][12]As a special case, the probability of observing a complete fragment in the (+) channel is [13]It can be seen from Eqs. **10** and **13** that under single-hit kinetics, the unmodified pool is expected to occupy a significant portion of the target pool, as *c* = 1 implies equal probabilities of experiencing no modification or a single hit. We can think of it as having a (−) channel contained within the (+) channel. Finally, based on Eqs. **8**–**12**, the probability of observing a *k*-fragment in the (+) channel is [14]

Assuming we observe (*X*_{1},…,*X*_{n+1}) *k*-fragment and complete fragment counts in the (+) channel, the likelihood of observing the entire data from both channels is given by [15]and can be explicitly written as [16]

Here, we maximize to find the model parameters that best explain the observed data and thus best reflect underlying structural properties of the RNA molecule.

## Maximum-Likelihood Estimation

We begin by stating a theorem that leads to an algorithm for ML estimation which either provides the exact solution, or else fails and reports that status. We assume that *X*_{n+1},*Y*_{n+1} > 0.

Estimate

*c*by setting [17]Estimate Γ by setting [18]

Estimate the SHAPE reactivities Θ by [19]where describe the observed (+) channel fragment-length distribution; i.e., .

If Eqs. **17**–**19** yield for all 1 ≤ *k* ≤ *n* and , then they determine the parameters that uniquely maximize the likelihood Eq. **16** over all distributions Θ, over Γ such that 0 ≤ *γ*_{k} ≤ 1, and over .

We first cast Eq. **16** as the following simplified log linear model: [20]where the *p*_{k}’s were defined earlier, *w*_{k} = Prob(*k*-fragment) (1 ≤ *k* ≤ *n*), and *w*_{n+1} = Prob(complete fragment) as in Eqs. **13** and **14**. To simplify notation, we define the variables [21]Note that Eqs. **13** and **14** imply *w*_{k} = *v*_{k} - *v*_{k+1} (1 ≤ *k* ≤ *n*) and *w*_{n+1} = *v*_{n+1}, and that setting results in *v*_{1} = 1. Becuse the *w*_{k}’s represent probabilities, they are nonnegative and hence the *v*_{k}’s form a weakly decreasing sequence . It is now also easy to see that because the sum is telescoping.

We can now leverage well-known results for ML estimation of log linear models. Essentially, when the parameters are nonnegative and sum to a known constant there is a unique closed-form ML solution (9). The model in Eq. **20**, however, imposes two constraints: . In this case, similar techniques (i.e., Lagrange multipliers) can be used to show that the likelihood is uniquely maximized by [22]A direct consequence of Eq. **22** is that can be determined from , as was done earlier (see Eq. **2**). One can also resolve from because implies . To backtrack , we use the previous construction: from the definitions we have that , and hence [23]Now, from Eq. **21** we have and therefore [24]Eq. **19** then follows from .

It is left to verify whether the obtained solution lies within the problem’s domain. Clearly, the ’s are properly bounded, and our construction also guarantees whenever (via and Eq. **21**). However, Eq. **17** might result in , which is indicative of problematic data. More importantly, it is possible to encounter jointly with (or ) for some *k*’s. To observe this, consider the product in Eq. **24**: The right-hand factor equals at most 1, whereas the left-hand factor equals at least 1. Yet, both factors are based on data from the (+) and (−) channels and are not constrained to guarantee that , thereby leading to cases where and or . In such cases, Eq. **20**’s solution maps to an infeasible parameter set.

Theorem 1 states that Algorithm 1 fails if the estimates are negative, as the results are then not biologically meaningful. Intuitively, negative reactivities arise when the frequencies of certain *k*-fragments in the (+) channel are exceedingly low compared to their (−) channel counterparts, to the extent that they cannot be explained by setting the relevant ’s to zero. In general, negative reactivities indicate that the (+) channel data are inconsistent with the (−) channel data, or in other words, that the entire dataset is “too noisy.”

When the user wishes to proceed with analysis in the presence of data inconsistencies (see *Interpretation*), our goal is to resolve them by finding 2*n* + 1 *feasible* parameters that maximize Eq. **16**. Notably, optimizing the function efficiently is particularly important since SHAPE-Seq is geared toward highly multiplexed probing (2). This, in turn, necessitates rapid analysis of the data collected from a multitude of distinct RNA molecules. Nevertheless, efficient optimization is computationally challenging for the following reasons. First, it involves boundary and fixed-sum constraints, which preclude the use of standard local optimization methods (e.g., gradient-based algorithms). Second, its scale is large, as *n* is typically on the order of several hundreds (3, 7). Here, we present Algorithm 2, which provides an efficient numerical solution despite the problem’s complexity and large dimension. Note that hereafter we explicitly require .

Initialize. Exclude from analysis all entries where

*X*_{k}=*Y*_{k}= 0. For each of the remaining*n*^{′}entries, use Algorithm 1’s output to set and .If increased in less than

*ϵ*or if*M*iterations were completed then stop, else solve the following three optimization problems:Optimize for

*c*. Find*c*> 0 that maximizes . Set*c*^{⋆}to the solution.Optimize for Γ. Find Γ that maximizes under the constraints 0 ≤

*γ*_{k}≤ 1 (*k*≤*n*^{′}). Set Γ^{⋆}to the solution.Optimize for Θ. Find Θ that maximizes under the constraints

*θ*_{k}≥0 (*k*≤*n*^{′}), . Set Θ^{⋆}to the solution.

Go to step 2.

The advantage to this formulation is that each of the three subproblems is a convex optimization problem, as established by Lemmas 1–3 in *SI Text*. A convex optimization problem has two attractive properties: (*i*) if a local maximum exists, it is a global maximum, and (*ii*) it is, in general, numerically tractable via specialized interior-point methods (10). Yet, we stress that these methods’ tractability and accuracy are limited by the problem’s dimension and also depend to a large extent on its structure. We circumvent these deficiencies by obtaining exact solutions for the high-dimensional coordinate problems, namely maximizing Θ and Γ, as stated in the next two theorems.

Given *c*^{⋆} > 0 and a distribution Θ^{⋆}, the likelihood function attains a unique maximum over all Γ∈[0,1]^{n} at that is the maximum of zero and [25]for all 1 ≤ *k* ≤ *n*, where (0 ≤ *k* ≤ *n*) and (1 ≤ *k* ≤ *n*).

The proof of the Theorem is in *SI Text*.

Although the third problem is not amenable to an explicit solution like the second one, the next theorem shows that it can in fact be reduced to numerically solving an equation in one variable. Here, we utilized the *water filling* optimization technique, which has been previously used to optimize power-constrained transmission over parallel Gaussian channels (11). A sketch of the proof is included below, with full details in *SI Text*.

Given *c*^{⋆} > 0 and Γ^{⋆}∈[0,1]^{n}, the likelihood function has a unique maximum over all distributions Θ, given by that is the maximum of zero and [26]for all 1 ≤ *k* ≤ *n*, where solves the piecewise equation [27]

The Karush–Kuhn–Tucker (KKT) constraints are a set of conditions that are necessary and sufficient for the optimality of a solution of a convex optimization problem with inequality constraints (10). We derive them by introducing a Lagrange multiplier, denoted *ν*^{⋆}, which corresponds to the fixed-sum constraint, and *n* KKT multipliers, denoted , which pertain to the nonnegativity constraints. Using these variables, we obtain the following conditions: [28][29]where is the derivative of .

Next, we derive the relations between , , and *ν*^{⋆} that solve Eq. **29** for a site *k*. First, assume that it is solvable for , in which case we set . The left-hand side of Eq. **29** then becomes the sum of a nonnegative constant and a strictly monotonously decreasing function that is illustrated in Fig. S1*A*. The choice of *ν*^{⋆} thus completely (and explicitly) determines , but importantly, *ν*^{⋆} must be *smaller* than a certain threshold to yield . This threshold depends on the constants *c*^{⋆}, *X*_{1},…,*X*_{k}, and and thus varies among the *n* sites. When *ν*^{⋆} *exceeds* site *k*’s threshold, the resulting violates the nonnegativity constraint and instead we set it to zero and use as a slack variable to fill the gap between *ν*^{⋆} and .

The derived relations imply that given *ν*^{⋆}, a subset of Θ^{⋆}’s entries may be set to zero, with the remaining entries being determined from *ν*^{⋆} via Eq. **26**. Furthermore, the subset’s cardinality decreases as we decrease *ν*^{⋆} and gradually cross the thresholds of additional sites, where their ’s become positive. Having obtained an explicit expression for , it is left to find the *ν*^{⋆} at which . It is found by observing that forms a piecewise continuous and strictly monotonously *decreasing* function (with increasing *ν*^{⋆}), with breakpoints at the above-mentioned thresholds, where the zero-entry subset is updated. Note that the monotonicity between breakpoints is due to ’s strict monotonicity, whereby expands as *ν*^{⋆} decreases. This allows us to gradually decrease *ν*^{⋆} until the sum reaches 1 and can be visualized as flooding a region of varying surface levels up to a constant amount of water (see Fig. S1*B*). The original problem thereby reduces to finding the intersection of with 1, or alternatively, of with *e*^{c⋆}, as formulated in Eq. **27**.

Despite the apparent complexity of Eq. **27**, its special properties lead to a straightforward root-finding routine. This, coupled with efficient methods to solve the first two optimization problems (see *SI Text*), enabled us to implement Algorithm 2 so that it can be run for hundreds of RNA molecules in a matter of minutes (2).

## Results

We used our automated bioinformatics pipeline to analyze data from SHAPE-Seq probing of a mixture of well-studied RNAs of lengths *n* = 172–198. Alignments were generated by mapping sequenced fragments to the known RNA sequences using the Bowtie paired-end alignment program (12).

ML estimation took 1–2 s per molecule on a personal computer, and the reconstructed reactivities were in very good agreement with those obtained from SHAPE-CE probing and analysis (2). An example is shown in Fig. 2 for the *Staphylococcus aureus* plasmid pT181 sense RNA. The raw *k*-fragment frequencies in the (+) channel are illustrated next to the outputs of Algorithms 1 and 2. The estimated rate was . One can readily observe the input signal’s decay as well as a general trend in its reconstruction, that is, signal attenuation at the molecule’s 3′ end gradually transitions into amplification over its 5′ end. Additionally, Algorithm 1’s intermediate output is indicative of high-quality data, as it displays very few negative estimates. The depicted error bars correspond to one standard deviation of 500 bootstrap samples of the dataset and indicate negligible variation. The complete details of estimation results and subsequent structural analysis conducted for all probed molecules appear in ref. 2. The full data analysis pipeline is available at http://bio.math.berkeley.edu/SHAPE-Seq/ as a supplementary file for download (see *SI Text* for details).

To explore the robustness of our method, we investigated the accuracy of the estimated Θ as a function of the sequencing depth as well as under several model scenarios featuring hypothetical Θ and Γ distributions. We assessed robustness with respect to data size by analyzing, for different sizes, 200 randomly drawn subsets of our dataset and determining the fraction of sites that were assigned reactivities within 15% of the full-dataset estimate. Average fraction and sample variation per subset size are shown in Fig. S2. They demonstrate that high quality is retained in the presence of an order-of-magnitude decrease in data throughput and that a decrease of two orders of magnitude results in fair accuracy. Fig. S3 recapitulates the latter claim by demonstrating that 0.5% of the reads suffice to capture the general reactivity profile of our dataset. The dependence of estimation quality on structural features of the RNA was assessed by assigning various distributions to Θ and Γ and subsequently drawing 5 million fragments from the induced fragment distributions. We collected 15%-, 10%-, and 5%-interval statistics. For 500 simulations per case, the average fraction of hits decreased with increasing number of reactive sites and was lowest for the extreme case of an exponentially declining Θ (i.e., 76% of 50 reactive sites were within the 5% interval). Yet, variation around the average was consistently negligible. In addition, effects of instantaneous drop-off spikes in Γ amounted to several percents degradation at the most. The complete analysis details are found in the *SI Text* and in Tables S1 and S2.

## Interpretation

Although Algorithm 1 may fail to generate the ML solution, it is valuable in two ways. First, it informs the user of the experiment and data quality via and summary-level statistics such as , , and , which are not obtainable with the present SHAPE-CE pipeline (4). The extent of data inconsistencies is manifested in the negative ’s and can be assessed by inspecting their magnitudes and abundance. In the presence of large-magnitude negatives, the user may merely exclude certain sites from further analysis (e.g., structure prediction), as is commonly done in SHAPE-CE data analysis (4, 5).

Second, our model-based correction formula elucidates the effects of the two bias sources and guides how to offset them. This, in turn, highlights the virtues and deficiencies of existing bioinformatics approaches. Specifically, we can rewrite Eq. **19** such that it maps the input fragment-length distributions to the corrected output distribution, as follows: [30]One can see that the correction can be applied separately to the noisy signal component (i.e., left-hand term) and the noise component (i.e., right-hand term), as is currently done in SHAPE-CE (6). More importantly, each frequency is corrected according to its *position in the distribution* (i.e., its percentile). Particularly, we can approximate the correction factor by because, in general, when *n* is large. Amplification is thus adjusted according to the empirical cumulative distribution. This is in contrast to existing approaches where users apply a heuristic exponential-decay correction that corrects according to a nucleotide’s *sequence position* rather than its percentile (6, 7). Moreover, this correction’s range and parameters are chosen by the user, based on visual inspection.

In the ideal case where and *γ*_{k} = *γ*_{0} ∀ *k*, the two approaches coincide (see illustration in Fig. S4*A*), whereas they differ as the RNA diverges from a uniform pattern, which is the case for biologically relevant RNAs. For example, in the *Bacillus subtilis* RNase P RNA specificity domain, most reactive bases are confined to a short segment (13), thereby inducing considerable signal decay over that region. However, due to its short span, it is easily missed by the eye (see SHAPE-Seq data in Fig. S4 *B* and *C*). The need for correction is also commonly overlooked in the (−) channel, which often displays a few outstanding spikes over low-frequency background. These reflect natural transcription barriers that significantly degrade the pool of molecules that are transcribed past these points. Hence, the decay pattern clearly departs from a smooth gradual decline toward a stepwise behavior, as illustrated in Fig. S4 *D* and *E*. Sharp instantaneous degradations in the target pool should be accounted for such that noise estimates past these points are boosted accordingly. In Eq. **30**, this is encapsulated into the noise correction term, such that the spike’s effect is propagated to upstream sites, whereas existing methods merely subtract the spike locally from the matching signal intensity.

Finally, the above interpretation gives rise to a simple nonparametric correction, where logarithms are removed and reactivities are normalized such that . Interestingly, the reactivities reconstructed with this approach were, for the most part, very close to the ML estimates. However, they tend to diverge as reactivities increase, which, in turn, may confer more sensitivity to outliers.

## Discussion

In this work, we present the first rigorous model of the SHAPE experiment used to probe the structures of RNA molecules. Using this model, we developed a robust method to determine the set of reactivities that best explains the observed cDNA fragment-length distribution. Coupled with our alignment pipeline, this produces an automated workflow for SHAPE analysis via sequencing and improves upon existing approaches for analyzing SHAPE data. Current approaches use a heuristic exponential decay correction and (−) channel scaling to assign low reactivities to user-identified sites thought to have little reactivity (6). Although they do correct for natural polymerase drop-off, these procedures can require expert knowledge to choose when and where to apply. Our approach also leverages accurate measurement of the number of full-length transcripts to provide an estimate of the modification rate, *c*. This has not been possible with SHAPE-CE due to detection limitations of CE in quantifying the signal of full-length fragments. Although our ML framework was applied to data generated from sequencing experiments, it could be adapted to CE-based data to automate their analysis.

Our robustness analysis shows that SHAPE-Seq is remarkably accurate even at low throughput making it tractable with lower throughput desktop sequencers, suitable for low abundance RNAs, and effective for multiplexed bar coding to simultaneously probe large numbers of molecules in parallel (2).

In the present work, we have focused on the application of our method to SHAPE-Seq (2). However, our method for modeling the effects of RT drop-off is broadly applicable to any sequence census method that utilizes a pool of RNA molecules as the basis for the measurement. In particular, our method should be directly applicable to the general class of RNA structure-dependent chemical probing techniques that utilize single modifications to probe other features such as solvent accessibility of the backbone and local electrostatic environment of the nucleotide (1). Additional interesting challenges will arise when extending our method to techniques where adduct formation is correlated to structural features at more than one position in the RNA, for example with chemicals that probe through-space neighbors (1). Other methods that involve converting RNA into cDNA, including transcript abundance quantitation and alternative splicing isoform identification with RNA-Seq (8, 14), and qRT-PCR, could potentially benefit from our estimation of premature termination of the RT process represented by to correct for drop-off biases. Moreover, we believe these methods generalize to the setting of random priming where fragments may not always begin at the same place, allowing us to determine the structures of de novo pools of RNA such as natural transcriptomes.

In general, our automated data analysis pipeline, coupled with the SHAPE-Seq protocol, is a high-throughput method to infer secondary and tertiary structural information for every nucleotide in an RNA solution (2). Furthermore, the final output of this method, the optimal Θ that represent the reactivity of each nucleotide to adduct formation, are well understood and can be used to constrain existing RNA structure prediction programs to remarkably increase their accuracy (5, 15). SHAPE-Seq should also be able to provide much-needed information for recent algorithms that predict tertiary RNA structures from primary sequence (16).

## Acknowledgments

S.A., J.B.L., and A.P.A. acknowledge support from the Synthetic Biology Engineering Research Center under NSF Grant 04-570/0540879. J.A.D. is an Howard Hughes Medical Institute (HHMI) Investigator, and this work was supported in part by the HHMI. S.A.M. is a fellow of the Leukemia and Lymphoma Society. J.B.L. and L.P. thank the Miller Institute for financial support and a stimulating environment in which this work was conceived.

## Footnotes

^{1}To whom correspondence may be addressed. E-mail: doudna{at}berkeley.edu, aparkin{at}lbl.gov, or lpachter{at}math.berkeley.edu.

Author contributions: S.A., C.T., J.B.L., S.A.M., S.L., G.P.S., J.A.D., A.P.A., and L.P. designed research; S.A., C.T., J.B.L., S.A.M., and L.P. performed research; S.A., C.T., J.B.L., and L.P. contributed new reagents/analytic tools; S.A., C.T., J.B.L., S.A.M., S.L., and L.P. analyzed data; and S.A., C.T., J.B.L., S.A.M., S.L., G.P.S., J.A.D., A.P.A., and L.P. wrote the paper.

The authors declare no conflict of interest.

See Commentary on page 10933.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1106541108/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵
- ↵
- Lucks JB,
- et al.

- ↵
- ↵
- ↵
- ↵
- Vasa SM,
- Guex N,
- Wilkinson KA,
- Weeks KM,
- Giddings MC

- ↵
- ↵
- ↵
- Pachter L,
- Sturmfels B

- ↵
- Boyd S,
- Vandenberghe L

- ↵
- Cover TM,
- Thomas JA

- ↵
- ↵
- ↵
- ↵
- Mathews DH,
- et al.

- ↵

## Citation Manager Formats

### More Articles of This Classification

### Biological Sciences

### Biophysics and Computational Biology

### Physical Sciences

### Related Content

### Cited by...

- The long noncoding RNA SPRIGHTLY acts as an intranuclear organizing hub for pre-mRNA molecules
- Data-directed RNA secondary structure prediction using probabilistic modeling
- Using in-cell SHAPE-Seq and simulations to probe structure-function design principles of RNA transcriptional regulators
- SHAPE Selection (SHAPES) enrich for RNA structure signal in SHAPE sequencing-based probing data
- Rational experiment design for sequencing-based RNA structure mapping
- SeqFold: Genome-scale reconstruction of RNA secondary structure integrating high-throughput sequencing data
- QuShape: Rapid, accurate, and best-practices quantification of nucleic acid probing information, resolved by capillary electrophoresis
- Multiplexed RNA structure characterization with selective 2'-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq)
- RNA structure probing dash seq