Multiple events on single molecules: Unbiased estimation in singlemolecule biophysics
 ^{†}Kavli Institute of Nanoscience, Faculty of Applied Sciences, Delft University of Technology, Lorentzweg 1, 2628 CJ Delft, The Netherlands; and
 ^{‡}Department of Applied Physics and Applied Mathematics, Center for Computational Biology and Bioinformatics, Columbia University, 500 West 120th Street, New York, NY 10027
See allHide authors and affiliations

Communicated by Robert H. Austin, Princeton University, Princeton, NJ, December 6, 2005 (received for review June 24, 2005)
Abstract
Most analyses of singlemolecule experiments consist of binning experimental outcomes into a histogram and finding the parameters that optimize the fit of this histogram to a given data model. Here we show that such an approach can introduce biases in the estimation of the parameters, thus great care must be taken in the estimation of model parameters from the experimental data. The bias can be particularly large when the observations themselves are not statistically independent and are subjected to global constraints, as, for example, when the iterated steps of a motor protein acting on a single molecule must not exceed the total molecule length. We have developed a maximumlikelihood analysis, respecting the experimental constraints, which allows for a robust and unbiased estimation of the parameters, even when the bias well exceeds 100%. We demonstrate the potential of the method for a number of singlemolecule experiments, focusing on the removal of DNA supercoils by topoisomerase IB, and validate the method by numerical simulation of the experiment.
Over the past few years, singlemolecule techniques have started to deliver on their promise as highresolution tools for the study of biological systems. The activity of single proteins such as kinesin, myosin, and topoisomerases (1–3), among others, has been monitored in real time. A hallmark of such singlemolecule experiments, in contrast to bulk experiments, is their unparalleled ability to yield the functional form of the distribution of experimental outcomes and not merely their averages (4) or other statistics (5). Estimating the parameter values that characterize these distributions often yields the information required to construct detailed mechanical models of the system under investigation.
To obtain these parameter values from an experiment, observables are typically binned into a histogram, and the histogram is fitted to the predictions of a model. An alternative method to obtain a distribution parameter is to use the maximumlikelihood method, in which one calculates the value of an unknown parameter in a distribution that maximizes the likelihood of the experimentally observed data (6, 7). The maximumlikelihood method has the advantage that one does not discard information, or introduce one's own biases, in the data through binning. Moreover, the histogramfitting approach, at least when squared loss is used, ignores the fact that the errors induced in the construction of the histogram are themselves a function of the model and the number of counts represented in each bin of the histogram. Another important advantage of using the maximumlikelihood method, which we demonstrate below, is the possibility to build a model that is more faithful to experimental reality. Particularly in biophysical experiments where a multitude of factors, such as finite size or other experimental or biological constraints, unavoidably thwart the assumption that each individual observation is independent and identically distributed (referred to as the “i.i.d.” assumption below), the maximumlikelihood approach facilitates building a model that is both more experimentally sound and more statistically robust.
Frequently, constraints emerge because of experimental limitations in detecting all values of experimental outcomes in a distribution: one receives from the measurement a limited range of values instead of the entire domain. In some experiments, for example, the DNA translocation by the enzyme FtsK described in ref. 8, the experimental outcomes are uncoupled from one another and are i.i.d. However, in the scenario that experimental outcomes are coupled to each other by a global constraint, the range of values that can be detected varies with every new measurement taken. As we demonstrate below, the presence of global constraints is a factor that absolutely requires maximumlikelihood analysis if the biological parameters of a system are to be measured accurately.
In principle, the analysis of bulk experiments can be hampered as much as the analysis of singlemolecule experiments. However, in singlemolecule measurements, one can evaluate each experimental outcome with respect to the constraints. Armed with this knowledge, one can apply the mathematical treatment outlined here and counter the bias in the data accurately.
In this article, we illustrate the problem of global constraints by showing how the measurement process in a singlemolecule study of DNA supercoil relaxation by the enzyme topoisomerase IB imposes global constraints on the probability distribution from which the experimental outcomes are drawn. Subsequently, we generalize the maximumlikelihood method for parameter estimation, enabling one to faithfully recover the unbiased estimate of the distribution parameter from data subject to global constraints. We also derive an expression for the standard deviation of the recovered parameter as a function of the available statistics. Numerical methods confirm our ability to recover the unbiased distribution parameter within the error estimation derived. Finally, we show that the method introduced here can play an important role in the extraction of biological parameters from several other singlemolecule experiments.
Topoisomerase IB Steps Are Subjected to a Global Constraint
We first illustrate the concept of global constraints by using data obtained from the singlemolecule analysis of topoisomerase IB (9). Topoisomerase IB is an enzyme that removes supercoils from a dsDNA molecule by transiently introducing a nick (10, 11). As long as the dsDNA molecule is nicked, torque present in the molecule will swivel the DNA about its intact strand. After a random number of supercoils are released, the enzyme religates the DNA, which terminates the removal of supercoils (9)
We can follow the action of the topoisomerase in real time by using magnetic tweezers (1). The experimental strategy is described elsewhere (12) and summarized in Fig. 1 a. Each time the topoisomerase removes supercoils from the DNA molecule, one observes a discrete step in the height of a μmsized bead attached to the molecule. The height of the bead is equal to the extension of the DNA molecule and is directly related to its linking number (Lk) and the number of supercoils present in the DNA. A small extension of the DNA corresponds to a large number of supercoils present, whereas a large extension corresponds to a few supercoils present in the DNA. Thus, each time the topoisomerase removes supercoils from the DNA, we observe a discrete step in the DNA extension, which is proportional to ΔLk. If the probability of religation (per turn) is constant, the distribution of ΔLk should be an exponential (Fig. 1 b); the average of ΔLk, which is denoted 〈ΔLk〉, is the parametric description of topoisomerase activity we want to deduce from the experiment.
The setup of the experiment, in which the DNA molecule only contains a limited number of supercoils, necessarily introduces global constraints on the distribution. Consequently, at some point the topoisomerase will inevitably remove the last few supercoils that remain in the DNA (red arrow, Fig. 1 a). This final step toward the level of zero supercoils contains only limited information in comparison to previous steps, because the final step is artificially constrained by the fact that no more supercoils remain in the DNA for the topoisomerase to remove. Therefore, when drawing conclusions about the working of the enzyme, one should discard this final step. For convenience, we will define substeps as those steps that do not extend to the level of zero supercoils. Effectively, steps so large that they become the final step are discarded, whereas steps so small that they become substeps are not discarded, which leads to an overrepresentation of small steps. When one simply analyzes the surviving substeps, one obtains a skewed distribution with an incorrect parameter, which can hamper a proper interpretation of the system under investigation. In an actual experiment, one cannot distinguish between the “true” distribution and the distribution that is skewed as a result of the measurement. After all, all one has is the measured distribution (Fig. 1 b), which is skewed. Fortunately, in a singlemolecule measurement this skewing can be corrected for by the method we describe below.
MaximumLikelihood and Domain Constraints
We briefly review the concept of parameter estimation by using the maximumlikelihood method (7). Let P(sk) be a properly normalized probability density function (pdf) for step size s, with parameter k. The goal of the maximumlikelihood method is to obtain an estimate for the parameter of the pdf, which in this general case is k. Because the experimental outcomes are assumed to be statistically independent, the combined probability to find, in n measurements, the data s_{1}, s_{2}, …, s _{n} is given by: where L is the likelihood function. To avoid working with very large or small numbers that can cause computational inaccuracies, and to facilitate the analysis, one often works with the logarithmic likelihood
We now introduce k _{*}, the value of k that maximizes L, otherwise known as the maximumlikelihood value. It is the best estimate for k and we want to calculate its value. We obtain k _{*} by solving (see Supporting Text, which is published as supporting information on the PNAS web site). Assuming that the shape of ln P(ks) near k _{*} is a Gaussian distribution (see Supporting Text), we can calculate the variance σ^{2} of k _{*}, using
In many experimental scenarios, constraints apply to the measurable domain of P. In other words, it may not be experimentally possible to sample all possible values of s. A proper analysis of the data taken in this experimental scenario then requires P to be renormalized by a weighting function g(k): where s _{min} is the minimum value for s that can be detected, and s _{max} is the maximum value for s that can be detected. In this simplest case, s _{min} and s _{max} are constant for each measurement i of s. However, an alternative possibility is for a global constraint to couple all observations (indexed by i) of the variable s to each other. In this second case, one requires a weighting function that varies with each measurement i of s: where s _{min,i} and s _{max,i} are again the minimum and maximum detectable values for s, respectively, but their values are not fixed for all measurements of s. Instead, s _{min,i} represents the minimum detectable value for s that is valid only for the ith measurement of s. Similarly, s _{max,i} represents the maximum detectable value for s that is valid only for the ith measurement of s. Analogously to the i.i.d. case, one can calculate the likelihood function for all of the constrained data, maximize this function, and obtain k _{*}. The value for k _{*} we obtain in this manner is then the unbiased estimate of the parameter of the distribution.
To illustrate the method explicitly, we use an exponential function as a pdf, the appropriate model for a topoisomerase that removes supercoils from DNA with constant probability per turn of religation. The normalized pdf for 0 < s < ∞ is then given by where k = 1∕〈s〉. Here and in the following, brackets indicate averages over the experimental observations. The corresponding likelihood function is given by In the case of i.i.d. observations, we obtain
However, in the case of global constraints, we obtain: and the values for s are drawn from
Having obtained a relation for P, we can calculate the corresponding likelihood. The probability of the data in the presence of global constraints is where N is the number of experimental outcomes of s. The logarithm of L is given by
The parameter measured in the topoisomerase IB experiment is the average change in 〈ΔLk〉, which is equal to 〈s〉 in the terminology used above. Because 〈ΔLk〉 = 1∕k, we take the derivative of Eq. 12 with respect to 1∕k:
We find the maximum in the likelihood by setting Eq. 13 equal to zero, where k _{*} is again the maximumlikelihood value for k, the value that solves Eq. 14 (the summation signs in Eq. 13 have been replaced by brackets in Eq. 14 to denote averages). Eq. 14 can be evaluated numerically to yield 〈ΔLk〉_{*}, the maximumlikelihood value of 〈ΔLk〉. We deduce that the variance of 〈ΔLk〉_{∗} is given by (see also Supporting Text). Comparing Eqs.14 and 15 to the case in which no constraints apply, or s _{max} = ∞ and s _{min} = 0, we recover and as expected.
Eqs. 14 and 15 can be directly applied to the singlemolecule data of DNA supercoil relaxation by topoisomerase IB (Fig. 1 b) to determine the true parameter of the underlying true distribution and its associated standard deviation as a function of the number of experimental outcomes N
Numerical Simulation and the Consequences of Ignoring Global Constraints
We simulate the measurement process in a singlemolecule experiment to quantify the biasing effect on a true distribution as a result of the global constraints and the “sampling error” inherent in the finite number of observations performed. The applicability of Eqs.14 and 15 to the determination of the value and the standard deviation of the distribution parameter can therefore be assessed.
We start by generating an exponential distribution characterized by a parameter that we define as the “true parameter” and is denoted 〈ΔLk〉_{true}. We arbitrarily set it to 〈ΔLk〉_{true} = 60. 〈ΔLk〉_{true} represents the parameter of the distribution that would be measured in the absence of constraints. We call this unbiased distribution the true distribution. Because it is unbiased, we can think of this distribution as representing the physics governing the workings of the enzyme. We then simulate the process of removing supercoils from a DNA molecule that has a maximum of 130 supercoils present (the global constraint ΔL k _{max} ^{0}, Fig. 2 a Inset). We use these values for all simulations. The number of supercoils that the topoisomerase removes each time from the DNA is randomly drawn from our true distribution. As described above, all final steps are subsequently discarded, and the substeps that remain are displayed in a histogram. This histogram reflects what we would measure experimentally and we call it the “measured distribution,” characterized by a “measured parameter,” which is biased and therefore denoted 〈ΔLk〉_{biased}. The true distribution is shown in blue in Fig. 2 a, and the measured distribution is shown in red. As can be clearly seen from Fig. 2 a, the two distributions are not identical. The true distribution obviously yields an average of 60 (in units of ΔLk). We obtain 〈ΔLk〉_{biased} by fitting the measured distribution to an exponential in the range between zero and ΔL k _{max} ^{0}. In fact, the functional form is altered slightly because of the global constraints, as discussed formally in Supporting Text and Fig. 5, which is published as supporting information on the PNAS web site. Note that the value for 〈ΔLk〉_{biased} is thus biased because of a combination of factors: (i) the presence of global constraints, (ii) the number of experimental outcomes N, and (iii) the analysis by histogram fitting rather than maximum likelihood. For the particular values for 〈ΔLk〉_{true} and ΔL k _{max} ^{0} we used, 〈ΔLk〉_{biased} was 46, which is an underestimate of ≈23% in comparison to 〈ΔLk〉_{true}. Indeed, the measurement process has biased small steps over large steps, skewing the measured distribution toward lower values. We now focus more closely on the relationship between the magnitude of the constraint and the resulting degree of bias. Fig. 2 b plots 〈ΔLk〉_{biased} as a function of the severity of the global constraint ΔL k _{max} ^{0}. We plot 〈ΔLk〉_{true} as a blue horizontal line in Fig. 2 b. The discrepancy between 〈ΔLk〉_{biased} and 〈ΔLk〉_{true} caused by the global constraints is thus reflected graphically as the distance between the red curve and the blue line in Fig. 2 b; in the absence of any biasing effect, all values for 〈ΔLk〉_{biased} would fall on top of the blue line. We describe three salient features of Fig. 2 b. First, as the constraint becomes less severe (ΔL k _{max} ^{0} increases), the magnitude of the bias decreases. Conversely, as the constraint becomes more severe (ΔL k _{max} ^{0} decreases), the magnitude of the bias increases. Second, the discrepancy between 〈ΔLk〉_{biased} and 〈ΔLk〉_{true} is very large (>100%) for small values of ΔL k _{max} ^{0}. For example, for ΔL k _{max} ^{0} = 20, 〈ΔLk〉_{biased} = 27, which constitutes an underestimation of 〈ΔLk〉_{true} by ≈120%. Although this is an example that might not generally be observed experimentally, we include it to emphasize that our method can recover 〈ΔLk〉_{true} robustly even in the case of extreme bias, as we show below. The third feature of Fig. 2 b highlights that in a regime where one naively would expect virtually no biasing effect because of the constraints, the bias is significant nevertheless. Indeed, for ΔL k _{max} ^{0} = 800, which is well over an order of magnitude larger than 〈ΔLk〉_{true}, one still observes that 〈ΔLk〉_{true} is underestimated by ≈7%. This surprising behavior stems from the fact that the constraints on the distribution vary from step to step and are on average smaller than ΔL k _{max} ^{0}. We now describe how we can nonetheless obtain an accurate value for 〈ΔLk〉_{true}, even in cases where 〈ΔLk〉_{true} is severely underestimated.
By monitoring the DNA extension, either in a real experiment or the simulation discussed here, we know the number of supercoils that remain in the DNA molecule before the topoisomerase removes a number of supercoils; that is to say, we know the constraints that apply to the measurement of each substep. The important point is that although the constraints vary for each step, they are known, and we can therefore substitute their values for s _{max,i} in Eq. 10 . We also know the value of s _{min,i}, which is the minimum detectable number of supercoils removed and is determined by the noise in the height of the bead. This is beyond the scope of this work, and for all practical purposes, we give s _{min,i} the fixed value of zero. We now solve Eq. 14 and call the solution 〈ΔLk〉_{calculated}. In this calculation, we have used N = 10^{5} substeps. To get an idea of the reproducibility in 〈ΔLk〉_{calculated}, we repeat the calculation Q = 10^{5} times and build a histogram of the solutions (Fig. 3 a). Importantly, we calculate that the mean of the distribution of 〈ΔLk〉_{calculated} is 60, which is identical to the value we have chosen as 〈ΔLk〉_{true}, the true parameter of the true distribution. Therefore, we conclude that the analysis method accurately recovers the true parameter, despite the biasing effect of the measurement.
In an experiment, it is not only important to recover the true parameter of the distribution but also to know its associated standard deviation as a function of the number of experimental outcomes N. We have therefore calculated the standard deviation of the 〈ΔLk〉_{calculated} distribution (Fig. 3 a) according to This procedure is repeated for nine different values of N and their values are plotted as solid black circles in Fig. 3 b. They can be compared with the theoretically predicted values for σ, denoted as σ_{theory}, calculated by using Eq. 15 . Fig. 3 b also plots σ_{theory} as a function of N as red and blue dots. Red dots are calculations of σ_{theory} with global constraints, and blue dots are calculations of σ_{theory} in the absence of constraints (ΔL k _{max} ^{0} = ∞). As is evident from Fig. 3 b, the solid circles fall on top of the theoretical prediction σ_{theory} given by Eq. 15 . Thus, we have shown that Eq. 15 predicts the standard deviation associated with 〈ΔLk〉_{true} accurately. From this result, we can draw an important conclusion, namely that in any given situation with global constraints, an experimenter can assess whether enough statistics have been obtained to determine the unbiased value of the true distribution to the desired accuracy.
Application of the Method
The method outlined above deals with global constraints in the domain of the distribution of experimental outcomes. Therefore, the method should in principle be used in all experiments that involve global constraints and whose experimental outcomes are not distributed like (a series of) delta functions. An example of outcomes distributed like a delta function is the fixed step size of 37 nm with which a myosin protein walks over an actin filament (3). Although experimentally it seems that one measures a Gaussian distribution of observables, the Gaussian shape in fact arises from stochastic fluctuations around a fixed true value. The function describing these processes is a delta function, peaked at the fixed true value of the observable. Mathematically, this process implies that the weight function [g _{i}(k), Eq. 9 ] is always equal to one, and consequently the pdf is unaltered by constraints. Therefore, the likelihood function and the observable that maximizes it remain unaffected, and one is not required to use this method.
We expect that the analysis method outlined here could guide the proper design and analysis of experiments including assays of the processivity of helicases, polymerases, and other translocation enzymes, singlemolecule Föorster resonance energy transfer (FRET) measurements, and realtime singlemolecule tracking of DNA condensation. For clarity, we describe a few of these experiments in more detail below.
Processivity Measurements on Limited Substrate
Some substrates, such as ssRNA or dsRNA molecules, are practically hard to prepare in lengths longer than a few kb if they are to be used in singlemolecule techniques (13). If one wants to measure the distribution of the processivity of a biomolecule that tracks along the RNA, one may find that the processivity exceeds the length of the RNA. In such a case, one is required to discard the final processive action, because it is artificially constrained by the fact that there is no more dsRNA substrate for the biomolecule to move on. The constraint is global, because the length of the RNA molecule that is available for the biomolecule shrinks as it proceeds. This dilemma is summarized in Fig. 4 a. An example of an enzyme translocating on RNA is the RNAdependent RNA polymerase P2 from φ_{6} bacteriophage (14). This polymerase can perform an RNA synthesis reaction by using either dsRNA or ssRNA as a template. The processivity, which can be only roughly estimated from experiments, is on the order of 10 kb or more (D. Bamford, personal communication) and is comparable to the length of the RNA substrate. In singlemolecule processivity measurements for P2 polymerase and other enzymes, we expect that our treatment would be instrumental in determining the mean processivity correctly.
Transitions in FRET Efficiency
FRET efficiency depends on the distance between a donor dye and acceptor dye and ranges between zero (no FRET) and one (maximum FRET) (4,15) (in practice, the range in which meaningful FRET measurements can be performed is even smaller because of the lack of sensitivity close to both the noFRET and the maximumFRET regimes). Changes in FRET efficiency can in theory be used to quantify conformational changes in biomolecules (e.g., in the folding of RNA molecules or in proteins). Future experiments measuring distributions of changes in FRET efficiency could be biased because of the global constraint imposed by the limited meaningful range in FRET efficiency. For example, one could measure a series of conformational changes in an RNA molecule in which each conformational change is associated with a transition in FRET efficiency between a donor and acceptor dye attached to two parts of the RNA molecule (e.g., refs. 16–18), as schematically depicted in Fig. 4 b. In such an experiment, one would be required to discard those transitions that extend to or exceed the limits of the FRET efficiency range. To correct for the ensuing bias toward small FRET transitions and thus for a correct analysis of the distribution, one needs to apply the method described here.
Concluding Remarks
Experimental outcomes that are nonglobally constrained in that they can be assumed to be i.i.d. can be relatively easily analyzed in their measured range. However, when such analysis is performed on outcomes that are coupled by global constraints, severe bias in the parameter estimation can occur. We have therefore generalized the maximumlikelihood method for parameter estimation to include distributions that have global constraints. Using this method, we robustly recover the unbiased distribution parameter from biased data, independent of the severity of the bias. In addition, we have adapted the relation describing errors in the estimation for distribution parameters for the case of global constraints, which allows an experimenter to assess whether enough data points have been accumulated to predict the true parameter to the desired accuracy. Finally, we show that global constraints can occur in a variety of experiments, all of which would benefit from using this method.
Acknowledgments
We thank Ulrich Keyser for stimulating discussions and Cees Dekker and Thijn van der Heijden for a critical reading of the manuscript. This work was supported by the Technische Universiteit Delft, the Foundation for Fundamental Research on Matter, and the Netherlands Organization for Scientific Research.
Footnotes
 ^{§}To whom correspondence should be addressed. Email: nynke.dekker{at}mb.tn.tudelft.nl

Author contributions: D.A.K. and N.H.D. designed research; D.A.K. and C.H.W. performed research; C.H.W. contributed new reagents/analytic tools; D.A.K. analyzed data; and D.A.K., C.H.W., and N.H.D. wrote the paper.

Conflict of interest statement: No conflicts declared.
 Abbreviations:
 i.i.d.,
 independent and identically distributed;
 pdf,
 probability density function;
 FRET,
 Föorster resonance energy transfer.
 © 2006 by The National Academy of Sciences of the USA
References
 ↵
 ↵

↵
 Yildiz A. ,
 Forkey J. N. ,
 McKinney S. A. ,
 Ha T. ,
 Goldman Y. E. ,
 Selvin P. R.

↵
 Weiss S.

↵
 Svoboda K. ,
 Mitra P. P. ,
 Block S. M.
 ↵

↵
 Rice J. A.
 ↵
 ↵
 ↵
 ↵

↵
 Strick T. R. ,
 Allemand J. F. ,
 Bensimon D. ,
 Bensimon A. ,
 Croquette V.

↵
 Dekker N. H. ,
 Abels J. A. ,
 Veenhuizen P. T. ,
 Bruinink M. M. ,
 Dekker C.
 ↵

↵
 Ha T. ,
 Ting A. Y. ,
 Liang J. ,
 Caldwell W. B. ,
 Deniz A. A. ,
 Chemla D. S. ,
 Schultz P. G. ,
 Weiss S.

↵
 Blanchard S. C. ,
 Kim H. D. ,
 Gonzalez R. L., Jr. ,
 Puglisi J. D. ,
 Chu S.
 ↵

↵
 Zhuang X. ,
 Kim H. ,
 Pereira M. J. ,
 Babcock H. P. ,
 Walter N. G. ,
 Chu S.
Citation Manager Formats
Sign up for Article Alerts
Jump to section
 Article
 Abstract
 Topoisomerase IB Steps Are Subjected to a Global Constraint
 MaximumLikelihood and Domain Constraints
 Numerical Simulation and the Consequences of Ignoring Global Constraints
 Application of the Method
 Processivity Measurements on Limited Substrate
 Transitions in FRET Efficiency
 Concluding Remarks
 Acknowledgments
 Footnotes
 References
 Figures & SI
 Info & Metrics