New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Analytical distributions for stochastic gene expression

Edited by Charles S. Peskin, New York University, New York, NY, and approved September 5, 2008 (received for review April 22, 2008)
This article has a correction. Please see:
Abstract
Gene expression is significantly stochastic making modeling of genetic networks challenging. We present an approximation that allows the calculation of not only the mean and variance, but also the distribution of protein numbers. We assume that proteins decay substantially more slowly than their mRNA and confirm that many genes satisfy this relation by using highthroughput data from budding yeast. For a twostage model of gene expression, with transcription and translation as firstorder reactions, we calculate the protein distribution for all times greater than several mRNA lifetimes and thus qualitatively predict the distribution of times for protein levels to first cross an arbitrary threshold. If in addition the fluctuates between inactive and active states, we can find the steadystate protein distribution, which can be bimodal if fluctuations of the promoter are slow. We show that our assumptions imply that protein synthesis occurs in geometrically distributed bursts and allows mRNA to be eliminated from a master equation description. In general, we find that protein distributions are asymmetric and may be poorly characterized by their mean and variance. Through maximum likelihood methods, our expressions should therefore allow more quantitative comparisons with experimental data. More generally, we introduce a technique to derive a simpler, effective dynamics for a stochastic system by eliminating a fast variable.
Gene expression in both prokaryotes and eukaryotes is inherently stochastic (1–4). This stochasticity is both controlled and exploited by cells and, as such, must be included in models of genetic networks (5,6). Here we will focus on describing intrinsic fluctuations, those generated by the random timing of individual chemical reactions, but extrinsic fluctuations are equally important and arise from the interactions of the system of interest with other stochastic systems in the cell or its environment (7,8). Typically, experimental data are compared with predictions of mean behaviors and sometimes with the predicted standard deviation around this mean, because protein distributions are often difficult to derive analytically, even for models with only intrinsic fluctuations.
We will propose a general, although approximate, method for solving the master equation for models of gene expression. Our approach exploits the difference in lifetimes of mRNA and protein and is valid when the protein lifetime is greater than the mRNA lifetime. Typically, proteins exist for at least several mRNA lifetimes, and protein fluctuations are determined by only timeaveraged properties of mRNA fluctuations. Following others (7,9–11), we will use this timeaveraging to simplify the mathematical description of stochastic gene expression.
For many organisms, singlecell experiments have shown that gene expression can be described by a threestage model (3,4,12–14). The promoter of the gene of interest can transition between two states (10,15–17), one active and one inactive. Such transitions could be from changes in chromatin structure, from binding and unbinding of proteins involved in transcription (3,4,12), or from pausing by RNA polymerase (18). Transcription can only occur if the promoter region is active. Both transcription and translation, as well as the degradation of mRNAs and proteins, are usually modeled as firstorder chemical reactions (5).
By taking the limit of a large ratio of protein to mRNA lifetimes, we will study the threestage model and a simpler twostage version where the promoter is always active. For this twostage model, we will derive the protein distribution as a function of time. We will derive the steadystate protein distribution for the full, threestage model. We also include expressions for the corresponding mRNA distributions (14,16) in the supporting information (SI) Appendix.
A TwoStage Model of Gene Expression
We will first consider the model of gene expression in Fig. 1A (9). This model assumes that the promoter is always active and so has two stochastic variables: the number of mRNAs and the number of proteins. The probability of having m mRNAs and n proteins at time t satisfies a master equation: with v_{0} being the probability per unit time of transcription, v_{1} being the probability per unit time of translation, d_{0} being the probability per unit time of degradation of an mRNA, and d_{1} being the probability per unit time of degradation of a protein. By defining the generating function, F(z′,z), by F(z′,z) = ∑_{m,n}z′^{m}z^{n}P_{m,n}, we can convert Eq. 1 into a firstorder partial differential equation: where we have rescaled (19), with a = v_{0}/d_{1}, b = v_{1}/d_{0}, γ = d_{0}/d_{1}, and τ = d_{1}t, and where u = z′ − 1 and v = z − 1.
If the protein lifetime is much greater than the mRNA lifetime and γ ≫ 1, Eq. 2 can be solved by using the method of characteristics. Let r measure the distance along a characteristic which starts at τ = 0 with u = u_{0} and v = v_{0} for some constant u_{0} and v_{0}, then Eq. 2 is equivalent to (20) Consequently direct integration implies r = v = v_{0}e^{τ}. For γ ≫ 1, u(v) obeys (SI Appendix) or as v = v_{0}e^{τ} > v_{0} for τ > 0. When γ ≫ 1, u rapidly tends to a fixed function of v: for most of a protein's lifetime, the dynamics of mRNA is at steady state. The generating function then obeys Intuitively, Eq. 6 arises from Eq. 2 because large γ causes the term in square brackets in Eq. 2 to tend to zero to keep F(u,v) finite and well defined. Eq. 6 describes only the distribution for protein numbers: F(u,v) is just a function of v. Terms of higher order in γ^{−1} will depend on u. Large γ implies that most of the mass of the joint probability distribution of mRNA and protein is peaked at m = 0: P_{m,n} ≃ P_{0,n}.
We can find the probability distribution for protein numbers as a function of time by integrating Eq. 6. Integration gives assuming that no proteins exist at τ = 0. From the definition of a generating function, expanding F(z) in z gives (SI Appendix) where P_{n}(τ) = P_{0,n}(τ). Here, _{2}F_{1}(a,b,c; z) is a hypergeometric function and Γ denotes the gamma function (21). Eq. 8 is valid when γ ≫ 1, τ ≫ γ^{−1} to allow the mRNA distribution to reach steady state, and a and b are finite. The mean, 〈n〉 = ab(1 − e^{−τ}), and the variance, 〈n^{2}〉 − 〈n〉^{2} = 〈n〉 (1 + b + be^{−τ}), of Eq. 8 agree with earlier results (9). At steady state τ ≫ 1 and which is a negative binomial distribution. We verified Eq. 8 and Eq. 9 with stochastic simulations by using the Gibson–Bruck version (22) of the Gillespie algorithm (23) and the Facile network complier and stochastic simulator (24). If γ ≫ 1, Eq. 9 accurately predicts the distribution described by Eq. 1 (Fig. 1B and C), but it fails as expected for smaller γ. This effect can be quantified by calculating the Kullback–Leibler divergence between the predicted and simulated distributions for different γ (Fig. 1D). Eq. 8 is illustrated in Fig. 2A and B. As well as γ ≫ 1, times with τ > γ^{−1} are necessary for negligible Kullback–Leibler divergences (Fig. 2C).
Eq. 8 allows complete characterization of the Markov process underlying the twostage model. The “propagator” probability, P_{nk}(τ), which is the probability of having n proteins at time τ given k proteins initially, satisfies (SI Appendix) where P_{n}(τ) = 0 if n < 0. With Eqs. 8 and 10, twostage gene expression is in principle completely characterized for γ ≫ 1 and τ ≫ γ^{−1}. For example, we can calculate how the noise in protein numbers, η (their standard deviation divided by their mean), changes with time. If protein numbers initially have a distribution P_{k}^{(0)}, then at a time τ their distribution will be ∑_{k}P_{nk}(τ)P_{k}^{(0)}. The noise of this distribution can either increase, decrease, or behave nonmonotonically as time increases (Fig. 2D). We can also calculate nonsteadystate autocorrelation functions and firstpassage time distributions for protein levels to first cross a threshold, N (with some standard numerics). In general, such distributions are only qualitative because contributions from times with τ < γ^{−1} are always relevant. Accuracy can be improved by having γ ≫ 10 and a sufficiently high threshold (Fig. 2E and SI Appendix).
We can derive Eq. 9 more intuitively. An mRNA undergoes a competition between translation and degradation because ribosomes and degradosomes bind to it mutually exclusively (25). For each competition, the probability of a ribosome binding to the mRNA is
Given that the synthesis and degradation of one mRNA generates a burst of r proteins, then the number of proteins at steady state is given by the typical number of mRNAs synthesized during a protein lifetime,
By assuming explicitly that protein synthesis occurs in bursts, we can derive an effective master equation for gene expression that considers only proteins, but implicitly includes mRNA fluctuations (19,30). We will show that this master equation has Eq. 8 as its solution and so is equivalent to the large γ approximation to Eq. 1, the master equation for both mRNA and protein. If we assume that each mRNA synthesized leaves behind a burst of r proteins then where the size of each burst has been determined by Eq. 11 (30). Eq. 14 has Eq. 7 as its generating function (SI Appendix). By introducing bursts of protein synthesis, mRNA fluctuations can be absorbed into a onevariable master equation provided γ ≫ 1. Friedman et al. used a continuous version of this approach with an exponential burst distribution inspired by their experimental results (28,29). They derived a gamma distribution for steadystate protein numbers (19). Eq. 9 tends to this distribution for large n (SI Appendix). Friedman et al. (19) also demonstrated that the burst approximation remains valid when negative or positive feedback is included.
In summary, we have shown that exploiting the difference between protein and mRNA lifetimes through a large value of γ, but finite a and b, allows powerful mathematical simplifications. Large γ implies that mRNA is at steady state for most of the lifetime of a protein and that the probability mass of the joint distribution of protein and mRNA is peaked at zero mRNAs, although the mean number of mRNAs need not be zero (Fig. 1). The number of proteins translated from an mRNA obeys a geometric distribution in both the twostage and threestage models (25), but large γ implies that the proteins translated from an mRNA all appear, on protein timescales, simultaneously so that the synthesis and degradation of an mRNA leaves behind a geometric burst of proteins. If γ < 1, then proteins synthesized from a particular mRNA will be degraded as further proteins are synthesized, and the distribution describing the number of proteins remaining once the mRNA is degraded will no longer be geometric. Explicitly including geometric bursts accurately describes the effects of mRNA fluctuations on the distribution of protein numbers when γ ≫ 1. It allows the model of Fig. 1A to be described by a onevariable master equation: Eq. 14.
A ThreeStage Model of Gene Expression
We next consider the full threestage model of gene expression (Fig. 3A). We find the protein distribution for this system by taking the large γ limit of the master equation. Let P_{m,n}^{(0)} be the probability of having m mRNAs and n proteins when the DNA is inactive and P_{m,n}^{(1)} be the probability of having m mRNAs and n proteins when the DNA is active. We then have two coupled equations: where κ_{0} = k_{0}/d_{1} and κ_{1} = k_{1}/d_{1}.
We solve Eqs. 16 and 17 at steady state by taking the large γ limit of the equivalent equations for their generating functions (a generating function is defined for each state of the promoter). Our approach is a natural extension of the method used to solve the twostage model (SI Appendix). We find that
where
and ϕ^{2} = (a + κ_{0} + κ_{1})^{2} − 4aκ_{0}. Eq. 18 is valid when γ ≫ 1 and a and b are finite. The mean of this distribution is 〈n〉 =
The protein distribution for the threestage model can have similar behavior to the twostage model of Fig. 1A, but it can also generate a bimodal distribution with a peak both at zero and nonzero numbers of molecules (Fig. 3D). This bimodality is not a reflection of an underlying bistability, but arises from slow transitions driving the DNA between active and inactive states (5,17,31,32).
As expected, Eq. 18 recovers the negative binomial distribution under certain conditions. It tends to Eq. 9 when κ_{1} → 0: the DNA is then always active at steady state. When κ_{1} = 0, Eqs. 19 and 20 imply that α = a and β = κ_{0}, and recall that _{2}F_{1}(a,0,c;z) = 1 for all a, c, and z. Similarly, when κ_{0} and κ_{1} are both large, but κ_{0}/κ_{1} is fixed, then α → κ_{0} + κ_{1} and
Discussion
We have shown we can calculate distributions for protein numbers by assuming that protein lifetimes are longer than mRNA lifetimes with a, the number of mRNAs transcribed during a protein remaining, and b, the number of proteins translated during a mRNA lifetime, remaining finite. Fig. 4A shows the ratio γ measured for almost 2,000 genes in budding yeast. Approximately 80% of the genes have γ > 1 and the median value is ≈3 (we include the dataset in SI Appendix). We therefore expect our predicted distributions to be widely applicable in budding yeast. In bacteria, too, γ is expected to be above 1 because mRNA lifetimes are usually minutes (they are typically tens of minutes in yeast) and protein lifetimes are often determined by the length of the cell cycle (typically 30 or more minutes) (33).
Values of γ > 1 reduce protein fluctuations by allowing more averaging of the underlying mRNA fluctuations (Eq. 21). We indeed observe a small, but statistically significant, negative correlation between total noise and γ by using the data of Newman et al. (34) (a rank correlation of ≃ −0.2 with a P value of 10^{−6}). In Fig. 4B, we have calculated the median γ for yeast genes in different gene ontology classes. All classes have a median γ > 1. Proteins involved in transferring nucleotidyl groups, which include RNA and DNA polymerases, have high median γ > 5, presumably because high stochasticity in these proteins can undermine many cellular processes. Similarly, proteins that contribute to the structural integrity of protein complexes have a median γ > 5. Large fluctuations can vastly reduce the efficiency of complex assembly by preventing complete complexes forming because of a shortage of one or more components (11,35). Perhaps surprisingly transcription factors have a low median γ > 1. Although low γ does increase stochasticity, it can allow quick response times if the protein degradation rate is high. A high protein degradation rate may also keep numbers of transcription factors low to reduce deleterious nonspecific chromosomal binding.
We show that protein synthesis occurs in bursts in both the two and the threestage model when γ ≫ 1. Such bursts of gene expression have been measured in bacteria and eukaryotes (12–14,28,29). They allow mRNA to be replaced in the master equation by a geometric distribution for protein synthesis for all times greater than several mRNA lifetimes if their source is translation and the protein lifetime is substantially longer than the mRNA lifetime. Such an approach has already been proposed (19), but without determining its validity. Similarly, if mRNA fluctuations are negligible, the master equation reduces to one variable (protein), and describing the protein distribution becomes substantially easier (31,36).
An important problem in systems biology is to determine which properties of biochemical networks and the intracellular environment must be modeled to make accurate, quantitative predictions. Besides obscuring the process driving the observed phenotype, models more complex than needed are harder to correctly parameterize and to simulate to generate predictions. Our results show that complexity, here two states of the promoter, can be modeled by effective parameters that under certain conditions will give accurate predictions of the entire distribution of protein numbers: Eq. 22. Alternatively, they show that not only the mean and variance (37), but also the protein distribution may not have enough information to determine the biochemical mechanism generating gene expression from measurements of protein levels: Eqs. 9 and 22. Such effects are likely to be compounded by nonsteadystate dynamics (Fig. 2) and extrinsic fluctuations. Collecting data on the corresponding mRNA distribution may disfavor the twostage over the threestage model because mRNA distributions in the threestage model can have two peaks even though the protein distribution has only one (6). In general, though, time series measurements, preferably with and without perturbations, may provide the most discriminative power (37).
Experimental measurements are best compared with the predicted distribution rather than its mean, standard deviation, or mode. Both the protein and mRNA distributions are typically not symmetric and may not be unimodal. Consequently, the mean and the mode can be significantly different, and the standard deviation can be a poor measure of the width of the distribution at halfmaximum (38). Such distributions are poorly characterized by the commonly used coefficient of variation because they are not locally Gaussian around their mean (Figs. 1–3). In addition, fitting moments to find model parameters can be challenging. Moments, more so than distributions, are functions of combinations of parameters and can also be badly estimated without large amounts of data, particularly for asymmetric distributions. We therefore believe a Bayesian or maximum likelihood approach is most suitable where the experimental protocol is replicated by the fitting procedure and explicitly accounts for the shape of the distribution and the number of measurements. For example, irrespective of how many measurements are available, the likelihood of the data for a particular set of parameters can always be determined from the assumed distribution of protein numbers. Our analytical expressions will greatly speed up such approaches by avoiding large numbers of simulations and by aiding in deconvolving extrinsic fluctuations that can substantially change the shape of protein distributions (8).
Our results should also allow more general fluctuation analyzes of gene expression data. Such analyzes convert fluorescence measurements into absolute units (numbers of molecules) by exploiting that the magnitude of fluctuations is determined by the number of molecules independently of how those numbers are measured (39). Converting into absolute units is essential if information from different experiments is to be combined into a larger, predictive framework, a goal of systems biology.
More generally, our approach is an example of a technique to simplify the dynamics of a stochastic system by exploiting differences in timescales. We remove a fast stochastic variable through replacing a constant parameter (the parameter a) by a timedependent parameter (the burst distribution) whose variation captures the effects of fluctuations in the fast variable on the dynamics of the slow one (40).
Acknowledgments
We thank an anonymous referee for showing us Eq. 12. P.S.S. is a recipient of a Tier II Canada Research Chair. V.S. and P.S.S. are supported by National Sciences and Engineering Research Council (Canada).
Footnotes
 ^{3}To whom correspondence should be addressed. Email: swain{at}cnd.mcgill.ca

Author contributions: V.S. and P.S. designed research, performed research, analyzed data, and wrote the paper.

↵^{1}Department of Mathematics, Imperial College, London SW7 2BZ, UK.

↵^{2}Center for Systems Biology, University of Edinburgh, Edinburgh EH9 3JU, UK.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0803850105/DCSupplemental.
 © 2008 by The National Academy of Sciences of the USA
References
 ↵
 ↵
 Elowitz MB,
 Levine AJ,
 Siggia ED,
 Swain PS
 ↵
 ↵
 Raser JM,
 O'shea EK
 ↵
 ↵
 ↵
 Swain PS,
 Elowitz MB,
 Siggia ED
 ↵
 Shahrezaei V,
 Ollivier JF,
 Swain PS
 ↵
 Thattai M,
 van Oudenaarden A
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Friedman N,
 Cai L,
 Xie XS
 ↵
 Zwillinger D
 ↵
 Abramowitz M,
 Stegun IA
 ↵
 ↵
 ↵
 ↵
 McAdams HH,
 Arkin A
 ↵
 van Kampen NG
 ↵
 Prochaska BJ
 ↵
 ↵
 Yu J,
 Xiao J,
 Ren X,
 Lao K,
 Xie XS
 ↵
 ↵
 ↵
 ↵
 Neidhardt FC
 Bremer H,
 Dennis PP
 ↵
 ↵
 Fraser HB,
 Hirsh AE,
 Giaever G,
 Kumm J,
 Eisen MB
 ↵
 ↵
 Pedraza JM,
 Paulsson J
 ↵
 ↵
 ↵
 ↵
 Belle A,
 Tanay A,
 Bitincka L,
 Shamir R,
 O'shea EK
 ↵
 Grigull J,
 Mnaimneh S,
 Pootoolal J,
 Robinson MD,
 Hughes TR
 ↵
 Wang Y,
 et al.