## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Inverse Gillespie for inferring stochastic reaction mechanisms from intermittent samples

Edited by Eric D. Siggia, The Rockefeller University, New York, NY, and approved June 19, 2013 (received for review August 21, 2012)

## Abstract

Gillespie stochastic simulation is used extensively to investigate stochastic phenomena in many fields, ranging from chemistry to biology to ecology. The inverse problem, however, has remained largely unsolved: How to reconstruct the underlying reactions de novo from sparse observations. A key challenge is that often only aggregate concentrations, proportional to the population numbers, are observable intermittently. We discovered that under specific assumptions, the set of relative population updates in phase space forms a convex polytope whose vertices are indicative of the dominant underlying reactions. We demonstrate the validity of this simple principle by reconstructing stochastic models (reaction structure plus propensities) from a variety of simulated and experimental systems, where hundreds and even thousands of reactions may be occurring in between observations. In some cases, the inferred models provide mechanistic insight. This principle can lead to the understanding of a broad range of phenomena, from molecular biology to population ecology.

Data-driven reverse engineering of dynamical systems has been a problem of longstanding interest (1, 2). Classical modeling methods have often used deterministic approximations based on coupled ordinary differential equations, yet many natural processes involving populations of interacting agents have an important stochastic component. To be valid, deterministic approximations require large agent populations that effectively average out the macroscopic effect of stochastic fluctuations. Recent investigations into cellular dynamics have revealed situations where this approximation fails. Stochastic effects have been implicated in diverse phenomena ranging from cell differentiation to pathogen virulence (3), maintenance of circadian rhythms (4, 5), bistability (6), excitability (7, 8), regulation of intracellular protein levels (9), and also in predator–prey in ecosystems (10). Experimental evidence confirms that gene expression occurs in abrupt stochastic bursts (11), and intercellular stochastic variation impacts high-level outcomes such as development and aging (12). Researchers are now able to estimate the number of specific macromolecules within a cell (13, 14), prompting a need for tools that explicitly model stochastic phenomena.

## Standard Forward Algorithm and the Inverse Problem

Gillespie’s stochastic simulation algorithm (SSA) (10) is an accurate procedure for simulating time evolution of finite interacting populations in continuous time; defining a computational approach to simulate the intractable chemical master equation. SSA along with its generalizations (15, 16) underlies most state-of-the-art stochastic kinetic modeling techniques (17).

Although providing an elegant technique to simulate a completely specified Markov jump process (MJP) (18), as is required to simulate chemical interaction of discrete agents, it is not particularly clear how observed time series data may be incorporated in a principled manner in Gillespie’s SSA. Thus, the inverse problem of inferring the structure and parameters of biochemical networks from observed expression data has remained largely unsolved.

In the trivial case where individual reactions are directly observable, the most likely set of reactions can be determined easily. However, population changes are usually only observed intermittently. Gillespie’s formulation establishes that the probabilities of individual reactions are possibly nonlinear functions of instantaneous population numbers; because transpired reactions change the population counts of the participating species, each successive occurrence of the same reaction transpires with a possibly different occurrence probability, making the identification process notoriously difficult. While the calibration problem, i.e., estimation of parameters, namely the reaction propensities, given a model structure and observed expression data has seen significant progress (19⇓⇓–22), there is little reported work on de novo structure identification.

The present work delineates a principle to reverse-engineer observed population time series for de novo structural identification along with estimation of reaction propensities (Fig. 1). We only need intermittent measurements of the system state represented as population counts of each participating species, and may skip many (on the order of thousands in some of our examples) reactions between successive measurements. We do have limitations, e.g., following Gillespie we assume our system is well-mixed, that reactions have constant propensities, and additionally, our approach requires the system to admit a probabilistic equilibrium, and even then not all reactions are guaranteed to be identifiable (see *Assumptions, Limitations, and Error Sources* for details). Even with these limitations, we successfully demonstrate de novo inference on both simulated data from synthetic systems, and on experimental data from ecological and biochemical scenarios. Also, from a computational viewpoint, our technique is significantly simpler than Markov chain Monte Carlo (MCMC) simulation-based parameter-estimation schemes, and is validated to produce parameter estimates close to reported MCMC calibration techniques (20, 23) (*SI Appendix*, section SI-9).

## Key Insight

For a multispecies system evolving stochastically via a small number of underlying reactions occurring with approximately constant propensities, with the population counts of the interacting species observed intermittently, the set of relative update vectors has an identifiable geometric structure. The specific geometry of the set of relative updates is indicative of the dominant reactions driving the system, and can be robustly identified under missed observations, measurement noise, and even limited stochastic fluctuations of the reaction propensities. Using Gillespie’s original formulation of stochastic reaction systems, and some additional assumptions, we claim that the set of relative population update vectors defines a convex bounded polytope. More importantly, each vertex of this polytope has a direction cosine which is the same as that of the population update realized by a driving reaction. Furthermore, this alignment is independent, up to a certain point, of the degree of intermittency of the observations. We show that these polytope vertices, and hence the direction cosines of the hidden dominant reactions, may be identified from the observed population time series alone, with no a priori system knowledge; thus defining a unique approach for de novo structure identification.

A reaction system *G*, as originally formalized by Gillespie, is a triple . and are integer-valued matrices (*R*: number of reactions, *P*: number of species) defining the reactions:where are elements of respectively, and are the reaction propensities. *G* may be simulated in the forward direction using Gillespie’s SSA once the initial population counts of each species is known. Given a value of the observation gap , we define the density function , such that from the current population vector or state *q*, the probability that the next observed state is , given , is . In the limit , every reaction is observable; is the probability of realizing from *q* by a single reaction. For , is the probability of observing next as a result of an unspecified number of unobserved reactions occurring within . Finally, given , we define the reaction polytope (RP):where is the state observed immediately after *q*, denotes the convex hull + interior, and is a finite set of population vectors visited by the system.

For a reaction system , the species-specific population updates brought about by the driving reactions define the reaction directions, which are in fact the direction cosines of the rows of the matrix . Under our assumptions, the RP is necessarily a convex bounded polytope (*SI Appendix*, sections SI-1.4 and SI-2). Additionally, for any choice of *η*, *X* and up to a limiting magnitude of the observation gap , each vertex of the RP necessarily coincides with a reaction direction (*SI Appendix*, Proposition SI-2). As a consequence, we may estimate the direction cosines of the RP vertices and hence the directions of the dominant reactions by identifying the bounding hyperplanes for the set of observed relative updates. Of course, the presence of noise and the finiteness of the observed time series data implies that simple calculation of the bounding hyperplanes, e.g., by taking the convex hull, would lead to spurious RP vertices. We solve this problem via the application of a Hough transform (HT) (24) on the set of relative updates (*SI Appendix*, section SI-2), and by carrying out robust identification of the bounding hyperplanes in the Hough parameter space. HT maps each observed point in the space of relative update vectors to a sinusoidal hypersurface (a sinusoidal curve in 2D) in the Hough space, and the intersection of these hypersurfaces (lines in 2D) indicate hyperplanes in the original space. The robustness of the approach (*SI Appendix*, section SI-2.3) arises from the established statistical properties of the HT estimator that is shown to be strongly consistent, with a finite-sample-replacement breakdown point (25, 26) close to 50%. Whereas traditionally the HT has been used to detect straight edges in 2D images in computer vision applications, we generalize it to *P* dimensions, and the intersection of the Hough hyperplanes (Hough lines for ) indicates the identifiable reaction directions (Fig. 2) with a high probability (approaching 1 exponentially fast with the length of the observed time series; *SI Appendix*, Proposition SI-2.2).

To see an example of the identified RP, bounded by Hough lines, examine the Lotka–Volterra (LV) predator–prey dynamical system, shown in Fig. 1. The LV system is defined by three simple reactions, namely: prey multiplication, predation by which predator multiplies at the expense of the prey, and predator death. We constructed the RP for simulated LV data with a range of different observation gaps. The result reveals a triangular polytope (Fig. 2 *A*–*H*). The polytope edges in Fig. 2*E* intersect at the coordinates , implying three reactions with directions 0°, , and which correspond to the update vectors , and , respectively.

Not all reaction directions are guaranteed to show up as RP vertices (*SI Appendix*, section SI-2.5). The ones that do are the dominant reactions. We must still infer the exact stoichiometries of the reactants and products for each inferred direction. We determine species-specific upper bounds on the entries of (*SI Appendix*, section SI-3), which then yield a finite set of possibilities for the stoichiometries. This results in a set of pairs as plausible reaction structures for the system. We consider all such structures, find reaction-specific propensities in each case, and evaluate the complete models for error vs. complexity tradeoff on a Pareto front (*SI Appendix*, section SI-4.2).

To estimate propensities, we assume that there exists a population vector *E* with zero expected update, i.e., a probabilistic equilibrium. This implies the existence of a nonnegative nontrivial vector satisfying (*SI Appendix*, section SI-5.1)

Stochastic evolution, in general, traces out different trajectories for the same initial conditions. However, in an ε-neighborhood of a probabilistic equilibrium , a nearly zero expected update implies that the occurrence of reactions can be modeled as independent trials in a multinomial distribution, with each outcome independently drawn from the set of the *R* inferred reactions. This independence condition, combined with the observed variance of the updates in , allows us to solve for the expected propensities. We show (*SI Appendix*, section SI-5) that there exists a consistent estimator for the propensity of the *r*th reaction, whose expected value (i.e., the propensity estimate) may be computed aswhere is the population update for the *i*th species for the *j*th update observed to have occurred from within , *M* is the total number of observed updates, and is an equilibrium probability vector (*SI Appendix*, section SI-5). Additionally, the coefficient of variation for the estimator (i.e., the ratio of SD to mean), which quantifies the uncertainty in the estimate, is given bywhich is identical for each reaction, and is independent of the equilibrium location and the equilibrium reaction probabilities . We also work out explicit corrections arising from measurement noise (*SI Appendix*, section SI-6). Note that if the left null space of has dimension greater than 1, then could be nonunique. In such cases, we sample the set of feasible vectors, and as before, tradeoff error-vs.-complexity of the resulting models on a Pareto front.

The observation gap is the inverse frequency at which experimental observations are made (*SI Appendix*, Remark SI-5.1), and is expected to be known. Without knowing we can only infer , and can only meaningfully compare the observed and simulated phase spaces near probabilistic equilibria. Because a probabilistic equilibrium has zero expected update for any , the distortion of the phase space in a small neighborhood of any such equilibrium is small. Thus, the deviation of the phase space in an equilibrium neighborhood reflects the modeling error independent of the observation gap. In contrast, comparing the phase spaces away from equilibria would require us to simulate the inferred model with set to a value close to that used in the original experiment.

## Computational Steps

Our inference algorithm consists of six sequential steps (see *SI Appendix*, section SI-8 for complete pseudocode). In step 1, we compute the optimal neighborhood radius (*SI Appendix*, section SI-5.5) as the one leading to the smallest coefficient of variation . (We need this radius to quantify what constitutes a neighborhood of any given state vector.) The location of a probabilistic equilibrium is identified as the population vector from which we have a near-zero expected update. Also, the variance of the observed population updates in the neighborhood of the identified equilibrium is computed. The expected update from every point *v* in the discretized space of population vectors is computed as well by taking the average update from within a neighborhood of *v*. Step 2 estimates a species-specific upper bound for entries. Step 3 applies HT to the set of relative updates, and outputs a set of pairs representing possible model structures. For each so obtained, we next estimate (step 5) the reaction probabilities producing zero expected update (*SI Appendix*, section SI-5.1). We implement this step by computing a basis for the left null space for , and then searching for vectors in the span of with nonnegative entries. No such may exist, which would render the problem unsolvable within our framework. In step 6 we evaluate the modeling error, and the descriptional complexity of the inferred models. We use the sum of the entries in the and as a measure of complexity (alternate measures of complexity may be used as well). Next we simulate each inferred model *G* (assuming if it is unknown) using Gillespie’s SSA, and compute the optimal neighborhood radius , the expected updates , and equilibrium as described in step 1. (Note: in step 1, we found these quantities, e.g., the optimal neighborhood , for the observed data; here we are finding them for simulated data from one of the inferred models.) Modeling error in the vicinity of equilibria is then estimated usingThe model-specific error and complexity estimates can then be used for tradeoffs on a Pareto front.

### Use of the Pareto Front.

As in any machine learning task, there is a natural tradeoff between error and complexity, and we eliminate models that are not maximally accurate for a particular complexity value. This leads to only a few models on the error-complexity Pareto front, as shown in Figs. 3 and 4. An expert can then consider these models and draw insight based on additional domain knowledge.

The complexity measure dictates which models get reported on the Pareto front; thus, specific applications may necessitate more discriminative estimates.

## Assumptions, Limitations, and Error Sources

Our approach makes the following assumptions:

*i*) Well-mixed reactants, i.e., absence of spatial inhomogeneity.*ii*) Reaction propensities are constant, i.e., independent of reaction time, and instantaneous population vectors.Even if these conditions are satisfied approximately, our inferred models are quite accurate (

*SI Appendix*, sections SI-5.4 and SI-6.3).*iii*) The system admits a probabilistic equilibrium (*SI Appendix*, Lemma SI-1.1). The notion of a probabilistic equilibrium is weaker than that of an equilibrium in a dynamical system. A system need not be static at a probabilistic equilibrium; only the expected change needs to be zero. This is realized with reaction probabilities that cause updates from individual reactions to cancel out. Thus, we cannot identify models for systems that do not exhibit any recurrent behavior, e.g., a system with a single reaction (we can identify the reaction direction, but not the propensity).*iv*) There is an upper bound on the observation gap beyond which the polytopal geometry of the RP is incorrect or lost (*SI Appendix*, Remark SI-1.7). We assume that is smaller than this bound. A minor assumption is that is approximately constant and known. Without we can only estimate (Eq.**3**).

It is not guaranteed that every reaction will be identified, e.g., reactions with the same direction cosine cannot be distinguished. However, the algorithm approximates the observed dynamics with the reactions it can infer (*SI Appendix*, section SI-12).

In molecular-biology observables are concentration of marker proteins, and it is difficult to map these values back to exact population numbers (*SI Appendix*, section SI-7).

As is increased, we expect a progressive degradation of the RP geometry (Fig. 2 *A*–*H*); e.g., in Fig. 2*H* with , the RP vertices appear incorrectly on the coordinate axes, illustrating the existence of an upper bound on the acceptable magnitude of .

For a small number of species, HT-based RP identification scales easily with increasing complexity of the systems, both in terms of increasing number of reactions and increasing number of entities involved in the reactions (i.e., sum of the integer coefficients in the model), as shown in *SI Appendix*, Fig. S-7. Whereas the theory itself has been developed for an arbitrary number of species, classical HT implementations have a high memory overhead in higher dimensions (although the number of parameters increases linearly, the number of cells in the discretized parameter space increases exponentially). Notably, some recent breakthroughs have reported HT schemes that have polynomial complexity in the number of dimensions (*SI Appendix*, section SI-2.4).

Because we are looking for integer coefficients, small errors in the reaction directions have no effect on the solution. Measurement errors do however affect the propensity estimation (*SI Appendix*, section SI-6), and we calculate explicit corrections that may be applied if the variance of the noise is known.

In the vast majority of problems in molecular biology, we observe the expression of only a few key regulators, whereas such systems typically consist of hundreds of distinct species with complex interaction networks. For example, in our experimental investigation into the competence dynamics of *Bacillus subtilis* (see below), we simultaneously measured the expression of *sinI*, *spo0A*, and *comG* promoters. However, it is well known that PsinI, Pspo0A, and PcomG are only key regulators in a complex network of hundreds of genes, and do not interact in the same sense as reacting molecules combine to produce products. Nevertheless, the “reactions” we infer from the expression data of PsinI, Pspo0A, and PcomG do make physical sense (see the discussion in the next section) and represent high-level abstractions of key processes. We give examples of several simulated systems with unobserved interacting species (*SI Appendix*, sections SI-10 and SI-11), and show that the inferred models can correctly identify the key roles played by the observable species.

## Application Examples

### Synthetic Datasets.

We validated our approach on a range of simulated and experimental systems. Simulated data included several variations of the LV system (systems 1 and 2 in Table 1), synthetic data from simulated transcription–translation dynamics (system 3 in Table 1; also Fig. 2*I*), and on the Oregonator (27) as a simple model of the Belousov–Zhabotinsky oscillator (system 4 in Table 1). The reaction structure was recovered correctly in each case (there is an obvious “correct” choice from the Pareto front for these specific systems, *SI Appendix*, section SI-13), and the inferred propensities were within 4% of the true values. The observation gap used varied between the systems, and the maximum value at which inference was possible is tabulated in Table 1.

### Challenge Problems.

Two datasets, comprising 425 and 315 observations, respectively, from the LV system are included, as benchmark challenges for competing approaches (*SI Appendix*, section SI-15).

### Experimental Datasets.

We first analyzed paramecium–didinum prey–predator cohabitation (28, 29) (paramecium denoted as species *X*, and didinum denoted as species *Y*). This is a classic protozoan experiment with *Paramecium aurelia* as prey and *Didinum nasutum* as predator (30), yielding time series of twice-daily counts of predator and prey abundance over several sustained population cycles (experimental runs spanned days each). We analyzed the two longest time series observed, measured under with cerophyl concentrations of and , respectively.

Inferred Pareto fronts are shown in Fig. 3 *C* and *D*, respectively. The inferred models are similar to the standard LV systems as expected. More importantly, model 2 from Fig. 3*C* and model 2 from Fig. 3*D* have identical reactions, with different propensities. The fourth reaction proceeds significantly faster for higher cerophyl concentration ( in Fig. 3*C*) against in Fig. 3*D*; higher cerophyl concentration gives more nutrition to the protozoa allowing them to multiply faster.

Abundance of food also correlates with the lower rates of population decay for both species (reactions 2 and 3 in the respective models). The large coefficient on paramecium (*X*) in the decay reaction implies that at low protozoa concentrations the probability of this reaction dominates, bringing about rapid decay of protozoa population, resulting in fast extinction of both species (30).

The second experimental system we analyzed is cellular differentiation dynamics of the microbial model system *B. subtilis*, known to exhibit stochastic behavior at the single-cell level (31⇓⇓–34). Upon exposure to environmental stressors, individual *B. subtilis* cells can probabilistically differentiate into highly resilient dormant spores, or transiently enter a state of competence for uptake of extracellular DNA. Both processes require expression of Spo0A, which is the global transcriptional stress response regulator of *B. subtilis* (35). Spo0A governs transcription of several hundred genes, including *sinI*, whose expression releases inhibition of sporulation by SinR. Therefore, expressions of Spo0A and SinI are critical steps in the sporulation program (36). Furthermore, Spo0A expression removes inhibition on the transcriptional master regulator ComK, which activates expression of over 140 genes involved in competence (37). Among these is *comG*, which is crucial for extracellular DNA uptake. Measurement of *spo0A*, *sinI,* and *comG* promoters thus provides critical insight into stochastic differentiation dynamics of two competing cellular differentiation programs in single cells (38).

We applied our technique to simultaneously recorded expression data (Fig. 4*A*) of *spo0A*, *sinI,* and *comG* promoters (denoted Pspo0A, PsinI, and PcomG, respectively) in cells undergoing competence episodes followed by sporulation. The RP showed 12 distinct vertices, indicating 12 reactions (Fig. 4*B*). Some of these vertices occurred in clusters, and could be collapsed. We computed the reaction directions at different resolutions, and obtained models with number of inferred reactions ranging from 3 to 12, leading to the Pareto front shown in Fig. 4*C*.

Model 5, consisting of six reactions, revealed surprising insights into the complex cellular differentiation dynamics and relationship between sporulation and competence.

Reactions capture the known requirement for entry into competence. Specifically, these reactions confirm that Spo0A expression, and thus at least a basal activity of the sporulation program, precedes competence initiation (35). Reactions explain the controlled concentration of PcomG during competence. Also PcomG → PsinI + Pspo0A corroborates recent experimental evidence of competence episodes being often followed by sporulation (38). Finally, PsinI + 2Pspo0A → ∅ predicts that both Spo0A and SinI are repressed during competence. Thus, this inferred model recovers known dynamics, and poses nontrivial hypotheses for unknown interactions.

## Conclusion

Under specific assumptions, certain observable geometric properties of the set of relative population updates in the phase space remain approximately conserved when an unknown number of reactions is skipped between aggregate population counts; this fact can be exploited for de novo inference of reaction structure and propensities with noisy and intermittent observations. Such de novo inference is crucial to analyzing large volumes of experimental data acquired via a gamut of emerging single-cell monitoring technologies, and may help elucidate fundamental principles by which biological elements integrate to exhibit complex behavior.

## Acknowledgments

This research was supported in part by the US Army Research Office Biomathematics program (W911NF-12-1-0499), the National Science Foundation (ECCS 0941561), and the Defense Threat Reduction Agency (HDTRA 1-09-1-0013).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: ishanu.chattopadhyay{at}cornell.edu.

Author contributions: I.C., A.K., G.M.S., and H.L. designed research; I.C., A.K., G.M.S., and H.L. performed research; I.C., A.K., G.M.S., and H.L. contributed new reagents/analytic tools; I.C., A.K., G.M.S., and H.L. analyzed data; and I.C. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1214559110/-/DCSupplemental.

## References

- ↵
- King RD,
- et al.

- ↵
- ↵
- McAdams HH,
- Arkin A

- ↵
- Forger DB,
- Peskin CS

- ↵
- ↵
- ↵
- Bratsun D,
- Volfson D,
- Tsimring LS,
- Hasty J

- ↵
- Edery I,
- Zwiebel LJ,
- Dembinska ME,
- Rosbash M

- ↵
- ↵
- ↵
- Zlokarnik G,
- et al.

- ↵
- ↵
- ↵Suel G (2011) Synthetic Biology, Part A: Methods for Part/Device Characterization and Chassis Engineering.
*Methods in Enzymology*, ed Voigt, C (Academic, Waltham, MA) 1st Ed, Vol 497. - ↵
- ↵
- ↵
- Wilkinson D

*Chapman and Hall/CRC Mathematical and Computational Biology Series*(Taylor & Francis). - ↵
- Platt JC,
- Koller D,
- Singer Y,
- Roweis ST

- Opper M,
- Sanguinetti G

- ↵
- ↵
- Golightly A,
- Wilkinson DJ

- ↵
- ↵
- ↵
- ↵Hough PVC (1962) US Patent 3,069,654.
- ↵
- Huber P

*Wiley Series in Probability and Mathematical Statistics. Probability and Mathematical Statistics*(Wiley, New York). - ↵
- ↵
- Field R

- ↵
- Gause G

- ↵Luckinbill, L (1972)
*Coexistence in Laboratory Populations of Paramecium Aurelia and the Predator Didinium Nasutum*(Univ California, Los Angeles-Zoology). - ↵
- ↵
- ↵
- Suel GM,
- Kulkarni RP,
- Dworkin J,
- Garcia-Ojalvo J,
- Elowitz MB

- ↵
- Maamar H,
- Raj A,
- Dubnau D

- ↵
- ↵
- ↵
- ↵
- ↵

## Citation Manager Formats

### More Articles of This Classification

### Biological Sciences

### Biophysics and Computational Biology

### Physical Sciences

### Related Content

- No related articles found.

### Cited by...

- No citing articles found.