New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Automated reverse engineering of nonlinear dynamical systems

Edited by Richard E. Lenski, Michigan State University, East Lansing, MI, and approved April 7, 2007 (received for review October 25, 2006)
Data supplements
Bongard and Lipson. 10.1073/pnas.0609476104.
Supporting Information
Files in this Data Supplement:
SI Table 4
SI Figure 4
SI Figure 5
SI Figure 6
SI Text
SI Figure 4Fig. 4. Mean model sizes obtained by the four algorithm variants applied to the four synthetic systems (with ± 0.08 noise). As reported in Fig. 3, the best model was extracted from each trial of the algorithm. Model size is the mean number of operands and operators in each equation comprising the model. Error bars indicate one unit of standard deviation (n = 30).</.>
SI Figure 5Fig. 5. General behavior of the four synthetic systems (red) and tests chosen by the algorithm (black points). The red lines indicate changes in the system's variables over time. Black points indicate the initial variable values that the algorithm with automated probing chose, during the modeling of these four systems. Initial conditions were confined to prespecified ranges (indicated by the blue boxes). However, once the initial conditions are supplied to the system, or used to sample data from a data set, variable values may exceed these ranges. For the lac operon system, only two of the three variables are plotted here: the seeming divergence in state evolution from the same point is caused by differences in the third variable (lactose concentration).
SI Figure 6a
SI Figure 6b
SI Figure 6c
SI Figure 6d
SI Figure 6e
SI Figure 6fFig. 6. The three mechanical systems. (a) The physical pendulum, with the base lying flat. (b) The pendulum with the base rotated 1.57 radians counterclockwise. (c) The pendulum with the base rotated 2.6 radians counterclockwise. (df) The time series data, plotted as a phase diagram, generated by the three systems, respectively.
Eighteen numerically stable synthetic systems were generated randomly and inferred without and with partitioning: three with two, three, four, five, and six variables, respectively. Thirty independent trials of both algorithm variants were run against the system. Success rates report the fraction of trials that produced a model with exactly the same terms and parameters as the system, with no additional incorrect terms.
SI Text
SI Materials and Methods
Partitioning. The error of a single equation is then the difference between the time series of the targeted variable i and its corresponding time series observed in the actual system:
m^{(i)}_{error} = Σ_{t=t0..tf} m^{(i)}_{t}  s^{(i)}_{t} / t_{span}.
For a system without partitioning, the size of the space of possible models grows as k^{2kc}, where k is the number of variables, and c is a measure of maximum allowed model complexity. In this work, c is the maximum number of operators and operands allowed in each equation f_{m}^{(i)}. For a fixed expression depth (as used here) that complexity c is also fixed. One variable admits a space of size 1^{c}, two variables one of size 2^{c}2^{c}=4^{2c}=2^{4c}, and n variables one of size Π_{k}=1..n i^{c} = k^{2kc}. With partitioning, each equation is optimized separately, therefore the time to search for k equations serially is the linear sum Σ_{n=1...k} k^{c} = k^{c+1}. Because of the separation, each equation can be modeled in parallel, reducing search time to k^{c}. Therefore, model search proceeds over a space with a size that scales polynomially with the number of variables, rather than exponentially. All the models reported in this paper were bounded by maximum expression depths of c=6, and were optimized serially, giving a search time of size k^{7} for each system.
Test quality. The quality of a test is computed as follows. First, a candidate test is supplied to each model, such that the model produces a time series of predicted values for each state variable. Then the quality of test s is computed as
s_{quality} = Σ_{t=t0..tf} Σ_{v=1..k} φ(m^{(}1)_{v,t},...,m^{(}5)_{v,t}) / 5kt_{span},
where t_{0} indicates the time at which the model began integrating the current test s; t_{f} indicates the time it finishes; t_{span} denotes this span of time; v=1..k represent the k state variables that make up the system under investigation; and m^{(i)}_{v,t} indicates the value of variable v at time step t, as predicted by model i. In short, the quality of a test is computed as the variance that it can induce in the model predictions, averaged over all variables, and over the length of the proposed experiment to be carried out on the system. For example, consider a system composed of two variables that can be observed one hour, two hours and three hours after a test has been supplied to it. Assume that some test s_{1} is supplied to three models. The first model predicts that the first variable will increase over time; the second model predicts it will decrease over time; and the third predicts it will stay the same. The same three models predict that the second variable will fall, stay the same, and rise, respectively. More information about the system will be gained if this test is supplied to the system, compared to another test s_{2} that causes all three models to predict that it will induce the first variable of the system to rise over time, and the second variable to decrease over time.
SI Example Experiment
As an example, we describe the application of the framework to the physical pendulum, in which the base is not rotated (an image of this setup is shown in Fig. 6a). To begin, we execute run_Pend_Flat.bat in Code\Physical_Systems\Release. This particular experiment uses partitioning, snipping and automated probing. This batch file begins the algorithm, which begins to generate models to explain the time series data produced by this device (this data is visualized in Fig. 6d). This data is stored in Code\Physical_Systems\Release\Targets\Pend_0.dat.
The algorithm begins by extracting 20 contiguous time steps of data from Pend_0.dat. There are a total of 200 time steps available. Where in the sample this subsample is taken from is random.
The algorithm then generates five random equations describing the rate of change of the first state variable, which is the rotational angle of the pendulum's arm. The first two equations are
θ' = 1.757 * ω
θ' = cos(θ7.002)  0.686
These five equations are supplied with the first value of θ in the extracted data subsample, and integrated for 20 time steps. The mean squared error between this data and the extracted data is computed for each equation, indicating its performance. Each equation is copied, and a random mutation is introduced. The first two equations mutate into
θ' = sin(θ) + ω
θ' = cos(0.1667.002)  θ
The errors of these new equations are computed. If the new equation's error is less than the original equation, it replaces it; otherwise, the new equation is deleted. This leads to a new set of five equations, which are mutated, and so on. This process continues for 1000 iterations. At this point the equations are stored, and five new random equations describing the second state variable's rate of change (ω'=...) are optimized against the extracted subsample. The same process of equation optimization continues for 1000 iterations. At this point the equation with least error for the first state variable is combined with the equation with least error for the second state variable; the equations with secondleast error are combined; and so on. This produces five models that can explain the extracted subsample of data.
Then, the search for a new test of the target system commences. Five pairs of values are generated at random (which represent possible initial conditions) between the state variables' valid ranges. The first initial condition is supplied to the five synthesized models, which generates five sets of time series data. The variance across this time series data is computed, and assigned to the current initial condition. This process is repeated for the four other initial conditions. The initial conditions are then copied, and the values in the copies are randomly modified. The new initial conditions are supplied to the models again, and the variance they induce is measured. If a new initial condition induces more model variance than the original initial condition, it replaces it; otherwise, it is deleted. This process continues for 100 iterations.
At that point, the initial condition that induces the most model variance is selected and compared against the system's time series data. The time step in this data is found with state variable values that are closest to the values in the proposed initial condition. For example if the proposed initial conditions are [0.118 2.223], then the closest match in values occurs during the 15^{th} time step from the system data, which is [0.11788 2.2043] (see Pend_0.dat). A new subsample of system data is then extracted: the time series from t_{15} to t_{35}.
Modeling then recommences. The five models are broken up into their individual equations again. The five equations for the first state variable are now integrated using both initial conditions: the first random subsample, and the second extracted subsample. The error of the equations is recomputed, mutated copies are produced, and replacement occurs. This process is again continued for 1000 iterations, and then repeated for the five equations for the second state variable. At this point, the equations are reintegrated into five models that explain both the first and second data subsamples well.
This entire process continues for a set period of clock time, which in run_Pend_Flat.bat is set to two minutes. At the end of execution, the algorithm returns the model with lowest error, from among the five that it has synthesized, as its best description of the behavior of the pendulum.
SI Source Code
The code for two algorithm variants are supplied: one for infering models from synthetic systems (Code/Synthetic_Systems) and one for infering models from physical systems (Code/Physical_Systems). The code for each is supplied as a Microsoft Visual C++ Workspace (DiffEq_Inference.dsw).
SI Executables
The code can be executed by running the batch files (run_x.bat) in the Release subdirectory of each algorithm. For the synthetic system: run_LV.bat will infer models of the LotkaVolterra system describing interspecies competition; run_HigherOrder.bat will infer models of the highorder system; and run_Synthetic.bat will run the syntheticallygenerated systems described in Table IV of the paper. The run_LV.bat contains the following line:
DiffEq_Inference.exe r 0 c 1000 n 0.08 td 6 el 10 null g 1000 t LV.dat
tmin 0 tmax 1 ee ei fullDis snip timer 30
 The r switch supplies a random seed; changing this number creates independent trials.
 The c switch sets the maximum number of allowed cycles of modeling and testing. If this maximum is reached, the algorithm terminates.
 The n switch tunes the amount of noise added to the system's behavior. At its current setting, this adds random values sampled uniformly from [0.08,0.08] to the value of each variable, at each time step, after the behavior has been generated.
 The td switch sets the maximum allowed depth of nested subexpressions. Larger values allow larger expressions to be generated to explain system behavior.
 The el switch sets experiment length: the number of time steps that the system is allowed to behave for, given a set of initial conditions.
 null turns off printing to the screen.
 The g switch indicates the number of optimization steps that are performed during each pass through the modeling phase.
 The t switch indicates which target system should be inferred. Systems are encoded in script files, which are loaded in at run time. The Targets directory contains many examples of different systems.
 tmin and tmax set the bounds on values that the initial conditions can take.
 The ee switch informs the algorithm to optimize the equations of the model, not just the parameters. This setting is always on.
 The ei fullDis combined switch informs the algorithm to optimize tests, using a set of optimized models. Removing these switches turns off automated probing.
 The snip switch turns on snipping.
 The timer switch indicates how many minutes the algorithm should run for.
For the physical system, the three supplied batch files infer models of the pendulum lying flat on its base (run_Pend_Flat.bat); the pendulum base rotated 90 degrees (run_Pend_90.bat); and the base rotated 138 degrees (run_Pend_138.bat). This executable reads in time series data generated by these systems (found in the Targets directory). The time series files are organized as a matrix: each column represents values of a given variable, and each row indicates a time step. For the pendulum data files, the first row indicates angle of the arm, and the second row indicates angular velocity. Pictorial representations of these data sets are shown in Fig. 6 df of the paper.
When the algorithm is run, it outputs data files into the data subdirectory. The most important of these are the ones named runx_y_models.dat. The x indicates the random seed that was used for this trial; the y indicates the algorithm variant that was used (turning on and off automated probing and snipping changes this value). This file contains a list of models, encoded in Matlab format: these models can be dropped into a Matlab integrator, and the resulting behavior of the model can be plotted. (Matlab code for doing this is not currently provided.) Each model in the list is the model with lowest error achieved by the algorithm during that pass through the modeling phase. The most important model is the last one, which is the final, best model output when the algorithm terminates. y(i) indicates the ith state variable, and yprime(i) reports the equation describing the behavior of variable i.
SI Data
The Data subdirectory contains the results of all the trials reported in the paper. These subdirectories are largelyselfexplanatory. Each one contains a set of data files output by a single trial of the algorithm, against that system.
Data for the two physical systems  the pendulum and the coevolutionary data  that was used to generate the models is stored in Code\Physical_Systems\Release\Targets. The data is formatted as follows. The first number indicates the number of time steps of available data for each state variable, and second number indicates the total number of state variables. Each column starting at the second row then corresponds to time series data for that state variable. For example Hudsons.dat begins as
91 2
20000 31000
20000 40000
indicating that there are 91 years worth of data; two species; that in the first year (1845) there were approximately 20,000 hares and 31,000 lynxes; in 1846 there were 20,000 hares and 40,000 lynxes; and so on.
Citation Manager Formats
Sign up for Article Alerts
Jump to section
You May Also be Interested in
More Articles of This Classification
Physical Sciences
Computer Sciences
Biological Sciences
Related Content
 No related articles found.
Cited by...
 REVEALING COMPLEX ECOLOGICAL DYNAMICS VIA SYMBOLIC REGRESSION
 Reconstruction of normal forms by learning informed observation geometries from data
 Datadriven discovery of partial differential equations
 Discovering governing equations from data by sparse identification of nonlinear dynamical systems
 Serotonergic regulation of melanocyte conversion: A bioelectrically regulated network for stochastic allornone hyperpigmentation
 Systems cell biology
 Response
 Machine Science
 Distilling FreeForm Natural Laws from Experimental Data