Data-driven discovery of coordinates and governing equations

Significance Governing equations are essential to the study of physical systems, providing models that can generalize to predict previously unseen behaviors. There are many systems of interest across disciplines where large quantities of data have been collected, but the underlying governing equations remain unknown. This work introduces an approach to discover governing models from data. The proposed method addresses a key limitation of prior approaches by simultaneously discovering coordinates that admit a parsimonious dynamical model. Developing parsimonious and interpretable governing models has the potential to transform our understanding of complex systems, including in neuroscience, biology, and climate science.

Lrecon ensures that the autoencoder can accurately reconstruct the data from the intrinsic coordinates. L dx/dt and L dz/dt 18 ensure that the discovered SINDy model captures the dynamics of the system by ensuring that the model can predict the 19 derivatives from the data. Lreg promotes sparsity of the coefficients in the SINDy model.

20
Computing derivatives. Computing the derivatives of the encoder variables requires propagating derivatives through the network. Our network makes use of an activation function f (·) that operates elementwise. Given an input x, we define the pre-activation values at the jth encoder layer as lj = f (lj−1)Wj + bj.
The first layer applies the weights and biases directly to the input so that l0 = xW0 + b0.
The activation function is not applied to the last layer, so for an encoder with L hidden layers the autoencoder variables are defined as z = f (lL−1)WL + bL.
Assuming that derivatives dx/dt are available or can be computed, derivatives dz/dt can also be computed: For the nonlinear pendulum example, we use a second order SINDy model that requires the calculation of second derivatives. Second derivatives can be computed using the following: epochs used are specific to each example. 26 To obtain parsimonious dynamical models, we use a sequential thresholding procedure that promotes sparsity on the coefficients in Ξ, which represent the dynamics on the latent variables z. Every 500 epochs, we set all coefficients in Ξ with a magnitude of less than 0.1 to 0, effectively removing these terms from the SINDy model. This is achieved by using a mask Υ, consisting of 1s and 0s, that determines which terms remain in the SINDy model. Thus the true SINDy terms in the loss function are given by where Υ is passed in separately and not updated by the optimization algorithm. Once a term has been thresholded out during 27 training, it is permanently removed from the SINDy model. Therefore the number of active terms in the SINDy model can only 28 be decreased as training continues. The L1 regularization on Ξ encourages the model coefficients to decrease in magnitude,

29
which combined with the sequential thresholding produces a parsimonious dynamical model.

30
While the L1 regularization penalty on Ξ promotes sparsity in the resulting SINDy model, it also encourages nonzero terms 31 to have smaller magnitudes. This results in a trade-off between accurately reconstructing the dynamics of the system and 32 reducing the magnitude of the SINDy coefficients, where the trade-off is determined by the relative magnitudes of the loss 33 weight penalties λ1, λ2 and the regularization penalty λ3. The specified training procedure therefore typically results in models 34 with coefficients that are slightly smaller in magnitude than those which would best reproduce the dynamics. To account 35 for this, we add an additional coefficient refinement period to the training procedure. To perform this refinement, we lock 36 in the sparsity pattern in the dynamics by fixing the coefficient mask Υ and continue training for 1000 epochs without the 37 L1 regularization on Ξ. This ensures that the best coefficients are found for the resulting SINDy model and also allows the 38 training procedure to refine the encoder and decoder parameters. This procedure is analogous to running a debiased regression  reproduce the input data. As simpler models (lower d) are typically easier to interpret, one should start by using the smallest d 49 possible. Once this minimum has been found, the full SINDy autoencoder can be trained. It is possible that more coordinates 50 may be needed to capture the dynamics, in which case the method may accurately reproduce the input but fail to find a 51 good model for the dynamics. In this case d could be increased from the minimum until a suitable dynamical model is found.

52
Choosing d that is greater than necessary may still result in a valid model for the dynamics; however, obtaining a sparse model 53 may be more difficult.

54
The choice of loss weight penalties λ1, λ2, λ3 also has a significant impact on the success of the training procedure. The 55 parameter λ1 determines the importance of the SINDy prediction in the original input space. We generally choose λ1 to be 56 slightly less than the ratio . This slightly prioritizes the reconstruction of x over prediction ofẋ in the 57 training. This is important to ensure that the autoencoder weights are being trained to reproduce x and that it is the SINDy 58 model that gives an accurate prediction ofẋ. We choose λ2 to be one or two orders of magnitude less than λ1. If λ2 is too large, To assess model performance, we calculate the relative L2 error of both the input data x and its derivativeẋ: [1]

80
These error values capture the decoder reconstruction and the fit of the dynamics, respectively. When considering parsimony, we consider the number of active terms in the resulting SINDy model. While parsimonious models are desirable for ease of analysis and interpretability, a model that is too parsimonious may be unable to fully capture the dynamics. In general, for the examples explored, we find that models with fewer active terms perform better on validation data (lower relative L2 error eẋ) whereas models with more active terms tend to overfit the training data. In reporting our results, we also calculate the relative L2 error in predictingż: For each example system, we apply the training procedure to ten different initializations of the network and compare the 81 resulting models. For the purpose of demonstration, for each example we show results for a chosen "best" model, which is 82 taken to be the model with the lowest relative L2 error on validation data among models with the fewest active coefficients.

83
While every instance of training does not result in the exact same SINDy sparsity pattern, the network tends to discover a few 84 different closely related forms of the dynamics.

85
Code. We use the Python API for TensorFlow to implement and train our network (5). Our code is publicly available at 86 github.com/kpchamp/SindyAutoencoders.

87
Example Systems 88 Chaotic Lorenz system. To create a high-dimensional data set with dynamics defined by the Lorenz system, we choose six spatial modes u1, . . . , u6 ∈ R 128 and take where the dynamics of z are specified by the Lorenz equationṡ with standard parameter values of σ = 10, ρ = 28, β = 8/3. We choose our spatial modes u1, . . . , u6 to be the first six Legendre  Following the training procedure described above, we learn ten models using the single set of training data (variability 95 among the models comes from the initialization of the network weights). The hyperparameters used for training are shown in 96   Table S1. For each model we run the training procedure for 10 4 epochs, followed by a refinement period of 10 3 epochs. Of the 97 ten models, two have 7 active terms, two have 10 active terms, one has 11 active terms, and five have 15 or more active terms.

98
While all models have ex, eẋ < 0.01, the three models with 20 or more active terms have the worst performance predicting 99ẋ . The two models with 10 active terms have the lowest overall error, followed by models with 7, 15, and 18 active terms.

100
While the models with 10 active terms have a lower overall error than the models with 7 terms, both have a very low error and 101 thus we choose to highlight the model with the fewest active terms. A model with 10 active terms is shown in Figure S1 for 102 comparison.
While the structure of this model appears to be different from that of the original Lorenz system, we can define an affine transformation that gives it the same structure. The variable transformation z1 = α1z1, z2 = α2z2, z3 = α3z3 + β3 gives the following transformed system of equations: 1z2. [2c] By choosing α1 = 1, α2 = −0.917, α3 = 0.524, β3 = −2.665, the system becomeṡ This has the same form as the original Lorenz equations with parameters that are close in value, apart from an arbitrary 104 scaling that affects the magnitude of the coefficients ofz1z3 in Eq. (3b) andz1z2 in Eq. (3c). The attractor dynamics for this 105 system are very similar to the original Lorenz attractor and are shown in Figure 3c in the main text.

106
To evaluate performance, we report the relative L2 error of predicting x,ẋ, andż, as defined in Eq. (1). We look at The learning procedure discovers a dynamical model by fitting coefficients that predict the continuous-time derivatives of the 116 variables in a dynamical system. Thus it is possible for the training procedure to discover a model with unstable dynamics or 117 which is unable to predict the true dynamics through simulation. We assess the validity of the discovered model by simulating 118 the dynamics of the discovered low-dimensional dynamical system. Simulation of the system shows that the system is stable 119 with trajectories existing on an attractor very similar to the original Lorenz attractor. Additionally, the discovered system is 120 able to predict the dynamics in the reduced space. The fourth panel in Figure S1a

Reaction-diffusion.
We generate data from a high-dimensional lambda-omega reaction-diffusion system governed by with d1, d2 = 0.1 and β = 1. The system is simulated from a single initial condition from t = 0 to t = 10 with a spacing of ∆t = 0.05 for a total of 10,000 samples. The initial condition is defined as u(y1, y2, 0) = tanh y 2 1 + y 2 2 cos ∠(y1 + iy2) − y 2 1 + y 2 2 v(y1, y2, 0) = tanh y 2 1 + y 2 2 sin ∠(y1 + iy2) − y 2 1 + y 2 2 over a spatial domain of y1 ∈ [−10, 10], y2 ∈ [−10, 10] discretized on a grid with 100 points on each spatial axis. The solution 138 of these equations results in a spiral wave formation. We apply our method to snapshots of u(y1, y2, t) generated by the above  Figure S4a. 142 We divide the total number of samples into training, validation, and test sets: the last 1000 samples are taken as the test 143 set, 1000 samples are chosen randomly from the first 9000 samples as a validation set, and the remaining 8000 samples are 144 taken as the training set. We train ten models using the procedure outlined above for 3 × 10 3 epochs followed by a refinement 145 period of 10 3 epochs. Hyperparameters used for training are shown in Table S2. Nine of the ten resulting dynamical systems 146 models have two active terms and one has three active terms. The dynamical equations, SINDy coefficient matrix, attractors, 147 and simulated dynamics for two example models are shown in Figure S4b [4] 153 We generate synthetic video of the pendulum in two spatial dimensions by creating high-dimensional snapshots given by x(y1, y2, t) = exp −20 (y1 −cos(z(t)−π/2)) 2 + (y2 −sin(z(t)−π/2)) 2 at a discretization of y1, y2 ∈ [−1.5, 1.5]. We use 51 grid points in each dimension resulting in snapshots x(t) ∈ R 2601 . To generate   SINDy model order 2 SINDy library polynomial order 3 SINDy library includes sine yes SINDy loss weightẋ, λ1 5 × 10 −4 SINDy loss weightż, λ2 5 × 10 −5 SINDy regularization loss weight, λ3 10 −5