## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Learning data-driven discretizations for partial differential equations

Edited by John B. Bell, Lawrence Berkeley National Laboratory, Berkeley, CA, and approved June 21, 2019 (received for review August 14, 2018)

## Significance

In many physical systems, the governing equations are known with high confidence, but direct numerical solution is prohibitively expensive. Often this situation is alleviated by writing effective equations to approximate dynamics below the grid scale. This process is often impossible to perform analytically and is often ad hoc. Here we propose data-driven discretization, a method that uses machine learning to systematically derive discretizations for continuous physical systems. On a series of model problems, data-driven discretization gives accurate solutions with a dramatic drop in required resolution.

## Abstract

The numerical solution of partial differential equations (PDEs) is challenging because of the need to resolve spatiotemporal features over wide length- and timescales. Often, it is computationally intractable to resolve the finest features in the solution. The only recourse is to use approximate coarse-grained representations, which aim to accurately represent long-wavelength dynamics while properly accounting for unresolved small-scale physics. Deriving such coarse-grained equations is notoriously difficult and often ad hoc. Here we introduce data-driven discretization, a method for learning optimized approximations to PDEs based on actual solutions to the known underlying equations. Our approach uses neural networks to estimate spatial derivatives, which are optimized end to end to best satisfy the equations on a low-resolution grid. The resulting numerical methods are remarkably accurate, allowing us to integrate in time a collection of nonlinear equations in 1 spatial dimension at resolutions 4× to 8× coarser than is possible with standard finite-difference methods.

Solutions of nonlinear partial differential equations can have enormous complexity, with nontrivial structure over a large range of length- and timescales. Developing effective theories that integrate out short lengthscales and fast timescales is a long-standing goal. As examples, geometric optics is an effective theory of Maxwell equations at scales much longer than the wavelength of light (1); density functional theory models the full many-body quantum wavefunction with a lower-dimensional object—the electron density field (2); and the effective viscosity of a turbulent fluid parameterizes how small-scale features affect large-scale behavior (3). These models derive their coarse-grained dynamics by more or less systematic integration of the underlying governing equations (by using, respectively, WKB theory, local density approximation, and a closure relation for the Reynolds stress). The gains from coarse graining are, of course, enormous. Conceptually, it allows a deep understanding of emergent phenomena that would otherwise be masked by irrelevant details. Practically, it allows computation of vastly larger systems.

Averaging out unresolved degrees of freedom invariably replaces them by effective parameters that mimic typical behavior. In other words, we identify the salient features of the dynamics at short and fast scales and replace these with terms that have a similar average effect on the long and slow scales. Deriving reliable effective equations is often challenging (4). Here we approach this challenge from the perspective of statistical inference. The coarse-grained representation of the function contains only partial information about it, since short scales are not modeled. Deriving coarse-grained dynamics requires first inferring the small-scale structure using the partial information (reconstruction) and then incorporating its effect on the coarse-grained field. We propose to perform reconstruction using machine-learning algorithms, which have become extraordinarily efficient at identifying and reconstructing recurrent patterns in data. Having reconstructed the fine features, modeling their effect can be done using our physical knowledge about the system. We call our method data-driven discretization. It is qualitatively different from coarse-graining techniques that are currently in use: Instead of analyzing equations of motion to derive effective behavior, we directly learn from high-resolution solutions to these equations.

## Related Work

Several related approaches for computationally extracting effective dynamics have been previously introduced. Classic works used neural networks for discretizing dynamical systems (5, 6). Similarly, equation-free modeling approximates coarse-scale derivatives by remapping coarse initial conditions to fine scales which are integrated exactly (7). The method has a similar spirit to our approach, but it does not learn from fine-scale dynamics and use the memorized statistics in subsequent times to reduce the computational load. Recent works have applied machine learning to partial differential equations (PDEs), either focusing on speed (8⇓–10) or recovering unknown dynamics (11, 12). Models focused on speed often replace the slowest component of a physical model with machine learning, e.g., the solution of Poisson’s equation in incompressible fluid simulations (9), subgrid cloud models in climate simulations (10), or building reduced-order models that approximate dynamics in a lower-dimensional space (8, 13, 14). These approaches are promising, but learn higher-level components than our proposed method. An important development is the ability to satisfy some physical constraints exactly by plugging learned models into a fixed equation of motion. For example, valid fluid dynamics can be guaranteed by learning either velocity fields directly (12) or a vector potential for velocity in the case of incompressible dynamics (8). Closely related to this work, neural networks can be used to calculate closure conditions for coarse-grained turbulent flow models (15, 16). However, these models rely on existing coarse-grained schemes specific to turbulent flows and do not discretize the equations directly. Finally, ref. 17 suggested discretizations whose solutions can be analytically guaranteed to converge to the center manifold of the governing equation, but not in a data-driven manner.

## Data-Driven Subgrid-Scale Modeling

Consider a generic PDE, describing the evolution of a continuous field **1** by approximating the spatial derivatives at these points. There are various methods for this approximation—polynomial expansion, spectral differentiation, etc.—all yielding formulas resembling

Standard schemes use one set of precomputed coefficients for all points in space, while more sophisticated methods alternate between different sets of coefficients according to local rules (20, 21). This discretization transforms Eq. **1** into a set of coupled ordinary differential equations of the form**3** depends on **2** as

However, the scale of the smallest features is often orders of magnitude smaller than the system size. High-performance computing has been driven by the ever increasing need to accurately resolve smaller-scale features in PDEs. Even with petascale computational resources, the largest direct numerical simulation of a turbulent fluid flow ever performed has Reynolds number of order 1,000, using about **2**, by changing the **3** with a different set of discrete equations.

The main idea of this work is that unresolved physics can instead be learned directly from data. Instead of deriving an approximate coarse-grained continuum model and discretizing it, we suggest directly learning low-resolution discrete models that encapsulate unresolved physics. Rigorous mathematical work shows that the dimension of a solution manifold for a nonlinear PDE is finite (25, 26) and that approximate parameterizations can be constructed (27⇓–29). If we knew the solution manifold, we could generate equation-specific approximations for the spatial derivatives in Eq. **2**, approximations that have the potential to hold even when the system is underresolved. In contrast to standard numerical methods, the coefficients **2** from this dataset. This produces a tradeoff in computational cost, which can be alleviated by carrying out high-resolution simulations on small systems to develop local approximations to the solution manifold and using them to solve equations in much larger systems at significantly reduced spatial resolution.

### Burgers’ Equation.

For concreteness, we demonstrate this approach with a specific example in 1 spatial dimension. Burgers’ equation is a simple nonlinear equation which models fluid dynamics in 1D and features shock formation. In its conservative form, it is written as**4** spontaneously develop sharp shocks, with specific relationships between the shock height, width, and velocity (19) that define the local structure of the solution manifold.

With this in mind, consider a typical segment of a solution to Burgers’ equation (Fig. 1*A*). We want to compute the time derivative of the field given a low-resolution set of points (blue diamonds in Fig. 1). Standard finite-difference formulas predict this time derivative by approximating v as a piecewise-polynomial function passing through the given points (orange curves in Fig. 1). But solutions to Burger’s equations are not polynomials: They are shocks with characteristic properties. By using this information, we can derive a more accurate, albeit equation-specific, formula for the spatial derivatives. For the method to work it should be possible to reconstruct the fine-scale solution from low-resolution data. To this end, we ran many simulations of Eq. **4** and used the resulting data to train a neural network. Fig. 1 compares the predictions of our neural net (details below and in *SI Appendix*) to fourth-order polynomial interpolation. This learned model is clearly far superior to the polynomial approximation, demonstrating that the spatial resolution required for parameterizing the solution manifold can be greatly reduced with equation-specific approximations rather than finite differences.

## Models for Time Integration

The natural question to ask next is whether such parameterizations can be used for time integration. For this to work well, integration in time must be numerically stable, and our models need a strong generalization capacity: Even a single error could throw off the solution for later times.

To achieve this, we use multilayer neural networks to parameterize the solution manifold, because of their flexibility, including the ability to impose physical constraints and interpretability through choice of model architecture. The high-level aspects of the network’s design, which we believe are of general interest, are described below. Additional technical details are described in *SI Appendix* and source code is available online at https://github.com/google/data-driven-discretization-1d.

### Pseudolinear Representation.

Our network represents spatial derivatives with a generalized finite-difference formula similar to Eq. **2**: The output of the network is a list of coefficients **2**, where the coefficients

The pseudolinear representation is a direct generalization of the finite-difference scheme of Eq. **2**. Moreover, exactly as in the case of Eq. **2**, a Taylor expansion allows us to guarantee formal polynomial accuracy. That is, we can impose that approximation errors decay as *SI Appendix*). We found the best results when imposing linear accuracy,

### Physical Constraints.

Since Burgers’ equation is an instance of the continuity equation, as with traditional methods, a major increase in stability is obtained when using a finite-volume scheme, ensuring the coarse-grained solution satisfies the conservation law implied by the continuity equation. That is, coarse-grained equations are derived for the cell averages of the field v, rather than its nodal values (19). During training we provide the cell average to the network as the “true” value of the discretized field.

Integrating Eq. **4**, it is seen that the change rate of the cell averages is completely determined by the fluxes at cell boundaries. This is an exact relation, in which the only challenge is estimating the flux given the cell averages. Thus, prediction is carried out in 3 steps: First, the network reconstructs the spatial derivatives on the boundary between grid cells (staggered grid). Then, the approximated derivatives are used to calculate the flux J using the exact formula Eq. **4**. Finally, the temporal derivative of the cell averages is obtained by calculating the total change at each cell by subtracting J at the cell’s left and right boundaries. The calculation of the time derivative from the flux can also be done using traditional techniques that promote stability, such as monotone numerical fluxes (19). For some experiments, we use Godunov flux, inspired by finite-volume ENO schemes (20, 21), but it did not improve predictions for our neural networks models.

Dividing the inference procedure into these steps is favorable in a few aspects: First, it allows us to constrain the model at the various stages using traditional techniques; the conservative constraint, numerical flux, and formal polynomial accuracy constraints are what we use here, but other constraints are also conceivable. Second, this scheme limits the machine-learning part to reconstructing the unknown solution at cell boundaries, which is the main conceptual challenge, while the rest of the scheme follows either the exact dynamics or traditional approximations for them. Third, it makes the trained model more interpretable since the intermediate outputs (e.g., J or *SI Appendix*.

### Choice of Loss.

The loss of a neural net is the objective function minimized during training. Rather than optimizing the prediction accuracy of the spatial derivatives, we optimize the accuracy of the resulting time derivative* . This allows us to incorporate physical constraints in the training procedure and directly optimize the final predictions rather than intermediate stages. Our loss is the mean-squared error between the predicted time derivative and labeled data produced by coarse graining the fully resolved simulations.

Note that a low value of our training loss is a necessary but not sufficient condition for accurate and stable numerical integration over time. Many models with low training loss exhibited poor stability when numerically integrated (e.g., without the conservative constraint), particularly for equations with low dissipation. From a machine-learning perspective, this is unsurprising: Imitation learning approaches, such as our models, often exhibit such issues because the distribution of inputs produced by the model’s own predictions can differ from the training data (30). Incorporating the time-integrated solution into the loss improved predictions in some cases (as in ref. 9), but did not guarantee stability, and could cause the training procedure itself to diverge due to decreased stability in calculating the loss. Stability for learned numerical methods remains an important area of exploration for future work.

### Learned Coefficients.

We consider 2 different parameterizations for learned coefficients. In our first parameterization, we learn optimized time- and space-independent coefficients. These fixed coefficients minimize the loss when averaged over the whole training set for a particular equation, without allowing the scheme to adapt the coefficients according to local features of the solution. Below, we refer to these as “optimized constant coefficients.” In our second parameterization, we allow the coefficients to be an arbitrary function of the neighboring field values

Example coefficients predicted by our trained models are shown in Fig. 2 and *SI Appendix*, Fig. S3. Both the optimized constant and data-dependent coefficients differ from baseline polynomial schemes, particularly in the vicinity of the shock. The neural network solutions are particularly interesting: They do not appear to be using 1-sided stencils near the shock, in contrast to traditional numerical methods such as WENO (21) which avoid placing large weights across discontinuities.

The output coefficients can also be interpreted physically. For example, coefficients for both *B*, *Inset*) and v (*SI Appendix*, Fig. S3*C*) are either right or left biased, opposite the sign of v. This is in line with our physical intuition: Burgers’ equation describes fluid flow, and the sign of v corresponds to the direction of flow. Coefficients that are biased in the opposite direction of v essentially look “upwind,” a standard strategy in traditional numerical methods for solving hyperbolic PDEs (19), which helps constrain the scheme from violating temporal causality. Alternatively, upwinding could be built into the model structure by construction, as we do in models which use Godunov flux.

## Results

### Burgers’ Equation.

To assess the accuracy of the time integration from our coarse-grained model, we computed “exact” solutions to Eq. **4** for different realizations of

Results are shown in Fig. 3. Fig. 3*A* compares the integration results for a particular realization of the forcing for different values of the resample factor, that is, the ratio between the number of grid points in the low-resolution calculation and that of the fully converged solution^{†} . Our learned models, with both constant and solution-dependent coefficients, can propagate the solution in time and dramatically outperform the baseline method at low resolution. Importantly, the ringing effect around the shocks, which leads to numerical instabilities, is practically eliminated.

Since our model is trained on fully resolved simulations, a crucial requirement for our method to be of practical use is that training can be done on small systems, but still produce models that perform well on larger ones. We expect this to be the case, since our models, being based on convolutional neural networks, use only local features and by construction are translation invariant. Fig. 3*B* illustrates the performance of our model trained on the domain

To make this assessment quantitative, we averaged over many realizations of the forcing and calculated the mean absolute error integrated over time and space. Results on the 10-times larger inference domain are shown in Fig. 3*C*: The solution from the full neural network has equivalent accuracy to increasing the resolution for the baseline by a factor of about 8×. Interestingly, even the simpler constant-coefficient method significantly outperforms the baseline scheme. The constant-coefficient model with Godunov flux is particularly compelling. This model is faster than WENO, because there is no need to calculate coefficients on the fly, with comparable accuracy and better numerical stability at coarse resolution, as shown in Figs. 3*A* and 4.

These calculations demonstrate that neural networks can carry out coarse graining. Even if the mesh spacing is much larger than the shock width, the model is still able to accurately propagate dynamics over time, showing that it has learned an internal representation of the shock structure.

### Other Examples.

To demonstrate the robustness of this method, we repeated the procedure for 2 other canonical PDEs: The Korteweg–de Vries (KdV) equation (32), which was first derived to model solitary waves on a river bore and is known for being completely integrable and to feature soliton solutions, and the Kuramoto–Sivashinsky (KS) equation which models flame fronts and is a textbook example of a classically chaotic PDE (33). All details about these equations are given in *SI Appendix*. We repeated the training procedure outlined above for these equations, running high-resolution simulations and collecting data to train equation-specific estimators of the spatial derivative based on a coarse grid. These equations are essentially nondissipative, so we do not include a forcing term. The solution manifold is explored by changing the initial conditions, which are taken to be a superposition of long-wavelength sinusoidal functions with random amplitudes and phases (see *SI Appendix* for details).

To assess the accuracy of the integrated solution, for each initial condition we define “valid simulation time” as the first time that the low-resolution integrated solution deviates from the cell-averaged high-resolution solution by more than a given threshold. We found this metric more informative to compare across very different equations than absolute error.

Fig. 4 shows the median valid simulation time as a function of the resample factor. For all equations and resolutions, our neural network models have comparable or better performance than all other methods. The neural network is particularly advantageous at low resolutions, demonstrating its improved ability to solve coarse-grained dynamics. The optimized constant coefficients perform better at coarse resolution than baseline methods, but not always at high resolutions. Finally, at large enough resample factors the neural network approximations also fail to reproduce the dynamics, as expected. These results also hold on a 10-times larger spatial domain, as shown in *SI Appendix*, along with figures illustrating specific realizations and mean absolute error (*SI Appendix*, Figs. S8 and S9).

## Discussion and Conclusion

It has long been remarked that even simple nonlinear PDEs can generate solutions of great complexity. But even very complex, possibly chaotic, solutions are not just arbitrary functions: They are highly constrained by the equations they solve. In mathematical terms, despite the fact that the solution set of a PDE is nominally infinite dimensional, the inertial manifold of solutions is much smaller and can be understood in terms of interactions between local features of the solutions to nonlinear PDEs. The dynamical rules for interactions between these features have been well studied over the past 50 years. Examples include, among many others, interactions of shocks in complex media, interactions of solitons (32), and the turbulent energy cascade (34).

Machine learning offers a different approach for modeling these phenomena, by using training data to parameterize the inertial manifold itself; said differently, it learns both the features and their interactions from experience of the solutions. Here we propose a simple algorithm for achieving this, motivated by coarse graining in physical systems. It is often the case that coarse graining a PDE amounts to modifying the weights in a discretized numerical scheme. Instead, we use known solutions to learn these weights directly, generating data-driven discretizations. This effectively parameterizes the solution manifold of the PDE, allowing the equation to be solved at high accuracy with an unprecedented low resolution.

Faced with this success, it is tempting to try and leverage the understanding the neural network has developed to gain new insights about the equation or its coarse-grained representation. Indeed, in Fig. 2 we could clearly interpret the directionality of the weights as an upwind bias, the pseudolinear representation providing a clear interpretation of the prediction in a physically sensible way. However, extracting more abstract insight from the network, such as the scaling relation between the shock height and width, is a difficult challenge. This is a general problem in the field of machine learning, which is under intensive current research (35, 36).

Our results are promising, but 2 challenges remain before our approach can be deployed at large scales. The first challenge is speed. We showed that optimized constant coefficients can already improve accuracy, but our best models rely on the flexibility of neural networks. Unfortunately, our neural nets use many more convolution operations than the single convolution required to implement finite differences, e.g.,

A second challenge is scaling to higher-dimensional problems and more complex grids. Here we showcased the approach for regular grids in 1D, but most problems in the real world are higher dimensional, and irregular and adaptive grids are common. We do expect larger potential gains in 2D and 3D, as the computational gain in terms of the number of grid points would scale like the square or the cube of the resample factor. Irregular grids may be more challenging, but deep learning methods that respect appropriate invariants have been developed both for arbitrary graphs (39) and for collections of points in 3D space (40). Similar to what we found here, we expect that hand-tuned heuristics for both gridding and grid coefficients could be improved upon by systematic machine learning. More broadly, data-driven discretization suggests the potential of data-driven numerical methods, combining the optimized approximations of machine learning with the generalization of physical laws.

## Acknowledgments

We thank Peyman Milanfar, Pascal Getreur, Ignacio Garcia Dorado, and Dmitrii Kochkov for collaboration and important conversations; Peter Norgaard and Geoff Davis for feedback on drafts of the manuscript; and Chi-Wang Shu for guidance on the implementation of WENO. Y.B.-S. acknowledges support from the James S. McDonnel postdoctoral fellowship for the study of complex systems. M.P.B. acknowledges support from NSF Grants DMS-1715477, ONR N00014-17-1-3029, as well as the Simons Foundation.

## Footnotes

↵

^{1}Y.B.-S. and S.H. contributed equally to this work.- ↵
^{2}To whom correspondence may be addressed. Email: ybarsinai{at}gmail.com or shoyer{at}google.com.

Author contributions: Y.B.-S., S.H., J.H., and M.P.B. designed research; Y.B.-S. and S.H. performed research; Y.B.-S. and S.H. analyzed data; and Y.B.-S., S.H., J.H., and M.P.B. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: Source code is available on GitHub (https://github.com/google/data-driven-discretization-1d).

↵*For one specific case, namely the constant-coefficient model of Burgers’ equation with Godunov flux limiting, trained models showed poor performance (e.g., not monotonically increasing with resample factor) unless the loss explicitly included the time-integrated solution, as done in ref. 9. Results shown in Figs. 3 and 4 use this loss for the constant-coefficient models with Burgers’ equation. See details in

*SI Appendix*.↵

^{†}Physically, the natural measure of the spatial resolution is with respect to the internal lengthscale of the equation which in the case of Burgers’ equation is the typical shock width. However, since this analysis is meant to be applicable also to situations where the internal lengthscale is a priori unknown, we compare here to the lengthscale at which mesh convergence is obtained.This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1814058116/-/DCSupplemental.

- Copyright © 2019 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- J. D. Jackson

- ↵
- D. Sholl,
- J. A. Steckel

- ↵
- C. J. Chen

- ↵
- M. Van Dyke

- ↵
- R. Gonzalez-Garcia,
- R. Rico-Martinez,
- I. Kevrekidis

- ↵
- A. B. Bulsari

- R. Rico-Martinez,
- I. Kevrekidis,
- K. Krischer

- ↵
- ↵
- B. Kim et al.

- ↵
- D. Precup,
- Y. W. Teh

- J. Tompson,
- K. Schlachter,
- P. Sprechmann,
- K. Perlin

- ↵
- S. Rasp,
- M. S. Pritchard,
- P. Gentine

- ↵
- S. L. Brunton,
- J. L. Proctor,
- J. N. Kutz

- ↵
- E. de Bezenac,
- A. Pajot,
- P. Gallinari

- ↵
- B. Lusch,
- J. N. Kutz,
- S. L. Brunton

- ↵
- S. Bengio

- et al.

- J. Morton,
- F. D. Witherden,
- A. Jameson,
- M. J. Kochenderfer

- ↵
- J. Ling,
- A. Kurzawski,
- J. Templeton

- ↵
- A. D. Beck,
- D. G. Flad,
- C. D. Munz

- ↵
- A. Roberts

- ↵
- W. E. Schiesser

- ↵
- R. J. LeVeque

- ↵
- A. Harten,
- B. Engquist,
- S. Osher,
- S. R. Chakravarthy

- ↵
- A. Quarteroni

- C. W. Shu

- ↵
- M. Lee,
- R. D. Moser

- ↵
- M. Clay,
- D. Buaria,
- T. Gotoh,
- P. Yeung

- ↵
- K. P. Iyer,
- K. R. Sreenivasan,
- P. K. Yeung

- ↵
- P. Constantin,
- C. Foias,
- B. Nicolaenko,
- R. Temam

- ↵
- ↵
- M. Jolly,
- I. Kevrekidis,
- E. Titi

- ↵
- ↵
- M. Marion

- ↵
- Y. W. Teh,
- M. Titterington

- S. Ross,
- D. Bagnell

- ↵
- I. Goodfellow,
- Y. Bengio,
- A. Courville,
- Y. Bengio

- ↵
- ↵
- D. Zwillinger

- ↵
- U. Frisch

- ↵
- D. Precup,
- Y. W. Teh

- M. Sundararajan,
- A. Taly,
- Q. Yan

- ↵
- D. Precup,
- Y. W. Teh

- A. Shrikumar,
- P. Greenside,
- A. Kundaje

- ↵
- Y. Romano,
- J. Isidoro,
- P. Milanfar

- ↵
- P. Getreuer et al.

- ↵
- D. Precup,
- Y. W. Teh

- J. Gilmer,
- S. S. Schoenholz,
- P. F. Riley,
- O. Vinyals,
- G. E. Dahl

- ↵
- I. Guyon et al.

- C. R. Qi,
- L. Yi,
- H. Su,
- L. J. Guibas

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Applied Mathematics