# Uniformly accurate machine learning-based hydrodynamic models for kinetic equations

Edited by Stanley Osher, University of California, Los Angeles, CA, and approved September 25, 2019 (received for review June 7, 2019)

## Significance

This paper addresses 2 very important issues of current interest: multiscale modeling in the absence of scale separation and building interpretable and truly reliable physical models using machine learning. We demonstrate that machine learning can indeed help us to build reliable multiscale models for problems with which classical multiscale methods have had trouble. To this end, one has to develop the appropriate models or algorithms for each of the 3 major components in the machine-learning procedure: labeling the data, learning from the data, and exploring the state space. We use the kinetic equation as an example and demonstrate that uniformly accurate moment systems can be constructed this way.

## Abstract

A framework is introduced for constructing interpretable and truly reliable reduced models for multiscale problems in situations without scale separation. Hydrodynamic approximation to the kinetic equation is used as an example to illustrate the main steps and issues involved. To this end, a set of generalized moments are constructed first to optimally represent the underlying velocity distribution. The well-known closure problem is then solved with the aim of best capturing the associated dynamics of the kinetic equation. The issue of physical constraints such as Galilean invariance is addressed and an active-learning procedure is introduced to help ensure that the dataset used is representative enough. The reduced system takes the form of a conventional moment system and works regardless of the numerical discretization used. Numerical results are presented for the BGK (Bhatnagar–Gross–Krook) model and binary collision of Maxwell molecules. We demonstrate that the reduced model achieves a uniform accuracy in a wide range of Knudsen numbers spanning from the hydrodynamic limit to free molecular flow.

### Sign up for PNAS alerts.

Get alerts for new articles, or get an alert when an article is cited.

This paper is written with 2 objectives in mind. The first is about multiscale modeling. Here, our interest is to develop macroscale models in situations without scale separation. The second is about machine learning. Here, our interest is to build interpretable and truly reliable physical models with the help of machine learning. Both objectives are of paramount importance for scientific modeling, and certainly we will not be able to address them completely within one paper. Instead, we will restrict ourselves to an illustration of what we consider to be the main issues that are involved, using a relatively simple example: machine learning-based moment closure for kinetic equations.

In scientific modeling, we often encounter the following dilemma: we are interested in modeling some macroscale phenomenon but we only have a reliable model at some micro scale and the microscale model is too detailed and too expensive for practical use. The idea of multiscale modeling is to develop models or algorithms that can efficiently capture the macroscale behavior of the system but use only the microscale model (see refs. 1 and 2 for a review). This program has been quite successful for situations in which there is a clear gap between the active scales in the macro- and microscale processes. However, for problems that do not exhibit scale separation, we still lack effective tools that can be used to discover the relevant variables and find accurate approximations for the dynamics of these relevant variables.

Machine learning, the second topic that we are concerned with here, is equally broad and important. We are particularly interested in developing physical models using machine learning. There are significant differences between this and traditional machine-learning tasks, such as the ones in computer vision and data analytics. One is that in the current setting, instead of being given the data, the data are generated using the microscale model. The good news is that, in principle, we can generate an unlimited amount of data. The bad news is that, oftentimes, the process of generating data is expensive. Therefore, it is an important issue to find a dataset that is as small as possible and yet representative enough. The second aspect is that we have to be concerned with physical constraints such as symmetry, invariance, and the laws of nature. We have to make a choice between enforcing these constraints explicitly or enforcing them approximately as a byproduct of accurately modeling the underlying physical process. The third is the interpretability issue. Machine-learning models are often black box in nature. However, as a physical model that we can rely on, just as the Schrödinger equation in quantum mechanics or the Navier–Stokes equation in fluid mechanics, it cannot just be a black box completely. Some of these issues have been studied in refs. 3–5 in the context of modeling interatomic force fields.

The work presented in refs. 3–5 is limited to the context of learning some functions. This paper examines the aforementioned issues systematically in the context of developing reduced dynamical models for multiscale problems. As a concrete example, we will study the problem where the microscale model is the kinetic equation, and the macroscale model is the hydrodynamic model. Besides being a very representative problem in multiscale modeling, this example is also of great practical interest in its own right. Kinetic equations have found applications in modeling rarefied gases, plasmas, microflow, semiconductor devices, radiative transfer, complex fluids, and so on. From a computational point of view, kinetic equations are costly to solve due to the dimensionality of the problem and the multidimensional integral in the collision term. As a result, there has been a long history of interest in reducing kinetic models to hydrodynamic models, going back at least to the work of Grad (6). Our work is in the same spirit.

A crucial dimensionless number that influences the behavior of the solutions of the kinetic equations is the Knudsen number, defined as the ratio between the mean free path of a particle and the typical macroscopic length scale of interest. When the Knudsen number is small, the velocity distribution stays close to local Maxwellians. This is the so-called hydrodynamic regime. In this regime, it is possible to derive hydrodynamic models for some selected macroscopic fluid variables (typically the mass, momentum, and energy densities) such as the Euler equations or the Navier–Stokes–Fourier equations (7, 8). These hydrodynamic models are not only much less costly to solve but also much easier to use to explain experimental observations. They are often in the form of conservation laws with some conserved quantities and fluxes. However, when the Knudsen number is no longer small, the hydrodynamic models break down. Direct simulation Monte Carlo (DSMC) method (9–11) becomes a more preferred choice. DSMC works well in the collision-less limit when the Knudsen number is large, but the associated computational cost becomes prohibitive when approaching the hydrodynamic regime due to the high collision rate. The significant difference between these 2 types of approaches creates a problem when modeling transitional flows in which the Knudsen number varies drastically.

Significant effort has been devoted to developing generalized hydrodynamic models as a replacement of the kinetic equation in different regimes, with limited success. These generalized models can be put into 2 categories. The first are direct extensions of the Navier–Stokes–Fourier equations. In these models, no new variables are introduced, but new derivative terms are added. One example is the Burnett equation (12). The second are the moment equations. In this case, additional moments are introduced and their dynamic equations are derived through the process of “moment closure” using the kinetic equation. The most well-known example are Grad’s 13-moment equations (6). Much effort has gone into making sure that such moment equations are mathematically well-posed and respect the most important physical constraints (8, 13, 14). In both approaches, one has to introduce some uncontrolled truncation in order to obtain a closed system of partial differential equations (PDEs).

To develop machine learning-based models, one can also proceed along these 2 separate directions. One can stay with the original set of hydrodynamic variables (mass, momentum, and energy densities) but introduce corrections to the fluxes as functionals of the local (in space) instantaneous hydrodynamic variables. This approach is relatively straightforward to implement through convolutional neural networks, and it does give reasonably satisfactory results. However, the accuracy of such a model is limited beforehand by the choice of the hydrodynamic variables. Moreover, the resulting model depends heavily on the specific spatial discretization used to find the data and train the model, and it is difficult to obtain a continuous PDE-like model using this approach. In other words, the result is more like a specific numerical algorithm rather than a new reliable physical model. In addition, some of the issues that we mentioned above, which are important for more general multiscale modeling problems, do not arise. This means that as an illustrative example for a general class of problems, it has a limited value. For these reasons, the majority of our efforts is devoted to the second category of methods: the moment equations. Specifically, we revisit the conventional Grad-type moments and explore the possibility of learning new generalized moments, not necessarily polynomials, to represent the velocity distribution function. We also explore the possibility of learning accurately the dynamical model, instead of resorting to ad hoc closure assumptions. In addition, we develop an active-learning procedure in order to ensure that the dataset used to generate these models is representative enough.

Below, we use the Boltzmann equation to explain the technical details of our approach. The algorithms are implemented for the BGK (Bhatnagar–Gross–Krook) model (15) and the Boltzmann equation modeling binary collision of Maxwell molecules. We work with a wide range of Knudsen numbers ranging from $1{0}^{-3}$ to 10 and different initial conditions. It is easy to see that the methodology developed here is applicable to a much wider range of models, some of which are under current investigation.

There has been quite some activities recently on applying machine learning, especially deep learning to study dynamical systems, including representing the physical quantities described by PDEs with physics-informed neural networks (16, 17), uncovering the underlying hidden PDE models with convolutional kernels (18, 19) or sparse identification of nonlinear dynamical systems method (20, 21), predicting reduced dynamics with memory effects using long short-term memory recurrent neural networks (22, 23), calibrating the Reynolds stress in the Reynolds averaged Navier–Stokes equation (24, 25), and so on. Machine-learning techniques have also been leveraged to find proper coordinates for dynamical systems in the spirit of dimensionality reduction (26–29). Despite the widespread interests in machine learning of dynamical systems, the examples presented in the literature are mainly ordinary differential equations or simple partial differential equations in which the ground truth is known. This paper stands in contrast to previous works with the aim of learning accurately the reduced PDE that is unknown beforehand at the macroscale, from the original microscale model.

## Preliminaries

Let $D$ be the spatial dimension and $f=f\left(\mathit{x},\mathit{v},t\right):{\mathbb{R}}^{D}\times {\mathbb{R}}^{D}\times $ $\mathbb{R}\hspace{0.17em}\to \hspace{0.17em}\mathbb{R}$ be the 1-particle phase-space probability density. Consider the following initial value problem of the Boltzmann equation in the absence of external forceswhere $\epsilon $ is the dimensionless Knudsen number and $Q$ is the collision operator. Two examples of the collision operator are considered in this paper. The first example is the BGK modelHere, ${f}_{M}$ is the local Maxwellian distribution function, sometimes also called the local equilibrium,where $\rho $, $\mathit{u}$, and $T$ are the density, bulk velocity, and temperature fields. They are related to the moments of $f$ throughIn the above, the dependence on the location and time has been dropped for clarity of the notation. The second example considered is the model for binary collision of Maxwell molecules in 2 dimensions:Here, $\left(\mathit{v},{\mathit{v}}_{*}\right)$ and $\left(\mathit{v}\prime ,{\mathit{v}}_{*}^{\prime}\right)$ are the pairs of precollision and postcollision velocities, related bywith $\sigma \in {\mathbb{S}}^{1}$. In general, $Q$ satisfies the conditionsThese ensure that mass, momentum, and energy are conserved during the evolution.

$${\partial}_{t}f+\mathit{v}\cdot {\nabla}_{\mathit{x}}f=\frac{1}{\epsilon}Q\left(f\right),\hspace{1em}f\left(\mathit{x},\mathit{v},0\right)={f}_{0}\left(\mathit{x},\mathit{v}\right),$$

[1]

$$Q\left(f\right)\left(\mathit{v}\right)={f}_{M}\left(\mathit{v}\right)-f\left(\mathit{v}\right).$$

[2]

$${f}_{M}\left(\mathit{v}\right)=\frac{\rho}{{\left(2\pi T\right)}^{\frac{D}{2}}}\mathrm{exp}\left(-\frac{{|\mathit{v}-\mathit{u}|}^{2}}{2T}\right),$$

[3]

$$\rho ={\int}_{{\mathbb{R}}^{D}}f\hspace{0.17em}\mathrm{d}\mathit{v},\hspace{1em}\mathit{u}=\frac{1}{\rho}{\int}_{{\mathbb{R}}^{D}}f\mathit{v}\hspace{0.17em}\mathrm{d}\mathit{v},\hspace{1em}T=\frac{1}{D\rho}{\int}_{{\mathbb{R}}^{D}}f{|\mathit{v}-\mathit{u}|}^{2}\hspace{0.17em}\mathrm{d}\mathit{v}.$$

[4]

$$Q\left(f\right)\left(\mathit{v}\right)={\int}_{{\mathbb{R}}^{2}}{\int}_{{\mathbb{S}}^{1}}\frac{1}{2\pi}\left(f\left(\mathit{v}\prime \right)f\left({\mathit{v}}_{*}^{\prime}\right)-f\left(\mathit{v}\right)f\left({\mathit{v}}_{*}\right)\right)\hspace{0.17em}\mathrm{d}\sigma \hspace{0.17em}\mathrm{d}{\mathit{v}}_{*}.$$

[5]

$$\left\{\begin{array}{cc}\hfill & \mathit{v}\prime =\frac{\mathit{v}+{\mathit{v}}_{*}}{2}+\frac{|\mathit{v}-{\mathit{v}}_{*}|}{2}\sigma ,\hfill \\ \hfill & {\mathit{v}}_{*}^{\prime}=\frac{\mathit{v}+{\mathit{v}}_{*}}{2}-\frac{|\mathit{v}-{\mathit{v}}_{*}|}{2}\sigma ,\hfill \end{array}\right.$$

$${\int}_{{\mathbb{R}}^{D}}Q\left(f\right)\left(\mathit{v}\right)\mathit{\varphi}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v}=\mathbf{0},\hspace{1em}\mathit{\varphi}\left(\mathit{v}\right)={\left(1,\mathit{v},{\left|\mathit{v}\right|}^{2}/2\right)}^{\mathrm{T}}.$$

[6]

It can be shown that when $\epsilon \ll 1$, $f$ stays close to the local Maxwellian distributions (30–34). Taking the moments $\u27e8\cdot ,\mathit{\varphi}\u27e9\hspace{0.17em}\u2254\hspace{0.17em}\int \cdot \mathit{\varphi}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v}$ on both sides of the Boltzmann equation Eq. where $p=\rho T$ is the pressure and $E=\frac{1}{2}\rho {\left|\mathit{u}\right|}^{2}+\frac{D}{2}\rho T$ is the total energy. Letwe can rewrite the Euler equations in a succinct conservation formFor larger values of $\epsilon $, one would like to use the moment method to obtain similar systems of PDEs that serve as an approximation to the Boltzmann equation. To this end, one starts with a linear space $\mathbb{M}$ of functions of $\mathit{v}$ (usually chosen to be polynomials) that include the components of $\mathit{\varphi}$ in Eq. The challenge is to approximate the 2 integral terms in Eq. Here, the moments $\mathit{M}$ are decomposed as $\mathit{M}={\left(\mathit{U},\mathit{W}\right)}^{\mathrm{T}}$, where $\mathit{U}\in {\mathbb{R}}^{D+2}$ denotes the conserved quantities including mass, momentum, and energy densities, and $\mathit{W}\in {\mathbb{R}}^{M}$ are the additional moments. The basis functions $\mathit{m}\left(\mathit{v}\right)$ are decomposed accordingly as $\mathit{m}\left(\mathit{v}\right)={\left(\mathit{\varphi}\left(\mathit{v}\right),\mathit{w}\left(\mathit{v}\right)\right)}^{\mathrm{T}}$. The initial condition of Eq. Similarly, the term $1/\epsilon $ in Eq.

**1**with $\mathit{\varphi}$ in Eq.**6**and replacing $f$ by ${f}_{M}$, one obtains the compressible Euler equations:$$\left\{\begin{array}{cc}\hfill & {\partial}_{t}\rho +{\nabla}_{\mathit{x}}\cdot \left(\rho \mathit{u}\right)=0,\hfill \\ \hfill & {\partial}_{t}\left(\rho \mathit{u}\right)+{\nabla}_{\mathit{x}}\cdot \left(\rho \mathit{u}\otimes \mathit{u}+p\mathbf{I}\right)=0,\hfill \\ \hfill & {\partial}_{t}E+{\nabla}_{\mathit{x}}\cdot \left(\left(E+p\right)\mathit{u}\right)=0,\hfill \end{array}\right.$$

[7]

$$\mathit{U}=\left(\begin{array}{c}\hfill \rho \hfill \\ \hfill \rho \mathit{u}\hfill \\ \hfill E\hfill \end{array}\right),\hspace{1em}{\mathit{F}}_{\text{Euler}}\left(\mathit{U}\right)=\left(\begin{array}{c}\hfill \rho {\mathit{u}}^{\mathrm{T}}\hfill \\ \hfill \rho \mathit{u}\otimes \mathit{u}+p\mathbf{I}\hfill \\ \hfill \left(E+p\right){\mathit{u}}^{\mathrm{T}}\hfill \end{array}\right),$$

[8]

$${\partial}_{t}\mathit{U}+{\nabla}_{\mathit{x}}\cdot {\mathit{F}}_{\text{Euler}}\left(\mathit{U}\right)=0.$$

[9]

**6**. Denote by $\mathit{m}$, a vector whose components form a basis of $\mathbb{M}$. Then, the moments $\mathit{M}$, as functions of $\mathit{x}$ and $t$, are defined as ${\int}_{{\mathbb{R}}^{D}}f\mathit{m}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v}$. From the Boltzmann equation, one obtains$${\partial}_{t}\mathit{M}+{\nabla}_{\mathit{x}}\cdot {\int}_{{\mathbb{R}}^{D}}f\mathit{m}\left(\mathit{v}\right){\mathit{v}}^{\mathrm{T}}\hspace{0.17em}\mathrm{d}\mathit{v}={\int}_{{\mathbb{R}}^{D}}\frac{1}{\epsilon}Q\left(f\right)\mathit{m}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v}.$$

[10]

**10**as functions of $\mathit{M}$ in order to obtain a closed system. This is the well-known “moment closure problem,” and it is at this stage that various uncontrolled approximations are introduced. In any case, once this is done, one obtains a closed system of the form$$\left\{\begin{array}{cc}\hfill & {\partial}_{t}\mathit{U}+{\nabla}_{\mathit{x}}\cdot \mathit{F}\left(\mathit{U},\mathit{W}\right)=0,\hfill \\ \hfill & {\partial}_{t}\mathit{W}+{\nabla}_{\mathit{x}}\cdot \mathit{G}\left(\mathit{U},\mathit{W}\right)=\frac{1}{\epsilon}\mathit{R}\left(\mathit{U},\mathit{W}\right).\hfill \end{array}\right.$$

[11]

**11**is determined by the one of the Boltzmann equation Eq.**1**:$$\begin{array}{c}\hfill \mathit{U}\left(\mathit{x},0\right)={\int}_{{\mathbb{R}}^{D}}{f}_{0}\left(\mathit{x},\mathit{v}\right)\mathit{\varphi}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v},\\ \hfill \mathit{W}\left(\mathit{x},0\right)={\int}_{{\mathbb{R}}^{D}}{f}_{0}\left(\mathit{x},\mathit{v}\right)\mathit{w}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v}.\end{array}$$

**11**is inherited directly from Eq.**1**.## Machine Learning-Based Moment System

We are interested in approximating a family of kinetic problems, in which the Knudsen number may span from the hydrodynamic regime ($\epsilon \ll 1$) to the free molecular regime ($\epsilon \sim 10$), and the initial conditions are sampled from a wide distribution of profiles. Fig. 1 presents a schematic diagram of the framework for the machine learning-based moment method. Below, we first describe the method for finding generalized moments and addressing the moment closure problem. Then, we discuss how to build the dataset $\mathcal{D}$ incrementally to achieve efficient data exploration.

Fig. 1.

### Finding Generalized Moments.

For this discussion, we can neglect the dependence of $f$ on $\left(\mathit{x},t\right)$ and view $f$ as a function of the velocity $\mathit{v}$ only. We consider 2 different ways of finding additional moments $\mathit{W}$. One is using the conventional Grad-type moments, and the other is based on the autoencoder. The conventional moment system, such as Grad’s moment system, starts with the Hermite expansion of $f$where $\mathit{\alpha}=\left({\alpha}_{1},\dots ,{\alpha}_{D}\right)$ is a $D$-dimensional multiindex. Here, the basis functions are defined asand $H{e}_{k}\left(\cdot \right)$ is the $k$th-order Hermite polynomial. Using the orthogonality of Hermite polynomials, one can easily express ${f}_{\mathit{\alpha}}$ as certain moment of $f$. If $f$ is at the local equilibrium, the only nonzero coefficient is ${f}_{\mathbf{0}}=\rho $. In general, one can truncate the Hermite expansion at an order $L$ and select those coefficients ${f}_{\mathit{\alpha}}$ with $0<\left|\mathit{\alpha}\right|\le L$ to construct the additional moments ${\mathit{W}}_{\text{Herm}}$.

$$f\left(\mathit{v}\right)=\sum _{\mathit{\alpha}\in {\mathbb{N}}^{D}}{f}_{\mathit{\alpha}}{\mathcal{H}}_{T,\mathit{\alpha}}\left(\frac{\mathit{v}-\mathit{u}}{\sqrt{T}}\right),$$

[12]

$${\mathcal{H}}_{T,\mathit{\alpha}}\left(\mathit{v}\right)=\frac{1}{{\left(2\pi \right)}^{\frac{D}{2}}}\prod _{j=1}^{D}{T}^{-\frac{{\alpha}_{j}+1}{2}}H{e}_{{\alpha}_{j}}\left({v}_{j}\right)\mathrm{exp}\left(-{v}_{j}^{2}/2\right),$$

The aforementioned Grad-type moments are based on the rationale that the truncated Hermite series is an effective approximation to $f$ near the local equilibrium. In a broader sense, we can use an autoencoder (35, 36) to seek the generalized moments ${\mathit{W}}_{\text{Enc}}$ without relying on such domain knowledge. More concretely, we wish to find an encoder $\mathrm{\Psi}$ that maps $f$ to generalized moments ${\mathit{W}}_{\text{Enc}}\in {\mathbb{R}}^{M}$ and a decoder $\mathrm{\Phi}$ that recovers the original $f$ from $\left(\mathit{U},{\mathit{W}}_{\text{Enc}}\right)$:The exponential form is chosen in order to ensure positivity of the reconstructed density. As an auxiliary goal, we would also like to predict a macroscopic analog of the entropywith all of the available moments. Accordingly, the objective function to be minimized readsIn Eq.

$$\begin{array}{c}\hfill {\mathit{W}}_{\text{Enc}}=\mathrm{\Psi}\left(f\right)={\int}_{{\mathbb{R}}^{D}}f\left(\mathit{v}\right)\mathit{w}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v},\\ \hfill \mathrm{\Phi}\left(\mathit{U},{\mathit{W}}_{\text{Enc}}\right)\left(\mathit{v}\right)=\mathrm{exp}\left(h\left(\mathit{v};\mathit{U},{\mathit{W}}_{\text{Enc}}\right)\right).\end{array}$$

[13]

$$\eta \left(f\right)={\int}_{{\mathbb{R}}^{D}}\hspace{0.17em}-\hspace{0.17em}f\mathrm{ln}f\hspace{0.17em}\mathrm{d}\mathit{v},$$

$${\mathbb{E}}_{f\sim \mathcal{D}}\left[\parallel f-\mathrm{\Phi}\left(\mathit{U},\mathrm{\Psi}\left(f\right)\right){\parallel}^{2}+{\lambda}_{\eta}{\left(\eta \left(f\right)-{h}_{\eta}\left(\mathit{U},\mathrm{\Psi}\left(f\right)\right)\right)}^{2}\right].$$

[14]

**14**, the unknown functions to optimize are $\mathit{w},h,{h}_{\eta}$, and the trial space (also known as the hypothesis space in machine learning) for all 3 functions can be any machine-learning models. In this work, we choose them to be multilayer feedforward neural networks. In practice, the $f$’s are always discretized into finite dimensional vectors and the quantities in Eq.**14**are actually squared ${l}^{2}$ norms of the associated vectors. Once the optimal functions $\mathit{w},h,{h}_{\eta}$ are trained, ${\mathit{W}}_{\text{Enc}}=\mathrm{\Psi}\left(f\right)$, as a general alternative to ${\mathit{W}}_{\text{Herm}}$, provides a set of the generalized moments of the system.### Learning Moment Closure.

Recall the dynamic equation Eq. where ${\mathit{F}}_{0}\left(\mathit{U}\right),{\mathit{G}}_{0}\left(\mathit{U}\right)$ are the fluxes of the corresponding moments $\mathit{U},\mathit{W}$ under the local Maxwellian distribution, that is, ${\mathit{F}}_{0}\left(\mathit{U}\right)\equiv {\mathit{F}}_{\text{Euler}}\left(\mathit{U}\right)$ andHere, we denote by ${f}_{M}\left(\cdot ;\mathit{U}\right)$ the Maxwellian distribution defined by the macroscopic variables $\mathit{U}$. Our experience has been that separating the terms ${\mathit{F}}_{0}\left(\mathit{U}\right)$ and ${\mathit{G}}_{0}\left(\mathit{U}\right)$ out from $\mathit{F}$ and $\mathit{G}$ serves to reduce the variance during training. In practice, we can calculate $\mathit{w}\left(\cdot \right)$ on a few points in advance and use the Gauss–Legendre quadrature rule to approximate ${\mathit{G}}_{0}\left(\mathit{U}\right)$ efficiently. Now, our goal becomes finding $\stackrel{\u0303}{\mathit{F}},\stackrel{\u0303}{\mathit{G}},\mathit{R}$. Note that the first component of $\stackrel{\u0303}{\mathit{F}}$, corresponding to the correction of the density flux, is always zero since the mass-continuity equation is exact.

**11**for the moment system, the goal of moment closure is to find suitable approximations of $\mathit{F},\mathit{G},\mathit{R}$ as functions of $\left(\mathit{U},\mathit{W}\right)$. We first rewrite Eq.**11**into$$\left\{\begin{array}{cc}\hfill & {\partial}_{t}\mathit{U}+{\nabla}_{\mathit{x}}\cdot \left[{\mathit{F}}_{0}\left(\mathit{U}\right)+\stackrel{\u0303}{\mathit{F}}\left(\mathit{U},\mathit{W}\right)\right]=0,\hfill \\ \hfill & {\partial}_{t}\mathit{W}+{\nabla}_{\mathit{x}}\cdot \left[{\mathit{G}}_{0}\left(\mathit{U}\right)+\stackrel{\u0303}{\mathit{G}}\left(\mathit{U},\mathit{W}\right)\right]=\frac{1}{\epsilon}\hspace{0.17em}\mathit{R}\left(\mathit{U},\mathit{W}\right),\hfill \end{array}\right.$$

[15]

$${\mathit{G}}_{0}\left(\mathit{U}\right)={\int}_{{\mathbb{R}}^{D}}{f}_{M}\left(\mathit{v};\mathit{U}\right)\mathit{w}\left(\mathit{v}\right){\mathit{v}}^{\mathrm{T}}\hspace{0.17em}\mathrm{d}\mathit{v}.$$

Take learning $\stackrel{\u0303}{\mathit{G}},\mathit{R}$, for example. At this point, we need a more concrete definition for the dynamics encoded in Eq. where $j$ and $n$ denote the spatial and temporal indices, respectively. In this case, the tuple $\mathcal{X}=\left({\mathit{U}}_{j-1,n},{\mathit{U}}_{j,n},{\mathit{U}}_{j+1,n},{\mathit{W}}_{j-1,n},{\mathit{W}}_{j,n},{\mathit{W}}_{j+1,n},{\mathit{W}}_{j,n+1}\right)$ is an example of data needed. The loss function can be chosen as:Care should be exercised when choosing the numerical scheme, since the solutions of these systems may contain shocks. The task for learning $\stackrel{\u0303}{\mathit{F}}$ can be formulated similarly based on the dynamics of $\mathit{U}$. The details of the numerical scheme we use can be found in

**15**. The easiest way of doing this is to introduce certain numerical scheme for Eq.**15**. We should emphasize that this scheme only serves to define the dynamics for the PDE. The machine-learning model obtained is independent of the details of this scheme so long as it defines the same dynamics (in the limit as the grid size goes to 0). We can think of a numerical scheme $\mathcal{S}$ as an operator, which takes the flux function, the stencil, and numerical discretization parameters as input and outputs the increment of the function values at the targeted point in time. For instance, in the case when $D=1$, a simple example of $\mathcal{S}$ may take the form$$\begin{array}{ll}\hfill & {\widehat{\mathit{W}}}_{j,n+1}-{\mathit{W}}_{j,n}\approx \hfill \\ \hfill & \mathcal{S}\left[\stackrel{\u0303}{\mathit{G}},\mathit{R}\right]\left({\mathit{U}}_{j-1,n},{\mathit{U}}_{j,n},{\mathit{U}}_{j+1,n},{\mathit{W}}_{j-1,n},{\mathit{W}}_{j,n},{\mathit{W}}_{j+1,n};\mathrm{\Delta}t,\mathrm{\Delta}x,\epsilon \right),\hfill \end{array}$$

$$\begin{array}{cc}\hfill {\mathbb{E}}_{f\sim \mathcal{D}}\parallel & {\mathit{W}}_{j,n+1}-{\mathit{W}}_{j,n}\hfill \\ \hfill & -\mathcal{S}\left({\mathit{U}}_{j-1,n},{\mathit{U}}_{j,n},{\mathit{U}}_{j+1,n},{\mathit{W}}_{j-1,n},{\mathit{W}}_{j,n},{\mathit{W}}_{j+1,n}\right){\parallel}^{2}.\hfill \end{array}$$

*SI Appendix*, section B.### Data Exploration.

The quality of the proposed moment method depends on the quality of the dataset $\mathcal{D}$ that we use to train the model. It consists of several solutions of the original Boltzmann equation Eq.

**1**under different initial conditions. Unlike most conventional machine-learning problems that rely on fixed given datasets, here, the construction of the dataset is completely our own choice, and is an important part of the algorithm. In general, our objective is to achieve greater accuracy with fewer training data by choosing the training data wisely. In this sense, it is close to that of active learning (37). To achieve this, an interactive algorithm is required between the augmentation of the dataset and the learning process.In this work, we adopt the following strategy. One starts with a relatively small dataset and uses it to learn the models. Then, a new batch of solutions are generated for both the original kinetic model Eq.

**1**and the moment system Eq.**15**. The error in the macroscopic variables $\mathit{U}$ is calculated as an indicator and the ones with large errors are added to the dataset for the next round of learning. These 2 steps are repeated until convergence is achieved, which indicates that the phase space has been sufficiently explored. The whole scheme works as a closed loop and forms a self-learning process.One key question is how to initialize the new batch of solutions. In principle, we would like to initialize them so that at the end of the active learning process, the configurations that occur in practice have been explored sufficiently. Unfortunately, at the moment, there are no precise mathematical principles that we can use to quantify this. This is certainly one issue that we will continue to investigate in the future. More details of the exploration procedure used can be found in

*SI Appendix*, section E.### Symmetries and Galilean Invariant Moments.

An important issue in building machine-learning models is how to handle the symmetries in the system. The handling of translational, rotational, and even permutational symmetries has been discussed in depth in the literature already (24, 38–40). Besides these static symmetries, the Boltzmann equation also possesses an important dynamic symmetry, the Galilean invariance. Specifically, for every $\mathit{u}\prime \in {\mathbb{R}}^{D}$, define $f\prime $ by $f\prime \left(\mathit{x},\mathit{u},t\right)=f\left(\mathit{x}-t\mathit{u}\prime ,\mathit{v}-\mathit{u}\prime ,t\right)$. If $f$ is a solution of the Boltzmann equation, then so is $f\prime $. It is desirable that the reduced moment system also inherits such an invariance. Here, we present a viable approach that achieves this goal.

The idea is to define the generalized moments ${\mathit{W}}_{\text{Gal}}$ (the subscript $\text{Gal}$ signifies Galilean invariance) properly such that they are Galilean invariant. Given the velocity $\mathit{u}$ and temperature $T$ of $f$, we modify the encoder to beNote that the encoder ${\mathrm{\Psi}}_{\text{Gal}}$ now depends nonlinearly on the first and second moments of $f$ (through $\mathit{u}$ and $T$) and is invariant with respect to the choice of the Galilean reference frame. It is straightforward to see that the Grad-type moments ${\mathit{W}}_{\text{Herm}}$ is a special case of Eq.

$${\mathit{W}}_{\text{Gal}}={\mathrm{\Psi}}_{\text{Gal}}\left(f\right)={\int}_{{\mathbb{R}}^{D}}f\left(\mathit{v}\right)\mathit{w}\left(\frac{\mathit{v}-\mathit{u}}{\sqrt{T}}\right)\hspace{0.17em}\mathrm{d}\mathit{v}.$$

[16]

**16**and thus are Galilean invariant. Modeling the dynamics of ${\mathit{W}}_{\text{Gal}}$ becomes more subtle due to the spatial dependence in $\mathit{u},T$. Below, for simplicity, we will work with a discretized form of the dynamic model.Suppose we want to model the dynamics of ${\mathit{W}}_{\text{Gal},j}$ at the spatial grid point indexed by $j$. Integrating the Boltzmann equation against the generalized basis at this grid point givesThe collision term on the right-hand side evaluated at the grid point $j$ can still be approximated reasonably well by a function of $\left({\mathit{U}}_{j},{\mathit{W}}_{\text{Gal},j}\right)$ only since there is no spatial interaction involved. However, after the discretization, the flux term above is going to depend not only on $\left(\mathit{U},{\mathit{W}}_{\text{Gal}}\right)$ but also on the basis quantities $\left({\mathit{u}}_{j},{T}_{j}\right)$ chosen in Eq. Note that Eq.

$$\begin{array}{cc}\hfill & {\partial}_{t}{\int}_{{\mathbb{R}}^{D}}f\mathit{w}\left(\frac{\mathit{v}-{\mathit{u}}_{j}}{\sqrt{{T}_{j}}}\right)\hspace{0.17em}\mathrm{d}\mathit{v}+{\nabla}_{\mathit{x}}\cdot {\int}_{{\mathbb{R}}^{D}}f\mathit{w}\left(\frac{\mathit{v}-{\mathit{u}}_{j}}{\sqrt{{T}_{j}}}\right){\mathit{v}}^{\mathrm{T}}\hspace{0.17em}\mathrm{d}\mathit{v}\hfill \\ \hfill =\hspace{0.17em}& {\int}_{{\mathbb{R}}^{D}}\frac{1}{\epsilon}Q\left(f\right)\mathit{w}\left(\frac{\mathit{v}-{\mathit{u}}_{j}}{\sqrt{{T}_{j}}}\right)\hspace{0.17em}\mathrm{d}\mathit{v}.\hfill \end{array}$$

[17]

**17**. This motivates us to consider the following approximate moment equation$${\partial}_{t}{\mathit{W}}_{\text{Gal}}+{\nabla}_{\mathit{x}}\cdot {\mathit{G}}_{\text{Gal}}\left(\mathit{U},{\mathit{W}}_{\text{Gal}};{\mathit{U}}_{j}\right)=\frac{1}{\epsilon}\hspace{0.17em}{\mathit{R}}_{\text{Gal}}\left(\mathit{U},{\mathit{W}}_{\text{Gal}}\right).$$

[18]

**18**is only meant to be used to model the dynamics of ${\mathit{W}}_{\text{Gal},j}$. To model the dynamics of ${\mathit{W}}_{\text{Gal},j\prime}$ at another grid point $j\prime $, a different basis information ${\mathit{U}}_{j\prime}$ is provided. Given the moment equation Eq.**18**, the loss functions for ${\mathit{G}}_{\text{Gal}},{\mathit{R}}_{\text{Gal}}$ can be defined in the same way as in*Learning Moment Closure*. More discussion on the dynamics of Galilean invariant moment systems can be found in*SI Appendix*, section D.One should also note that while preserving invariances is an important issue, it is not absolutely necessary to preserve such invariances exactly. If we make the trial space too restrictive in order to preserve invariances, it may become difficult to find a model with satisfactory accuracy. On the other hand, if the reduced dynamics is sufficiently accurate, all of the invariances of the original system should be satisfied approximately with similar accuracy.

### An End-To-End Learning Procedure.

In the previous sections, we have introduced a 2-step procedure to learn separately the moments ${\mathit{W}}_{\text{Enc}}$ (or ${\mathit{W}}_{\text{Gal}}$) and the dynamics of the moments. Here, we also present an alternative end-to-end learning procedure.

For simplicity, we go back to the framework of Similarly, the loss function for learning $\stackrel{\u0303}{\mathit{G}}$ and $\mathit{R}$ can be written asLinearly combining Eqs.

*Finding Generalized Moments*and*Learning Moment Closure*(i.e., using ${\mathit{W}}_{\text{Enc}}$ without specifying explicitly the Galilean invariance) in the 1-dimensional (1D) case. The loss function for the autoencoder component is still defined as in Eq.**14**. The functions in the moment equations are learned using the following loss functions, guided by Eq.**10**and its approximation Eq.**15**. For $\stackrel{\u0303}{\mathit{F}}$, only the third component ${\stackrel{\u0303}{\mathit{F}}}_{3}$, the energy flux, needs to be considered because the mass and momentum continuity equations in Eq.**7**are exact in the 1D case. The corresponding loss function is given by$${\mathbb{E}}_{f\sim \mathcal{D}}\parallel {\int}_{\mathbb{R}}\frac{1}{2}f{\left|\mathit{v}\right|}^{2}{\mathit{v}}^{\mathrm{T}}\hspace{0.17em}\mathrm{d}\mathit{v}-\left(E+p\right){\mathit{u}}^{\mathrm{T}}-{\stackrel{\u0303}{\mathit{F}}}_{3}\left(\mathit{U},{\mathit{W}}_{\text{Enc}}\right){\parallel}^{2}.$$

[19]

$${\mathbb{E}}_{f\sim \mathcal{D}}{\parallel {\int}_{\mathbb{R}}\left(f-{f}_{M}\left(\mathit{v};\mathit{U}\right)\right)\mathit{w}\left(\mathit{v}\right){\mathit{v}}^{\mathrm{T}}\hspace{0.17em}\mathrm{d}\mathit{v}-\stackrel{\u0303}{\mathit{G}}\left(\mathit{U},{\mathit{W}}_{\text{Enc}}\right)\parallel}^{2},$$

[20]

$${\mathbb{E}}_{f\sim \mathcal{D}}{\parallel {\int}_{\mathbb{R}}Q\left(f\right)\mathit{w}\left(\mathit{v}\right)\hspace{0.17em}\mathrm{d}\mathit{v}-\mathit{R}\left(\mathit{U},{\mathit{W}}_{\text{Enc}}\right)\parallel}^{2}.$$

[21]

**14**and**19**–**21**defines a loss function that allows us to learn the moment system in a single optimization step. The benefits of such a learning strategy are 2-fold. On the one hand, any single snapshot ${f}_{n}$ ($n$ denotes the temporal index when solving the kinetic equation) is sufficient for evaluating the loss function in the end-to-end approach, while 2 consecutive discretized snapshots ${f}_{n},{f}_{n+1}$ are required in the loss function described in*Learning Moment Closure*and*Symmetries and Galilean Invariant Moments*. The resulting paradigm becomes more like unsupervised learning rather than supervised learning since solving the Boltzmann equation becomes unnecessary. On the other hand, accuracy can potentially be improved since all of the parameters are optimized jointly.### An Alternative Direct Machine-Learning Strategy.

A more straightforward machine-learning approach is to stay at the level of the original hydrodynamic variables $\mathit{U}$ and directly learn a correction term for the Euler equations to approximate the dynamics of the Boltzmann equation:Here, $\stackrel{\u0303}{\mathit{F}}\left[\mathit{U}\left(\cdot \right);\epsilon \right]$ is a functional of $\mathit{U}\left(\cdot \right)$ that depends on $\mathit{U}$, where $\mathit{U}$ is viewed as a function on the whole space. It is quite straightforward to design machine-learning models under this framework. For instance, assume that $D=1$. One can discretize $\mathit{U}$ using ${N}_{x}$ grid points and train a network to approximate $\stackrel{\u0303}{\mathit{F}}$ in which the input is an ${N}_{x}\times 4$ matrix representing $\mathit{U}$ and $\epsilon $, and the output are the values of $\stackrel{\u0303}{\mathit{F}}$ at the ${N}_{x}$ grid points. In practice, to guarantee translational invariance and approximate locality, we choose a 1D convolutional neural network with small convolutional kernels. This approach is conceptually simple and easy to implement. Ideas like this have been used in various problems, including the turbulence models (24, 41, 42).

$${\partial}_{t}\mathit{U}+{\nabla}_{\mathit{x}}\cdot \left({\mathit{F}}_{\text{Euler}}\left(\mathit{U}\right)+\stackrel{\u0303}{\mathit{F}}\left[\mathit{U}\left(\cdot \right);\epsilon \right]\right)=0.$$

[22]

There are several problems with this. The first is that the approach does not offer any room for improving accuracy since no information can be used other than the instantaneous hydrodynamic variables. The second is that this procedure requires a specific discretization to begin with. The model obtained is tied with this discretization. We would like to have machine-learning models that are more like PDEs. The third is that the models obtained are harder to interpret. For instance, the role that the Knudsen number plays in the convolutional network is not as explicit as in the Boltzmann equation or the moment equation. In any case, it is our view that the moment system is a more appealing approach.

## Numerical Results

In this section, we report results for the methods introduced above for the BGK model Eq.

**2**with $D=1$ and the Boltzmann equation for Maxwell molecules Eq.**5**with $D=2$. At the moment, there are no reliable conventional moment systems for the full Boltzmann equation for Maxwell molecules. In contrast, the methodology introduced here does not make use of the specific form of $Q\left(f\right)$ and can be readily applied to this or even more complicated kinetic models.We consider the 1D interval $\left[-0.5,0.5\right]$ in the physical domain with periodic boundary condition and the time interval $\left[\mathrm{0,0.1}\right]$. Note that for the 2-dimensional (2D) Maxwell model, we assume the distribution function is constant in the $y$ direction of the physical domain so that it is still sufficient to compute the solution on the 1D spatial interval. Three different tasks are considered, termed

*Task Wave*,*Task Mix*, and*Task MixInTransition*, respectively. In the first 2 tasks, the Knudsen number $\epsilon $ is constant across the whole domain. This constant value is sampled from a log-uniform distribution on $\left[-\mathrm{3,1}\right]$ respect to base 10, i.e., $\epsilon $ takes values from $1{0}^{-3}$ to 10. For data exploration, the initial conditions used in*Task Wave*consist of a few combination of waves randomly sampled from a probability distribution. For*Task Mix*, the initial conditions contain a mixture of waves and shocks, both randomly sampled from their respective probability distributions. More details can be found in*SI Appendix*, section A. In*Task MixInTransition*, the initial conditions are the same as the ones in*Task Mix*but the Knudsen number is no longer constant in the spatial domain. Instead, it varies from $1{0}^{-3}$ to 10. This is a toy model for transitional flows. We do not train any new models in*Task MixInTransition*but instead adopt the model learned from*Task Mix*to check its transferability.We generate the training data with Implicit–Explicit (IMEX) scheme (43) for the 1D BGK model or fast spectral method (44–46) for the 2D Maxwell model. The spatial and temporal domains are both discretized into 100 grid points. The velocity domain is discretized into 60 grid points in the 1D BGK model and 48 grid points in each dimension in the 2D Maxwell model. In all of the numerical experiments,

*Task Wave*uses 100 paths as the training dataset, whereas*Task Mix*use 200 paths. All of the results reported are based on 100 testing initial profiles sampled from the same distribution of the corresponding tasks. To evaluate the accuracy on the testing data, we consider 2 types of error measures for the macro quantities $\mathit{U}$, the relative absolute error (RAE) and the relative squared error (RSE). More details of the tasks and data are provided in*SI Appendix*, section A.We name different moment systems using a combination of the definition of $\mathit{W}$ and closure model. The moments considered in this paper, ${\mathit{W}}_{\text{Herm}}$, ${\mathit{W}}_{\text{Enc}}$, ${\mathit{W}}_{\text{Gal}}$ are abbreviated as Herm, Enc, GalEnc, respectively. The machine learning-based closure and its end-to-end version is called MLC and E2EMLC for short. When ${\mathit{W}}_{\text{Enc}}$ is used, the learning of the closure takes the form introduced in

*Learning Moment Closure*and when ${\mathit{W}}_{\text{Herm}}$ or ${\mathit{W}}_{\text{Gal}}$ is used, it takes the form introduced in*Symmetries*and*Galilean Invariant Moments*. The model described in*An Alternative Direct Machine-Learning Strategy*is called DirectConv. In all of the numerical experiments presented in the main text, we always set ${\mathit{W}}_{\text{Enc}}$, ${\mathit{W}}_{\text{Gal}}$ to be 6-dimensional for the 1D BGK model and 9-dimensional for the 2D Maxwell model. When Hermite polynomials are used for the moments, due to the numerical instability of evaluating moments of high-order polynomials, we limit the order to be no larger than 5 for each variable. The resulting ${\mathit{W}}_{\text{Herm}}$ is 3-dimensional for the 1D BGK model and 6-dimensional for the 2D Maxwell model, dropping the ones that are identically 0 due to symmetry. Results of EncMLC are always augmented by data exploration, unless specified. Other models do not use data exploration for ease of comparison. Tables 1 and 2 report the relative error of the different models on the testing data for the 3 tasks. Fig. 2 shows an example of the profiles of the mass, momentum, and energy densities at $t=\mathrm{0,0.05},0.1$ for the same initial condition in*Task Mix*for the 2D Maxwell model, obtained by solving the kinetic equation, the Euler equations, and HermMLC.Table 1.

Model | Wave | Mix | MixInTransition |
---|---|---|---|

EncMLC (no exploration) | 1.27(13), 1.60(35) | 1.68(10), 2.35(12) | 1.82(11), 2.49(11) |

EncMLC (exploration) | 0.85(10), 1.01(14) | 1.25(5), 1.75(8) | 1.55(4), 2.06(7) |

GalEncMLC | 0.97(22), 1.11(23) | 1.28(6), 1.85(10) | 1.53(4), 2.11(8) |

EncE2EMLC | 0.76(6), 0.92(7) | — | — |

HermMLC | 0.34(5), 0.43(6) | 0.92(4), 1.45(10) | 1.77(25), 3.01(27) |

DirectConv | 0.92(3), 1.16(6) | 0.83(5), 1.13(9) | 2.57(20), 3.61(26) |

The first one in each cell denotes the RAE, and the second denotes the RSE. The numbers in the parentheses denote the SD of the last 1 or 2 digits computed from 3 independent runs. The results in the fourth column are obtained using the models learned in

*Task Mix*directly. —, the error is unavailable due to the unphysical solutions produced by EncE2EMLC.Table 2.

Model | Wave | Mix | MixInTransition |
---|---|---|---|

EncMLC (no exploration) | 0.83(7), 1.19(13) | 1.30(11), 1.96(16) | 1.50(7), 2.28(10) |

GalEncMLC | 0.77(6), 1.09(10) | 1.18(10), 1.77(15) | 1.42(7), 2.12(13) |

HermMLC | 0.30(4), 0.50(10) | 1.12(7), 1.74(11) | 1.22(4), 1.89(5) |

DirectConv | 1.49(8), 2.22(14) | 1.00(6), 1.47(9) | 2.62(52), 3.64(74) |

The first one in each cell denotes the RAE, and the second denotes the RSE. The numbers in the parentheses denote the SD of the last 1 or 2 digits computed from 3 independent runs. The results in the fourth column are obtained using the models learned in

*Task Mix*directly.Fig. 2.

The benefits brought by data exploration is clearly shown through the comparison between the first 2 rows in Table 1. EncMLC with data exploration has a similar accuracy compared to GalEncMLC. However, when the 2 models are trained on the datasets of the same size and distribution without data exploration, GalEncMLC performs better than EncMLC, for both the 1D BGK model and 2D Maxwell model. The superiority of GalEncMLC on data efficiency is not surprising since it better captures the intrinsic features of the original dynamical system.

While the generalized moments and moment equations are learned differently in

*Task Wave*and*Task Mix*under the associated data distribution, in*Task MixInTransition*(the fourth column in Tables 1 and 2), we use the same models learned from*Task Mix*directly without any further training. The fact that the relative error is similar to that in*Task Mix*indicates that the machine learning-based moment system has satisfactory transferability. On the contrary, while the model learned in*Task Wave*is accurate enough for that particular task, it is not sufficiently transferrable in general.EncE2EMLC has the best accuracy in

*Task Wave*for the BGK model among all of the models based on ${\mathit{W}}_{\text{Enc}}$. Note that all of the networks in EncE2EMLC have the same structure as in EncMLC; it seems that this improvement comes mainly from the end-to-end training process. However, when shocks are present, this model performs badly. For example, it may produce unphysical solutions. This should be due to the lack of any enforcement of the entropy condition, either explicitly through the entropy function or implicitly through supervision from the dynamics of the kinetic equation. On the other hand, the existence of shocks is a special feature of the physical problem considered here. This issue disappears in most other physical systems, and EncE2EMLC should become a very attractive approach for those systems.The good performance of HermMLC suggests that the proposed machine learning-based closure is applicable to different types of moments. The remarkable accuracy of HermMLC in

*Task Wave*might be a result of the close proximity between $f$ and the local equilibrium in this task. The fact that GalEncMLC achieved a similar accuracy in*Task Mix*and*Task MixInTransition*suggests that the autoencoder is an effective tool for finding generalized moments in general situations.The accuracy of DirectConv is quite good for both

*Task Wave*and*Task Mix*. However, the model obtained is tied to the specific discretization algorithm used, and it is unclear how it can be used for other discretization schemes. The machine learning-based moment systems do not have this problem since they behave more like conventional PDEs. Fig. 3 illustrates the solutions of GalEncMLC under the same initial condition as in*Task Wave*for the 2D Maxwell model but different spatial discretization.Fig. 3.

In Fig. 4, we display the log–log scatter plots of the relative error versus the Knudsen number $\epsilon $ for both

*Task Wave*and*Task Mix*for the 2D Maxwell model. One can see that the accuracy of the machine learning-based moment system is almost uniform across the whole regime, with the same computational cost. This stands in striking contrast to the conventional hydrodynamic models or DSMC method. As for the computational cost, to solve for one path for the 2D Maxwell model on a MacBook Pro with a 2.9 GHz Intel Core i5 processor, it takes about 5 min to run the fast spectral method for the original Boltzmann equation. In contrast, the runtime of the machine learning-based moment method (GalEncMLC) is only half a second.Fig. 4.

We refer the interested readers to

*SI Appendix*, section F for more numerical results, including the shape of the generalized basis functions in the 1D BGK model, the accuracy under different number of moments, the solution on the test data under different spatial and temporal discretization, the growth of the relative error as a function of time in the 3 tasks, and more sample profiles in the 3 tasks. A brief introduction of the hyperbolic regularized Grad’s moment system (13, 14), a conventional moment method for the BGK model, and related results are provided in*SI Appendix*, section C.## Discussion and Conclusion

This paper presents a framework for multiscale modeling using machine learning in the absence of scale separation. We have put our emphasis on learning physical models, not just a particular algorithm. We have studied some of the main issues involved, including the importance of obeying physical constraints, actively learning, end-to-end models, etc. Our experience suggests that it is often advantageous to respect physical constraints, but there is no need to sacrifice a lot of accuracy just to enforce them, since if we can model the dynamics accurately, the physical constraints are also satisfied with similar accuracy. Even though we still lack a proper mathematical framework to serve as a guideline, active learning is very important in order to ensure the validity of the model under different physical conditions. Regarding the end-to-end model, even though it did not perform satisfactorily for the problem with shocks, we feel that it may very well be the most attractive approach in the more general cases since it seems most promising to derive uniform error bounds in this case.

There are a few important issues that need to be addressed when applying the methodology presented here to other problems. The first is the construction of additional relevant variables. Here, our requirement is encoded in the autoencoder: the reduced variables need to be able to capture enough information so that the 1-particle phase-space distribution function can be accurately reproduced. This needs to be generalized when considering other models, for example, when the microscopic model is molecular dynamics. The second is the starting point for performing “closure.” This component is also problem-specific. Despite the fact that these important components need to be worked out for each specific problem, we do feel that the general procedure here is applicable for a wide variety of multiscale problems.

Going back to the specific example we studied, the BGK model or the Boltzmann equation for Maxwell molecules, we presented an interpretable generalized moment system that works well over a wide range of Knudsen numbers. One can think of these models just like conventional PDEs, except that some of the terms in the fluxes and forcing are stored as subroutines. This is not very different from the conventional Euler equations for complex gases where the equations of state are stored as look-up tables or sometimes subroutines. Regarding the 3 ingredients involved in learning the reduced models, namely, labeling the data, learning from the data, and exploring the data (as shown in Fig. 1), labeling the data is straightforward in this case: we just need to solve the kinetic equation for some short period under different initial conditions. Data exploration is carried out using Monte Carlo sampling from some prescribed initial velocity distributions, and the choice of these initial velocity distributions is still somewhat ad hoc. The learning problem is the part that we have studied most carefully. We have explored and compared several different versions of machine-learning models. We are now ready to attack other problems such as kinetic models in plasma physics and kinetic models for complex fluids, building on this experience.

## Acknowledgments

The work presented here is supported in part by a gift to Princeton University from iFlytek and Office of Naval Research Grant N00014-13-1-0338. We also thank the reviewers for the valuable comments, which helped to improve the quality of the work and the paper.

## Supporting Information

Appendix (PDF)

- Download
- 765.96 KB

## References

1

W. E, B. Engquist, X. Li, W. Ren, E. Vanden-Eijnden, Heterogeneous multiscale methods: A review.

*Commun. Comput. Phys.***2**, 367–450 (2007).2

W. E,

*Principles of Multiscale Modeling*(Cambridge University Press, 2011).3

J. Han, L. Zhang, R. Car, W. E, Deep potential: A general representation of a many-body potential energy surface.

*Commun. Comput. Phys.***23**, 629–639 (2018).4

L. Zhang, J. Han, H. Wang, R. Car, W. E, Deep potential molecular dynamics: A scalable model with the accuracy of quantum mechanics.

*Phys. Rev. Lett.***120**, 143001 (2018).5

L. Zhang, D. Y. Lin, H. Wang, R. Car, W. E, Active learning of uniformly accurate interatomic potentials for materials simulation.

*Phys. Rev. Mater.***3**, 023804 (2019).6

H. Grad, On the kinetic theory of rarefied gases.

*Commun. Pure Appl. Math.***2**, 331–407 (1949).7

C. Bardos, F. Golse, C. D. Levermore, Fluid dynamic limits of kinetic equations II convergence proofs for the Boltzmann equation.

*Commun. Pure Appl. Math.***46**, 667–753 (1993).8

C. D. Levermore, Moment closure hierarchies for kinetic theories.

*J. Stat. Phys.***83**, 1021–1065 (1996).9

G. A. Bird,

*Molecular Gas Dynamics and the Direct Simulation of Gas Flows*(Clarendon Press, Oxford, UK, 1994).10

F. J. Alexander, A. L. Garcia, The direct simulation Monte Carlo method.

*Comput. Phys.***11**, 588–593 (1997).11

L. Pareschi, G. Russo, An introduction to Monte Carlo method for the Boltzmann equation.

*ESAIM: Proc.***10**, 35–75 (2001).12

D. Burnett, The distribution of velocities in a slightly non-uniform gas.

*Proc. Lond. Math. Soc.***2**, 385–430 (1935).13

Z. Cai, Y. Fan, R. Li, Globally hyperbolic regularization of Grad’s moment system.

*Commun. Pure Appl. Math.***67**, 464–518 (2014).14

Z. Cai, Y. Fan, R. Li, A framework on moment model reduction for kinetic equation.

*SIAM J. Appl. Math.***75**, 2001–2023 (2015).15

P. L. Bhatnagar, E. P. Gross, M. Krook, A model for collision processes in gases. I. Small amplitude processes in charged and neutral one-component systems.

*Phys. Rev.***94**, 511–525 (1954).16

M. Raissi, G. E. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations.

*J. Comput. Phys.***357**, 125–141 (2018).17

M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.

*J. Comput. Phys.***378**, 686–707 (2019).18

Z. Long, Y. Lu, X. Ma, B. Dong, “PDE-Net: Learning PDEs from data” in

*Proceedings of the 35th International Conference on Machine Learning*(Proceedings of Machine Learning Research, 2018),**vol. 80**., pp. 3208–3216.19

Z. Long, Y. Lu, B. Dong, PDE-Net 2.0: Learning PDEs from data with a numeric-symbolic hybrid deep network.

*J. Comput. Phys.***399**, 108925 (2019).20

S. Rudy, A. Alla, S. L. Brunton, J. N. Kutz, Data-driven identification of parametric partial differential equations.

*SIAM J. Appl. Dyn. Syst.***18**, 643–660 (2019).21

K. P. Champion, S. L. Brunton, J. N. Kutz, Discovery of nonlinear multiscale systems: Sampling strategies and embeddings.

*SIAM J. Appl. Dyn. Syst.***18**, 312–333 (2019).22

C. Ma, J. Wang, W. E, Model reduction with memory and the machine learning of dynamical systems.

*Commun. Comput. Phys.***25**, 947–962 (2019).23

P. R. Vlachas, W. Byeon, Z. Y. Wan, T. P. Sapsis, P. Koumoutsakos, Data-driven forecasting of high-dimensional chaotic systems with long short-term memory networks.

*Proc. R. Soc. A***474**, 20170844 (2018).24

J. Ling, A. Kurzawski, J. Templeton, Reynolds averaged turbulence modelling using deep neural networks with embedded invariance.

*J. Fluid Mech.***807**, 155–166 (2016).25

J. X. Wang, J. L. Wu, H. Xiao, Physics-informed machine learning approach for reconstructing Reynolds stress modeling discrepancies based on DNS data.

*Phys. Rev. Fluids***2**, 034603 (2017).26

Q. Li, F. Dietrich, E. M. Bollt, I. G. Kevrekidis, Extended dynamic mode decomposition with dictionary learning: A data-driven adaptive spectral decomposition of the koopman operator.

*Chaos Interdiscip. J. Nonlinear Sci.***27**, 103111 (2017).27

N. Takeishi, Y. Kawahara, T. Yairi, “Learning Koopman invariant subspaces for dynamic mode decomposition” in

*Advances in Neural Information Processing Systems 30 (NIPS 2017)*, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, Eds. (Neural Information Processing Systems Foundation, 2017). pp. 1130–1140.28

B. Lusch, J. N. Kutz, S. L. Brunton, Deep learning for universal linear embeddings of nonlinear dynamics.

*Nat. Commun.***9**, 4950 (2018).29

K. Champion, B. Lusch, J. N. Kutz, S. L. Brunton, Data-driven discovery of coordinates and governing equations. arXiv:1904.02107 (29 March 2019).

30

C. Cercignani, Small and large mean free paths.

*Theory and Application of the Boltzmann Equation*(Scottish Academic Press, 1975), pp. 232–285.31

S. Chapman, T. G. Cowling, The non-uniform state for a simple gas.

*The Mathematical Theory of Non-uniform Gases: An Account of the Kinetic Theory of Viscosity, Thermal Conduction and Diffusion in Gases*(Cambridge University Press, 1990), pp. 110–131.32

T. Nishida, Fluid dynamical limit of the nonlinear Boltzmann equation to the level of the compressible Euler equation.

*Commun. Math. Phys.***61**, 119–148 (1978).33

R. E. Caflisch, The fluid dynamic limit of the nonlinear Boltzmann equation.

*Commun. Pure Appl. Math.***33**, 651–666 (1980).34

F. Bouchut, F. Golse, M. Pulvirenti,

*Kinetic Equations and Asymptotic Theory*(Elsevier, 2000).35

G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks.

*Science***313**, 504–507 (2006).36

I. Goodfellow, Y. Bengio, A. Courville, Autoencoders.

*Deep Learning*(MIT Press, 2016), pp. 499–523.37

B. Settles, “Active learning literature survey” (Technical Report, University of Wisconsin-Madison Department of Computer Sciences, 2009).

38

M. Zaheer et al., “Deep sets” in

*Advances in Neural Information Processing Systems 30 (NIPS 2017)*, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, Eds. (Neural Information Processing Systems Foundation, (2017), pp. 3391–3401.39

L. Zhang et al., “End-to-end symmetry preserving inter-atomic potential energy model for finite and extended systems” in

*Advances in Neural Information Processing Systems 31 (NIPS 2018)*, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett, Eds. (Neural Information Processing Systems Foundation, 2018), pp. 4436–4446.40

M. Eickenberg, G. Exarchakis, M. Hirn, S. Mallat, L. Thiry, Solid harmonic wavelet scattering for predictions of molecule properties.

*J. Chem. Phys.***148**, 241732 (2018).41

K. Duraisamy, G. Iaccarino, H. Xiao, Turbulence modeling in the age of data.

*Annu. Rev. Fluid Mech.***51**, 357–377 (2019).42

E. Fonda, A. Pandey, J. Schumacher, K. R. Sreenivasan, Deep learning in turbulent convection networks.

*Proc. Natl. Acad. Sci. U.S.A.***116**, 8667–8672 (2019).43

F. Filbet, S. Jin, A class of asymptotic-preserving schemes for kinetic equations and related problems with stiff sources.

*J. Comput. Phys.***229**, 7625–7648 (2010).44

J. Hu, L. Ying, A fast spectral algorithm for the quantum Boltzmann collision operator.

*Commun. Math. Sci.***10**, 989–999 (2012).45

I. M. Gamba, J. R. Haack, C. D. Hauck, J. Hu, A fast spectral method for the Boltzmann collision operator with general collision kernels.

*SIAM J. Sci. Comput.***39**, B658–B674 (2017).46

J. Hu, Z. Ma, A fast spectral method for the inelastic Boltzmann collision operator and application to heated granular gases.

*J. Comput. Phys.***385**, 119–134 (2019).## Information & Authors

### Information

#### Published in

#### Classifications

#### Copyright

© 2019. Published under the PNAS license.

#### Submission history

**Published online**: October 16, 2019

**Published in issue**: October 29, 2019

#### Keywords

#### Acknowledgments

The work presented here is supported in part by a gift to Princeton University from iFlytek and Office of Naval Research Grant N00014-13-1-0338. We also thank the reviewers for the valuable comments, which helped to improve the quality of the work and the paper.

#### Notes

This article is a PNAS Direct Submission.

### Authors

#### Competing Interests

The authors declare no competing interest.

## Metrics & Citations

### Metrics

#### Citation statements

#### Altmetrics

### Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

#### Cited by

Loading...

## View Options

### View options

#### PDF format

Download this article as a PDF file

DOWNLOAD PDF### Get Access

#### Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login#### Recommend to a librarian

Recommend PNAS to a Librarian#### Purchase options

Purchase this article to access the full text.