# A machine learning framework for solving high-dimensional mean field game and mean field control problems

See allHide authors and affiliations

Contributed by Stanley J. Osher, March 2, 2020 (sent for review December 18, 2019; reviewed by Weinan E and George Em Karniadakis)

## Significance

Mean field games (MFG) and mean field control (MFC) play central roles in a variety of scientific disciplines such as physics, economics, and data science. While the mathematical theory of MFGs has matured considerably, the development of numerical methods has not kept pace with growing problem sizes and massive datasets. Since MFGs, in general, do not admit closed-form solutions, effective numerical algorithms are critical. Most existing numerical methods use meshes and thus are prone to the curse of dimensionality. Our framework is mesh-free, since it combines Lagrangian PDE solvers and neural networks. By penalizing violations of the underlying Hamilton–Jacobi–Bellman equation, we increase accuracy and computational efficiency. Transforming MFGs into machine learning problems promises exciting opportunities to advance application and theory.

## Abstract

Mean field games (MFG) and mean field control (MFC) are critical classes of multiagent models for the efficient analysis of massive populations of interacting agents. Their areas of application span topics in economics, finance, game theory, industrial engineering, crowd motion, and more. In this paper, we provide a flexible machine learning framework for the numerical solution of potential MFG and MFC models. State-of-the-art numerical methods for solving such problems utilize spatial discretization that leads to a curse of dimensionality. We approximately solve high-dimensional problems by combining Lagrangian and Eulerian viewpoints and leveraging recent advances from machine learning. More precisely, we work with a Lagrangian formulation of the problem and enforce the underlying Hamilton–Jacobi–Bellman (HJB) equation that is derived from the Eulerian formulation. Finally, a tailored neural network parameterization of the MFG/MFC solution helps us avoid any spatial discretization. Our numerical results include the approximate solution of 100-dimensional instances of optimal transport and crowd motion problems on a standard work station and a validation using a Eulerian solver in two dimensions. These results open the door to much-anticipated applications of MFG and MFC models that are beyond reach with existing numerical methods.

- mean field games
- mean field control
- machine learning
- optimal transport
- Hamilton-Jacobi-Bellman equations

Mean field games (MFG) (1⇓⇓⇓–5) and mean field control (MFC) (6) allow one to simulate and analyze interactions within large populations of agents. Hence, these models have found widespread use in economics (7⇓⇓–10), finance (11⇓⇓–14), crowd motion (15⇓⇓–18), industrial engineering (19⇓–21), and, more recently, data science (22) and material dynamics (23).

The theoretical properties of MFG and MFC problems have been continuously developed over the last few decades (see, e.g., refs. 24⇓⇓⇓⇓–29). A key observation is that both problems involve a Hamilton–Jacobi–Bellman (HJB) equation that is coupled with a continuity equation. From the solution of this system of partial differential equations (PDEs), each agent can infer the cost of their optimal action, which is why it is commonly called value function. In addition, the agent can obtain their optimal action from the gradient of the value function, which alleviates the need for individual optimization (see refs. 30 and 31 for details). We note that, due to similar conventions in physics, the value function is sometimes also called a potential.

Our framework applies to a common subclass of MFGs, namely, potential MFGs, and MFCs. These problems can be formulated as infinite-dimensional optimal control problems in density space. Interestingly, their optimality conditions coincide with the HJB and continuity equation.

Despite many theoretical advances, the development of numerical methods for solving MFGs, particularly in high-dimensional sample spaces, lags and has not kept pace with growing data and problem sizes. A crucial disadvantage of most existing approaches for solving MFGs or their underlying HJB equations is their reliance on meshes (see refs. 32⇓⇓⇓⇓⇓⇓⇓⇓⇓–42 and references therein). Mesh-based methods are prone to the curse of dimensionality, that is, their computational complexity grows exponentially with spatial dimension (43).

In this paper, we tackle the curse of dimensionality in two steps. First, extending the approach in ref. 44, we solve the continuity equation and compute all other terms involved in the MFG using Lagrangian coordinates. In practice, this requires the computation of characteristic curves and the Jacobian determinant of the induced transformation; both terms can be inferred from the value function. In our examples, the former follows the gradient, and the logarithm of the latter can be approximated by integrating the Laplacian of the value function. Our scheme also allows us to control and enforce the HJB equation along the characteristics. These computations can be performed independently and in parallel for all agents.

Second, we parameterize the value function in space and time using a neural network that is specifically designed for an accurate and efficient Lagrangian scheme. With this design, we can also directly penalize the violation of the HJB equation. Thereby, we develop a generic framework that transforms a wide range of MFGs into machine learning (ML) problems.

In our numerical experiments, we solve high-dimensional instances of the dynamical optimal transport (OT) problem (45, 46), a prototypical instance of a potential MFG, and an MFG inspired by crowd motion. In OT, one seeks to find the path connecting two given densities that minimizes a kinetic energy that models transport costs. As for other MFGs, there are many relevant high-dimensional instances related to dynamical OT, for example, in ML (47). We validate our scheme using comparisons to a Eulerian method on two-dimensional (2D) instances. Most crucially, we show the accuracy and scalability of our method using OT and MFG instances in up to 100 space dimensions. We also show that penalizing the HJB violations allows us to solve the problem more accurately with less computational effort. Our results for the crowd motion problem also show that our framework is capable of solving potential MFGs and MFCs with nonlinear characteristics.

To enable further theoretical and computational advances as well as applications to real-world problems, we provide our prototype implementation written in Julia (48) as open-source software under a permissible license.

## Related Work

Our work lies at the interface of ML, PDEs, and optimal control. Recently, this area has received a lot of attention, rendering a comprehensive review to be beyond the scope of this paper. The idea of solving high-dimensional PDEs and control problems with neural networks has been pioneered by other works (49⇓–51) and has been further investigated by ref. 52. In this section, we put our contributions into context by reviewing recent works that combine concepts from ML, OT, MFGs, and Lagrangian methods for MFG.

### OT and ML.

Despite tremendous theoretical insight gained into the problem of optimally transporting one density to match another one, its numerical solution remains challenging, particularly when the densities live in spaces of four dimensions or more. In small-dimensional cases, there are many state-of-the-art approaches that effectively compute the global solution (see, e.g., refs. 40, 42, 53, and 54 and the survey in ref. 55). Due to their reliance on Euclidean coordinates, those techniques require spatial discretization, which makes them prone to the curse of dimensionality. An exception is the approach in ref. 56 that uses a generative model for computing the transport plan. This work uses a neural network model for the density and a Lagrangian PDE solver.

An ML task that bears many similarities to OT is variational inference with normalizing flows (47). Roughly speaking, the goal is to transform given samples from a typically unknown distribution such that they are approximately normally distributed. To this end, one trains a neural network-based flow model; hence, the name normalizing flow. The trained flow can be used as a generator to produce new samples from the unknown distribution by reversing the flow, starting with samples drawn from a normal distribution. While the original formulation of the learning problem in normalizing flows does not incorporate transport costs, refs. 57⇓–59 successfully apply concepts from OT to analyze and improve the learning of flows. The approach in ref. 59 is formulated as a point cloud matching problem and estimates the underlying densities using Gaussian mixture models. The works (57, 58) propose neural network models for the value function instead of the velocity of the flow, which leads to more-physically plausible models. This parameterization has also been used to enforce parts of the HJB equation via quadratic penalties (58). We generalize these ideas from OT to a broader class of MFGs that naturally incorporate transport costs. We also add a penalty for the final time condition of the HJB to the training problem.

### MFGs and ML.

ML and MFGs have become increasingly intertwined in recent years. On the one hand, MFG theory has been used to provide valuable insight into the training problems in deep learning (22). On the other hand, refs. 60⇓–62 use ML to solve MFGs in up to four spatial dimensions. The methods in refs. 61 and 62 are limited to MFGs whose formulations do not involve the population density, as this computation is challenging. For the time-independent second-order problems in ref. 62, one can express the density explicitly in terms of the value function. Furthermore, in numerical examples for the time-dependent case in ref. 61, the congestion terms depend on the average positions of the population. In this situation, the congestion term can be computed using sample averages. Our framework does not have the above limitations and, in particular, is applicable to MFGs where there is no analytical formula for the density or special structure that can be used to compute the congestion term, for example, MFGs with nonlinear congestion terms. We achieve this using the Lagrangian formulation that includes an estimate of the density along the agents’ trajectories. This generality is a critical contribution of our work.

Additionally, our neural network architecture for the control respects the structure induced by optimality conditions. We believe that this property is critical for obtaining accurate algorithms that scale and yield correct geometry for the agents’ trajectories. As a result, we use our method to approximately solve MFGs in 100 dimensions on a standard workstation.

### Lagrangian Methods in MFG.

To the best of our knowledge, the first Lagrangian method for solving MFG problems appeared in ref. 44. Lagrangian techniques are natural from an optimal control perspective and unavoidable for high-dimensional problems. However, a crucial computational challenge in applying these techniques in MFG stems from the density estimation, which is critical, for example, to compute the congestion cost incurred by an individual agent. In ref. 44, the authors overcome this difficulty for nonlocal interactions by passing to Fourier coordinates in the congestion term and thus avoiding the density estimation. Our neural network parameterization aims to reduce the computational effort and memory footprint of the methods in ref. 44 and provides a tractable way to estimate the population density.

Lagrangian methods for mass-transport problems in image processing were proposed in ref. 63. While the computation of the characteristics is mesh-free, the final density is computed using a particle-in-cell method that does not scale to high-dimensional problems.

## Mathematical Modeling

This section provides a concise mathematical formulation of MFG and MFC models and key references; for more details, see the monographs (3, 6). MFGs model large populations of rational agents that play a noncooperative differential game. At optimality, this leads to a Nash equilibrium where no single agent can do better by unilaterally changing their strategy. By contrast, in MFC, there is a central planner that aims at a distributed strategy for agents to minimize the average cost or maximize the average payoff across the population. Starting with a microscopic description of MFGs, we derive their macroscopic equations, introduce the important class of potential MFGs, and briefly outline MFCs.

### MFGs.

Assume that a continuum population of small rational agents plays a noncooperative differential game on a time horizon **1**,

The agents forecast a distribution of the population,

From Eq. **3**, we have that individual agents solve an optimal control problem that has a value function**7**, the population density,

The above discussion suggests an alternative to optimizing the strategy v individually for each agent. One can obtain the optimal strategy for all agents simultaneously by solving the coupled PDE system given by the HJB Eq. **5** and continuity Eq. **8** and then using Eq. **7**. Our work follows this macroscopic approach and aims at reducing the immense computational challenges caused by the high dimension of these PDEs.

### Potential MFGs.

Assume that there exist functionals

In the seminal paper (3), Lasry and Lions observed that, in this case, the MFG system Eqs. **5** and **8** coincides with first-order optimality conditions of the infinite-dimensional constrained optimization problem**9**. This is reminiscent of potential games, whose Nash equilibria are critical points of a single function that is called a potential. Remarkably, the Lagrange multiplier associated with the constraint in Eq. **9** satisfies the HJB Eq. **5**.

We use Eq. **7** to reparameterize the control in Eq. **9** and solve the infinite-dimensional optimal control problem**9**, this formulation enforces Eq. **7** by construction, which can reduce the computational cost of a numerical solution (see ref. 40).

### MFC.

Mathematically, MFC models are similar to potential MFGs. The key difference in MFC is that a central planner devises an optimal strategy, **2**, resulting in individual cost given by Eq. **1**. The goal of the central player is to minimize the overall costs obtained by replacing the running cost, F, and final cost, G, in Eq. **11** by**5** to**15**, respectively. Similar to potential MFGs, finding the optimal strategy v is equivalent to determining the value function Φ by solving Eq. **11**.

## Lagrangian Method

We follow a discretize-then-optimize approach, to obtain a finite-dimensional instance of the optimal control problem Eq. **11**. The key idea of our framework is to overcome the curse of dimensionality by using a Lagrangian method to discretize and eliminate the continuity Eq. **8** and all other terms of the objective. We note that the trajectories of the individual agents in Eq. **2** are the characteristic curves of the continuity equation. Hence, Lagrangian coordinates are connected to the microscopic model and are a natural way to describe MFGs. Using Lagrangian coordinates, we obtain a stable and mesh-free discretization that parallelizes trivially.

To keep the discussion concrete and our notation brief, from now on, we consider the **6**, it is easy to verify that the Hamiltonian is

We solve the continuity Eq. **8** using the method of characteristics and Jacobi’s identity (ref. 65, chapter 10). Thereby, we eliminate the density ρ by using the constraint in Eq. **11** and obtain an unconstrained problem whose optimization variable is Φ. Given Φ, we obtain the characteristics by solving the ordinary differential equation (ODE)**7** into Eq. **2**, and the second step is specific to the H in Eq. **17**. For clarity, we explicitly denote the dependence of the characteristic on the starting point x in the following. Along the curve

Solving the continuity equation using Eq. **19** requires an efficient way of computing the Jacobian determinant along the characteristics. Direct methods require, in general, **17**, Δ is the Laplace operator, and we denote

An obvious, yet important, observation is that Lagrangian versions of some terms of the optimal control problems do not involve the Jacobian determinant. For example, using Eq. **19** and applying the change of variable formula to the first term in Eq. **10** gives**12** do not involve the Jacobian determinant.

Using this expression, we can obtain the accumulated transport costs associated with x as **17**.

We also accumulate the running costs associated with a fixed x along the characteristics by integrating*Numerical Experiments*.

We note that the trajectories associated with an optimal value function Φ must satisfy the HJB Eq. **5**. One way to ensure this by construction is to integrate the Hamiltonian system as also proposed in ref. 57. Since this complicates the use of the Jacobi identity, we instead penalize violations of the HJB along the characteristics using**13**. In summary, we compute the trajectories, log-determinant, transportation costs, running costs, and HJB violations by solving the initial value problem

### Optimization Problem.

We now approximate the integrals in Eq. **11** using a quadrature rule. Together with the Lagrangian method Eq. **25**, this leads to an unconstrained optimization problem in the value function Φ, which we will model with a neural network in *ML Framework for MFGs*.

Our approach is modular with respect to the choice of the quadrature; however, as we are mostly interested in high-dimensional cases, we use Monte Carlo integration, that is,**25**, and the change of variable used to compute Ĝ is discussed in *Numerical Experiments*. The above problem consists of minimizing the expected loss function, which is given by the sum of the running costs, terminal costs, and HJB penalty.

Let **11** in Φ and ρ to an unconstrained optimization problem in Φ. For a given Φ, we eliminate the constraints by solving Eq. **25** independently for each point

Optimization algorithms for solving Eq. **27** can roughly be divided into two classes: stochastic approximation (67) and sample average approximation (SAA) (68); see also the recent survey (69). The further class contains, for example, variants of the stochastic gradient scheme. These methods aim at iteratively solving Eq. **26** using a sequence of steps, each of which uses gradient information computed using a relatively small number of sample points. A crucial parameter in these methods is the sequence of step sizes (also known as learning rate) that is typically decaying. When chosen suitably, the steps reduce the expected value in Eq. **26**; however, it is not guaranteed that there exists a step size that reduces the approximation obtained for the current sample points. This prohibits the use of line search or trust region strategies and complicates the application of second-order methods. By contrast, the idea in SAA methods is to use a larger number of points such that Eq. **27** is a suitable approximation of Eq. **26**. Then, the problem can be treated as a deterministic optimization problem and solved, for example, using line search or Trust Region methods. We have experimented with both types of schemes and found an SAA approach based on quasi-Newton schemes with occasional resampling most effective for solving MFG and MFC problems to high accuracy; see *SI Appendix* for an experimental comparison.

## ML Framework for MFGs

To obtain a finite-dimensional version of Eq. **27**, we parameterize the value function using a neural network. This enables us to penalize violations of the HJB Eq. **5** during training and thereby ensure that the trained neural network accurately captures the optimality conditions. Using a neural network in the Lagrangian method gives us a mesh-free scheme. In the following, we give a detailed description of our neural network models and the computation of the characteristics.

### Neural Network Models for the Value Function.

We now introduce our neural network parameterization of the value function. While this is not the only possible parameterization, using neural networks to solve high-dimensional PDE problems has become increasingly common (see, e.g., refs. 49⇓⇓–52). The novelty of our approach is the design of the neural network architecture such that the characteristics in Eq. **25** can be approximated accurately and efficiently.

Our network models map the input vector

The neural network processes the input features using a number of layers, each of which combines basic operations such as affine transformations and element-wise nonlinearities. While the size of the input and output features is determined by our application, there is flexibility in choosing the size of the hidden layers, which is also called their widths. Altogether, the number, widths, and specific design of the layers are referred to as the network’s architecture.

Our base architecture is defined as follows:**27**, in general, nonconvex.

In this work, our network architecture is a residual network (ResNet) (71). For a network with M layers and a given input feature

ResNets have been tremendously successful in a wide range of ML tasks and have been found to be trainable even for large depths (i.e., **30** as a forward Euler discretization of an initial value problem, the continuous limit of a ResNet is amenable to mathematical analysis (72, 73). This observation has received a lot of attention recently and been used, for example, to derive maximum principle (74), propose stable ResNet variants (73), and accelerate the training using multilevel (75) and parallel-in-time schemes (76).

### Characteristics Computations.

To compute the characteristic and other quantities in Eq. **25**, we need to compute the gradient and Laplacian of Φ with respect to the input features. Perhaps the simplest option is to use automatic differentiation (AD), which has become a ubiquitous and mature technology in most ML packages. We note that this convenience comes at the cost of d separate evaluations of directional derivatives when computing the Laplacian. While trace estimation techniques can lower the number of evaluations, this introduces inaccuracies in the PDE solver. Also, we show below that computing the Laplacian exactly is feasible for our network.

We now provide a detailed derivation of our gradient and Laplacian computation. The gradient of our model in Eq. **29** with respect to the input feature s is given by**30** in the direction w using back-propagation (also called reverse mode differentiation),

Next, we compute the Laplacian of the value function model with respect to x. We first note that**30** is

To compute the Laplacian of the entire ResNet, we continue with the remaining rows in Eq. **33** in reverse order to obtain

To summarize the above derivations, we note that each time step of the numerical approximation of the characteristics involves one forward propagation (Eq. **30**), one back-propagation (Eq. **33**), and the trace computations (Eq. **35**). The overall computational costs scale as

## Numerical Experiments

We apply our method to two prototype MFG instances. Here, we give a concise overview of the problems and results and refer to *SI Appendix* for a more detailed description of the problem instances and results. We perform our experiments using a prototype of our algorithm that we implemented in the Julia programming language (48) as an extension of the ML framework Flux (77). Readers can access all codes, scripts, and produced results at http://github.com/EmoryMLIP/MFGnet.jl.

### Example 1: Dynamical OT.

Given two densities

Among the many versions of OT, we consider the fluid dynamics formulation (45), which can be seen as a potential MFG. This formulation is equivalent to a convex optimization problem, which can be solved accurately and efficiently if

We model the fluid dynamic version of OT in ref. 45 as a potential MFG by using**20**. The variational derivative of this loss function, which is used in the computation of the penalty term in Eq. **14**, is**11** is symmetric with respect to the roles of

We generate a synthetic test problem that enables us to increase the dimension of the problem with only minimal effect on its solution. For a given dimension d, we choose the density of the Gaussian white noise distribution as the target

We perform four experiments to show the scalability to high-dimensional instances, explore the benefits of the penalty function *SI Appendix* for details.

We show, in Fig. 2, that our network provides a meaningful solution for the 2D instance. The performance can be seen by the close match of the given *SI Appendix*, Fig. S4).

Our method obtains quantitatively similar results in higher-dimensional instances (*SI Appendix*, Fig. S5. The values are computed using the validation sets. The table also shows that, despite a moderate growth of the number of training samples, the actual runtime per iteration of our prototype implementation per iteration grows slower than expected from our complexity analysis. We use more samples in higher dimensions to avoid overfitting. Due to the design of the problem, similar transport costs and terminal costs are to be expected. There is a slight increase of the terminal costs for the 50-dimensional instances, but we note that, at least for projections onto the first two coordinate dimensions, the image quality is similar for all cases (*SI Appendix*, Fig. S4).

In our second experiment, we show that, without the use of the HJB penalty, *SI Appendix*, Fig. S6). Increasing the number of time steps to discretize the characteristics, Eq. **25**, improves the results; however, the results are still inferior to the ones obtained with the penalty, and the computational cost of training is 4 times higher. Hence, in this example, using the penalty function, we obtain a more accurate solution at reduced computational costs.

Our third OT experiment compares two optimization strategies: the SAA approach with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method used throughout our experiments and an SA approach with the Adam method that is common in ML. We note that, for the 2D instance of our problem, Adam converges considerably slower and is less effective in reducing the HJB penalty at a comparable computational cost; however, both methods lead to similar final results.

In our fourth experiment, we compare our proposed method to a provably convergent Eulerian solver for the 2D instance. We compare the optimal controls obtained using both methods using an explicit finite volume solver for the continuity equation, which was not used during the optimization. This step is essential to obtain a fair comparison. *SI Appendix*, Figs. S10 and S12 and Table S2 show that, for this example, our method is competitive and, as demonstrated above, scales to dimensions that are beyond reach with Eulerian schemes.

### Example 2: Crowd Motion.

We consider the motion of a crowd of agents distributed according to an initial density

To force the agents to move to the target distribution, we use the same terminal cost as in the previous example. To express the agents’ preference to avoid congestion and model costs associated with traveling through the center of the domain, we use the mean field potential energy**37**. To penalize the **16** and **17**.

Finally, we take the variational derivative to obtain the HJB Eq. **5**. We note that the mean field coupling is

A detailed description of the experimental setup is provided in *SI Appendix*. The initial and target densities are shifted Gaussians**36**. For the higher-dimensional instances, we evaluate the preference function using the first two components of x. Thereby, the preference function becomes invariant to the remaining entries, and the solutions become comparable across dimensions.

As in the OT example, we obtain similar but not identical objective function values for the problem instances obtained by choosing *SI Appendix*, Fig. S9 for similar visualizations for the remaining instances. As expected, the agents avoid the crowded regions and prefer to spread out horizontally to avoid congestion. We chose the starting points of the characteristics to be symmetric about the *SI Appendix*, Figs. S11 and S13 and Table S2).

## Discussion and Outlook

By combining Lagrangian PDE solvers with neural networks, we develop a framework for the numerical solution of potential MFGs and MFC problems. Our method is geared toward high-dimensional problem instances that are beyond reach with existing solution methods. Since our method is mesh-free and well suited for parallel computation, we believe it provides a promising direction toward much-anticipated large-scale applications of MFGs. Even though our scheme is competitive with Eulerian schemes based on convex optimization for the 2D examples, its main advantage is the scalability to higher dimensions. We exemplify the effectiveness of our method using an OT and a crowd motion problem in 100 dimensions. Using the latter example, we also demonstrate that our scheme can learn complex dynamics even with a relatively simple neural network model.

The fact that our framework transforms MFGs into types of ML problems brings exciting opportunities to advance MFG application and theory. While we can already leverage the many advances made in ML over the last decades, the learning problem in MFGs has some unique traits that require further attention.

Open mathematical issues include the development of practical convergence results for neural networks in optimal control. In fact, it is not always clear that a larger network will perform better in practice. Our numerical experiment for the OT problem makes us optimistic that our framework may be able to solve HJB equations in high dimensions to practically relevant accuracy. Similarly, the impact of regularization parameters on the optimization deserves further attention. However, more analysis is needed to obtain firm theoretical results for this property.

A critical computational issue is a more thorough parallel implementation, which may provide the speed-up needed to solve more-realistic MFG problems. Also, advances in numerical optimization for deep learning problems may help solve the learning problems more efficiently.

Open questions on the ML side include the setup of the learning problem. There are little to no theoretical guidelines that help design network architectures that generalize well. Although there is a rich body of examples for data science applications, our learning problem has different requirements. For example, not only does the neural network need to be evaluated, but the Lagrangian scheme also requires its gradient and Laplacian. A careful choice of the architecture may be necessary to ensure the required differentiability of the network. Also, more experiments are needed to establish deeper intuition into the interplay between the network architecture and other hyperparameters, for example, the choice of the penalty parameter or the time discretization of the characteristics.

An obvious open question from an application perspective is the use of our framework to solve more-realistic problems. To promote progress in this area, we provide our code under a permissible open source license.

## Acknowledgments

We thank Derek Onken for fruitful discussions and his help with the Julia implementation; and Wonjun Lee and Siting Liu for their assistance with the implementation of the Eulerian solver for the 2D problem instances and their suggestions that helped improve the manuscript. L.R. is supported by NSF Grant DMS-1751636. This research was performed while L.R. was visiting the Institute for Pure and Applied Mathematics, Los Angeles, CA, which is supported by NSF Grant DMS-1440415. S.J.O., W.L., L.N., and S.W.F. receive support from Air Force Office of Science Research Grants MURI-FA9550-18-1-0502 and FA9550-18-1-0167, and Office of Naval Research Grant N00014-18-1-2527.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: sjo{at}math.ucla.edu or lruthotto{at}emory.edu.

Author contributions: L.R., S.J.O., W.L., L.N., and S.W.F. designed research; L.R., L.N., and S.W.F. performed research; and L.R., S.J.O., W.L., L.N., and S.W.F. wrote the paper.

Reviewers: W.E., Princeton University; and G.E.K., Brown University.

The authors declare no competing interest.

Data deposition: Prototype implementation written in Julia is available as open-source software under a permissible license. Readers can access all codes, scripts, and produced results at http://github.com/EmoryMLIP/MFGnet.jl.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1922204117/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- ↵
- ↵
- M. Huang,
- R. P. Malhamé,
- P. E. Caines

- ↵
- M. Huang,
- P. E. Caines,
- R. P. Malhamé

- ↵
- A. Bensoussan,
- J. Frehse,
- P. Yam

- ↵
- O. Guéant,
- J.-M. Lasry,
- P.-L. Lions

- ↵
- ↵
- D. A. Gomes,
- L. Nurbekyan,
- E. A. Pimentel

- ↵
- Y. Achdou,
- J. Han,
- J.-M. Lasry,
- P.-L. Lions,
- B. Moll

- ↵
- A. Lachapelle,
- J.-M. Lasry,
- C.-A. Lehalle,
- P.-L. Lions

- ↵
- P. Cardaliaguet,
- C.-A. Lehalle

- ↵
- P. Casgrain,
- S. Jaimungal

- ↵
- D. Firoozi,
- P. E. Caines

- ↵
- A. Lachapelle,
- M.-T. Wolfram

- ↵
- ↵
- A. Aurell,
- B. Djehiche

- ↵
- Y. Achdou,
- J.-M. Lasry

- ↵
- A. C. Kizilkale,
- R. Salhab,
- R. P. Malhamé

- ↵
- A. De Paola,
- V. Trovato,
- D. Angeli,
- G. Strbac

- ↵
- D. A. Gomes,
- J. Saúde

- ↵
- W. E,
- J. Han,
- Q. Li

- ↵
- P. M. Welch,
- K. Ø. Rasmussen,
- C. F. Welch

- ↵
- P. Cardaliaguet,
- F. Delarue,
- J.-M. Lasry,
- P.-L. Lions

- ↵
- W. Gangbo,
- A. Święch

- ↵
- D. A. Gomes,
- E. A. Pimentel,
- V. Voskanyan

- ↵
- A. Cesaroni,
- M. Cirant

- ↵
- R. Carmona,
- F. Delarue

- ↵
- R. Carmona,
- F. Delarue

- ↵
- L. C. Evans

- ↵
- W. H. Fleming,
- H. M. Soner

- ↵
- Y. Achdou

- ↵
- E. Carlini,
- F. J. Silva

- ↵
- Y. Achdou,
- M. Laurière

- ↵
- N. Almulla,
- R. Ferreira,
- D. Gomes

- ↵
- J.-D. Benamou,
- G. Carlier,
- F. Santambrogio

- ↵
- J.-D. Benamou,
- G. Carlier,
- S. Di Marino,
- L. Nenna

- ↵
- D. Evangelista,
- R. Ferreira,
- D. A. Gomes,
- L. Nurbekyan,
- V. Voskanyan

- ↵
- M. Jacobs,
- F. Léger

- ↵
- Y. T. Chow,
- W. Li,
- S. Osher,
- W. Yin

- ↵
- L. Briceño Arias et al.

- ↵
- M. Jacobs,
- F. Léger,
- W. Li,
- S. Osher

- ↵
- R. Bellman

- ↵
- L. Nurbekyan,
- J. Saúde

- ↵
- ↵
- C. Villani

- ↵
- D. Jimenez Rezende,
- S. Mohamed

- ↵
- J. Bezanson,
- A. Edelman,
- S. Karpinski,
- V. B. Shah

- ↵
- W. E,
- J. Han,
- A. Jentzen

- ↵
- J. Han,
- W. E

- ↵
- J. Han,
- A. Jentzen,
- W. E

- ↵
- J. Sirignano,
- K. Spiliopoulos

- ↵
- W. Li,
- E. K. Ryu,
- S. Osher,
- W. Yin,
- W. Gangbo

- ↵
- E. Haber,
- R. Horesh

- ↵
- P. Gabriel,
- M. Cuturi

- ↵
- W. Li,
- S. Osher

- ↵
- L. Zhang,
- W. E,
- L. Wang

- ↵
- L. Yang,
- G. E. Karniadakis

_{2}optimal transport regularity for generative models optimal transport regularity for generative models. arXiv.org (29 August 2019). - ↵
- J. Lin,
- K. Lensink,
- E. Haber

- ↵
- J. Yang,
- X. Ye,
- R. Trivedi,
- H. Xu,
- H. Zha

- ↵
- R. Carmona,
- M. Laurière

- ↵
- R. Carmona,
- M. Laurière

- ↵
- A. Mang,
- L. Ruthotto

- ↵
- J. Nocedal,
- S. Wright

- ↵
- R. Bellman

- ↵
- L. Ambrosio,
- N. Gigli,
- G. Savaré

- ↵
- H. Robbins,
- S. Monro

- ↵
- ↵
- L. Bottou,
- F. E. Curtis,
- J. Nocedal

- ↵
- H. Li,
- Z. Xu,
- G. Taylor,
- T. Goldstein

- ↵
- K. He,
- X. Zhang,
- S. Ren,
- J. Sun

*Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition*“*Deep residual learning for image recognition*, G. Hua, H. Jégou, Eds. (Institute of Electrical and Electronics Engineers, 2016), pp. 770–778. - ↵
- W. E

- ↵
- E. Haber,
- L. Ruthotto

- ↵
- Q. Li,
- L. Chen,
- C. Tai,
- W. E

- ↵
- B. Chang,
- L. Meng,
- E. Haber,
- F. Tung,
- D. Begert

*International Conference on Learning Representations*. https://openreview.net/pdf?id=SyJS-OgR-. Accessed 3 April 2020. - ↵
- S. Günther,
- L. Ruthotto,
- J. B. Schroder,
- E. C. Cyr,
- N. R. Gauger

- ↵
- M. Innes

- ↵
- L. V. Kantorovich

- ↵
- L. C. Evans

- ↵
- L. Ambrosio

- ↵
- C. Villani

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Mathematics