# Mining gold from implicit models to improve likelihood-free inference

^{a}Center for Cosmology and Particle Physics, New York University, New York, NY 10003;^{b}Center for Data Science, New York University, New York, NY 10003;^{c}Department of Electrical Engineering and Computer Science, University of Liège, B-4000 Liège, Belgium;^{d}Federico Santa María Technical University, Valparaíso, Chile

See allHide authors and affiliations

Edited by Larry Wasserman, Carnegie Mellon University, Pittsburgh, PA, and approved January 22, 2020 (received for review September 20, 2019)

## Significance

In scientific fields as diverse as particle physics, genetics, and epidemiology, complex computer simulations provide the most accurate description of phenomena, but the corresponding likelihood function cannot be computed. This is a major challenge for statistical inference. We present inference techniques for this case that combine the insight that additional latent information can be extracted from the simulator with the power of neural networks in regression and density estimation tasks. In toy problems and a real-life particle physics problem, these techniques lead to better sample efficiency than established methods, demonstrating the potential to enable more precise scientific results from large-scale experiments across many disciplines.

## Abstract

Simulators often provide the best description of real-world phenomena. However, the probability density that they implicitly define is often intractable, leading to challenging inverse problems for inference. Recently, a number of techniques have been introduced in which a surrogate for the intractable density is learned, including normalizing flows and density ratio estimators. We show that additional information that characterizes the latent process can often be extracted from simulators and used to augment the training data for these surrogate models. We introduce several loss functions that leverage these augmented data and demonstrate that these techniques can improve sample efficiency and quality of inference.

In many areas of science, complicated real-world phenomena are best described through computer simulations. Typically, the simulators implement a stochastic generative process in the “forward” mode based on a well-motivated mechanistic model with parameters θ. While the simulators can generate samples of observations

We present a suite of techniques that can dramatically improve the sample efficiency for training neural network surrogates that estimate the likelihood *Extracting More Information from the Simulator*. In *Learning from Augmented Data*, we introduce the loss functions that utilize these augmented data. In *Experiments*, we demonstrate through a range of experiments that these techniques provide a significant increase in sample efficiency compared with techniques that do not leverage the augmented data, which ultimately increase the quality of inference.

## Related Work

Techniques for likelihood-free inference can be divided into two broad categories. The first category uses the simulator directly during inference, while the second uses the simulator to construct or train a tractable surrogate model that is used during inference. There are rich connections between simulator-based inference and learning in implicit generative models, with a considerable amount of cross-pollination between these areas (13).

### ABC.

A particularly ubiquitous method is ABC (1⇓⇓⇓–5), a Bayesian sampling technique in which the likelihood is approximated by comparing data generated from the simulator with the observed data. This approach requires introducing a kernel

### Probabilistic Programming.

Probabilistic programming systems represent another class of methods that use the simulator directly during inference (43, 44). These techniques are deeply integrated into the control flow of the program but still require a tractable likelihood term or ABC-like kernel to compare the simulated data x and the observed data

### The Likelihood Ratio Trick.

A surrogate model for the likelihood ratio

### NDE.

More recently, several methods for conditional density estimation have been proposed, often based on neural networks (17⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓–38). They can be used to train a surrogate for the likelihood

### Contributions.

The most important contributions that differentiate our work from the existing methods are the observation that additional information can be extracted from the simulator and the development of loss functions that allow us to use these “augmented” data to more efficiently learn surrogates for the likelihood function. In addition, we show how the augmented data can be used to define locally optimal summary statistics, which can then be used for inference with density estimation techniques or ABC. We playfully introduce the analogy of mining gold as these augmented data require work to extract and are very valuable. In experiments, we demonstrate that these approaches can dramatically improve sample efficiency and quality of likelihood-free inference.

Concurrently, the application of these methods to a specific class of problems in particle physics has been discussed in refs. 47 and 48. The present manuscript is meant to serve as the primary reference for these techniques and is addressed to the broader physical science and machine learning communities. Most importantly, it generalizes the specific particle physics case to almost any scientific simulator, requiring significantly weaker assumptions than those made in the physics context. We also introduce an algorithm called Scandal (score-augmented neural density approximates likelihood), for which we provide the experimental results.

## Extracting More Information from the Simulator

We consider a scientific simulator that implements a stochastic generative process that proceeds through a series of latent states

In this paper, we consider the problem of estimating the likelihood

Typically, the setting of likelihood-free inference assumes that the only available outputs from the simulator are samples of observations

The key observation that is the starting point of our inference methods is the following. While

Similarly, we can extract the joint likelihood ratio

As a motivating example, consider the simulation for a generalization of the Galton board in which a set of balls is dropped through a lattice of nails ending in one of several bins denoted by x. The Galton board is commonly used to demonstrate the central limit theorem. If the nails are uniformly placed such that the probability of bouncing to the left is p, the sum over the latent space is tractable analytically, and the resulting distribution of x is a binomial distribution. However, if the probability of bouncing to the left is an arbitrary function of the nail position and some parameter θ (for instance, because the nails are not uniformly placed), the resulting distribution requires an explicit sum over the latent paths z that might lead to a particular x. Such a distribution would become intractable as the size of the lattice increases. Fig. 1 shows an example of two latent trajectories that lead to the same x. In this toy example, the probability

Fig. 1 shows that a large number of samples from the simulator are needed to reveal the differences in the distribution of x for small changes in θ—the number of samples needed grows like **3** can easily be computed by accumulating the factors

It may sound as if the joint likelihood ratio and joint score might only be accessible in specially tailored toy examples and are not available in complicated real-life simulators, but that is not the case. There are three different strategies for their calculation in practice.

1) In many complex simulators, only a few steps of the program flow explicitly depend on the parameters of interest, and the corresponding probability distributions are known analytically. In this case, we can add a manual calculation of these quantities to the simulator code or even calculate them based on existing simulator output. In addition to this (small) code overhead, this strategy requires domain knowledge. We have successfully applied this approach to state-of-the-art simulators of particle physics experiments (47, 48, 51) (

*Experiments*) and in a simulator of strong gravitational lensing from dark matter halos (52). In both cases, the calculation of these quantities required some domain knowledge but only little engineering. We will use this strategy for the experiments in*Experiments*.2) The calculation of the joint ratio and score can be added to existing real-world simulators via a protocol like PPX (53), an open interoperability protocol between codes in different programming languages.

3) Finally, the development of simulators within probabilistic programming frameworks is an alternative, more general solution; in this case, the joint likelihood ratio and joint score can be calculated without requiring additional domain knowledge or changes to the simulator code. As a proof of principle, in ref. 54 we provide a framework that automates these calculations for any simulator in which all stochastic steps are implemented with the Pyro library (55).

Our work should be considered as a motivating example to implement access to the joint likelihood ratio and joint score through any of these strategies.

## Learning from Augmented Data

### Key Idea.

How can the “augmented data” consisting of simulated observations

Consider the squared error of a function

Identifying

Similarly, by identifying

These loss functionals are immensely useful because they allow us to transform

### Learning the Likelihood Ratio.

Based on this observation, we introduce a family of likelihood-free inference techniques. They fall into two categories. We first discuss a class of algorithms that uses the augmented data to learn a surrogate model for any likelihood *Locally Sufficient Statistics for Implicit Models*, we will define a second class of methods that is based on a local expansion of the model around some reference parameter point.

The simulators we consider in this work implicitly define not only a single density

#### Rolr (regression on likelihood ratio).

First, a number of parameter points

An expressive regressor

Both terms in this loss function are estimators of Eq. **7** (in the second term, we switch

#### Rascal (ratio and score approximate likelihood ratio).

If such a likelihood ratio regressor is differentiable (as is the case for neural networks) with respect to θ, we can calculate the predicted score **9**). Turning this argument around, we can improve the training of a likelihood ratio estimator by minimizing the combined ratio and score loss with a hyperparameter α:

#### Cascal (calibrated approximate ratios of likelihoods and score approximate likelihood ratio).

The same trick can be used to improve the likelihood ratio trick and the Carl inference method (10, 12). Following the discussion around Eq. **1**, a calibrated classifier trained to discriminate samples

#### Scandal.

Finally, we can use the same strategy to improve conditional neural density estimators, such as density networks or normalizing flows. If a parameterized neural density estimator

### Locally Sufficient Statistics for Implicit Models.

A second class of likelihood-free inference methods is based on an expansion of the implicit model around a reference parameter point **5**, is its sufficient statistics.

For inference in a sufficiently small neighborhood around a reference point **9** allows us to extract sufficient statistics from an intractable, nondifferentiable simulator, at least in the neighborhood of

#### Sally (score approximates likelihood locally).

By minimizing the squared error with respect to the joint score (Eq. **9**), we train a score estimator

#### Sallino (score approximates likelihood locally in one dimension).

The Sally inference method requires density estimation in the estimated score space, with typically **15**, the likelihood ratio

The Sally and Sallino techniques are designed to work very well close to the reference point. The local model approximation may deteriorate further away, leading to a reduced sensitivity and weaker bounds. These approaches are simple and robust, and in particular, the Sallino algorithm scales exceptionally well to high-dimensional parameter spaces.

For all of these inference strategies, the augmented data are particularly powerful for enhancing the power of simulation-based inference for small changes in the parameter θ. When restricted to samples

## Experiments

### Generalized Galton Board.

We return to the motivating example in *Extracting More Information from the Simulator* and Fig. 1 and try to estimate likelihood ratios for the generalized Galton board. We use the likelihood ratio trick and a neural density estimator as baselines and compare them with the Rolr, Rascal, Cascal, and Scandal methods. As the simulator defines a distribution over a discrete x, for the NDE and Scandal methods we use a neural network with a softmax output layer over the bins to model

### Lotka–Volterra Model.

As a second example, we consider the Lotka–Volterra system (59, 60), a common example in the likelihood-free inference literature. This stochastic Markov jump process models the dynamics of a species of predators interacting with a species of prey. Four parameters θ set the rate of predators and prey being born, predators dying, and predators eating prey.

We simulate the Lotka–Volterra model with Gillespie’s algorithm (61). From the time evolution of the predator and prey populations, we calculate summary statistics *Learning the Likelihood Ratio*, including a Scandal likelihood estimator based on an MAF. For MAF and Scandal, we stack four masked autoregressive distribution estimators (MADEs) (27) on a mixture of MADEs with 10 components (21). For all other methods, we use three hidden layers. In all cases, the hidden layers have 100 units and tanh activations. Code for simulation and inference is available in ref. 63.

For inference on a wide range in the parameter space, the different probability densities often do not overlap. As discussed above, the augmented data are then of limited use. Instead, we focus on the regime where we try to discriminate between close parameter points with similar predictions for the observables. We generate training data and evaluate the models in the range

Our results indicate a tradeoff between the performance in likelihood (density) estimation and likelihood ratio estimation. For density estimation, the MAF performs well. The variance of the score term in the Scandal loss degrades the performance, especially for larger values of the hyperparameter α. However, for statistical inference, the more relevant quantity is the likelihood ratio. Here, the techniques that use the joint score, in particular Scandal, are significantly more sample efficient.

### Particle Physics.

Finally, we consider a real-world problem from particle physics. A simulator describes the production of a Higgs boson at the Large Hadron Collider experiments followed by the decay into four electrons or muons, subsequent radiation patterns, the interaction with the detector elements, and the reconstruction procedure. Each recorded collision produces a single high-dimensional observable

The inference techniques can accommodate state-of-the-art simulators, but in that setting, we cannot compare them with the true likelihood function. We, therefore, present a simplified setup and approximate the detector response such that the true likelihood function is tractable, providing us with a ground truth with which to compare the inference techniques. As simulator, we use a combination of MadGraph 5 (64) and MadMax (65⇓–67). The setup and the results of this experiment are described at length in ref. 48, and the code is available in ref. 68.

In Fig. 4, *Left*, we show the expected mean squared error of the approximate

All presented inference techniques outperform the traditional histogram method, provided that the training samples are sufficiently large. Using augmented data substantially decreases the amount of training data required for a good performance: the Rascal method, which uses both the joint ratio and joint score information from the simulator, reduces the amount of training data by two orders of magnitude compared with the Carl technique, which uses only the samples *Right*) show that the Cascal and Rascal techniques have the highest precision, leading to exclusion limits virtually indistinguishable from those based on the true likelihood ratio.

## Conclusions

In this work, we have presented a family of inference techniques for the setting in which the likelihood is only implicitly defined through a stochastic generative simulator. The methods estimate likelihoods or ratios of likelihoods with data available from the simulator. Most established inference methods, such as ABC and techniques based on NDE, only use samples of observables from the simulator. We pointed out that in many cases the joint likelihood ratio and the joint score—quantities conditioned on the latent variables that characterize the trajectory through the data generation process—can be extracted from the simulator.

While these additional quantities often require work to be extracted, they also prove to be very valuable as they can dramatically improve sample efficiency and quality of inference. Indeed, we have shown that this additional information lets us define loss functionals that are minimized by the likelihood or likelihood ratio, which can in turn be used to efficiently guide the training of neural networks to precisely estimate the likelihood function. A second class of techniques is motivated by a local approximation of the likelihood function around a reference parameter value, where the score vector is the sufficient statistic. In the case where the simulator provides the joint score, we can use it to train a precise estimator of the score and use it as locally optimal summary statistics.

We have demonstrated in three experiments that the inference techniques let us precisely estimate likelihood ratios. In turn, this enables parameter measurements with a higher precision and less training data than with established methods.

This approach is complementary to many recent advances in likelihood-free inference: the augmented data can be used to improve training for any neural density estimator or likelihood ratio estimator as we have demonstrated for MAFs and Carl. It can also be used to define locally optimal summary statistics that can be used, for instance, in ABC techniques.

Finally, these results motivate the development of tools that provide a nonstandard interpretation of the simulator code and automatically generate the joint score and joint ratio, building on recent developments in probabilistic programming and automatic differentiation 43, 44, 69⇓⇓–72). We have provided a proof-of-principle implementation of such a tool based on Pyro (54).

## Acknowledgments

We thank Cyril Becot and Lukas Heinrich, who contributed to this project at an early stage. We also thank Jan-Matthis Lückmann for helping us automate the calculation of the joint likelihood ratio and joint score in Pyro and all participants of the likelihood-free inference workshop at the Flatiron Institute. We thank Felix Kling, Tilman Plehn, and Peter Schichtel for providing the MadMax code and George Papamakarios for discussing the masked autoregressive flow code with us. K.C. thanks the Centre for Cosmology, Particle Physics and Phenomenology at Université Catholique de Louvain for hospitality. Finally, we thank Atılım Güneş Baydin, Lydia Brenner, Joan Bruna, Kyunghyun Cho, Michael Gill, Siavash Golkar, Ian Goodfellow, Daniela Huppenkothen, Michael Kagan, Hugo Larochelle, Yann LeCun, Fabio Maltoni, Jean-Michel Marin, Iain Murray, George Papamakarios, Duccio Pappadopulo, Dennis Prangle, Rajesh Ranganath, Dustin Tran, Rost Verkerke, Wouter Verkerke, Max Welling, and Richard Wilkinson for interesting discussions. J.B., G.L., and K.C. are grateful for the support of the Moore–Sloan data science environment at New York University (NYU). G.L. and K.C. were supported through NSF Grants ACI-1450310 and PHY-1505463. G.L. is the recipient of the ULiège-NRB Chair on Big Data and is thankful for the support of NRB. J.P. was partially supported by Scientific and Technological Center of Valparaíso Fondecyt Grant BASAL FB0821. This work was supported in part through the NYU Information Technology High Performance Computing resources, services, and staff expertise.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: johann.brehmer{at}nyu.edu.

Author contributions: J.B., G.L., J.P., and K.C. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission.

Data deposition: Code for simulation and inference is available on GitHub (https://github.com/johannbrehmer/simulator-mining-example, https://github.com/johannbrehmer/goldmine, https://github.com/johannbrehmer/higgs_inference).

↵*We pick the algorithms that we use for these “ground truth” predictions based on the variance between independent runs and the consistency of improvements with increasing training sample size. For likelihood estimation, we use the MAF as baseline, with qualitatively similar results when using Scandal. For likelihood ratio estimation, we use the Scandal estimator as baseline and find qualitatively similar results for Cascal.

Published under the PNAS license.

## References

- ↵
- ↵
- M. A. Beaumont,
- W. Zhang,
- D. J. Balding

- ↵
- P. Marjoram,
- J. Molitor,
- V. Plagnol,
- S. Tavaré

- ↵
- S. A. Sisson,
- Y. Fan,
- M. M. Tanaka

- ↵
- S. A. Sisson,
- Y. Fan,
- M. Beaumont

- ↵
- J. Alsing,
- B. Wandelt,
- S. Feeney

- ↵
- T. Charnock,
- G. Lavaux,
- B. D. Wandelt

- ↵
- P. J. Diggle,
- R. J. Gratton

- ↵
- I. J. Goodfellow et al.

- ↵
- K. Cranmer,
- J. Pavez,
- G. Louppe

- ↵
- K. Cranmer,
- G. Louppe

- ↵
- G. Louppe,
- K. Cranmer,
- J. Pavez

- ↵
- S. Mohamed,
- B. Lakshminarayanan

- ↵
- M. U. Gutmann,
- R. Dutta,
- S. Kaski,
- J. Corander

- ↵
- T. Dinev,
- M. U. Gutmann

- ↵
- J. Hermans,
- V. Begy,
- G. Louppe

- ↵
- I. Guyon et al.

- D. Tran,
- R. Ranganath,
- D. Blei

- ↵
- L. Dinh,
- D. Krueger,
- Y. Bengio

- ↵
- D. Jimenez Rezende,
- S. Mohamed

- ↵
- L. Dinh,
- J. Sohl-Dickstein,
- S. Bengio

- ↵
- G. Papamakarios,
- T. Pavlakou,
- I. Murray

- ↵
- C.-W. Huang,
- D. Krueger,
- A. Lacoste,
- A. Courville

- ↵
- G. Papamakarios,
- D. C. Sterratt,
- I. Murray

- ↵
- T. Q. Chen,
- Y. Rubanova,
- J. Bettencourt,
- D. K. Duvenaud

- ↵
- D. P. Kingma,
- P. Dhariwal

- ↵
- W. Grathwohl,
- R. T. Q. Chen,
- J. Bettencourt,
- I. Sutskever,
- D. Duvenaud

- ↵
- M. Germain,
- K. Gregor,
- I. Murray,
- H. Larochelle

- ↵
- B. Uria,
- M.-A. Côté,
- K. Gregor,
- I. Murray,
- H. Larochelle

- ↵
- A. van den Oord et al.

- ↵
- A. van den Oord et al.

- ↵
- A. van den Oord,
- N. Kalchbrenner,
- K. Kavukcuoglu

- ↵
- Y. Fan,
- D. J. Nott,
- S. A. Sisson

- ↵
- D. D. Lee,
- M. Sugiyama,
- U. V. Luxburg,
- I. Guyon,
- R. Garnett

- G. Papamakarios,
- I. Murray

- ↵
- B. Paige,
- F. Wood

- ↵
- R. Dutta,
- J. Corander,
- S. Kaski,
- M. U. Gutmann

- ↵
- G. Louppe,
- K. Cranmer

- ↵
- J.-M. Lueckmann et al.

- ↵
- J.-M. Lueckmann,
- G. Bassetto,
- T. Karaletsos,
- J. H. Macke

- ↵
- ↵
- S. S. Wilks

- ↵
- E. Meeds,
- R. Leenders,
- M. Welling

- ↵
- M. M. Graham,
- A. J. Storkey

- ↵
- S. Kaski,
- J. Corander

- F. Wood,
- J. W. van de Meent,
- V. Mansinghka

- ↵
- A. Singh,
- J. Zhu

- T. Anh Le,
- A. Gunes Baydin,
- F. Wood

- ↵
- K. Cranmer,
- J. Brehmer,
- G. Louppe

- ↵
- D. S. Greenberg,
- M. Nonnenmacher,
- J. H. Macke

- ↵
- J. Brehmer,
- K. Cranmer,
- G. Louppe,
- J. Pavez

- ↵
- J. Brehmer,
- K. Cranmer,
- G. Louppe,
- J. Pavez

- ↵
- R. J. Williams

- ↵
- J. Brehmer,
- K. Cranmer,
- G. Louppe,
- J. Pavez

- ↵
- J. Brehmer,
- F. Kling,
- I. Espejo,
- K. Cranmer

- ↵
- J. Brehmer,
- S. Mishra-Sharma,
- J. Hermans,
- G. Louppe,
- K. Cranmer

- ↵
- PPX Developers

- ↵
- Participants of the Likelihood-Free Inference Meeting at the Flatiron Institute 2019

- ↵
- E. Bingham et al.

- ↵
- P. Baldi,
- K. Cranmer,
- T. Faucett,
- P. Sadowski,
- D. Whiteson

- ↵
- J. Alsing,
- B. Wandelt

- ↵
- J. Alsing,
- B. Wandelt

- ↵
- A. J. Lotka

- ↵
- ↵
- ↵
- G. Papamakarios,
- T. Pavlakou,
- I. Murray

- ↵
- J. Brehmer,
- K. Cranmer,
- G. Louppe,
- J. Pavez

- ↵
- J. Alwall et al.

- ↵
- K. Cranmer,
- T. Plehn

- ↵
- T. Plehn,
- P. Schichtel,
- D. Wiegand

- ↵
- F. Kling,
- T. Plehn,
- P. Schichtel

- ↵
- J. Brehmer,
- K. Cranmer,
- G. Louppe,
- J. Pavez

- ↵
- B. Eli et al.

- ↵
- D. Tran et al.

- ↵
- I. Guyon et al.

- N. Siddharth et al.

- ↵

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics