# Bayesian computation via empirical likelihood

^{a}School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD 4001, Australia;^{b}Centre de Biologie pour la Gestion des Populations, Institut National de la Recherche Agronomique, 34988 Montferrier-sur-Lez Cedex, France;^{c}Université Montpellier 2, Institut de Mathématiques et de Modélisation de Montpellier, 34095 Montpellier Cedex 5, France;^{d}Institut de Biologie Computationnelle, Montpellier, France;^{e}Université Paris Dauphine, Centre de Recherche en Mathematiques de la Decision, 75775 Paris Cedex 16, France;^{f}Institut Universitaire de France, Paris, France; and^{g}Centre de Recherche en Statistique et Economie, 92245 Malakoff Cedex, France

See allHide authors and affiliations

Edited by Stephen E. Fienberg, Carnegie Mellon University, Pittsburgh, PA, and approved December 7, 2012 (received for review May 25, 2012)

## Abstract

Approximate Bayesian computation has become an essential tool for the analysis of complex stochastic models when the likelihood function is numerically unavailable. However, the well-established statistical method of empirical likelihood provides another route to such settings that bypasses simulations from the model and the choices of the approximate Bayesian computation parameters (summary statistics, distance, tolerance), while being convergent in the number of observations. Furthermore, bypassing model simulations may lead to significant time savings in complex models, for instance those found in population genetics. The Bayesian computation with empirical likelihood algorithm we develop in this paper also provides an evaluation of its own performance through an associated effective sample size. The method is illustrated using several examples, including estimation of standard distributions, time series, and population genetics models.

Bayesian statistical inference cannot easily operate when the likelihood function associated with the data is not entirely known, or cannot be computed in a manageable time, as is the case in most population genetic models (1⇓–3). The fundamental reason for this difficulty with population genetics is that the statistical model associated with coalescent data needs to integrate over trees of high complexity. Similar computational issues with the likelihood function often occur in hidden Markov and other dynamic models (4). In those settings, traditional approximation tools based on stochastic simulation (5) are unavailable or unreliable. Indeed, the complexity of the latent structure defining the likelihood makes simulation of such structures too unstable to be trusted. Such settings call for alternative and often cruder approximations. The approximate Bayesian computation (ABC) methodology (1, 6) is a popular solution that bypasses the computation of the likelihood function (surveys in refs. 7 and 8); ref. 9 validates a conditional version of ABC that applies to hierarchical Bayes models in a wide generality.

The fast and polytomous development of the ABC algorithm is indicated by the rising literature in the domain, at both the methodological and the application levels. For instance, a whole new area of population genetic modeling (8, 10) has been explored thanks to the availability of such methods. However, practitioners and theoreticians both show a reluctance in adopting ABC, with some doubt about the validity of the method (11⇓–13). We propose in this paper to supplement the ABC approach with a generic and convergent likelihood approximation called the empirical likelihood that validates this Bayesian computational technique as a convergent inferential method when the number of observations grows to infinity. The empirical likelihood perspective, introduced by ref. 14, is a robust statistical approach that does not require the specification of the likelihood function. However, although it does not appear to have been used before in the settings that now rely on ABC, this data analysis method also is a broadly (albeit not universally) applicable and often fast approach, the approximation of which differs from the one found in ABC algorithms, even though both are rooted in nonparametric statistics. Therefore, this methodology can be used both as a solution per se and as a benchmark against which to test the ABC output in many cases. This paper presents the Bayesian computation via empirical likelihood (BC_{el}) algorithm and illustrates its performances on selected representative examples, comparing the outcome with the true posterior density whenever available, and with an ABC approximation (15) otherwise.

## Statistical Methods

### ABC Algorithm.

The primary purpose of the ABC algorithm is to approximate simulation from the centerpiece of Bayesian inference, the posterior distribution , when it cannot be numerically computed but when the distributions corresponding to both the prior π and the likelihood *f* can be simulated by manageable computer devices. The original (6) ABC algorithm is as follows: given a sample **y** of observations from the sample space, a sample of parameters is produced by

**Algorithm 1:** ABC sampler.

**for** *i* = 1 **to** *M* **do**

**repeat**

Generate from the prior distribution π(⋅)

Generate **z** from the likelihood

until

set ,

**end for**

The parameters of the ABC algorithm are the summary statistic *η*, the distance , and the tolerance level . The basic justification of the ABC approximation is that, when using a sufficient statistic *η*, the distribution of the ’s in the output of the algorithm converges to the genuine posterior distribution when *ε* goes to zero (16).

In practice, however, the statistic *η* is nonsufficient and at best the approximation then converges to the genuine posterior when *ɛ* goes to zero. This loss of information seems to be a necessary price to pay for the access to computable quantities. Furthermore, as argued below, it can be evaluated against the empirical likelihood approximation when the latter is available. Indeed, this approach does not require an information reduction through the choice of a tolerance zone or of a nonsufficient summary statistic.

### Empirical Likelihood.

Owen (14) developed empirical likelihood techniques as a robust alternative to classical likelihood approaches. He demonstrated that, for some categories of statistical models, this approach inherited the convergence properties of standard likelihood at a much lower cost in assumptions about the model (as detailed in *SI Text*). Whereas ABC algorithms do require a fully defined and often complex (hence debatable) statistical model, we argue that one should take advantage of the approximation device provided by the empirical likelihood to overcome most of the calibration difficulties encountered by ABC, at least as a convenient benchmark against which to test ABC solutions.

Assume that the dataset **y** is composed of *n* independent replicates of some random vector *Y* with density *f*. Rather than defining the likelihood from the density *f* as usual, the empirical likelihood method starts by defining parameters of interest ** θ** as functionals of

*f*, for instance as moments of

*f*, and it then profiles a nonparametric likelihood. More precisely, given a set of constraints of the form

where the dimension of *h* sets the number of constraints unequivocally defining ** θ**, the empirical likelihood is defined as

for **p** in the set . For instance, in the one-dimensional case when , the empirical likelihood in *θ* is the maximum of the product under the constraint . (Solving **2** is done using the R package “emplik” developed by ref. 17 and based on the Newton–Lagrange algorithm.) When the data are not independent and identically distributed (iid), an underlying iid structure may sometimes be exploited, as illustrated in *Dynamic Model*. However, this is not always the case, meaning that the empirical likelihood method remains out of reach in some complex cases when ABC can still be implemented. Furthermore, as pointed out in *SI Text* by a quote from Owen (18), the validation of the approach depends on a choice of the set of constraints that ensures convergence.

Whereas the convergence of the empirical likelihood is well-established (*SI Text* and ref. 18), the Bayesian use of empirical likelihoods has been little examined in the past, apart from a Monte Carlo study in ref. 19, and a probabilistic one in ref. 20.

### BC_{el}.

The most natural use of the empirical likelihood approximation is to act as if this representation was an exact likelihood, as in ref. 19. Incorporating this perspective into a basic sampler leads to the following algorithm:

**Algorithm 2:** Basic BC_{el} sampler.

**for** *i* = 1 **to** *M* **do**

Generate *θ*_{i} from the prior distribution *π*(⋅)

Set the weight

**end for**

The output of BC_{el} is a sample of size *M* of parameters with associated weights, which operate as an importance sampling output (5). Thus, the performance of the algorithm can be evaluated through the effective sample size (ESS):

which approximates the size of an iid sample with the same variance as the original sample. As shown in ref. 21, this quantity is always between 1 (corresponding to a very poor outcome) and *M* (corresponding to an iid perfect outcome).

Any algorithm that samples from a posterior distribution (e.g., Markov chain Monte Carlo, population Monte Carlo, sequential Monte Carlo algorithms, ref. 5) may instead use the empirical likelihood as a proxy to the exact likelihood. For instance, to speed up the computation in the population genetics model introduced below, we resorted to the adaptive multiple importance sampling (AMIS, ref. 22), which is easy to parallelize on a multicore computer. Although the original target distribution is and the AMIS algorithm uses several (multivariate) Student’s *t* distributions denoted (i.e., with three degrees of freedom, centered at mean **m,** and with covariance matrix **Σ**) as an importance sampling distribution, the algorithm can be adapted to the empirical likelihood in a straightforward manner:

**Algorithm 3:** BC_{el}–AMIS sampler.

**for** *i* = 1 to *M* **do**

Generate from the prior distribution

Set

**end for**

**for** *t* = 2 **to** *T*_{M} **do**

Compute weighted mean **m**_{t} and weighted variance matrix of the (, ).

Denote the density of .

**for** *i* = 1 to *M* **do**

Generate from .

Set

**end for**

**for** *r* = 1 to *t* − 1 **do**

**for** *i* = 1 to *M* **do**

Update the weight of as

**end for**

**end for**

**end for**

The output is thus a weighted sample of size .

In contrast with ABC, BC_{el} algorithms do not usually require simulations from the sampling model, given that **2** provides a converging and nonparametric approximation of the likelihood function. This feature thus induces very significative improvements in computing time when the production of pseudodatasets is time-consuming, because solving **1** is usually immediate. This is for instance the case in population genetics and *Time Gains in Population Genetic Models* provides an illustration of a huge improvement in comparison with ABC in two experiments described below. However, the improvement in speed may vanish in cases when producing an iid structure connected with the constraint **1** requires simulations from the sampling model, as illustrated by a counterexample for point processes in *SI Text*, BC_{el}, and ABC, then breaking even in terms of computing time. Even though the computing time required by BC_{el} is customarily negligible compared with ABC (or does not induce any extra time as in the point process counterexample), we further caution against opposing both approaches solely based on computing times, because they differ in the approximations they provide to a genuine Bayesian analysis and thus should be used in conjunction.

Using empirical likelihoods means there is no calibration of the many tuning parameters of ABC algorithms; most significantly, the likelihood ratio acts as a natural distance and importance weights produce an implicit and self-defined quantile on the original sample simulated from the prior. Notwithstanding these appealing qualities, BC_{el} still requires calibration, in particular in the choices of the parametrization of the sampling distribution and of the corresponding constraints **1** defining the empirical likelihood. Some examples are discussed below. The BC_{el} – AMIS sampler also implies computing values of the prior density, up to a constant, which may be a hindrance in peculiar cases.

### Composite Likelihood in Population Genetics.

ABC was introduced by population geneticists (2, 6, 10) interested in statistical inference about the evolutionary history of species, as no likelihood-based approach existed apart from very rudimentary and hence unrealistic situations. This approach has been used in a number of biological studies (23⇓–25), most of them including model choice. It is therefore crucial to obtain insights into the validity of such studies, particularly when they have economic, biological, or ecological consequences (e.g., ref. 26). This can be achieved in part by running a comparison using BC_{el}. Furthermore, given the major gain in computing time due to the absence of replications of the data, BC_{el} can be applied to more complex biological models.

The main caveat when using the empirical likelihood in such settings is selecting a constraint **1** on the parameter of interest: In phylogeography, parameters like divergence dates, effective population sizes, mutation rates, etc., cannot be expressed as moments of the sampling distribution at a given locus. In particular, the data are not iid. However, when considering microsatellite loci with the stepwise mutation model (27) and evolutionary scenarios composed of divergence, we can derive the pairwise composite scores whose zero is the pairwise maximum likelihood estimator (MLE). Composite likelihoods have been proved consistent for estimating recombination rates, introducing an approximation of the dependency structure between nearby loci (28⇓⇓–31). (Also, ref. 32 gives composite likelihoods used in a likelihood-free setting.)

More specifically, we are approximating the intralocus likelihood by a product over all pairs of genes in the sample at a given locus. Assuming that denotes the allele of the *i*th gene in the sample at the *k*th locus, and that *ϕ* is the vector of parameters, then the so-called pairwise likelihood of the data at the *k*th locus, namely , is defined by

and the corresponding pairwise score function is . Pairwise score equations

provide a constraint **1** in every way comparable to the score equations that give the maximum likelihood estimator and which is quite powerful for empirical likelihood derivations (ref. 18), pp 48–50). Hence the empirical likelihood of the full dataset given *ϕ* is computed with **2** under the (multidimensional) constraint that

When the effective population size is identical over all populations of the demographic scenario, the time axis may be scaled so that coalescence of two genes in Kingman’s genealogy occurs with rate if there are *k* lineages. In this modified scale, mutations at a given locus arise with rate along the gene genealogy. Our mutation model is the simple stepwise mutation model of ref. 27; i.e., the number of repeats of the mutated gene increases or decreases by 1 unit with equal probability. Given two microsatellite allelic states *x*_{1} and *x*_{2}, their pairwise likelihood depends only on the difference of the states . If both genes belong to individuals that lie in the same deme, then (*SI Text* and ref. 33)

where . If the two genes belong to individuals from demes having diverged at time *τ*, then (33)

where denotes the *δ*th-order modified Bessel function of the first kind evaluated at *z*. Computing the pairwise scores, i.e., partial derivatives of from those equations, is straightforward by virtue of recurrence properties of the Bessel functions. Algorithm BC_{el} is therefore directly available in this setting, and furthermore at a cost much lower than the one associated with ABC algorithms (Table S1).

## Results

### Normal Distribution.

Starting with the benchmark of a normal distribution with known variance (equal to 1), we can check that the empirical likelihood allows for a proper recovery of the true posterior distribution on the mean. Fig. S1 shows that a constraint **1** based on the mean works well, as do the two constraints on mean and second central moment, (Fig. S2). On the other hand, using the first three central moments in the empirical likelihood may degrade the fit (three cases in Fig. S3). Whereas this poor fit is not signaled by the ESS (which is often larger than in Figs. S1 and S2, because of the growing disconnection between the approximation and the true likelihood and hence a more uniform range of the weights), a parallel run of the method with different collections of constraints does detect the discrepancy. This illustrates the variability of the empirical likelihood approximation, as well as its sensitivity to the choice of defining constraints. Although a drawback of the method, this variability can be tested and evaluated by comparing outcomes, due to often limited computing costs. This toy experiment also supports the generic recommendation (18) to keep the number of constraints and parameters equal.

### Quantile Distributions.

Quantile distributions are defined by a closed-form quantile function , and generally have no closed form for the density function. They are of great interest because of their flexibility and the ease with which they can be simulated by a simple inversion of the uniform distribution. A range of methods, including ABC approaches (10), has been proposed for estimation (*SI Text*). We focus here on the four-parameter *g*-and-*k* distribution, defined by its quantile function, denoted and equal to

where is the *r*th standard normal quantile; the parameters , and *k* represent location, scale, skewness, and kurtosis, respectively, and *c* measures the overall asymmetry (34, 35). We evaluated the BC_{el} algorithm for estimating this distribution using two values of , two sets of priors, and various combinations of *n*, *M*, and *p*, where *p* is the number of percentiles used as constraints (details in *SI Text*).

Fig. 1 illustrates the true and fitted curves and a 95% credible region for the case with , and . The corresponding posterior means (SDs) for the parameters were 3.08(0.14), 1.12(0.23), 1.79(0.25), and 0.41(0.12), respectively. The choice of sample size and number of constraints did not substantively affect the accuracy of parameter estimates, but the precision was noticeably improved for the larger sample size (Figs. S4–S6).

The accuracy and precision of the estimates were broadly comparable with the results obtained by ref. 36 for the same distribution. Based on the whole experiment, the parameters *A* and *B* were well estimated in all cases, whereas the estimates of *g* and *k* were poorer for smaller values of *n* and *M*. For small *n* the estimates were more subject to the vagaries of sampling variation, whereas for small *M* they were subject to the influence of a smaller number of very large importance weights. However, given the speed of BC_{el} compared with competing ABC algorithms, it is feasible to use even larger values of *M* than considered in this experiment, because there is no requirement to simulate new datasets at each iteration. Moreover, this experiment is based on the very basic case of sampling from the prior; the results would be further improved by using an analog of BC_{el}–AMIS or alternative approaches similar to those proposed by ref. 37 for ABC.

### Dynamic Models.

In dynamic models, the difficulty with empirical likelihood stems from the dependence in the data . However, these models can be represented as transforms of unobserved iid sequences . The recovery of a converging empirical likelihood representation thus requires the reconstitution of the ’s as transforms of the data **y** and of the parameter *θ*. Independence between the ’s is then at least as important as moment conditions. (This implies equivalent computing times for ABC and BC_{el}.)

For instance, consider a simple dynamic model, namely the autoregressive conditionally heteroskedastic 1 [Arch(1)] model:

with a uniform prior over the simplex, i.e., , . Whereas this model can be handled by other means, because the likelihood function is available we will compare here the behavior of ABC and BC_{el} algorithms.

First, a natural empirical likelihood representation is based on the reconstituted ’s, defined as when the ’s are derived recursively. Fig. 2 shows the result of estimating both parameters and when *Algorithm ABC* uses as summary statistics either the least-square estimates of the parameters [derived from the series ], which we label “optimal ABC” in connection with ref. 38, or the mean of the series supplemented by the first two autocorrelations of the series . The constraints in the empirical likelihood are either based on the first three moments of the reconstituted ’s or on the variance of those ’s complemented by both the correlations between the ’s and the ’s and between the ’s and the ’s. As seen from this experiment, BC_{el} does as well as the optimal ABC for the estimation of the parameters, but further brings a reduction in the variability of those estimates, thanks to the importance weights.

A much more complex dynamic model is the generalized ARCH (GARCH) model of ref. 39 that can be formalized as the observation of when

under the constraints and , that is, . Given the constraints on the parameters, a natural priority is to choose an exponential distribution on , for instance an exponential distribution, and a Dirichlet on . An ABC approach requires the choice of summary statistics, which are necessarily nonsufficient because the model is a state-space model. Following ref. 38, we use the MLE as summary statistics, relying on the R function garch for its derivation despite its lack of stability. Because ref. 40 derived natural score constraints for the empirical likelihood associated with this model, we used their constraints to build our BC_{el} algorithm. Fig. 3 provides a comparison of both approaches with the MLE. It shows in particular that the ABC algorithm is unable to produce acceptable inference in this case, even in the most favorable case when it is initialized at a satisfactory maximum likelihood estimate (as shown by the bottom row). The BC_{el} algorithm is performing better, even though it fails to catch the correct range of .

Another type of non-iid model relying on the superposition of an unknown number of gamma point processes and processed in ref. 41 through a (non-Bayesian) alternative to ABC is discussed in *SI Text* as an additional illustration of the possibilities of the empirical likelihood perspective for complex models, offering a free benchmark for evaluating the ABC outcome. Fig. S7 shows a clear improvement brought by BC_{el} over the corresponding ABC outcome.

### Population Genetics.

We compare our proposal with the reliable ABC-based estimates given by ref. 3. We set up two toy experiments that are designed to defeat ABC, using pseudoobserved data. The two evolutionary scenarios are given in Fig. 4. In all experiments, we only consider microsatellite loci and assume that the effective population size is identical over all populations of the scenario.

In the first experiment, we consider two populations which diverged at time *τ* in the past (Fig. 4, *Left*). Our pseudoobserved datasets are made of 30 diploid individuals per population genotyped at 100 independent loci. We compare the marginal posterior distributions of the unknown parameters *θ* and *τ* computed with the ABC method [using the Do It Yourself ABC (DIYABC) software of ref. 42] and with the BC_{el}–AMIS sampler. In this case, results are improved when the *θ* component of the composite scores, namely , is restricted to the sum over all pairs of genes lying in the same population. Otherwise, as can be checked via a quick simulation experiment, BC_{el} systematically underestimates *θ*. Fig. 5 shows the typical discrepancy between both results: ABC and BC_{el} agree on the mutation rate *θ*, but the BC_{el} estimation of *τ* is more accurate (Table 1).

In the second experiment, we consider three populations (Fig. 4, *Right*): the last two populations diverged at time and their common ancestral population diverged from the first population at time . The sample comprises 30 diploid individuals per population genotyped at 100 independent loci. In contrast with the first experiment, all components of the composite scores are computed here by summing over all pairs of genes whatever the population to which they belong. The results given in Table 1 show that ABC and BC_{el} mainly agree on both parameters *θ* and , but BC_{el} is slightly more accurate than ABC on .

Table S1 gives a comparison of the computing times for both algorithms, showing the difference of magnitudes between them. This is due to the simulation of the simulated datasets for ABC: Although this difference should not be overinterpreted, it signals a potential for self-assessment and testing that is missing for ABC methods.

## Discussion

Compared with ABC methods, the (often) significant time savings provided by BC_{el} due to the lack of pseudosample simulation may open wider ranges for processing models involving complex likelihoods. For instance, in population genetics, ABC is severely hindered by the time spent simulating a dataset when modeling isolation by distance in a continuously distributed population, or when studying a large set of SNP markers even on quite simple evolution scenarios. Moreover, when the dataset is composed of large sets of markers, the summary statistics proposed in ABC (in DIYABC, these are averages of some quantitative statistics over all loci) ignores some (statistical) information, whereas BC_{el} manages to recover most of it, more specifically to estimate divergence on large datasets. Improvements in accuracy of estimation and computational efficiency are also possible in other contexts as illustrated in the range of examples given above.

Even when BC_{el} requires the same computing time as ABC, it uses the outcome in a very different perspective and provides a benchmark likelihood that helps in evaluating the pertinence of the ABC approximation, as illustrated in the gamma point process of SI.

We acknowledge a caveat of the empirical likelihood: it requires a careful choice of the constraint **1**. Those pivotal quantities have to be connected to the parameter in an identifying way, which may require complex manipulations as in the gamma process case or may even be impossible. However, repeated experimentation is often available, as illustrated by the normal example and the population genetic experiments (where we computed the composite score on both a restricted set of pairs and all pairs of genes). Checking for the accuracy of the approximation means that a constraint in BC_{el} should be tested on simulated datasets in controlled experiments where the true parameters are known, although much less than in ABC runs. Then we can test coverage of credibility intervals, and measure the error of various point estimates based on the output of the scheme.

## Acknowledgments

P.P. and C.P.R. thank Jean-Marie Cornuet for his help and availability. C.R.P. is grateful to Patrice Bertail, Chris Drovandi, Brunero Liseo, and Art Owen for useful discussions. Comments and suggestions from the Editor and Reviewers greatly contributed to improve both the presentation and scope of the paper. P.P. and C.P.R. were partly supported by the Agence Nationale de la Recherche through the 2009–2012 Project Emile.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: Christian.Robert{at}ceremade.dauphine.fr.

Author contributions: C.P.R. designed research; K.L.M., P.P., and C.P.R. performed research; K.L.M., P.P., and C.P.R. analyzed data; and K.L.M., P.P., and C.P.R. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1208827110/-/DCSupplemental.

## References

- ↵
- ↵
- ↵
- Cornuet JM,
- et al.

- ↵
- Cappé O,
- Moulines E,
- Rydén T

- ↵
- Robert C,
- Casella G

- ↵
- ↵
- ↵
- ↵
- ↵
- Marjoram P,
- Molitor J,
- Plagnol V,
- Tavaré S

- ↵
- Templeton AR

- ↵
- Berger JO,
- Fienberg SE,
- Raftery AE,
- Robert CP

- ↵
- Robert CP,
- Cornuet JM,
- Marin JM,
- Pillai NS

- ↵
- Owen AB

- ↵
- ↵
- Biau G,
- Cérou F,
- Guyader A

*New Insights into Approximate Bayesian Computation*. Tech Rep HAL 00721164. Available at http://hal.inria.fr/hal-00721164/en. - ↵
- Zhou M

*R package version 0.9-8-2*. - ↵
- Owen AB

- ↵
- Lazar NA

- ↵
- Schennach SM

- ↵
- Liu J

- ↵
- ↵
- ↵
- ↵
- Fagundes NJ,
- et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Barthelmé S,
- Chopin N

- ↵
- ↵
- ↵
- Gilchrist W

- ↵
- ↵
- ↵
- Fearnhead P,
- Prangle D

- ↵
- ↵
- Chan N,
- Ling S

- ↵
- Cox D,
- Kartsonaki C

- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology

- Physical Sciences
- Statistics