## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Searching for efficient Markov chain Monte Carlo proposal kernels

Edited by John P. Huelsenbeck, University of California, Berkeley, CA, and accepted by the Editorial Board October 16, 2013 (received for review June 21, 2013)

## Significance

Bayesian statistics is widely used in various branches of sciences; its main computational method is the Markov chain Monte Carlo (MCMC) algorithm, which is used to simulate a sample on the computer, on which all Bayesian inference is based. The efficiency of the algorithm determines how long one should run the computer program to obtain estimates with acceptable precision. We compare different simulation algorithms and propose a unique class of algorithms (called Bactrian algorithms), which are found to be >50% more efficient than current algorithms. Our tests suggest that the improvement is generic and may apply to all MCMC algorithms developed in various applications of Bayesian inference.

## Abstract

Markov chain Monte Carlo (MCMC) or the Metropolis–Hastings algorithm is a simulation algorithm that has made modern Bayesian statistical inference possible. Nevertheless, the efficiency of different Metropolis–Hastings proposal kernels has rarely been studied except for the Gaussian proposal. Here we propose a unique class of Bactrian kernels, which avoid proposing values that are very close to the current value, and compare their efficiency with a number of proposals for simulating different target distributions, with efficiency measured by the asymptotic variance of a parameter estimate. The uniform kernel is found to be more efficient than the Gaussian kernel, whereas the Bactrian kernel is even better. When optimal scales are used for both, the Bactrian kernel is at least 50% more efficient than the Gaussian. Implementation in a Bayesian program for molecular clock dating confirms the general applicability of our results to generic MCMC algorithms. Our results refute a previous claim that all proposals had nearly identical performance and will prompt further research into efficient MCMC proposals.

Markov chain Monte Carlo (MCMC) algorithms can be used to simulate a probability distribution π(*x*) that is known only up to a factor, that is, with only known; they are especially important in Bayesian inference where π is the posterior distribution. In a Metropolis–Hastings (MH) algorithm (1, 2), a proposal density *q*(*y*|*x*), with *x*, *y* ∈ χ, is used to generate a new state *y* given the current state *x*. The proposal is accepted with probability α(*x*, *y*). If the proposal is accepted, the new state becomes *y*; otherwise it stays at *x*. The algorithm generates a discrete-time Markov chain with state space χ and transition law *P* having transition probability densityThe acceptance probability α is chosen so that the detailed balance condition is satisfied: π(*x*)*p*(*x*, *y*) = π(*y*)*p*(*y*, *x*), for all *x*, *y* ∈ χ. The MH choice of α is

The proposal kernel *q*(*y*|*x*) can be very general; as long as it specifies an irreducible aperiodic Markov chain, the algorithm will generate a reversible Markov chain with stationary distribution π. Here we assume that χ is a subset of ℝ^{k} and both π(*y*) and *q*(*y*|*x*) are densities on χ.

Given a sample *x*_{1}, *x*_{2}, …, *x*_{n} simulated from *P*, the expectation of any function *f*(*x*) over πcan be approximated by the time average over the sampleThis is an unbiased estimate of *I*, and converges to *I* according to the central limit theorem, independent of the initial state *x*_{0} (ref. 3, p. 99). The asymptotic variance of iswhere *V*_{f} = var_{π}{*f*(*X*)} is the variance of *f*(*X*) over π, *ρ*_{k} = corr{*f*(*x*_{i}), *f*(*x*_{i} _{+} _{k})} is the lag-*k* autocorrelation (ref. 4, pp. 87–92), and the variance ratio *E* = *V*_{f}/ν is the efficiency: an MCMC sample of size *N* is as informative about *I* as an independent sample of size *NE*. Thus, *NE* is known as the effective sample size.

Given π and *q*, the MH choice of [**2**] maximizes α(*x*, *y*), minimizes ν of Eq. **5** and is optimal (5). The choice of *q* given π is less clear. The independent sampler *q*(*y*|*x*) = π(*y*) has *E* = 1, but is often impossible or impractical to sample from. The choice of scale of the Gaussian kernel applied to the Gaussian target, in both univariate and multivariate cases, has been studied (6, 7). For example, the optimal scale σ for the 1D kernel *y*|*x* ∼ N(*x*, σ^{2}) applied to the *N*(0, 1) target is 2.5, in which case 44% of proposals are accepted (6), with *P*_{jump} = 0.44. Note that

However, there has been little research into alternative kernels. In this paper, we examine the mixing efficiency of a number of proposal kernels applied to several representative targets. We focus on 1D proposals. We develop a Bactrian kernel, which uses a mixture of two Gaussian distributions and which has the overall shape of a Bactrian camel, to reduce the positive autocorrelation (ρ_{1} in Eq. **5**). The proposal density iswhere 0 ≤ *m* < 1 controls how spiky the two humps are (Fig. 1*A*); this has mean *x* and variance σ^{2}. We also consider mixtures of two triangle or two Laplace distributions (Fig. 1 *B* and *C* and *SI Appendix*, *Text S1*).

We compare the efficiency of the Bactrian as well as many other proposals for simulating several target distributions. We demonstrate the general nature of our results with a major MCMC application in Bayesian molecular-clock dating of species divergences.

## Results

### Mixing Efficiency for Simple Targets.

Fig. 2*A* shows the performance of three proposal kernels applied to estimate the mean of the target *N*(0, 1). First, the scale (σ) of each kernel has a great impact on the performance of the kernel. The Gaussian kernel *x*′|*x* ∼ N(*x*, σ^{2}) achieves highest efficiency at σ = 2.5, in which case *E* = 0.23 (6). In other words, if the target is N(μ, τ^{2}), the optimal proposal should be overdispersed with SD 2.5τ. At this optimal scale, one has to take an MCMC sample of size *N*/0.23 to achieve the accuracy of an independent sample of size *N*. Second, the different kernels have different performance. At the optimal scales, the uniform kernel is 21% more efficient than the Gaussian (0.28/0.23 = 1.21), and the Bactrian kernel is 65% more efficient than the Gaussian (0.38/0.23 = 1.65; Table 1). The Bactrian kernel uses a mixture of two single-moded distributions and avoids values very close to the current value (Fig. 1 and *SI Appendix*, *Text S1*). To propose something new, propose something different.

Because the efficiency *E* (Eq. **5**) may depend on the target π or the function *f*, we have included in our evaluation nine different proposals combined with five targets (*SI Appendix*, Figs. S1 and S3). The efficiency *E* is calculated directly using the transition matrix *P* on a discretized parameter space (*Materials and Methods*). The five targets are *N*(0, 1), a mixture of two normals, a mixture of two *t*_{4} distributions, the gamma, and the uniform, all having unit variance (Fig. 3). For the gamma and uniform targets, reflection is used to propose new states in the MCMC and to calculate the proposal density (*SI Appendix*, *Text S2* and Fig. S2). For all targets except the uniform, every proposal is optimal at an intermediate scale (σ), achieved when the acceptance proportion is *P*_{jump} ∼ 30% for the Bactrian kernels (*m* = 0.95) and ∼40% for the other six kernels (*SI Appendix*, Fig. S3 and Table 1).

When each proposal is used at the optimal scale, the relative performance of the kernels does not vary among targets: the uniform kernel is more efficient than the Gaussian, whereas the Bactrian kernels are the best for all targets examined (Table 1 and *SI Appendix*, Table S1). For the two-normals target, the Bactrian kernel (with *m* ≥ 0.95) is nearly twice as efficient as the Gaussian (Table 1). The Bactrian kernel is better if the two modes are far apart (e.g., when *m* is close to 1). However, if *m* is too close to 1, efficiency drops off quickly with the increase of σ. We use *m* = 0.95 because it appears to strike the right balance, whereby the optimal *P*_{jump} ∼ 0.3. With this *m*, the Bactrian kernels have better performance than the uniform and Gaussian for almost the whole range of *P*_{jump} for the *N*(0, 1) target (Fig. 2*B*).

The uniform target is revealing for understanding the differences of the kernels (*SI Appendix*, Table S1). Due to reflection, all proposals considered here will become the independent sampler with efficiency *E* ∼ 1, if σ → ∞. The uniform kernel achieves best performance at σ = 3, in which case *E* = 1.52. In other words, if the target is *U*(0, 1), a uniform sliding window of width 3 is optimal, and is 52% more efficient than sampling directly from *U*(0, 1). This “superefficiency” is because the samples from the Markov chain tend to be negatively correlated, like antithetic variables. For this target, the Bactrian kernels (with *m* ≥ 0.95) are 4–20× more efficient than the independent sampler or the Gaussian kernel.

The uniform may not be representative of posteriors in real applications. For the other four targets, the Bactrian kernels are ∼50–100% more efficient than the Gaussian kernel (Table 1 and *SI Appendix*, Table S1). Fig. 4 shows the transition matrix *P* for the optimal Gaussian (σ = 2.5) and Bactrian (*m* = 0.95, σ = 2.3) kernels applied to the *N*(0, 1) target. The two kernels are very different. With the Gaussian the next step *x*′ tends to be around the current value *x*, leaning toward the mode at 0. With the Bactrian the next step *x*′ has a good chance of being in the interval (–1, 1) if *x* is outside of it, but if *x* is around 1 (or –1), *x*′ tends to be around –1 (or 1).

### Alternative Measure of Efficiency.

A different measure of mixing efficiency (8) is = *E*(*X*_{i} – *X*_{i} _{– 1})^{2}, where *X*_{i} are the sampled values from the Markov chain; this always ranks the proposals (both different scales for the same kernel and different kernels) the same as does *E* (Table 1 and *SI Appendix*, Table S1 and Fig. S3). This identical ranking may be expected because *E*(*X*_{i} – *X*_{i} _{– 1})^{2} = 2var(*X*)(1 – ρ_{1}), where ρ_{1} is the lag-one autocorrelation of Eq. **5**. may be useful when searching for efficient kernels in an MCMC algorithm because it is easier to calculate than *E*, although it does not have the nice interpretation of *E*.

### Rate of Convergence.

We also examined the rate of convergence, measured by δ_{8}, the difference between the distribution reached after eight steps and the steady-state distribution (*Materials and Methods*). For the uniform and Gaussian kernels, the optimal σ for fast convergence is slightly greater than for efficient mixing, and one should take larger steps during the burn-in than after (*SI Appendix*, Fig. S3). For the Bactrian kernels, optimal σ for fast convergence is about the same as that for mixing. Because the burn-in constitutes a small portion of the computational effort, it appears to be adequate to optimize σ for fast mixing only. It is generally understood that a small λ_{2} (the second largest eigenvalue of *P*) is associated with efficient mixing, and a smaller |λ_{2}| (the second largest eigenvalue of *P* in absolute value) is associated with fast convergence (9). Nevertheless, we found that λ_{2} and |λ_{2}| are too crude to be used to choose the optimal scale for a given kernel (*SI Appendix*, Fig. S3) or to rank different kernels (Table 1 and *SI Appendix*, Table S1 and Fig. S3).

### Application to Bayesian Phylogenetics.

To confirm the general applicability of our results, we implement the major proposal kernels in the MCMCtree program for molecular clock dating (10, 11), and use it to date divergences among six primate species. The tree with fossil calibrations is shown in Fig. 5. The large genomic dataset is analyzed under the HKY + Γ_{5} model (12, 13). There are eight parameters: the five times *t*_{1} − *t*_{5} (Fig. 5), the evolutionary rate μ, the ratio of transition to transversion rates (κ), and the shape parameter (α) of the gamma distribution of rates among sites. Because the likelihood depends on the products of times and rate but not on times and rate separately, the model is not fully identifiable. As a result, the posterior for *t*_{1} − *t*_{5} and μ involves considerable uncertainties, whereas κ and α are estimated with extremely high precision (10, 14).

All proposals generate the same posterior but their efficiencies differ (Fig. 5 and Table 2). For *t*_{j} and μ, the relative efficiencies of the proposals are very similar to those seen earlier for simple targets (Table 1 and *SI Appendix*, Table S1), with the uniform being better than the Gaussian and the Bactrian better than both. The results confirm the general nature of the results found for simple targets. Efficiency is, however, very low for κ and α, especially for the Bactrian kernels, which is due to the use of the same scale to update both parameters in the MCMCtree program. The data are far more informative about κ than about α, so that the same scale is too large for κ and too small for α. Because we apply the proposals on the logarithms of κ and α, a relevant scale is the ratio of posterior credibility interval width to the posterior mean (Note that if *y* = log(*x*), then *σ*_{y} ∼ = *σ*_{x}/*x*); this is 1.3% for κ and 8.9% for α (Table 2). Ideally, the scale for the κ move should be 8.9/1.3 = 6.8× as large as that for α; this is not pursued here, because both parameters are estimated with high precision, and the small variations have virtually no effect on the posterior of times and rate, but we note the danger of using one scale for parameters of different (relative) posterior precisions.

## Discussion

### Automatic Scale Adjustment.

We have evaluated different kernels at nearly optimal scales. In real applications, calculating *E* at different σs (Eq. **5**) may be too expensive. Instead we can adjust the scale σ by using the realized *P*_{jump}, which typically has a monotonic relationship with σ. The *N*(0, 1) target is of particular interest because in large datasets the posterior will be approximately Gaussian. With this target,for the Gaussian kernel (6). Thus, given the current σ and *P*_{jump}, the optimal scale iswhere the optimal ∼ 0.44.

Similarly, we havefor the uniform kernel andwhere *a* = and *b* = for the Bactrian kernel (*SI Appendix*, *Text S3* and Fig. 6). Eqs. **10** and **11** can be used for automatic scale adjustment for those two kernels through a linear search or look-up table. The *P*_{jump} vs. σ curves, however, have similar shapes for all of the kernels over a wide range of *P*_{jump} (Fig. 6), so that Eq. **9** can be used effectively for all kernels. We use ∼ 0.4 for the uniform, triangle, Laplace, and Gaussian kernels and ∼0.3 for the three Bactrian kernels in the MCMCtree program, to apply four rounds of automatic scale adjustment during the burn-in. The strategy appears effective and is simpler than adaptive Monte Carlo algorithms (8).

### In Search of Efficient MCMC Proposals.

We suspect that the Bactrian kernel can be improved further, because it is based on the simple intuition to reduce the autocorrelation. We also note that alternatives to MH algorithms have been suggested, such as the slice sampler (15) and Langevin diffusion, which requires derivatives of the posterior density (16). The efficiency of good MH algorithms relative to the MH alternatives will be interesting topics for future research.

We have here focused on 1D proposals, and the case of multidimensional proposals is yet to be studied. Nevertheless, 1D proposals may be the most important. Note that one may use a 1D kernel to change many parameters along a curve in the multidimensional space to accommodate strong correlations in the posterior, as in our clock-dating algorithm. There has been much interest in the optimal choice of scale for Gaussian kernels, especially in infinite dimensions (7, 8). In contrast, virtually no study has examined alternative kernels; this appears to be due to the influence of ref. 6, which claimed that different kernels had nearly identical performance. This conclusion is incorrect. Note that even the uniform is more efficient than the Gaussian and also much cheaper to simulate. We hope that our results will motivate research into efficient MCMC proposals.

## Materials and Methods

### Calculation of MCMC Mixing Efficiency.

We used the initial positive sequence method (17) to calculate the asymptotic variance ν of Eq. **5**. More efficiently, for the simple target, we calculated ν from the transition matrix *P* on a discretized state space directly (6). We use *K* = 500 bins over the range (*x*_{L}, *x*_{U}) = (–5, 5), with each bin having the width Δ = (*x*_{U} – *x*_{L})/*K*, and with bin *i* represented by *x*_{i} = *x*_{L} + (*i* – 1/2)Δ. The steady-state distribution on the discrete space is then *π*_{i} = π(*x*_{i})Δ, *i* = 1, 2, …, *K*, renormalized to sum to 1. The *K* × *K* transition matrix *P* = {*p*_{ij}} is constructed to mimic the MCMC algorithm

with *p*_{ii} = . We then have

Let *f* = (*f*_{1}, *f*_{2}, …, *f*_{K})^{T}, with *f*_{i} = *f*(*x*_{i}). Define *A* = {*a*_{ij}}, with *a*_{ij} = *π*_{j}, to be the “limiting matrix,” *Z* = [*I* – (*P* – *A*)]^{–1} to be the “fundamental matrix” (ref. 18, p. 74), and *B* = diag{π_{1}, π_{2}, …, *π*_{K}}. Then, Eq. **5** becomes

(5, ref. 18, p. 84); this can equivalently be calculated using the spectral decomposition of *P*, as

where 1 = λ_{1} > λ_{2} ≥ λ_{3} ≥ … *λ*_{K} > –1 are the eigenvalues of *P* and columns of *E* are the corresponding right eigenvectors, normalized so that *E*^{T}*BE* = *I* (9, 19, 20). We calculate the eigenvalues and eigenvectors of *P* as follows. Define *T* = *B*^{1/2}*PB*^{–1/2} to be a symmetrical matrix (so that its eigenvalues are all real and can be calculated using standard algorithms). Let *T* = *RΛR*^{T}, where Λ = diag{λ_{1}, λ_{2}, …, *λ*_{K}}, and *R*^{T} = *R*^{–1}. Then

with *E* = *B*^{–1/2}*R*.

Thus, we calculated the efficiency using MCMC simulation on the continuous space (Eq. **5**) and Eqs. **14** and **15** in the discretized space.

The rate of convergence of the chain is often measured by |λ|_{2} = , the second-largest eigenvalue of *P* in absolute value (9). For a more precise measure, we used

where is the *ij*th element of *P*^{n}. We use *n* = 8, with *P*^{8} calculated using the spectral decomposition of *P* or using three rounds of matrix squaring. The maximum is typically achieved with the initial state in the tail (*i* = 0 or *K* – 1).

All computations are achieved using an ANSI C program written by Z.Y. and available at http://abacus.gene.ucl.ac.uk/software/MCMCjump.html.

### Proposals and Targets Evaluated.

We evaluate nine different MCMC proposal kernels (*SI Appendix*, Fig. S1): they are (*i*) uniform sliding window, (*ii*) triangle proposal, (*iii*) Laplace proposal, (*iv*) Gaussian sliding window, (*v*) *t*_{4} (*t* distribution with df = 4) sliding window, (*vi*) Cauchy sliding window, and (*vii–ix*) three Bactrian proposals based on the Gaussian, triangle, and Laplace kernels. Note that all proposal densities except the Cauchy have mean *x* and variance σ^{2}. The density functions for all kernels are given in *SI Appendix*, *Text S1* and Fig. S1.

We use five target densities (Fig. 3): (*i*) Gaussian *N*(0, 1); (*ii*) mixture of two Gaussian distributions, , with mean and variance 1; (*iii*) mixture of two *t*_{4} distributions: , with *s* = ; (*iv*) gamma distribution *G*(4, 2) with mean 2 and variance 1; and (*v*) uniform . All target densities have variance 1. With the gamma and uniform targets, the variable *x* is bounded, but most proposal kernels considered in this study have support over (−∞, ∞). Thus, reflection is necessary. For the gamma, if the proposed value *x*′ < 0, it is simply reset to –*x*′. For the uniform, we implemented reflection algorithms to calculate the proposal density and to sample from it (*SI Appendix*, *Text S2*).

Each kernel is implemented using ∼30 different scales (σ) to identity the optimum. Thus, with five targets and nine kernels, we constructed ∼1,350 Markov transition matrices and MCMC algorithms.

### Application to Molecular Clock Dating.

We implemented the major proposal kernels in the MCMCtree program for dating species divergences (10, 11), and applied it to date divergences among six primate species under the molecular clock (Fig. 5). The sequence data consist of the first and second codon positions of 14,631 protein-coding genes, with 8,708,584 base pairs in the sequence alignment. The data were analyzed previously by dos Reis and Yang (14) and dos Reis et al. (21). The likelihood is calculated under the HKY + Γ_{5} model (12, 13). There are eight parameters: the five times or node ages *t*_{1} − *t*_{5}, the rate μ, the ratio (κ) of the transition to transversion rates, and the shape parameter (α) of the gamma distribution of rates among sites. The prior on times *t*_{1} − *t*_{5} incorporates calibration information based on the fossil record (21). The rate is assigned the prior μ ∼ *G*(2, 80), with mean 0.025 or 2.5 × 10^{−8} changes per site per year. Two gamma priors are assigned on the substitution parameters: κ ∼ *G*(2, 0.5) with mean 4, and α ∼ *G*(2, 20) with mean 0.1.

Each iteration of the MCMC algorithm consists of four steps. The first three update (*i*) the times *t*_{j}, (*ii*) the rate μ, and (*iii*) parameters κ and α. Step *iv* multiplies the times *t*_{j} and divides the rate μ by the same random variable *c* that is close to 1; this is a 1D move and accommodates the positive correlation among *t*_{1} − *t*_{5} and the negative correlation between *t*_{1} − *t*_{5} and μ. All four proposals are proportional scalers, equivalent to apply a sliding window (which can be any of the proposal kernels evaluated in this paper) on the logarithm of the parameter. The scale for each proposal is adjusted automatically to achieve optimal *P*_{jump}. We use a burn-in of 4,000 iterations to estimate *P*_{jump} and apply four rounds of automatic scale adjustments (Eq. **9**), setting the target *P*_{jump} to 0.4 for the uniform, triangle, Laplace, and Gaussian kernels and to 0.3 for the three Bactrian kernels. After the burn-in, the chain is run for 40,000 iterations, and the collected samples are used to calculate the asymptotic variance of parameter estimates (Eq. **5**) (17).

## Acknowledgments

Z.Y. was supported by a Biotechnology and Biological Sciences Research Council Grant (BB/K000896/1) and a Royal Society Wolfson Merit Award.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. E-mail: z.yang{at}ucl.ac.uk.

Author contributions: Z.Y. designed research; Z.Y. and C.E.R. performed research; and Z.Y. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. J.P.H. is a guest editor invited by the Editorial Board.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1311790110/-/DCSupplemental.

## References

- ↵
- ↵
- Hastings WK

- ↵
- Chung KL

- ↵
- Diggle PJ

- ↵
- Peskun PH

- ↵
- Bernardo JM,
- et al.

- Gelman A,
- Roberts GO,
- Gilks WR

- ↵
- ↵
- Brooks S,
- et al.

- Rosenthal JS

- ↵
- Barone P,
- Frigessi A,
- Piccioni M

- Green PJ,
- Han XL

- ↵
- Yang Z,
- Rannala B

- ↵
- Yang Z

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Kemeny JG,
- Snell JL

- ↵Sokal AD (1989)
*Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms*(Université de Lausanne, Bâtiment des Sciences de Physique, Lausanne, Switzerland). - ↵
- ↵
- dos Reis M,
- et al.

- ↵
- Inoue J,
- Donoghue PCH,
- Yang Z

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Statistics

- Biological Sciences
- Population Biology

## Jump to section

## You May Also be Interested in

*Top Left:*Image credit: Dikka Research Project.

*Top Right:*Image credit: Alem Abreha (photographer).

*Bottom:*Image credit: Dikka Research Project.