New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Efficient Monte Carlo sampling by parallel marginalization

Communicated by Alexandre J. Chorin, University of California, Berkeley, CA, June 8, 2007 (received for review January 12, 2007)
Abstract
Markov chain Monte Carlo sampling methods often suffer from long correlation times. Consequently, these methods must be run for many steps to generate an independent sample. In this paper, a method is proposed to overcome this difficulty. The method utilizes information from rapidly equilibrating coarse Markov chains that sample marginal distributions of the full system. This is accomplished through exchanges between the full chain and the auxiliary coarse chains. Results of numerical tests on the bridge sampling and filtering/smoothing problems for a stochastic differential equation are presented.
To understand the behavior of a physical system, it is often necessary to generate samples from complicated high dimensional distributions. The usual tools for sampling from these distributions are Markov chain Monte Carlo (MCMC) methods by which one constructs a Markov chain whose trajectory averages converge to averages with respect to the distribution of interest. For some simple systems, it is possible to construct Markov chains with independent values at each step. In general, however, spatial correlations in the system of interest result in long correlation times in the Markov chain and hence slow convergence of the chain's trajectory averages. Here, a method is proposed to alleviate the difficulties caused by spatial correlations in high dimensional systems. The method, parallel marginalization, is tested on two stochastic differential equation conditional path sampling problems.
Parallel marginalization takes advantage of the shorter correlation lengths present in marginal distributions of the target density. Auxiliary Markov chains that sample approximate marginal distributions are evolved simultaneously with the Markov chain that samples the distribution of interest. By swapping their configurations, these auxiliary chains pass information between themselves and with the chain sampling the original distribution. As shown below, these swaps are made in a manner consistent with both the original distributions and the approximate marginal distributions. The numerical examples indicate that improvement in efficiency of parallel marginalization over standard MCMC techniques can be significant.
The design of efficient methods to approximate marginal distributions was addressed by Chorin (1) and Stinis (2). The use of Monte Carlo updates on coarse subsets of variables is not a new concept (see ref. 3 and the references therein). The method presented in ref. 3 does not use marginal distributions. However, attempts have been made previously to use marginal distributions to accelerate the convergence of MCMC (see refs. 4 and 5). In contrast to parallel marginalization, the methods proposed in refs. 4 and 5 do not preserve the distribution of the full system and therefore are not guaranteed to converge. The parallel construction used here is motivated by the parallel tempering method (see ref. 6), and allows efficient comparison of the auxiliary chains and the original chain. See refs. 6 and 7 for expositions of standard MCMC methods.
Parallel marginalization for problems in Euclidean state spaces is described in detail below. In the final sections, the conditional path sampling problem is described and numerical results are presented for the bridge sampling and smoothing/filtering problems.
Parallel Marginalization
For the purposes of the discussion in this section, we assume that appropriate approximate marginal distributions are available. As discussed in a later section, they may be provided by coarse models of the physical problem as in the examples below, or they may be calculated via the methods in refs. 1 and 2.
Assume that the d _{0} dimensional system of interest has a probability density, π_{0}(x _{0}), where x _{0} ∈ ℝ^{d0}. Suppose further that, by the Metropolis–Hastings or any other method (see ref. 6), we can construct a Markov chain, Y _{0} ^{n} ∈ ℝ^{d0}, which has π_{0} as its stationary measure. That is, for two points x _{0}, y _{0} ∈ ℝ^{d0} where T _{0}(x _{0} → y _{0}) is the probability density of a move to {Y _{0} ^{n+1} = y _{0}} given that {Y _{0} ^{n} = x _{0}}. Here, n is the algorithmic step. Under appropriate conditions (see ref. 6), averages over a trajectory of {Y _{0} ^{n}} will converge to averages over π_{0}, i.e. for an objective function g(x _{0}) The size of the error in the above limit decreases as the rate of decay of the time autocorrelation increases. In this formula, Y _{0} ^{0} is assumed to be drawn from π_{0}.
It is well known that judicious elimination of variables by renormalization can reduce long range spatial correlations (for example, see refs. 8 and 9). The variables are removed by averaging out their effects on the full distribution. If the original density is π(x̂, x̂) and we wish to remove the x̃ variables, the distribution of the remaining x̂ variables is given by the marginal density (see refs. 1 and 6) The full distribution can be factored as where π(x̃x̂) is the conditional density of x̃ given x̂. Because they exhibit shorter correlation lengths, the marginal distributions are useful in the acceleration of MCMC methods.
With this in mind, we consider a collection of lower dimensional Markov chains Y_{i} ^{n} ∈ ℝ ^{di} , which have stationary distributions π _{i} (x_{i} ) where d _{0} > ⋯ > d_{i} . For each i ≤ L, let T_{i} be the transition probability density of Y_{i} ^{n} , i.e. T_{i} (x_{i} → y_{i} ) is the probability density of {Y _{i} ^{n+1} = y_{i} } given that {Y_{i} ^{n} = x_{i} }. The {π _{i} } are approximate marginal distributions. For example, divide the x_{i} variables into two subsets, x̂_{i} ∈ ℝ^{di+1} and x̃_{i} ∈ ℝ^{di−di+1}, so that x_{i} = (x̂_{i} , x̃_{i} ). The x̃_{i} variables represent the variables of x_{i} that are removed by marginalization, i.e.,
After arranging these chains in parallel, we have the larger process The probability density of a move to {Y ^{n+1} = y} given that {Y ^{n} = x} for x, y ∈ ℝ^{d0} × ⋯ × ℝ ^{dL} is given by Because the stationary distribution of Y ^{n} is
The next step in the construction is to allow interactions between the chains {Y_{i} ^{n} } and to thereby pass information from the rapidly equilibrating chains on the lower dimensional spaces (large i) down to the chain on the original space (i = 0). This is accomplished by swap moves. In a swap move between levels i and i + 1, we take a d _{i+1} dimensional subset, x̂_{i} , of the x_{i} variables and exchange them with the x _{i+1} variables. The remaining d _{i} − d _{i+1} x̃_{i} variables are resampled from the conditional distribution π _{i} (x̃_{i} x _{i+1}). For the full chain, this swap takes the form of a move from {Y ^{n} = x} to {Y ^{n+1} = y} where and The ellipses represent components of Y ^{n} that remain unchanged in the transition and ỹ _{i} is drawn from π _{i} (x̃_{i} x _{i+1}).
If these swaps are undertaken unconditionally, the resulting chain will equilibrate rapidly, but will not, in general, preserve the product distribution Π. To remedy this, we introduce the swap acceptance probability In this formula, π̄ _{i} is the function on ℝ^{di+1} resulting from marginalization of π _{i} as in Eq. 1 . Given that {Y ^{n} = x}, the probability density of {Y ^{n+1} = y}, after the proposal and either acceptance with probability A _{i} or rejection with probability 1 − A _{i}, of a swap move, is given by for x, y ∈ d _{0} × ⋯ × ℝ ^{dL} . δ is the Dirac delta function.
We have the following lemma.
Lemma 1.
The transition probabilities S_{i} satisfy the detailed balance condition for the measure Π, i.e., where x, y ∈ ℝ^{d0} × ⋯ × ℝ^{dL}.
The detailed balance condition stipulates that the probability of observing a transition x → y is equal to that of observing a transition y → x and guarantees that the resulting Markov chain preserves the distribution ∏. Therefore, under general conditions, averages over a trajectory of {Y ^{n}} will converge to averages over ∏. Because we can calculate averages over π_{0} by taking averages over the trajectories of the first d _{0} components of Y ^{n}.
“Exact” Approximation of Acceptance Probability
Note that formula 3 for A _{i} requires the evaluation of π̃ _{i} at the points x̂_{i} , x _{i+1} ∈ ℝ^{di+1}. Although the approximation of π̃ _{i} by functions on ℝ^{di+1} is in general a very difficult problem, its evaluation at a single point is often not terribly demanding. In fact, in many cases, including the examples in this paper, the x̂_{i} variables can be chosen so that the remaining x̃_{i} variables are conditionally independent given x̂_{i} .
Despite these mitigating factors, the requirement that we evaluate π̄ _{i} before we accept any swap is a little onerous. Fortunately, and somewhat surprisingly, this requirement is not necessary. In fact, standard strategies for approximating the point values of the marginals yield Markov chains that themselves preserve the target measure. Thus, even a poor estimate of the ratio appearing in Eq. 3 can give rise to a method that is exact in the sense that the resulting Markov chain will asymptotically sample the target measure.
To illustrate this point, we consider the following example of a swap move. Assume that the current position of the chain is {Y ^{n} = x} where The following steps will result in either {Y ^{n+1} = x} or {Y ^{n+1} = y} where and ỹ _{i} ∈ ℝ^{di−di+1}.

Let v ^{0} = x̃_{i} and let v ^{j} ∈ ℝ^{di−di+1} for j = 1, …, M − 1 be independent samples from p _{i}(·x̂_{i} ), where p _{i}(·x̂_{i} ) is a reference density conditioned by x̂_{i} . For example, p _{i}(·x̂_{i} ) could be a Gaussian approximation of π _{i} (x̃_{i} x̂_{i} ). How p _{i} is chosen depends on the problem at hand (see numerical examples below). In general, p _{i}(·x̂_{i} ) should be easily evaluated and independently sampled, and it should “cover” π _{i} (·x̂_{i} ) in the sense that areas of ℝ ^{di} where π _{i} (·x̂_{i} ) is not negligible should be contained in areas where p _{i}(·x̂_{i} ) is not negligible.

Let u ^{j} ∈ ℝ^{di−di+1} for j = 0, …, M − 1 be independent random variables sampled from p _{i}(·x _{i+1}) (recall that we are considering a swap of x̂_{i} and x _{i+1}, which live in the same space). Note, that the {u ^{j}} variables depend on x _{i+1}, whereas the {v ^{j}} variables depend on x̂_{i} .

Define the weights The choice of p _{i} made above affects the variance of these weights, and therefore the variance of the acceptance probability below.

Choose ỹ _{i} from among the {u ^{j}} according to the multinomial distribution with probabilities Note that ỹ _{i} is an approximate sample from π _{i} (·x _{i+1}).

Set with probability and with probability 1 − A_{i} ^{M} .
The transition probability density for the above swap move from x → y for x, y ∈ ℝ^{d0} × ⋯ × ℝ ^{dL} is given by where and δ is again the Dirac delta function. In other words, S_{i} ^{M} dictates that the Markov chain accepts the swap with probability R and rejects it with probability 1 − R.
Although the preceding swap move corresponds to a method for approximating the ratio appearing in the formula for A _{i} above, it also has some similarities with the multipletry Metropolis method presented in ref. 10, which uses multiple suggestion samples to improve acceptance rates of standard MCMC methods. The following lemma is suggested by results in ref. 10.
Lemma 2.
The transition probabilities S_{i} ^{M} satisfy the detailed balance condition for the measure ∏.
As before, the detailed balance condition guarantees that averages over trajectories of the first d _{0} dimensions of Y ^{n} will converge to averages over π_{0}.
The A_{i} ^{M} contain an approximation to the ratio of marginals in 3 where E _{pi} denotes expectation with respect to the density p _{i}. When 0 < E _{pi} [w_{v} ^{j} {x̂_{i} = x̂_{i} }] < ∞, the convergence above follows from the strong law of large numbers and the fact that For small values of M in 4 , calculation of the swap acceptance probabilities is very cheap. However, higher values of M may improve the acceptance rates. For example, if the {π _{i} }_{i>0} are exact marginals of π_{0}, then A _{i} ≡ 1, whereas A_{i} ^{M} ≤ 1. Results similar to Lemma 2 hold when more general approximations replace the one given above; for example, when the {u ^{j}} and {v ^{j}} are generated by a Metropolis–Hastings rule. In practice, one has to balance the speed of evaluating A_{i} ^{M} for small M with the possible higher acceptance rates for M large.
It is easy to see that a Markov chain, which evolves only by swap moves, cannot sample all configurations. These swap moves must therefore be used in conjunction with a transition rule that can reach any region of space, such as T from expression 2. More precisely, T should be ∏irreducible and aperiodic (see ref. 11). The the transition rule for parallel marginalization is where and α ∈ [0, 1) is the probability that a swap move occurs. P dictates that, with probability α, the chain attempts a swap move between levels I and I + 1, where I is a random variable chosen uniformly from {0, …, L − 1}. Next, each level of the chain evolves independently according to the {T_{i} }. With probability 1 − α, the chain does not attempt a swap move, but does evolve each level. The next result follows trivially from Lemma 2 and guarantees the invariance of ∏ under evolution by P.
Theorem 1.
The transition probability P satisfies the detailed balance condition for the measure Π, i.e., where x, y ∈ ℝ^{d0} × ⋯ × ℝ ^{dN} .
Thus, by combining standard MCMC steps on each component governed by the transition probability T, with swap steps between the components governed by S, an MCMC method results that not only uses information from rapidly equilibrating lower dimensional chains, but is also convergent.
Numerical Example 1: Bridge Path Sampling
In the bridge path sampling problem, we wish to approximate conditional expectations of the form where s ∈ (0, T) and {Z ^{t}} is the real valued processes given by the solution of the stochastic differential equation g, f, and σ are real valued functions of ℝ. Of course, we can also consider functions g of more than one time. This problem arises, for example, in financial volatility estimation. Because, in general, we cannot sample paths of 5 , we must first approximate {Z ^{t}} by a discrete process for which the path density is readily available. Let t _{0} = 0, t _{1} = T/K, …, t _{K} = T be a mesh on which we wish to calculate path averages. One such approximate process is given by the linearly implicit Euler scheme (a balanced implicit method, see ref. 12), The {ξ^{k}} are independent Gaussian random variables with mean 0 and variance 1, and Δ = T/K. K is assumed to be a power of 2. The choice of this scheme over the Euler scheme (see ref. 13) is due to its favorable stability properties, as explained later. Without the condition X^{tK} = Z ^{T} above, generating samples of (X ^{0}, …, X^{tK} ) is a relatively straightforward endeavor. One simply generates a sample of Z ^{0}, then evolves the system with this initial condition. However, the presence of information about {Z ^{t}}_{t>0} complicates the task. In general, a sampling method that requires only knowlege of a function proportional to conditional density of (X ^{t1}, …, X ^{t k −1}) must be applied. The approximate path density associated with discretization 6 is where
At this point, we wish to apply the parallel marginalization sampling procedure to the density π_{0}. However, as indicated above, a prerequisite for the use of parallel marginalization is the ability to estimate marginal densities. In some important problems, homogeneities in the underlying system yield simplifications in the calculation of these densities by the methods in refs. 1 and 2. These calculations can be carried out before implementation of parallel marginalization, or they can be integrated into the sampling procedure.
In some cases, the numerical estimation of the {π _{i} }_{i>0} can be completely avoided. The examples presented here are two such cases. Let S _{i} = {0, 2^{i}, 3(2^{i}), 4(2^{i}), …, K}. Decompose S _{i} as Ŝ _{i} ⊔ S̃ _{i} where and In the notation of the previous sections, x_{i} = (x̂_{i} , x̃_{i} ), where x̂_{i} = {x_{i} ^{tk} }_{k∈Ŝi\{0,K}} and x̃_{i} = {x_{i} ^{tk} }_{k∈S̃i}. In words, the hat and tilde variables represent alternating time slices of the path. For all i, fix x_{i} ^{0} = z ^{−} and x_{i} ^{tK} = z ^{+}. We choose the approximate marginal densities where for each i, q _{i} is defined by successive coarsenings of 6 . That is, Because π _{i} will be sampled using a Metropolis–Hastings method with x ^{0} and x^{tK} fixed, knowlege of the normalization constants is unnecessary.
Note from
7
that, conditioned on the values of x^{tk−1}
and x
^{tk+1}, the variance of x^{tK}
is of order Δ. Thus, any perturbation of x^{tK}
that leaves x^{tj}
fixed for j ≠ k and is compatible with joint distribution
7
must be of the order
In this numerical example, bridge paths are sampled between time 0 and time 10 for a diffusion in a doublewell potential The left and right end points are chosen as z ^{−} = z ^{+} = 0. Δ = 2^{−10}. Y_{i} ^{n} ∈ ℝ^{10/(2iΔ)+1} is the ith level of the parallel marginalization Markov chain at algorithmic time n. There are 10 chains (L = 9 in expression 2). The observed swap acceptance rates are reported in Table 1. Let Y_{mid} ^{n} ∈ ℝ denote the midpoint of the path defined by Y _{0} ^{n} (i.e., an approximate sample of the path at time 5). In Fig. 1, the autocorrelation of Y_{mid} ^{n} is compared to that of a standard Metropolis–Hastings rule. In Fig. 1, the time scale of the autocorrelation for the Metropolis–Hastings method has been scaled by a factor of 1/10 to more than account for the extra computational time required per iteration of parallel marginalization. The relaxation time of the parallel chain is clearly reduced. In these numerical examples, the algorithm in the previous section is applied with a slight simplification. First generate M independent Gaussian random paths {ζ^{j}(t _{k})}_{k∈S̃i} with independent components ζ^{j}(t _{k}) of mean 0 and variance 2^{i−1}Δ. For each j and k ∈ S̃ _{i}, let If in step 4, ỹ _{i} = u ^{j}*, then in step 1 we set v ^{0} = x̃_{i} and for each k ∈ S̃_{i} All other steps remain the same. This change yields a slightly faster though less generally applicable swap step that also preserves the density ∏. Note that this modification implies that the reference density p _{i} is given by For this problem, the choice of M in 4 , the number of samples of {u ^{j}} and {v ^{j}}, seems to have little effect on the swap acceptance rates. In the numerical experiment M = i + 1 for swaps between levels i and i + 1.
Numerical Example 2: Nonlinear Smoothing/Filtering
In the nonlinear smoothing and filtering problem, we wish to approximate conditional expectations of the form where s ∈ (0, T) and the real valued processes {Z ^{t}} and {H ^{j}} are given by the system g, f, σ, and r are real valued functions of ℝ. The {X ^{j}} are real valued independent random variable drawn from the density μ and are independent of the Brownian motion {W ^{t}}. {s _{j}} ⊂ {t _{j}}, and 0 = s _{0} < s _{1} < ⋯ < s _{J} = T. The process Z ^{t} is a hidden signal, and the {H ^{j}} are noisy observations.
Again, the system must first be discretized. The linearly implicit Euler scheme gives The {ξ^{k}} are independent Gaussian random variables with mean 0 and variance 1, and Δ = T/K. The {ξ^{k}} are independent of the {X ^{j}}. K is again assumed to be a power of 2.
The approximate path measure for this problem is The approximate marginals are chosen as where V, q _{i}, and S _{i} are as defined in the previous section.
In this example, samples of the smoothed path are generated between time 0 and time 10 for the same diffusion in a doublewell potential. The densities μ and ρ are chosen as The observation times are s _{0} = 0, s _{1} = 1, …, s _{10} = 10 with H ^{j} = −1 for j = 0, …, 5 and H ^{j} = 1 for j = 6, …, 10. Δ = 2^{−10}. There are eight chains (L = 7 in expression 2). The observed swap acceptance rates are reported in Table 1. Again, Y_{mid} ^{n} ∈ ℝ denotes the midpoint of the path defined by Y _{0} ^{n} (i.e., an approximate sample of the path at time 5). In Fig. 2, the autocorrelation of Y_{mid} ^{n} is compared to that of a standard Metropolis–Hastings rule. Fig. 2 has been adjusted as in the previous example. The relaxation time of the parallel chain is again clearly reduced. The algorithm is modified as in the previous example. For this problem, acceptable swap rates require a higher choice of M in 4 than needed in the bridge sampling problem. In this numerical experiment M = 2^{i} for swaps between levels i and i + 1.
Conclusion
A MCMC method has been proposed and applied to two conditional path sampling problems for stochastic differential equations. Numerical results indicate that this method, parallel marginalization, can have a dramatically reduced equilibration time when compared to standard MCMC methods.
Note that parallel marginalization should not be viewed as a standalone method. Other acceleration techniques, such as hybrid Monte Carlo, can and should be implemented at each level within the parallel marginalization framework. As indicated by the smoothing problem, the acceptance probabilities at coarser levels can become small. The remedy for this is the development of more accurate approximate marginal distributions by, for example, the methods in refs. 1 and 2.
Acknowledgments
I thank Prof. A. Chorin for his guidance during this research, which was carried out during my Ph.D. studies at the University of California, Berkeley, and Dr. P. Okunev, Dr. P. Stinis, and the referees for their very helpful comments. This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract DEAC0376SF00098 and National Science Foundation Grant DMS0410110.
Footnotes
 ^{†}Email: weare{at}math.berkeley.edu

Author contributions: J.W. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The author declares no conflict of interest.
 Abbreviation:
 MCMC,
 Markov chain Monte Carlo.
 © 2007 by The National Academy of Sciences of the USA
References
 ↵
 ↵
 ↵

↵
 Brandt A ,
 Ron D

↵
 Okunev P

↵
 Liu J

↵
 Binder K ,
 Heermann D

↵
 Binney J ,
 Dowrick N ,
 Fisher A ,
 Newman M

↵
 Kadanoff L
 ↵
 ↵
 ↵

↵
 Kloeden P ,
 Platen E