Mixed Markov models
See allHide authors and affiliations

Communicated by David Mumford, Brown University, Providence, RI, March 30, 2003 (received for review July 21, 2002)
Abstract
Markov random fields can encode complex probabilistic relationships involving multiple variables and admit efficient procedures for probabilistic inference. However, from a knowledge engineering point of view, these models suffer from a serious limitation. The graph of a Markov field must connect all pairs of variables that are conditionally dependent even for a single choice of values of the other variables. This makes it hard to encode interactions that occur only in a certain context and are absent in all others. Furthermore, the requirement that two variables be connected unless always conditionally independent may lead to excessively dense graphs, obscuring the independencies present among the variables and leading to computationally prohibitive inference algorithms. Mumford [Mumford, D. (1996) in ICIAM 95, eds. Kirchgassner, K., Marenholtz, O. & Mennicken, R. (Akademie Verlag, Berlin), pp. 233–256] proposed an alternative modeling framework where the graph need not be rigid and completely determined a priori. Mixed Markov models contain nodevalued random variables that, when instantiated, augment the graph by a set of transient edges. A single joint probability distribution relates the values of regular and nodevalued variables. In this article, we study the analytical and computational properties of mixed Markov models. In particular, we show that positive mixed models have a local Markov property that is equivalent to their global factorization. We also describe a computationally efficient procedure for answering probabilistic queries in mixed Markov models.
Graphical models such as Markov random fields (1) and Bayesian networks (2) are powerful tools for representing complex multivariate distributions using the adjacency structure of a graph. A Markov field is a probability distribution on an undirected graph whose edges connect those variables that are directly dependent, i.e., remain dependent even after all other variables have been instantiated.
Specifying a Markov field requires identifying in advance and connecting every pair of variables that could ever be conditionally dependent, even for a single choice of values of the other variables. This may lead to graphs that are excessively dense, hiding potentially relevant independencies from the human interpreter and rendering intractable those inference algorithms that are specified automatically once the structure of the graph is determined (3). As a simple example of this limitation, consider the following gene regulatory network (4).
The protein product pA of gene A induces expression of gene B. Compound C, when present, modifies pA. The modified protein is unable to regulate B directly but can induce expression of gene D. Genes B and D are corepressive (pB inhibits expression of D; pD inhibits expression of B).
Denote by X_{A}, X_{B}, and X_{D} the expression levels of the corresponding three genes, quantized into two levels, 1 (on) or 0 (off). Another binary variable X_{C} will indicate whether compound C is present.
Due to the stochastic nature of gene expression, the relationships “upregulates” or “inhibits” are not absolute (4). For example, it is possible for gene D not to be expressed even when A is upregulated and C is present. It is natural, then, to represent the interactions in this network as a stochastic process. Let us determine its graph. Suppose we learn that X_{A} = 1, i.e., gene A is expressed. Would this help predict, say, X_{D} if we already knew the states of the other two variables? The answer, of course, depends on the actual value of X_{C}. If X_{C} = 0 (compound C is absent), X_{A} affects X_{D} only indirectly, through X_{B}. In other words, given X_{B}, X_{D} is independent of X_{A} (see Fig. 1 Left). If, however, X_{C} = 1 (C is present), the states of X_{A} and X_{B} are both predictive of that of X_{D} (see Fig. 1 Right).
This behavior, when whether or not two variables are directly dependent is determined by the value of a third variable, is difficult to capture within the limits imposed by the rigid topology of a Markov field. In fact, it is easy to verify that every pair of variables in this example is directly dependent, i.e., † for any permutation (X_{1}, X_{2}, X_{3}, X_{4}) of (X_{A}, X_{B}, X_{C}, X_{D}) and at least one choice of values x_{1} through x_{4}. Thus the graph of this network when modeled as a Markov field is fully connected!
The simple example above motivates us to consider graphical models in which the Markov boundary of a variable is at least in part random, determined by instantiating certain variables. These “context” variables have the ability to create direct dependencies between other variables when assigned certain values and destroy them when assigned others. A single joint probability distribution will relate the values of the “regular” and “context” variables. This is the basic setup of mixed Markov models (5, 6). We describe them more formally in the following section.
Definition
Let G = (V, E) be an undirected graph and X a random vector indexed by the vertices in V.‡ There are two types of variables in X corresponding to two types of nodes in V: The random variables in X_{I} = (X_{i})_{i}_{∈}_{I} are the standard variables found in Markov fields; they may encode, for example, pixel intensities, degrees of belief, or gene expression levels. In this article, we assume they are finite (hence discrete). The X_{i}'s are called regular variables, and the vertices in I regular nodes.
The variables in X_{A}, on the other hand, are not real valued; rather, they are pointers to regular nodes and take values in the set I ∪ {nil}. The X_{a}'s are called address variables or pointers and the vertices in A address nodes. Note that by definition, address variables cannot have address nodes as values; no pointers to pointers are allowed. On the other hand, there is no restriction on the graph topology: the edges in G can link any two nodes.
We will define the new model the same way Markov fields are defined, i.e., as a probability distribution that is factorizable into a product of local terms. The difference will come when defining what “local” means, in the presence of address variables.
Let C denote the set of cliques§ in G and C a clique in C. Note that cliques may contain both regular and address nodes.
Definition 2.1: A nonnegative function V_{C} on X is a mixed interaction function if for any configurations x and y, whenever Thus V_{C} depends only on the values of (i) variables within C, and (ii) variables pointed to by an address variable within C. Mixed interaction functions can be thought of as having extra slots in their argument list, up to the number of address nodes in the corresponding clique, and filling them in by “pulling back” the values of the regular variables pointed to from within that clique (see Fig. 2). We can now define the new model as follows.
Definition 2.2: A probability distribution P on a graph G = (I υ A, E) is called a mixed Markov model if P is factorizable into a product of mixed interaction functions over the cliques [1] where x_{xC} is the vector of states of those regular variables pointed to from within C. Note that in the absence of address variables, Eq. 1 defines a classical Markov field with neighbor potential {log V_{C}: C ∈ C} (7).
The gene regulatory network above has a natural representation as a mixed Markov model. Redefine X_{C} to be an address variable whose value indicates which gene is being directly regulated by gene A. The range of X_{C} is the union of two nodes B or D (see Fig. 3). The joint distribution is given by with mixed interaction functions over the maximum cliques subsuming all others. The first factor, V_{AC}, encodes the constraint that x_{x}_{C} = 1 when x_{A} = 1; the second enforces x_{B} = 0 if and only if x_{D} = 1. Either constraint may be soft, of course.
Markov Property
While mixed models are more flexible than Markov fields from a knowledge engineering point of view, it is unclear whether they admit efficient procedures for probabilistic inference. To that end, we will establish in this section the Markov property of mixed models and its equivalence to their Gibbs characterization (Eq. 1). We will then use this equivalence to prove the computational feasibility of stochastic relaxation algorithms such as Gibbs sampling for inference in mixed models.
In Markov fields, the term Markov property refers to the fact that the conditional distribution of any variable given all others (its local characteristic) is a local quantity, i.e., a function of only the neighbors of that variable (7). For example, in a Markov chain, the local characteristic of the present is a function only of the immediate future and immediate past. The equivalence of the Markov property and the Gibbs characterization for positive Markov fields over arbitrary undirected graphs was first established by Hammersley and Clifford (ref. 9; see also ref. 10). The following result generalizes the Hammersley–Clifford theorem to mixed Markov models.
Theorem 1. A positive distribution P is a mixed Markov model with graph G if and only if

the local characteristic of a regular variable X_{i}, [2]is a function only of

X_{i}'s neighbors,

variables pointed to by address variables among X_{i}'s neighbors,

variables pointing to i, and their neighbors, and

variables pointed to by address variables among those in 3 (see Fig. 4a): and


the normalized local characteristic of an address variable X_{a}, [3]is a function only of

X _{a} 's neighbors, and

variables pointed to by X_{a}and the address variables among X_{a}'s neighbors (see Fig. 4b).

The normalization in Eq. 3 is necessary because the conditional distribution P(X_{a}x\x_{a}) may not be a local quantity. Consider, for example, a simple mixed model with two regular variables and one address variable and a graph as shown in Fig. 5. The joint distribution is given by The local characteristic P(X_{a} = ix_{i}, x_{j}) depends on x_{j} as well as x_{i}, whereas the ratio P(X_{a} = ix_{i}, x_{j})/P(X_{a} = nilx_{i}, x_{j}) doesn't. Fortunately, the locality of the normalized local characteristic is sufficient to make tractable probabilistic inference based on Gibbs sampling, as we show in the next section.
The onlyif part of Theorem 1 determines the mechanism by which address variables enable contextspecific relationships in a mixed model. Consider what happens to the graph when an address variable is instantiated.
Theorem 2.Let P be a mixed Markov model. The conditional distribution P(X_{V}_{}_{a}X_{a} = x_{a}) is again a mixed model, with a graph that is derived from that of P by replacing each edgeincident to node a with the edge.
Assigning a value to an address variable X_{a} causes the edges incident to it to be pushed along the transient directed edge (a, x_{a}) (see Fig. 6). Consequently, the overall connectivity of the graph does not increase: if the graph of a mixed model is sparse to begin with, it will remain sparse and thus interpretable and computable after some, or all, of the address variables are instantiated. Recall, for example, the regulatory network described in the Introduction. When the joint distribution is modeled as a Markov field, its graph is fully connected (see Fig. 7 Upper Left). When address variable X_{C} is assigned a value, say, B, the graph of the conditional distribution P(X_{A}, X_{B}, X_{D}X_{!} = B) is, according to a generic conditioning algorithm for Markov fields,¶ a fully connected graph on the remaining variables (Fig. 7 Upper Right).
On the other hand, suppose the joint distribution is represented with a mixed model (Fig. 7 Lower Left). Theorem 2 tells us how to obtain the graph of the conditional P(X_{A}, X_{B}, X_{D}X_{C} = B) (Fig. 7 Lower Right). We know this distribution is a Markov field because the only address variable has been instantiated. Thus we have found a sparser graph that respects the distribution P(X_{A}, X_{B}, X_{D}X_{C} = B).
It is instructive to compare mixed models and mixture models such as Bayesian multinets (10). A mixed model can be regarded as a mixture of Markov fields, with address variables acting as latent variables: Note that the mixture components of a mixed model need not have the same graph. In this respect, mixed models are similar to Bayesian multinets. However, the mixture components of a multinet may have arbitrary, unrelated graphs. In contrast, the graphs of the mixture components of a mixed model are all related and are obtained from the graph of the joint distribution by the algorithm in Theorem 2. This compromise permits the existence of contextdependent relationships while simplifying the modeling and machinelearning tasks.
Probabilistic Inference
Once a graphical model is defined or estimated, a procedure must be identified for obtaining quantities of interest, e.g., posterior marginals on the unobserved variables, highly likely states, or the probability of the observed data. In Markov fields, hence, in mixed models as well, exact computation of these quantities is in general intractable (11). An alternative is to seek approximate solutions by a stochastic sampling procedure such as a Gibbs sampler (12) below.
Let P be a mixed model (or a Markov field, or a Bayes network) with a graph G = (V, E). Let s: {1,..., V} → V be an enumeration of the nodes (often referred to as a visiting scheme∥). Let v = s(1) be the first node in the visiting scheme. Select an initial configuration x.

Sample from the local characteristic P(X_{v}x_{V \ v)}.

If y_{v} is the state drawn, set x_{v} ← y_{v}.

Set v to be the next vertex in the visiting scheme or s(1) if the current node is last in the visiting scheme.

Return to step 1.
By simulating a Markov chain that converges to the joint distribution P, Gibbs sampler gives approximate marginals of P or, by clamping observed variables to their observed values, approximate posterior marginals on the unobserved variables. Simple modification of Gibbs sampler allow computing local and, at least theoretically, global maxima of P and its conditionals [Besag's iterated conditional mode (ICM; ref. 13) and simulated annealing (12), respectively]. It is not clear, however, whether in the mixed model, the sampling from a local characteristic required in step 1 is computationally feasible, the way it is in a Markov field or a Bayes network. After all, as we noted before, if X_{a} is an address variable, P(x_{a}x_{V\a}) may not be a local quantity.
As it turns out, there is a way around this complication. Since X_{a} takes values in a subset of I ∪ {nil}, we can think of the local characteristic P(x_{a}x_{V}_{\a}) as a vector with at most I + 1 components, each corresponding to a different state of X_{a}. In order to sample from the local characteristic, it suffices to know this vector up to a multiplicative constant. When this constant is 1\P(X_{a} = nilx_{V}_{\a}), the ratios are in fact computable by Theorem 1. Assuming that each ratio can be evaluated in constant time, the computational complexity of sampling from the local characteristic of an address variable is O(I).
Examples
The flexibility of mixed Markov models is particularly useful when modeling global structures such as templates (5) within Markov fields. A template replaces the local statistics at its location with those specific for that template. As a simple example, consider n independent, identically distributed Bernoulli random variables X_{1},..., X_{n} labeled by nodes 1 through n. Let P(X_{i} = 1) = p, 1 ≤ i ≤ n. Now imagine a template A that, when placed at location i, changes the probability of 1 at that location from p to q. Let random variable X_{A} taking values in I = {1,..., n} specify the location of the template.
When modeled as a Markov field, the joint distribution of (X_{1},..., X_{n}, X_{A}) has a starshaped graph (see Fig. 8 Left). The fact that the X_{i}'s are independent given the template's location is easily inferred from the graph.
Now suppose there are two templates, their locations specified by a pair of random variables (X_{A}, X_{B}). Allow the templates to interact spatially, attracting or repelling each other or perhaps exhibiting a more complex pattern of interaction, encoded in a joint distribution P(X_{A}, X_{B}). Let us further assume that the templates force the same values on the X_{i}'s at their locations; i.e., if X_{A} = i and X_{B} = j, then, with high probability, X_{i} = X_{j}. For example, if A and B are left and righteye templates, we would probably want to force the pixel colors at their locations in the image to be the same.
Now if we represent the joint distribution of (X_{1},..., X_{n}, X_{A}, X_{B}) as a Markov field, we cannot do any better than a fully connected graph. In fact, X_{i} clearly depends directly on both X_{A} and X_{B}; furthermore, X_{i} and X_{j} are conditionally dependent on each other because [4] But in fact both sides of Eq. 4 are almost always equal: X_{i} ⊥ X_{j}(X_{A}, X_{B}) unless X_{A} = i and X_{B} = j or X_{A} = j and X_{B} = i. This independence is completely lost in the graph in Fig. 8 Center. On the other hand, when the same process is represented as a mixed model (Fig. 8 Right), one can see that X_{i} and X_{j} are conditionally independent except when both pointed to from within the clique {A, B}. In general, the neighbors of any variable can be easily identified from this graph by using Theorem 1.
The same framework can be used to design models containing multiple templates, each possibly consisting of several interacting elements. Flexible templates (14, 15) arise as a special case. One can also imagine having templates of templates, leading to mixed models that have a flexible hierarchical structure such as multilayer texton maps (16). In every case, spatial interactions between templates, between the elements of a template, and between a template and its location are encoded with an sparse set of local constraints reflected in the graph of the mixed model.
Conclusion
In this article, we have analyzed the analytical and computational properties of mixed Markov models (5, 6). In these graphical models, the graph needn't be rigid and completely determined a priori. Instead, nodevalued address variables are allowed to modify the graph by augmenting it with a set of transient edges. A single joint probability distribution relates the values of regular and address variables.
The introduction of address variables allows sparser graphs that are more easily interpretable by human experts. Positive mixed models have a local Markov property that is equivalent to their global factorization (Theorem 1). Sparse mixed models will remain sparse after some, or all, of the address variables have been instantiated (Theorem 2). A Gibbs sampler (12) is a computationally efficient means of performing probabilistic inference in mixed Markov models.
Acknowledgments
I am grateful to SongChun Zhu and Michael Isard for critical review of this manuscript. Thanks also go to Jeff Sachs for helpful discussions and for suggesting ref. 4.
Appendix: Proofs of Theorems
Proof of Theorem 1: We begin with some notation. Let N(v) denote the set of neighbors of a node v (i.e., those nodes incident to v); also, let . For a configuration x and a set of vertices D, define D_{*} is the set of regular nodes pointed to by an address variable in D. D^{*} is the set of address nodes whose corresponding variables are pointing to a node in D.
Suppose that P has the Gibbs characterization (Eq. 1). The normalized local characteristic of an address variable X_{a} is In fact, if a node is not a neighbor of a, it cannot be in the same clique as a, and the only way it can then influence one of the terms in the product above is by being pointed to from within a clique containing a.
Next, the local characteristic of a regular variable X_{i} is Its reciprocal is equal to Turning to the if part, fix a reference configuration o = (o_{I}, nil_{A}). By the Möbius inversion formula (7), [5] We will construct mixed interaction functions V_{C} on the cliques of G in such a way that ∏_{C} V_{C} = ∏_{D} U_{D}. Given a set D ⊂ I ∪ A and a configuration x_{D} on D, set [6] C contains all of the address nodes of D, plus those regular nodes in D not pointed to from within D. Suppose U_{D}(x_{D}) ≠ 1. We assert that

C is a clique.

C ⊂ D ⊂ C ∪ C_{*} and C ∩ C_{*} = [null].

C as given by Eq. 6 is the only clique satisfying condition 2.
Assume for the moment that 1–3 hold. Set the mixed interaction functions as follows: By exchanging the order of multiplication, where A(D) = {C ∈ C: C ⊂ D ⊂ C ∪ C_{*}, C ∩ C_{*} = [null]} contains a single clique, D\D_{*}, whenever U^{D}(x^{D}) ≠ 1. From Eq. 5 and the last chain of equalities, the Gibbs characterization of a mixed model, q.E.d.
Of the three assertions, only claim 1 is nontrivial; it is also the only one requiring the Markov property. We must show that if U_{D}(x_{D}) ≠ 1 then any two nodes u and v in D\D_{*} are neighbors. Consider two cases. If at least one of the two vertices (say, u) is an address node then U_{D}(x_{D}) = [7] By the Markov property of address variables, U_{D}(x_{D}) is 1 unless either v ∈ N(u) or v ∈ _{*}, meaning v = x^{a} for some address node a ∈ N(u). But all address variables in Eq. 7 are set to nil, except those in D. Hence v ∈ D_{*}, contradicting v ∈ D\D_{*}.
Suppose now that u and v are both regular nodes. Express U_{D}(x_{D}) as in Eq. 7 with o_{u} in place of nil_{u}. The Markov property of regular variables gives us U_{D}(x_{D}) = 1 unless v ∈ N(u), or v ∈ N(u)_{*}, or u ∈ N(v)_{*}, or u ∈ N(v^{*})_{*} (see Fig. 9).
In all but the first case, however, either u or v is the value of an address variable residing in D; otherwise it would be set to nil in Eq. 7. It follows that either u or v is in D_{*}, a contradiction. Thus v ∈ N(u).
Claims 2 and 3 are simple settheoretical exercises; both follow from the requirement that address variables cannot point to address nodes. □
Proof of Theorem 2: Let G′ denote the modified graph in which all edges incident to a have been replaced with . Let N and M be the neighborhood relations in G and G′, respectively. We must show that the local characteristics of the remaining variables are local with respect to M, in the sense of Theorem 1.
Let b ≠ a be an address node. Since P is a mixed model, the normalized local characteristic of X_{b} is a function of x_{N}_{(}_{b}_{)}_{∪}_{}_{*}. Now M(b) ∪ _{*} = N(b) ∪ N(b)_{*}\{a}. In fact, if a ∉ N(b), then N(b) = M(b) and N(b)_{*} = M(b)_{*} and neither set includes a. If a ∈ N(b) then N(b)_{*} includes one extra regular node, x_{a}, compared to _{*}; however, that node is now in M(b), so the union of the two sets is still the same except for the node a.
That the local characteristic of a regular variable X_{i} is local with respect to M is shown in a similar fashion, by establishing M(i) ∪ M(i)_{*} ∪ ∪ _{*} = N(i) ∪ N(i)_{*} ∪ ∪ \{a}.
Footnotes

↵* Email: arthur_fridman{at}merck.com.

↵† We will write P(x_{i}x_{j}) in place of P(X_{i} = x_{i}X_{j} = x_{j}).

↵‡ Random variables are indicated by a capital letter: X_{i}, X, and their values by the same letter in lower case: x_{i}, x. Sets of random variables and their values are shown in boldface.

↵§ A clique is a fully connected set of nodes.

↵¶ A generic conditioning algorithm for Markov fields is to remove from the graph the instantiated node and all edges incident to it.

↵∥ In the case of the mixed model, a visiting scheme will include both 1 and address nodes. Gibbs sampler does not differentiate between the two types of nodes.
 Received July 21, 2002.
 Copyright © 2003, The National Academy of Sciences
References
 ↵
Kindermann, R. & Snell, J. (1980) Markov Random Fields and Their Applications (Amer. Math. Soc., Providence, RI).
 ↵
Jensen, F. (1996) An Introduction to Bayesian Networks (University College Press, London).
 ↵
 ↵
 ↵
Mumford, D. (1996) in ICIAM 95, eds. Kirchgassner, K., Mahrenholtz, O. & Mennicken, R. (Akademie Verlag, Berlin), pp. 233256.
 ↵
Mumford, D. (1997) in The Legacy of Norbert Wiener: A Centennial Symposium, eds. Jerison, D., Singer, I. & Stroock, D. (Amer. Math. Soc., Providence, RI), pp. 235260.
 ↵
Winkler, G. (1995) Image Analysis, Random Fields and Dynamic Monte Carlo Methods (Springer, Berlin).

Hammersley, J. & Clifford, P. (1968) Markov Fields on Finite Graphs and Lattices, preprint.
 ↵
 ↵
 ↵
Lauritzen, S. (1996) Graphical Models (Clarendon, Oxford).
 ↵
 ↵
Besag, J. (1986) J. R. Stat. Soc. B 48, 259302.
 ↵
 ↵
 ↵