(See Part 1 for background.)

Once the structure of a graphical model (or whatever else you want to call it) is determined, and we know how to represent it using sheaves and cosheaves, what do we do with it? Typically, we know the local Hamiltonian HH and we want to know the probabilities of certain events. Of course, we don’t usually care about the probabilities of individual global states, but about the marginal probabilities of local states, or some other statistics that we can derive from those marginals. For instance, in the Ising model, we might want to know the distribution of the proportion of local sites that have +1+1 spin instead of 1-1. We don’t need the whole distribution to calculate this, just the marginal distributions for each site.

A local Hamiltonian H=αhαH = \sum_\alpha h_\alpha produces local probability potential functions ϕα=exp(hα)\phi_\alpha = \exp(- h_\alpha). The resulting probability distribution on states is defined by pH(x)=1Zαexp(hα)p_H(x) = \frac{1}{Z} \prod_{\alpha} \exp(- h_\alpha), where ZZ is a normalizing factor called the partition function. Z=xαexp(hα(xα))Z = \sum_{x} \prod_{\alpha} \exp(- h_\alpha(x_\alpha)). The only hard part of calculating the probability distribution is calculating the partition function, because it requires summing over the exponentially large global state space. The goal is to find a way to calculate the marginals for each set α\alpha without computing the whole partition function.

One way to reframe things is to note that the probability distribution is the solution to an optimization problem. The distribution pHp_H is the minimizer (for fixed HH) of the free energy functional F(p,H)=Ep[H]S(p)\mathbb{F}(p,H) = \mathbb{E}_{p}[H] - S(p), where S(p)S(p) is the entropy of the distribution pp. To see this, we think of HH and pp as big vectors, and form the Lagrangian

L(p,λ,μ)=Ep[H]S(p)+λ(1,p1)μ,p.\mathcal L(p,\lambda,\mu) = \mathbb{E}_p[H] - S(p) + \lambda (\langle \mathbf{1},p\rangle -1) - \langle \mu, p\rangle.

The expectation is linear in pp, and SS is concave, so this is a convex optimization problem. The KKT optimality conditions require that the derivative of L\mathcal L with respect to pp be zero, that 1,p=1\langle \mathbf{1}, p\rangle = 1, p0p \geq 0, μ0\mu \geq 0, and μ(x)p(x)=0\mu(x)p(x) = 0 for all xx. Differentiating L\mathcal L with respect to pp, we get H+(log(p)+1)+λ1μ=0,H + (\log(p) + 1) + \lambda \mathbf{1} - \mu = 0, which gives p(x)=exp(H(x)(λ+1)+μ(x))=exp(H(x))exp(λ1)exp(μ(x)).p(x) = \exp(- H(x) - (\lambda + 1) + \mu(x)) = \exp(- H(x))\exp(-\lambda -1)\exp(\mu(x)).

Since μ(x)p(x)=0\mu(x)p(x) = 0, we can ignore the term exp(μ(x))\exp(\mu(x)), since it is 1 unless p(x)p(x) is zero. Finally, we see that exp(λ1)\exp(-\lambda-1) must be the normalizing factor 1/Z1/Z, since pp must be normalized and changing λ\lambda is the only way to make that happen.

This reformulation alone doesn’t solve the problem, since we still have to optimize over the space of all possible distributions on SS. But remember: when H=αhαH = \sum_\alpha h_\alpha we can calculate Ep[H]\mathbb{E}_p[H] locally via the action of H0(X;A)H^0(X; \mathcal A^*) on H0(X;A)H_0(X; \mathcal A). Unfortunately, the same is not (in general) true of the entropy S(p)S(p). Naively, we need to actually extend our local marginals to a global distribution in order to calculate its entropy, which is exactly what we were trying to avoid. Even worse, a pseudomarginal μH0(X;A)\mu \in H^0(X; \mathcal A^*) might not even correspond to any probability distribution on the total space of states SS. Even if it does, that probability distribution is far from unique. Assuming μ\mu actually corresponds to the local marginals of some distribution, we can solve this uniqueness problem by assuming that it corresponds to the distribution with maximal entropy.

Even checking whether μ\mu is a true marginal distribution is NP-hard in general. (The hardness depends on how the covering sets in V\mathcal V are arranged. If XX is a tree, every pseudomarginal is a true marginal, for example.)

The Bethe-Kikuchi approximation solves these two problems by ignoring them and introducing one of its own.

  • First, forget the problem of finding a real marginal distribution. All we’re looking for are pseudomarginals. We’ll just hope they correspond to marginals of a real distribution. In practice this doesn’t seem to be too big a deal.
  • Second, we will have to replace the entropy with a locally calculated approximation.
  • The new problem is that the Bethe approximation to the entropy is no longer a concave function, and hence we can’t guarantee uniqueness of the obtained distribution.

The definition of the Bethe-Kikuchi entropy of a pseudomarginal is motivated by an inclusion-exclusion process. A first approximation might be to calculate the entropy for each local pseudomarginal pαp_\alpha, for each maximal covering set α\alpha. This is a good start. But, because the sets may overlap, we’ve double-counted the entropy associated with the common variables in, say, αβ\alpha \cap \beta. So we subtract this entropy. But these sets might overlap, so we need to add back in the extra entropy we removed. Since we removed it 3 times, once for each face of [αγ][\alpha\gamma], we need to add two back in. Eventually we get to

Sˇ(p)=αS(pα)α,βS(pα)+2α,β,γS(pαβγ)=k=1nα1,αk(1)k1kS(pα1αk).\check{S}(p) = \sum_{\alpha} S(p_\alpha) - \sum_{\alpha,\beta} S(p_{\alpha}) + 2\sum_{\alpha,\beta,\gamma} S(p_{\alpha\beta\gamma}) - \cdots = \sum_{k=1}^n \sum_{\alpha_1,\ldots\alpha_k} (-1)^{k-1}kS(p_{\alpha_1\cdots\alpha_k}).

Note that this definition implicitly assumes that pp is a section of A\mathcal A^*. Coming up with a formula that is defined on C0(X;A)C^0(X;\mathcal A^*) is a bit messy. One way to do it is to let Sˇα(pα)=S(pα)+ασ(1)dimσdimσdimσ+1S(Aασpα)\check{S}_{\alpha}(p_\alpha) = S(p_\alpha) + \sum_{\alpha \trianglelefteq \sigma} (-1)^{\dim \sigma}\frac{\dim \sigma}{\dim \sigma + 1}S(A^*_{\alpha \trianglelefteq \sigma} p_\alpha) for each vertex α\alpha of XX. Then we let Sˇ(p)=αSˇα(pα)\check{S}(p) = \sum_\alpha \check{S}_\alpha(p_\alpha). This splits the computation of the term corresponding to a kk-simplex of XX equally between its k+1k+1 vertices.

The Bethe-Kikuchi entropy is often described in a slightly different but equivalent way, better adapted to situations where we don’t use the simplicial structure. For this, we convert the cover V\mathcal V, which we previously assumed to have no sets contained in each other, to its closure V\overline{\mathcal V} under intersection, and take its reversed poset of inclusion. The Möbius numbers of this poset are an assignment of integers cαc_\alpha to each element αV\alpha \in \overline{\mathcal V} such that βαcβ=1\sum_{\beta\subseteq \alpha} c_\beta = 1 for every α\alpha. This definition captures the inclusion-exclusion principle behind the Bethe-Kikuchi approximation. If we let Sˇ(p)=αcαS(pα)\check{S}(p) = \sum_\alpha c_\alpha S(p_\alpha), we get an equivalent definition with less redundancy and a natural localized definition on 0-cochains, although we are forced to work with cochains of A\mathcal A defined on V\overline{\mathcal V}.

However we decide to calculate Sˇ(p)\check{S}(p), the approximate marginal inference problem is what one might call a homological program:

minpp,hSˇ(p) s.t. pH0(X;A),p0,p normalized\min_{p} \langle p, h \rangle - \check{S}(p) \text{ s.t. } p \in H^0(X;\mathcal A^*), p \geq 0, p \text{ normalized}

The Bethe-Kikuchi entropy is not concave, so this is not a convex optimization problem. But it is localizable as discussed earlier, so we can try some of our distributed homological programming techniques without the optimality guarantees. The general idea is to replace the constraint pH0(X;A)p \in H^0(X;\mathcal A^*) with the constraint LAp=0L_{\mathcal A^*} p = 0. The objective function is F(p)=αpα,hαSˇα(pα)F(p) = \sum_{\alpha} \langle p_\alpha, h_\alpha\rangle - \check{S}_\alpha(p_\alpha), and we can construct a local Lagrangian

L(p,λ,μ,τ)=αLα(p,λα,μα,τα),\mathcal L(p,\lambda,\mu,\tau) = \sum_\alpha \mathcal L_\alpha(p,\lambda_\alpha,\mu_\alpha, \tau_\alpha),

with Lα(p,λα,μα,τα)=pα,hαSˇα(pα)+λα,(LAp)α+μα(1,pα1)+τα,pα\mathcal{L}_\alpha(p,\lambda_\alpha,\mu_\alpha,\tau_\alpha) = \langle p_\alpha, h_\alpha \rangle - \check{S}_\alpha (p_\alpha) + \langle \lambda_\alpha, (L_{\mathcal{A}^*} p)_\alpha\rangle + \mu_\alpha (\langle \mathbf{1}, p_\alpha \rangle - 1) + \langle \tau_\alpha,p_\alpha \rangle.

The local Lagrangian Lα\mathcal{L}_\alpha only depends on the values of pβp_\beta where β\beta is a neighboring vertex to α\alpha. The primal-dual dynamics on L\mathcal L—gradient descent on pp and ascent on the dual variables—is also locally determined, and (hopefully) converges to a critical point of the optimization problem. You can then discretize the continuous-time dynamics and work out what the messages passed between nodes of XX look like, but that’s a lot of work and not particularly enlightening. While it’s interesting that you can come up with a distributed algorithm using the general nonsense of homological programming, I’m not sure this is actually a fruitful approach. The resulting algorithm doesn’t look anything like the well-studied message-passing algorithms for marginal inference. And it has some obvious deficits. For instance, if XX is a tree, the standard message-passing algorithms converge to an exact solution in finitely many steps, while convergence of this optimization algorithm is asymptotic at best (and possibly not guaranteed to happen).

There is a better approach, developed by Olivier Peltre, that makes connections between this sheaf-theoretic perspective and message passing algorithms. His perspective interprets message-passing as a discrete-time approximation to the flow of a nonlinear Laplacian-like operator on 0-chains of A\mathcal A. Part 3 will outline this framework and its implications.