Sheaves and Probability, Part 2
(See Part 1 for background.)
Once the structure of a graphical model (or whatever else you want to call it) is determined, and we know how to represent it using sheaves and cosheaves, what do we do with it? Typically, we know the local Hamiltonian \(H\) and we want to know the probabilities of certain events. Of course, we don’t usually care about the probabilities of individual global states, but about the marginal probabilities of local states, or some other statistics that we can derive from those marginals. For instance, in the Ising model, we might want to know the distribution of the proportion of local sites that have \(+1\) spin instead of \(-1\). We don’t need the whole distribution to calculate this, just the marginal distributions for each site.
A local Hamiltonian \(H = \sum_\alpha h_\alpha\) produces local probability potential functions \(\phi_\alpha = \exp(- h_\alpha)\). The resulting probability distribution on states is defined by \(p_H(x) = \frac{1}{Z} \prod_{\alpha} \exp(- h_\alpha)\), where \(Z\) is a normalizing factor called the partition function. \(Z = \sum_{x} \prod_{\alpha} \exp(- h_\alpha(x_\alpha))\). The only hard part of calculating the probability distribution is calculating the partition function, because it requires summing over the exponentially large global state space. The goal is to find a way to calculate the marginals for each set \(\alpha\) without computing the whole partition function.
One way to reframe things is to note that the probability distribution is the solution to an optimization problem. The distribution \(p_H\) is the minimizer (for fixed \(H\)) of the free energy functional \(\mathbb{F}(p,H) = \mathbb{E}_{p}[H] - S(p)\), where \(S(p)\) is the entropy of the distribution \(p\). To see this, we think of \(H\) and \(p\) as big vectors, and form the Lagrangian
\[\mathcal L(p,\lambda,\mu) = \mathbb{E}_p[H] - S(p) + \lambda (\langle \mathbf{1},p\rangle -1) - \langle \mu, p\rangle.\]
The expectation is linear in \(p\), and \(S\) is concave, so this is a convex optimization problem. The KKT optimality conditions require that the derivative of \(\mathcal L\) with respect to \(p\) be zero, that \(\langle \mathbf{1}, p\rangle = 1\), \(p \geq 0\), \(\mu \geq 0\), and \(\mu(x)p(x) = 0\) for all \(x\). Differentiating \(\mathcal L\) with respect to \(p\), we get \[H + (\log(p) + 1) + \lambda \mathbf{1} - \mu = 0,\] which gives \[p(x) = \exp(- H(x) - (\lambda + 1) + \mu(x)) = \exp(- H(x))\exp(-\lambda -1)\exp(\mu(x)).\]
Since \(\mu(x)p(x) = 0\), we can ignore the term \(\exp(\mu(x))\), since it is 1 unless \(p(x)\) is zero. Finally, we see that \(\exp(-\lambda-1)\) must be the normalizing factor \(1/Z\), since \(p\) must be normalized and changing \(\lambda\) is the only way to make that happen.
This reformulation alone doesn’t solve the problem, since we still have to optimize over the space of all possible distributions on \(S\). But remember: when \(H = \sum_\alpha h_\alpha\) we can calculate \(\mathbb{E}_p[H]\) locally via the action of \(H^0(X; \mathcal A^*)\) on \(H_0(X; \mathcal A)\). Unfortunately, the same is not (in general) true of the entropy \(S(p)\). Naively, we need to actually extend our local marginals to a global distribution in order to calculate its entropy, which is exactly what we were trying to avoid. Even worse, a pseudomarginal \(\mu \in H^0(X; \mathcal A^*)\) might not even correspond to any probability distribution on the total space of states \(S\). Even if it does, that probability distribution is far from unique. Assuming \(\mu\) actually corresponds to the local marginals of some distribution, we can solve this uniqueness problem by assuming that it corresponds to the distribution with maximal entropy.
Even checking whether \(\mu\) is a true marginal distribution is NP-hard in general. (The hardness depends on how the covering sets in \(\mathcal V\) are arranged. If \(X\) is a tree, every pseudomarginal is a true marginal, for example.)
The Bethe-Kikuchi approximation solves these two problems by ignoring them and introducing one of its own.
- First, forget the problem of finding a real marginal distribution. All we’re looking for are pseudomarginals. We’ll just hope they correspond to marginals of a real distribution. In practice this doesn’t seem to be too big a deal.
- Second, we will have to replace the entropy with a locally calculated approximation.
- The new problem is that the Bethe approximation to the entropy is no longer a concave function, and hence we can’t guarantee uniqueness of the obtained distribution.
The definition of the Bethe-Kikuchi entropy of a pseudomarginal is motivated by an inclusion-exclusion process. A first approximation might be to calculate the entropy for each local pseudomarginal \(p_\alpha\), for each maximal covering set \(\alpha\). This is a good start. But, because the sets may overlap, we’ve double-counted the entropy associated with the common variables in, say, \(\alpha \cap \beta\). So we subtract this entropy. But these sets might overlap, so we need to add back in the extra entropy we removed. Since we removed it 3 times, once for each face of \([\alpha\gamma]\), we need to add two back in. Eventually we get to
\[\check{S}(p) = \sum_{\alpha} S(p_\alpha) - \sum_{\alpha,\beta} S(p_{\alpha}) + 2\sum_{\alpha,\beta,\gamma} S(p_{\alpha\beta\gamma}) - \cdots = \sum_{k=1}^n \sum_{\alpha_1,\ldots\alpha_k} (-1)^{k-1}kS(p_{\alpha_1\cdots\alpha_k}).\]
Note that this definition implicitly assumes that \(p\) is a section of \(\mathcal A^*\). Coming up with a formula that is defined on \(C^0(X;\mathcal A^*)\) is a bit messy. One way to do it is to let \(\check{S}_{\alpha}(p_\alpha) = S(p_\alpha) + \sum_{\alpha \trianglelefteq \sigma} (-1)^{\dim \sigma}\frac{\dim \sigma}{\dim \sigma + 1}S(A^*_{\alpha \trianglelefteq \sigma} p_\alpha)\) for each vertex \(\alpha\) of \(X\). Then we let \(\check{S}(p) = \sum_\alpha \check{S}_\alpha(p_\alpha)\). This splits the computation of the term corresponding to a \(k\)-simplex of \(X\) equally between its \(k+1\) vertices.
The Bethe-Kikuchi entropy is often described in a slightly different but equivalent way, better adapted to situations where we don’t use the simplicial structure. For this, we convert the cover \(\mathcal V\), which we previously assumed to have no sets contained in each other, to its closure \(\overline{\mathcal V}\) under intersection, and take its reversed poset of inclusion. The Möbius numbers of this poset are an assignment of integers \(c_\alpha\) to each element \(\alpha \in \overline{\mathcal V}\) such that \(\sum_{\beta\subseteq \alpha} c_\beta = 1\) for every \(\alpha\). This definition captures the inclusion-exclusion principle behind the Bethe-Kikuchi approximation. If we let \(\check{S}(p) = \sum_\alpha c_\alpha S(p_\alpha)\), we get an equivalent definition with less redundancy and a natural localized definition on 0-cochains, although we are forced to work with cochains of \(\mathcal A\) defined on \(\overline{\mathcal V}\).
However we decide to calculate \(\check{S}(p)\), the approximate marginal inference problem is what one might call a homological program:
\[\min_{p} \langle p, h \rangle - \check{S}(p) \text{ s.t. } p \in H^0(X;\mathcal A^*), p \geq 0, p \text{ normalized}\]
The Bethe-Kikuchi entropy is not concave, so this is not a convex optimization problem. But it is localizable as discussed earlier, so we can try some of our distributed homological programming techniques without the optimality guarantees. The general idea is to replace the constraint \(p \in H^0(X;\mathcal A^*)\) with the constraint \(L_{\mathcal A^*} p = 0\). The objective function is \(F(p) = \sum_{\alpha} \langle p_\alpha, h_\alpha\rangle - \check{S}_\alpha(p_\alpha)\), and we can construct a local Lagrangian
\[\mathcal L(p,\lambda,\mu,\tau) = \sum_\alpha \mathcal L_\alpha(p,\lambda_\alpha,\mu_\alpha, \tau_\alpha),\]
with \(\mathcal{L}_\alpha(p,\lambda_\alpha,\mu_\alpha,\tau_\alpha) = \langle p_\alpha, h_\alpha \rangle - \check{S}_\alpha (p_\alpha) + \langle \lambda_\alpha, (L_{\mathcal{A}^*} p)_\alpha\rangle + \mu_\alpha (\langle \mathbf{1}, p_\alpha \rangle - 1) + \langle \tau_\alpha,p_\alpha \rangle\).
The local Lagrangian \(\mathcal{L}_\alpha\) only depends on the values of \(p_\beta\) where \(\beta\) is a neighboring vertex to \(\alpha\). The primal-dual dynamics on \(\mathcal L\)—gradient descent on \(p\) and ascent on the dual variables—is also locally determined, and (hopefully) converges to a critical point of the optimization problem. You can then discretize the continuous-time dynamics and work out what the messages passed between nodes of \(X\) look like, but that’s a lot of work and not particularly enlightening. While it’s interesting that you can come up with a distributed algorithm using the general nonsense of homological programming, I’m not sure this is actually a fruitful approach. The resulting algorithm doesn’t look anything like the well-studied message-passing algorithms for marginal inference. And it has some obvious deficits. For instance, if \(X\) is a tree, the standard message-passing algorithms converge to an exact solution in finitely many steps, while convergence of this optimization algorithm is asymptotic at best (and possibly not guaranteed to happen).
There is a better approach, developed by Olivier Peltre, that makes connections between this sheaf-theoretic perspective and message passing algorithms. His perspective interprets message-passing as a discrete-time approximation to the flow of a nonlinear Laplacian-like operator on 0-chains of \(\mathcal A\). Part 3 will outline this framework and its implications.