Sheaves and Probability, Part 3
The distributed algorithm for inference I described last time is kind of awkward, and not very well linked with the rest of the field. Olivier Peltre has a better one. It’s connected with the well-known message-passing algorithm for this sort of calculation. Unfortunately, while I think I have a decent idea of the general idea of what’s going on, I don’t have a very detailed understanding of all the moving pieces, so some of this exposition might be just a little bit wrong.
We will need to give up a bit of our nice simplicial structure to make everything work properly. We will instead work directly with a cover of our set of random variables \(\mathcal I\) by a collection of sets \(\mathcal V\), where we require that \(\mathcal V\) be closed under intersection. This requirement makes \(\mathcal V\) a partially ordered set under the relation \(\alpha \leq \beta\) if \(\beta \subseteq \alpha\). (The change in direction is to ensure that our sheaves stay sheaves and our cosheaves stay cosheaves.) The state spaces for each \(\alpha\) still form a sheaf (i.e., a covariant functor) over \(\mathcal V\); and so we also get the cosheaf \(\mathcal A\) of observables and the sheaf \(\mathcal A^*\) of functionals. (This is probably terrible terminology for anyone not comfortable with cellular sheaves, since typically we want to think of a presheaf as a contravariant functor. Sorry.)
We can move back to the simplicial world by taking the nerve \(\mathcal N(\mathcal V)\) of \(\mathcal V\): this is the simplicial complex whose \(k\)-simplices are nondegenerate chains of length \(k+1\) in \(\mathcal V\). In particular, vertices of \(\mathcal{N}(\mathcal V)\) correspond to elements of \(\mathcal V\), while edges correspond to pairs \(\alpha \leq \beta\). A sheaf or cosheaf on \(\mathcal V\) extends naturally to \(\mathcal{N}(\mathcal V)\). The stalk over a simplex \(\alpha \leq \cdots \leq \beta\) is \(\mathcal F(\beta)\), and the restriction maps between 0- and 1-simplices are \(\mathcal F(\alpha) \to \mathcal F(\alpha \leq \beta)\), \(x_\alpha \mapsto \mathcal F_{\alpha \leq \beta} x_\alpha\), and \(\mathcal F(\beta) \to \mathcal F(\alpha \leq \beta)\), \(x_\beta \mapsto x_\beta\).
Extending a sheaf or cosheaf to \(\mathcal N(\mathcal V)\) preserves homology and cohomology (defined in terms of derived functors and all that), so this is not an unreasonable thing to do. The interpretations of \(H_0(\mathcal N(\mathcal V);\mathcal A)\) and \(H^0(\mathcal N(\mathcal V);\mathcal A^*)\) are also the same. Homology classes of \(\mathcal A\) correspond to Hamiltonians defined as a sum of local terms, and cohomology classes of \(\mathcal A^*\) (when normalized) correspond to consistent local marginals (or pseudomarginals).
We now come to the part of the framework I have always found most confusing, because it’s seldom very well explained. Peltre (and others, probably) makes a distinction between the interaction potentials \(h_\alpha\) which are summed to get a Hamiltonian on \(S\), and the collection of local Hamiltonians \(H_\alpha\), obtained by treating each subset \(\alpha\) as if it were an isolated system and summing the potentials \(h_\beta\) for \(\beta \subseteq \alpha\). Both of these can be seen as 0-cochains of \(\mathcal A\), and they are not really homologically related. There is an invertible linear map \(\zeta: C_0(\mathcal{N}(\mathcal V);\mathcal A) \to C_0(\mathcal{N}(\mathcal V);\mathcal A)\) which sends a collection of interaction potentials to its family of local Hamiltonians. The formula is straightforward: \((\zeta h)_\alpha = \sum_{\beta \subseteq \alpha} \mathcal A_{\alpha \leq \beta} h_\beta\). The inverse of \(\zeta\) is the Möbius transform \(\mu\) given by \((\mu H)_\alpha = \sum_{\beta \subseteq \alpha} c_{\beta} \mathcal{A}_{\alpha \leq \beta} H_\beta\). The coefficients \(c_{\beta}\) are the Möbius numbers of the poset \(\mathcal V\). Notice that these formulas refer to the poset relations, not to the incidence of simplices in \(\mathcal N(\mathcal V)\), which is a bit frustrating—the simplicial structure doesn’t seem to really be doing much here. However, \(\zeta\) and \(\mu\) can be extended to chain maps, and hence descend to homology. There are analogous operations on cochains of \(\mathcal A^*\), adjoint to the operations on the chains of \(\mathcal A\).
The action of \(H^0(\mathcal N(\mathcal V);\mathcal A^*)\) on \(H_0(\mathcal N(\mathcal V);\mathcal A)\) calculates the expected value of the Hamiltonian \(H\) given by a class of interaction potentials \(h_\alpha\) with respect to a consistent set of marginals. If a chain represents a collection of local Hamiltonians, then this pairing doesn’t compute an expectation. Instead, if \(p \in H^0\) and \(H \in C_0\), we have to take \(\langle p, \mu \cdot H\rangle\) to get the expectation. By adjointness, this is \(\langle \mu\cdot p, H\rangle\).
For reasons I don’t understand very deeply, these operations play a key role in formulating message passing dynamics. My current understanding is that it has to do with the gradients of the Bethe entropy. The Bethe entropy is defined on a section of \(\mathcal A^*\) by the same inclusion-exclusion procedure as we did before, but using the Möbius coefficients to perform inclusion-exclusion: \(\check{S}(p) = \sum_{\alpha} c_\alpha S(p_\alpha)\). Think of this as computing local (i.e. marginalized) entropy and then applying the Möbius transform.
The Bethe free energy is the functional we minimize to find the approximate pseudomarginals: \(\langle p, h \rangle - \check{S}(p)\). Adding the constraint that \(p\) be consistent (\(\delta p = 0\)) and normalized (\(\langle p_\alpha, \mathbf{1}_\alpha \rangle = 1\)), we get the Lagrangian condition
\[h + \mu (\log p + \mathbf{1}) + \partial \lambda + \tau = 0\]which we can then convert to
\[-\log p - \mathbf{1} = \zeta(h + \partial \lambda + \tau).\]The Lagrange multiplier \(\tau\) is an arbitrary 0-chain which is constant on each vertex, so its zeta transform also satisfies the same condition, and we can roll the \(\mathbf 1\) into it. \(\lambda\) is an arbitrary 1-chain. Ultimately, we have the requirement that
\[-\log p = \zeta(h + \partial \lambda) + \tau.\]This holds on the level of cochains, and of course the cochain \(p\) must also be in \(H^0(\mathcal{N}(\mathcal V),\mathcal A^*)\). So in order to be a critical point for the Bethe free energy functional, \(-\log p\) must be obtained as the zeta transform of something homologous to the given local potential \(h\), up to some normalizing additive constant. Solving the marginalization problem amounts to a search for a local potential \(h'\) homologous to the given \(h\) that makes \(p = \exp(-\zeta h')\) a section of \(\mathcal A^*\). Note that every section of \(\mathcal A^*\) is normalized to some constant sum, so the normalization constraint isn’t all that interesting. Further, if \(p = \exp(-\zeta h')\), it is automatically nonnegative.
One of the key insights now is that if we define a differential equation on \(h\) where the derivative of \(h\) is always in the image of \(\delta\), the evolution of the state will be restricted to the homology class of the initial condition. If we define an operator on \(C_0(\mathcal{N}(\mathcal V);\mathcal A)\) which vanishes when \(\exp(-\zeta h')\) is a section of \(\mathcal A^*\), we’ll be able to use it to implement just such a differential equation.
Let \(\Delta = \partial \circ \mathcal D \circ \zeta\), where \(\mathcal D: C_0(\mathcal N(V);\mathcal A^*) \to C_1(\mathcal N(\mathcal V); \mathcal A)\) is defined by \((\mathcal D H)_{\alpha \leq \beta} = -\log(\mathcal A^*_{\alpha \leq \beta}\exp(-H_\alpha)) +H_\beta \in \mathcal A(\alpha \leq \beta) = \mathcal A(\beta)\). This operator clearly vanishes when \(\exp(-\zeta h)\) is a section of \(\mathcal A^*\), since \(\mathcal D \circ \zeta\) does. In fact, \(\Delta\) vanishes at precisely these points. This operator \(\Delta\) is somewhat mysterious, but if you squint hard enough it looks kind of like a Laplacian. In fact, its linearization around a given 0-chain appears to be a sheaf Laplacian.
Thus the system of differential equations \(\dot{h} = -\Delta h\) has stationary points corresponding precisely to the critical points of the Bethe entropy. If you discretize this ODE with a naive Euler scheme with step size 1, you get a discrete-time evolution equation, and this implies dynamics on local marginal estimates \(p_\alpha \propto \exp(-(\zeta h)_\alpha)\). It turns out that these dynamics on \(p\) are precisely the standard message-passing dynamics, typically described in terms of marginalization, sums, and products. This is a really neat result. And despite the complications involved in defining everything, I find this more comprehensible than the other expositions of generalized message-passing algorithms I’ve read. (I’m looking at you, Wainwright and Jordan.) I’d still like to understand this better, though. What exactly is the relationship between the operator \(\Delta\) and a sheaf Laplacian? Is there a sheaf-theoretic way to understand the max-product algorithm for finding maximum-likelihood elements rather than marginal distributions? Can we use (possibly conically constrained) cohomology of \(\mathcal A^*\) to say anything about the pseudomarginal problem?