Let me pretend to be a physicist for a moment. Consider a system of nn particles, where each particle ii can have a state in some set SiS_i. The total state space of the system is then S=iSiS = \prod_i S_i, which grows exponentially as nn increases. If we want to study probability distributions on the state space, we very quickly run out of space to represent them. Physicists have a number of clever tricks to represent and understand these systems more efficiently.

For instance, the time evolution of the system is oftened determined by assigning each state an energy and following some sort of Hamiltonian dynamics. That is, we have a function H:SRH: S \to \Reals giving the energy of each state. But specifying the energy of exponentially many states is exponentially hard. One way to solve this problem is to define HH locally. The simplest way to do this is to say H=ihiH = \sum_i h_i, where hih_i depends only on the state of particle ii. But of course this means that there are no interactions between the particles. We can increase the scope of locality, letting H=αhαH = \sum_{\alpha} h_\alpha, where each α\alpha is a subset of particles, and hαh_\alpha depends only on the states of the particles in α\alpha.

In a thermodynamic setting, the probability of a given state s\mathbf{s} is typically proportional to exp(βH(s))\exp(-\beta H(\mathbb{s})), where β\beta is an inverse temperature parameter, making states with lower energies more probable. When H=αhαH = \sum_\alpha h_\alpha, this becomes αexp(βhα(s))\prod_\alpha \exp(-\beta h_\alpha(\mathbb{s})). This is relatively easy to compute for any single state s\mathbf{s}. The hard part is finding the constant of proportionality. The actual probability is p(s)=1Z(β)αexp(βhα(s))p(\mathbf{s}) = \frac{1}{Z(\beta)} \prod_\alpha \exp(-\beta h_\alpha(\mathbb{s})), where ZZ is the partition function, computed by summing over all possible states.

We can think about this problem from a purely probabilistic perspective as well; it doesn’t hinge on the thermodynamics. Consider a set of non-independent discrete random variables XiX_i for iIi \in \mathcal I, whose joint distribution is p(x)αψα(xα)p(\mathbf{x}) \propto \prod_{\alpha}\psi_{\alpha}(\mathbf{x}_\alpha), where each ψα\psi_\alpha is a nonnegative function defined for the random variables in some subset α\alpha of I\mathcal I. Denote the collection of subsets α\alpha used in this product by V\mathcal V.

Why is this interesting? Here are a few examples:

  • The Ising Model. This is one of the earliest statistical models for magnetism. We have a lattice of atoms, each with spin Xi{±1}X_i \in \{\pm 1\}. The joint probability of finding the atoms in a given spin state is p(x)exp(iϕixi+ijθijxixj)=iψi(xi)ijψij(xi,xj)p(\mathbf{x}) \propto \exp(\sum_i \phi_i x_i + \sum_{i\sim j} \theta_{ij} x_ix_j) = \prod_i \psi_i(x_i) \prod_{i \sim j}\psi_{ij}(x_i,x_j). We can write this in the Hamiltonian form as well: H=iϕixiijθijxixjH = -\sum_{i} \phi_i x_i - \sum_{i \sim j} \theta_{ij} x_i x_j.

  • LDPC Codes. These are a ubiquitous tool in coding theory, which designs ways to transmit information that are robust to error. We encode some information in redundant binary code words, which are then transmitted. The code is designed so that if a few bits are corrupted during transmission, it is possible to recover the correct code word. LDPC codes are a particularly efficient type of code. Here we treat the code words as realizations of a tuple of random variables (X1,,Xn)(X_1,\ldots,X_n), valued in F2\mathbb{F}_2. The probability distribution used will be simple, and will be supported solely on the set of acceptable codewords. For an (n,k)(n,k) LDPC code, a code word has length nn, and satisfies a collection of nkn-k constraints of the form iαxi=0\sum_{i \in \alpha} x_i = 0. In other words, the set of codewords is the kernel of an (nk)×n(n-k)\times n matrix with F2\mathbb{F}_2 entries.

    The local potential functions ψα\psi_\alpha are simple: they are equal to 1 when iαxi=0\sum_{i \in \alpha} x_i = 0 and 0 otherwise. Note that the corresponding local Hamiltonian would be infinite for non-codewords. The problem of finding the most likely code word given a received tuple is an inference problem with respect to this (or a closely related) distribution.

  • Graphical Models. In various statistical or machine learning tasks, it can be useful to work by specifying a joint distribution that factors according to some local decomposition of the variables. In general, there should be a term for each set of variables that is closely related. In a spatially-related task, these might be random variables associated with nearby points in space. The clusters might also correspond to already-known facts about a domain, like closely-related proteins in an analysis of a biological system.

There are a number of interesting facts about probability distributions that factor in this way. One is a form of conditional independence. Given a factorization p(x)αψα(xα)p(\mathbf{x}) \propto \prod_{\alpha} \psi_{\alpha}(\mathbf{x}_\alpha), we can produce a graph GVG_{\mathcal V} whose vertices are associated with the random variables XiX_i and whose edges are determined by adding a clique to the graph for each subset αV\alpha \in \mathcal V.

Another, slightly more topological way to construct this graph is as the 1-skeleton of a Dowker complex, where the relation is the inclusion relation between the set of indices II and the set V\mathcal V of subsets α\alpha used in the functions ψα\psi_\alpha.

If we observe a set SS of random variables, and the removal of the corresponding set of nodes separates two vertices u,vu,v in the graph, the corresponding random variables Xu,XvX_u, X_v are conditionally independent given the observations. This property makes the collection of random variables a Markov random field for the graph. It is a fascinating but nontrivial fact that these two properties are equivalent: a set of random variables (with strictly positive distributions) has a distribution that splits into factors corresponding to the cliques of GVG_{\mathcal V} if and only if it is a Markov random field with respect to GVG_{\mathcal V}. This result is the Hammersley-Clifford theorem.

The interpretation of the graph GVG_{\mathcal V} as the 1-skeleton of a Dowker complex hints at some deeper relationships between Markov random fields, factored probability distributions, and topology. My goal here is to bring some of these relationships into clearer focus, with a good deal of help from Olivier Peltre’s PhD thesis Message Passing Algorithms and Homology.

Let’s construct a simplicial complex and some sheaves from the scenario we’ve been considering. Start with a set I\mathcal I indexing random variables {Xi}iI\{X_i\}_{i \in \mathcal I}, with SiS_i the codomain of the random variable (here just a finite set) for each iIi \in \mathcal I. Take a cover V\mathcal V of I\mathcal I—i.e., a collection of subsets of I\mathcal I whose union is I\mathcal I. Let XX be the Cech nerve of this cover. That is, XX a simplicial complex with a 0-simplex [α][\alpha] for each αV\alpha \in \mathcal V, a 1-simplex [α,β][\alpha, \beta] for each pair α,βV\alpha,\beta \in \mathcal V with nonempty intersection (and we may as well assume α,β\alpha,\beta are distinct), a 2-simplex [α,β,γ][\alpha,\beta,\gamma] for each triple with nonempty intersection, etcetera.

We’ll use a running example of an Ising-type model on a very small graph. Let I={1,2,3,4}\mathcal I = \{1,2,3,4\}, V={{1,2},{2,3},{3,4}}\mathcal{V} = \{\{1,2\},\{2,3\},\{3,4\}\}, and Si={±1}S_i = \{\pm 1\}. XX is then a path graph with 3 nodes and 2 edges. A warning: XX is not the same as GVG_{\mathcal V}, which has 4 nodes and 3 edges. Each vertex of XX corresponds to an edge of GVG_{\mathcal V}. XX and GVG_{\mathcal V} (or, really, the clique complex of GVG_{\mathcal V}) are dual in a sense: maximal simplices of the clique complex of GVG_{\mathcal V} correspond to vertices of XX.

Variables, covering, and states

There is a natural cellular sheaf of sets on XX, given by letting F([α1,,αk])=iαjSi\mathcal{F}([\alpha_1,\ldots,\alpha_k]) = \prod_{i \in \bigcap \alpha_j} S_i. That is, over the simplex corresponding to the intersection of the sets α1αk\alpha_1\cap\cdots\cap \alpha_k, the stalk is the set of all possible outcomes for the random variables contained in that intersection. The restriction maps of F\mathcal F are given by the projection maps of the product. For instance F[α][α,β]\mathcal{F}_{[\alpha] \trianglelefteq [\alpha,\beta]} is the projection iαSiiαβSi\prod_{i \in \alpha} S_i \to \prod_{i \in \alpha \cap \beta} S_i. Let’s denote v1=[{1,2}]v_1 = [\{1,2\}], v2=[{2,3}]v_2 = [\{2,3\}], v3=[{3,4}]v_3 = [\{3,4\}], and e12=[{1,2},{2,3}]e_{12} = [\{1,2\},\{2,3\}], e23=[{2,3},{3,4}]e_{23} = [\{2,3\},\{3,4\}]. Then F(vi)={(±1,±1)}\mathcal F(v_i) = \{(\pm 1,\pm 1)\} for all ii and F(eij)={±1}\mathcal F(e_{ij}) = \{\pm 1\}. The restriction map Fv1e12\mathcal F_{v_1 \trianglelefteq e_{12}} sends (x,y)y(x,y) \mapsto y. We call this sheaf F\mathcal F the sheaf of (deterministic) states of the system. Its sections correspond exactly to global states in S=iSiS = \prod_i S_i.

The Ising state sheaf

We can construct a cellular cosheaf of vector spaces on XX by letting each stalk be the vector space RF(α)\Reals^{\mathcal F(\alpha)}. The extension maps of this cosheaf are those induced by the functor R()\Reals^{(-)}. You might call this “cylindrical extension”: when f:F([α])F([α,β])f: \mathcal F([\alpha]) \to \mathcal F([\alpha, \beta]), a function h:F([α,β])Rh: \mathcal F([\alpha, \beta]) \to \Reals pulls back to a function h:F([α])Rh^*: \mathcal F([\alpha]) \to \Reals by h(x)=h(f(x))h^*(x) = h(f(x)). We call the resulting cosheaf A\mathcal A, the cosheaf of observables.

In the Ising example, A(vi)=RF(vi)=R{±1}×{±1}R2R2\mathcal A(v_i) = \Reals^{\mathcal F(v_i)} = \Reals^{\{\pm 1\}\times \{\pm 1\}}\simeq \Reals^{2}\otimes \Reals^2 (with basis {e±1e±1}\{e_{\pm 1} \otimes e_{\pm 1}\}) and A(eij)=RF(eij)=R{±1}R2\mathcal A(e_{ij}) = \Reals^{\mathcal F(e_{ij})} = \Reals^{\{\pm 1\}} \simeq \Reals^2. The extension map Av1e12\mathcal A_{v_1 \trianglelefteq e_{12}} sends xx to (e1+e2)x(e_1 + e_2) \otimes x.

You can probably already see that notation is a major struggle when working with these objects. It’s a problem akin to keeping track of tensor indices. One reason the sheaf-theoretic viewpoint is helpful is that, once defined, it gives us a less notationally fussy structure to hang all this data from.

The linear dual of the cosheaf A\mathcal A is a sheaf A\mathcal A^* with isomorphic stalks and restriction maps given by “integrating” or summing along fibers. If we take the obvious inner product on the stalks of A\mathcal A (the one given by choosing the indicator functions for elements of F(σ)\mathcal F(\sigma) as an orthonormal basis), these restriction maps are the adjoints of the extension maps of A\mathcal A. Probability distributions on F([α])=iαSi\mathcal{F}([\alpha]) = \prod_{i \in \alpha} S_i naturally lie inside A([α])\mathcal A^*([\alpha]); the restriction map A([α])A([α,β])\mathcal A^*([\alpha]) \to \mathcal A^*([\alpha, \beta]) computes marginalization onto a subset of random variables.

The probability distributions on F([α])\mathcal{F}([\alpha]) are those elements pp of A([α])\mathcal A^*([\alpha]) with p(1)=1p(\mathbf{1}) = 1 and p(x2)0p(x^2) \geq 0 for every observable xx. Global sections of A\mathcal A^* which locally satisfy these constraints are sometimes known as pseudomarginals. They do not always correspond to probability distributions on S=iISiS = \prod_{i\in \mathcal{I}} S_i. But they’re as close as we can get while only working with local information. You can always get some putative global distribution, but it might assign negative probabilities to some events, even if all the marginals are nonnegative.1 We can therefore think of A\mathcal A^* (or at least the subsheaf given by the probability simplex in each stalk) as a sheaf of probabilistic states.

Because A\mathcal A^* is the linear dual of A\mathcal A, stalks of A\mathcal A^* act on stalks of A\mathcal A. If pA(σ)p \in \mathcal A^*(\sigma) is a probability distribution on F(σ)\mathcal{F}(\sigma), the action on hA(σ)h \in \mathcal A(\sigma), p,h\langle p, h\rangle takes the expectation of hh with respect to the probability distribution pp. Further, this action commutes with restriction and extension maps: pσ,Aστhτ=Aστpσ,hτ\langle p_\sigma,\mathcal A_{\sigma \trianglelefteq \tau} h_\tau \rangle = \langle \mathcal A^*_{\sigma \trianglelefteq \tau} p_\sigma, h_\tau\rangle.

Suppose hh is a 0-chain of A\mathcal A; we can think of hh as defining a global Hamiltonian HH on SS. That is, H=αVhαπαH = \sum_{\alpha \in \mathcal V} h_\alpha \circ \pi_{\alpha}, where πα:SF([α])\pi_\alpha: S \to \mathcal{F}([\alpha]) is the natural projection. The expectation of such an observable HH depends only on the marginals pαp_\alpha of the global probability distribution PP for each set α\alpha. So if pp is the 0-cochain of marginals of PP, EP[H]=p,h\mathbb{E}_P[H] = \langle p, h\rangle.

Further, if hh and hh' are homologous 0-chains, they define the same global function HH; adding a boundary x\partial x to hh corresponds to shifting the accounting of the energy associated with some random variables from one covering set to another. We therefore naturally have an action of H0(A;X)H^0(\mathcal{A}^*;X) on H0(A;X)H_0(\mathcal A; X). It computes the expectation of a local Hamiltonian with respect to a set of pseudomarginals. So in order to compute the expected energy if the system state follows a certain probability distribution, we only need the marginals and the local Hamiltonians. The locality in the systems we consider is captured in this sheaf-cosheaf pair.

So far, all we’ve done is represented the structure of a graphical model in terms of a sheaf-cosheaf pair. The duality between marginal distributions and local Hamiltonians is implemented by linear duality of sheaves and cosheaves. Next time we’ll see if there’s anything we can actually do with this language. Can we use it to understand or design inference algorithms? (Answer: yes, sort of.)

  1. It’s a fun exercise to use the discrete Fourier transform and the Fourier slice theorem to show this.