Sheaves and Probability, Part 1
Let me pretend to be a physicist for a moment. Consider a system of \(n\) particles, where each particle \(i\) can have a state in some set \(S_i\). The total state space of the system is then \(S = \prod_i S_i\), which grows exponentially as \(n\) increases. If we want to study probability distributions on the state space, we very quickly run out of space to represent them. Physicists have a number of clever tricks to represent and understand these systems more efficiently.
For instance, the time evolution of the system is oftened determined by assigning each state an energy and following some sort of Hamiltonian dynamics. That is, we have a function \(H: S \to \Reals\) giving the energy of each state. But specifying the energy of exponentially many states is exponentially hard. One way to solve this problem is to define \(H\) locally. The simplest way to do this is to say \(H = \sum_i h_i\), where \(h_i\) depends only on the state of particle \(i\). But of course this means that there are no interactions between the particles. We can increase the scope of locality, letting \(H = \sum_{\alpha} h_\alpha\), where each \(\alpha\) is a subset of particles, and \(h_\alpha\) depends only on the states of the particles in \(\alpha\).
In a thermodynamic setting, the probability of a given state \(\mathbf{s}\) is typically proportional to \(\exp(-\beta H(\mathbb{s}))\), where \(\beta\) is an inverse temperature parameter, making states with lower energies more probable. When \(H = \sum_\alpha h_\alpha\), this becomes \(\prod_\alpha \exp(-\beta h_\alpha(\mathbb{s}))\). This is relatively easy to compute for any single state \(\mathbf{s}\). The hard part is finding the constant of proportionality. The actual probability is \(p(\mathbf{s}) = \frac{1}{Z(\beta)} \prod_\alpha \exp(-\beta h_\alpha(\mathbb{s}))\), where \(Z\) is the partition function, computed by summing over all possible states.
We can think about this problem from a purely probabilistic perspective as well; it doesn’t hinge on the thermodynamics. Consider a set of non-independent discrete random variables \(X_i\) for \(i \in \mathcal I\), whose joint distribution is \(p(\mathbf{x}) \propto \prod_{\alpha}\psi_{\alpha}(\mathbf{x}_\alpha)\), where each \(\psi_\alpha\) is a nonnegative function defined for the random variables in some subset \(\alpha\) of \(\mathcal I\). Denote the collection of subsets \(\alpha\) used in this product by \(\mathcal V\).
Why is this interesting? Here are a few examples:
-
The Ising Model. This is one of the earliest statistical models for magnetism. We have a lattice of atoms, each with spin \(X_i \in \{\pm 1\}\). The joint probability of finding the atoms in a given spin state is \(p(\mathbf{x}) \propto \exp(\sum_i \phi_i x_i + \sum_{i\sim j} \theta_{ij} x_ix_j) = \prod_i \psi_i(x_i) \prod_{i \sim j}\psi_{ij}(x_i,x_j)\). We can write this in the Hamiltonian form as well: \(H = -\sum_{i} \phi_i x_i - \sum_{i \sim j} \theta_{ij} x_i x_j\).
-
LDPC Codes. These are a ubiquitous tool in coding theory, which designs ways to transmit information that are robust to error. We encode some information in redundant binary code words, which are then transmitted. The code is designed so that if a few bits are corrupted during transmission, it is possible to recover the correct code word. LDPC codes are a particularly efficient type of code. Here we treat the code words as realizations of a tuple of random variables \((X_1,\ldots,X_n)\), valued in \(\mathbb{F}_2\). The probability distribution used will be simple, and will be supported solely on the set of acceptable codewords. For an \((n,k)\) LDPC code, a code word has length \(n\), and satisfies a collection of \(n-k\) constraints of the form \(\sum_{i \in \alpha} x_i = 0\). In other words, the set of codewords is the kernel of an \((n-k)\times n\) matrix with \(\mathbb{F}_2\) entries.
The local potential functions \(\psi_\alpha\) are simple: they are equal to 1 when \(\sum_{i \in \alpha} x_i = 0\) and 0 otherwise. Note that the corresponding local Hamiltonian would be infinite for non-codewords. The problem of finding the most likely code word given a received tuple is an inference problem with respect to this (or a closely related) distribution.
-
Graphical Models. In various statistical or machine learning tasks, it can be useful to work by specifying a joint distribution that factors according to some local decomposition of the variables. In general, there should be a term for each set of variables that is closely related. In a spatially-related task, these might be random variables associated with nearby points in space. The clusters might also correspond to already-known facts about a domain, like closely-related proteins in an analysis of a biological system.
There are a number of interesting facts about probability distributions that factor in this way. One is a form of conditional independence. Given a factorization \(p(\mathbf{x}) \propto \prod_{\alpha} \psi_{\alpha}(\mathbf{x}_\alpha)\), we can produce a graph \(G_{\mathcal V}\) whose vertices are associated with the random variables \(X_i\) and whose edges are determined by adding a clique to the graph for each subset \(\alpha \in \mathcal V\).
Another, slightly more topological way to construct this graph is as the 1-skeleton of a Dowker complex, where the relation is the inclusion relation between the set of indices \(I\) and the set \(\mathcal V\) of subsets \(\alpha\) used in the functions \(\psi_\alpha\).
If we observe a set \(S\) of random variables, and the removal of the corresponding set of nodes separates two vertices \(u,v\) in the graph, the corresponding random variables \(X_u, X_v\) are conditionally independent given the observations. This property makes the collection of random variables a Markov random field for the graph. It is a fascinating but nontrivial fact that these two properties are equivalent: a set of random variables (with strictly positive distributions) has a distribution that splits into factors corresponding to the cliques of \(G_{\mathcal V}\) if and only if it is a Markov random field with respect to \(G_{\mathcal V}\). This result is the Hammersley-Clifford theorem.
The interpretation of the graph \(G_{\mathcal V}\) as the 1-skeleton of a Dowker complex hints at some deeper relationships between Markov random fields, factored probability distributions, and topology. My goal here is to bring some of these relationships into clearer focus, with a good deal of help from Olivier Peltre’s PhD thesis Message Passing Algorithms and Homology.
Let’s construct a simplicial complex and some sheaves from the scenario we’ve been considering. Start with a set \(\mathcal I\) indexing random variables \(\{X_i\}_{i \in \mathcal I}\), with \(S_i\) the codomain of the random variable (here just a finite set) for each \(i \in \mathcal I\). Take a cover \(\mathcal V\) of \(\mathcal I\)—i.e., a collection of subsets of \(\mathcal I\) whose union is \(\mathcal I\). Let \(X\) be the Cech nerve of this cover. That is, \(X\) a simplicial complex with a 0-simplex \([\alpha]\) for each \(\alpha \in \mathcal V\), a 1-simplex \([\alpha, \beta]\) for each pair \(\alpha,\beta \in \mathcal V\) with nonempty intersection (and we may as well assume \(\alpha,\beta\) are distinct), a 2-simplex \([\alpha,\beta,\gamma]\) for each triple with nonempty intersection, etcetera.
We’ll use a running example of an Ising-type model on a very small graph. Let \(\mathcal I = \{1,2,3,4\}\), \(\mathcal{V} = \{\{1,2\},\{2,3\},\{3,4\}\}\), and \(S_i = \{\pm 1\}\). \(X\) is then a path graph with 3 nodes and 2 edges. A warning: \(X\) is not the same as \(G_{\mathcal V}\), which has 4 nodes and 3 edges. Each vertex of \(X\) corresponds to an edge of \(G_{\mathcal V}\). \(X\) and \(G_{\mathcal V}\) (or, really, the clique complex of \(G_{\mathcal V}\)) are dual in a sense: maximal simplices of the clique complex of \(G_{\mathcal V}\) correspond to vertices of \(X\).
There is a natural cellular sheaf of sets on \(X\), given by letting \(\mathcal{F}([\alpha_1,\ldots,\alpha_k]) = \prod_{i \in \bigcap \alpha_j} S_i\). That is, over the simplex corresponding to the intersection of the sets \(\alpha_1\cap\cdots\cap \alpha_k\), the stalk is the set of all possible outcomes for the random variables contained in that intersection. The restriction maps of \(\mathcal F\) are given by the projection maps of the product. For instance \(\mathcal{F}_{[\alpha] \trianglelefteq [\alpha,\beta]}\) is the projection \(\prod_{i \in \alpha} S_i \to \prod_{i \in \alpha \cap \beta} S_i\). Let’s denote \(v_1 = [\{1,2\}]\), \(v_2 = [\{2,3\}]\), \(v_3 = [\{3,4\}]\), and \(e_{12} = [\{1,2\},\{2,3\}]\), \(e_{23} = [\{2,3\},\{3,4\}]\). Then \(\mathcal F(v_i) = \{(\pm 1,\pm 1)\}\) for all \(i\) and \(\mathcal F(e_{ij}) = \{\pm 1\}\). The restriction map \(\mathcal F_{v_1 \trianglelefteq e_{12}}\) sends \((x,y) \mapsto y\). We call this sheaf \(\mathcal F\) the sheaf of (deterministic) states of the system. Its sections correspond exactly to global states in \(S = \prod_i S_i\).
We can construct a cellular cosheaf of vector spaces on \(X\) by letting each stalk be the vector space \(\Reals^{\mathcal F(\alpha)}\). The extension maps of this cosheaf are those induced by the functor \(\Reals^{(-)}\). You might call this “cylindrical extension”: when \(f: \mathcal F([\alpha]) \to \mathcal F([\alpha, \beta])\), a function \(h: \mathcal F([\alpha, \beta]) \to \Reals\) pulls back to a function \(h^*: \mathcal F([\alpha]) \to \Reals\) by \(h^*(x) = h(f(x))\). We call the resulting cosheaf \(\mathcal A\), the cosheaf of observables.
In the Ising example, \(\mathcal A(v_i) = \Reals^{\mathcal F(v_i)} = \Reals^{\{\pm 1\}\times \{\pm 1\}}\simeq \Reals^{2}\otimes \Reals^2\) (with basis \(\{e_{\pm 1} \otimes e_{\pm 1}\}\)) and \(\mathcal A(e_{ij}) = \Reals^{\mathcal F(e_{ij})} = \Reals^{\{\pm 1\}} \simeq \Reals^2\). The extension map \(\mathcal A_{v_1 \trianglelefteq e_{12}}\) sends \(x\) to \((e_1 + e_2) \otimes x\).
You can probably already see that notation is a major struggle when working with these objects. It’s a problem akin to keeping track of tensor indices. One reason the sheaf-theoretic viewpoint is helpful is that, once defined, it gives us a less notationally fussy structure to hang all this data from.
The linear dual of the cosheaf \(\mathcal A\) is a sheaf \(\mathcal A^*\) with isomorphic stalks and restriction maps given by “integrating” or summing along fibers. If we take the obvious inner product on the stalks of \(\mathcal A\) (the one given by choosing the indicator functions for elements of \(\mathcal F(\sigma)\) as an orthonormal basis), these restriction maps are the adjoints of the extension maps of \(\mathcal A\). Probability distributions on \(\mathcal{F}([\alpha]) = \prod_{i \in \alpha} S_i\) naturally lie inside \(\mathcal A^*([\alpha])\); the restriction map \(\mathcal A^*([\alpha]) \to \mathcal A^*([\alpha, \beta])\) computes marginalization onto a subset of random variables.
The probability distributions on \(\mathcal{F}([\alpha])\) are those elements \(p\) of \(\mathcal A^*([\alpha])\) with \(p(\mathbf{1}) = 1\) and \(p(x^2) \geq 0\) for every observable \(x\). Global sections of \(\mathcal A^*\) which locally satisfy these constraints are sometimes known as pseudomarginals. They do not always correspond to probability distributions on \(S = \prod_{i\in \mathcal{I}} S_i\). But they’re as close as we can get while only working with local information. You can always get some putative global distribution, but it might assign negative probabilities to some events, even if all the marginals are nonnegative.1 We can therefore think of \(\mathcal A^*\) (or at least the subsheaf given by the probability simplex in each stalk) as a sheaf of probabilistic states.
Because \(\mathcal A^*\) is the linear dual of \(\mathcal A\), stalks of \(\mathcal A^*\) act on stalks of \(\mathcal A\). If \(p \in \mathcal A^*(\sigma)\) is a probability distribution on \(\mathcal{F}(\sigma)\), the action on \(h \in \mathcal A(\sigma)\), \(\langle p, h\rangle\) takes the expectation of \(h\) with respect to the probability distribution \(p\). Further, this action commutes with restriction and extension maps: \(\langle p_\sigma,\mathcal A_{\sigma \trianglelefteq \tau} h_\tau \rangle = \langle \mathcal A^*_{\sigma \trianglelefteq \tau} p_\sigma, h_\tau\rangle\).
Suppose \(h\) is a 0-chain of \(\mathcal A\); we can think of \(h\) as defining a global Hamiltonian \(H\) on \(S\). That is, \(H = \sum_{\alpha \in \mathcal V} h_\alpha \circ \pi_{\alpha}\), where \(\pi_\alpha: S \to \mathcal{F}([\alpha])\) is the natural projection. The expectation of such an observable \(H\) depends only on the marginals \(p_\alpha\) of the global probability distribution \(P\) for each set \(\alpha\). So if \(p\) is the 0-cochain of marginals of \(P\), \(\mathbb{E}_P[H] = \langle p, h\rangle\).
Further, if \(h\) and \(h'\) are homologous 0-chains, they define the same global function \(H\); adding a boundary \(\partial x\) to \(h\) corresponds to shifting the accounting of the energy associated with some random variables from one covering set to another. We therefore naturally have an action of \(H^0(\mathcal{A}^*;X)\) on \(H_0(\mathcal A; X)\). It computes the expectation of a local Hamiltonian with respect to a set of pseudomarginals. So in order to compute the expected energy if the system state follows a certain probability distribution, we only need the marginals and the local Hamiltonians. The locality in the systems we consider is captured in this sheaf-cosheaf pair.
So far, all we’ve done is represented the structure of a graphical model in terms of a sheaf-cosheaf pair. The duality between marginal distributions and local Hamiltonians is implemented by linear duality of sheaves and cosheaves. Next time we’ll see if there’s anything we can actually do with this language. Can we use it to understand or design inference algorithms? (Answer: yes, sort of.)
-
It’s a fun exercise to use the discrete Fourier transform and the Fourier slice theorem to show this. ↩