Jekyll2021-04-07T16:37:09-04:00https://www.jakobhansen.org/feed.xmlJakob HansenSheaves and Probability, Part 32021-04-02T00:00:00-04:002021-04-02T00:00:00-04:00https://www.jakobhansen.org/2021/04/02/sheaves-probability-3<p>(<a href="/2021/03/17/sheaves-probability">Part 1</a>, <a href="/2021/03/26/sheaves-probability-2">Part 2</a>)</p>
<p>The distributed algorithm for inference I described last time is kind of
awkward, and not very well linked with the rest of the field. Olivier Peltre
has a better one. It’s connected with the well-known message-passing
algorithm for this sort of calculation. Unfortunately, while I think I have a
decent idea of the general idea of what’s going on, I don’t have a very
detailed understanding of all the moving pieces, so some of this exposition
might be just a little bit wrong.</p>
<p>We will need to give up a bit of our nice simplicial structure to
make everything work properly. We will instead work directly with a cover of
our set of random variables \(\mathcal I\) by a collection of sets \(\mathcal
V\), where we require that \(\mathcal V\) be closed under intersection. This
requirement makes \(\mathcal V\) a partially ordered set under the relation
\(\alpha \leq \beta\) if \(\beta \subseteq \alpha\). (The change in direction
is to ensure that our sheaves stay sheaves and our cosheaves stay cosheaves.)
The state spaces for each \(\alpha\) still form a sheaf (i.e., a covariant
functor) over \(\mathcal V\); and so we also get the cosheaf \(\mathcal A\)
of observables and the sheaf \(\mathcal A^*\) of functionals. (This is
probably terrible terminology for anyone not comfortable with cellular
sheaves, since typically we want to think of a presheaf as a contravariant
functor. Sorry.)</p>
<p>We can move back to the simplicial world by taking the nerve \(\mathcal
N(\mathcal V)\) of \(\mathcal V\): this is the simplicial complex whose
\(k\)-simplices are nondegenerate chains of length \(k+1\) in \(\mathcal V\).
In particular, vertices of \(\mathcal{N}(\mathcal V)\) correspond to elements
of \(\mathcal V\), while edges correspond to pairs \(\alpha \leq \beta\). A
sheaf or cosheaf on \(\mathcal V\) extends naturally to
\(\mathcal{N}(\mathcal V)\). The stalk over a simplex \(\alpha \leq \cdots
\leq \beta\) is \(\mathcal F(\beta)\), and the restriction maps between 0-
and 1-simplices are \(\mathcal F(\alpha) \to \mathcal F(\alpha \leq \beta)\),
\(x_\alpha \mapsto \mathcal F_{\alpha \leq \beta} x_\alpha\), and \(\mathcal
F(\beta) \to \mathcal F(\alpha \leq \beta)\), \(x_\beta \mapsto x_\beta\).</p>
<p>Extending a sheaf or cosheaf to \(\mathcal N(\mathcal V)\) preserves homology
and cohomology (defined in terms of derived functors and all that), so this
is not an unreasonable thing to do. The interpretations of \(H_0(\mathcal
N(\mathcal V);\mathcal A)\) and \(H^0(\mathcal N(\mathcal V);\mathcal A^*)\)
are also the same. Homology classes of \(\mathcal A\) correspond to
Hamiltonians defined as a sum of local terms, and cohomology classes of
\(\mathcal A^*\) (when normalized) correspond to consistent local marginals
(or pseudomarginals).</p>
<p>We now come to the part of the framework I have always found most confusing,
because it’s seldom very well explained. Peltre (and others, probably) makes
a distinction between the <em>interaction potentials</em> \(h_\alpha\) which are
summed to get a Hamiltonian on \(S\), and the collection of <em>local
Hamiltonians</em> \(H_\alpha\), obtained by treating each subset \(\alpha\) as
if it were an isolated system and summing the potentials \(h_\beta\) for
\(\beta \subseteq \alpha\). Both of these can be seen as 0-cochains of
\(\mathcal A\), and they are not really homologically related. There is an
invertible linear map \(\zeta: C_0(\mathcal{N}(\mathcal V);\mathcal A) \to
C_0(\mathcal{N}(\mathcal V);\mathcal A)\) which sends a collection of
interaction potentials to its family of local Hamiltonians. The formula is
straightforward: \((\zeta h)_\alpha = \sum_{\beta \subseteq \alpha} \mathcal
A_{\alpha \leq \beta} h_\beta\). The inverse of \(\zeta\) is the Möbius transform \(\mu\)
given by \((\mu H)_\alpha = \sum_{\beta \subseteq \alpha} c_{\beta}
\mathcal{A}_{\alpha \leq \beta} H_\beta\). The coefficients \(c_{\beta}\) are
the Möbius numbers of the poset \(\mathcal V\). Notice that these formulas
refer to the poset relations, not to the incidence of simplices in \(\mathcal
N(\mathcal V)\), which is a bit frustrating—the simplicial structure doesn’t
seem to really be doing much here. However, \(\zeta\) and \(\mu\) can be
extended to chain maps, and hence descend to homology. There are analogous
operations on cochains of \(\mathcal A^*\), adjoint to the operations on the
chains of \(\mathcal A\).</p>
<p>The action of \(H^0(\mathcal N(\mathcal V);\mathcal A^*)\) on \(H_0(\mathcal
N(\mathcal V);\mathcal A)\) calculates the expected value of the Hamiltonian
\(H\) given by a class of interaction potentials \(h_\alpha\) with respect to
a consistent set of marginals. If a chain represents a collection of local
Hamiltonians, then this pairing doesn’t compute an expectation. Instead, if
\(p \in H^0\) and \(H \in C_0\), we have to take \(\langle p, \mu \cdot
H\rangle\) to get the expectation. By adjointness, this is \(\langle \mu\cdot
p, H\rangle\).</p>
<p>For reasons I don’t understand very deeply, these operations play a key role
in formulating message passing dynamics. My current understanding is that it
has to do with the gradients of the Bethe entropy. The Bethe entropy is
defined on a section of \(\mathcal A^*\) by the same inclusion-exclusion
procedure as we did before, but using the Möbius coefficients to perform
inclusion-exclusion: \(\check{S}(p) = \sum_{\alpha} c_\alpha S(p_\alpha)\).
Think of this as computing local (i.e. marginalized) entropy and then
applying the Möbius transform.</p>
<p>The Bethe free energy is the functional we minimize to find the approximate pseudomarginals:
\(\langle p, h \rangle - \check{S}(p)\). Adding the constraint that \(p\) be consistent (\(\delta p = 0\)) and normalized (\(\langle p_\alpha, \mathbf{1}_\alpha \rangle = 1\)), we get the Lagrangian condition</p>
\[h + \mu (\log p + \mathbf{1}) + \partial \lambda + \tau = 0\]
<p>which we can then convert to</p>
\[-\log p - \mathbf{1} = \zeta(h + \partial \lambda + \tau).\]
<p>The Lagrange multiplier \(\tau\) is an arbitrary 0-chain which is constant on
each vertex, so its zeta transform also satisfies the same condition, and we
can roll the \(\mathbf 1\) into it. \(\lambda\) is an arbitrary 1-chain.
Ultimately, we have the requirement that</p>
\[-\log p = \zeta(h + \partial \lambda) + \tau.\]
<p>This holds on the level of cochains, and of course the cochain \(p\) must also be
in \(H^0(\mathcal{N}(\mathcal V),\mathcal A^*)\). So in order to be a
critical point for the Bethe free energy functional, \(-\log p\) must be
obtained as the zeta transform of something homologous to the given local
potential \(h\), up to some normalizing additive constant. Solving the
marginalization problem amounts to a search for a local potential \(h'\)
homologous to the given \(h\) that makes \(p = \exp(-\zeta h')\) a section of
\(\mathcal A^*\). Note that every section of \(\mathcal A^*\) is normalized
to some constant sum, so the normalization constraint isn’t all that
interesting. Further, if \(p = \exp(-\zeta h')\), it is automatically
nonnegative.</p>
<p>One of the key insights now is that if we define a differential equation on
\(h\) where the derivative of \(h\) is always in the image of \(\delta\), the
evolution of the state will be restricted to the homology class of the
initial condition. If we define an operator on \(C_0(\mathcal{N}(\mathcal
V);\mathcal A)\) which vanishes when \(\exp(-\zeta h')\) is a section of
\(\mathcal A^*\), we’ll be able to use it to implement just such a differential equation.</p>
<p>Let \(\Delta = \partial \circ \mathcal D \circ \zeta\), where \(\mathcal D:
C_0(\mathcal N(V);\mathcal A^*) \to C_1(\mathcal N(\mathcal V); \mathcal A)\)
is defined by
\((\mathcal D H)_{\alpha \leq \beta} = -\log(\mathcal A^*_{\alpha \leq \beta}\exp(-H_\alpha)) +H_\beta \in \mathcal A(\alpha \leq \beta) = \mathcal A(\beta)\). This operator clearly vanishes when \(\exp(-\zeta
h)\) is a section of \(\mathcal A^*\), since \(\mathcal D \circ \zeta\) does.
In fact, \(\Delta\) vanishes at precisely these
points. This operator \(\Delta\) is somewhat mysterious, but if you squint
hard enough it looks kind of like a Laplacian. In fact, its linearization
around a given 0-chain appears to be a sheaf Laplacian.</p>
<p>Thus the system of differential equations \(\dot{h} = -\Delta h\) has
stationary points corresponding precisely to the critical points of the Bethe
entropy. If you discretize this ODE with a naive Euler scheme with step size
1, you get a discrete-time evolution equation, and this implies dynamics on
local marginal estimates \(p_\alpha \propto \exp(-(\zeta h)_\alpha)\). It
turns out that these dynamics on \(p\) are precisely the standard
message-passing dynamics, typically described in terms of marginalization,
sums, and products. This is a really neat result. And despite the
complications involved in defining everything, I find this more
comprehensible than the other expositions of generalized message-passing
algorithms I’ve read. (I’m looking at you, <a href="https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf">Wainwright and
Jordan</a>.)
I’d still like to understand this better, though. What exactly is the
relationship between the operator \(\Delta\) and a sheaf Laplacian? Is there
a sheaf-theoretic way to understand the max-product algorithm for finding
maximum-likelihood elements rather than marginal distributions? Can we use
(possibly conically constrained) cohomology of \(\mathcal A^*\) to say
anything about the pseudomarginal problem?</p>(Part 1, Part 2)Sheaves and Probability, Part 22021-03-26T00:00:00-04:002021-03-26T00:00:00-04:00https://www.jakobhansen.org/2021/03/26/sheaves-probability-2<p>(See <a href="/2021/03/17/sheaves-probability">Part 1</a> for background.)</p>
<p>Once the structure of a graphical model (or whatever else you want to call
it) is determined, and we know how to represent it using sheaves and
cosheaves, what do we do with it? Typically, we know the local Hamiltonian
\(H\) and we want to know the probabilities of certain events. Of course, we
don’t usually care about the probabilities of individual global states, but
about the marginal probabilities of local states, or some other statistics
that we can derive from those marginals. For instance, in the Ising model, we
might want to know the distribution of the proportion of local sites that
have \(+1\) spin instead of \(-1\). We don’t need the whole distribution to
calculate this, just the marginal distributions for each site.</p>
<p>A local Hamiltonian \(H = \sum_\alpha h_\alpha\) produces local probability
potential functions \(\phi_\alpha = \exp(- h_\alpha)\). The resulting
probability distribution on states is defined by \(p_H(x) = \frac{1}{Z}
\prod_{\alpha} \exp(- h_\alpha)\), where \(Z\) is a normalizing factor
called the <strong>partition function</strong>. \(Z = \sum_{x} \prod_{\alpha} \exp(-
h_\alpha(x_\alpha))\). The only hard part of calculating the probability
distribution is calculating the partition function, because it requires
summing over the exponentially large global state space. The goal is to find
a way to calculate the marginals for each set \(\alpha\) without computing
the whole partition function.</p>
<p>One way to reframe things is to note that the probability distribution is the solution to an optimization problem. The distribution \(p_H\) is the minimizer (for fixed \(H\)) of the <em>free energy</em> functional \(\mathbb{F}(p,H) = \mathbb{E}_{p}[H] - S(p)\), where \(S(p)\) is the entropy of the distribution \(p\).
To see this, we think of \(H\) and \(p\) as big vectors, and form the Lagrangian</p>
<p>\[\mathcal L(p,\lambda,\mu) = \mathbb{E}_p[H] - S(p) + \lambda (\langle \mathbf{1},p\rangle -1) - \langle \mu, p\rangle.\]</p>
<p>The expectation is linear in \(p\), and \(S\) is concave, so this is a convex optimization problem. The <a href="https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions">KKT optimality conditions</a> require that the derivative of \(\mathcal L\) with respect to \(p\) be zero, that \(\langle \mathbf{1}, p\rangle = 1\), \(p \geq 0\), \(\mu \geq 0\), and \(\mu(x)p(x) = 0\) for all \(x\). Differentiating \(\mathcal L\) with respect to \(p\), we get
\[H + (\log(p) + 1) + \lambda \mathbf{1} - \mu = 0,\]
which gives
\[p(x) = \exp(- H(x) - (\lambda + 1) + \mu(x)) = \exp(- H(x))\exp(-\lambda -1)\exp(\mu(x)).\]</p>
<p>Since \(\mu(x)p(x) = 0\), we can ignore the term \(\exp(\mu(x))\), since it
is 1 unless \(p(x)\) is zero. Finally, we see that \(\exp(-\lambda-1)\) must
be the normalizing factor \(1/Z\), since \(p\) must be normalized and
changing \(\lambda\) is the only way to make that happen.</p>
<p>This reformulation alone doesn’t solve the problem, since we still have to
optimize over the space of all possible distributions on \(S\). But remember:
when \(H = \sum_\alpha h_\alpha\) we can calculate \(\mathbb{E}_p[H]\)
locally via the action of \(H^0(X; \mathcal A^*)\) on \(H_0(X; \mathcal A)\).
Unfortunately, the same is not (in general) true of the entropy \(S(p)\).
Naively, we need to actually extend our local marginals to a global
distribution in order to calculate its entropy, which is exactly what we were
trying to avoid. Even worse, a pseudomarginal \(\mu \in H^0(X; \mathcal
A^*)\) might not even correspond to any probability distribution on the total
space of states \(S\). Even if it does, that probability distribution is far
from unique. Assuming \(\mu\) actually corresponds to the local marginals of
some distribution, we can solve this uniqueness problem by assuming that it
corresponds to the distribution with maximal entropy.</p>
<p>Even checking whether \(\mu\) is a true marginal distribution is NP-hard in general. (The hardness depends on how the covering sets in \(\mathcal V\) are arranged. If \(X\) is a tree, every pseudomarginal is a true marginal, for example.)</p>
<p>The Bethe-Kikuchi approximation solves these two problems by ignoring them and introducing one of its own.</p>
<ul>
<li>First, forget the problem of finding a real marginal distribution. All we’re looking for are pseudomarginals. We’ll just hope they correspond to marginals of a real distribution. In practice this doesn’t seem to be too big a deal.</li>
<li>Second, we will have to replace the entropy with a locally calculated approximation.</li>
<li>The new problem is that the Bethe approximation to the entropy is no longer a concave function, and hence we can’t guarantee uniqueness of the obtained distribution.</li>
</ul>
<p>The definition of the Bethe-Kikuchi entropy of a pseudomarginal is motivated
by an inclusion-exclusion process. A first approximation might be to
calculate the entropy for each local pseudomarginal \(p_\alpha\), for each
maximal covering set \(\alpha\). This is a good start. But, because the sets
may overlap, we’ve double-counted the entropy associated with the common
variables in, say, \(\alpha \cap \beta\). So we subtract this entropy. But these
sets might overlap, so we need to add back in the extra entropy we removed.
Since we removed it 3 times, once for each face of \([\alpha\gamma]\), we
need to add two back in. Eventually we get to</p>
<p>\[\check{S}(p) = \sum_{\alpha} S(p_\alpha) - \sum_{\alpha,\beta} S(p_{\alpha}) + 2\sum_{\alpha,\beta,\gamma} S(p_{\alpha\beta\gamma}) - \cdots = \sum_{k=1}^n \sum_{\alpha_1,\ldots\alpha_k} (-1)^{k-1}kS(p_{\alpha_1\cdots\alpha_k}).\]</p>
<p>Note that this definition implicitly assumes that \(p\) is a section of
\(\mathcal A^*\). Coming up with a formula that is defined on
\(C^0(X;\mathcal A^*)\) is a bit messy. One way to do it is to let
\(\check{S}_{\alpha}(p_\alpha) = S(p_\alpha) + \sum_{\alpha \trianglelefteq
\sigma} (-1)^{\dim \sigma}\frac{\dim \sigma}{\dim \sigma + 1}S(A^*_{\alpha
\trianglelefteq \sigma} p_\alpha)\) for each vertex \(\alpha\) of \(X\). Then
we let \(\check{S}(p) = \sum_\alpha \check{S}_\alpha(p_\alpha)\). This splits
the computation of the term corresponding to a \(k\)-simplex of \(X\) equally
between its \(k+1\) vertices.</p>
<p>The Bethe-Kikuchi entropy is often described in a slightly different but
equivalent way, better adapted to situations where we don’t use the
simplicial structure. For this, we convert the cover \(\mathcal V\), which we
previously assumed to have no sets contained in each other, to its closure
\(\overline{\mathcal V}\) under intersection, and take its reversed poset of
inclusion. The <em>Möbius numbers</em> of this poset are an assignment of integers
\(c_\alpha\) to each element \(\alpha \in \overline{\mathcal V}\) such that
\(\sum_{\beta\subseteq \alpha} c_\beta = 1\) for every \(\alpha\). This
definition captures the inclusion-exclusion principle behind the
Bethe-Kikuchi approximation. If we let \(\check{S}(p) = \sum_\alpha c_\alpha
S(p_\alpha)\), we get an equivalent definition with less redundancy and a
natural localized definition on 0-cochains, although we are forced to work
with cochains of \(\mathcal A\) defined on \(\overline{\mathcal V}\).</p>
<p>However we decide to calculate \(\check{S}(p)\), the approximate marginal
inference problem is what one might call a <em>homological program</em>:</p>
<p>\[\min_{p} \langle p, h \rangle - \check{S}(p) \text{ s.t. } p \in H^0(X;\mathcal A^*), p \geq 0, p \text{ normalized}\]</p>
<p>The Bethe-Kikuchi entropy is not concave, so this is not a convex
optimization problem. But it is localizable as discussed earlier, so we can
try some of our <a href="/publications/distopt.pdf">distributed homological
programming</a> techniques without the optimality
guarantees. The general idea is to replace the constraint \(p \in
H^0(X;\mathcal A^*)\) with the constraint \(L_{\mathcal A^*} p = 0\). The
objective function is \(F(p) = \sum_{\alpha} \langle p_\alpha,
h_\alpha\rangle - \check{S}_\alpha(p_\alpha)\), and we can construct a local
Lagrangian</p>
<p>\[\mathcal L(p,\lambda,\mu,\tau) = \sum_\alpha \mathcal L_\alpha(p,\lambda_\alpha,\mu_\alpha, \tau_\alpha),\]</p>
<p>with
\(\mathcal{L}_\alpha(p,\lambda_\alpha,\mu_\alpha,\tau_\alpha) = \langle p_\alpha, h_\alpha \rangle - \check{S}_\alpha (p_\alpha) + \langle \lambda_\alpha, (L_{\mathcal{A}^*} p)_\alpha\rangle + \mu_\alpha (\langle \mathbf{1}, p_\alpha \rangle - 1) + \langle \tau_\alpha,p_\alpha \rangle\).</p>
<p>The local Lagrangian \(\mathcal{L}_\alpha\) only depends on the values of
\(p_\beta\) where \(\beta\) is a neighboring vertex to \(\alpha\). The
primal-dual dynamics on \(\mathcal L\)—gradient descent on \(p\) and ascent
on the dual variables—is also locally determined, and (hopefully) converges
to a critical point of the optimization problem. You can then discretize the
continuous-time dynamics and work out what the messages passed between nodes
of \(X\) look like, but that’s a lot of work and not particularly
enlightening. While it’s interesting that you can come up with a distributed
algorithm using the general nonsense of homological programming, I’m not sure
this is actually a fruitful approach. The resulting algorithm doesn’t look
anything like the well-studied message-passing algorithms for marginal
inference. And it has some obvious deficits. For instance, if \(X\) is a
tree, the standard message-passing algorithms converge to an exact solution
in finitely many steps, while convergence of this optimization algorithm is
asymptotic at best (and possibly not guaranteed to happen).</p>
<p>There is a better approach, developed by Olivier Peltre, that makes
connections between this sheaf-theoretic perspective and message passing
algorithms. His perspective interprets message-passing as a discrete-time
approximation to the flow of a nonlinear Laplacian-like operator on 0-chains
of \(\mathcal A\). Part 3 will outline this framework and its implications.</p>(See Part 1 for background.)Sheaves and Probability, Part 12021-03-17T00:00:00-04:002021-03-17T00:00:00-04:00https://www.jakobhansen.org/2021/03/17/sheaves-probability<p>Let me pretend to be a physicist for a moment. Consider a system of \(n\)
particles, where each particle \(i\) can have a state in some set \(S_i\).
The total state space of the system is then \(S = \prod_i S_i\), which grows
exponentially as \(n\) increases. If we want to study probability
distributions on the state space, we very quickly run out of space to
represent them. Physicists have a number of clever tricks to represent and
understand these systems more efficiently.</p>
<p>For instance, the time evolution of the system is oftened determined by
assigning each state an energy and following some sort of Hamiltonian dynamics.
That is, we have a function \(H: S \to \Reals\) giving the energy of each
state. But specifying the energy of exponentially many states is
exponentially hard. One way to solve this problem is to define \(H\)
<em>locally</em>. The simplest way to do this is to say \(H = \sum_i h_i\), where
\(h_i\) depends only on the state of particle \(i\). But of course this means
that there are no interactions between the particles. We can increase the
scope of locality, letting \(H = \sum_{\alpha} h_\alpha\), where each
\(\alpha\) is a subset of particles, and \(h_\alpha\) depends only on the
states of the particles in \(\alpha\).</p>
<p>In a thermodynamic setting, the probability of a given state \(\mathbf{s}\)
is typically proportional to \(\exp(-\beta H(\mathbb{s}))\), where \(\beta\)
is an inverse temperature parameter, making states with lower energies more
probable. When \(H = \sum_\alpha h_\alpha\), this becomes \(\prod_\alpha
\exp(-\beta h_\alpha(\mathbb{s}))\). This is relatively easy to compute for
any single state \(\mathbf{s}\). The hard part is finding the constant of
proportionality. The actual probability is \(p(\mathbf{s}) =
\frac{1}{Z(\beta)} \prod_\alpha \exp(-\beta h_\alpha(\mathbb{s}))\), where
\(Z\) is the <em>partition function</em>, computed by summing over all possible
states.</p>
<p>We can think about this problem from a purely probabilistic perspective as
well; it doesn’t hinge on the thermodynamics. Consider a set of
non-independent discrete random variables \(X_i\) for \(i \in \mathcal I\),
whose joint distribution is \(p(\mathbf{x}) \propto
\prod_{\alpha}\psi_{\alpha}(\mathbf{x}_\alpha)\), where each \(\psi_\alpha\)
is a nonnegative function defined for the random variables in some subset
\(\alpha\) of \(\mathcal I\). Denote the collection of subsets \(\alpha\)
used in this product by \(\mathcal V\).</p>
<p>Why is this interesting? Here are a few examples:</p>
<ul>
<li>
<p><strong>The Ising Model</strong>. This is one of the earliest statistical models for
magnetism. We have a lattice of atoms, each with spin \(X_i \in \{\pm 1\}\).
The joint probability of finding the atoms in a given spin state is
\(p(\mathbf{x}) \propto \exp(\sum_i \phi_i x_i + \sum_{i\sim j} \theta_{ij}
x_ix_j) = \prod_i \psi_i(x_i) \prod_{i \sim j}\psi_{ij}(x_i,x_j)\). We can
write this in the Hamiltonian form as well: \(H = -\sum_{i} \phi_i x_i -
\sum_{i \sim j} \theta_{ij} x_i x_j\).</p>
</li>
<li>
<p><strong>LDPC Codes.</strong> These are a ubiquitous tool in coding theory, which designs
ways to transmit information that are robust to error. We encode some
information in redundant binary <em>code words</em>, which are then transmitted. The
code is designed so that if a few bits are corrupted during transmission, it
is possible to recover the correct code word.
LDPC codes are a particularly efficient type of code. Here we treat the code
words as realizations of a tuple of random variables \((X_1,\ldots,X_n)\),
valued in \(\mathbb{F}_2\). The probability distribution used will be simple,
and will be supported solely on the set of acceptable codewords. For an
\((n,k)\) LDPC code, a code word has length \(n\), and satisfies a collection
of \(n-k\) constraints of the form \(\sum_{i \in \alpha} x_i = 0\). In other
words, the set of codewords is the kernel of an \((n-k)\times n\) matrix with
\(\mathbb{F}_2\) entries.</p>
<p>The local potential functions \(\psi_\alpha\) are simple: they are equal to 1
when \(\sum_{i \in \alpha} x_i = 0\) and 0 otherwise. Note that the
corresponding local Hamiltonian would be infinite for non-codewords. The
problem of finding the most likely code word given a received tuple is an
inference problem with respect to this (or a closely related) distribution.</p>
</li>
<li>
<p><strong>Graphical Models.</strong> In various statistical or machine learning tasks, it can
be useful to work by specifying a joint distribution that factors according
to some local decomposition of the variables. In general, there should be a
term for each set of variables that is closely related. In a
spatially-related task, these might be random variables associated with
nearby points in space. The clusters might also correspond to already-known
facts about a domain, like closely-related proteins in an analysis of a
biological system.</p>
</li>
</ul>
<p>There are a number of interesting facts about probability distributions that
factor in this way. One is a form of conditional independence. Given a
factorization \(p(\mathbf{x}) \propto \prod_{\alpha}
\psi_{\alpha}(\mathbf{x}_\alpha)\), we can produce a graph \(G_{\mathcal V}\)
whose vertices are associated with the random variables \(X_i\) and whose
edges are determined by adding a clique to the graph for each subset \(\alpha
\in \mathcal V\).</p>
<p>Another, slightly more topological way to construct this graph is as the
1-skeleton of a Dowker complex, where the relation is the inclusion relation
between the set of indices \(I\) and the set \(\mathcal V\) of subsets
\(\alpha\) used in the functions \(\psi_\alpha\).</p>
<p>If we observe a set \(S\) of random variables, and the removal of the
corresponding set of nodes separates two vertices \(u,v\) in the graph, the
corresponding random variables \(X_u, X_v\) are conditionally independent
given the observations. This property makes the collection of random
variables a <strong><a href="https://en.wikipedia.org/wiki/Markov_random_field">Markov random
field</a></strong> for the graph. It
is a fascinating but nontrivial fact that these two properties are
equivalent: a set of random variables (with strictly positive distributions)
has a distribution that splits into factors corresponding to the cliques of
\(G_{\mathcal V}\) if and only if it is a Markov random field with respect to
\(G_{\mathcal V}\). This result is the <a href="https://en.wikipedia.org/wiki/Hammersley%E2%80%93Clifford_theorem">Hammersley-Clifford
theorem</a>.</p>
<p>The interpretation of the graph \(G_{\mathcal V}\) as the 1-skeleton of a
Dowker complex hints at some deeper relationships between Markov random
fields, factored probability distributions, and topology. My goal here is to
bring some of these relationships into clearer focus, with a good deal of
help from <a href="https://opeltre.github.io">Olivier Peltre</a>’s PhD thesis <a href="https://arxiv.org/abs/2009.11631">Message
Passing Algorithms and Homology</a>.</p>
<p>Let’s construct a simplicial complex and some sheaves from the scenario we’ve
been considering. Start with a set \(\mathcal I\) indexing random variables
\(\{X_i\}_{i \in \mathcal I}\), with \(S_i\) the codomain of the random
variable (here just a finite set) for each \(i \in \mathcal I\). Take a cover
\(\mathcal V\) of \(\mathcal I\)—i.e., a collection of subsets of
\(\mathcal I\) whose union is \(\mathcal I\). Let \(X\) be the Cech nerve of
this cover. That is, \(X\) a simplicial complex with a 0-simplex \([\alpha]\)
for each \(\alpha \in \mathcal V\), a 1-simplex \([\alpha, \beta]\) for each
pair \(\alpha,\beta \in \mathcal V\) with nonempty intersection (and we may
as well assume \(\alpha,\beta\) are distinct), a 2-simplex
\([\alpha,\beta,\gamma]\) for each triple with nonempty intersection,
etcetera.</p>
<p>We’ll use a running example of an Ising-type model on a very small graph. Let
\(\mathcal I = \{1,2,3,4\}\), \(\mathcal{V} = \{\{1,2\},\{2,3\},\{3,4\}\}\),
and \(S_i = \{\pm 1\}\). \(X\) is then a path graph with 3 nodes and 2 edges.
A warning: \(X\) is not the same as \(G_{\mathcal V}\), which has 4 nodes and
3 edges. Each vertex of \(X\) corresponds to an edge of \(G_{\mathcal V}\).
\(X\) and \(G_{\mathcal V}\) (or, really, the clique complex of \(G_{\mathcal
V}\)) are dual in a sense: maximal simplices of the clique complex of
\(G_{\mathcal V}\) correspond to vertices of \(X\).</p>
<p><img src="/assets/probabilitysheaves/isingcover.svg" alt="Variables, covering, and states" class="center-image" /></p>
<p>There is a natural cellular sheaf of sets on \(X\), given by letting
\(\mathcal{F}([\alpha_1,\ldots,\alpha_k]) = \prod_{i \in \bigcap \alpha_j}
S_i\). That is, over the simplex corresponding to the intersection of the
sets \(\alpha_1\cap\cdots\cap \alpha_k\), the stalk is the set of all
possible outcomes for the random variables contained in that intersection.
The restriction maps of \(\mathcal F\) are given by the projection maps of
the product. For instance \(\mathcal{F}_{[\alpha] \trianglelefteq
[\alpha,\beta]}\) is the projection \(\prod_{i \in \alpha} S_i \to \prod_{i
\in \alpha \cap \beta} S_i\). Let’s denote \(v_1 = [\{1,2\}]\), \(v_2 =
[\{2,3\}]\), \(v_3 = [\{3,4\}]\), and \(e_{12} = [\{1,2\},\{2,3\}]\), \(e_{23} = [\{2,3\},\{3,4\}]\).
Then \(\mathcal F(v_i) = \{(\pm 1,\pm 1)\}\) for all \(i\) and \(\mathcal
F(e_{ij}) = \{\pm 1\}\). The restriction map \(\mathcal F_{v_1
\trianglelefteq e_{12}}\) sends \((x,y) \mapsto y\). We call this sheaf
\(\mathcal F\) the sheaf of (deterministic) states of the system. Its
sections correspond exactly to global states in \(S = \prod_i S_i\).</p>
<p><img src="/assets/probabilitysheaves/isingstatesheaf.svg" alt="The Ising state sheaf" class="center-image" /></p>
<p>We can construct a cellular cosheaf of vector spaces on \(X\) by letting each
stalk be the vector space \(\Reals^{\mathcal F(\alpha)}\). The extension maps
of this cosheaf are those induced by the functor \(\Reals^{(-)}\). You might
call this “cylindrical extension”: when \(f: \mathcal F([\alpha]) \to
\mathcal F([\alpha, \beta])\), a function \(h: \mathcal F([\alpha, \beta])
\to \Reals\) pulls back to a function \(h^*: \mathcal F([\alpha]) \to
\Reals\) by \(h^*(x) = h(f(x))\). We call the resulting cosheaf \(\mathcal
A\), the cosheaf of observables.</p>
<p>In the Ising example, \(\mathcal A(v_i) = \Reals^{\mathcal F(v_i)} =
\Reals^{\{\pm 1\}\times \{\pm 1\}}\simeq \Reals^{2}\otimes \Reals^2\) (with
basis \(\{e_{\pm 1} \otimes e_{\pm 1}\}\)) and \(\mathcal A(e_{ij}) =
\Reals^{\mathcal F(e_{ij})} = \Reals^{\{\pm 1\}} \simeq \Reals^2\). The
extension map \(\mathcal A_{v_1 \trianglelefteq e_{12}}\) sends \(x\) to \((e_1 + e_2)
\otimes x\).</p>
<p>You can probably already see that notation is a major struggle when working
with these objects. It’s a problem akin to keeping track of tensor indices.
One reason the sheaf-theoretic viewpoint is helpful is that, once defined, it
gives us a less notationally fussy structure to hang all this data from.</p>
<p>The linear dual of the cosheaf \(\mathcal A\) is a sheaf \(\mathcal A^*\)
with isomorphic stalks and restriction maps given by “integrating” or summing
along fibers. If we take the obvious inner product on the stalks of
\(\mathcal A\) (the one given by choosing the indicator functions for
elements of \(\mathcal F(\sigma)\) as an orthonormal basis), these
restriction maps are the adjoints of the extension maps of \(\mathcal A\).
Probability distributions on \(\mathcal{F}([\alpha]) = \prod_{i \in \alpha} S_i\) naturally lie
inside \(\mathcal A^*([\alpha])\); the restriction map \(\mathcal A^*([\alpha])
\to \mathcal A^*([\alpha, \beta])\) computes marginalization onto a subset
of random variables.</p>
<p>The probability distributions on \(\mathcal{F}([\alpha])\) are those elements
\(p\) of \(\mathcal A^*([\alpha])\) with \(p(\mathbf{1}) = 1\) and \(p(x^2)
\geq 0\) for every observable \(x\). Global sections of \(\mathcal A^*\)
which locally satisfy these constraints are sometimes known as
<em>pseudomarginals</em>. They do not always correspond to probability distributions
on \(S = \prod_{i\in \mathcal{I}} S_i\). But they’re as close as we can get while only working
with local information. You can always get some putative global distribution,
but it might assign negative probabilities to some events, even if all the
marginals are nonnegative.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup> We can therefore think of \(\mathcal A^*\) (or
at least the subsheaf given by the probability simplex in each stalk) as a
sheaf of probabilistic states.</p>
<p>Because \(\mathcal A^*\) is the linear dual of \(\mathcal A\), stalks of
\(\mathcal A^*\) act on stalks of \(\mathcal A\). If \(p \in \mathcal
A^*(\sigma)\) is a probability distribution on \(\mathcal{F}(\sigma)\), the
action on \(h \in \mathcal A(\sigma)\), \(\langle p, h\rangle\) takes the
expectation of \(h\) with respect to the probability distribution \(p\).
Further, this action commutes with restriction and extension maps: \(\langle
p_\sigma,\mathcal A_{\sigma \trianglelefteq \tau} h_\tau \rangle = \langle
\mathcal A^*_{\sigma \trianglelefteq \tau} p_\sigma, h_\tau\rangle\).</p>
<p>Suppose \(h\) is a 0-chain of \(\mathcal A\); we can think of \(h\) as
defining a global Hamiltonian \(H\) on \(S\). That is, \(H = \sum_{\alpha \in
\mathcal V} h_\alpha \circ \pi_{\alpha}\), where \(\pi_\alpha: S \to
\mathcal{F}([\alpha])\) is the natural projection. The expectation of such an
observable \(H\) depends only on the marginals \(p_\alpha\) of the global
probability distribution \(P\) for each set \(\alpha\). So if \(p\) is the
0-cochain of marginals of \(P\), \(\mathbb{E}_P[H] = \langle p, h\rangle\).</p>
<p>Further, if \(h\) and \(h'\) are homologous 0-chains, they define the same
global function \(H\); adding a boundary \(\partial x\) to \(h\) corresponds
to shifting the accounting of the energy associated with some random
variables from one covering set to another. We therefore naturally have an
action of \(H^0(\mathcal{A}^*;X)\) on \(H_0(\mathcal A; X)\). It computes the
expectation of a local Hamiltonian with respect to a set of pseudomarginals.
So in order to compute the expected energy if the system state follows a
certain probability distribution, we only need the marginals and the local
Hamiltonians. The locality in the systems we consider is captured in this
sheaf-cosheaf pair.</p>
<p>So far, all we’ve done is represented the structure of a graphical model in
terms of a sheaf-cosheaf pair. The duality between marginal distributions and
local Hamiltonians is implemented by linear duality of sheaves and cosheaves.
<a href="/2021/03/26/sheaves-probability-2">Next time</a> we’ll see if there’s anything we can actually do with this
language. Can we use it to understand or design inference algorithms?
(Answer: yes, sort of.)</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>It’s a fun exercise to use the discrete Fourier transform and the Fourier slice theorem to show this. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Let me pretend to be a physicist for a moment. Consider a system of \(n\) particles, where each particle \(i\) can have a state in some set \(S_i\). The total state space of the system is then \(S = \prod_i S_i\), which grows exponentially as \(n\) increases. If we want to study probability distributions on the state space, we very quickly run out of space to represent them. Physicists have a number of clever tricks to represent and understand these systems more efficiently.Hybrid Systems and Homotopy Theory2021-03-04T00:00:00-05:002021-03-04T00:00:00-05:00https://www.jakobhansen.org/2021/03/04/hybrid-systems<p>Hybrid systems are a modeling approach for interacting discrete-time and continuous-time phenomena. Probably the best introduction is via a pair of canonical examples.</p>
<p>First, consider the dynamics of a ball bouncing up and down on the ground. Most of the time its motion is governed by Newton’s laws. When it hits the ground, we have a modeling choice. We can try to model the deformation of the ball, the transfer of energy and resulting acceleration and deceleration (complicated) or use what we know about how balls bounce: the velocity vector gets turned around and reduced by some scaling factor (much easier). This second approach is the hybrid system model, because there are discontinuities in the state of the ball that happen at discrete times.</p>
<p>As another example, let’s look at a thermostat. Most thermostats function by turning a heat source on and off—there is no intermediate output. The system state is just the temperature of the room. When the heater is on, this state evolves according to one differential equation (hopefully one that causes temperature to increase); when it is off it follows a different equation. The thermostat is able to switch between these vector fields in response to the state of the system.</p>
<p>These two examples capture the two main features hybrid systems are designed for:</p>
<ol>
<li>the possibility for discontinuous changes in the state of the system (bouncing ball)</li>
<li>the possibility for discontinuous changes in the equations of motion for the system (thermostat)</li>
</ol>
<p>A common formalism for hybrid systems comes from, essentially, gluing a finite state machine to some continuous state spaces and continuous-time dynamics.</p>
<p><strong>Definition.</strong> A <em>hybrid system</em> is given by a directed graph \(G\), together with a smooth manifold \(S_v\) (possibly with boundary) for each vertex \(v\) of \(G\), endowed with a vector field \(X_v\), and for each edge \(e: u \to v\) of \(G\), an embedded submanifold \(G_e\) of \(S_u\) (called a guard) and a smooth map \(r_e: G_e \to S_v\) (called a reset map).</p>
<p>The semantics aren’t hard to understand once you get the idea. The state of the system may be in any one of the spaces \(S_u\), and evolves according to the vector field \(X_u\). If the state hits one of the guards \(G_e\), it is instantaneously sent to the next space \(S_v\) via the reset map \(r_e\).</p>
<p>For the bouncing ball, \(G\) is the graph with one vertex \(u\) and one edge \(e: u \to u\). The state space \(S_u\) is \(\Reals_{\geq 0} \times \Reals\), consisting of the (positive) position and (signed) velocity of the ball. The guard manifold \(G_e\) is \(0 \times \Reals_{\leq 0}\), capturing the events where the ball is on the ground with negative velocity. The reset map \(r_e\) sends \((0,v)\) to \((0,-cv)\), where \(0 < c < 1\) is the coefficient of restitution of the ball. The vector field \(X_u\) is determined by gravity: \((\dot{x},\dot{v}) = (v,-g)\).</p>
<p>For the thermostat, there are two vertices in the graph, corresponding to the two possible states of the controller. \(S_{\text{on}} = S_{\text{off}} = \Reals\), and there is an edge from \(\text{off}\) to \(\text{on}\) and one from \(\text{on}\) to \(\text{off}\). The thermostat has a temperature \(T_{\text{on}}\) at which it turns on the heater, and a temperature \(T_{\text{off}}\) at which it turns it off. These two points are the guard manifolds, and somewhat confusingly, \(G_{\text{on}\to \text{off}} = T_{\text{off}}\), \(G_{\text{off}\to\text{on}} = T_{\text{on}}\). The reset maps are the natural inclusion, so the only difference is in the vector fields. I won’t try to describe these, because modeling heat flow is not my deal.</p>
<p>Hybrid systems can display behavior (some might call it pathological) that is not exhibited by either discrete- or continuous-time systems. Let’s take another look at the ball. Because the coefficient of restitution is less than 1, the ball’s speed decreases every time it hits the guard manifold, and so each time its bounce height decreases. Because the acceleration due to gravity is constant, the time between bounces decreases geometrically. So the transition times when it hits the guard set accumulate at a certain point, and it’s hard to define what happens to the system after that time.
This is called Zeno behavior, since the ball seems to actually carry out the sequence of actions Zeno argued you must in order to go from point A to point B: go half the distance, then go half the distance remaining, and so forth. It’s typically seen as a problem in a hybrid system, since it’s not typically physically realizable, and may lead to undefined behavior.</p>
<p>A while back someone told me about a PhD thesis that somehow managed to apply the theory of model categories to hybrid systems. I couldn’t find anything about this at the time, but I recently uncovered it. The dissertation of legend is actually two: Aaron Ames wrote an <a href="http://ames.caltech.edu/A%20categorical%20theory.pdf">engineering PhD thesis</a> and a <a href="http://ames.caltech.edu/Hybrid%20model%20structures.pdf">mathematics masters thesis</a> covering two sides of the story. As is the norm in the world of engineering, this turned into a <a href="http://ames.caltech.edu/AmesHSCC2005.pdf">bunch</a> <a href="http://ames.caltech.edu/AmesACC2005.pdf">of</a> <a href="http://ames.caltech.edu/HCatGraph.pdf">other</a> <a href="http://ames.caltech.edu/AMSHybridModel.pdf">papers</a> (and those are just the ones about a very small segment of the thesis).</p>
<p>There are a lot of interesting ideas in these theses, both about hybrid systems and other stuff. I was pleased to find that these constructions are basically cryptomorphic versions of cellular sheaves and cosheaves.</p>
<p>A categorical hybrid system interprets the underlying graph \(G\) of the hybrid system as a category \(\mathcal{G}\) with one object for each edge and one object for each vertex. Each attachment between an edge and a vertex is realized by a morphism from the edge to the vertex. (This is not <em>quite</em> the incidence category of a regular cell complex, since there may be self-loops.) We also want to keep track of the orientation of each edge, so there’s a bit of extra structure tagging each incidence morphism with the information of whether it goes toward the source of the edge or the target of the edge. A categorical hybrid system is just a functor \(\mathcal{X}: \mathcal{G} \to \mathbf{Man}\) to the category of smooth manifolds.</p>
<p>To upgrade a traditional hybrid system into a categorical one, we let \(\mathcal{X}(v) = X_v\) and \(\mathcal{X}(e) = G_e\). The action of \(\mathcal{X}\) on morphisms is as follows. Each edge has two outgoing morphisms \(e_s\) and \(e_t\), going to its source and target vertices. \(\mathcal{X}(e_s)\) should be the inclusion map \(G_e \hookrightarrow X_{s(e)}\), while \(\mathcal{X}(e_t)\) should be the reset map \(r_e: G_e \to X_{t(e)}\).
The actual dynamics of the system are added to the vertex stalks afterwards; we think of \(\mathcal{X}\) as being a “hybrid state space” on which we can define any dynamics we please.</p>
<p>I like thinking about these as cellular cosheaves, even though that’s not quite technically correct, both because I’m comfortable with that language, but also because you can kind of think of the execution of a hybrid system as a sort of holonomic walk over the underlying graph.
We can take the colimit of the functor \(\mathcal{X}\) (in a suitable category of topological spaces) to get a genuine space where the dynamics happen.
The colimit forgets some topological information about the discrete part of the system. If we want to preserve this, we can take the homotopy colimit.
It turns out that if the homotopy colimit of one of these categorical hybrid systems is contractible, Zeno behavior cannot occur (regardless of choice of vector fields).</p>
<p>This result sounds very alluring, but I couldn’t actually find an explicit proof of it. The closest I found was an explanation that if the underlying graph of the hybrid system has no directed cycles, then Zeno behavior is impossible, because there can only be finitely many state transitions. If the homotopy colimit is simply connected, the underlying graph must have no cycles. After this explanation, the result no longer seems quite so deep—it’s something we could have deduced without any of the algebraic topology. In fact, by looking at the underlying graph itself, we actually get a finer invariant, because we can look for directed cycles rather than any sort of cycle.</p>
<p>As an example, the thermostat cannot exhibit Zeno behavior as long as the on and off temperatures are separated by a positive distance and the vector fields are fixed. This is because it always takes the same amount of time to go between the temperatures. But both the colimit of the corresponding hybrid functor and the homotopy colimit are homeomorphic to \(S^1\). We can’t rule out Zeno behavior by the homotopy condition.
On the other hand, this criterion does correctly indicate that Zeno behavior is possible for the bouncing ball, but only if you take the homotopy colimit. The standard colimit is just a cone, while the homotopy colimit is a cylinder.</p>
<p>The masters thesis is dedicated to exploring the homotopy theory of these categorical hybrid systems. Ames constructs a model category structure for the category of hybrid functors, which recovers the homotopy theory of the homotopy colimit. You can compute homotopy or homology using something like the Bousfield-Kan spectral sequence. The second page of this spectral sequence is essentially given by taking stalkwise homology of the cosheaf and computing cosheaf homology.</p>
<p>There may be more to this than I’ve been able to uncover, though. A couple papers hint at using the actual vector fields of the hybrid system to define a sort of Morse homology, which might be a better way to identify non-Zeno behavior. But that project doesn’t seem to have gone anywhere. Instead, Ames is <a href="http://ames.caltech.edu/index.html">a successful researcher at Caltech</a> building robots, with the only category theory or homological algebra relegated to diagrams on his website.</p>
<p>I guess this is a lesson about using fancy math to solve real problems. It takes a lot of work—much more than just a dissertation—to take an idea (“hybrid systems are functors”) from abstract mathematics to anywhere near an application. And the incentives often aren’t there to carry a program like that through to completion, even if it might end up being useful.
I am, of course, unable to resist speculating about other high-powered tools that might be brought to bear on the problem. Trajectories of a hybrid system are directed, so maybe directed algebraic topology has something meaningful to say. Or maybe the theory of stratified spaces might be useful in understanding the colimit of one of these hybrid functors. But there are no guarantees, and without a clear motivating problem, it’s not likely that this line of research will ever, say, produce novel guarantees about the behavior of hybrid systems.</p>
<p>I do think the analogy with cellular cosheaves helped me understand the structure of hybrid systems better. Maybe the real benefit of work like this is giving mathematicians a way to readily understand and compartmentalize what more applied researchers are doing.</p>Hybrid systems are a modeling approach for interacting discrete-time and continuous-time phenomena. Probably the best introduction is via a pair of canonical examples.Homology via Matrix Reduction 2: The Long Exact Sequence of a Pair2021-01-12T00:00:00-05:002021-01-12T00:00:00-05:00https://www.jakobhansen.org/2021/01/12/long-exact-sequence<p>The long exact sequence of a pair of topological spaces \((X,A)\) is one of the
consummately abstract algebraic results that end up being very useful in
calculating the homology of various spaces. Here I want to explain how we can
make this result much more concrete: in fact, it’s a statement about how
Gaussian elimination behaves in certain types of matrices.</p>
<p>In the last post, I talked about how to compute the homology of a simplicial
complex by performing elementary row and column operations. The long exact
sequence arises from doing the same sorts of operations with a little bit more
information.</p>
<p>When \(A\) is a subcomplex of \(X\), the boundary matrix for \(X\) can be split
into blocks. The boundary of a simplex in \(A\) is still in \(A\), but the
boundary of a simplex not in \(A\) may lie partially in \(A\) and partially
outside \(A\). So we can split the standard basis for \(C_k(X)\) up into two
sets: simplices in \(A\) and simplices not in \(A\), and with respect to this
division, the boundary matrix has a block we know is zero. Since the
complementary basis to the basis for \(C_k(A)\) serves as a basis for
\(C_k(X,A)\), this zero block corresponds to the triviality of the boundary
\(\partial: C_{k+1}(A) \to C_{k}(X,A)\).</p>
<p><img src="/assets/les/lesstep1.svg" alt="Block boundary operator matrix" class="center-image two" /></p>
<p>The two “diagonal” blocks each in fact qualify as boundary maps of their own.
That is, \(\partial_{k}^A\partial_{k+1}^A = 0\) and
\(\partial_k^{(X,A)}\partial_{k+1}^{(X,A)} = 0\). This can be seen by evaluating
\(\partial^2\) and noting that the blocks corresponding to maps \(C_{k+2}(A) \to
C_{k}(A)\) and \(C_{k+2}(X,A) \to C_{k}(X,A)\) are the squares of the diagonal
blocks.</p>
<p>As a result, the full boundary map \(\partial\) also encodes the boundary map
\(\partial^A\) for the subcomplex \(A\) and the relative boundary map
\(\partial^{(X,A)}\). We can compute these homologies in context by performing
the same operations as before, just restricted to the diagonal blocks of the
full boundary \(\partial\). This leaves us with bases for \(\ker \partial^A\)
containing bases for \(\text{im } \partial^A\). But there’s a bit more left in the
matrix: a block corresponding to a map \(C_{k+1}(X,A) \to C_k(A)\).</p>
<p><img src="/assets/les/lesstep2.svg" alt="Step 2: Homology of A and (X,A)" class="center-image two" /></p>
<p>What do we know about this block? Quite a bit can be deduced from the fact that
this reduced matrix squares to zero. Let’s look at what happens when we multiply
the two relevant blocks.</p>
<p>Most of the blocks of the product are automatically zero by the structure of the
factors, but the upper right block, corresponding to a map \(C_{k+1}(X,A) \to
C_{k-1}(A)\), has two potentially nonzero regions. One comes from the
composition of the block \(C_{k+1}(X,A) \to C_{k}(A)\) with the identity
submatrix in the block \(C_{k}(A) \to C_{k-1}(A)\), and the other comes from the
composition of the identity submatrix in \(C_{k+1}(X,A) \to C_{k}(X,A)\) with
the block \(C_{k}(X,A) \to C_{k-1}(A)\).</p>
<p>Here’s what that looks like. There’s a vertical strip in \(\partial_{k}\) and a
horizontal strip in \(\partial_{k+1}\) that contribute to the product. Where
they don’t overlap in the product, these strips must be zero. Where they do
overlap, they must sum to zero.</p>
<p><img src="/assets/les/nilsquare2.svg" alt="Tracking where the zeros must be" class="center-image two" /></p>
<p>This translates to some known zeros in the full matrix, which we will be able to
leverage in our reduction operations. In particular, we know that the red block
and the green block sum to zero here.</p>
<p><img src="/assets/les/lesstep3.svg" alt="Step 3: Some known zeros in the reduced matrix" class="center-image two" /></p>
<p>We can use the identity block for \(C_{k+1}(A) \to C_{k}(A)\) to clear out the
red block using column operations. Since this is the negative of the green
block, the corresponding row operations clear out the green block.</p>
<p>What are we doing here? The column operations amount to subtracting chains in
\(C_{k+1}(A)\) from the basis vectors for \(C_{k+1}(X,A)\), which makes sense
when we think of the latter space as a quotient.</p>
<p>We continue clearing out columns to get a basis for which \(H_{k+1}(X,A)\) maps
directly onto the basis for \(H_k(A)\). This is our connecting map. The
corresponding row operations involve adding multiples of zero rows to other zero
rows, so this manipulation has no side effects. We are in effect choosing
different representatives in \(C_*(X)\) for chains in \(C_*(X,A)\).</p>
<p><img src="/assets/les/lesstep4.svg" alt="Step 4: A matrix representation of the connecting map is revealed" class="center-image two" /></p>
<p>Can we tell that the sequence this describes is exact? We need to finish
reducing the matrix so we know what the rest of the maps in the sequence look
like. Once we get the connecting map in Jordan form, we have in fact
calculated \(H_k(X)\) as well.</p>
<p><img src="/assets/les/lesstep5.svg" alt="Step 5: The fully reduced boundary matrix reveals exactness" class="center-image two" /></p>
<p>We can now extract all the maps in the long exact sequence from this matrix. We
can extract bases for \(H_k(A)\), \(H_k(X,A)\) and \(H_k(X)\) from this matrix
and the column operations we performed. The map \(H_k(A) \to H_k(X)\) is given by
the map sending basis vectors for \(H_k(A)\) to either themselves (if they are
not in \(\im \partial\)) or to zero (if they are in \(\im \partial\)). Similarly,
the map \(H_k(X) \to H_k(X,A)\) sends a basis vector to itself (if it is in the
basis for \(H_k(X,A)\)) and zero otherwise (which implies it is in the basis for
\(H_k(A)\)).</p>
<p>Exactness of the sequence is almost immediate. The bases for \(H_k(A)\),
\(H_k(X)\), and \(H_k(X,A)\) factor each space into two subspaces: a piece which
is mapped isomorphically onto the next term in the sequence, and a complement
consisting of the image of the previous term. So we have actually gotten a bit
more than just the long exact sequence—we’ve produced a decomposition of each
term of the sequence. In fact, this expresses \(H_k(X)\) as a quotient of
\(H_k(A)\) plus a subspace of \(H_k(X,A)\). We’ve performed the calculation
involved in the spectral sequence of \(X\) associated to the two-step filtration
\(A \subset X\). Without too much more effort, we should be able to extend this
to a computation of the spectral sequence for a longer filtration, which I hope
to write about soon.</p>The long exact sequence of a pair of topological spaces \((X,A)\) is one of the consummately abstract algebraic results that end up being very useful in calculating the homology of various spaces. Here I want to explain how we can make this result much more concrete: in fact, it’s a statement about how Gaussian elimination behaves in certain types of matrices.Homology via Matrix Reduction2020-12-10T00:00:00-05:002020-12-10T00:00:00-05:00https://www.jakobhansen.org/2020/12/10/homology-matrix-reduction<p>Vin de Silva recently gave a talk here at OSU<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup> about his approach to teaching algebraic topology. He wants to make sure students understand the pre-algebraic roots of the concepts: (co)cycles should come before spaces of (co)chains, etc. How did everyone think about things before Emmy Noether came in and ruined everything? So he starts by talking about winding numbers, using this to define the 1-cocycle \(d\theta\) which acts on 1-cycles, and then you’re well on your way to homology and cohomology.</p>
<p>This is one way of getting your hands dirty in algebraic topology: understand the geometric motivations for the abstract algebraic structures first. Work as concretely as possible with the topology.</p>
<p>A complementary approach is to work as concretely as possible with the algebra. This means using combinatorial models like simplicial complexes for topological spaces and looking at the vector spaces of field-valued chains and cochains. The combinatorial nature of a finite simplicial complex means that these are finite-dimensional vector spaces with a natural basis indexed by simplices. This lets us describe all the algebraic structure using little more than a matrix.</p>
<p>Computing (co)homology is then a matter of finding a new basis for the vector space that gives this matrix a particularly nice structure. I first learned this way of thinking about homology from <a href="https://gregoryhenselman.org">Greg Henselman</a>, but this perspective or something like it is ubiquitous in the world of computational topology. I think understanding it is too often left as something to be done purely in the privacy of one’s own home (or <a href="https://github.com/Eetion/Eirene.jl">one’s own Julia code</a>), but the basic idea is quite simple and, I think, rather nice.</p>
<p>Let \(X\) be an oriented simplicial complex with a space of \(\mathbb{k}\)-valued chains \(C_*(X;\mathbb{k})\). If \(X^k\) is the collection of \(k\)-simplices of \(X\), \(C_k(X;\mathbb{k})\) is isomorphic to \(\mathbb{k}^{X^k}\), and we will use the indicator functions \(\{\mathbf{1}_{\sigma}\}_{\sigma \in X^k}\) as a basis. Under this basis, the boundary operator \(\partial: C_*(X;\mathbb{k}) \to C_*(X;\mathbb{k})\) has a representation that looks like this:</p>
<p><img src="/assets/homology/homologystep1.svg" alt="Block boundary operator matrix" class="center-image two" /></p>
<p>This matrix is nilpotent because \(\partial^2 = 0\). Our goal is to find a basis for \(C_*\) that makes this fact evident and reveals a basis for \(H_k\). This is a matter of performing row and column operations on the matrix.</p>
<p>Let’s assume that \(X\) has dimension \(k+1\). We start by performing column operations on \(\partial_{k+1}\) to find a basis for \(\ker \partial_{k+1}\). We perform Gaussian elimination on the columns to clear out all the columns possible and move them to the left. The remaining nonzero columns are a submatrix in column echelon form: the first entry in each column is a 1, and they stairstep down. This gives us a matrix that looks like this:</p>
<p><img src="/assets/homology/homologystep2.svg" alt="Step 2: After column operations on C_{k+1}" class="center-image two" /></p>
<p>Now we perform row operations on \(\partial_{k+1}\) to clean up the block that is still nonzero. Because we’re treating this as a linear map \(C_*(X) \to C_*(X)\), we also have to perform the corresponding (inverse) column operations on \(\partial_k\). So, for instance, if we add row \(i\) to row \(j\), we also need to subtract column \(j\) from column \(i\). (Think about this in terms of multiplication by <a href="https://en.wikipedia.org/wiki/Elementary_matrix">elementary matrices</a>.)</p>
<p>This is actually a good thing, because it does some extra work for us for free. Our change of basis doesn’t change the fact that \(\partial^2 = 0\), so we can conclude that certain columns of \(\partial_k\) are now zero (since they get multiplied by our identity block in \(\partial_{k+1}\)). So now our matrix looks like this. The required zeros in \(\partial_k\) correspond to the fact that \(\text{im } \partial_{k+1} \subseteq \ker \partial_k\).</p>
<p><img src="/assets/homology/homologystep3.svg" alt="Step 3: After row operations on C_k" class="center-image two" /></p>
<p>Now we perform the same sort of column operations on \(\partial_{k}\), getting that into a column echelon form. We of course need to do the corresponding inverse row operations on \(\partial_{k+1}\), but all the corresponding rows are zero, so this does nothing. Then we perform row operations in \(C_{k-1}\), leaving us with another identity block representing \(\partial_k\).</p>
<p><img src="/assets/homology/homologystep4.svg" alt="Step 4: A basis for H_k is revealed" class="center-image two" /></p>
<p>If we look at the columns corresponding to \(C_k\), we have a basis for \(\ker \partial_k\) which contains a basis for \(\text{im } \partial_{k+1}\). The complement is a basis for \(H_k\).
One way to look at this is as a pretty way of writing the Jordan canonical form of \(\partial\). Since \(\partial^2 = 0\), all its eigenvalues are 0, and it has a number of \(2\times 2\) Jordan blocks. This particular algorithm permutes the basis vectors so those blocks have a nice interpretation in terms of determining kernels and images of the graded pieces of \(\partial\).</p>
<p>If we keep track of the row operations we perform on \(\partial\), we can see what the representatives for \(H_k\) are in terms of the standard basis. What we have done is written \(J = S^{-1}\partial S\), and the columns of \(S\) therefore correspond to the basis vectors giving us our Jordan form of \(\partial\).</p>
<p>What I find neat about this is the way that abstract algebraic conditions like the fact that \(C_*(X)\) is a complex show up as concrete facts about the matrix of the operator \(\partial\) when we write it in the right basis. The gap between the abstract structure and the matrix representation isn’t too wide in this case, but it rests on some fairly deep facts about the combinatorics of bases in linear algebra. Greg’s thesis describes this in terms of matroid theory, and uses this language to understand persistent homology. I’m planning to explore this a bit more in the future, and hopefully understand how to interpret the long exact sequence of a pair and spectral sequences in terms of simple matrix operations.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Well, as far as anything happens “here” anymore. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Vin de Silva recently gave a talk here at OSU1 about his approach to teaching algebraic topology. He wants to make sure students understand the pre-algebraic roots of the concepts: (co)cycles should come before spaces of (co)chains, etc. How did everyone think about things before Emmy Noether came in and ruined everything? So he starts by talking about winding numbers, using this to define the 1-cocycle \(d\theta\) which acts on 1-cycles, and then you’re well on your way to homology and cohomology. Well, as far as anything happens “here” anymore. ↩Buridan’s Principle2020-11-13T00:00:00-05:002020-11-13T00:00:00-05:00https://www.jakobhansen.org/2020/11/13/buridan<p>You’re walking down the street and realize, to your chagrin, that you’re on a collision course with another person walking toward you. You swerve to the right, only to realize that your counterpart is doing the same. Hesitating for a second, you try to figure out which way to go, but your legs somehow keep moving you forward, bringing you inexorably closer to a collision. Finally, when you’re only an arm’s length apart, you do the obligatory awkward dance and find a way to go around each other.
What happened here? If you ask <a href="http://www.lamport.org">Leslie Lamport</a>, you just fell afoul of <a href="https://lamport.azurewebsites.net/pubs/buridan.pdf">Buridan’s Principle</a>:</p>
<blockquote>
<p>A discrete decision based upon an input having a continuous range of values cannot be made within a bounded length of time.</p>
</blockquote>
<p>The name comes from <a href="https://en.wikipedia.org/wiki/Buridan%27s_ass">Buridan’s ass</a>, the thought experiment that wouldn’t die. A donkey precisely equidistant between two equally large and attractive piles of hay would have no reason to choose one over the other, and would therefore end up starving to death through indecision. Early versions of the tale existed among the ancient Greeks, and the scenario got its name as a satire of the moral determinism advocated by the 14th century philosopher Jean Buridan. Buridan believed that human beings would invariably choose the option they judge to be best when faced with a choice. However, there’s the possibility that two options are equally good:</p>
<blockquote>
<p>Should two courses be judged equal, then the will cannot break the deadlock, all it can do is to suspend judgement until the circumstances change, and the right course of action is clear.</p>
</blockquote>
<p>Donkeys don’t actually starve to death from an inability to choose between food sources. But this doesn’t mean there isn’t an important lesson to be had from the parable. Here’s the first example Lamport gives, which is surprising and even a bit disturbing:</p>
<p>You drive up to an ungated railroad crossing, stop at the sign, and now have to decide whether to go onward or wait. Suppose the train comes at time \(t=0\) and you arrive at time \(s\). Your position at time \(t\) is a function \(x(t,s)\) of both \(t\) and the time you arrive at the crossing \(s\). If \(x\) is continuous, so is the function taking \(s\) to \(x(0,s)\). That is, your position when the train arrives depends continuously on the time you arrive at the crossing. If you arrive long before the train, you should clearly cross before the train arrives, so \(x(0,s)\) for \(s \ll 0\) must be on the other side of the track. On the other hand, if you arrive well after the train passes, you will obviously be on the near side at \(t=0\), so \(x(0,s)\) is on the first side for \(s \gg t\). Now we apply the intermediate value theorem: there must be some \(s\) such that \(x(0,s)\) is in the middle of the track. No matter what you do, there’s a possibility that you get hit by the train!</p>
<p>Well, you say, this relies on the assumption that \(x\) was continuous. The slices \(x(t,s)\) for fixed \(s\) must be continuous in \(t\) because this is simply your position over time; you can’t jump from one side of the track to the other. But why should your position at \(t=0\) be continuous with respect to your arrival time \(s\)? Why can’t you pick some threshold \(S \lt 0\), and if \(s \leq S\) follow a continuous path crossing before the train arrives, and if \(s \gt S\) wait for the train to pass first? Most of the time this works. But say you arrive at the crossing very close to \(S\). How long does it take to determine which side of the line you are on? If you’re close enough, it might take longer than you have to figure out if you have enough time. Any completely safe decision process will again rely on some discontinuity in the function \(x\), which is simply impossible to physically implement in a bounded amount of time.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>Does this actually happen? Well, maybe. There was a <a href="https://en.wikipedia.org/wiki/Valhalla_train_crash">train crash in 2015</a> that looks like it might have Buridanian influences. From the <a href="https://www.ntsb.gov/investigations/AccidentReports/Reports/RAR1701.pdf">NTSB accident report</a>:</p>
<blockquote>
<p>On February 3, 2015, at 6:26 p.m. eastern standard time, a 2011 Mercedes Benz ML350 sport-utility vehicle (SUV) driven by a 49-year-old woman (driver), traveled northwest on Commerce Street in Valhalla, New York, toward a public highway-railroad grade crossing….The driver moved beyond the grade crossing boundary (stop line) and stopped adjacent to the railroad tracks. The grade crossing warning system activated and the gate came down, striking the rear of her vehicle. She then exited her vehicle and examined the gate. The driver then returned to her vehicle and moved forward on to the tracks.</p>
</blockquote>
<p>There are many ways to explain this driver’s behavior. Perhaps she felt she was already committed to crossing the tracks; perhaps she was just running on autopilot and did not comprehend the danger facing her. Buridan’s principle was not the only, or perhaps even the determining factor in this accident. But the driver’s behavior does look a bit like what would happen if she were having trouble judging whether she had enough time to cross.</p>
<p>A more familiar and perhaps more plausible example of the Buridan phenomenon is the decision of whether to stop or speed through a traffic light that has just turned yellow. It’s easy to waffle between the two options, only coming to a conclusion by slamming on the brakes or screeching through the intersection, risking running a red light. Again, there are other interpretations of this behavior, but the difficulty of figuring out <em>in time</em> whether you have <em>enough time</em> is a real one.</p>
<p>And this is a general problem, not limited to determining safe paths for vehicles. Any rule for deciding between two discrete options induces a decision boundary on the parameter space of inputs to the decision. An input very close to the decision boundary may be easily confused with one on the other side. It can take longer than expected to figure out what side of the boundary you are on. And in the time it takes to carefully measure where you are, the time for making the decision may pass.</p>
<p>What can you do about this problem? Well, the mathematics should be sufficient to convince that you can’t avoid it entirely, but you can make it less likely. For example, you could take more parameters into account, reducing the likelihood that you end up near the decision boundary. An extreme version of this is using a coin flip to make the decision if you’re near the boundary. This is a sort of <em>deadlock detector</em>: a second level system that is meant to step in when it senses that the first level is having trouble making a decision. Of course, this just replaces the problem with the one of having to decide if you’re close enough to the boundary to use the coin flip. By Buridan’s principle, any deadlock detector could itself deadlock,<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup> but as you add more layers, this can become less and less likely.</p>
<p>You can also give yourself more time to make the decision. Suppose you’re driving and see an obstacle in the road ahead. You can either go around it to the right or left, depending on where it is. In the extreme cases where the obstacle is not actually in the road, but to the side, you would obviously go to the side you’re already on. But there’s a discrete decision to be made based on the continuous parameter of the position of the obstacle, so Buridan’s principle applies. But you have another trick up your sleeve. If it’s taking a long time to decide, you can put on the brakes, up to and including stopping the car completely. You can give yourself an arbitrarily long amount of time to make the decision. Even if you’re going at a fixed speed, if that speed is low enough, the probability of not making a decision in time can be very small.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">3</a></sup></p>
<p>This is the approach often taken in computer design. In order to interact with the outside world, a microprocessor needs to observe some external inputs. This happens at fixed points in its clock cycle. Most of the time an observation will be unambiguously a zero or one. But if the input is not synchronized with the clock, it’s possible to catch it right in the middle of a transition. What happens then? The circuit has left the regime of digital logic with clear 1s and 0s, and entered a realm of metastability and undefined behavior. In <a href="http://lamport.azurewebsites.net/pubs/glitch.pdf">a previous paper</a>, Lamport showed that it is theoretically possible for this period of ambiguity to last arbitrarily long before the system settles into a well-defined digital state. In practice, what system designers do is add intentional delays to push the probability of staying in an undefined state for that long very close to zero. In essence, rather than trying to look at what the state is right away, the processor looks at what the state was several clock cycles ago. It’s possible to make the probability of decision failure small enough that it’s orders of magnitude less likely than other sorts of spontaneous failures.</p>
<p>Coming back to the original image of the donkey hesitating between two piles of hay, we have a perhaps better look at how our brains deal with this kind of problem, and what it might actually mean for questions about free will and moral determinism.</p>
<p>There’s a famous set of brain studies that has been purported to prove that conscious thought does not dominate in decision making. These studies, introduced by Benjamin Libet, ask subjects to sit in silence and, of their own volition and at any time, take some small action like tapping their finger. Meanwhile, their brains are monitored, via EEG or FMRI or some other measurement, and they are asked to report the time that they chose to take the action. Researchers have found that there is typically an increase in brain activity—called the <em>Bereitschaftspotential</em>—beginning significantly before subjects report deciding to act. This increase in activity is taken to be the “real” cause of the action, not the conscious decision to act.</p>
<p>Recent work has cast this interpretation into doubt. Rather than indicating some subconscious process dictating your every move, this precursor activity <a href="https://www.theatlantic.com/health/archive/2019/09/free-will-bereitschaftspotential/597736/">might actually be a mechanism to avoid deadlocks</a>:</p>
<blockquote>
<p>Neuroscientists know that for people to make any type of decision, our neurons need to gather evidence for each option. The decision is reached when one group of neurons accumulates evidence past a certain threshold. Sometimes, this evidence comes from sensory information from the outside world: If you’re watching snow fall, your brain will weigh the number of falling snowflakes against the few caught in the wind, and quickly settle on the fact that the snow is moving downward.</p>
<p>But Libet’s experiment, Schurger pointed out, provided its subjects with no such external cues. To decide when to tap their fingers, the participants simply acted whenever the moment struck them. Those spontaneous moments, Schurger reasoned, must have coincided with the haphazard ebb and flow of the participants’ brain activity. They would have been more likely to tap their fingers when their motor system happened to be closer to a threshold for movement initiation.</p>
<p>This would not imply, as Libet had thought, that people’s brains “decide” to move their fingers before they know it. Hardly. Rather, it would mean that the noisy activity in people’s brains sometimes happens to tip the scale if there’s nothing else to base a choice on, saving us from endless indecision when faced with an arbitrary task. The <em>Bereitschaftspotential</em> would be the rising part of the brain fluctuations that tend to coincide with the decisions. This is a highly specific situation, not a general case for all, or even many, choices.</p>
</blockquote>
<p>The <em>Bereitschaftspotential</em> might be evidence of a mechanism keeping Buridan’s ass from starving, and research subjects from being paralyzed due to insufficient reason to act. Perhaps these fluctuations in some sense provide the ability to do otherwise when all the relevant facts are held constant, at least when nothing is at stake. Maybe this is nature’s answer to the problem Buridan poses.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>This has something to do with the fact that <a href="http://math.andrej.com/2006/03/27/sometimes-all-functions-are-continuous/">all computable functions are continuous</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Lamport makes this point about a proposed election deadlock detector in a comment on his webpage: “Another amusing example occurred in an article by Charles Seif titled <em>Not Every Vote Counts</em> that appeared on the op-ed page of the <em>New York Times</em> on 4 December 2008. Seif proposed that elections be decided by a coin toss if the voting is very close, thereby avoiding litiginous disputes over the exact vote count. While the problem of counting votes is not exactly an instance of Buridan’s Principle, the flaw in his scheme will be obvious to anyone familiar with the principle and the futile attempts to circumvent it. I submitted a letter to the <em>Times</em> explaining the problem and sent email to Seif, but I received no response.” <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>In a sense, the decision you make here is not discrete, since you can vary your speed continuously. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>You’re walking down the street and realize, to your chagrin, that you’re on a collision course with another person walking toward you. You swerve to the right, only to realize that your counterpart is doing the same. Hesitating for a second, you try to figure out which way to go, but your legs somehow keep moving you forward, bringing you inexorably closer to a collision. Finally, when you’re only an arm’s length apart, you do the obligatory awkward dance and find a way to go around each other. What happened here? If you ask Leslie Lamport, you just fell afoul of Buridan’s Principle:Hodge theory and persistent homology2019-09-25T00:00:00-04:002019-09-25T00:00:00-04:00https://www.jakobhansen.org/2019/09/25/hodge-persistence<p><a href="https://www.math.upenn.edu/~ghrist/preprints/barcodes.pdf">Persistent homology</a> is a great way to get information about “approximate topology” of some space. The standard way to approach it is through filtrations of topological spaces. Say we have a simplicial complex \(X\) with a filtration function \(f: X \to \mathbb{R}\). This filtration gives each simplex of \(X\) a filtration level or weight. We then have a sequence of inclusions \(i_a^b: X^a = f^{-1}((-\infty,a]) \to f^{-1}((-\infty, b]) = X^b\) for each pair of weight values \(a \leq b\) in the filtration. This leads to a sequence of maps \(H_k (i_a^b): H_k(X^a) \to H_k(X^b)\), which turn out to decompose nicely in the sense that there exists a basis for each \(H^k(X^a)\) so that for every \(a \leq b\) the matrix representing \(H_k i_a^b\) is diagonal: each basis element of \(H^k(X^a)\) is either mapped to zero or to a basis element of \(H^k(X^b)\).</p>
<p>In essence, we can represent the way the homology of \(X^a\) evolves by a collection of bars, where a bar represents a homology class that persists across a certain distance in the filtration of \(X\). Bars which persist for a long time then represent, perhaps, features of the filtered space which are close to being homology classes.</p>
<p>Another way to approach approximate topology of a weighted simplicial complex is with <a href="https://arxiv.org/abs/1507.05379">discrete Hodge theory</a>. Here we start with a weighted simplicial complex \(X\); the weights on \(k\)-simplices give a natural inner product structure on \(C_k(X;\mathbb{R})\). This means that we can produce adjoints of the boundary maps \(\partial_k\). The Hodge Laplacian of \(X\) is the graded operator \(\Delta = (\partial + \partial^\ast)^2 = \partial \partial^\ast + \partial^\ast \partial\). On \(C_k\), this restricts to \(\Delta_k = \partial_{k+1}\partial_{k+1}^\ast + \partial_k^\ast\partial_k\). The main theorem of discrete Hodge theory is that \(\ker \Delta_k \simeq H_k(X;\mathbb{R})\). The \(k\)-cochains in \(\ker \Delta_k\) are canonical representatives for homology classes of \(X\), and the eigenvectors of \(\Delta_k\) with eigenvalues close to \(0\) can be seen as representing \(k\)-chains which are almost homology classes.</p>
<p>I’ve wanted a way to connect these two perspectives for a while now. I recently discovered a paper that provides a straightforward way of relating them. The paper is <a href="https://arxiv.org/abs/1502.07928">Persistent Homology and Floer-Novikov Theory</a> by Michael Usher and Jun Zhang. On their way to building a connection between persistence and Floer theory, the authors develop a new way of looking at persistent homology that makes the analogy with Hodge theory much more evident.</p>
<p>The key is to express persistent homology in terms of a <em>non-Archimedean</em> singular value decomposition of the boundary operator. To see what this means, we’ll have to go through a little bit of background. Fortunately, we don’t need to go into all the genrality that Usher and Zhang do.</p>
<p><strong>Definition:</strong> A <em>valuation</em> on a field \(K\) is a map \(\nu: K \to \mathbb{R} \cup \{\infty\}\) such that</p>
<ol>
<li>\(\nu(x) = \infty\) iff \(x = 0\),</li>
<li>\(\nu(xy) = \nu(x) + \nu(y)\), and</li>
<li>\(\nu(x+y) \geq \min(\nu(x),\nu(y))\).</li>
</ol>
<p>These conditions make \(N(x) = \exp(- \nu(x))\) into a (multiplicative) <em>norm</em> or absolute value on \(K\): \(\exp(-\nu(x)) = 0\) iff \(\nu(x) = \infty\), \(N(xy) = N(x)N(y)\), and \(N(x+y) \leq \max(N(x),N(y)) \leq N(x) + N(y)\). Indeed, this is what you might call an ultrametric absolute value. Every field has the trivial valuation with \(\nu(x) = \infty\) if \(x = 0\) and \(\nu(x) = 0\) otherwise, and this is the only valuation we will actually need.</p>
<p><strong>Definition:</strong> A <em>non-Archimedean normed vector space</em> over \(K\) is a vector space \(C\) together
with a filtration function \(\ell: C \to \mathbb{R} \cup \{-\infty\}\) such that</p>
<ol>
<li>\(\ell(x) = -\infty\) iff \(x = 0\),</li>
<li>\(\ell(\lambda x) - \ell(x) - \nu(\lambda)\) for \(\lambda \in K\), and</li>
<li>\(\ell(x + y) \leq \max(\ell(x),\ell(y))\).</li>
</ol>
<p>Taking \(\|{v}\| = \exp( \ell(v))\) we get \(\|x\| = 0\) iff \(x = 0\), \(\|\lambda x\| = N(\lambda)\|x\|\), and \(\|x+y\| \leq \max(\|x\|,\|y\|)\). Again, this is a non-Archimedean or ultrametric-type norm on \(C\).</p>
<p>A simplicial complex \(X\) with a filtration function \(f: X \to \mathbb{R}\) has a natural non-Archimedean norm on its space of \(k\)-chains \(C_k(X;K)\), where \(K\) has the trivial valuation. This norm is given by the filtration function \(\ell(\sum_{i} a_i \sigma_i) = \max f(\sigma_i)\) for \(i\) with \(a_i \neq 0\).</p>
<p><strong>Definition:</strong> Let \(C\) be a non-Archimedean normed vector space. Subspaces \(V\) and \(W\) of \(C\) are <em>orthogonal</em> if \(\ell(v+w) = \max(\ell(v),\ell(w))\) for all \(v \in V\), \(w \in W\). That is, \(\|v + w\| = \max(\|v\|,\|w\|)\). Compare this to orthogonality for an inner product space, where we have \(\|v + w\|^2 = \|v\|^2 + \|w\|^2\) when \(v\) and \(w\) are orthogonal. Similarly, a tuple of vectors \((v_1,\ldots,v_r)\) is orthogonal if \(\ell(\sum \lambda_i v_i) = \max_i (\lambda_i w_i)\) for all choices of \(\lambda_1,\ldots,\lambda_r \in K\).</p>
<p>Not all non-Archimedean normed vector spaces admit an orthogonal basis. However, if one exists, any basis can be orthogonalized via a sort of Gram-Schmidt algorithm. This orthogonalized basis is not unique.</p>
<p>We’re finally at the point where we can define a singular value decomposition. Let \(C\) and \(D\) be non-Archimedean normed vector spaces with orthogonal bases, and let \(A: C \to D\) be a linear map of rank \(r\). A singular value decomposition of \(A\) is a choice of orthogonal ordered bases \((y_1,\ldots,y_n)\) for \(C\), \((x_1,\ldots,x_n)\) for \(D\) such that</p>
<ol>
<li>\((y_r,\ldots,y_n)\) is an orthogonal ordered basis for \(\ker A\)</li>
<li>\((x_1,\ldots,x_r)\) is an orthogonal ordered basis for \(\text{im } A\)</li>
<li>\(A y_i = x_i\) for \(1 \leq i \leq r\)</li>
<li>\(\ell_C(y_1) - \ell_D(x_1) \geq \cdots \geq \ell_C(y_r)- \ell_D(x_r)\).</li>
</ol>
<p>The last condition is what determines the ordering for the singular values. If we write this in terms of norms, this says that \(\|y_1\|/\|x_1\| \geq \cdots \geq \|y_r\|/\|x_r\|\). The singular values should be the inverses of these, i.e., \(\sigma_i = \|x_i\|/\|y_i\|\).</p>
<p>Usher and Zhang show that these non-Archimedean SVDs exist, and can be computed via a Gaussian elimination-like algorithm. Now let’s interpret this when \(A = \partial_{k+1}: C_{k+1}(X;K) \to Z_k(X;K) = \ker \partial_k\), with the norm given by the simplicial filtration. We get bases \(Y\) of \(C_{k+1}(X;K)\) and \(X\) of \(Z_k(X;K)\), and the
basis \(X\) has elements corresponding to \(k\)-cycles which are either the boundary of some \((k+1)\)-chain, or are not. The \(k\)-cycles which are never killed by a boundary are those \(x_i\) for \(i > r\), and the others have \(i \leq r\), and hence are paired with basis elements \(y_i\) of \(C_{k+1}\). A singular value \(\sigma_i\) less than 1 corresponds to a pair of a \(k\)-cycle \(x_i\) and a \((k+1)\)-chain \(y_i\) bounded by \(x_i\), where the \(x_i\) is born at an earlier filtration level than \(y_i\). In persistent homology terms, we have a bar of length \(\ell(y_i) - \ell(x_i)\). The \(x_i\) for \(i > r\) correspond to bars starting at \(\ell(x_i)\) and extending to infinity, because they are not the boundary of any \((k+1)\)-chain, no matter how large its norm.</p>
<p>So we’ve arrived at the remarkable fact that the non-Archimedean SVD of \(\partial_{k+1}\) contains all the information in the barcode for the \(k\)-dimensional persistent homology of the filtered complex \(X\). We’re nearly to the connection with the Hodge-theoretic version of approximate homology—all we need to do is recall the connection between the eigendecomposition of \(\Delta_k\) and the singular value decomposition of \(\partial_{k+1}\).</p>
<p>Since \(\ker \partial_k\) contains \(\text{im } \partial_{k+1}\), the eigenvalues of \(\Delta_k\) split into three parts: the set of zero eigenvalues, the nonzero eigenvalues of \(\partial_{k+1}\partial_{k+1}^*\), and the nonzero eigenvalues of \(\partial_{k}^*\partial_k\). We define the up-Laplacian \(\Delta_k^+ = \partial_{k+1}\partial_{k+1}^*\) and the down-Laplacian \(\Delta_k^- = \partial_{k}^*\partial_k\). The eigenvalues of \(\Delta_k^+\) are the squares of the singular values of \(\partial_{k+1}\); restricting to \(\ker \partial_k = Z_k\) just means we ignore the subspace spanned by the nonzero eigenvalues of \(\Delta_k^-\). Returning to the interpretation of the SVD, we see that the eigenvectors of \(\Delta_k^+\) with nonzero eigenvalues correspond to \(k\)-cycles which are the boundaries of certain \((k+1)\)-chains; the eigenvalues tell us about the relative sizes of the cycles and the chains that they bound. In particular, an eigenvector with small eigenvalue corresponds to a \(k\)-cycle which is only filled in by a large \((k-1)\)-chain.</p>
<p>What about the other nonzero eigenvalues, the ones that come from \(\Delta_k^-\)? I think the best way to look at these is to switch to cohomology. Eigenvectors of \(\Delta_k^-\) form an orthogonal basis of \(k\)-cocycles which are paired with an orthogonal basis of \((k-1)\)-cochains that they cobound. The eigenvalues are the relative norms of the cocycles and the cochains that kill them. A small eigenvalue represents a cocycle which almost fails to be a coboundary—it requires a lot of “energy” to produce it as a coboundary, perhaps.</p>
<p>The eigendecomposition of the full Hodge Laplacian \(\Delta_k\) then contains representatives of the genuine homology classes of \(X\), as \(\ker \Delta_k\), as well as approximate homology (eigenvectors of \(\Delta_k^+\)) and approximate cohomology (eigenvectors of \(\Delta_k^-\)) of \(X\).</p>
<p>In any event, we have a formal analogy between discrete Hodge theory and persistent homology by way of two types of singular value decomposition. One might interpret the distinction between the two in terms of the way the size of chains are measured—Hodge theory explicitly works with \(\ell^2\) norms to measure size, while persistent homology feels more like using an \(\ell^\infty\) norm (although this is not exactly the case). But both approaches try to find “small” \(k\)-cycles that are killed by “large” \((k+1)\)-chains. This relationship should be helpful in developing a deeper understanding and interpretation of both persistence and Hodge theory.</p>Persistent homology is a great way to get information about “approximate topology” of some space. The standard way to approach it is through filtrations of topological spaces. Say we have a simplicial complex \(X\) with a filtration function \(f: X \to \mathbb{R}\). This filtration gives each simplex of \(X\) a filtration level or weight. We then have a sequence of inclusions \(i_a^b: X^a = f^{-1}((-\infty,a]) \to f^{-1}((-\infty, b]) = X^b\) for each pair of weight values \(a \leq b\) in the filtration. This leads to a sequence of maps \(H_k (i_a^b): H_k(X^a) \to H_k(X^b)\), which turn out to decompose nicely in the sense that there exists a basis for each \(H^k(X^a)\) so that for every \(a \leq b\) the matrix representing \(H_k i_a^b\) is diagonal: each basis element of \(H^k(X^a)\) is either mapped to zero or to a basis element of \(H^k(X^b)\).Flow-coloring duality2019-04-09T00:00:00-04:002019-04-09T00:00:00-04:00https://www.jakobhansen.org/2019/04/09/flow-coloring-duality<p>Graphs are one of the simplest sorts of topological spaces. In fact, they’re so
simple that topologists don’t usually spend much time studying them. Their
homology and homotopy groups are determined purely combinatorially; every
connected graph is homotopy equivalent to a wedge sum of circles. Graph theory
focuses on stricter combinatorial invariants, usually even stricter than
homeomorphism. However, a lot of the algebraic tools that topologists use to
study topological spaces have combinatorial properties that encode more than
just homotopical information when applied to graphs. Let’s look at the chain
complex of a graph, viewed as a simplicial complex.</p>
<p>This is a very simple algebraic object, a complex</p>
\[\cdots \to 0 \to \mathbb{Z}^{|E|} \to \mathbb{Z}^{|V|} \to 0 \to \cdots\]
<p>A flow on \(G\) is
an element of the kernel of the only nontrivial boundary map of \(G\). That
is, given an orientation of the edges, it is a choice of an integer \(x_e\)
for each edge \(e\) of \(G\) satisfying Kirchoff’s current law: the net
flow out of any vertex is zero. This is a very familiar topological object; the
kernel of the boundary map is just \(H_1(G)\). The fact that we care about
the particular combinatorial structure of \(G\), however, means that the
actual construction of the boundary map is relevant, not just its homology up
to isomorphism. In particular, we can look at the way that \(H_1(G)\) sits
inside \(C_1(G) = \mathbb{Z}^{|E|}\). One way to do this is to consider the
set of \(k\)-flows: flows that are less than \(k\) in absolute value. By
restricting the size of flows, we have gone from a homotopy-invariant object to
one depending on the specific combinatorial structure of \(G\).</p>
<p>While combinatorial, this quantity still has a sort of topological feel, and
topological thinking is still relevant. For example, suppose \(G\) is a
planar graph with dual \(G'\). Every \(k\)-coloring of \(G'\) gives a
nowhere-zero \(k\)-flow on \(G\).</p>
<p>To see this, let’s first review the relationship between \(G\) and \(G'\)
from a topological perspective. Since \(G\) is a planar graph, it corresponds
to a cell decomposition of a simply-connected closed subset of \(S^2\).</p>
<p><img src="/assets/flowcoloring/Gcelldecomp.png" alt="G and its cell decomposition" class="center-image" /></p>
<p>The complement of this subset in \(S^2\) is an open disc,
so we in fact get a cell decomposition of the sphere.
If \(G\) is bridge-free, this decomposition is a regular
cell decomposition, which means that it has a Poincaré dual.</p>
<p><img src="/assets/flowcoloring/Gprime.png" alt="The dual graph G'" class="center-image" /></p>
<p>Consider the dual
cell complexes \(X\) and \(X'\), coming from \(G\) and \(G'\). Both
are decompositions of \(S^2\), and by construction, we have \(C^i(X) =
C^{2-i}(X')\). Indeed, the following diagram commutes:</p>
<p><img src="/assets/flowcoloring/dualcomplexes.svg" alt="Dual complexes" class="center-image" /></p>
<p>Now, suppose we have a coloring of \(G'\), which corresponds to a coloring of
the 2-cells of \(G\). Integer-valued 0-cochains on \(G'\) correspond to
colorings, in that if we have a 0-cochain with values in
\(\{0,\ldots,k-1\}\) whose coboundary vanishes nowhere, it defines a
\(k\)-coloring of \(G'\), and vice versa.</p>
<p><img src="/assets/flowcoloring/coloring.png" alt="Colored graph and dual" class="center-image" /></p>
<p>Given a \(k\)-coloring of \(G'\), represented as a 0-cochain \(x\), we
take \(\delta x\), giving a nowhere-vanishing 1-cochain of \(G'\). By
duality, this corresponds to a 1-chain \(y\) of \(G\), and we see that
\(\partial y = 0\) because \(\delta^2 x = 0\). Note also that \(y\) can
never exceed \(k-1\) in absolute value, since its value on each edge is the
difference of two terms in \(\{0,\ldots,k-1\}\). This shows that every
\(k\)-coloring of \(G'\) gives a \(k\)-flow of \(G\).</p>
<p>To go from flows to colorings is a bit trickier. A flow \(y\) on \(G\) is
an element of \(\ker \partial_1\), since \(H_1(S^2) = 0\), \(\ker
\partial_1 = \text{im } \partial_2\). Thus, every flow on \(G\) is the
image of a 2-chain on \(X\), which corresponds to a 0-cochain \(x\) on
\(X'\). If \(y\) is nowhere-zero, \(x\) defines a coloring, since it
differs over every edge. To get a \(k\)-coloring, we will use a trick: mod
everything out by \(k\). A nowhere-zero \(k\)-flow clearly becomes a
nowhere-zero \(\mathbb Z/k\)-flow. And a 0-cochain valued in \(\mathbb Z /
k\) whose coboundary vanishes nowhere is clearly equivalent to a
\(k\)-coloring. Since none of the algebra changes when we take the quotient,
we see that \(x\) defines a \(k\)-coloring of \(G'\).</p>
<p>The flow-coloring duality gives an alternate statement of the four-color
theorem: every (bridgeless) planar graph has a nowhere-zero 4-flow. A
conjecture that then has the same spirit as the four-color theorem is that
every bridgeless graph has a nowhere-zero 5-flow.</p>
<p>This flow-coloring duality extends to higher-dimensional structures. If \(X\)
and \(X'\) are dual cell structures on a compact manifold \(M\) of
dimension \(n\), where \(H_{n-1}(M) = 0\), then nowhere-zero
\((n-1)\)-cycles of \(X\) which are everywhere less than \(k\) in
absolute value correspond to \(k\)-colorings of the 0-cells of \(X'\).
I’m not aware of anyone who has studied this, but it seems like an interesting
connection between graph theory, topology, and the combinatorics of cell complexes.</p>Graphs are one of the simplest sorts of topological spaces. In fact, they’re so simple that topologists don’t usually spend much time studying them. Their homology and homotopy groups are determined purely combinatorially; every connected graph is homotopy equivalent to a wedge sum of circles. Graph theory focuses on stricter combinatorial invariants, usually even stricter than homeomorphism. However, a lot of the algebraic tools that topologists use to study topological spaces have combinatorial properties that encode more than just homotopical information when applied to graphs. Let’s look at the chain complex of a graph, viewed as a simplicial complex.UMAP2018-05-04T00:00:00-04:002018-05-04T00:00:00-04:00https://www.jakobhansen.org/2018/05/04/UMAP<p>There’s a new dimensionality reduction algorithm called <a href="https://github.com/lmcinnes/umap">UMAP</a>. This would not necessarily be of any great note to me except that one of the authors has been talking up its roots in category theory and topology. A <a href="https://arxiv.org/abs/1802.03426">conference paper</a> recently came out explaining the algorithm, which appears to work quite well (and quickly!). Unfortunately, it’s a little bit hard to parse. This is my attempt to dig out the core of the algorithm.</p>
<p>The main idea behind a number of dimensionality reduction methods is this: In order to find a low-dimensional representation of a data set, compute some local representation of the data and then attempt to find a low-dimensional point cloud that has a similar representation. We want the representation to be local for two reasons: (a) it makes it faster to compute and (b) it allows us to use stochastic gradient descent and similar algorithms to optimize. UMAP does exactly this, but the paper buries all this in discussions of fuzzy simplicial sets and pseudo-metric spaces. Here’s the simplest description I have of what the algorithm actually does.</p>
<p>Take your data points \(\{X_1,\ldots,X_N\}\), and for each point find its \(k\) nearest neighbors. Let \(\sigma_i\) be the diameter of the neighborhood of \(X_i\) and let \(\rho_i\) be the distance from \(X_i\) to its nearest neighbor. Form a weighted graph from these points by doing the following. For points \(X_j\) in the \(k\)-nearest neighborhood of \(X_i\), define \(w_i(X_i,X_j) = \exp(-(d(X_i,X_j)-\rho_i)/\sigma_i)\). This is an asymmetric function on the data points; we symmetrize it by letting \(w(X_i,X_j) = w_i(X_i,X_j) + w_j(X_j,X_i) - w_i(X_i,X_j)w_j(X_j,X_i)\). We can treat these as weights on a graph that vary between 0 and 1.</p>
<p>Given two weights \(w,w'\) on the data set, the cross entropy between them is</p>
\[C(w,w') = \sum_{i \sim j} w(i,j)\log\left(\frac{w(i,j)}{w'(i,j)}\right) + (1-w(i,j))\log \left(\frac{1-w(i,j)}{1-w'(i,j)}\right).\]
<p>We let \(w\) be the weights computed from our data set and \(w'\) the weights computed from our low-dimensional embedding. The cross entropy is a sum of uncoupled terms, one for each edge, which means we can apply stochastic gradient descent by choosing a single term to approximate the gradient at each step. Further, the weights \(w'(i,j)\) depend only on the points in the neighborhoods of \(X_i\) and \(X_j\), i.e., at most \(2k-1\) points, so their derivatives are not difficult to compute. In particular, the difficulty does not grow as the number of points grows.</p>
<p>And that’s the entirety of the algorithm. The inspiration here comes from a
pair of adjoint functors between the category of finite metric spaces and the
category of fuzzy simplicial sets (really, just simplicial sets filtered over
the interval \((0,1]\)). This comes from a <a href="http://math.mit.edu/~dspivak/files/metric_realization.pdf">construction by David
Spivak</a> giving a
metrized version of the singular set functor on topological spaces. The fuzzy
singular set functor (for finite metric spaces) is essentially a scaled (and
reversed) version of the Vietoris-Rips filtration: it can be specified
completely by a filtration on the 1-skeleton. What UMAP does is construct a
local metric space approximation in a neighborhood of each point, scaling
things so that the \(k\)-nearest neighborhood of each point has radius 1. It
then sends this metric space through the fuzzy singular set functor in such a
way that the edge between \(X_i\) and its nearest neighbor is in the fuzzy
set with weight 1. This gives us our local edge weights. We can take a union of
all of these fuzzy singular sets by just taking a union of the fuzzy edge sets.
This is the symmetrization step. This is complicated by the fact that there are
many possible notions of unions and intersections of fuzzy sets. The one
McInnes and Healy chose is the one given by the product t-norm.</p>
<p>UMAP is fast and seems to give good results (at least as good as <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a>, which is state of the art). It’s not clear to me why the algorithm works so well; pretty much every dimensionality reduction algorithm has a plausible-sounding story, which makes it hard to know which ones are going to actually be useful.</p>
<p>That said, the underlying idea behind these dimensionality reduction methods feels very sheaf-like, and amenable to all sorts of more topological variations. For instance, rather than using a fuzzy set union, one could build a cellular sheaf out of these local Vietoris-Rips filtrations and try to match that with a low-dimensional point cloud. I’m reminded of Vidit Nanda’s work on <a href="https://arxiv.org/abs/1707.00354">local cohomology and stratification of simplicial complexes</a>. Can we preserve information about singularities in our low-dimensional embeddings? It seems to me that this approach might be a way to make topological data analysis scale: it’s purely local and hence the problem size does not grow exponentially in the size of the data set like persistent homology can.</p>There’s a new dimensionality reduction algorithm called UMAP. This would not necessarily be of any great note to me except that one of the authors has been talking up its roots in category theory and topology. A conference paper recently came out explaining the algorithm, which appears to work quite well (and quickly!). Unfortunately, it’s a little bit hard to parse. This is my attempt to dig out the core of the algorithm.