UMAP | Jakob Hansen

There’s a new dimensionality reduction algorithm called UMAP. This would not necessarily be of any great note to me except that one of the authors has been talking up its roots in category theory and topology. A conference paper recently came out explaining the algorithm, which appears to work quite well (and quickly!). Unfortunately, it’s a little bit hard to parse. This is my attempt to dig out the core of the algorithm.

The main idea behind a number of dimensionality reduction methods is this: In order to find a low-dimensional representation of a data set, compute some local representation of the data and then attempt to find a low-dimensional point cloud that has a similar representation. We want the representation to be local for two reasons: (a) it makes it faster to compute and (b) it allows us to use stochastic gradient descent and similar algorithms to optimize. UMAP does exactly this, but the paper buries all this in discussions of fuzzy simplicial sets and pseudo-metric spaces. Here’s the simplest description I have of what the algorithm actually does.

Take your data points $\{X_1,\ldots,X_N\}$ , and for each point find its $k$ nearest neighbors. Let $\sigma_i$ be the diameter of the neighborhood of $X_i$ and let $\rho_i$ be the distance from $X_i$ to its nearest neighbor. Form a weighted graph from these points by doing the following. For points $X_j$ in the $k$ -nearest neighborhood of $X_i$ , define $w_i(X_i,X_j) = \exp(-(d(X_i,X_j)-\rho_i)/\sigma_i)$ . This is an asymmetric function on the data points; we symmetrize it by letting $w(X_i,X_j) = w_i(X_i,X_j) + w_j(X_j,X_i) - w_i(X_i,X_j)w_j(X_j,X_i)$ . We can treat these as weights on a graph that vary between 0 and 1.

Given two weights $w,w'$ on the data set, the cross entropy between them is

C(w,w') = \sum_{i \sim j} w(i,j)\log\left(\frac{w(i,j)}{w'(i,j)}\right) + (1-w(i,j))\log \left(\frac{1-w(i,j)}{1-w'(i,j)}\right).

We let $w$ be the weights computed from our data set and $w'$ the weights computed from our low-dimensional embedding. The cross entropy is a sum of uncoupled terms, one for each edge, which means we can apply stochastic gradient descent by choosing a single term to approximate the gradient at each step. Further, the weights $w'(i,j)$ depend only on the points in the neighborhoods of $X_i$ and $X_j$ , i.e., at most $2k-1$ points, so their derivatives are not difficult to compute. In particular, the difficulty does not grow as the number of points grows.

And that’s the entirety of the algorithm. The inspiration here comes from a pair of adjoint functors between the category of finite metric spaces and the category of fuzzy simplicial sets (really, just simplicial sets filtered over the interval $(0,1]$ ). This comes from a construction by David Spivak giving a metrized version of the singular set functor on topological spaces. The fuzzy singular set functor (for finite metric spaces) is essentially a scaled (and reversed) version of the Vietoris-Rips filtration: it can be specified completely by a filtration on the 1-skeleton. What UMAP does is construct a local metric space approximation in a neighborhood of each point, scaling things so that the $k$ -nearest neighborhood of each point has radius 1. It then sends this metric space through the fuzzy singular set functor in such a way that the edge between $X_i$ and its nearest neighbor is in the fuzzy set with weight 1. This gives us our local edge weights. We can take a union of all of these fuzzy singular sets by just taking a union of the fuzzy edge sets. This is the symmetrization step. This is complicated by the fact that there are many possible notions of unions and intersections of fuzzy sets. The one McInnes and Healy chose is the one given by the product t-norm.

UMAP is fast and seems to give good results (at least as good as t-SNE, which is state of the art). It’s not clear to me why the algorithm works so well; pretty much every dimensionality reduction algorithm has a plausible-sounding story, which makes it hard to know which ones are going to actually be useful.

That said, the underlying idea behind these dimensionality reduction methods feels very sheaf-like, and amenable to all sorts of more topological variations. For instance, rather than using a fuzzy set union, one could build a cellular sheaf out of these local Vietoris-Rips filtrations and try to match that with a low-dimensional point cloud. I’m reminded of Vidit Nanda’s work on local cohomology and stratification of simplicial complexes. Can we preserve information about singularities in our low-dimensional embeddings? It seems to me that this approach might be a way to make topological data analysis scale: it’s purely local and hence the problem size does not grow exponentially in the size of the data set like persistent homology can.