Covariance pooling works because it’s nonlinear

Researchers at Goodfire recently proposed a new method for probing transformer internal activations that they call covariance pooling. They frequently want to take advantage of the structured internal representations a model has learned in order to make predictions or inferences about data. When working with sequences, often the level of granularity you want to analyze is different from the level of granularity of the sequence. When working with language, you typically want to ask questions about sentences, paragraphs, or documents rather than words or tokens. When working with DNA sequences, you want to ask questions about genes rather than individual base pairs. But the model’s internal representation of a sequence of tokens is a sequence of vectors. Different sequences have different lengths, and whatever method you use to probe these representations needs to handle sequences of arbitrary length.

The default approach either to use the representation of the last token in the sequence, or average the representations of all tokens in the sequence (mean pooling). Using the last token often works surprisingly well, likely because the autoregressive training objective encourages the model to represent significant amounts of information about a sequence in the representation of its final token, but it’s often finicky and clearly suboptimal. Mean pooling can at least in principle incorporate information from vectors across the entire sequence, but it’s also a fairly dumb method.

Dumb methods are sometimes good, because they’re both interpretable and unlikely to overfit. But some things just aren’t represented cleanly in a way that these simple pooling methods can capture. So people have been looking for more sophisticated ways to probe across sequences. One reasonably popular option is attention probing, where you use a linear probe to identify which tokens in a sequence to aggregate. (Surprisingly, this doesn’t always do better than just using the last token.)

Covariance pooling is a little different: the idea is to use the second moment of the activations in the sequence rather than the first moment. This is, in effect, applying the map $x \mapsto xx^T$ to the activations before mean pooling. (They use a learned linear compression of the output, so it’s actually $x \mapsto Lx(Rx)^T$ for two learned projections $L$ and $R$ .)

There’s a fairly obvious question: what happens if you apply the feature map after mean pooling? The covariance pooling is doing two things here: aggregating information in a way that is sensitive to localized co-occurrence of features, as well as applying a nonlinear feature map before training a linear probe. How much of the benefit is due to the nonlinearity?

To find out, I tried a few binary classification tasks on Gemma 3 270M¹, focused on settings with at least moderately long sequences (a couple hundred words). These are:

IMDB reviews: sentiment analysis on movie reviews
Medical Abstracts: identify the type of disease discussed in an abstract from a medical paper
Hyperpartisan News: identify whether a news article takes an extremely partisan viewpoint or is neutral
Civil Comments: toxicity detection in internet comments

Here’s the accuracy of each probe. I didn’t put a lot of effort into hyperparameter optimization², so take these as rough lower bounds, particularly for the fancier probe types.

Dataset	Mean pool	Covariance pool	Mean pool with quadratic feature map
IMDB	0.907±0.005	0.911±0.007	0.904±0.005
Medical Abstracts	0.871±0.003	0.881±0.009	0.877±0.006
Hyperpartisan News	0.840±0.014	0.938±0.000	0.935±0.017
Civil Comments	0.769±0.005	0.828±0.008	0.768±0.008

IMDB and Medical Abstracts are essentially a wash between the methods. For Hyperpartisan News both covariance pooling and quadratic feature maps did better than plain mean pooling. This makes sense if models represent partisanship linearly along a left-right axis: a linear classifier can’t bend the extremes around into a horseshoe, but second-order feature maps can. Conversely, for Civil Comments, quadratic feature maps are about as good as mean pooling, while covariance pooling does better.

So yes, at least sometimes, the improvements from covariance pooling come purely from the nonlinear feature map and not to capturing local co-occurrence information. Which shouldn’t be too surprising: nonlinear classifiers are more powerful than linear ones. A major reason we use linear classifiers is because they’re simple and interpretable. If you want the best performance possible, it’s often better to just fine-tune a base model for that purpose rather than try to intercept the concept somewhere in an LLM’s residual stream. But if you want more insight into what the underlying model is doing, linear probes have the advantage of exactly corresponding with directions in the residual stream. This means you can easily compare the directions of two probes or convert them to steering vectors. That’s harder with nonlinear probes³—their main contribution to interpretability is just giving evidence that a concept is represented in some way at some location in the model. Interesting, but less readily extensible to other tasks.

A related aside: any time you have a feature map, you’ve got to be thinking about kernels, right? This is a simple feature map, but we could look at the kernel it defines; it’s just the inner product of the output features. So if you have sequences $X$ and $Y$ , represented as $D \times M$ and $D \times N$ matrices, it’s

k(X, Y) = \left\langle \frac{1}{M} X X^T, \frac{1}{N} Y Y^T \right\rangle = \frac{1}{MN} \text{tr}(XX^TYY^T) = \frac{1}{MN} \text{tr}(X^TYY^TX) = \frac{1}{MN}\|Y^TX\|_F^2.

This takes $O(MND)$ multiplications, compared with the $O((M+N)D^2)$ multiplications for the covariance pooled feature map (ignoring the projections). When $D$ is bigger than the sequence length, the kernel evaluation is cheaper than the feature map. But, of course, with the kernel approach you need to evaluate the kernel for every pair of sequences, rather than a feature map once for each sequence. So situations where the kernel approach is more efficient than the direct feature map are likely rare.

There are much more sophisticated kernels for sets of vectors, like the Bhattacharyya kernel. This also computes the covariance matrix of each set, along with the mean, then computes the Bhattacharyya coefficient $\int_{\mathbb{R}^D} \sqrt{p_X(z)p_Y(z)}\,dz$ between multivariate Gaussian distributions with these parameters. This boils down to a gnarly formula that involves determinants and inverses of covariance matrices. This is probably a nicer similarity measure in some sense, and it gives you an infinite dimensional feature map, but it also seems like a giant pain. No wonder they invented transformers.

It’s worth noting that even the Bhattacharyya kernel still doesn’t treat the sequence as a sequence, since the order of the vectors doesn’t matter. There are also kernels that take order into account, but they’re even more obscure. If we’re interested in more high-powered LLM probing techniques, kernel methods might be a place to look for inspiration.

I tried doing some DNA sequence probing tasks like in the original post, but these were a giant pain to get working, and I’m not familiar enough with the domain to be confident I’m not doing something extremely dumb. ↩
Some specifics:
- Activations are from layer 9 of Gemma 3 270M (residual stream dimension 640), scaled to average norm 1.
- All datasets are subsampled to select long sequences and to balance labels. Training sets range in size from around 500 to 5000, and test sets range from 65 to around 3000 samples.
- Probes were trained with independent initializations and data ordering from five different seeds.
- Probes are trained for 100 epochs on all datasets.
- Linear probes have a learning rate of 1e-1 (surprisingly high, this is one of the few hyperparameters I tweaked), the others 1e-3.
- Nonlinear probes have weight decay of 1e-2 applied.
- No reconstruction loss is applied for covariance pooling or quadratic feature maps.
- The quadratic feature map is computed exactly the same way as the covariance pooled feature maps, just after mean pooling instead of beforehand.
- For the covariance pooling and quadratic feature map, the intermediate dimension (i.e. the number of rows in $L$ and $R$ ) is 64.
↩
Though not impossible, see, e.g., Toward universal steering and monitoring of AI models for an approach to extracting steering vectors from (a specific class of) nonlinear probes. ↩