Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah (2022)

Highlighted by Jan Jensen

Most NNs are notoriously hard to interpret. While there are a few cases, mostly in image classification, where some features (like lines or corners) can be assigned to particular neurons, in general is it seems like every part of the NN contributes to every prediction. This paper provides some powerfull insight into why this is, by analysing simple toy models.

The study builds on the idea that the output of a hidden layers is an N-dimensional embedding vector (V) that encodes a feature of the data (N is the number of neurons in the layers). You might have seen this famous example from language models: V("king") - V("man") + V("woman") = V("queen").

Naively, one would expect that a N-neuron layer can encode N different features, since there are N different (i.e. orthogonal) vectors. However, the papers points out that the number of almost orthogonal vectors (say, with angles between 89° and 91°) increases exponentially with N, so that NNs can represent many more features than they have dimensions, which they term "superposition".

Since most features are stored in orthogonal vectors they will necessarily have many non-zero contributions and this cannot be assigned to a specific neuron. The authors further show that the superposition is driven by data sparcity, i.e. few examples of a particular input feature: more data sparcity, more superposition, less interpretability.

The paper is very thorough and there are many more insights that I have skipped. But I hope this highlight has made you curious enough to have a look at the paper. I can also recommend this brilliant introduction superposition by 3Blue1Brown to get you started.

Now, it's important to note that these insights are obtained by analysing simple toy problems. It will be interesting to see if and how they apply to real-world applications, including chemistry.

This work is licensed under a Creative Commons Attribution 4.0 International License.