Polysemanticity
The phenomenon where individual neurons in neural networks activate for multiple unrelated concepts, making single neurons uninterpretable.
Polysemanticity refers to the observation that individual neurons in trained neural networks don't encode single, clean concepts. A single neuron might fire for the Golden Gate Bridge, the color red, and certain Korean characters simultaneously. This superposition of meanings is the fundamental obstacle to interpretability: you can't understand a model by examining individual neurons because each neuron participates in encoding many different things. Sparse autoencoders address this by decomposing activations into directions that do correspond to single concepts.
Also known as
superposition, feature superposition