What is superposition in neural networks?

Superposition is a compression strategy where neural networks represent more features than they have neurons by encoding features in nearly-orthogonal directions across multiple neurons. In high-dimensional spaces, you can pack thousands of directions that barely interfere with each other. Networks exploit this geometry to represent millions of concepts, but the trade-off is that individual neurons become uninterpretable mixtures of many features.

Why can't we understand neural networks by looking at individual neurons?

Individual neurons in neural networks are polysemantic, meaning they fire for multiple unrelated concepts. One neuron might activate for the Golden Gate Bridge, Korean text, and butter simultaneously. This happens because networks compress far more features than they have neurons by encoding concepts in overlapping patterns across many neurons. To understand what a network represents, you need to decompose these mixed signals into separate features using techniques like sparse autoencoders.

Can polysemanticity be eliminated from neural networks?

Probably not entirely. Some polysemanticity is deliberate compression due to capacity constraints, but recent research shows it also arises incidentally from training dynamics, random initialization, and regularization. Even networks with enough neurons to represent all features cleanly still develop polysemantic representations. Researchers are pursuing both intrinsic methods (training networks that use less superposition) and post-hoc methods (sparse autoencoders to reverse-engineer superposition), but neither has scaled to frontier models.

Polysemanticity: Why Individual Neurons Mean Nothing

Headline: Polysemanticity: Why Individual Neurons Mean Nothing

Look at any single neuron in a large language model and you'll find it responds to an incoherent grab-bag of concepts. One neuron might activate for the Golden Gate Bridge, the color red, and fragments of Korean text. Another fires for both legal contracts and recipes involving butter.

This is polysemanticity, and it's the fundamental reason why looking inside neural networks has been so difficult. Understanding it matters for anyone following AI safety research, because polysemanticity is both the core problem mechanistic interpretability is trying to solve and a direct consequence of how neural networks achieve their power in the first place.

The Compression Trade-off

The basic insight comes from Anthropic's foundational 2022 research: neural networks represent more features than they have neurons by encoding features in overlapping combinations.

Think of it this way. If you have a 500-dimensional space (a modest layer in a neural network), you can perfectly represent 500 orthogonal features, one per dimension. But it turns out you can represent far more features if you're willing to tolerate some interference.

High-dimensional spaces have a useful geometric property: you can pack many "almost orthogonal" directions into them. Two vectors can be 99% orthogonal, meaning they barely interfere with each other, and you can fit thousands of such nearly-orthogonal vectors into a 500-dimensional space. Neural networks exploit this relentlessly. Rather than dedicating one neuron to one concept, they encode features as directions in activation space that span multiple neurons. The Golden Gate Bridge isn't represented by "neuron 47"; it's represented by a specific pattern of activation across hundreds of neurons. And those same neurons help represent thousands of other concepts.

Each concept is represented across many neurons, and each neuron represents many concepts.

That's the core of superposition: a neural network's way of compressing more information than its architecture would naively seem to allow.

So why do networks do this? Because they have to. Language models need to represent an enormous number of concepts: every person, place, historical event, abstract idea, writing style, and logical relationship needs some internal representation. The number of "things a model needs to know about" vastly exceeds the number of neurons available.

Networks develop a priority system. According to research on feature capacity, the most important features get dedicated neurons (monosemantic representation). Moderately important features share space through superposition. And the least important features get ignored entirely. This creates a hierarchy where interpretability varies by importance; the features that fire most often and matter most for task performance tend to get cleaner representations, while the long tail of rare concepts gets compressed into shared, polysemantic neurons.

The trade-off is interference. When two features share overlapping neural representations, activating one partially activates the other. Networks learn to tolerate this noise because the compression benefits outweigh the costs, at least for features that don't need to be perfectly precise.

The Geometry Gets Strange

Our intuitions about geometry come from 2D and 3D space, where orthogonal directions are scarce. In 3D, you can only have three mutually perpendicular vectors. But in 500 dimensions? You can have thousands of directions that are almost orthogonal to each other.

The math works out surprisingly well. If you allow vectors to be 95% orthogonal instead of 100%, the number of nearly-independent directions you can pack into a space grows exponentially with dimension. Neural networks implicitly discover and exploit this geometric fact during training.

This explains why larger models tend to be more interpretable in some ways. More dimensions means more room for features to spread out, reducing the interference between them.

The story above suggests polysemanticity is always a deliberate optimization; networks compress because they have to. But research into the origins of polysemanticity found that feature overlap can arise even when there's no capacity pressure. Some polysemanticity is incidental rather than strategic.

How? Random initialization can place multiple features into the same neuron by chance. During training, these accidental associations get strengthened rather than corrected. Regularization and neural noise can facilitate this incidental overlap. A comprehensive review of mechanistic interpretability identifies three distinct causes: capacity constraints (the superposition story), noise-induced redundancy, and feature misalignment from initialization. Real networks exhibit all three.

This matters for interpretability research. If all polysemanticity were deliberate compression, you might design architectures that don't need it. Give the network enough neurons and features would separate cleanly. But if polysemanticity arises from training dynamics themselves, the problem is harder to engineer away.

Reversing the Compression

The practical consequence: you cannot interpret neural networks by looking at individual neurons. A neuron that activates strongly means nothing by itself. You need to understand the direction in activation space it's participating in, along with all the other neurons pointing in related directions.

This is why Anthropic's feature extraction work matters.

Using sparse autoencoders, researchers have extracted millions of interpretable features from production models, reversing the superposition to recover what the network actually represents. The Golden Gate Claude experiment demonstrated this dramatically: amplifying a single extracted feature to 10x its normal strength created a model obsessively fixated on the Golden Gate Bridge. The feature-level view works where the neuron-level view fails. But it requires sophisticated techniques to decompose polysemantic neurons into their constituent features. And even then, current methods succeed on roughly 25% of cases tested.

Our read: polysemanticity is not a flaw to be eliminated but a compression strategy with real costs for interpretability. Neural networks achieve their representational power by exploiting high-dimensional geometry in ways that make individual components meaningless.

This creates a direct tension between capability and transparency. The same superposition that lets a model know about millions of concepts is what makes those concepts illegible when you look inside. Making models more interpretable might require sacrificing some of this representational efficiency.

The field is pursuing two approaches. Intrinsic methods try to train networks that use less superposition from the start. Post-hoc methods (like sparse autoencoders) accept superposition and try to reverse-engineer it afterward. Neither has scaled to frontier models yet. But understanding why polysemanticity exists is the first step toward either solution, and toward understanding what we're actually asking for when we demand interpretable AI.

Sources cited: Claims as analysis: