Sparse Autoencoder

A neural network trained to decompose model activations into millions of interpretable features, each corresponding to a human-readable concept.

Sparse autoencoders are a key tool in mechanistic interpretability that learn to represent neural network activations as sparse combinations of interpretable features. By enforcing sparsity, the technique encourages each feature direction to correspond to a single concept rather than a blend of meanings. Anthropic's 2024 work scaled this approach to production models, extracting millions of features from Claude that cluster meaningfully: San Francisco landmarks near other San Francisco landmarks, deception-related features near manipulation features.

Also known as

SAE, dictionary learning