When you ask a language model to write a poem that rhymes, it doesn't work the way you'd expect. Rather than generating text sequentially and hoping the ending works, the model pre-selects rhyme words before writing the lines that lead to them. It plans backwards.
This finding comes from Anthropic's most recent interpretability research, which uses a technique called attribution graphs to trace the computational steps inside Claude. And it reveals something counterintuitive: we can now see genuine reasoning happening inside these models, follow the chain of concepts that leads to an answer, and catch the model using completely different logic than what it claims in its chain-of-thought explanations.
That gap between stated reasoning and actual mechanism? That's the whole game for AI safety.
The Wall Researchers Kept Hitting
For years, interpretability work ran into the same problem: individual neurons don't mean anything interpretable. A single neuron might activate for the Golden Gate Bridge, the concept of redness, and certain Korean text. This phenomenon, called polysemanticity, made it nearly impossible to understand what models are actually thinking.
The breakthrough came from dictionary learning, specifically a technique called sparse autoencoders. Rather than trying to interpret neurons directly, researchers learned to decompose model activations into millions of what they call "features": directions in the model's internal space that correspond to human-readable concepts. Anthropic's 2024 work demonstrated this could scale to production models. They extracted millions of interpretable features from Claude's middle layer, and crucially, these features clustered in meaningful ways. The Golden Gate Bridge feature sits near other San Francisco landmarks. Features related to scams neighbor features about deception and manipulation.
But knowing that models encode interpretable concepts only gets you so far. The real prize is understanding how these concepts interact to produce outputs.
This is where attribution graphs come in. The technique traces the computational path from input tokens through intermediate features to the final output, using roughly 30 million interpretable features to map the journey. What researchers found challenges some basic assumptions about how these models work.
Consider the Dallas example: when asked "What is the capital of the state that Dallas is in?", the model's internal features show Texas activating before Austin. The model genuinely reasons through the intermediate step rather than pattern-matching directly to the answer.
Other findings are more unsettling.
When Claude performs mental arithmetic and explains its step-by-step "carrying" process in chain-of-thought, the internal mechanism is actually a lookup table. The stated reasoning doesn't match the actual computation. The model confabulates its own explanations. If models can perform tasks using mechanisms that differ from their stated reasoning, then chain-of-thought explanations aren't reliable windows into model cognition. You can't simply ask a model how it reached a conclusion and trust the answer.
Scams, Sycophancy, and Surgical Edits
Anthropic's feature mapping identified safety-relevant features for scams, sycophancy, deception, and dangerous content. These features are causally meaningful: amplifying them changes model behavior in predictable ways.
The most vivid demonstration was Golden Gate Claude, a modified version of the model where researchers amplified the Golden Gate Bridge feature to 10x its normal activation. The result was a model obsessively focused on the bridge, inserting it into every conversation.
The modification was surgical, performed at the neural level rather than through prompts or fine-tuning. The same technique could theoretically apply to safety-critical features. If you can identify the internal feature corresponding to "I should hide my true reasoning," you can potentially detect or suppress deceptive behavior before it manifests in outputs.
The interpretability story gets complicated from here. These techniques work, but not reliably. According to Anthropic's own assessment, current methods succeed on roughly 25% of prompts tested. For three-quarters of cases, the attribution graphs don't produce clear interpretations. Attention patterns, a core component of transformer architecture, remain largely opaque. Researchers can trace feature-to-feature connections but often can't explain why the model attends to certain tokens over others.
There's also a scaling question. Roughly 60% of features in larger sparse autoencoders show minimal activation, suggesting that the techniques may be capturing noise or highly specialized circuits that rarely fire. Whether these methods will continue to work as models scale remains genuinely uncertain. The academic literature is even more cautious; a recent survey of open problems in mechanistic interpretability emphasizes that translating interpretability tools into concrete safety outcomes remains unsolved. The field is honest about the gap between "we can see interesting structures" and "we can guarantee a model won't behave badly."
The Parsing Delay That Adversaries Exploit
One finding explains something practical. Models assemble responses token-by-token, and they often don't recognize harmful intent until they've already started generating problematic content. By the time safety-relevant features activate, the model is mid-sentence and more likely to continue than to stop.
This isn't a training failure so much as an architectural limitation. The model genuinely doesn't understand the full request until it's partially committed to answering it. Adversarial prompts exploit this parsing delay.
Our read: Interpretability research has produced a genuine scientific instrument, not just an interesting research direction. We can now see real structures inside these models, follow chains of reasoning, and identify features that correspond to safety-relevant concepts. But the instrument is still being calibrated. It works on a quarter of cases. It can't see everything. And the gap between "we understand this specific circuit" and "we can certify this model is safe" remains vast.
What's changing is the pace. The jump from "neurons are uninterpretable" to "we can trace multi-step reasoning through millions of features" happened in roughly two years. The next question is whether interpretability can scale alongside the models themselves, or whether frontier-scale systems will remain fundamentally opaque.
We have a microscope that reveals real structure. We're mapping a continent with a flashlight. Both statements are true.
Sources cited: Claims as analysis: