Attribution Graph

A technique that traces the computational path from input tokens through intermediate features to model outputs, revealing the reasoning steps inside neural networks.

Attribution graphs are a mechanistic interpretability method that maps how information flows through a language model during inference. The technique tracks which features activate at each layer and how they influence subsequent features, building a graph of the computational path from input to output. This allows researchers to see whether models genuinely reason through intermediate steps or pattern-match directly to answers, and to detect when stated chain-of-thought reasoning differs from actual internal mechanisms.

Also known as