Mechanistic Interpretability

A research approach that reverse-engineers neural networks by identifying interpretable features and tracing how they combine to produce outputs.

Mechanistic interpretability is a subfield of AI safety research focused on understanding the internal computations of neural networks at a granular level. Rather than treating models as black boxes, researchers decompose activations into interpretable features and trace the circuits that connect inputs to outputs. The goal is to move from observing what models do to understanding how they do it, enabling detection of deceptive reasoning or unsafe behavior before it manifests.

Also known as