Google's 2022 paper Emergent Abilities of Large Language Models made a striking claim: certain capabilities don't improve gradually as you scale up a model. They jump. The researchers documented specific compute thresholds where this happened, including around 10²² FLOPs for certain arithmetic tasks. Below that threshold, models performed at chance. Above it, they could actually do the work.
The paper identified two flavors of this: emergent prompted tasks (where the capability itself appears at scale) and emergent prompting strategies like chain-of-thought reasoning. The core assertion was stark: these abilities "cannot be predicted by extrapolating smaller model performance." Google's research blog extended this analysis across LaMDA, GPT-3, Gopher, Chinchilla, and PaLM, measuring emergence on dozens of tasks. The patterns held.
For safety researchers, this was genuinely alarming. If larger models do things you couldn't have predicted from smaller versions, you're flying blind.
The Stanford Critique
Then came the deflation. Stanford's 2023 paper Are Emergent Abilities of Large Language Models a Mirage? won a NeurIPS Outstanding Paper Award for offering a different explanation: emergence might be an artifact of how we measure, not what models actually do.
The math is elegant. Most benchmarks use discontinuous metrics (right/wrong, pass/fail). But model improvement on individual tokens is continuous. Accuracy relates exponentially to per-token error; specifically, accuracy equals the exponential of negative cross-entropy. When you measure pass/fail on a multi-step task, smooth underlying improvement produces sharp apparent jumps.
Over 92% of emergent abilities documented on BIG-Bench appeared specifically under discontinuous metric types.
The researchers demonstrated they could induce or eliminate apparent emergence in vision tasks just by changing how they measured. Same model, same capabilities, different metrics, entirely different conclusion about emergence. The underlying improvement is continuous; the "jump" is an artifact of where you draw the threshold.
This is where I part ways with the "emergence is debunked" crowd. The Stanford paper demonstrated that some apparent emergence is measurement artifact. It didn't prove all of it is.
A March 2025 survey on emergence notes the debate remains genuinely unresolved. The authors found that emergence depends on multiple interacting factors: scaling laws, task complexity, pre-training loss, quantization effects, and prompting strategies. Untangling which factor matters when is genuinely hard. The survey also extended analysis to Large Reasoning Models using reinforcement learning and inference-time search, finding emergence patterns there too.
Notably, it warns that emergence "carries dual implications": enhanced reasoning capabilities, but also potential for deception, manipulation, and reward hacking.
Capabilities appearing that we didn't explicitly train for? That's not purely a measurement question.
The Quiet Redefinition
Something interesting has happened to the term itself. "Emergence" used to mean sudden, unpredictable jumps in capability. Now it increasingly means "more than linear improvement with scale." That's a significant goalpost move, and worth noticing.
The new definition is less dramatic but possibly more useful. Linear improvement is what you'd naively expect: 2x the compute, roughly 2x the capability. More-than-linear improvement means scaling has compounding returns on certain tasks. Interesting and worth understanding, even if it's not the sci-fi-flavored "suddenly the AI can do X" story. Georgetown's CSET explainer captures the policy angle well: "Previous attempts to forecast capabilities in advance have shown decidedly mixed results." Whether that's because emergence is real or because we're bad at building continuous metrics for complex tasks remains genuinely unclear.
Our read: both sides are probably right about different things. The Stanford critique correctly identifies that discontinuous metrics create artificial cliffs. But that doesn't mean underlying capability development is perfectly smooth. Neural networks do exhibit phase transitions in learning; you can watch them develop circuits for specific capabilities. The question is whether those transitions happen predictably at measurable compute thresholds or whether they're genuinely surprising.
For Builders
If you're working with LLMs, some practical notes:
Don't count on capability jumps. The evidence that scaling reliably unlocks new capabilities at predictable thresholds is weaker than it seemed in 2022. Plan for gradual improvement, not sudden breakthroughs. Your metrics shape your conclusions. Evaluate models with pass/fail benchmarks, you'll see emergence. Use continuous metrics (partial credit, probability distributions), you'll see smooth curves. The model hasn't changed; your measurement has.
And the safety implications cut both ways. If emergence is mostly measurement artifact, we might be able to predict dangerous capabilities before they appear. But if it's even partially real, we need safety margins for capabilities we can't forecast. The honest answer is we don't know which world we're in.
The emergence debate is a case study in how scientific concepts get refined under scrutiny. The original claim (sudden unpredictable jumps) was provocative and important. The critique (it's the metrics) was rigorous and necessary. The synthesis (it's complicated, and we need better definitions) is less satisfying but probably closer to truth.
What we're left with: capabilities improve with scale, sometimes faster than linearly, and our ability to predict exactly what a larger model will do remains limited. Less dramatic than "emergence." More actionable than "we have no idea."