Inference-Time Compute: AI's Third Scaling Axis

The old playbook was simple: more data, bigger models, better results. Now labs are discovering you can make models smarter by letting them think longer.

AI ResearchscalingreasoninginfrastructureDeepSeek

For a decade, the AI scaling playbook was simple: more data, bigger models, better results. That playbook is now incomplete.

The major labs have discovered a third axis. Instead of making models smarter at training time, you can make them smarter at inference time by letting them think longer before answering. This is why OpenAI's o1, DeepSeek's R1, and Claude's extended thinking mode exist. It's why the industry is shifting from one-third to two-thirds of compute going to inference by 2026. And it's why your API costs now include "thinking tokens" that get billed like output tokens.

Training vs. Thinking

Training-time compute makes the model itself smarter. You invest billions upfront, and every user benefits from that investment. The model doesn't get any smarter while answering your question; it's just recalling what it learned.

Inference-time compute works differently. The model does additional work while processing your specific query: reasoning through the problem, considering alternatives, backtracking when it hits dead ends, building an answer. More compute per query means better answers on hard problems.

The Google DeepMind/UC Berkeley paper that formalized this finding showed something striking:

A smaller model with optimal test-time compute can outperform a 14x larger model on problems where it has non-trivial success rates.

For certain tasks, it's more efficient to let a cheap model think hard than to build an expensive one that answers instantly.

Five Flavors of Thinking

There's no single technique here. Researchers have identified five main categories of inference-time scaling:

Chain-of-thought prompting is the simplest. Instead of answering directly, the model generates intermediate reasoning steps. This can be as basic as adding "let's think step by step" to a prompt, or as sophisticated as the extended thinking in Claude or o1 where the model produces long internal reasoning traces before responding.

Voting and search methods generate multiple candidate answers and pick the best one. Best-of-N sampling runs the model several times and selects based on some criterion (often a verifier model that scores each response). Beam search explores multiple reasoning paths simultaneously.

Iterative refinement has the model critique and revise its own outputs. Think of it as the model editing its first draft.

Dynamic compute allocation is where things get genuinely interesting. Monte Carlo tree search (MCTS), borrowed from game-playing AI, explores a tree of possible reasoning steps and focuses compute on the most promising branches. The DeepMind paper found this particularly effective for harder problems, while iterative refinement works better for easier ones.

Latent space reasoning skips natural language entirely and operates on the model's internal representations. Still experimental, but potentially more efficient since you're not generating and parsing tokens.

DeepSeek's R1 model, released in January 2025, validated all of this at production scale. Using pure reinforcement learning, they built a reasoning model that matches OpenAI's o1 on key benchmarks: 79.8% on AIME 2024 versus 79.2% for o1, and 97.3% on MATH-500 versus 96.4%.

What makes the R1 work compelling is what emerged without explicit programming. The model spontaneously developed behaviors like self-reflection (recognizing when its reasoning might be flawed), verification (checking its own answers), and what the researchers call the "aha moment," where the model learns to revisit and rethink problematic steps mid-reasoning. These behaviors weren't trained in. They emerged from giving the model room to think and optimizing for correct final answers through reinforcement learning.

The lesson? If you create the right training signal and give the model enough inference-time compute budget, sophisticated reasoning strategies arise on their own.

The Efficiency Math

How much does thinking time buy you? The DeepMind paper found that compute-optimal scaling improves efficiency by more than 4x compared to naive approaches like running best-of-N and hoping for the best.

The key insight is adaptive allocation. Not every problem deserves the same thinking budget. A simple factual question can be answered in one pass; a complex math proof might benefit from extensive search. Optimal systems learn to allocate compute dynamically based on problem difficulty.

Our read: this is the real unlock. The gap between "give the model 10 seconds" and "give the model 10 seconds on problems that need it" is where the efficiency gains live.

Thinking tokens aren't free, of course. Anthropic bills extended thinking as output tokens. OpenAI's reasoning tokens work similarly. When a model thinks for 30 seconds before answering, you're paying for all those intermediate tokens even though you only see the final response. This creates a new cost-optimization problem: how much should you let the model think? The answer depends on the task, the cost of errors, and your latency requirements. For a customer service chatbot, extensive reasoning is probably overkill. For code generation where bugs are expensive, more thinking might pay for itself.

There's also the latency variance problem. Unlike traditional models with predictable response times, reasoning models can take highly variable amounts of time depending on problem complexity. Systems that depend on consistent latency need new architectural patterns.

Rewiring the Data Centers

The industry is literally rewiring its infrastructure for this transition. Training used to dominate compute spending; now inference is catching up fast. One estimate suggests 80/20 inference-to-training spending as the new equilibrium.

This matters because training and inference workloads have different requirements. Training benefits from massive parallelism and can tolerate high latency. Inference needs to be fast, consistent, and globally distributed. The hardware optimized for one isn't necessarily optimal for the other.

Inference-time compute comes with real costs, though. Unpredictable spending: when the model decides how long to think, your bill becomes harder to forecast. Latency variance: response times can swing wildly based on problem difficulty. Research has also identified cases of underthinking, where models abandon promising reasoning paths too early, particularly when they've been trained with strict length limits. And then there's the fundamental constraint: performance improves logarithmically with thinking budget. Doubling compute doesn't double capability. At extreme budgets, you're paying a lot for marginal gains.

Who Wins Here?

The shift from training-time to inference-time scaling changes the competitive dynamics of AI. The old question was "how much did you spend training this model?" The new question is "how much thinking time can you afford per query?"

Training billion-dollar models kept startups out. Inference-time scaling lets smaller labs compete by spending more at serving time. DeepSeek demonstrated this: a $6 million training run plus generous inference budgets can match the output of far more expensive training regimes.

The biggest open question is where the returns curve bends. We know inference-time compute works. We know it has diminishing returns. What we don't know yet is where the practical ceiling lies for production workloads.

That's what the next year of deployments will answer.

Frequently Asked Questions