Inference-Time Compute

A scaling approach that allocates additional compute during model inference rather than training, allowing models to 'think longer' on difficult problems.

Inference-time compute is an emerging paradigm that scales AI capabilities by spending more computation when generating responses rather than during pre-training. Instead of training a larger model, systems like OpenAI's o1 and o3 use extended reasoning chains, multiple attempts, or search procedures at inference time. This approach has shown dramatic results on reasoning benchmarks (o3 achieved 87.5% on ARC-AGI versus 5% for GPT-4o) and represents a fundamental shift from the 'train bigger models' playbook that dominated 2020-2024.

Also known as