How much does it cost to train GPT-2 in 2025?

Using Andrej Karpathy's nanochat project on cloud spot instances with an 8×H100 node, GPT-2 can be trained in about 2.91 hours for roughly $20. This is down from approximately $43,000 and 168 hours when OpenAI originally trained it in 2019.

Nanochat is an open-source MIT-licensed project by Andrej Karpathy that provides about 1,000 lines of readable, forkable code for training GPT-2-scale language models. It incorporates dozens of modern optimizations and is designed as a research sandbox for testing new training techniques against a meaningful baseline.

Do GPT-2 training optimizations scale to larger models?

Not necessarily at the same magnitude. Many of nanochat's wins are scale-specific. For example, FP8 delivers only about 5% speedup on GPT-2 but 25% on Llama3-8B, because gains grow with matrix size. Some architectural choices like value embeddings and alternating window attention remain unproven at 70B+ parameter counts. The individual techniques are directionally promising, but the 600× cost reduction seen at GPT-2 scale won't directly translate to frontier models.

What optimizations made GPT-2 training 600× cheaper?

The cost reduction comes from stacking dozens of techniques: Flash Attention 3 for ~9% speed gains, the Muon optimizer for weight matrices, architectural changes like RoPE and RMSNorm replacing learned embeddings and LayerNorm, and FP8 quantized training. No single breakthrough; 320 hyperparameter experiments and incremental tweaks compounded over seven years.

Can I train GPT-2 on cloud spot instances without interruption?

Yes. The 2.91-hour training run fits within continuous spot instance windows available on most major cloud providers. Azure's 30-second eviction warning is tighter than AWS's, but the cost savings make occasional interruptions worth tolerating for non-production research runs.

Why is GPT-2 being called 'the new MNIST'?

Karpathy frames GPT-2 as a universal benchmark for testing training ideas because it's now trivially cheap to reproduce. Like MNIST became the default test harness for image models, GPT-2's low cost (~$20) and nanochat's readable 1,000-line codebase make it ideal for validating new optimizers or architectural tweaks before scaling up.

GPT-2 Training Now Costs Less Than a Pizza

Headline: GPT-2 Training Now Costs Less Than a Pizza

Andrej Karpathy's latest nanochat milestone compresses what OpenAI needed 168 hours and roughly $43,000 to accomplish in 2019 into 2.91 hours on a single 8×H100 node. On spot instance pricing, the whole run costs about $20. GPT-2, the model OpenAI once deemed "too dangerous to release," is now cheaper to train than a large pepperoni.

The record, announced on February 3, represents a 4.3% improvement over the previous best of 3.04 hours set just days earlier. FP8 quantized training was the latest optimization to land, though Karpathy notes the practical speedup is modest: about 5% net, well below the theoretical 2× FLOP advantage FP8 holds over BF16. Overhead from scale conversions and the relatively small matrix multiplications in a GPT-2-scale model eat most of the gains.

But the FP8 result is the icing. The real story is the full stack of optimizations Karpathy and contributors have assembled across nanochat, and what it tells us about where training efficiency is heading.

The optimization stack

The nanochat discussion thread lays out the engineering in detail. The biggest single contributor was Flash Attention 3, which delivered a roughly 9% tokens-per-second improvement while enabling sliding window attention patterns that alternate between short and long context. The Muon optimizer, used for weight matrices while AdamW handles embeddings, brought its own gains through techniques including Polar Express orthogonalization, factored variance reduction, and "cautious" weight decay that only applies where gradient and parameter signs align.

Architecturally, the model diverges significantly from the original GPT-2 design: RoPE replaces learned positional embeddings, RMSNorm replaces LayerNorm, ReLU² replaces GELU, and per-layer learnable residual scalars maintain what Karpathy calls a "superhighway" for gradient flow. Value embeddings at alternating layers add 604M parameters at near-zero FLOP cost. The resulting 1.38B parameter model was trained on FineWeb-edu data and evaluated across 22 benchmarks using a composite CORE metric.

This wasn't one breakthrough. It was 320 hyperparameter experiments and dozens of architectural tweaks, each contributing fractions of a percent, compounding into a 600× cost reduction over seven years.

What didn't work

The failure list is just as instructive. Multi-token prediction added 13GB of memory for no improvement. Varlen attention did nothing meaningful. FP8 applied to the language model head bought a 1% speedup at 2GB of additional memory. Bigram embeddings were "complexity bloat." Half the ideas that sound good on paper didn't survive contact with actual training runs.

The "new MNIST" thesis

Karpathy frames GPT-2 as becoming "the new MNIST": a reference model so cheap to reproduce that it becomes a universal benchmark for testing ideas. The HN discussion largely revolved around practical questions of running these jobs on spot instances, with commenters noting that a continuous three-hour window is readily available on most cloud providers. One commenter noted Azure's 30-second eviction warning is tighter than AWS's, but the cost savings make interruptions worth tolerating.

The analogy holds. When a benchmark becomes trivially cheap, it stops being impressive to beat and starts being useful as a test harness. Nanochat is explicitly designed for this: about 1,000 lines of readable, forkable code with an MIT license. It's not a framework. It's a starting point.

The scalability question

Will these optimizations work at frontier scale? The optimism needs tempering. A 600× cost reduction on GPT-2 does not mean we'll see a 600× reduction on frontier models. Many of nanochat's wins are specific to the model's scale. FP8's gains grow with matrix size, which is why torchao reports a 25% speedup on Llama3-8B compared to Karpathy's 5% on GPT-2. Flash Attention's benefits are well-established at larger scales, but some of the architectural choices (value embeddings, alternating window attention) are unproven on 70B+ parameter counts.

Our read: the individual techniques here are directionally right, but the magnitude of improvement will look different at every scale. The training cost curve for GPT-2 falls roughly 2.5× per year, according to Karpathy's analysis. Frontier models are on a different curve entirely, with training costs still climbing as capabilities expand.

What nanochat gives the field is a clean experimental sandbox. If you want to test whether a new optimizer or attention variant has legs, you can validate it against a meaningful baseline in under three hours for pocket change. That accelerates research iteration in a way that matters more than the headline dollar figure.

The pizza comparison is fun. The research velocity it enables is the real meal.

Sources cited: Claims as analysis:

GPT-2 Training Now Costs Less Than a Pizza

The optimization stack

What didn't work

The "new MNIST" thesis

The scalability question

Key Terms

Frequently Asked Questions

The optimization stack

What didn't work

The "new MNIST" thesis

The scalability question

Key Terms

Frequently Asked Questions

How much does it cost to train GPT-2 in 2025?

What is nanochat?

Do GPT-2 training optimizations scale to larger models?

What optimizations made GPT-2 training 600× cheaper?

Can I train GPT-2 on cloud spot instances without interruption?

Why is GPT-2 being called 'the new MNIST'?