Headline: GPT-2 Training Now Costs Less Than a Pizza
Andrej Karpathy's latest nanochat milestone compresses what OpenAI needed 168 hours and roughly $43,000 to accomplish in 2019 into 2.91 hours on a single 8×H100 node. On spot instance pricing, the whole run costs about $20. GPT-2, the model OpenAI once deemed "too dangerous to release," is now cheaper to train than a large pepperoni.
The record, announced on February 3, represents a 4.3% improvement over the previous best of 3.04 hours set just days earlier. FP8 quantized training was the latest optimization to land, though Karpathy notes the practical speedup is modest: about 5% net, well below the theoretical 2× FLOP advantage FP8 holds over BF16. Overhead from scale conversions and the relatively small matrix multiplications in a GPT-2-scale model eat most of the gains.
But the FP8 result is the icing. The real story is the full stack of optimizations Karpathy and contributors have assembled across nanochat, and what it tells us about where training efficiency is heading.
The optimization stack
The nanochat discussion thread lays out the engineering in detail. The biggest single contributor was Flash Attention 3, which delivered a roughly 9% tokens-per-second improvement while enabling sliding window attention patterns that alternate between short and long context. The Muon optimizer, used for weight matrices while AdamW handles embeddings, brought its own gains through techniques including Polar Express orthogonalization, factored variance reduction, and "cautious" weight decay that only applies where gradient and parameter signs align.
Architecturally, the model diverges significantly from the original GPT-2 design: RoPE replaces learned positional embeddings, RMSNorm replaces LayerNorm, ReLU² replaces GELU, and per-layer learnable residual scalars maintain what Karpathy calls a "superhighway" for gradient flow. Value embeddings at alternating layers add 604M parameters at near-zero FLOP cost. The resulting 1.38B parameter model was trained on FineWeb-edu data and evaluated across 22 benchmarks using a composite CORE metric.
This wasn't one breakthrough. It was 320 hyperparameter experiments and dozens of architectural tweaks, each contributing fractions of a percent, compounding into a 600× cost reduction over seven years.
What didn't work
The failure list is just as instructive. Multi-token prediction added 13GB of memory for no improvement. Varlen attention did nothing meaningful. FP8 applied to the language model head bought a 1% speedup at 2GB of additional memory. Bigram embeddings were "complexity bloat." Half the ideas that sound good on paper didn't survive contact with actual training runs.
The "new MNIST" thesis
Karpathy frames GPT-2 as becoming "the new MNIST": a reference model so cheap to reproduce that it becomes a universal benchmark for testing ideas. The HN discussion largely revolved around practical questions of running these jobs on spot instances, with commenters noting that a continuous three-hour window is readily available on most cloud providers. One commenter noted Azure's 30-second eviction warning is tighter than AWS's, but the cost savings make interruptions worth tolerating.
The analogy holds. When a benchmark becomes trivially cheap, it stops being impressive to beat and starts being useful as a test harness. Nanochat is explicitly designed for this: about 1,000 lines of readable, forkable code with an MIT license. It's not a framework. It's a starting point.
The scalability question
Will these optimizations work at frontier scale? The optimism needs tempering. A 600× cost reduction on GPT-2 does not mean we'll see a 600× reduction on frontier models. Many of nanochat's wins are specific to the model's scale. FP8's gains grow with matrix size, which is why torchao reports a 25% speedup on Llama3-8B compared to Karpathy's 5% on GPT-2. Flash Attention's benefits are well-established at larger scales, but some of the architectural choices (value embeddings, alternating window attention) are unproven on 70B+ parameter counts.
Our read: the individual techniques here are directionally right, but the magnitude of improvement will look different at every scale. The training cost curve for GPT-2 falls roughly 2.5× per year, according to Karpathy's analysis. Frontier models are on a different curve entirely, with training costs still climbing as capabilities expand.
What nanochat gives the field is a clean experimental sandbox. If you want to test whether a new optimizer or attention variant has legs, you can validate it against a meaningful baseline in under three hours for pocket change. That accelerates research iteration in a way that matters more than the headline dollar figure.
The pizza comparison is fun. The research velocity it enables is the real meal.
Sources cited: Claims as analysis: