When DeepSeek released R1 in January 2025, the AI industry's reaction split into two camps: those who thought the benchmarks were faked, and those who started reverse-engineering the training recipe.
The benchmarks weren't faked. R1 hits 79.8% on AIME 2024, matching OpenAI's o1 at 79.2%. And the training recipe is now public. What makes this interesting isn't that a Chinese lab matched OpenAI; it's that they did it with open weights, for roughly $6 million in compute, using methods OpenAI deliberately avoided. DeepSeek proved there's more than one path to reasoning AI.
Big Model, Small Inference
R1 is built on DeepSeek V3, a Mixture of Experts (MoE) model with 671 billion total parameters. Only 37 billion activate per inference. The model routes each input to relevant expert subnetworks, leaving the rest dormant.
MoE isn't new. It's been around since the '90s. What's new is making it work at this scale without bankrupting yourself in the process. Sebastian Raschka's analysis highlights two architectural innovations that made V3 viable: Multi-Head Latent Attention (MLA), which compresses key-value tensors into a lower-dimensional space, and DeepSeek Sparse Attention (DSA), which reduces attention complexity from O(L²) to O(Lk) using learned sparsity patterns.
According to Epoch AI's cost analysis, training required roughly 3×10²⁴ FLOP on 14.8 trillion tokens using 2,048 H800 GPUs. Total cost: approximately $6 million, excluding experiments and personnel.
Some found this implausibly cheap. Epoch AI's take is counterintuitive: the mystery isn't why it was so cheap, but why it was so expensive given the algorithmic shortcuts DeepSeek employed. The model achieved only 23% FLOP utilization due to MoE communication overhead across 64-way expert parallelism.
GRPO Cuts the Reward Model Entirely
Traditional RLHF requires multiple models in memory simultaneously: the policy model you're training, a reference model for KL divergence, a reward model to score outputs, and often a value head for PPO's advantage estimation. Expensive. Complex. Memory-hungry.
DeepSeek introduced Group Relative Policy Optimization (GRPO), detailed in the R1 paper. The core idea: eliminate the critic model entirely. Instead of training a separate value function to estimate advantages, compute them relative to group averages. Generate multiple responses for each prompt, score them, use the relative rankings as your training signal. Philipp Schmid's breakdown notes that GRPO estimates baselines from group scores, eliminating the separate value function model that PPO needs. This approximately halves the memory and compute requirements for the RL phase.
The trick that makes GRPO practical: rule-based rewards rather than learned reward models. For math and coding, they check whether the answer is correct. Binary signal. For format consistency and language, simple rule-based checks. No Monte Carlo Tree Search. No Process Reward Models.
Just accuracy signals plus format constraints.
This is a deliberate simplification. We've written about the DPO vs RLHF debate; GRPO represents yet another point in the design space, trading reward model sophistication for training efficiency.
The Four Stages
R1's training follows a specific sequence, documented by Schmid:
Stage 1: Cold-start SFT. Before any RL, they fine-tune V3 on thousands of long chain-of-thought examples. This makes subsequent RL training faster and more stable.
Stage 2: RL for reasoning. GRPO does its work here, optimizing for accuracy on math, code, and logic tasks using rule-based rewards. The paper reports emergent behaviors appearing at this stage: self-verification, reflection, dynamic strategy adaptation. DeepSeek didn't explicitly train these capabilities. They emerged from the RL process.
Stage 3: Rejection sampling + SFT. They generate 600,000 reasoning samples and 200,000 general-purpose samples using R1 as the generator and V3 as the evaluator. The best outputs become new training data.
Stage 4: RL for helpfulness. A final RL phase optimizes for human preferences on general tasks, not just reasoning accuracy.
Stage 1 reveals something counterintuitive. Pure RL works, but fine-tuning first makes it work better. DeepSeek-R1-Zero, trained with pure RL and no initial SFT, showed reasoning capability but had readability issues and language mixing problems. The cold-start SFT solved these.
Distillation: 32B Beats o1-mini
R1's flagship results are impressive. But if you're a practitioner, the distillation story might be what actually affects your work.
DeepSeek generated 800,000 samples from R1 and used them to fine-tune smaller models ranging from 1.5B to 70B parameters. The results: the distilled 32B model achieves 72.6% on AIME 2024, beating o1-mini's 63.6%. A model you can run locally outperforming OpenAI's smaller reasoning model.
This suggests reasoning patterns learned through RL transfer efficiently through distillation. You don't need to run the full GRPO pipeline on a 671B model to get reasoning capabilities. You can distill from a model that did. The weights are public. The training recipe is documented. The distilled models are already being fine-tuned by the community.
R1 is not a complete victory, though. The safety guardrails are notably weak; some evaluations report near-100% jailbreak success rates. This is likely a tradeoff of the pure RL approach, which optimizes for task accuracy without the safety-focused RLHF that labs like Anthropic and OpenAI apply.
The benchmark parity with o1 also doesn't mean identical reasoning. R1's chain-of-thought process differs in ways that aren't fully characterized. It tends toward longer reasoning chains. Whether that's more or less efficient depends entirely on the task.
And GRPO itself has stability issues at scale. The Qwen team reports that GRPO suffers from severe instability during long training runs, sometimes causing model collapse. Their solution, GSPO, moves to sequence-level optimization. The RL training landscape is still evolving rapidly.
Our read: DeepSeek R1 matters because it demonstrated an alternative. Frontier reasoning doesn't require OpenAI's budget, OpenAI's methods, or OpenAI's secrecy. A lab in Hangzhou published the architecture, the training pipeline, and the weights. The specific techniques will evolve. GRPO may get replaced by GSPO or something else. MoE routing will improve. But the precedent stands: reasoning capability can be developed openly, documented publicly, and matched by labs outside the traditional frontier.
Sources cited: Claims as analysis: