GRPO
Group Relative Policy Optimization, DeepSeek's RL training method that eliminates the critic model by computing advantages relative to group averages of sampled responses.
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm introduced by DeepSeek for training reasoning models. Unlike PPO, which requires a separate value function (critic) to estimate advantages, GRPO generates multiple responses per prompt, scores them using rule-based rewards, and uses relative rankings within the group as the training signal. This approximately halves memory requirements for the RL phase. GRPO enables training on binary accuracy signals (correct/incorrect) without learned reward models, though subsequent research by the Qwen team found stability issues at scale.
Also known as
Group Relative Policy Optimization