DPO
Direct Preference Optimization. A simpler alternative to RLHF that trains directly on preference data using a single classification loss, eliminating the need for a separate reward model.
Published by Stanford at NeurIPS 2023, DPO proves mathematically that the optimal RLHF policy can be expressed directly as a function of preference data. This allows training with a single supervised classification loss instead of the multi-stage RLHF pipeline of reward model training plus PPO. DPO matches or exceeds PPO-based RLHF on summarization and dialogue benchmarks with less engineering complexity, though most frontier labs still use PPO-based approaches for production models.
Also known as
direct preference optimization