Direct Preference Optimization

An alternative to RLHF that trains language models on preference data using a single supervised loss, eliminating the need for a separate reward model.

Direct Preference Optimization (DPO) is a simpler alignment technique introduced by Stanford researchers at NeurIPS 2023. Its key insight is that the optimal RLHF policy can be expressed directly as a function of preference data, bypassing the complex multi-stage pipeline of training a reward model and running PPO. DPO matches or exceeds PPO-based RLHF on academic benchmarks with substantially less engineering complexity, though frontier labs still predominantly use PPO for production models.

Also known as

DPO