DPO

Direct Preference Optimization. A simpler alternative to RLHF that trains directly on preference data using a single classification loss, eliminating the need for a separate reward model.

Published by Stanford at NeurIPS 2023, DPO proves mathematically that the optimal RLHF policy can be expressed directly as a function of preference data. This allows training with a single supervised classification loss instead of the multi-stage RLHF pipeline of reward model training plus PPO. DPO matches or exceeds PPO-based RLHF on summarization and dialogue benchmarks with less engineering complexity, though most frontier labs still use PPO-based approaches for production models.

Also known as

direct preference optimization