RLHF

A training technique that fine-tunes language models using human preference data, optimizing outputs through reinforcement learning against a reward model trained on human comparisons.

Reinforcement Learning from Human Feedback (RLHF) is the final stage in training conversational AI models like GPT-4 and Claude. Human annotators compare pairs of model outputs and select the better response, training a reward model to predict these preferences. The language model is then optimized using PPO to produce higher-scoring outputs, with a KL divergence penalty preventing excessive drift from the base model. RLHF shapes model behavior to be more helpful and refuse harmful requests, though research shows these alignment changes are surprisingly shallow and fragile.

Also known as

Reinforcement Learning from Human Feedback, reinforcement learning from human feedback