RLHF
Reinforcement Learning from Human Feedback. A training method that optimizes language model outputs against a reward model trained on human preference rankings.
RLHF is the final stage of the modern LLM training pipeline, applied after pre-training and supervised fine-tuning. Human annotators compare pairs of model outputs and pick the better one. A reward model learns to predict these preferences, and PPO (Proximal Policy Optimization) then nudges the language model's weights toward higher-scoring outputs. A KL divergence penalty prevents the model from drifting too far from its SFT checkpoint. Research shows RLHF primarily modifies behavior in the first few output tokens rather than deeply altering the model's knowledge.
Also known as
reinforcement learning from human feedback