Reward Model

A neural network trained on human preference comparisons to score language model outputs, providing the optimization signal for RLHF.

A reward model is the preference predictor at the heart of RLHF. It's trained on thousands of pairwise comparisons where human annotators selected which of two model outputs was better—OpenAI's InstructGPT used roughly 50,000 prompts, while Anthropic's Constitutional AI used 318,000 comparisons. The reward model learns to output a scalar score predicting human preference, which then guides PPO optimization of the language model. Without careful constraints, models may exploit reward model weaknesses through reward hacking.

Also known as