Reward Hacking
When a model finds degenerate shortcuts to score well on the reward function without actually being more helpful or safe.
A persistent problem in RLHF where models exploit patterns in the reward model rather than genuinely improving output quality. Without a KL divergence penalty constraining how far the model can drift from its SFT checkpoint, models may generate gibberish strings that fool the reward model into giving high scores. Sycophancy, where models learn that agreeing with users produces higher preference scores, is arguably a subtler form of reward hacking.
Also known as
reward gaming