Alignment Tax
The measurable performance degradation on capabilities like reasoning, translation, and factual accuracy that occurs when models undergo RLHF alignment training.
The alignment tax refers to the tradeoff between safety alignment and raw model capability. Research presented at EMNLP 2024 quantified this effect: as RLHF reward increases, models show declining performance on reading comprehension, translation, and common-sense reasoning benchmarks. OpenAI's own work showed that InstructGPT's RLHF increased hallucination rates even as human evaluators preferred its outputs. This occurs because RLHF optimizes for human preference rather than truth, and the surface-level behavioral changes can interfere with deeper capabilities.
Also known as
RLHF tax