Alignment Tax

The measurable degradation in a model's core capabilities (reading comprehension, translation, reasoning) caused by RLHF safety training.

Research presented at EMNLP 2024 quantified that as RLHF reward increases during training, models lose performance on reading comprehension, translation, and common-sense reasoning benchmarks. OpenAI's InstructGPT showed a specific manifestation: RLHF made hallucination worse even as human evaluators preferred its outputs overall. Simple weight interpolation between pre-RLHF and post-RLHF model weights often produces the best tradeoff between alignment and retained capability.

Also known as

alignment cost