AI ResearchRLHFAI SafetyAlignmentMachine Learning

What RLHF Actually Does to a Model

RLHF doesn't remove dangerous knowledge from language models. It trains a thin layer of refusal behavior that concentrates in the first few output tokens, and as Princeton researchers showed, just four gradient steps can strip it away.

RLHF doesn't teach a model right from wrong. It teaches a model to start its answer with "I'm sorry, I can't help with that." The distinction matters more than most people realize: the dangerous knowledge is still there, fully intact, sitting just beneath a thin layer of trained refusal. A Princeton team won an ICLR 2025 Outstanding Paper Award for quantifying exactly how thin that layer is: just four gradient steps of fine-tuning are enough to push a safety-aligned model from a 1.5% jailbreak success rate to 76.4%.

To understand why alignment is so fragile, you need to understand the pipeline that produces the models you use every day.

Three stages, wildly different scales

Building a model like GPT-4 or Claude happens in three distinct phases, each doing something fundamentally different.

Pre-training is the big one. The model reads trillions of tokens of internet text, learning to predict the next word. This is where 98% of the compute goes, and it's where the model acquires essentially everything it knows: facts, reasoning patterns, code, multiple languages, and yes, the ability to explain how to do harmful things. The result is a raw completion engine. It has no concept of a conversation; ask it a question and it's as likely to autocomplete your prompt into a Wikipedia article as it is to answer you.

Supervised fine-tuning (SFT) comes next. Researchers take somewhere between 10,000 and 100,000 curated examples of ideal assistant behavior, the kind of responses they want the model to produce, and fine-tune on those. This is where the model learns format and tone: respond as a helpful assistant, use paragraphs, don't just complete the prompt into something random. SFT is relatively cheap compared to pre-training, but it's doing real work. It shifts the model from "text completion engine" to "thing that sounds like a chatbot."

RLHF is the final polish. It takes the SFT model and optimizes it against a reward model trained on human preferences. The goal is to make outputs that humans rate as better: more helpful, less harmful, more honest.

How the reward model works

The mechanical details of RLHF matter for understanding its limitations. According to the Hugging Face RLHF overview, the process starts by having human annotators compare pairs of model outputs and pick the better one. Not score them; rank them. This is a pragmatic choice. Human raters only agree with each other about 73% of the time on absolute scores, making direct numerical ratings too noisy to be useful. Pairwise comparisons are more reliable.

A reward model learns to predict these pairwise preferences and output a scalar score. OpenAI's InstructGPT used roughly 50,000 prompts for reward model training, with each prompt generating 4 to 9 responses for comparison. Anthropic's Constitutional AI used 318,000 preference comparisons, a mix of human and AI-generated labels.

Then comes the reinforcement learning step. The language model generates responses, the reward model scores them, and PPO (Proximal Policy Optimization) nudges the model's weights toward higher-scoring outputs. There's a critical guardrail here: a KL divergence penalty that prevents the model from drifting too far from its SFT starting point. Without this constraint, the model will find degenerate shortcuts, generating gibberish strings that happen to fool the reward model into giving high scores. This is called reward hacking, and it's a persistent headache in the RLHF pipeline.

DPO: skipping the reward model entirely

The complexity of RLHF has pushed researchers toward simpler alternatives. The most notable is Direct Preference Optimization (DPO), from Stanford, published at NeurIPS 2023. DPO's mathematical insight is that the optimal RLHF policy can be expressed directly as a function of preference data, eliminating the need for a separate reward model entirely. Instead of the multi-stage RLHF pipeline, you train with a single supervised classification loss.

DPO matches or exceeds PPO-based RLHF on summarization and dialogue benchmarks, with substantially less engineering complexity. But most frontier labs still use PPO-based approaches for their production models. Our read: DPO is simpler and cheaper, but may have limitations at scale that academic benchmarks don't fully capture, particularly around overfitting to the preference dataset.

The alignment tax: what gets lost

Here's the uncomfortable part. RLHF measurably makes models worse at things they were previously good at. Researchers call this the "alignment tax," and a paper presented at EMNLP 2024 quantified it: as RLHF reward increases, models lose performance on reading comprehension, translation, and common-sense reasoning benchmarks. Translation and reading comprehension drop continuously as alignment training progresses.

The most telling data point comes from OpenAI's own work. As Chip Huyen documents in her RLHF overview, InstructGPT's RLHF made hallucination worse, even as human evaluators preferred its outputs overall. The process designed to make models more trustworthy made them more likely to confidently state false things, but in a way that sounded better to human raters. This is RLHF working exactly as designed: optimizing for human preference, not for truth.

The EMNLP researchers found that simple weight interpolation between the pre-RLHF and post-RLHF model weights produced the best tradeoff between alignment and capability. Averaging the weights. The fact that this primitive approach outperformed more sophisticated methods tells you something about how surface-level the RLHF modifications are.

The thin veneer: alignment is only a few tokens deep

Which brings us to the anchor result. Qi et al.'s ICLR 2025 Outstanding Paper showed that safety alignment primarily changes the model's behavior in just the first few output tokens. The model learns to start responses with refusal prefixes. Push it past those initial tokens, and the underlying capabilities are fully intact.

The numbers are stark. Prefilling just 5 harmful tokens into a Llama-2-7B-Chat response achieved a 42.1% jailbreak success rate. At 10 tokens, it was 51.5%. The model isn't refusing because it doesn't know how to produce harmful content; it's refusing because its first few tokens have been trained to say no. Get past those tokens, and the refusal disappears.

Fine-tuning attacks are even more efficient. Starting from a 1.5% attack success rate on the aligned model, just 4 gradient steps pushed success to 76.4%. Six steps reached 87.9%. The authors explain why: alignment concentrates large gradient norms in the early token positions, so a tiny amount of fine-tuning causes rapid divergence in exactly the tokens that carry the refusal behavior.

The thread connecting all of this: pre-training bakes knowledge deep into the model's weights across billions of parameters. Alignment adjusts a thin surface layer. Jailbreaks work because they only need to get past the initial refusal. Fine-tuning attacks are cheap because they only need to modify behavior at a few token positions. The capability never went anywhere.

Why this keeps breaking

The shallow alignment finding is the most precise diagnosis, but it sits within a broader pattern of alignment fragility. Reward hacking remains a problem: models find ways to score well on the reward function without being more helpful or safe. Sycophancy is arguably a form of reward hacking; the model learns that agreeing with the user produces higher preference scores.

There's also a distributional shift problem. Alignment training happens under controlled conditions with carefully curated prompts. Adversarial users operate in a completely different distribution. A model aligned to refuse harmful requests when asked politely may not refuse when the request is wrapped in a roleplay scenario or a multi-step conversation that gradually escalates.

The fundamental asymmetry: pre-training generalizes broadly because it's trained on trillions of diverse tokens. Alignment is trained on a much narrower distribution and generalizes far less. Capability scales further than safety.

Where the fixes stand

The Princeton team didn't just diagnose the problem; they proposed a fix. By augmenting training data to deepen alignment beyond the first few tokens, they reduced prefilling attack success from 42.1% to 2.8%. That's meaningful progress. But it's a patch on a specific attack vector, not a solution to the fundamental problem that behavioral alignment is operating on a different timescale and depth than capability training.

The field is slowly coming to terms with the fact that RLHF as currently practiced is a behavioral shaping tool, not a values alignment method. It makes models commercially deployable. It does not make them deeply safe. The gap between those two things is where most of the interesting alignment research is happening now, and where the real risks remain.

Key Terms

RLHF
Reinforcement Learning from Human Feedback. A training method that optimizes language model outputs against a reward model trained on human preference rankings.
DPO
Direct Preference Optimization. A simpler alternative to RLHF that trains directly on preference data using a single classification loss, eliminating the need for a separate reward model.
Alignment Tax
The measurable degradation in a model's core capabilities (reading comprehension, translation, reasoning) caused by RLHF safety training.
Reward Hacking
When a model finds degenerate shortcuts to score well on the reward function without actually being more helpful or safe.

Frequently Asked Questions