What is the difference between white-box and black-box distillation?

White-box distillation provides access to the teacher's internals: attention weights, hidden states, and layer activations. This enables techniques like layer-by-layer alignment. Black-box distillation works from API outputs only, collecting responses through prompting and training on those traces. Black-box has become surprisingly effective for reasoning transfer because chain-of-thought traces carry most of what students need to learn.

What is the difference between knowledge distillation and quantization?

Knowledge distillation reduces model size by training a smaller student model to replicate a larger teacher's outputs, resulting in fewer parameters. Quantization reduces memory and compute by lowering numerical precision (e.g., from 32-bit to 8-bit or 4-bit) without changing parameter count. They're complementary: local LLMs are typically both distilled and quantized for maximum efficiency.

Why don't distilled models retain the safety alignment of their teachers?

Standard distillation optimizes for task accuracy by matching the teacher's probability distributions on reasoning tasks. Safety behaviors trained through RLHF or constitutional AI are separate from core capabilities and don't automatically transfer. Testing shows refusal rates can drop from 81% on teachers to 26.5% on distilled variants. Developers should assume safety guardrails aren't present in distilled models.

What are soft labels in knowledge distillation?

Soft labels are the probability distributions a teacher model outputs across its vocabulary, as opposed to hard labels that indicate only the correct answer. When a model outputs '42' for a math problem, it might assign 89% probability to '42,' 6% to 'forty-two,' and 2% to '41.' These distributions encode reasoning nuances that hard binary labels can't capture, helping student models learn more than just correct answers.

Knowledge Distillation: How Giant Models Shrink

A 7B parameter model shouldn't be able to solve competition-level math problems. Yet DeepSeek-R1-Distill-Qwen-7B achieves 55.5% on AIME 2024, outperforming QwQ-32B-Preview, a model four times its size. The explanation is knowledge distillation: a training technique that transfers capabilities from massive "teacher" models to compact "student" models.

This is why the local LLM movement exists at all. Without distillation, running frontier-quality reasoning on consumer hardware would remain fantasy. With it, a laptop can approximate what a 671B parameter model produces.

Learning to Think, Not Just Answer

Traditional supervised learning trains models on ground truth labels. Distillation does something different. The student learns to match the teacher's probability distributions, not just the correct answer.

That distinction matters because of what those distributions actually encode. When GPT-4 answers a math problem, it doesn't just output "42." It assigns probability scores across its entire vocabulary: maybe 89% for "42," 6% for "forty-two," 2% for "41." These probability distributions (called "soft labels") encode nuance that hard binary labels can't capture.

IBM's explanation frames knowledge as "a learned mapping from input vectors to output vectors." The student isn't copying parameters; it's learning to replicate the teacher's decision-making process.

A key insight: the architectures don't even need to match. An ensemble of hundreds of models can distill down to a single compact network.

Temperature is the control knob. Higher values "soften" the teacher's output distributions, making subtle reasoning patterns more visible to the student. Lower temperatures sharpen outputs toward the most likely answer. Tuning this hyperparameter determines whether you're transferring memorized answers or genuine reasoning capability.

Early distillation work focused on classification tasks. Modern LLM distillation is fundamentally different: it transfers chain-of-thought reasoning. DeepSeek's approach is instructive. They generated about 800,000 reasoning-centric samples from their 671B R1 teacher, each containing step-by-step solutions. The distilled models don't just produce correct final answers; they've learned to "think out loud" through problems the same way their teacher does.

Fireworks AI's research quantifies why this matters. Synthetic reasoning traces from R1 achieved 87% accuracy on GSM8K, compared to 68% for human expert chain-of-thought annotations. R1's reasoning chains average 2,000 characters versus 280 for human experts. The teacher's verbose, step-by-step thinking translates into measurably better student performance.

This explains a counterintuitive result: distillation from a capable teacher can outperform training on human-annotated data.

Compression Has Costs

Distillation isn't magic.

DeepSeek's distilled variants span 1.5B to 70B parameters. The 1.5B model runs 16× faster than the teacher, but sacrifices about 50 percentage points on formal logic tasks compared to the 236B R1. Certain reasoning pathways simply don't survive compression to very small parameter counts. Mathematical reasoning appears more compressible than formal logical deduction; a 2025 analysis in Artificial Intelligence Review reports distilled models achieving roughly 95% of teacher performance on standard benchmarks like GLUE, SuperGLUE, and MMLU. That's impressive retention, but "95% of frontier performance" means different things depending on the task.

Quantization complements distillation. Distillation reduces parameter count; quantization reduces numerical precision (from 32-bit to 8-bit or 4-bit representations). Together, they make laptop deployment feasible. The DeepSeek distilled models running locally are typically both distilled and quantized.

What Access Gets You

Distillation quality depends heavily on what you can see inside the teacher.

White-box distillation means access to the teacher's internals: attention weights, hidden states, intermediate layer activations. You can align the student's internal representations with the teacher's, not just match final outputs. This enables techniques like layer-by-layer distillation and attention-head matching.

Black-box distillation works from API outputs only. You can prompt the teacher, collect responses, and train on those. Modern techniques use prompt engineering to elicit detailed reasoning chains, then fine-tune students on those traces. The arXiv survey notes this is how many open-source LLMs bootstrap from proprietary models like GPT-4.

Our read: black-box distillation has become surprisingly effective for reasoning transfer. The reasoning chain itself, generated through careful prompting, carries most of what the student needs to learn.

The catch: distillation transfers reasoning capability, but safety alignment often doesn't follow. Standard distillation optimizes for task accuracy, not refusal of harmful prompts. The Emergent Mind analysis of DeepSeek's distilled models reports refusal rates dropping from 81% on the teacher to 26.5% on distilled variants when tested on unsafe prompts.

This isn't surprising. Safety behaviors are typically trained through RLHF or constitutional AI methods, applied after base capability training. Distillation (particularly black-box distillation) primarily captures the capability surface. The safety layer doesn't automatically transfer. For developers using distilled models: assume the safety guardrails from the teacher aren't present. Plan accordingly.

Running Frontier Reasoning Locally

The pipeline is now well-established: frontier labs train massive models with expensive RL pipelines; researchers distill those capabilities into smaller architectures; the community fine-tunes and quantizes for local deployment.

DeepSeek's distilled R1 variants demonstrate this at scale. A 32B distilled model achieves 72.6% on AIME 2024, outperforming OpenAI's o1-mini. Running locally on prosumer hardware, it delivers reasoning quality that would have required API access to frontier models just months ago.

The 2024 arXiv survey identifies emerging techniques: multi-teacher distillation (learning from several models simultaneously), dynamic co-evolution (teacher and student improving together), and task-specific vertical distillation for domains like medicine and law.

You don't need a 671B model running in production. You can distill capabilities from models you can't afford to run into models you can. The intelligence gap between frontier and local is narrowing, and distillation is the mechanism.