Knowledge Distillation

A training technique where a smaller student model learns to replicate the outputs of a larger teacher model, transferring capabilities at reduced computational cost.

Knowledge distillation trains a compact student model to match the behavior of a larger, more capable teacher model. Rather than training the student from scratch, it learns from the teacher's output distributions or generated samples. DeepSeek used distillation to create smaller R1 variants (1.5B to 70B parameters) by fine-tuning on 800,000 samples generated by the full R1 model. The distilled 32B model achieves 72.6% on AIME 2024, demonstrating that reasoning capabilities transfer efficiently through this process.

Also known as

distillation, model distillation, teacher-student training