Distillation

Distillation transfers what a big model knows into a smaller one. The large model is the teacher, the small one is the student, and the student learns by copying the teacher instead of learning from raw data.

What gets copied is richer than you'd think. The teacher doesn't just say "the next word is pass." It gives a full distribution: pass 38%, drive 27%, shoot 18%, and so on. Those runner-up probabilities carry real information about what's almost right and what's terrible. Matching the whole shape teaches the student good judgment, not just answers. Geoffrey Hinton called the signal hiding in those probabilities dark knowledge.

To measure how close the student is to the teacher, you need a distance between two probability distributions. That's KL divergence. Zero means they match, and training shrinks it.

The most common use is compression: keep most of a large model's ability at a fraction of the cost, which is how a 70B model's smarts end up running on a laptop. But the teacher-student pattern is flexible. A model can learn from a peer, or grade its own past outputs to improve, which is self-distillation. Either way, you're trading a little quality for a lot of efficiency.