Knowledge Distillation (Distillation)

IntermediateDeep Learning

Last updated June 14, 2026

What is Knowledge Distillation in simple terms?

In simple terms, distillation is teaching a small model by having it learn from a big one — like a seasoned expert coaching a quick apprentice. The apprentice ends up nearly as good, but far cheaper.

What is Knowledge Distillation?

Knowledge distillation is a technique for training a small, fast AI model to copy the behavior of a large, capable one, so the smaller model captures much of the bigger model's skill while being cheaper and quicker to run.

Knowledge distillation, often just called distillation, is a way to transfer the ability of a large, powerful AI model into a much smaller one. The large model is called the teacher and the small one the student. The basic move is to run lots of examples through the teacher, watch how it responds, and train the student to produce the same responses. The student is too small to learn all of that on its own from scratch, but learning by imitating an expert turns out to be far more effective than learning from raw data alone. The result is a compact model that captures a surprising share of the teacher's skill at a fraction of the size and running cost.

The reason it works better than simply training a small model directly is subtle and worth unpacking. When a teacher model answers, it doesn't just give a final answer — it also reveals how confident it is across all the possibilities, and how close the runners-up were. Asked to identify an animal, a good teacher might say "90% dog, 8% wolf, 2% fox." Those proportions are quietly informative: they tell the student that dogs and wolves look alike while foxes are more distinct — a nuance the student would never get from a bare "dog" label. Distillation lets the student absorb that richer, between-the-lines knowledge, which is why a distilled model usually beats a same-sized model trained the ordinary way. It's the difference between a student memorizing answer keys and one being shown a master's full reasoning.

Distillation has become a standard tool for making capable AI practical to deploy. It is how a giant, expensive model running in a data center can be shrunk into a lighter version that runs quickly on a phone, in a browser, or at lower cost on a server, while keeping most of its usefulness. It sits alongside two related shrinking techniques — quantization, which stores a model's numbers at lower precision, and pruning, which removes unneeded parts — and the three are often combined. The trade-off is honest: a distilled student rarely matches its teacher exactly, and it can only learn what the teacher already knows, so the teacher's blind spots and mistakes can be inherited along with its strengths.

Real-world example of Knowledge Distillation

A company offers a smart email-reply suggester, and the model that writes the best suggestions is large, slow, and costly to run for millions of users every minute. So the team uses a top-tier large model as a teacher: they feed it huge numbers of emails, record how it drafts each reply along with how it weighed the alternatives, and train a small student model to reproduce that behavior. The student that comes out is a fraction of the size, responds in a blink, and is cheap enough to run for everyone — yet writes replies nearly as good as its bulky teacher's, because it learned from the expert rather than from the raw inbox alone. That "coach a nimble student to match a heavyweight expert" move is precisely what distillation delivers.

Related terms

Frequently asked questions about Knowledge Distillation

What is the difference between distillation and quantization?

Both make a model cheaper to run, but in different ways. Distillation trains a brand-new, smaller model (the student) to copy the behavior of a large one (the teacher) — you end up with a different, more compact model. Quantization keeps the same model but stores its internal numbers at lower precision, shrinking it without retraining a new network. One produces a new, smaller model by imitation; the other compresses an existing one by rounding. They are complementary, and along with pruning are frequently combined to make a model as small and fast as possible.

How does distillation work?

A large "teacher" model is run over many examples, and a small "student" model is trained to reproduce the teacher's outputs. Crucially, the student learns from more than the teacher's final answers — it learns how confident the teacher was across all the options, which encodes subtle relationships (that two categories look alike, say) the student would never get from plain labels. Absorbing that richer signal is why a distilled student usually outperforms a same-sized model trained directly on raw data. The student ends up compact but skilled.

What is distillation used for?

It is used to make powerful AI practical to deploy: shrinking a large, costly model into a lighter one that runs fast on phones, in browsers, or at lower cost on servers, while keeping most of its ability. It is widely used to produce the smaller, cheaper versions of large language models that companies serve at scale. Anywhere a big model is too slow or expensive for the job but you don't want to lose its skill, distillation is a leading option — often paired with quantization and pruning.