The Art of Compressing LLMs: Pruning, Distillation, and Quantization

NVIDIA Deep Learning Institute

PaidIntermediate8 hoursSelf-pacedCoding required

Last updated June 18, 2026

The Art of Compressing LLMs is an eight-hour, hands-on course from NVIDIA's Deep Learning Institute on making large language models smaller and cheaper to run without giving up much of their performance. It teaches three core techniques — pruning (removing parts of the network), knowledge distillation (training a smaller "student" model to imitate a larger "teacher"), and quantization (storing the model's numbers at lower precision) — and, importantly, how to weigh the trade-offs each one forces between accuracy, speed, cost, and hardware. Working in Python and PyTorch with tools from NVIDIA's stack, you take a model through the full pipeline from compression to deployment and benchmarking. It ends with an assessment in which you apply the techniques to a new dataset and hit a target accuracy. A focused, engineering-minded course on efficient model deployment.

What you'll learn

  • Choosing the right compression method for a problem and its constraints
  • Applying structured pruning to reduce model complexity
  • Using knowledge distillation to transfer knowledge from a larger model to a smaller one
  • Applying quantization, then benchmarking and deploying the compressed model

Frequently asked questions about The Art of Compressing LLMs

Who is The Art of Compressing LLMs for?

Developers familiar with Python, PyTorch, and transformer-based LLMs who want practical model-compression and deployment skills.

Is The Art of Compressing LLMs free?

No — The Art of Compressing LLMs is a paid course.

What are the prerequisites for The Art of Compressing LLMs?

Familiarity with Python and PyTorch, conceptual knowledge of deep learning, and familiarity with LLM architectures including transformers, attention mechanisms, and feed-forward networks.

Do you need to code for The Art of Compressing LLMs?

Yes — The Art of Compressing LLMs involves hands-on coding.

Why we suggest this course

For developers who can build models but now need to run them affordably — the skills behind fitting a capable LLM into real latency, memory, and cost budgets. The distinct takeaway is hands-on practice combining pruning, distillation, and quantization and judging the trade-offs each one demands.

Start The Art of Compressing LLMs on the provider's site

Related terms