Quantization

IntermediateInfrastructure

Last updated June 14, 2026

What is Quantization in simple terms?

In simple terms, quantization is rounding a model's numbers to make it smaller and faster — like writing prices as whole dollars instead of exact cents. You lose a little precision but gain speed.

What is Quantization?

Quantization is a technique for shrinking an AI model by storing its internal numbers at lower precision — for example, using whole numbers instead of long decimals — so the model takes up less memory and runs faster, with only a small loss of accuracy.

An AI model is, under the hood, a vast collection of numbers — the weights it learned during training. By default each of these numbers is stored in high precision, typically taking 32 bits of space, which lets it hold many decimal places. That precision is useful while the model is being trained, but it makes the finished model large and slow: more memory to store, more work to compute with. Quantization is the practice of storing those numbers more coarsely — for instance, as 8-bit whole numbers instead of 32-bit decimals. The everyday version of the idea is rounding: writing 3.14159 as just 3.1 loses a little detail but is far quicker to read and write.

The reason this works without wrecking the model is that AI models are surprisingly tolerant of imprecision. The exact final digits of each weight rarely matter to the overall answer, so trimming them away usually costs only a sliver of accuracy while delivering large savings — a quantized model can take up a quarter of the memory and run several times faster. There are two common ways to do it. The simpler is to quantize a model after it's already trained, which is quick and needs no retraining. The more careful approach builds the rounding into training itself, so the model learns to compensate for the coarser numbers and ends up more accurate at the smaller size. Which one you choose is a trade-off between effort and how much accuracy you're willing to give up.

Quantization matters most when you want to run a model somewhere with limited resources. It is the key technique behind getting capable AI to run on a phone, a laptop, a smartwatch, or a cheap server rather than only in a data center stacked with expensive accelerators. It also cuts the cost and energy of running large models at scale. It is closely related to other model-shrinking techniques — pruning, which removes unneeded parts of a model, and distillation, which trains a smaller model to copy a bigger one — and the three are often combined. The honest catch is that quantize too aggressively and accuracy starts to degrade noticeably, so there is always a balance to strike between size, speed, and quality.

Real-world example of Quantization

A developer has built a voice-note transcription app and wants it to work offline, on the phone itself, so recordings never leave the device. The speech model that does the transcription was trained in full 32-bit precision and is far too big and slow to run smoothly on a handset. So before shipping, the developer quantizes it down to 8-bit numbers. The model shrinks to a fraction of its size, runs fast enough to transcribe in near real time on the phone's modest chip, and the transcriptions come out almost as accurate as before — the occasional extra slip is a fair price for a model that fits in your pocket and needs no internet. That trade of a touch of precision for a model light enough to run locally is exactly what quantization is for.

Related terms

Frequently asked questions about Quantization

What is the difference between quantization and pruning?

Both shrink a model, but they remove different things. Quantization keeps every part of the model but stores each internal number at lower precision — coarser numbers, smaller file. Pruning instead deletes parts of the model entirely, such as connections or whole units that contribute little. One reduces the precision of what stays; the other reduces how much there is. They aren't rivals — they're often applied together, alongside distillation, to make a model as small and fast as possible while keeping accuracy acceptable.

How does quantization work?

It converts a model's numbers from a high-precision format, such as 32-bit decimals, into a lower-precision one, such as 8-bit whole numbers — essentially rounding them. Because a model's answers don't depend on the last few digits of each weight, this trims memory and speeds up computation while losing only a little accuracy. It can be done after training (quick, no retraining needed) or built into training itself, where the model learns to compensate for the coarser numbers and so holds onto more of its accuracy at the smaller size.

What is quantization used for?

It is used to make trained models cheaper, faster, and smaller to run — most visibly to fit capable AI onto phones, laptops, wearables, and other limited hardware, and to run it offline without a data center. At larger scale, it cuts the memory, cost, and energy of serving big models to many users. Anywhere a model needs to run within tight constraints on memory, speed, power, or cost, quantization is one of the first techniques reached for, often combined with pruning and distillation.