Inference / Quantization

Quantization

Quantization stores a model's weights using fewer bits. Weights are numbers, kept in 16 bits by default. Drop them to 8 bits and the model halves in size. Drop to 4 bits and it halves again.

Size maps straight to speed and reach. A 70-billion-parameter model goes from 140 GB, which needs two GPUs, to 70 GB, which fits on one, to 35 GB, which fits with room to spare. Because decode is memory-bound, halving the bytes roughly doubles decode speed. Whatever memory you free becomes KV cache, which becomes more concurrent users.

The catch is precision. Fewer bits means coarser numbers, and at some point quality drops. In practice, 8-bit serving is nearly lossless and has become the default, helped by modern GPUs running 8-bit math natively. 4-bit costs measurable quality, though bigger models absorb it better, and it's the trick that gets a large model running on a laptop.

You already do this everywhere else. You gzip responses, minify bundles, and compress images, because the bytes crossing a slow boundary are the cost. Quantization is the same move applied to weights, and it works on the KV cache too.