Glossary · Term

Quantization

Quantization is a representative lightweight technique that reduces the size and amount of calculations by lowering the numerical precision of the model.

Quantization is a lightweight technique that reduces storage capacity and calculation amount by lowering the precision of the numbers that make up the AI model. Similar to drastically reducing the file size by slightly lowering the quality of a photo, the value that was previously recorded with a dense number of decimal places is expressed by approximating it with a simpler number.

It is widely used to enable high-performance models to run on general GPUs, laptops, and smartphones. In particular, as the culture of running open-weight models on personal computers has spread, downloading and writing quantized model files has become a de facto standard.

Performance loss varies depending on how much precision is reduced, and excessive compression can result in quality degradation in subtle inferences or long context processing. Various compensation techniques are being developed to reduce losses.

✅ Why it matters

⚠️ Limits and debates

← View all glossary entries