Quantization
Quantization is a representative lightweight technique that reduces the size and amount of calculations by lowering the numerical precision of the model.
Quantization is a lightweight technique that reduces storage capacity and calculation amount by lowering the precision of the numbers that make up the AI model. Similar to drastically reducing the file size by slightly lowering the quality of a photo, the value that was previously recorded with a dense number of decimal places is expressed by approximating it with a simpler number.
It is widely used to enable high-performance models to run on general GPUs, laptops, and smartphones. In particular, as the culture of running open-weight models on personal computers has spread, downloading and writing quantized model files has become a de facto standard.
Performance loss varies depending on how much precision is reduced, and excessive compression can result in quality degradation in subtle inferences or long context processing. Various compensation techniques are being developed to reduce losses.
✅ Why it matters
- Significantly reduces model capacity and memory usage
- Allows large models to be run on personal computers without expensive equipment
- Reduces the amount of computation, improving response speed and power efficiency
⚠️ Limits and debates
- As the compression level increases, the quality of the answer gradually deteriorates
- Performance degradation is only apparent in certain tasks and is difficult to detect in advance
- Depending on the combination of model and technique, there is a large variation in results and verification is required