Quantization

A compression technique that reduces model precision from 32-bit floats to 8-bit or 4-bit integers, shrinking memory requirements and accelerating inference.

Quantization converts neural network weights from high-precision floating-point numbers to lower-precision integers, dramatically reducing memory footprint and computational requirements. Common quantization levels include Q8 (8-bit) with minimal quality loss, Q4_K_M (4-bit) as the sweet spot for most deployments, and aggressive Q2_K for maximum compression. The technique enables running models on hardware that couldn't otherwise support them, though lower precision introduces measurable quality degradation.

Also known as