Quantization
A compression technique that reduces model precision from 32-bit floats to 8-bit or 4-bit integers, shrinking memory requirements and accelerating inference.
Quantization converts neural network weights from high-precision floating-point numbers to lower-precision integers, dramatically reducing memory footprint and computational requirements. Common quantization levels include Q8 (8-bit) with minimal quality loss, Q4_K_M (4-bit) as the sweet spot for most deployments, and aggressive Q2_K for maximum compression. The technique enables running models on hardware that couldn't otherwise support them, though lower precision introduces measurable quality degradation.
Also known as
model quantization, weight quantization, INT8, INT4