Quantization: How AI Models Shrink to Fit Your Phone

A 7 billion parameter model in 16-bit precision needs 14GB of memory just for the weights. The same model quantized to 4 bits fits in 3.5GB. That difference determines whether a model runs on your phone or requires a datacenter.

explainersquantizationinferenceoptimizationon-device AI

Quantization is the compression technique that makes this possible. It converts high-precision floating-point numbers into lower-precision integers, trading some accuracy for dramatically smaller memory footprints and faster inference. In 2026, the deployment pattern is consistent: train at 16-bit, deploy at 4-bit.

For a 1B parameter model, that means shrinking from 2GB to 500MB.

The trade-off affects everything from battery life to which devices can run AI locally.

From Floats to Integers

Quantization works by mapping continuous floating-point values to a discrete set of integers through scale factors. Take a weight value of 0.0372 in 16-bit float. Quantization compresses this into an 8-bit or 4-bit integer by multiplying by a scale factor, rounding, and storing the result. To use the weight, you reverse the process.

This creates precision loss. A 4-bit integer can only represent 16 distinct values. A 16-bit float can represent over 65,000. Every quantized weight is an approximation of the original.

Symmetric quantization maps values around zero and skips storing an offset, reducing computational overhead during inference. Asymmetric quantization adds a zero-point parameter, representing the range more precisely but requiring extra computation. Symmetric has won in production because the offset computation adds latency on every forward pass.

The granularity matters too. Per-tensor quantization uses one scale factor for an entire weight matrix; fast but imprecise. Per-channel quantization uses separate scales for each output channel; more accurate but more overhead. Per-block quantization splits tensors into small blocks with individual scales, striking a balance that works well for the irregular weight distributions in large language models.

This works fine until you hit models above 6.7B parameters. Then the math breaks in predictable ways.

Why Naive Quantization Breaks Large Models

Small models quantize easily. Apply a uniform scale factor, round to 4 bits, and accuracy loss stays under a few percent.

Large language models break this approach.

The problem emerges around the 6.7B parameter threshold. Above this scale, models develop "saliency weights" with extreme activation values. A weight might be 0.01 while its neighbor is 47.3. When you compress this range into 4 bits, the 0.01 becomes indistinguishable from zero. Multiply that loss across millions of parameters and model quality collapses.

The outlier problem explains why a 3B model might quantize gracefully while a 13B model of the same architecture loses coherence.

SmoothQuant attacks the activation side. Rather than fighting extreme activation values during quantization, it mathematically migrates the difficulty from activations to weights through per-channel scaling. Weights are static and can tolerate adjustment; activations are dynamic and unpredictable. By smoothing the activation distribution before quantization, the technique avoids the outlier problem entirely.

AWQ (Activation-aware Weight Quantization) flips the strategy. It identifies which weight channels produce extreme activations, then protects those channels during quantization. Only 1% of weights might be critical, but preserving them prevents the cascading quality loss. It's surgical rather than systematic.

SpinQuant takes the most aggressive approach: learning rotation matrices that reshape weight distributions before quantization, then reversing the rotation during inference. Meta's research shows it achieving W4A4KV4 quantization with under 3% accuracy loss. (That's 4-bit weights, activations, and KV cache.) Previous methods saw 25%+ degradation at the same precision. The technique essentially teaches the model to be more quantization-friendly.

Weights, Activations, and KV Cache

Weight quantization dominates discussions because weights are static and predictable. But three components determine deployment performance.

Weights are the obvious target. They're fixed after training, so you quantize once and deploy. The 4-bit standard means a 4× memory reduction with manageable quality loss when using outlier-aware methods.

Activations are trickier. They change with every input, so you need either calibration data to estimate their ranges (post-training quantization) or to train the model with quantization in the loop (quantization-aware training). Most production deployments use INT4 weights with FP16 activations because activation quantization remains error-prone.

KV cache has emerged as the sleeper optimization. For long-context applications, the key-value cache storing attention states can exceed model weight memory. Research shows KV cache can be quantized to 3 bits with negligible quality loss, often delivering more impact than weight quantization for applications generating thousands of tokens.

The dominant 2026 pattern is W4A16: 4-bit weights, 16-bit activations. Datacenter deployments often use W8A8 for the better accuracy when memory constraints are looser. The cutting edge is W4A4KV4, but it requires techniques like SpinQuant and isn't universally applicable yet.

The Hardware Reality

Quantization would be useless if hardware couldn't exploit it. Modern accelerators can.

INT4 operations run faster than FP16 because the arithmetic units are simpler and more can fit on chip. Memory bandwidth drops because you're moving fewer bits. Benchmarks on H100 GPUs show INT4 delivering 2.69× throughput versus BF16, with 63% cost reduction for inference workloads.

But the bigger win is capacity. Compressed weights free GPU memory for expanded KV cache, which matters enormously for reasoning models that generate thousands of internal tokens before responding. The same hardware that struggled with a 7B model can now run a 70B model with aggressive quantization.

Apple's A19 Pro includes native mxfp4 support in its Neural Engine, signaling where consumer hardware is heading. On-device AI isn't just possible; it's the target deployment environment.

When Quantization Fails

The compression has limits.

Sub-4-bit quantization works but requires training from scratch. BitNet demonstrates 1.58-bit models that function, fitting a 2B parameter model in 400MB. You can't post-training convert to this; you have to build it in from the start.

Fine-tuned models quantize worse than base models. The fine-tuning process shifts weight distributions in ways that make outliers more severe. Test quantized quality carefully if you're deploying a heavily customized model.

Long-tail capabilities degrade first. A model might maintain general conversation quality at 4-bit while losing precise code generation or mathematical reasoning. The benchmarks that matter depend on your use case.

Our read: Quantization has matured from a compression hack to an essential deployment technique. The 4-bit standard works because the research community solved the outlier problem, not because LLMs are inherently robust to precision loss. When that next generation of models hits 1 trillion parameters, the outlier problem will likely return in new forms. For now, the trade-off is remarkably favorable: 4× memory reduction for under 5% quality loss is a deal most applications should take.

Frequently Asked Questions