How much VRAM do I need to run a 7B model locally?

At Q4_K_M quantization (the standard for consumer inference), a 7B parameter model requires approximately 8GB of VRAM. This fits comfortably on mid-range consumer GPUs like the RTX 3060 or RTX 4070. At full precision, the same model would need around 28GB, putting it out of reach for most consumer hardware.

When should I use vLLM vs llama.cpp for local inference?

Use llama.cpp for single-user or low-concurrency scenarios where its lightweight footprint (under 90MB vs 4.6GB for Ollama) and broad hardware support matter. Use vLLM when serving multiple concurrent users: at 64 concurrent requests, vLLM delivers 35x the throughput of llama.cpp due to its PagedAttention mechanism and parallel inference design. vLLM requires 24GB+ VRAM minimum.

What's the quality difference between Q4 and Q8 quantization?

Q8 quantization shows less than 2% perplexity increase compared to full precision, making it effectively lossless for most applications. Q4_K_M typically shows 2-8% quality degradation. The counterintuitive finding: a larger model at Q4 often outperforms a smaller model at full precision, making aggressive quantization worthwhile if it lets you run a bigger model.

Local LLM Inference: The Real Hardware Math

Headline: Local LLM Inference: The Real Hardware Math

Running AI locally sounds appealing until you check your GPU specs. Local inference in 2026 is genuinely viable for specific use cases, but the trade-offs are more nuanced than "local good, API bad" framing suggests. The math on what you can actually run, what you'll sacrifice, and when it makes sense.

Memory is the constraint that matters. According to LocalLLM.in's quantization guide, different model sizes require the following at Q4_K_M quantization (the standard for consumer inference):

7B model: 8GB VRAM
13B model: 12GB VRAM
70B model: 35-43GB VRAM

If you have an RTX 4090 (24GB), you can run a 13B model comfortably or squeeze a 70B at aggressive quantization. An RTX 3060 (12GB) puts you in 7-13B territory. Most laptops with integrated graphics? You're looking at CPU inference for anything meaningful.

The throughput picture changes dramatically under load. Red Hat's benchmarks found that at 64 concurrent users, vLLM delivered 35x the throughput and 44x the tokens per second compared to llama.cpp.

For single-user scenarios, though, llama.cpp performs just fine. The right tool depends entirely on your use case.

The Quantization Trade-off

Quantization compresses model weights from 32-bit floats to smaller representations. The quality cost follows a predictable curve, and LocalLLM.in's analysis breaks it down clearly:

Q8: Less than 2% perplexity increase. Effectively lossless for most applications.
Q4_K_M: 2-8% quality degradation. This is the standard; it's where most people should land.
Q3_K_S: 8-15% quality loss. Acceptable for broad queries but struggles with precision tasks.

The counterintuitive part: larger models handle quantization better. A quantized 13B model can outperform an unquantized 7B. If you're choosing between running a bigger model at Q4 versus a smaller model at full precision, the bigger quantized model often wins.

Our read: Q4_K_M is the pragmatic choice for consumer hardware. You're trading single-digit percentage points of quality for roughly 4x memory savings. For most development and iteration workflows, that's a good trade.

Picking Your Inference Stack

The local inference ecosystem has stratified into clear tiers. Rost Glukhov's hosting comparison maps the landscape well.

Ollama won the "Docker for LLMs" positioning. One-command setup, excellent Apple Silicon support, and a clean API for integration. If you're a developer who wants to get running fast, start here. LM Studio owns the beginner market with its GUI-first approach, and uniquely supports Vulkan for integrated Intel/AMD GPUs that other tools ignore. If you're on integrated graphics or want a visual interface, this is your entry point.

llama.cpp is for users who want control.

It's under 90MB versus 4.6GB for Ollama, with native Vulkan support that works out-of-box for AMD GPUs on Windows. Ollama actually uses llama.cpp under the hood, adding abstraction and overhead. For power users who want minimal dependencies and maximum configurability, raw llama.cpp eliminates the black boxes.

vLLM is the production choice, but with a catch: it needs 24GB+ VRAM minimum. Its PagedAttention mechanism reduces memory fragmentation by 50%+, making it dramatically more efficient at scale. But that scale assumption is key. vLLM shines when you're serving multiple users; for personal use, it's overkill. The Red Hat analysis crystallizes the split: llama.cpp's throughput stays nearly flat under concurrent load due to its queue-based architecture, while vLLM was built for parallel inference. Pick your tool based on whether you're serving yourself or serving others.

This is where the local vs. API conversation gets interesting. A local Llama 3.1 70B at Q4 quantization is genuinely competitive with GPT-4-class models for many tasks. That's real capability you can run on your own hardware (assuming you have the VRAM). For code completion, drafting, summarization, and iteration, local models are often good enough.

But frontier API models (Claude Opus, GPT-5.2) still outperform the best open models on complex reasoning tasks.

The gap isn't "local models are bad"; it's "local models are great for 80% of tasks, but the hardest 20% still favors frontier inference."

This maps to how production systems actually work. The DEV Community's LLM comparison notes that recommended architectures route 80-95% of requests to cheaper tiers (including local or budget API models), escalating only complex tasks to expensive reasoning models. The cost angle needs nuance too. API pricing dropped significantly in the past year. GPT-4.1 runs $2.00/$8.00 per million tokens; Claude Haiku 4.5 is $1.00/$5.00. Running local only makes economic sense at high volume or when privacy requirements mandate on-premise inference.

Local vs. API: Match the Tool to the Job

Local inference makes sense when privacy is non-negotiable (data never leaves your machine, no API logs, no third-party terms of service), when latency matters more than capability (no network round-trip for real-time applications like autocomplete), when you're iterating rather than shipping (eliminating API costs and rate limits during development), or when you already have the hardware (that workstation with serious GPU power makes the marginal cost of inference effectively zero).

APIs make sense when you need frontier capability for the hardest reasoning tasks, when you're serving multiple users without serious GPU infrastructure, or when you want to ship fast without managing model updates and VRAM budgets.

The Honest Take

Local LLM inference in 2026 is genuinely capable for the right use cases. An 8GB GPU running a 7B model at Q4 quantization loses only 2-8% quality compared to full precision. For developers, hobbyists, and privacy-conscious users, that's a real option now.

But framing this as "local vs. cloud" misses the point. The interesting architecture is hybrid: local models for iteration and privacy-sensitive tasks, API calls for the hard problems. Most production systems already work this way.

If you have consumer hardware (8-24GB VRAM), you can run 7-13B models effectively. Useful for drafts, code completion, and development iteration. Not useful for replacing a frontier model on complex reasoning tasks. Local inference is viable, cost-effective at scale, and increasingly practical. It's a tool for specific jobs, not a replacement for everything.

Sources cited: Claims as analysis: