Can small language models run tool-calling agents on CPU?

Yes. Veerman's benchmark ran entirely on CPU using Ollama. BitNet 2B-4T generated valid JSON tool calls in 2.3 seconds on a laptop with no GPU. The benchmark suggests simple tool routing is solved at 1.5B parameters on consumer hardware, though judgment under ambiguity remains challenging for all sub-4B models tested.

Do bigger language models have better tool-calling judgment?

Not necessarily. In Veerman's benchmark of 11 sub-4B models, Qwen 2.5 at 1.5B parameters outperformed its 3B sibling on judgment tasks by being more conservative—declining prompts it wasn't sure about rather than making incorrect tool calls. The pattern suggests that model size correlates with capability but not with restraint.

What is the keyword trigger problem in small LLM tool calling?

The keyword trigger problem occurs when language models call tools based on keyword matching rather than understanding intent. For example, if a prompt mentions 'weather' anywhere—even in 'don't check the weather' or 'write code about weather APIs'—models will often call a weather tool anyway. This pattern-matching behavior breaks agents in production when users mention capabilities they don't want invoked.

Small LLMs Can Call Tools. They Can't Stop Calling Them.

The conventional wisdom about AI agents is that capability scales with parameters. Bigger model, better tool use. A developer named Mike Veerman decided to test that assumption by running 11 small models through a benchmark designed not to measure whether they can call tools, but whether they know when not to.

The results upend the scaling intuition entirely. Qwen 2.5 at 1.5B parameters outperformed Qwen 2.5 at 3B. Llama 3.2:3B achieved a 9/10 action score but 0/2 on restraint. The problem isn't tool execution. It's what Veerman calls the keyword trigger problem: say "weather" anywhere in a prompt and these models will call get_weather, even when explicitly told not to.

Trick Prompts Expose the Real Failure Mode

Veerman's benchmark includes prompts designed to trip up models that pattern-match on keywords rather than understanding intent:

"Don't check the weather in Antwerp, just find me the quarterly report." Three of eight models called get_weather anyway.
"The weather in Antwerp is 8°C and rainy. Should I schedule an indoor meeting?" Five of eight models called get_weather to look up information already in the prompt.
"Write a Python script that checks the weather using an API." Multiple models called get_weather instead of writing code about weather APIs.

These aren't edge cases. They're the exact scenarios that break agents in production: users mentioning a capability they explicitly don't want invoked, providing context that makes a tool call redundant, or asking for code that references a tool's domain.

Llama 3.2:3B is the clearest example. It picks the right tool more often than most models on hard prompts. Its problem is restraint, not selection. Ask it "what tools do you have?" and it calls search_files.

Every prompt is a nail.

Conservatism beats aggression on judgment tasks. Qwen 2.5:1.5B won over its 3B sibling by declining prompts it wasn't sure about instead of guessing wrong. When asked to write a Python script about weather APIs, the 3B model called get_weather. The 1.5B didn't. This challenges the architecture of most agent systems, which assume you want the largest model you can afford. For routing decisions, maybe you want the most conservative model you can trust.

BitNet 2B-4T offered another surprise. The base BitNet 3B produces what Veerman calls "word salad." But the instruction-tuned 2B-4T generates perfect JSON tool calls at 2.3 seconds on CPU. No GPU, no cloud API, just Ollama and bitnet.cpp on a laptop.

What Agent Builders Should Take From This

Veerman's practical takeaway: "Simple tool routing is solved at 1.5B on CPU. But if your agent needs to decide whether to act, sub-4B models will confidently take the wrong action when keyword triggers are present."

This matters for a stack that's standardizing around tool-calling as the core interface. As we've seen with Pydantic's argument that code generation beats tool calling, and with Apple building Claude's agent SDK into Xcode, the industry is betting heavily on tool-using agents. The implicit assumption is that tool selection is the hard problem. Veerman's benchmark suggests the hard problem is actually tool avoidance. The difference between a useful agent and an annoying one is knowing when to do nothing.

The full benchmark code is a single Python file, designed for others to add models and prompts. It's an early attempt at measuring something the industry hasn't prioritized: judgment under ambiguity on consumer hardware.

The right metric for local agents isn't "can it call tools." It's "does it know when to shut up."

Trick Prompts Expose the Real Failure Mode

What Agent Builders Should Take From This

Frequently Asked Questions

Can small language models run tool-calling agents on CPU?

Do bigger language models have better tool-calling judgment?

What is the keyword trigger problem in small LLM tool calling?