Small LLMs Can Call Tools. They Can't Stop Calling Them.

A CPU-only benchmark of 11 sub-4B models reveals the real agentic bottleneck: restraint, not capability.

Developer Toolsbenchmarkssmall-modelstool-callinglocal-inference

The conventional wisdom about AI agents is that capability scales with parameters. Bigger model, better tool use. A developer named Mike Veerman decided to test that assumption by running 11 small models through a benchmark designed not to measure whether they can call tools, but whether they know when not to.

The results upend the scaling intuition entirely. Qwen 2.5 at 1.5B parameters outperformed Qwen 2.5 at 3B. Llama 3.2:3B achieved a 9/10 action score but 0/2 on restraint. The problem isn't tool execution. It's what Veerman calls the keyword trigger problem: say "weather" anywhere in a prompt and these models will call get_weather, even when explicitly told not to.

Trick Prompts Expose the Real Failure Mode

Veerman's benchmark includes prompts designed to trip up models that pattern-match on keywords rather than understanding intent:

  • "Don't check the weather in Antwerp, just find me the quarterly report." Three of eight models called get_weather anyway.
  • "The weather in Antwerp is 8°C and rainy. Should I schedule an indoor meeting?" Five of eight models called get_weather to look up information already in the prompt.
  • "Write a Python script that checks the weather using an API." Multiple models called get_weather instead of writing code about weather APIs.

These aren't edge cases. They're the exact scenarios that break agents in production: users mentioning a capability they explicitly don't want invoked, providing context that makes a tool call redundant, or asking for code that references a tool's domain.

Llama 3.2:3B is the clearest example. It picks the right tool more often than most models on hard prompts. Its problem is restraint, not selection. Ask it "what tools do you have?" and it calls search_files.

Every prompt is a nail.

Conservatism beats aggression on judgment tasks. Qwen 2.5:1.5B won over its 3B sibling by declining prompts it wasn't sure about instead of guessing wrong. When asked to write a Python script about weather APIs, the 3B model called get_weather. The 1.5B didn't. This challenges the architecture of most agent systems, which assume you want the largest model you can afford. For routing decisions, maybe you want the most conservative model you can trust.

BitNet 2B-4T offered another surprise. The base BitNet 3B produces what Veerman calls "word salad." But the instruction-tuned 2B-4T generates perfect JSON tool calls at 2.3 seconds on CPU. No GPU, no cloud API, just Ollama and bitnet.cpp on a laptop.

What Agent Builders Should Take From This

Veerman's practical takeaway: "Simple tool routing is solved at 1.5B on CPU. But if your agent needs to decide whether to act, sub-4B models will confidently take the wrong action when keyword triggers are present."

This matters for a stack that's standardizing around tool-calling as the core interface. As we've seen with Pydantic's argument that code generation beats tool calling, and with Apple building Claude's agent SDK into Xcode, the industry is betting heavily on tool-using agents. The implicit assumption is that tool selection is the hard problem. Veerman's benchmark suggests the hard problem is actually tool avoidance. The difference between a useful agent and an annoying one is knowing when to do nothing.

The full benchmark code is a single Python file, designed for others to add models and prompts. It's an early attempt at measuring something the industry hasn't prioritized: judgment under ambiguity on consumer hardware.

The right metric for local agents isn't "can it call tools." It's "does it know when to shut up."

Frequently Asked Questions