Pydantic Bets LLMs Should Write Code, Not Call Tools

Monty is a minimal Python interpreter for AI agents. Pydantic's real bet: code generation beats sequential tool calling for agentic workflows.

Developer ToolsAI agentscode generationPydanticPython

Monty is a minimal Python interpreter built for AI agents. But the real story is the architectural gamble behind it.

The Pydantic team has released Monty, a minimal Python interpreter written in Rust and designed specifically for AI agents. It boots in 0.06 milliseconds, strips Python down to its core syntax, and removes standard library access entirely.

That's the what. The why is more interesting.

Right now, the dominant pattern for AI agents is sequential tool calling. The LLM decides to call a tool, gets results, sends those results back through its context window, reasons about them, calls another tool, and so on. Each step is a full roundtrip. Each roundtrip burns tokens and adds latency. This works, but it's expensive and clunky. If an agent needs to fetch data from three sources and combine them, that's three separate LLM turns minimum, with complete result payloads shuttled through the context window each time.

Pydantic's bet: let the LLM write code instead.

Code Mode Changes the Model

In Pydantic AI's Code Mode, the LLM generates a script that chains multiple tool calls, extracts specific fields, and runs logic locally. Only the final result (or errors) goes back to the model. The team claims this reduces token usage, cuts context bloat, and speeds up multi-step operations.

It's a compelling pitch. Instead of the LLM orchestrating a conversation with tools, it writes a program that does the work in one shot.

But this creates a new problem: you're executing LLM-generated code. And you can't just run arbitrary Python from an LLM. Full CPython has too much surface area; even with sandboxing, the attack vectors multiply. Import os, access the filesystem, call subprocess, and you're in trouble.

Monty solves this by removal.

No standard library means no dangerous imports. The interpreter supports core Python (variables, functions, control flow) but nothing that touches the outside world except explicitly exposed tool functions. The technical choices reveal the thinking:

Microsecond startup matters because Code Mode might generate many small scripts per task. If each execution has noticeable overhead, the latency savings from fewer roundtrips evaporate.

The Rust implementation enables embedding in Python and JavaScript runtimes without spawning subprocesses. And the intentional incompleteness is the security model itself. Rather than trying to restrict full Python (a notoriously difficult problem), Monty just doesn't implement the dangerous parts.

The team is clear-eyed about limitations: for multi-tenant production systems, they recommend layered defense with VM or container isolation. Monty handles the fast path; heavier sandboxing handles the adversarial cases.

Two Different Philosophies

Code Mode is philosophically different from tool calling. Tool calling treats the LLM as an orchestrator that directs execution step by step. Code Mode treats the LLM as a programmer that generates execution plans upfront.

The advantages are real: fewer roundtrips, better composability, reduced context bloat for intermediate results. The risks are also real. LLM-generated code can have bugs. It can hallucinate function names or produce valid syntax with invalid semantics. When an agent calls tools one at a time, you can intercept and validate each step. When it writes a script, you're trusting the whole thing.

There's a Hacker News discussion worth reading on this. One commenter argues we should be building stricter languages purpose-built for AI, not constraining existing ones. The logic: LLMs don't need the ergonomic flexibility humans do, and that flexibility is exactly what creates security and correctness problems.

That's probably right in the long run.

But Pydantic chose Python's syntax for a practical reason: LLMs are already trained on vast amounts of Python code. A novel DSL might be cleaner in theory but less reliable in practice.

Monty is infrastructure for a specific vision of AI agents. Not the "autonomous agent that browses the web" demo-ware, but the production reality of LLMs that need to manipulate structured data, call APIs, and compose results quickly. You can see similar instincts in projects like E2B and the way Claude's computer use generates action sequences rather than individual commands.

Our read: The real question is whether the industry converges on code generation as the agentic paradigm, or whether sequential tool calling remains dominant because the debugging story is simpler. Pydantic is betting on code. But this is still early. The tooling for understanding what an LLM-generated script will do before you run it doesn't really exist yet. Until it does, Code Mode remains a calculated risk rather than an obvious upgrade.

Frequently Asked Questions