What is the difference between Code Mode and tool calling for AI agents?

Tool calling treats the LLM as an orchestrator that calls tools one at a time, with results shuttled through the context window after each step. Code Mode treats the LLM as a programmer that generates a complete script upfront, executing multiple operations locally and returning only the final result. Code Mode reduces roundtrips and context bloat but requires trusting the entire generated script rather than validating each step.

Why did Pydantic build a custom Python interpreter for AI agents?

Full CPython has too much surface area for safely executing LLM-generated code. Even with sandboxing, dangerous imports like os, subprocess, or filesystem access create attack vectors. Monty solves this through intentional incompleteness: it implements only core Python syntax without standard library access, exposing only explicitly defined tool functions. This is easier to secure than trying to restrict full Python.

Is Code Mode safer than sequential tool calling?

It's a tradeoff, not strictly safer. With sequential tool calling, you can intercept and validate each step before execution. With Code Mode, you're trusting an entire LLM-generated script. Pydantic recommends layered defense with VM or container isolation for multi-tenant production systems. Monty handles the fast path with its restricted interpreter; heavier sandboxing handles adversarial cases.

Can developers trust multi-agent AI systems on production codebases?

Trust in multi-agent systems depends heavily on the oversight mechanisms in place. Current best practices include human-in-the-loop checkpoints for destructive operations, sandboxed execution environments, and audit logs for all agent actions. The technology is maturing but most organizations still limit multi-agent deployments to non-critical paths or require human approval for commits and deployments.

Will the AI industry converge on code generation or sequential tool calling for agentic systems?

The industry hasn't converged yet, and both paradigms have distinct advantages. Sequential tool calling offers simpler debugging since each step is discrete and inspectable, while code generation (as championed by Pydantic's approach) allows models to express complex logic in a single artifact. The likely outcome is hybrid systems that use code generation for complex workflows and tool calling for simpler, auditable operations.

Pydantic Bets LLMs Should Write Code, Not Call Tools

Monty is a minimal Python interpreter built for AI agents. But the real story is the architectural gamble behind it.

The Pydantic team has released Monty, a minimal Python interpreter written in Rust and designed specifically for AI agents. It boots in 0.06 milliseconds, strips Python down to its core syntax, and removes standard library access entirely.

That's the what. The why is more interesting.

Right now, the dominant pattern for AI agents is sequential tool calling. The LLM decides to call a tool, gets results, sends those results back through its context window, reasons about them, calls another tool, and so on. Each step is a full roundtrip. Each roundtrip burns tokens and adds latency. This works, but it's expensive and clunky. If an agent needs to fetch data from three sources and combine them, that's three separate LLM turns minimum, with complete result payloads shuttled through the context window each time.

Pydantic's bet: let the LLM write code instead.

Code Mode Changes the Model

In Pydantic AI's Code Mode, the LLM generates a script that chains multiple tool calls, extracts specific fields, and runs logic locally. Only the final result (or errors) goes back to the model. The team claims this reduces token usage, cuts context bloat, and speeds up multi-step operations.

It's a compelling pitch. Instead of the LLM orchestrating a conversation with tools, it writes a program that does the work in one shot.

But this creates a new problem: you're executing LLM-generated code. And you can't just run arbitrary Python from an LLM. Full CPython has too much surface area; even with sandboxing, the attack vectors multiply. Import os, access the filesystem, call subprocess, and you're in trouble.

Monty solves this by removal.

No standard library means no dangerous imports. The interpreter supports core Python (variables, functions, control flow) but nothing that touches the outside world except explicitly exposed tool functions. The technical choices reveal the thinking:

Microsecond startup matters because Code Mode might generate many small scripts per task. If each execution has noticeable overhead, the latency savings from fewer roundtrips evaporate.

The Rust implementation enables embedding in Python and JavaScript runtimes without spawning subprocesses. And the intentional incompleteness is the security model itself. Rather than trying to restrict full Python (a notoriously difficult problem), Monty just doesn't implement the dangerous parts.

The team is clear-eyed about limitations: for multi-tenant production systems, they recommend layered defense with VM or container isolation. Monty handles the fast path; heavier sandboxing handles the adversarial cases.

Two Different Philosophies

Code Mode is philosophically different from tool calling. Tool calling treats the LLM as an orchestrator that directs execution step by step. Code Mode treats the LLM as a programmer that generates execution plans upfront.

The advantages are real: fewer roundtrips, better composability, reduced context bloat for intermediate results. The risks are also real. LLM-generated code can have bugs. It can hallucinate function names or produce valid syntax with invalid semantics. When an agent calls tools one at a time, you can intercept and validate each step. When it writes a script, you're trusting the whole thing.

There's a Hacker News discussion worth reading on this. One commenter argues we should be building stricter languages purpose-built for AI, not constraining existing ones. The logic: LLMs don't need the ergonomic flexibility humans do, and that flexibility is exactly what creates security and correctness problems.

That's probably right in the long run.

But Pydantic chose Python's syntax for a practical reason: LLMs are already trained on vast amounts of Python code. A novel DSL might be cleaner in theory but less reliable in practice.

Monty is infrastructure for a specific vision of AI agents. Not the "autonomous agent that browses the web" demo-ware, but the production reality of LLMs that need to manipulate structured data, call APIs, and compose results quickly. You can see similar instincts in projects like E2B and the way Claude's computer use generates action sequences rather than individual commands.

Our read: The real question is whether the industry converges on code generation as the agentic paradigm, or whether sequential tool calling remains dominant because the debugging story is simpler. Pydantic is betting on code. But this is still early. The tooling for understanding what an LLM-generated script will do before you run it doesn't really exist yet. Until it does, Code Mode remains a calculated risk rather than an obvious upgrade.