Headline: Agent Architectures: How AI Decides When to Stop
We've solved the reasoning problem. The ReAct paper from Google Research established in 2022 that interleaving thinking with action beats doing either alone. On fact verification tasks, ReAct outperformed pure chain-of-thought by about 5 percentage points. On decision-making benchmarks like ALFWorld, the gap was starker: 71% success for ReAct versus 45% for action-only approaches.
We've solved reliable tool invocation. Function calling gave us structured, predictable interfaces between models and external systems. We've even documented multi-agent coordination patterns.
But termination? That's where things get messy.
Every framework handles it differently, and the failure modes are real production problems: infinite loops, runaway token costs, agents that drift further from their goal with each iteration. You'd think this would be the first thing we'd nail down. It wasn't.
ReAct's Hidden Continuation Bias
The core insight of ReAct is simple: let the model think out loud, then act, then observe what happened. Repeat. This "Thought → Action → Observation" loop has become the dominant paradigm for agent design, and for good reason.
Why does this work better than pure reasoning? Grounding. When a model generates a chain-of-thought without external feedback, it can hallucinate confidently and never get corrected. When it acts and observes real results, those observations anchor subsequent reasoning in reality. The ReAct paper demonstrated another practical benefit: human-in-the-loop correction becomes trivial. Because reasoning traces are explicit, a human can edit a single thought step and let the agent continue from there.
Try doing that with an opaque decision process.
One thing often gets overlooked: ReAct has a continuation bias baked in. The default assumption is that the agent should keep going. The loop runs until something external stops it.
Modern production agents often skip the explicit reasoning trace entirely. Function calling lets models invoke tools through structured JSON rather than natural language. The model says "I want to call search with query X," not "I think I should search for X because..."
The trade-off is clear. Function calling is faster, cheaper on tokens, and more reliable for well-defined tool interfaces. ReAct-style prompting is more flexible, better for novel situations, and produces interpretable reasoning traces. This is not an either/or choice. The Vercel AI SDK's loop control documentation shows modern frameworks mixing both approaches: function calling for routine operations, explicit reasoning for edge cases.
The more interesting question is what happens when you need multiple agents working together.
Orchestration Gets Complicated Fast
Microsoft's Azure Architecture Center documents five core orchestration patterns for multi-agent systems:
Sequential: Agents run in order, each passing results to the next. Good for pipelines with clear stages.
Concurrent: Agents work in parallel on independent subtasks. Good for embarrassingly parallel problems.
Group Chat: Multiple agents discuss a problem, with an orchestrator managing turns. Microsoft recommends keeping this to three or fewer agents; conversation flow gets unwieldy beyond that.
Handoff: One agent routes to specialists based on the task. This is where things get dangerous. When routing logic is unclear, agents can bounce requests back and forth indefinitely.
Magentic: A task ledger tracks progress on complex, open-ended problems. This pattern prioritizes auditability over speed.
The pitfalls Microsoft identifies are instructive: overlapping specialization (agents stepping on each other), context window bloat (accumulating irrelevant history), and mutable shared state (race conditions between agents). These are not theoretical concerns.
Deloitte's enterprise research projects that more than 40% of agentic AI projects could be cancelled by 2027 due to cost and complexity.
Only 28% of enterprises report mature AI agent capabilities, versus 80% for basic automation. The agent market is projected at $8.5 billion by 2026, potentially reaching $45 billion by 2030 with good orchestration. The qualifier matters.
Every architecture above glosses over the same question: who decides when to stop?
The Letta blog's analysis of their v1 agent loop makes this explicit. MemGPT's key innovation was treating every action, including messages, as tool calls. This enabled a clever termination mechanism: the request_heartbeat parameter lets models explicitly signal whether they want to continue or stop. MemGPT defaults to stopping (termination bias). ReAct defaults to continuing (continuation bias).
Neither is universally correct.
The Vercel AI SDK offers multiple approaches. You can stop after a fixed number of steps (default: 20). You can track cumulative token costs and stop when you hit a budget. You can use a "tool with no execute function" as a forced termination signal; the model calls a tool called finish and the loop ends. You can dynamically adjust available tools per iteration, removing continuation options when you want to force a stop.
These are all heuristics. They work until they don't.
When Agents Go Wrong
The failure modes are consistent across implementations:
Infinite loops: Agent A asks Agent B for clarification. Agent B asks Agent A for context. Neither has a stopping condition for this case.
Context drift: The agent keeps working but gradually loses track of the original goal. Each step is locally reasonable; the trajectory is globally nonsensical.
Runaway costs: No token budget was set, or the budget was set too high. The agent burns through API credits accomplishing nothing useful.
The industry trend is toward "human-on-the-loop" rather than "human-in-the-loop." Instead of approving each action, humans monitor dashboards and intervene when metrics go wrong. This requires better observability: knowing what an agent is doing, why it thinks it should continue, and how close it is to success or failure.
Our read: termination is not a technical problem waiting for a clever solution. It is a product design problem. What counts as "done" depends on what the user actually wanted, and users often don't know until they see intermediate results. The architectures that scale are not the ones with the smartest stopping criteria. They are the ones that make it easy for humans to see what is happening and course-correct before things go wrong.
ReAct taught us to reason with action. Function calling taught us to invoke tools reliably. Multi-agent patterns taught us to coordinate specialists. The next lesson is termination, and that one requires understanding the human on the other end.
Sources cited: Claims as analysis: