Structured Outputs: What Guarantees Schema Compliance

JSON mode, function calling, and constrained decoding explained for developers who need reliability from their LLM integrations, not hope.

Technical Explainerstructured outputsJSON modefunction callingconstrained decoding

Headline: Structured Outputs: The Difference Between Asking Nicely and Guaranteeing Mathematically

Getting structured data from LLMs shouldn't require prayer. But most developers don't understand why their JSON sometimes comes back malformed, or why one API offers "100% compliance" while another silently fails. The difference comes down to a simple question: is the model being asked nicely, or is it being forced?

There are fundamentally three approaches to getting structured output from language models, and they offer very different guarantees.

Level 1: Prompting. You ask the model for JSON in the prompt. Sometimes you get it. Sometimes you get markdown-wrapped JSON. Sometimes you get apologetic prose explaining why the JSON couldn't be generated. OpenAI reports that prompt engineering alone achieves about 35.9% schema compliance. Better than a coin flip, but not something to build production systems on.

Level 2: JSON Mode. The model guarantees valid JSON, but not that it matches your schema. You'll get parseable output, but the fields might be wrong, missing, or invented. This is where most "JSON mode" features live.

Level 3: Constrained Decoding. The model literally cannot produce invalid output because invalid tokens are made impossible during generation. OpenAI's strict mode claims 100% schema compliance through this approach.

The gap between levels 2 and 3? That's where most developer frustration lives.

Constrained Decoding Under the Hood

Your JSON schema gets transformed into a regex, then into a finite state machine, and finally into pre-computed token masks for each state. At inference time, before the model samples its next token, invalid tokens have their probability set to negative infinity.

Not reduced. Not discouraged. Made mathematically impossible.

If your schema says a field must be an integer, the model cannot output a letter there. If a field is required, the model cannot skip it. The constraint operates at the generation level, not as post-processing or validation. This explains why constrained decoding only works with local models or server-side implementation. You need access to the raw probability distribution over tokens. API providers like OpenAI implement this on their servers; if you're hitting their API, they're doing the logit biasing for you.

For complex schemas with nesting and recursion, simple regex matching falls apart. That's where context-free grammars come in. The vLLM team moved from FSM-based approaches (Outlines) to pushdown automata (XGrammar), achieving up to 5x performance improvement under load. The previous approach blocked entire request batches; the new one handles concurrent requests properly.

Function Calling vs Structured Outputs

These terms get used interchangeably, but they solve different problems.

Structured outputs constrain the entire response to match a schema. Use this when you need the model's answer in a specific format: extracting entities, generating config files, formatting data for downstream systems. Function calling is a specialized application of structured output with an orchestration layer. The model decides whether to call a function and with what parameters. The schema constrains only the function parameters, not the entire response. Use this when the model needs to take actions: querying databases, calling APIs, triggering workflows.

The mechanical difference matters: function calling adds decision logic about which tool to invoke. The schema enforcement problem is the same; the orchestration problem is new.

One critical detail that trips people up: function descriptions are part of the prompt. Vague or ambiguous descriptions lead to hallucinated tool calls or wrong parameter values. The model is predicting what function you probably meant based on your description. Write them precisely.

Constrained decoding does not solve hallucination.

The model will fill every required field in your schema. If it doesn't know the answer, it will make something up. Schema compliance guarantees structure, not semantic accuracy. You can get perfectly valid JSON containing completely fabricated data.

There's also evidence that structured outputs may reduce LLM reasoning capabilities compared to free-form responses.

Constraining the output format forces the model down specific token paths that may not be optimal for working through a problem. This is the reasoning tax: you're trading some cognitive flexibility for structural guarantees. The practical implication? Don't use maximally constrained outputs for tasks where the model needs to think through problems. Let it reason in prose, then extract structured data from the conclusion.

When to Use What

Prompt engineering makes sense when you're prototyping, the output format is flexible, or you're willing to parse multiple formats. Accept that it will sometimes fail.

JSON mode works when you need guaranteed valid JSON but don't have a strict schema. Good for exploratory extraction where you don't know the exact fields ahead of time.

Function calling fits when the model needs to decide whether and how to use tools. The output is action-oriented, not data-oriented.

Strict structured outputs are for when schema compliance is non-negotiable: financial data, API integrations, config generation, anything where malformed output breaks downstream systems.

For production workloads on open-source models, library choice matters significantly. XGrammar outperforms Outlines under load because it moved grammar compilation from Python to C and doesn't block concurrent requests. If you're running local inference at scale, this is worth benchmarking.

Complex Schemas Get Tricky

There's a token limit of 16,384 tokens for structured outputs on OpenAI, meaning larger responses will fail validation entirely. Deeply nested schemas also become difficult to design correctly; Pydantic-to-regex conversion can fail on complex structures, sometimes requiring manual regex.

Our read: keep schemas as flat as possible. Deep nesting creates multiple points of failure and makes debugging harder. If your schema is complex enough that conversion is failing, that's a signal to simplify.

The engineering investment in constrained decoding is real. The shift from interpreted Python grammars to compiled C implementations, from blocking to non-blocking batch handling, from FSMs to pushdown automata: this is infrastructure work that suggests structured outputs are becoming a core capability rather than an afterthought.

For developers, the practical takeaway is straightforward: stop treating structured outputs as a prompting problem. Understand whether your provider is asking nicely or guaranteeing mathematically. The difference is the gap between "usually works" and "production ready."

The harder problem remains semantic accuracy. No amount of schema enforcement will make the model know facts it doesn't know. Structured outputs give you reliable containers; filling them with reliable content is still your problem.

Frequently Asked Questions