You're not paying for intelligence. You're paying for generation.
Across every major provider, output tokens cost 3-10x more than input tokens. Anthropic charges $15 per million output tokens for Sonnet 4.5 versus $3 for input. Google's Gemini 3 Pro follows the same pattern: $12 out, $2 in. The ratio varies. The direction never does.
Why? Reading your prompt can be parallelized across GPU cores. Generating a response is inherently sequential: each token depends on the previous one, requiring repeated memory access and computation. That sequential bottleneck is expensive, and providers price accordingly. Which leads to a counterintuitive first rule of cost optimization: worry less about your prompt length and more about your output length. A verbose system prompt costs far less than a verbose response.
250x and Climbing
Current pricing spans roughly two orders of magnitude. At the budget end, Google's Gemini 2.5 Flash-Lite runs $0.10 per million input tokens. At the premium end, Claude Opus 4.6 costs $5 input and $25 output.
That's a 250x spread on output alone.
One analysis calculated that running a chatbot on GPT-5 would cost $1,050 per month, while the same workload on Gemini Flash would run $12. For document processing, the gap widened to 93x between frontier and efficient models. Most production workloads don't need frontier capabilities for every request. The gap between "good enough" and "best available" is where most of your budget disappears.
There's a trap that catches teams scaling up: context windows have price tiers. Anthropic's pricing increases 50% for requests exceeding 200K tokens. You're not just paying for more tokens; you're paying a premium rate for each one. Long-context use cases (RAG with large document sets, multi-turn conversations with full history, code analysis across entire repositories) can quietly push you into the expensive tier. The fix is aggressive context management: summarize earlier turns, retrieve only relevant chunks, prune aggressively.
Caching Changes Everything
Prompt caching is the highest-leverage optimization most teams aren't using.
The concept is simple: if part of your prompt (system instructions, few-shot examples, static context) doesn't change between requests, cache it. Providers won't reprocess those tokens. Anthropic offers 90% off cached token reads with a 5-minute TTL. Google charges 50% less for cached tokens, though they add a small storage fee ($1 per million tokens per hour). For applications with large static prompts, this alone can cut costs in half.
Per IBM's overview: the provider generates a hash of your static content, checks whether that hash exists in cache, and either returns the cached computation or processes and stores the new content.
The five-minute TTL at Anthropic means you need sustained request volume to benefit, but for production workloads, that's rarely a constraint. Implementation is straightforward: separate your prompt into static (cacheable) and dynamic (per-request) portions. System prompts, documentation snippets, and few-shot examples go in the static section. User queries and conversation history go in the dynamic section.
If your workload doesn't require real-time responses, batch APIs offer even simpler math. Both Anthropic and Google offer 50% discounts for async processing. You submit requests, they process them when capacity is available, you get results later. This works for anything that isn't user-facing: nightly data processing, bulk content generation, evaluation pipelines, embedding generation at scale. The trade-off is latency; the savings are guaranteed.
Reasoning Models Will Fool You
Reasoning models like OpenAI's o3 deserve special attention because their pricing is deceptive. The per-token rates look competitive: o3 charges $2/$8 per million tokens for input/output, cheaper than Claude Opus.
But reasoning models generate far more tokens internally.
Artificial Analysis found that o3 generated 39 million tokens during their evaluation, compared to an 11 million token average across other models. That's 3.5x more output. The "thinking" process that makes these models powerful also makes them expensive in practice, potentially 3-4x the apparent cost. A model that's 30% cheaper per token but generates 200% more tokens isn't saving you money. Benchmark your actual token consumption, not just the rate card.
Routing Is the Real Strategy
The most effective cost optimization isn't about any single technique.
Analysis of production deployments suggests that 70-80% of typical requests can be handled by cheaper models (GPT-4-class or below) without meaningful quality degradation. Only the remaining 20-30% actually need frontier capabilities. So you build a complexity classifier: analyze incoming requests, estimate difficulty, route simple queries to Haiku or Flash, escalate complex ones to Sonnet or GPT-5. One case study documented 81% savings on a production chatbot by combining model routing with prompt caching and output compression.
The compound effect is where this gets interesting. Route 80% of traffic to a model that's 10x cheaper, cache 50% of token costs on the remaining 20%, and you've cut your bill by roughly 85%. That's not optimization theater. That's the difference between a viable product and one that bleeds money at scale.
At sufficient volume, the rate card becomes a starting point. Industry guidance suggests negotiating custom terms around $5,000 in monthly spend, with more aggressive discounts available at $100,000 and above. Providers want committed volume, and they'll discount to get it. If you're spending five figures monthly on API costs, you should be talking to your provider's sales team.
Our read: LLM costs are dropping fast, but not uniformly. Frontier model pricing remains premium; the real deflation is in the mid-tier, where capable models that handle most tasks are getting dramatically cheaper. The strategic implication: design for model flexibility from day one. Hard-coding a specific model into your application locks you into its pricing trajectory. Building with model routing and prompt caching as first-class concerns means you can ride the cost curve down as cheaper models become capable enough for your use case.
The 60-80% savings are real. Whether you capture them is an architecture decision.