Most AI planning is built on a flawed assumption: that today's pricing is real. It isn't. What we're seeing right now is not a stable market for intelligence. It's a subsidized land grab. The companies that win won't be the ones optimizing for today's pricing. They'll be the ones already building for tomorrow's cost structure.
1. The Assumption Everyone's Operating On
Right now, the market for AI is sending a signal.
You can build AI-powered features for $0.001 per thousand tokens. You can subscribe to Claude, ChatGPT, or Gemini for $20 a month. You can deploy inference at scale without breaking your unit economics.
That signal looks like permission.
Permission to design systems around unlimited API access. Permission to call language models for every decision, every summarization, every classification in your product. Permission to treat inference like a free resource you simply tap whenever needed.
Companies are building their entire roadmaps around that signal.
But here's the problem: that signal is lying.
2. What's Actually Happening: A Subsidized Land Grab
This is not a stable market for intelligence.
It's a subsidized land grab.
The numbers make this obvious:
GPU supply is constrained. The H100 shortage didn't end; it just became accepted as baseline. Every major lab needs exponentially more compute just to train the next generation. Nvidia controls the supply. Lead times exist. Margins are real.
Inference costs are high. Running a trillion-parameter model through billions of requests has real computational expense. The margin economics on inference-as-a-service are thin without scale. If you're not at Anthropic, OpenAI, or Google's scale, you're not getting there.
And major labs are burning cash to lock in developers. The "$20/month unlimited AI" model? That's not pricing based on cost. That's a subsidy. Independent analysis of OpenAI's reported financials found the company spends approximately $1.35 for every $1 earned on API revenue, deliberately priced below cost to win developer mindshare. Anthropic's economics are similar: at heavy usage tiers, real compute costs have been estimated to reach multiples of the subscription price. Anthropic, OpenAI, Google, Meta, all willing to lose money on individual inference calls to lock in developers before competitors can establish beachheads.
The goal is clear: make it so cheap and convenient that you design systems that only work with their API.
That's brilliant strategy.
But it's not economics. It's a land grab.
And land grabs end.
3. The Hidden Mistake: Designing for Today
Here's where companies are quietly going wrong.
They are designing systems, workflows, and ROI models around temporary pricing conditions.
I see this constantly:
A SaaS company launches an "AI-powered" feature. It calls an LLM for every user action: summarization, categorization, content generation, decision support. Cheap. Reliable. Fast to ship.
The feature ships. Adoption grows. Cost per request stays at $0.0001. Everything works.
But the company has now locked itself into a specific economic model: low-friction, high-frequency inference.
If inference costs 3–5× more tomorrow, the unit economics get uncomfortable. If it costs 10–15× more, the feature becomes a liability.
And the architecture can't adapt. You can't pull AI out of a system that was built assuming it was free. You've baked it into the user experience, the data model, the support structure, everything.
As I wrote in "AI Unlocks Economics", the companies winning right now are the ones architecting for AI as a first-class cost lever. But most aren't. Most are just adding it where it's cheap.
That works until it doesn't.
4. What Real Costs Actually Look Like
Once the market stabilizes-and it will-AI pricing will converge toward actual inputs, not distribution subsidies.
That means:
- Compute - GPU cycles (H100 prices, power, cooling, depreciation, replacement)
- Memory bandwidth - RAM and VRAM constraints (the bottleneck nobody talks about)
- Latency guarantees - SLA penalties, dedicated capacity, priority routing
- Uptime and reliability overhead - redundancy, failover, monitoring, on-call costs
- Model complexity per task - smaller models for simple tasks, larger for complex ones
Not subscriptions. Not "flat access." Not fantasy bundles where you're subsidizing billion-token monthly budgets.
Real infrastructure pricing. Real operational costs. Real trade-offs.
What does that multiple look like? Epoch AI's inference price trend data shows list prices have fallen roughly 10× per year, GPT-4 equivalent performance dropped from ~$20/M tokens (late 2022) to ~$0.40/M tokens by 2025. That is a real and remarkable decline. But it is being funded by providers operating at a loss on inference, not by sustainable unit economics. OpenAI's financials show approximately $1.35 spent per $1 earned on API revenue, below-cost pricing as a market share strategy. The listed price drop and the sustainable price are not the same number.
H100 GPU rental on dedicated infrastructure (Lambda Labs, CoreWeave) stabilized at $2.85–3.50/hour in 2025, down 64–75% from peak. Frontier model inference throughput at those rates implies a real compute cost in the range of $15–50 per million output tokens for large models, versus current API pricing of $3–15/M for comparable capability. The honest range of "how much higher could sustainable pricing be" is somewhere between 2× and 10×, depending on your latency tier, batch efficiency, and model class. The precise multiple matters less than the direction: list prices are subsidized today and will converge toward real costs as the competitive market share race ends.
When that happens, the current "efficiency" of shipping AI everywhere becomes what it actually is: massively wasteful.
5. The Real Optimization Problem
And this is where most companies get it backwards.
They ask: "How do we use more AI?"
The companies that survive cost normalization will ask: "How do we design systems that minimize unnecessary inference while maximizing output quality?"
Those are completely different problems.
The first leads to: bloated inference pipelines, redundant API calls, wasteful reranking, over-engineered summarization, and slow unit economics.
The second leads to: intelligent caching, hybrid architectures (AI for the 10% of decisions that matter, heuristics for the 90%), batch processing instead of real-time, smaller models for fast classification, larger models only for complex reasoning.
As I explored in "The Hidden Cost of AI-Generated Code", this applies even to code generation: the cost isn't the API call. The cost is the technical debt of maintaining globally fragile systems built by optimizing locally for speed. That's a cost that appears later, silently, in maintenance burden.
The same principle applies everywhere AI touches your system.
Low-frequency, high-value inference = defensible.
High-frequency, low-value inference = fragile.
Companies designing for the first architecture now will have pricing power tomorrow.
Companies designed for the second will have a restructuring problem.
6. Who Actually Wins When Pricing Normalizes
Here's the uncomfortable truth:
The companies that win won't be the ones optimizing for today's pricing.
They'll be the ones already building for tomorrow's cost structure.
That means:
- Treating inference as a constrained resource, not a commodity
- Building hybrid systems that use AI where it compounds (complex reasoning, pattern recognition, content generation) and leave heuristics everywhere else
- Investing in edge inference and smaller models to reduce API dependency
- Designing for cacheability - fewer novel problems, more cached solutions
- Engineering for interrogation - systems that can explain why they called an LLM, and prove it was worth it
As I wrote in "The 5 Files You Must Still Review", the standard that separates builders from generators is: can you defend every decision your system made? For AI, that standard is even more brutal. If you can't defend why you paid for that inference, you've wasted it.
Companies building these standards now-while inference is cheap-will have architectural advantage when inference is expensive.
The ones who don't? They'll face a choice:
- Restructure the entire system to reduce inference (slow, painful, risky)
- Keep the bloated system and accept lower margins (slow death)
- Kill the feature entirely (fast death)
None of those are good options.
The time to fix it is now.
7. What This Means for Your Team
If you're building AI-powered features today, ask yourself:
Will this architecture still make sense if inference costs 3–10× what it costs today?
If the answer is "no," you're not building for the future. You're building for a subsidy.
And subsidies always end.
Start now:
- Measure inference like a cost line item. Every API call, every token, every model selection should be tracked like infrastructure spend. Make it visible. Make it defendable.
- Build hybrid systems. AI for the decisions that matter. Rules and heuristics for everything else. Your margins will be better. Your latency will be faster. Your system will be simpler.
- Invest in smaller models. Haiku is often faster and cheaper than Opus, because constraints force clarity. The same principle applies to task-specific models. Spend engineering time now to reduce model size and complexity later.
- Design for cacheability. If you're solving the same problem twice, cache it. If you're calling an LLM for identical inputs, cache the output. If you're doing redundant inference, stop.
- Treat inference reduction like a feature. Make it a sprint goal. Make it a roadmap item. Make it a metric you track. Not because it's fashionable, but because it's going to be necessary.
The companies that do this while inference is cheap will have pricing power, architectural clarity, and customer defensibility when the market normalizes.
The ones that don't will have a rewrite in their future.
One point worth clarifying: this article is not an argument against using AI freely. The Haiku-First Engineer makes the case that using smaller, cheaper models extensively is good engineering discipline, the cost per call is low enough to experiment without hesitation. Both arguments converge on the same principle: use the cheapest model appropriate for the task, and design out unnecessary calls to expensive frontier models. The risk this article is flagging is not experimentation with Haiku at $0.001 per call. It is systems architected around $0.05-per-call frontier inference as if that price will hold indefinitely.
Frequently Asked Questions
If prices have been falling 10× per year, why should I worry about cost increases?
Because the decline reflects market competition and subsidized pricing, not sustainable economics. Major providers are currently selling inference at a loss to capture market share. When the competitive dynamic shifts, consolidation, VC funding pressure, or the race to profitability, the floor under published prices disappears. Planning for $0.40/M tokens in perpetuity is planning for a competitor's generosity to last forever.
Which model tiers should I use for which tasks?
A practical rule: use the cheapest model that gets the answer right 95%+ of the time for that specific task. Classification, routing, summarization of short text, and yes/no decisions: Haiku-class ($0.25–1/M tokens). Multi-step reasoning, code generation, and complex instruction following: Sonnet-class ($3–8/M). Tasks that require frontier reasoning where errors are expensive: Opus-class ($15/M+). Most production systems that audit carefully find 60–70% of their inference budget can be moved to cheaper tiers without measurable quality loss.
Should I lock in pricing now before it rises?
Enterprise contracts with annual commit discounts make sense if you have predictable volume. Spot pricing or pay-as-you-go works if your usage is volatile. What does not make sense is assuming you need to lock in current list prices against future rises, most providers offer volume discount structures that track market rates anyway. The architectural question (can this system survive 3–10× pricing?) matters more than any spot price hedge.
What does a hybrid architecture actually look like?
The pattern: rules/heuristics as the first gate (zero cost, instant), small models for classification and routing (cheap, fast), frontier models only for the decisions that justify the cost. An example: a customer support triage system that uses keyword matching to resolve 40% of queries, Haiku to classify intent for another 40%, and Sonnet only for the 20% that require nuanced reasoning. The unit economics change dramatically, and the system is more auditable because each decision layer is explicit.
Conclusion
The window for designing for real costs is now. Inference subsidies will not last forever, and the companies that recognize this early will have a structural advantage. Nobody knows the exact multiple, it depends on model tier, latency requirements, and how the GPU supply constraint resolves. Build for a 3–10× scenario. That range is defensible. Designing for zero cost is not.
Sources:
- Investing.com (2026). "The AI Token Pricing Crisis Behind OpenAI and Anthropic's Revenue Race." Documents OpenAI's below-cost API pricing; estimated ~$1.35 spend per $1 earned on inference.
- Epoch AI (2025). "LLM inference prices have fallen rapidly but unequally across tasks." Price trajectory data showing ~10× annual decline in GPT-4 equivalent performance cost. Used in Section 4 to contextualize the subsidy argument.
- Daniel Miessler (2024). "Inference Costs Are Not Sustainable." Analysis of true inference cost structure vs. published API pricing.
- CoreWeave / Lambda Labs (2025 market rates): H100 dedicated GPU rental $2.85–3.50/hour (down 64–75% from peak); basis for frontier model compute cost estimates in Section 4. Rates from GMI Cloud GPU pricing comparison (May 2026).
Working through the challenges in this post? I help engineering leaders and CTOs navigate complex technical decisions and scale high-performing teams. Schedule a consultation →