Perspective from an engineering leader at a software consulting firm
A year ago, the AI conversation in most engineering organizations was about capability: can these models actually help us ship faster, write better code, automate the tedious parts of our work? That question has largely been answered. The conversation I’m having now with my own teams and with clients is different. It’s about cost, and more specifically, about the uncomfortable moment when finance asks why the API bill tripled last quarter and nobody can say exactly where the money went.
This is a solvable problem, but it requires treating LLM spend the way we learned to treat cloud spend a decade ago: with visibility, attribution, and a few well-placed guardrails. Here’s what we’ve found that works today.
Start with a single front door
The most common failure mode I see is fragmentation. One team has an OpenAI key, another is hitting Anthropic directly, a third spun up something on Bedrock, and a few developers are quietly expensing personal subscriptions. Nobody has the full picture, and when costs spike, the investigation is archaeology.
The fix is an LLM gateway: a proxy layer that all model traffic flows through. Whether you adopt an open-source option, a commercial platform, or build something lightweight yourself, the value is the same. You get one place to manage credentials, log usage, enforce limits, and one interface that lets you swap models and providers without rewriting application code. Everything else in this article gets dramatically easier once the gateway exists. Without it, you’re trying to govern traffic you can’t see.
Make spend visible and attributable
Aggregate cost numbers are nearly useless. “We spent $40K on inference last month” tells you nothing actionable. What you need is attribution: which team, which application, which use case. Issue scoped keys or tags per project so every token traces back to an owner.
Once you have attribution, budgets become possible. We favor soft budgets with alerts over hard cutoffs for most workloads. You don’t want a production feature failing silently because it crossed a monthly threshold on the 28th. But hard caps absolutely belong on experimental keys and individual developer sandboxes. In our experience, expensive incidents include sustained overuse, retry loops without backoff, an agent stuck calling the same tool in a cycle and a batch job that someone pointed at the wrong dataset. Per-key rate limits and maximum-token ceilings turn what would have been a five-figure surprise into a log entry.
Match the model to the job
The instinct on most teams is to default to the most capable model available for everything, because why wouldn’t you want the best? But a large share of LLM calls in a typical application are doing simple work: classifying a ticket, extracting fields from a document, routing a request, summarizing a paragraph. Smaller models handle these tasks well at a fraction of the price often one-tenth or less per token.
A deliberate tiering strategy pays for itself almost immediately. Reserve frontier models for the genuinely hard problems: complex reasoning, nuanced code generation, multi-step planning. Send everything else down-market. Implementing tiering means defining routing rules that map task types to model classes, sending a classification job to a small model and a reasoning task to a larger one, then revisiting those mappings as new models ship, prices drop, and capabilities move up and down the tiers, which happens on a near-monthly cadence.
Beyond tiering, a few mechanical optimizations compound: prompt caching for applications that reuse large system prompts or shared context, batch APIs for anything asynchronous (the discounts are typically around half off), and plain old prompt hygiene trimming bloated instructions and unnecessary context. For agentic coding tools, watch session length. An agent carrying an entire repository’s context through a long working session burns tokens at a rate that surprises people the first time they see the invoice.
Distinguish seats from tokens
It helps to recognize that AI spend comes in two flavors with very different risk profiles. Seat-based tools (e.g. coding assistants and chat subscriptions) are predictable. You know the per-user price, you know your headcount, and the worst case is paying for licenses nobody uses. Most businesses opt for an Enterprise license for compliance reasons or exceeding a seat count, e.g. 150 seat limit.
API consumption is where variance lives. It scales with usage, with traffic, with bugs, and with enthusiasm. This is the spend that needs the gateway, the attribution, and the limits. Treating both categories with the same controls either over-burdens the predictable spend or under-protects the variable spend.
Keep the process lightweight
The temptation, once costs become a leadership concern, is to install a heavyweight approval process for anything AI-related. Resist it. The teams experimenting most freely with these tools are often the ones finding the highest-value applications, and a committee between a developer and an idea is how you end up with shadow IT, which puts you right back where you started, except now the usage is invisible again.
What works better is a registration model: any new AI-powered use case gets a key, a budget, and an owner. With a mature observability and Site Reliability Engineering (SRE) strategy, the friction is minutes, not weeks, and in exchange you get visibility into everything that’s running. Review the spend dashboard monthly, celebrate the use cases delivering value, and have honest conversations about the ones burning tokens without much to show for it.
The payoff
None of this is exotic. It’s the same discipline engineering organizations eventually applied to cloud infrastructure: centralize access, attribute costs, set guardrails, and right-size resources to workloads. The organizations that struggled with cloud spend were the ones that waited until the bill forced the issue. The same pattern is playing out with LLMs, just faster.
The good news is that the toolchain has matured quickly. The three practices that do most of the work, routing traffic through a gateway, attributing spend against scoped budgets, and tiering models to match tasks, are well within reach for any engineering organization willing to prioritize them. Get the visibility first. Everything else follows from being able to see where the money goes.