Summary: Optimizing GenAI spend requires a transition from simple “Pay-As-You-Go” models to Provisioned Capacity, treating it as a strategic commitment rather than a simple discount mechanism. Provisioned models can offer dedicated throughput and lower latency, but can also introduce risks like Idle Allocated Capacity and vendor-specific rigidities (e.g., AWS’s model-specific locks vs. Azure’s flexible PTU pools). Build awareness around how provisioned units can sometimes cost more per token than shared tiers, meaning the investment must be justified by SLA requirements and performance needs (TTFT/OTPS) rather than pure unit-cost savings. Leverage “Spillover” logic to handle peak bursts and perform rigorous load testing, as advertised “Tokens Per Minute” (TPM) limits rarely reflect real-world workload complexities.
In the first two installments of our series we explored the new challenges of GenAI FinOps and demystified the “Fuzzy Math of Token Pricing.” We established that the advertised cost-per-token is misleading and that hidden costs, like Context Window Creep, can dominate your spend.
Now, we tackle one of the most significant and costly decisions a FinOps team will face: choosing the right capacity model.
When enterprises first adopt GenAI, they almost universally start with shared capacity (also called shared, or pay-as-you-go). It’s the default: a straightforward, consumption-based API call where you pay for what you use. This model is flexible and easy to start with, but it operates in a shared “public” pool. This means you have no guarantee of performance, and latency can spike during peak hours.
To solve this, vendors offer provisioned capacity (or reserved capacity). On the surface, this seems familiar. You pay up-front to reserve a dedicated block of resources, often at a discount. However, the unique nature of GenAI models makes this decision far more complex than reserving a cloud virtual machine.
Moving to provisioned capacity isn’t just a cost-saving tactic; it’s a strategic commitment that can have consequences for your bill, your application’s performance, and your ability to innovate.
Note that this blog covers inferencing models from hyperscaler-managed services, such as AWS Bedrock, Microsoft Foundry, and GCP Vertex. It does not cover provisioning custom models as a result of model training services (like Amazon Sagemaker or Azure Machine Learning) or the related data pipelines.
The first question most FinOps practitioners ask is, “Will reserved capacity save us money?” The answer is a frustrating “it depends,” and it depends on the shape of your traffic and the vendor.
Provisioned capacity is purchased for a fixed term (e.g., one month, one year), giving you a set amount of throughput per minute. You are paying for this capacity 24/7, whether you use it or not.
This leads to the following trade-offs:
Spillover is a feature that automatically routes traffic to the standard, pay-as-you-go shared tier after your provisioned capacity has been fully utilized. Imagine your reserved capacity can handle 1,000 requests per minute. If a sudden spike sends 1,200 requests, spillover automatically sends the extra 200 requests to the shared tier instead of returning a “429 throttled” error. This can reduce outage risk and save money.
If some spillover can be tolerated, it can be used to increase reserved capacity utilization and lower costs. For use cases that require low latency, the key is aligning on the tolerance and SLA thresholds, i.e., percentage of requests that can spillover while maintaining SLAs.
For use cases that scale with business hours and don’t have latency requirements, reserved capacity can solely be used to reduce costs. You can think of reserved capacity like a Savings Plan or CUD where you manage purchases based on a coverage target that gives you the lowest cost per token. The coverage level will vary depending on the throughput per capacity unit, which varies widely across models. In some cases, reserved capacity won’t reduce costs at all.
When provisioning capacity, there are two new types of waste that you can end up paying for. Imagine you’ve reserved a block of capacity on a hyperscaler for a specific model (e.g., Claude 4.5 Sonnet) for a month, but your application traffic is low. You are paying for 100% of that reservation while using only 15% of it at peak. The remaining 85% is Idle Allocated Capacity. The true cost of this idle time is amplified if your primary, running workload has a high proportion of expensive operations, such as generating a large volume of output tokens (which are approximately 3x more computationally expensive than input tokens). In effect, the financial loss here isn’t just a slight dip in utilization; it’s a massive overpayment for unused, premium service. This is the most common form of waste.
We will describe the second form of waste, which is unique to Azure, in the next section.
The biggest mistake a FinOps team can make is assuming all “reserved capacity” is the same. The economic and strategic differences between the hyperscalers are profound. Note that at the time of writing, there are no hyper scalers offering reservations that have access to all of the top-tier foundational LLMs. For example, you can only get OpenAI models on Azure and Gemini models on GCP.
On AWS, you reserve capacity for a specific model. For example, you can buy a one-month reservation for “Anthropic Claude 4.5 Sonnet”.
On GCP, you reserve capacity for a specific model much like you do for AWS. However, GCP allows you to change the model to another model from the same publisher. For example, you can switch the model from Google Gemini 2.0 Pro to Google Gemini 2.0 Flash, but you can’t switch from Google Gemini 2.0 Flash to Anthropic Claude 4.5 Sonnet.
You cannot reduce the amount of reserved capacity, even if the new model requires less capacity to operate.
Azure handles this differently. You don’t reserve a specific model; you reserve a pool of Provisioned Throughput Units (PTUs). This is like renting a block of generic GPU power from Azure.
For example, imagine you make a 500-PTU reservation. From there, you deploy models against that pool using a certain number of those PTUs. You might assign 100 PTUs to gpt-4o and 50 PTUs to Deepseek-V3, for example. The higher the number of PTUs, the higher the number of tokens per minute available for inference. Different models have different PTU minimums to run effectively, with the larger models requiring more PTUs.
If you reserve 500 PTUs but only deploy models that use 100 units, the remaining 400 are “unallocated.” You are paying for them, but they aren’t assigned to a model/deployment and they are generating zero value. Furthermore, because the reservation and deployment are separate, making a reservation on Azure does not in any way guarantee that the capacity is available for the models you want to deploy. When a new model comes out, and you want to use it as part of your reservation, if there is no available capacity for that model, then you may end up paying for long periods of Unallocated Capacity as you wait to be able to leverage your PTUs with the new model.
Guidance from Azure is to deploy the models first, then make the reservation afterwards. However, this does not work if you have a pre-existing reservation and you are looking to switch models.
While the Azure PTU model offers flexibility as new models are released, it introduces a new form of waste management (Idle Unallocated Capacity) and separates the reservation from guaranteed capacity availability. The AWS and GCP reservation systems offer superior cost and capacity predictability at the cost of being more rigid for longer term reservations. An organization might prefer a single, quantifiable risk (paying for a less-efficient model later) over managing a more complex system with an additional source of potential waste and an additional layer of management for handling the PTU-to-Model allocation. This frames the choice as a trade-off between model flexibility and operational/financial simplicity.
Further, as called out above, the different hyperscalers offer provisioned capacity for different models, and there is no single provider that currently offers all of the best-in-class foundational models. This means that, based on what your engineering teams desire to use for your products, you may be forced to leverage reservations from multiple sources and deal structures.
As discussed above, a core differentiator among vendors is how they package provisioned capacity. Each has its own type of “capacity unit”, and you don’t want to assume the per-token rate is always lower with provisioned capacity. Pricing for these units can be billed on an hourly, daily, weekly, monthly, or annual basis, with each vendor offering different options. To compare your cost per token purchased with the standard rates, you must normalize them. The results may surprise you.
| Input bundle | Output bundle |
| 25,000 TPM | 2,500 TPM |
| $75.00 per unit / day | $60.00 per unit / day |
Above is the pricing for one of OpenAI’s provisioned capacity options, called “Scale Tier”. Below is the comparison to the standard rate. This shows that even if you utilize GPT 5 Scale Tier units 100% of the time, you’d pay 67% more per token. In the case of GPT 4.1, 25-27% more. Therefore, using Scale Tier capacity only makes sense if you require uptime and latency SLAs.
| Capacity Type | Input Rate | Cached Input Rate | Output Rate |
| Standard | $1.25 | $0.125 | $10.00 |
| Provisioned (Scale Tier) | $2.08 | $0.208 | $16.67 |
| % Delta | 67% | 67% | 67% |
| Capacity Type | Input Rate | Cached Input Rate | Output Rate |
| Standard | $2.00 | $0.50 | $8.00 |
| Provisioned (Scale Tier) | $2.55 | $0.64 | $10.00 |
| % Delta | 27% | 27% | 27% |
With provisioned capacity, the primary driver may not be cost savings at all. In many cases, organizations pay more for provisioned capacity because they need its performance and SLA benefits.
Shared capacity is slow and unpredictable. Reserved capacity is dedicated to you, which means it’s fast and reliable. But “fast” in GenAI is measured differently. End-to-end latency (from prompt to final token) is less relevant for streaming applications than it is for traditional cloud software.
The metrics that matter more for streaming solutions are:
Reserved capacity dramatically improves both TTFT and OTPS. This “perceived latency” is critical for user experience. Even for reasoning models that “think” before responding, that thinking step is just token generation under the hood. Faster OTPS means the model “thinks” faster, too. For latency-sensitive applications, this performance gain alone may justify the higher cost.
Provisioned capacity typically comes with higher uptime SLAs than shared capacity. Further, it (usually) offers data privacy guarantees. Most providers state that data sent to their provisioned endpoints is not used for training future models.
This privacy difference also unlocks a key strategy: traffic “affinitization”. A team might route all requests containing PII or confidential corporate data to their reserved capacity endpoint to ensure data privacy. Less sensitive traffic can then be sent to the shared endpoint. This combined approach lowers costs because you can reserve far less capacity than what would be required to handle all traffic.
As you navigate this decision, keep these final factors in mind.
Ultimately, the move to provisioned capacity is a graduation. It’s the moment your GenAI application becomes a mission-critical part of the business. The decision is not a simple cost calculation but a complex, strategic trade-off between cost, performance, security, and vendor flexibility.
In our next installment, we’ll discuss how to consistently adapt to the volatility of the GenAI landscape.
We’d like to thank the following people for their work on this Paper: