James Barney
MetLife
As organizations increasingly adopt generative AI technologies, understanding and managing the associated costs becomes crucial. This whitepaper provides a comprehensive guide to tracking and attributing costs in AI workloads, with a focus on token usage as the primary unit of measurement.
Key Points:
This whitepaper serves as a foundational guide for FinOps practitioners and IT leaders looking to implement robust cost tracking for their AI initiatives. By following these principles and best practices, organizations can gain better visibility into their AI spending, optimize costs, and make informed decisions about their AI investments.
A common question when getting started with Generative AI cost tracking involves token usage. For most AI workloads, tokens are the single unit of cost that can be easily tracked and attributed to individual AI use cases within a business. Tokens can be classified into two distinct groups: input tokens and generated tokens. Input tokens (these are your prompts or instructions you send to the AI) are typically billed at a much lower rate than the generated tokens (what the model generates and responds with).
A simple metaphor for token counts is a simple word-count. If you wish to generate one thousand, 100-word emails, you could expect to generate 100,000 tokens. In practice, word-count dramatically underestimates actual token count which wholly depends on how the model was trained and other various parameters the model creators set – a topic out of scope for this paper.
Tokens are commonly billed per million tokens inputted or generated (Mtokens or Megatokens, just like Megabits!). Prices tend to change rapidly as systems become optimized but, as of writing, the most cutting-edge models tend to live in the USD 10 to 20 per Mtokens generated range. Production usage can easily cross billions of tokens (Gigatokens) per month, so costs can quickly add up. Attributing those costs correctly is crucial for evaluating the value of AI workloads over time.
Despite the rapid rise and adoption of AI technology, the fundamental billing unit, tokens, lends itself quite well to the established practices of the FinOps Framework. Incremental usage results in incremental costs, predictable patterns can be optimized with precommitments, and reporting usage is often real-time. However, many cloud or model providers are focusing on new capabilities, not cost tracking capabilities. As a result, FinOps practitioners can feel left in the dust with a big, unexplainable AI bill each month.
What best practices exist for tracking costs? How can you accurately attribute those costs to multiple use cases? What limitations currently exist within the AI ecosystem that introduces difficulties? We attempt to answer those questions in this paper. Importantly, we’ll be looking at inference costs primarily – not other traditional costs like data storage, backups, load balancing, and content distribution as these are topics covered extensively in existing FinOps materials.
The most basic and inaccurate AI cost estimation technique would be to simply count the number of requests per API key and bill per call. This is inaccurate because the cost of someone sending “hello” to the AI endpoint would be the same as someone sending the complete works of Shakespeare to the AI endpoint – two tokens vs 1.2 million tokens. However, depending on your workload types, volumes, and bill size, this may be a perfectly adequate solution, especially as a quick fix for a hard problem.
The next technique in terms of better accuracy but more work to implement would be token estimation. Token estimation can be useful if you’re using a model that doesn’t support standard tokenization reporting. Various tools and code libraries exist for estimating token usage. It is important to note, however, that unless the model you’re using has published its tokenization strategy, any token count your techniques arrive at will just be estimates and cannot be used to predict a bill exactly.
Token estimation generally requires a dedicated database (both NoSQL and traditional SQL work fine here) that would record the usage (input and generation tokens) from each API call. Then at the end of the month or billing period, run a process to attribute usage to each individual key and use case. If you already have an existing FinOps tool or database collecting cost data, this would be a good place to store this information.
The most accurate would be using the actual token input and generation counts into a database with the same rollups as mentioned before. Most third-party models come with this provided count luxury, but some still don’t (or you’re using an older model which doesn’t support it yet).
A full token tracking table schema could look something like this (full example on Github):
-- Request Keys table CREATE TABLE request_keys ( request_key_id SERIAL PRIMARY KEY, key_name VARCHAR(255) NOT NULL, key_value VARCHAR(255) NOT NULL UNIQUE, is_active BOOLEAN DEFAULT true, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Model Information table CREATE TABLE model_information ( model_id SERIAL PRIMARY KEY, model_name VARCHAR(255) NOT NULL, model_input_price DECIMAL(10, 6) NOT NULL, model_output_price DECIMAL(10, 6) NOT NULL, price_effective_date DATE NOT NULL, is_current BOOLEAN DEFAULT true ); -- API Versions table CREATE TABLE api_versions ( api_version_id SERIAL PRIMARY KEY, api_version VARCHAR(50) NOT NULL, release_date DATE NOT NULL ); -- Token Tracking table CREATE TABLE token_tracking ( tracking_id SERIAL PRIMARY KEY, request_id UUID NOT NULL, request_key_id INTEGER REFERENCES request_keys(request_key_id), input_token_count INTEGER NOT NULL, output_token_count INTEGER NOT NULL, model_id INTEGER REFERENCES model_information(model_id), api_version_id INTEGER REFERENCES api_versions(api_version_id), timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP, total_cost DECIMAL(10, 6) GENERATED ALWAYS AS ( (input_token_count * (SELECT model_input_price FROM model_information WHERE model_id = token_tracking.model_id) / 1000) + (output_token_count * (SELECT model_output_price FROM model_information WHERE model_id = token_tracking.model_id) / 1000) ) STORED ); -- Index for faster querying CREATE INDEX idx_token_tracking_timestamp ON token_tracking(timestamp); CREATE INDEX idx_token_tracking_request_key_id ON token_tracking(request_key_id); CREATE INDEX idx_token_tracking_model_id ON token_tracking(model_id);
In any solution, for the purposes of cost estimation, we do not recommend storing the raw prompts and output long-term as you could expose yourself to storing sensitive data in a non-sensitive database. In addition, the storage of prompts and outputs will only contribute to the overall cost of the system. Furthermore, your security partners may already be storing these prompts somewhere – ask them for direct access or to provide additional columns in their data store that you can leverage for your analysis.
Many enterprises, hoping to manage and supervise the adoption of AI capabilities, require all AI workloads to interact with models through a centralized AI proxy or hub. This technique can make it very simple to monitor and manage the cost of specific use cases if the hub is set up correctly.
Importantly, access to the hub should be granted through API keys or authentication keys tied directly to a given use case. These keys should be treated like other secret access mechanisms and not shared between applications, users, or use cases – existing technology exists for secrets management and should be leveraged for AI workloads.
However, if logging isn’t configured properly on the centralized hub, distinct API key usage may not be enough – from the centralized hub’s point of view, it’ll just see high usage coming from a number of keys. It is important to attribute input and generated tokens to each individual API call. While some existing FinOps tools can handle this type of usage, it’s a growing field and not many work perfectly. Additionally, managing this for each different model’s tokenization strategy can be tricky. Some models provide input and generated token counts upon response and should be used directly. Other models do not provide this information – you’ll need to implement token counting on the hub-side of your system.
If the adoption of AI workloads is decentralized within your organization, it may be more difficult to attribute costs as compared to a centralized technique. This is because the implementation of a unified token estimation or counting practice would be difficult to enforce. For example, requiring tags on each API request to the models may not be enforceable. As a result, a use case could easily misattribute their usage (either on purpose to fly under the radar or by honest mistake) on their system’s API calls.
If AI usage is decentralized, we recommend partnering with engineering to provide a common interface, SDK, or other tool for interacting with the models that can automatically apply standards for each use case. Publishing an infrastructure-as-code module with built in load balancing, tagging, and other undifferentiated heavy lifting tasks can be a great way to speed adoption of an opinionated tagging method.
Pairing this interface technique with an AI governance framework that requires accurate reporting can offer a decent system of tradeoffs for decentralized access. This will require a more orchestrated approach to handling AI usage across a few teams which can introduce more work long-term.
If this is not possible, we recommend separating usage into distinct AI workload billing entities. Distinct subscriptions or accounts for each AI workload will allow you to maintain a little more granular insights into how costs are changing over time. However, this introduces complexity into the long-term maintenance and management of AI workloads.
Until fundamental tagging and other metadata management techniques critical to other FinOps capabilities are deeply embedded in AI services and systems, a decentralized approach to AI will require open discourse and honest reporting by the consumers of the services within your organization.
A complicating factor for AI workload costs is Provisioned Throughput Units (PTUs). Provisioned Throughput is a pricing model offered by some AI service providers that allows organizations to reserve a certain amount of processing capacity for a fixed cost. While this can lead to significant cost savings for consistent, high-volume workloads, it also presents unique challenges in terms of usage allocation and cost attribution. While the models are still generating tokens, you’re no longer paying per token.
Additionally, because PTUs can be expensive, it’s common for multiple use cases to share one PTU. This can make it difficult to allocate costs to a specific use case. The simplest way to split costs would be to divide the total PTU cost by the number of use cases using it. This is very easy however, if one use case uses more of the PTU than the other, it could be unfair to the use case not consuming as much.
Below, we outline one approach you can take to help understand your shared PTU costs based on consumption. Since PTUs lower the effective per-token rate, you can calculate your price-per-token amount in your internal tracking system by totalling the number of tokens generated over a given period of time. This approach can easily slide into an existing cost system but also introduces a few minor complications.
Due to utilization metrics, it can be fairly difficult to pin the consumption of tokens from a specific use case to a specific cost. Until throughput and pricing transparency is introduced into model provider’s PTU offerings, estimates are our best option. An additional complication is that model providers each have a different pricing or PTU allocation option such that saying “a single PTU” doesn’t really mean much from model provider to model provider.
To illustrate the concepts, we’ll start with the simplest situation: a single use case with a single PTU.
Let’s assume 50% utilization over the entire month. We can perform some basic averaging to achieve a realized price. Let’s assume you can buy a certain amount of PTUs with a $30,000/month commitment. AWS’s Provisioned Throughput of Claude 2.0 will cost $29,462.40 for one month. How many tokens can you generate with that PTU in that month? Your results will vary between models and platforms but just to illustrate, let’s assume that one PTU can handle 2,500 tokens input/250 tokens output per minute.
Input tokens
=
2500 tokens
input
minute
×
30 days
month
×
24 hours
day
×
60 minutes
hour
=
~5 GBtoken
input
Output tokens
=
250 tokens
output
minute
×
30 days
month
×
24 hours
day
×
60 minutes
hour
=
~0.6 GBtoken
output
If you had paid for each of those tokens with standard, per-token billing:
5 Gtoken input tokens
×
$8 / Mtoken
input
=
$41,472
0.6 Gtoken output tokens
×
$24 / Mtoken
output
=
$15,552
Total
=
$57,024
Now, you have two options. The first is to use that overall savings as your effective per-token rate. The second would be to use that actual PTU cost and divide it up across your use cases as they consume tokens.
The first option works very well if you are comfortable with the fact that our effective rate came from the assumption that our PTU was running at 100% the entire month. It normally will not be at 100%, so add that caveat on your dashboards. Let’s calculate that PTU token rate:
PTU rate
=
$29,462.40
$57,024.00
=
0.516
Now, we can apply that effective rate to our example AI use case:
Price per Mtoken
input
=
8.00
x
0.516
=
$4.133 per Mtoken
Price per Mtoken
output
=
24.00
x
0.516
=
$12.40 per Mtoken
We’ll start by deriving the formula for computing the effective rate of a PTU.
Spend
=
price per token
x
tokens consumed
Or, once we calculate the effective PTU token rate (assuming we ran at 100% utilization),
Spend
=
PTU Rate
100%util
x
Mtoken Count
Token Unit
If we didn’t run at 100% utilization the entire time period, we need to account for that by modifying the effective rate accordingly. Since total spend increases linearly with token usage, a simple ratio is all that is needed. Because not running at 100% results in us not using our PTUs at max efficiency, we end up effectively paying more per token – in effect, our PTU rate increases so we add a term:
Spend
=
PTU Rate
sub 100%util
x
Mtoken Count
Token Unit
Spend
=
(PTU Rate + (PTU Rate × (1 − Utilization Rate)))
×
Mtoken Count
Token Unit
And through simplification:
Spend
=
PTU Rate × (2 − Utilization Rate)
×
Mtoken Count
Token Unit
Now, we can use this to calculate our costs. Note you’ll want to use two terms to compute both input and output since the PTU rates are different for each:.
Spend
input
=
PTU Rate × (2 − Utilization Rate)
×
Mtoken Count
Token Unit
Spend
input
=
($4.13 × (2 − 0.50 ))
×
700 Mtokens
1000000 tokens
=
$4,339
Spend
output
=
($12.40 × (2 − 0.50 ))
×
300 Mtokens
1000000 tokens
=
$5,580
Spend
total
=
$9,919
And if we hadn’t used a PTU:
Spend
input
=
$8
×
700 Mtokens
1000000 tokens
=
$5,600
Spend
output
=
$24
×
300 Mtokens
1000000 tokens
=
$7,200
Spend
total
=
$12,800
Great! Now we can calculate the effective rate of our PTU. But what if we have more than one use case? Simple: you can create a calculation for each use-case and add them all up. We can use the utilization rate and token count for any arbitrary number of use cases over any period of time and multiply them by the effective rate to derive their contributing spend.
For example, let’s say we have three use cases (A, B, C) driving a 75% utilization rate for a random day of the week. Use case A consumed 550 Mtokens, B consumed 100 Mtokens, and C consumed 350 Mtokens.
Use Case | Input Tokens | Output Tokens |
---|---|---|
A | 550 Mtokens | 250 Mtokens |
B | 100 Mtokens | 50 Mtokens |
C | 350 Mtokens | 100 Mtokens |
The total spend will be the sum of all use cases’ consumption times thePTU rate
Spend
total
=
∑
n
i
PTU Rate
sub 100% util
×
Mtoken
i
Count
Token Unit
Since the PTU Rate will be the same for all use cases, we can pre compute that:
PTU Rate
sub 100% util
=
$4.13
×
(2 − 0.75)
=
$5.16 per
Mtoken
input
Spend
input
=
PTU Rate
×
Mtoken Count A
+
PTU Rate
×
Mtoken Count B
+
…
Spend
input
=
5.16
×
550
+
5.16
×
100
+
5.16
×
350
Spend
input
=
$2838
+
$516
+
$1806
Spend
input
=
$5160
And for output:
PTU Rate
sub 100% util
=
$12.40
×
(2 − 0.75)
=
$15.50 per
Mtoken
input
Spend
output
=
PTU Rate
×
Mtoken Count A
+
PTU Rate
×
Mtoken Count B
+
…
Spend
output
=
15.5
×
250
+
15.5
×
50
+
15.5
×
100
Spend
output
=
$3875
+
$258
+
$516
Spend
output
=
$4649
This example shows that use case A had a total cost of $6,713, use case B, $774, and use case C, $2,322.
Critically, by combining PTU utilization with discreet timestamps (hourly, daily, weekly), you can establish a fair showback or chargeback billing system for PTU consumption across an enterprise. If between 9:00 am and 10:00 am on Tuesday, your PTU was at 30% utilization, collect the token usage between those timestamps, calculate the effective rate, and assign the cost to those use-cases.
As with all things in FinOps, understanding the cost of any given process running in the cloud is essential deriving its value to the business. AI workloads are the new kid on the block and, because of their rapid development, they aren’t fully mature, feature-rich capabilities. As a result, some FinOps practitioners may need to build and maintain systems to provide insight into the cost and usage of these services. Partnering with other organizations in the AI space within your company can provide extensive capabilities without too much overhead, just remember that those systems probably haven’t been purpose built to answer FinOps questions.
Ultimately, the consumption of AI inference services needs to be reconciled with the monthly bill. Tracking each and every token may get you very close to the real bill you’re being charged, however it still may not be perfect. Other hidden costs tied to more traditional infrastructure in the cloud may contribute directly to the cost of AI inference services. An existing, robust FinOps strategy is the best way to help you understand these costs and extend your existing best practices to the new AI workloads.
We’d like to thank the following people for their hard work on this Paper:
We’d also like to thank all of our supporters for their help on this asset.