FinOps X 2026 · June 8-11 · San Diego
Register Now
FinOps Foundation Insights

Token Economics: The Atomic Unit of AI Value

J.R. Storment
J.R. Storment
May 10, 2026 - 18-minute read
Token Economics: The Atomic Unit of AI Value

What is Token Economics?

Token economics, in the context of AI, is the study of how the production, distribution, and consumption of tokens (the atomic units of data that large language and multimodal models read, write, and reason over) generate cost and value within an organization. It is the discipline through which AI consumption is metered, attributed, and connected to business outcomes.

The term should be distinguished from its earlier and unrelated usage in distributed-ledger systems, where “tokenomics” describes the supply mechanics of cryptographic assets. In AI, a token is not a unit of ownership; it is a unit of computation. A token typically represents a sub-word fragment, a character cluster, or a discretized segment of audio, image, or video data. Industry convention places roughly 1,500 English words at approximately 2,048 tokens, although tokenization schemes vary by model family and modality.

Tokenomics is best understood as FinOps applied to AI. Where traditional FinOps uses Unit Economics to govern the relationship between variable cost cloud infrastructure (compute, storage, network) and value, token economics governs the variable cost of intelligence computation itself. It does not replace prior cost disciplines; it extends them into a layer where the consumed resource is probabilistic, non-deterministic, and priced per inferential act.

Token as an Atomic Unit

Every interaction with a generative or agentic AI system can be decomposed into input tokens (the prompt, the retrieved context, the system instructions, the conversation history) and output tokens (the generated response, the tool call, the chain of reasoning). Providers price these two flows separately, often at different rates, and frequently introduce additional metering for cached context, compression options, routing intelligence, embedded media, and reasoning traces.

Five variables drive token consumption per request:

  1. System prompt overhead. Standing instructions appended to every call.
  2. Context and memory. Retrieved documents, conversation history, and tool definitions.
  3. Model selection. Larger or reasoning-class models consume more tokens per equivalent task.
  4. Output length. Determined by user intent, prompt design, and model behavior.
  5. Retry and orchestration overhead. Failed calls, validation passes, and agent-to-agent communication.

These variables compound. A single user query routed through a retrieval-augmented generation pipeline with a reasoning model and three tool calls may consume one to two orders of magnitude more tokens than a direct prompt to a smaller model. Token consumption is therefore non-linear with respect to user-facing activity, which is the principal reason traditional cost forecasts have proven unreliable in AI workloads.

A further complication: unit prices per million tokens are falling across most provider families, while aggregate enterprise spend is rising. The mechanism is elastic demand. Even when cost per token declines, organizations expand modality (text to image to video), increase agent autonomy, and lengthen reasoning chains. AT&T has publicly reported scaling from roughly 8 billion to 27 billion tokens per day after deploying multi-agent systems. Google has reported processing approximately 1.3 quadrillion tokens per month, a roughly 130-fold increase in just over a year.

A token may get cheaper; tokens, in aggregate, are not.

The Price Environment in 2026

Public commentary on token pricing has, until recently, been dominated by a single narrative: per-token list prices were falling rapidly across all capability tiers, with public benchmarking measuring two-orders-of-magnitude declines at fixed capability levels over the 2022 to 2024 period. That trajectory was real, but it is no longer the relevant story for an organization budgeting AI consumption in 2026. Two structural changes have reshaped the price environment.

End of the Subsidy Phase. Frontier model providers spent the early adoption cycle pricing below true cost to accelerate growth, supported by venture capital and ahead-of-revenue infrastructure buildout. As enterprise consumption growth outpaced the rate at which per-token cost was declining, the underlying math broke.

Anthropic’s April 2026 enterprise pricing transition is the most public example. The company moved enterprise customers from bundled token allowances to a seat-fee-plus-pre-committed-token-consumption structure with no included usage cushion, walled flat-rate Pro and Max subscriptions off from third-party agent harnesses (which industry analysts had estimated were consuming roughly five times more compute per dollar than the subscription was priced to support), and reframed the procurement question for buyers from “how many seats” to “how much compute will you forecast and pre-pay.”

OpenAI’s product lead for ChatGPT acknowledged the underlying dynamic publicly: an unlimited AI plan, he said, is structurally similar to an unlimited electricity plan and does not work. Anthropic’s posture is the visible front edge of an industry-wide adjustment.

Per-token list prices are still declining, but the rate has slowed and the declines are concentrated in commodity tiers rather than at the frontier. Reasoning and agentic workloads consume from five to thirty times more tokens per task than the equivalent chat interaction. The International Energy Agency reports that electricity demand from AI-focused data centers grew approximately 50 percent in 2025 alone, against three percent growth in global electricity demand overall, with the divergence attributed specifically to the rise of reasoning and agentic use cases. Even where unit prices decline at the bottom of the capability range, consumption growth at the top of the range swamps any savings.

Anthropic added approximately twenty-one billion dollars of annualized revenue between October 2025 and April 2026, almost entirely through enterprise token consumption. Over the same period, AT&T publicly reported scaling token throughput from roughly eight billion to twenty-seven billion tokens per day on multi-agent systems, and Google reported processing approximately one and a third quadrillion tokens per month, a roughly 130-fold increase year over year.

The defensible summary of the current price environment is therefore narrower than the early-cycle headlines suggested. A token at a fixed capability tier may continue to drift cheaper. The tokens an enterprise actually consumes, weighted by the tiers those workloads require and the volumes those workloads now generate, are not.

Not All Tokens Are Equal: Goodput and the Pareto Frontier

A token-count view of AI consumption assumes tokens are homogeneous. They are not. A token delivered at five tokens per second per user is a different economic good from a token delivered at five hundred. A token generated by a small classifier is a different good from one produced by a reasoning model with a long context window. Token economics, to be defensible at the operating level, must account for this heterogeneity.

The relevant concept is goodput: token output that meets a defined service-level objective, typically expressed as a time-to-first-token threshold and a sustained tokens-per-second-per-user rate. Goodput, not raw throughput, is what enterprises actually purchase.

Inference workloads sit on a Pareto frontier between total throughput per unit of power and per-user interactivity. Public benchmarking work (notably SemiAnalysis’s InferenceX) identifies three regions along this curve:

NVIDIA chief executive Jensen Huang has framed the resulting market as a tiered spectrum: free tiers to attract users, mid-tier models that balance scale and speed, and premium tiers with large context windows and high throughput that command higher prices per million tokens. From an operator’s perspective, the weighted product of tier mix, throughput, and price is the revenue function for a fixed power envelope. From a buyer’s perspective, the same logic applies in reverse: paying premium-tier rates for workloads that would tolerate Goldilocks-tier latency is a measurable inefficiency.

Token economics, then, requires the metering of token quality alongside token quantity. An organization that tracks only volume will systematically misattribute cost.

The Cost Stack Around the Token

Tokens account for only a portion of AI spend. They are the most visible and most easily metered layer, which has led to a common conflation in industry discourse where “AI cost” and “token cost” are used interchangeably. They are not the same.

A complete accounting of AI cost spans several layers, only one of which is denominated in tokens:

A token-only view of AI cost only captures the variable, marginal cost of a unit of inference, which is necessary for unit economics, but omits the fixed and semi-fixed costs that determine whether a given AI initiative is economically viable at scale. Agentic workloads in particular, characterized by continuous inference and autonomous tool use, distribute cost across most of the layers above, often in proportions that change week to week as model behavior and orchestration logic evolve.

Token economics which only accounts for tokens is therefore a partial view.

SaaS Subscriptions as Token Aggregators that Remove Visibility

A specific subcategory inside SaaS embedding deserves attention because it has become a substantial and unpredictable cost driver. AI-native developer tools and productivity applications, many of which present as conventional monthly subscriptions, are increasingly token aggregators in practice.

The structural property uniting these cases is that headline subscription pricing on AI-native tools is no longer a reliable budgeting signal. The seat fee is the floor; the variable component is what drives total spend. Procurement and finance functions that treated AI tooling as a SaaS line item discovered, in 2025 and the first months of 2026, that they had inherited a metered consumption obligation without the visibility or controls a metered consumption obligation usually requires. Tokenomics, properly practiced, treats this category with the same rigor it applies to direct API spend.

Token economics which only accounts for tokens is therefore a partial view.

Hardware, Power, and the Efficiency Curve

Beneath every token is a chain of physical and architectural decisions that determine what the token costs to produce. At gigawatt scale, the binding constraint on AI infrastructure is no longer capital or silicon supply; it is power. Jensen Huang has stated the relationship plainly on recent earnings calls: inference tokens per watt translate directly into revenue for cloud service providers operating an AI infrastructure footprint. The corollary holds for buyers. The cost of an inference, traced upstream, is the cost of the electricity that produced it.

Several factors determine how efficiently power becomes tokens.

The hardware generation curve. NVIDIA reports a roughly one-million-fold improvement in inference throughput per megawatt over six GPU generations (Kepler in 2012 through Rubin in 2026). Whether or not this specific figure is reproducible under independent methodology, the order of magnitude is consistent with publicly available benchmarks across vendors. The practical consequence is that the unit economics of inference are non-stationary. A workload that is uneconomic on one generation can become economic on the next without any change in software or model.

Facility efficiency. Power Usage Effectiveness (PUE), the ratio of total facility power to compute power, varies materially across cooling architectures. Modern liquid-cooled designs achieve PUE values approaching 1.1, compared with 1.5 or higher for legacy air-cooled facilities. At gigawatt scale, NVIDIA estimates that up to 40 percent of grid power can be lost before reaching compute under traditional designs. The share of facility power that becomes billable tokens is therefore a function of facility design, not only chip selection.

System architecture. The shift toward mixture-of-experts (MoE) model architectures, in which a subset of a model’s parameters is activated for any given token, has driven demand for rack-scale systems with high-bandwidth interconnect (NVIDIA NVL72, AMD Helios, AWS Trainium3). MoE inference distributes computation across many GPUs that must communicate frequently; the cost of that communication is paid in latency and power, and is materially lower in rack-scale designs than in loosely coupled eight-GPU systems. Smaller systems remain economically competitive for latency-sensitive workloads at the right edge of the Pareto curve, where rack-scale advantages diminish.

Disaggregated serving. Inference can be separated into a compute-intensive prefill phase (prompt processing) and a bandwidth-limited decode phase (token generation). Frameworks that run these phases on different pools of GPUs (such as NVIDIA Dynamo or AMD MoRI) can materially improve tokens-per-watt for a given goodput target. The optimal ratio of prefill to decode capacity varies by workload.

Numerical precision. Lower-precision data types (FP8, and increasingly FP4) reduce the memory, bandwidth, and compute required per token. The accuracy tradeoff is non-trivial but has narrowed significantly with block-scaling techniques. Quantization is an explicit economic lever, with measurable gains in throughput per watt set against measurable losses in output quality.

Software stack. The same hardware running different inference engines (vLLM, SGLang, TensorRT-LLM, vendor-specific microservices) can produce materially different tokens per watt for the same model. Benchmarks shift on a timescale of weeks as engines are optimized for new models and new precision formats. For operators, this means hardware investment alone does not determine unit cost; the software stack and the cadence at which it is updated are first-order variables.

The collective implication is that “cost per token” is not a property of a model. It is a property of a configuration: model, precision, inference engine, hardware generation, system topology, cooling architecture, and operating point on the Pareto curve.

Token economics cannot ignore the elements of hardware, power and efficiency in calculating cost.

The AI Factory Framing

Jensen Huang has, across recent GTC keynotes, advanced the framing of AI infrastructure as the “AI factory.” The proposition is that modern data centers no longer simply host applications; they manufacture a product. That product is tokens, and tokens are the raw form from which language, images, code, designs, and decisions are reconstituted. Electricity and silicon enter the factory; tokens emerge, to be consumed by downstream applications.

The framing is useful for token economics because it brings the unit economics of AI into the same conceptual space as the unit economics of any manufactured good. A factory has throughput, yield, defect rate, cost per unit, and a useful life. So does an AI inference. The relevant top-line metric for a data center operator is revenue per megawatt: the weighted product of tokens produced, the tier those tokens occupy on the goodput curve, and the price per million tokens at that tier. A generational hardware upgrade can shift this product by a factor of several, potentially without any change to the underlying applications.

For buyers, the same framing yields a complementary metric: cost per outcome, traced through to cost per inference, traced through to cost per token at the relevant goodput tier. Whether or not Huang’s prediction that every company will eventually operate two factories holds in full, the measurement obligation it implies is independent of the prediction. An organization producing or consuming tokens at scale needs to know its production cost, its throughput, its idle rate, and the value of what those tokens are used to generate.

FinOps for AI as the Enabling Discipline

The FinOps Framework, originally developed for variable cloud spend, has been extended to address AI as a category of technology value. The extension is structural rather than cosmetic. It introduces token-level metering into the Understand Usage and Cost domain, attaches AI-specific unit metrics to the Quantify Business Value domain, and adds practices such as semantic caching, model tiering, prompt optimization, and goodput-aware routing to the Optimize Usage and Cost domain.

The role of FinOps for AI is to provide a vendor-neutral, community-governed methodology for the questions Tokenomics raises. Those questions include:

The open source FOCUS specification (FinOps Open Cost and Usage Specification) plays a supporting role by normalizing cost and usage data across providers so that token consumption from one model family can be compared with another, and so that token cost can be reconciled against the underlying cloud and data center costs that produced it.

Without a normalization layer, token economics remains provider-specific and the cross-organizational benchmarks required for industry-level discipline cannot form.

The CFO Lens: Governance, Inflection Points, and Hidden Costs

For finance organizations, the introduction of token economics is not a refinement of IT cost management. It is a structural change in how technology spend behaves on the income statement.

The scale of the underlying shift is now measurable. Enterprise generative AI spend grew from approximately one and seven-tenths billion dollars in 2023 to thirty-seven billion in 2025, a 3.2x year-over-year increase and, by most independent counts, the fastest software category expansion on record. Yet enterprise capability to attribute that spend to business outcome lags meaningfully.

McKinsey’s State of AI 2025 survey finds that eighty-eight percent of organizations now use AI in at least one business function, but only six percent qualify as high performers attributing more than five percent of EBIT to AI; two-thirds of respondents have not yet begun scaling AI across the enterprise. The gap between adoption and measured impact is the operating space that token economics is meant to close.

Three properties distinguish AI spend from prior technology cost categories.

Non-linearity. Token volumes can move by orders of magnitude in response to product changes, model updates, or shifts in user behavior. Multi-agent and agentic workloads, with continuous inference and autonomous tool use, are the principal source of unpredictable expansion. Cost forecasts built on prior-period run rates have, in practice, failed to capture this behavior.

Obfuscation. A growing share of AI cost is embedded in SaaS contracts where the token meter is not exposed to the buyer. The cost is real and rises through renewal cycles, but it cannot be governed with the tools used for explicitly metered cloud or API consumption. Some SaaS providers have begun exposing token-level usage; many have not.

The AI-native developer tools category, in particular, has demonstrated that headline subscription pricing on AI products has become a poor proxy for what they will actually cost an enterprise at scale.

Distribution across the P&L. Token cost shows up in cloud bills, SaaS renewals, professional services, capital expenditure for self-hosted infrastructure, and unmanaged shadow AI. No single budget line captures it. Industry surveys place AI consumption at one quarter to one half of total IT spend at the firms most aggressive in adoption.

These properties produce a governance requirement that finance functions have not previously had to meet for technology spend. The relevant practices, drawn from established FinOps discipline and adapted for AI, include real-time consumption monitoring at the workload level, business-unit chargeback or showback for token consumption, ROI thresholds gating new AI initiatives, and explicit policy for shadow-AI discovery and remediation.

The procurement model is also shifting in real time. That shift has implications for how enterprises budget. Frontier model providers are moving away from flat-rate enterprise pricing toward seat-fee-plus-pre-committed-token-consumption structures, a posture that requires buyers to forecast compute demand as they once forecast cloud demand. The model is closer to a capacity commitment than to a traditional software subscription. Anthropic’s April 2026 enterprise pricing transition is the visible front edge of this change; the underlying logic (that flat-rate AI subscriptions cannot indefinitely absorb agentic consumption growth) applies across providers, and most large vendors are expected to follow.

A second, more strategic CFO-level question is the timing of architectural commitments. The unit economics of three deployment archetypes (SaaS-embedded, API-consumed, and self-hosted) cross over at predictable token volumes. SaaS-embedded AI carries the lowest activation cost and the highest per-token cost; API consumption carries the highest visibility and the highest sensitivity to provider pricing; self-hosted inference (the AI factory archetype) carries the highest capital commitment and the lowest marginal cost per token at sustained scale, but the highest cost risk when underused. The inflection points between these models are calculable if token demand is modeled early. They are not calculable if the question is deferred until the spend itself forces it. Infrastructure commitments at the self-hosted end of the spectrum, in particular, are difficult to unwind, which makes the timing of the decision economically material.

The CFO mandate that emerges from this picture is the connection of token consumption to the P&L with the same rigor applied to other capital and operating decisions.

Tokens, in this framing, are not a technology metric. They are a unit of cost and a unit of value that finance needs to see, measure, and govern.

Connecting Tokens to AI Value

The purpose of token economics is not to minimize token consumption. It is to connect token consumption to value. A model that consumes ten times the tokens of an alternative but produces an outcome worth one hundred times more is economically preferable; a model that consumes a tenth of the tokens but produces an unusable output is not a savings.

Practitioners commonly track a small set of metrics to maintain this connection:

These metrics are useful only when paired with a business-side measure of value: revenue uplift, cost-to-serve reduction, cycle-time compression, defect-rate improvement, or customer lifetime value. Token economics treated as a cost discipline alone tends toward false economies.

Token economics treated as a value discipline, with cost as one of two inputs, tends to produce the decisions that scale.

The Engineering Lens: Efficiency Determines Token Economic Viability

The diagnostic claim that AI consumption is non-linear, unpredictable, and rising faster than per-token price reductions can offset is correct, but it is not the whole story. A substantial and rapidly maturing set of supply-side techniques is now available to organizations that wish to govern token consumption rather than merely observe it. Token economics, treated rigorously, includes the architectural and engineering levers available to reduce the tokens required to produce a given outcome.

The token economic levers deployed by engineering teams affect viability substantially:

Model routing and cascading. The cost-and-capability variance across the model landscape is wide enough that routing each query to the cheapest model capable of producing an acceptable answer is a first-order optimization. The principle was formalized in academic work as the FrugalGPT cascade, which achieved cost reductions of up to ninety-eight percent against GPT-4 baselines by sequentially querying smaller models and escalating only when a scoring function judged the output insufficient.

The open-source RouteLLM framework extended the approach using preference data, demonstrating cost reductions in excess of eighty-five percent on standard benchmarks. AWS Bedrock now ships intelligent prompt routing as a managed service, and similar capabilities are appearing across the major hyperscaler and inference-platform vendors. The practical implication for an enterprise is that single-model deployments are an increasingly visible inefficiency.

Code execution as a tool-use pattern. The Model Context Protocol pattern of exposing tools to an agent through enumerated schemas scales poorly because each tool definition is loaded into context on every turn. The alternative pattern, originated independently by Anthropic and Cloudflare under the name Code Mode, has the agent write code that calls tools rather than calling tools directly. Cloudflare reports an eighty-one percent reduction in token usage for a multi-step calendar task and approximately ninety-nine point nine percent for an MCP server exposing twenty-five hundred Cloudflare API endpoints, which is compressible from roughly 1.17 million tokens of flat tool definitions to approximately one thousand tokens with two meta-tools. Anthropic reports a ninety-eight point seven percent reduction on a Google Drive to Salesforce workflow under the same pattern. The architectural insight is that models are stronger at writing code that calls tools than at selecting tools through enumerated function-calling syntax.

Context compression and pruning. The tokens that matter inside a retrieval-augmented generation pipeline are typically a small fraction of the tokens retrieved. Zilliz’s open-source semantic highlighting model performs sentence-level relevance filtering and reports seventy to eighty percent reductions in tokens sent to the underlying language model, with measurable improvements in answer quality on top of the cost gains. Domain-specific approaches achieve comparable or higher reductions: Flexpa reports ninety-two percent context reduction in a healthcare deployment by translating FHIR queries into SQL before invoking the model, rather than passing raw clinical records into context. Prompt-compression research (LLMLingua and its successors) operates on natural language directly. These techniques do not require model changes; they sit between retrieval and inference and recover spend that the model would otherwise pay for noise.

Structured output and data format. The choice of serialization format meaningfully affects token consumption. Microsoft research demonstrates that function-calling-based structured output is materially more token-efficient than free-form JSON generation. CSV, TSV, and newer formats designed specifically for LLM consumption (such as TOON) consume thirty to sixty percent fewer tokens than JSON for equivalent tabular data, with corresponding gains in parsing speed. The implication for application developers is that data format is a cost lever and should be treated as one.

Retrieval-augmented generation versus long context. The widespread availability of context windows in the hundreds of thousands or millions of tokens has reopened the architectural question of whether to retrieve at all. The defensible answer in 2026 is that the two approaches are complementary rather than competitive. Long-context inference tends to produce marginally higher answer quality on retrieval-heavy tasks at substantially higher per-query cost and latency; RAG produces near-equivalent results at a small fraction of the cost. Hybrid approaches, such as the self-routing pattern that uses retrieval by default and escalates to long context when retrieval confidence is low, reduce cost by thirty to sixty-five percent against pure long-context implementations while preserving answer quality on the residual hard cases.

Caching, model tiering, and prompt optimization. The familiar FinOps optimization vocabulary applies inside the token layer. Semantic caching of equivalent or near-equivalent queries can return cached responses without invoking the model at all. Model tiering (routing simple tasks to smaller or self-hosted models, reserving frontier capability for the queries that require it) is a cousin of cascading and similarly addresses the cost-versus-capability mismatch that single-model deployments produce. Prompt optimization (shortening, structuring, and templating prompts to remove redundant tokens) is the cheapest optimization available and is consistently undervalued in production deployments.

The collective implication of these techniques is that the engineering work of reducing tokens-per-outcome is no longer optional for organizations consuming AI at scale. It is the supply-side counterpart to the demand-side discipline that token economics asks finance and operations to apply, and it is increasingly where the difference between economically viable and economically unviable AI deployments is decided.

Conclusion

Token economics is the unit-economic vocabulary of the AI era. It introduces the token as the atomic accounting unit for AI consumption, but does not stop there. A complete practice recognizes that tokens are heterogeneous in quality, that their cost is determined by a stack of hardware, power, software, and architectural decisions, and that their accounting reaches across cloud, data center, SaaS, and operational layers. The practice of Token Economics is as much a finance  as a technology exercise, and the questions it raises (about inflection points, about hidden cost, about agentic spend volatility) are the questions a CFO and a CIO/CTO now share.

The methodology that operationalizes this practice, on a vendor-neutral basis, is FinOps for AI. Its contribution is not a new framework so much as a disciplined extension of an existing one: the same questions the FinOps community has asked of cloud spend for the past decade, now asked of the variable cost of intelligence itself, and now extended further into the physical infrastructure that produces it.

References

  1. FinOps Foundation. FinOps Framework: Domains, Capabilities, and Personas. https://www.finops.org/framework/
  2. FinOps Foundation. FOCUS: FinOps Open Cost and Usage Specification. https://focus.finops.org/
  3. Moseley, K., Perez, K., and Mahajan, P. “Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per Watt.” NVIDIA Technical Blog, March 2026. https://developer.nvidia.com/blog/scaling-token-factory-revenue-and-ai-efficiency-by-maximizing-performance-per-watt/
  4. Mann, T. “Unpacking the deceptively simple science of tokenomics.” The Register, March 7, 2026. https://www.theregister.com/on-prem/2026/03/07/unpacking-the-deceptively-simple-science-of-tokenomics/
  5. Deloitte. “AI Tokenomics: A CFO’s Guide to Governing the AI P&L.” Deloitte on the Wall Street Journal, 2026. https://deloitte.wsj.com/riskandcompliance/tokenomics-a-cfos-guide-to-governing-the-ai-p-l-ea09aed4
  6. Deloitte. “Navigate the Economics of AI: How Tokenomics Is Reshaping AI Costs and ROI.” January 2026. https://www.deloitte.com/us/en/services/consulting/articles/how-to-navigate-economics-of-ai.html
  7. Deloitte Insights. “AI Tokens: How to Navigate AI’s New Spend Dynamics.” January 2026. https://www.deloitte.com/us/en/insights/topics/emerging-technologies/ai-tokens-how-to-navigate-spend-dynamics.html
  8. Deloitte. “The CFO Guide to Tech Trends 2026.” 2026. https://www.deloitte.com/us/en/what-we-do/capabilities/finance-transformation/articles/cfo-guide-to-tech-trends.html
  9. SemiAnalysis. InferenceX: Inference Performance, Efficiency, and Cost Benchmarks. https://inferencex.semianalysis.com/
  10. Huang, J. Remarks on AI factories, token economics, and inference revenue per megawatt. NVIDIA GTC 2026 keynote and quarterly earnings commentary, 2026.
  11. Goldman Sachs Global Institute. Analysis of the cumulative AI infrastructure build-out, useful life of silicon, and per-megawatt data center costs, as referenced in FinOps Foundation source briefing materials, 2025-2026.
  12. FinOps Foundation. State of FinOps Report 2026: AI Spend Management. https://www.finops.org/insights/state-of-finops/
  13. Ramaswami, V. and Albert, S. “The Price of Tokenmaxxing: Claude’s Explosive Growth and the Real Cost of Intelligence.” Madrona Venture Group, April 2026. https://www.madrona.com/price-of-tokenmaxxing-claude-explosive-growth-cost-of-intelligence/
  14. LaRocque, G. “The Anthropic Pricing Shift Is a Reckoning That Work Tech Can’t Ignore.” WorkTech, April 2026. https://1worktech.com/the-anthropic-pricing-shift-is-a-reckoning-that-work-tech-cant-ignore/
  15. International Energy Agency. “Energy and AI.” World Energy Outlook Special Report, 2025. https://www.iea.org/reports/energy-and-ai
  16. International Energy Agency. “Key Questions on Energy and AI.” 2026. https://www.iea.org/reports/key-questions-on-energy-and-ai
  17. Chen, L., Zaharia, M., and Zou, J. “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.” Transactions on Machine Learning Research, 2024. https://arxiv.org/abs/2305.05176
  18. Ong, I., Almahairi, A., Wu, V., Chiang, W-L., Wu, T., Gonzalez, J., Kadous, M.W., and Stoica, I. “RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing.” LMSYS Org, July 2024. https://lmsys.org/blog/2024-07-01-routellm/
  19. Amazon Web Services. “Amazon Bedrock Intelligent Prompt Routing.” AWS Documentation, 2025. https://aws.amazon.com/bedrock/intelligent-prompt-routing/
  20. Proser, Z. “Cloudflare: Code Mode Cuts Token Usage by 81%.” WorkOS Blog, December 2025. https://workos.com/blog/cloudflare-code-mode-cuts-token-usage-by-81
  21. Cloudflare. “Code Mode: Give Agents an Entire API in 1,000 Tokens.” Cloudflare Blog, February 2026. https://blog.cloudflare.com/code-mode-mcp/
  22. Zilliz. “The 70% Token Reduction Breakthrough: How Semantic Highlighting Is Rewriting RAG Economics.” RAG About It, 2026. https://ragaboutit.com/the-70-token-reduction-breakthrough-how-semantic-highlighting-is-rewriting-rag-economics/
  23. Flexpa. “How We Used SQL on FHIR to Shrink LLM Context by 92%.” Flexpa Engineering Blog. https://www.flexpa.com/blog/sql-on-fhir-for-llm-context-reduction
  24. Williams, B. “Token Efficiency with Structured Output from Language Models.” Data Science at Microsoft, July 2024. https://medium.com/data-science-at-microsoft/token-efficiency-with-structured-output-from-language-models-be2e51d3d9d5
  25. GetCrux. “Is CSV Format Better than JSON for Sending Data to LLMs?” 2024. https://www.getcrux.ai/blog/experiment-data-formats—json-vs-csv
  26. Nicoomanesh, A. “Token Efficiency and Compression Techniques in Large Language Models: Navigating Context-Length Limits.” Medium, October 2024. https://medium.com/@anicomanesh
  27. Legion Intel. “Comparison: RAG vs. Long Context Window Models.” Legion Secure AI. https://www.legionintel.com/blog/rag-systems-vs-lcw-performance-and-cost-trade-offs
  28. MIT CSAIL. “RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation.” Proceedings of the International Symposium on Computer Architecture (ISCA), 2025. https://people.csail.mit.edu/suvinay/pubs/2025.rago.isca.pdf
  29. Sharwood, S. “AWS Pricing for Kiro Dev Tool ‘A Wallet-Wrecking Tragedy.'” The Register, August 2025. https://www.theregister.com/2025/08/18/aws_updated_kiro_pricing/
  30. Vantage. “Cursor Pricing Explained 2026.” March 2026. https://www.vantage.sh/blog/cursor-pricing-explained
  31. Morph. “Cursor Model Pricing: Plans, Credits, and Hidden Costs (2026).” March 2026. https://www.morphllm.com/cursor-model-pricing
  32. Yegge, S. “Welcome to Gas Town.” Medium, January 2026. https://steve-yegge.medium.com/welcome-to-gas-town
  33. Menlo Ventures. “2025: The State of Generative AI in the Enterprise.” December 2025. https://menlovc.com/perspective/2025-the-state-of-generative-ai-in-the-enterprise/
  34. McKinsey & Company. “The State of AI in 2025: Agents, Innovation, and Transformation.” November 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

Topics

  • FinOps Foundation Perspectives