Why Your OpenAI Bill Lies to You — CuriousDevs Blog

The situation every engineering team recognises

CFO pings on Slack: "Our OpenAI spend jumped 34% last month. What happened?" Engineering lead checks the dashboard. Sees one number: $18,400. Opens a spreadsheet. Types team names. Guesses percentages. Sends the reply. Nobody is confident it is accurate.

What OpenAI actually gives you

OpenAI's usage dashboard shows you spend broken down by model and time period. That is it. There is no concept of feature, team, product line, deployment environment, or user cohort in the API response or the invoice.

Here is a real OpenAI API response object:

// What you get from /v1/usage

{

"object": "list",

"data": [

{ "aggregation_timestamp": 1717977600,

"n_requests": 14820,

"operation": "completion",

"snapshot_id": "gpt-4o-2024-05-13",

"n_context_tokens_total": 48291004,

"n_generated_tokens_total": 9182330 }

]

}

Notice what is missing: zero context about why those 14,820 requests happened. Which product feature triggered them. Which team owns them. Which environment they ran in. You have aggregate counts and token totals. Nothing else.

The real pricing breakdown (June 2025)

Before we get to attribution, understand why this matters at scale. The cost spread across OpenAI models alone is 17×:

Model	Input / 1M tokens	Output / 1M tokens	Context window
GPT-4o (May 2025)	$2.50	$10.00	128K
GPT-4o-mini	$0.15	$0.60	128K
GPT-4-turbo	$10.00	$30.00	128K
GPT-3.5-turbo	$0.50	$1.50	16K
o1-preview	$15.00	$60.00	128K
o1-mini	$3.00	$12.00	128K

The difference between GPT-4o and GPT-4o-mini is 16.7× on input, 16.7× on output. If your staging environment runs GPT-4o when it only needs GPT-4o-mini, you are burning money in a silent background process that nobody looks at.

Where the money actually goes (real distribution)

Across TokenFin beta users, after adding attribution instrumentation, the average spend distribution looked like this:

User-facing product features48%

Batch processing pipelines22%

Dev + staging environments19%

Retry loops & error recovery7%

Internal tooling4%

The dev/staging 19% is the one that shocks every team. These environments run the same GPT-4o calls as production. They generate no revenue. And because nobody watches them closely, inefficiencies compound undetected for months.

The attribution debt spiral — how it compounds

This is not a one-time problem. It compounds.

Month 1: You ship a chat feature. One LLM call path. Easy to reason about mentally.

Month 3: Summarization pipeline added. Batch job added. New team starts using the API. Mental model breaks.

Month 6: 8–14 distinct call paths. Multiple models. Dev/staging/prod all hitting the same billing account. The invoice is now a black box.

Month 9: CFO asks for a cost forecast. Engineering produces a number with error bars of ±40%. Trust collapses.

// Real cost curve from a TokenFin beta user (anonymised)

Jan 2025$4,20028KLaunch month — 1 feature

Feb 2025$6,80046K+Summarisation pipeline

Mar 2025$11,20074K+Batch job, +1 team

Apr 2025$18,400121KUnknown drivers — panic

May 2025$11,600128KTokenFin added → found waste

Note: call volume went from 121K to 128K (up) while spend dropped from $18.4K to $11.6K (down 37%). The waste was in model selection and retry logic, not volume.

What token-level attribution actually looks like

The fix is to attach metadata to every LLM call at the point of invocation. Not post-hoc analysis — at call time.

// Without attribution (before)

const response = await openai.chat.completions.create(({

model: 'gpt-4o',

messages: messages

})

// With TokenFin attribution (after) — one line change

const response = await track(

openai.chat.completions.create(({)

model: 'gpt-4o',

messages: messages

}),)

{ team: 'ml-infra', feature: 'semantic-search',

env: 'production', userId: req.user.id }

)

Now every call in your dashboard shows exactly what it cost and why it ran. You can filter by team, feature, environment, model. You can set budget alerts per feature. You can compare cost-per-output-token across models for the same feature.

The five optimisations attribution unlocks

Model right-sizing

Identify every feature running GPT-4o that only needs GPT-4o-mini. Average team finds 2–3 such features. Savings: 60–80% on those call paths.

Environment throttling

Add a config flag: if (env !== "production") use("gpt-4o-mini"). Average dev/staging saving: 85% reduction on those environments.

Retry loop surgery

Aggressive retry logic on expensive models compounds fast. Attribution reveals which pipeline has the highest retry rate. Fixing retry logic + exponential backoff typically saves 10–18% total spend.

Prompt compression

Average enterprise prompt has 800–1,200 tokens of system prompt. After attribution you can see cost per call clearly. Compressing system prompts by 40% saves 40% on input costs for that feature.

Caching layer identification

Attribution shows which features have repetitive, near-identical prompts. These are candidates for semantic caching (e.g., GPTCache). Average cacheable call rate: 22–35% of total calls in customer support features.

Bottom line

The invoice is not lying. The API just never asked you the right questions at call time.

Token-level attribution is not a reporting feature. It is the foundation of every cost optimisation you will make on your AI stack. Teams that instrument it in week 1 spend 30–45% less at month 6 than teams that do it reactively. The debt compounds in both directions.

Try TokenFin free — add attribution in 5 minutes →