The AI Industry’s Dirty Secret: Nobody Tells You How Tokens Actually Work
If you’ve ever used Claude, ChatGPT, or any AI assistant and suddenly hit a wall — a vague message telling you you’ve “reached your limit” with zero explanation — you’re not alone. And you’re not crazy.
Thousands of users are frustrated, confused, and feeling ripped off. The AI industry has a transparency problem, and it starts with one word: tokens.
This guide will make you an expert. No jargon. No hand-waving. Just the truth about what you’re paying for, why it runs out so fast, and how to get dramatically more value from every dollar.
What Are Tokens? (The 60-Second Version)
Every AI model — Claude, ChatGPT, Gemini, all of them — doesn’t read words the way you do. It reads tokens: small chunks of text, usually 3-4 characters each.
Think of tokens like syllables. The word “understanding” isn’t one unit to an AI — it gets broken into pieces like “under” + “stand” + “ing.” Common words like “the” or “and” are a single token. Longer or unusual words get split into multiple tokens.
The rule of thumb: 1,000 tokens ≈ 750 words ≈ 3 pages of text.
Here’s what common content looks like in tokens:
| Content Type | Approximate Tokens | Word Equivalent |
|---|---|---|
| A short question (“What’s the weather?”) | ~10 tokens | 4 words |
| A typical AI response | 200-400 tokens | 150-300 words |
| A full page of text | ~300 tokens | ~250 words |
| A 10-page document | ~3,000 tokens | ~2,500 words |
| An uploaded image | 1,000-1,600 tokens | N/A (yes, images cost tokens too) |
| A 10-page PDF | 5,000-15,000 tokens | ~4,000-12,000 words |
That last one surprises people. Everything you send to an AI — text, images, PDFs, code files — gets converted to tokens. And every token costs money, whether you’re paying through a subscription or directly through the API. If you’re building systems that rely on LLMs, understanding these costs is critical — we cover the engineering side in our guide to integrating LLMs into production.
The Hidden Mechanic That Drains Your Usage (This Is the Big One)
Here’s what almost nobody understands, and it changes everything once you do:
Every time you send a message, the AI re-reads your entire conversation from the beginning.
AI models don’t have memory. They don’t “remember” what you said three messages ago. Instead, every time you hit send, the entire conversation history — every message you’ve sent, every response the AI gave — gets bundled up and sent to the model again as input.
This means your token usage grows exponentially as a conversation gets longer:
| Message # | Tokens Sent to AI | Cumulative Tokens Used | How It Feels |
|---|---|---|---|
| 1 | 500 | 500 | Fast, responsive |
| 5 | 5,000 | 15,000 | Still fine |
| 10 | 12,000 | 75,000 | Starting to slow down |
| 20 | 30,000 | 300,000 | “Why did I hit my limit?” |
| 30 | 50,000 | 750,000+ | “I only sent 30 messages!” |
Read that again. By message 30, a single conversation can consume 750,000+ tokens — even though each individual message was short. You didn’t send 750,000 tokens. The system re-sent your entire history 30 times.
This is why users say things like “I only sent 10 messages and hit my limit.” You sent 10 messages, but the AI processed the equivalent of hundreds of messages worth of text.
Why Your $20/Month Feels Like It Disappears
Let’s talk about the subscription plans, because this is where the frustration peaks.
| Plan | Monthly Cost | What You Get | Who It’s For |
|---|---|---|---|
| Claude Free | $0 | Limited messages, Sonnet model only | Casual/tryout users |
| Claude Pro | $20 | ~5x Free usage, all models, priority | Daily users |
| Claude Max 5x | $100 | ~5x Pro usage, priority access | Power users, developers |
| Claude Max 20x | $200 | ~20x Pro usage, top priority | Heavy daily use, Claude Code |
| Claude Team | $25-$150/seat | Higher limits + admin controls | Businesses |
Here’s the problem: Anthropic doesn’t publish exact token limits for any plan. You get vague language like “at least five times” the free tier. No dashboard. No counter. No transparency.
Users have tried to reverse-engineer the actual limits through testing. Community estimates suggest Claude Pro gets roughly 30-45 Opus messages or 100+ Sonnet messages per 5-hour rolling window before throttling kicks in. But these are unofficial numbers — Anthropic has never confirmed them.
When you exceed your “fast” usage, you’re not cut off. You’re downgraded to “standard” rate — which means slower responses, potential queuing, or being routed to a smaller model. Many users report standard rate being so slow it’s essentially unusable, making it feel like a hard cutoff.
The Rolling Window That Confuses Everyone
Usage doesn’t reset daily at midnight. It operates on a rolling 5-hour window. Usage from 5+ hours ago “falls off” your count. This means:
- There’s no fixed reset time to wait for
- Heavy use in a burst hurts more than spread-out use
- You can sometimes recover mid-day if you take a break
And here’s a kicker most people don’t know: during peak hours (weekdays 5am-11am Pacific / 1pm-7pm GMT), your limits are reduced. Anthropic confirmed this affects about 7% of users, and your 5-hour session limits drain faster during these windows. Your weekly limits stay the same, but the per-session budget shrinks.
The March 2026 Token Drain Crisis (Yes, This Really Happened)
In late March 2026, something broke. Claude Code users started reporting their usage limits draining at impossible rates — Max 5x subscribers ($100/month) burning through their entire quota in under an hour. Sessions that should last 5 hours were dying in 19 minutes.
The complaints flooded Reddit, GitHub, and Anthropic’s Discord:
“I used up Max 5 in 1 hour of working, before I could work 8 hours.”
“It’s maxed out every Monday and resets at Saturday… out of 30 days I get to use Claude 12.”
“Rate-limit errors need to be caught explicitly — they look like generic failures and will silently trigger retries. One session in a loop can drain your daily budget in minutes.”
A user who reverse-engineered Claude Code’s binary claimed to find “two independent bugs that cause prompt cache to break, silently inflating costs by 10-20x.” Multiple users confirmed that downgrading to an older version (2.1.34) made a noticeable difference.
The root cause turned out to be three overlapping issues:
- Intentional peak-hour quota reductions affecting ~7% of users
- The end of a 2x off-peak promotion on March 28 that had quietly doubled limits
- Confirmed caching bugs that broke prompt caching, causing the same content to be re-processed at full price instead of the cached 90% discount
Anthropic acknowledged the crisis: “People are hitting usage limits in Claude Code way faster than expected. We’re actively investigating… it’s the top priority for the team.”
Claude Code: The Token Furnace
If regular Claude chat burns through tokens, Claude Code is a blast furnace. Here’s why.
When you use Claude through the regular chat interface, each message is one round-trip: you send a message, Claude responds. Predictable.
Claude Code is different. A single request from you can trigger dozens of internal API calls. Every time Claude Code reads a file, runs a command, searches your codebase, or edits code, that’s a separate round-trip — and each one includes:
- The full system prompt (tool definitions, configuration — thousands of tokens)
- Your entire conversation history
- All previous tool results (file contents, command output, search results)
| Action in Claude Code | What Actually Happens | Token Impact |
|---|---|---|
| You ask “fix this bug” | Claude reads 3 files, runs a command, edits 2 files | 6+ API calls, each resending full context |
| Reading a 500-line file | File contents added to conversation permanently | ~3,000-5,000 tokens added to every future call |
| Running a build command | Full stdout/stderr captured | Hundreds to thousands of tokens in context |
| A 30-tool-call session | Context grows with each call | 60,000+ tokens of accumulated context, resent each time |
The math is brutal: If Claude Code makes 30 tool calls in a session, and the average context by the end is 40,000 tokens, that last tool call alone sends 40,000 input tokens — just for the context, before your actual question. Multiply that across every call, and a single coding session can consume millions of tokens.
This is why many users say the Max plan ($100-$200/month) feels like it was specifically created for Claude Code users. The Pro plan at $20/month can be exhausted in a single coding session. We use Claude Code daily to build AI-powered workflows and automated content pipelines — the token economics are very real.
Track Your Token Usage (We Built a Tool for This)
Frustrated by the lack of visibility into where your tokens go? So were we. That’s why we built MyTokenTracker — a purpose-built dashboard for Claude subscribers who want to actually understand their usage.
With MyTokenTracker, you can:
- Track usage by project — see exactly which projects are consuming your quota
- Break down by session — identify which conversations burned through tokens and why
- Filter by model — compare your Opus vs. Sonnet vs. Haiku consumption side by side
- Search and filter — drill into specific date ranges, usage patterns, and anomalies
- Spot the drain — catch runaway sessions before they eat your entire budget
It’s the transparency layer that Anthropic hasn’t built yet. If you’re tired of guessing where your tokens went, try MyTokenTracker and take control of your AI spend.
How AI Pricing Actually Works (The Comparison You Need)
Every AI company charges differently, but they all charge for the same thing: tokens in, tokens out. And output tokens always cost more — typically 3-5x more — because generating text requires more computation than reading it.
Current API Pricing (Per Million Tokens, April 2026)
| Provider | Model | Input Price | Output Price | Context Window |
|---|---|---|---|---|
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | 200K |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K (1M beta) | |
| Claude Opus 4.6 | $5.00 | $25.00 | 200K (1M beta) | |
| OpenAI | GPT-4o | $2.50 | $10.00 | 128K |
| GPT-5.2 | $1.75 | $14.00 | Extended | |
| GPT-4o mini | $0.15 | $0.60 | 128K | |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M | |
| Gemini 3 Flash | $0.50 | $3.00 | 1M | |
| xAI | Grok-4.1 Fast | $0.20 | $0.50 | 2M |
What $20/Month Actually Buys You (Subscription Comparison)
| Feature | Claude Pro ($20) | ChatGPT Plus ($20) | Gemini Advanced ($20) |
|---|---|---|---|
| Top Model | Opus 4.6 + Sonnet 4.6 | GPT-4o + o1-mini | Gemini 3.1 Pro |
| Context Window | 200K tokens | 128K tokens | 1M tokens |
| Message Limits | ~45 Sonnet msgs/5hrs | ~80 GPT-4o msgs/3hrs | Generally generous |
| Web Browsing | No | Yes | Yes (Google Search) |
| Image Generation | No | Yes (DALL-E) | Yes (Imagen) |
| Code Execution | Yes (Artifacts) | Yes (Code Interpreter) | Yes |
| Coding Assistant | Claude Code (industry-leading) | Codex (new) | Limited |
| Power Tier | Max at $100-$200 | Pro at $200 | None public |
If You Spent $20 Directly on API Credits Instead
| Provider | Model | $20 Buys (Input) | $20 Buys (Output) |
|---|---|---|---|
| Anthropic | Sonnet 4.6 | 6.7M tokens | 1.3M tokens |
| OpenAI | GPT-4o | 8M tokens | 2M tokens |
| Gemini 3.1 Pro | 10M tokens | 1.7M tokens | |
| Gemini 3 Flash | 40M tokens | 6.7M tokens |
The Industry Pricing Trend (Good News)
AI is getting dramatically cheaper. When GPT-4 launched in March 2023, it cost $30/$60 per million tokens. Today’s models that match or exceed that quality cost a fraction of that price.
| Year | Best Available Model | Input Cost (Per 1M Tokens) | Trend |
|---|---|---|---|
| 2023 | GPT-4 | $30.00 | Baseline |
| 2024 | Claude Sonnet 3.5 / GPT-4o | $3.00 / $2.50 | ~10x cheaper |
| 2025 | Claude Sonnet 4 / GPT-4o | $3.00 / $2.50 | Stable |
| 2026 | Claude Opus 4.6 / GPT-5.2 | $5.00 / $1.75 | Better models, similar or lower prices |
The trend is clear: you get more intelligence per dollar every year. Input tokens are racing toward near-free. Output tokens remain more expensive because generating text requires sequential computation, but even those are falling.
10 Ways to Get More From Your AI Subscription (Starting Today)
These aren’t theoretical tips. These are battle-tested strategies used by power users who need to maximize every token.
1. Start New Conversations Early and Often
The single biggest token saver. Because of the conversation history problem, message 20 in a long chat costs 10-50x more than message 1. Once you have your answer on a topic, start fresh. Don’t treat AI chat like a text thread with a friend.
2. Use the “Summarize and Reset” Technique
Before ending a long conversation, ask: “Summarize everything we’ve decided in 5 bullet points.” Copy that summary, start a new conversation, and paste it as context. You just compressed thousands of tokens into dozens.
3. Choose the Right Model for the Job
Not every question needs the most powerful model. Use Haiku/Flash for simple questions, Sonnet/GPT-4o for everyday work, and Opus only when you need maximum reasoning power. The cost difference is 5-25x between tiers.
4. Be Specific About Output Length
“Explain quantum computing” will get you a 2,000-word essay. “Explain quantum computing in 3 sentences” gets you the same core answer at 1/20th the token cost. AI models default to verbose — tell them to be brief.
5. Don’t Upload Files Until You Need Them
An image or PDF attached in message 1 gets re-sent with every subsequent message. If you’re going to have a 15-message conversation and only need the file for one question, attach it in that specific message — not at the start.
6. Avoid Pasting Entire Files
Instead of “Here’s my 500-line file, find the bug on line 47,” just paste lines 40-55. You save thousands of tokens and get a more focused answer.
7. Skip the Pleasantries
“Thank you so much! That was really helpful. I have another question…” adds tokens to every future message in the conversation. Just ask your next question. The AI doesn’t have feelings to hurt.
8. Batch Related Questions Together
Five separate messages cost far more than one message with five questions. Every separate message triggers a full re-read of the conversation. Combine related questions into a single prompt.
9. Don’t Ask the AI to Repeat Itself
“Can you show me that code again?” forces the AI to regenerate content that’s already in your conversation. Just scroll up. Regeneration costs output tokens — the expensive kind.
10. Track Everything with MyTokenTracker
You can’t optimize what you can’t measure. MyTokenTracker gives you the visibility Anthropic doesn’t — usage by project, session, and model with full search and filtering. Stop guessing and start knowing exactly where your tokens go.
Understanding Prompt Caching (The Technology Saving You Money Behind the Scenes)
There’s one piece of good news in all of this: prompt caching. This is the technology that makes AI subscriptions economically viable despite the re-reading problem.
When you continue a conversation, the system prompt and previous messages are often identical to the last request. Anthropic’s caching system recognizes this and charges only 10% of the normal input price for cached content.
| Cache Operation | Cost vs. Standard | What It Means |
|---|---|---|
| First request (cache write) | 1.25x standard price | Slightly more expensive initially |
| Subsequent requests (cache hit) | 0.10x standard price | 90% savings on repeated content |
| Cache miss (expired after 5 min) | 1.0x standard price | Back to normal pricing |
This is why taking a long break mid-conversation can actually cost more — the cache expires after 5 minutes of inactivity, and everything has to be re-processed at full price when you return. For a deeper dive into how caching strategies work at scale, see our post on caching strategies for high scalability.
When caching breaks — as happened during the March 2026 crisis — costs can inflate 10-20x because every request is processed at full price instead of the cached rate. That’s what turned a $100/month plan into something that emptied in an hour.
The Real Problem: Transparency
Here’s our honest take: the technology behind tokens, context windows, and caching is genuinely complex. But the communication around it doesn’t have to be.
The number one complaint across every platform — Reddit, GitHub, Discord, X — isn’t about the price. It’s about the opacity.
Users would accept limits if they could:
- See a real-time token counter while using the product
- Know exact limits for their plan (not “approximately 5x”)
- Understand the reset schedule clearly
- Get notified before hitting a limit, not after
- See what’s consuming tokens (conversation history vs. new content vs. system overhead)
This is exactly the gap we set out to fill. MyTokenTracker was built because we got tired of waiting for Anthropic to solve this. It gives Claude subscribers the usage dashboard that should have existed from day one — tracking by project, session, and model with powerful filters and search. If you’re serious about managing your AI spend, it’s the tool you need.
The Bottom Line
Tokens aren’t a scam. They’re the real unit of work that AI models perform, and they cost real money to process on expensive GPU hardware. But the way they’re communicated to users — with vague limits, hidden mechanics, and opaque pricing — creates unnecessary frustration and erodes trust.
The three things to remember:
- Conversations get exponentially more expensive as they get longer. Start fresh early and often.
- Not all tokens are created equal. Output tokens cost 3-5x more than input. Images and files add thousands of tokens. Choose your model wisely.
- The industry is getting cheaper fast. What costs $5 today cost $30 two years ago. Hold your providers accountable, but know that the trend is strongly in your favor.
You don’t need a computer science degree to use AI effectively. You just need to understand the meter that’s running — and now you do.
Champlin Enterprises is an AI-first software engineering consultancy with 27+ years of experience building production systems. We don’t just write about AI — we engineer with it every day. Have questions about token optimization or AI integration for your business? Let’s talk.





