When Claude 3.5 Sonnet shipped with a 200K token context window, we watched teams immediately stuff entire codebases into single prompts. When GPT-4 Turbo hit 128K tokens, the pattern repeated. The reasoning was simple: if the model can accept that much context, why not use it?

Here’s why: context windows are not memory systems. They’re stateless, expensive, and cognitively unreliable at scale. Treating them as infinite memory is a Category 1 architectural mistake that will resurface in your cost reports, latency budgets, and user-facing inconsistencies.

The Mental Model Problem

A context window is a single-request buffer. Every API call to an LLM is stateless—the model receives tokens, processes them in parallel, and returns a completion. There is no persistent memory between requests. When you send 150K tokens of context, you’re not building a knowledge base; you’re asking the model to hold everything in working memory for one operation.

This differs fundamentally from traditional application memory:

No indexing: The model can’t jump directly to relevant information. It processes context sequentially (or via attention mechanisms that still require scanning).
No caching semantics: While providers offer prompt caching, it’s optimization, not architecture. You still pay for processing cached tokens on each request.
Degraded recall: Research consistently shows LLM performance degrades on information buried in long contexts. The “lost in the middle” problem is real.

We’ve seen this fail pattern repeatedly: teams ship an AI feature that works beautifully in demos with 5K tokens of context, then falls apart in production when users generate 80K tokens of conversation history.

The Cost Reality

Let’s run the numbers. At current pricing (early 2026), Claude 3.5 Sonnet charges $3 per million input tokens. If you’re sending 100K tokens of context with every request:

100,000 tokens × $3 / 1,000,000 = $0.30 per request (input only)

Assume a moderately active AI-native product with 1,000 daily active users, each making 20 requests per day:

1,000 users × 20 requests × $0.30 = $6,000/day = $180,000/month (input only)

That’s input tokens alone. Add output tokens, and you’re approaching a quarter-million dollars monthly—before you’ve built any other infrastructure.

The instinct is to blame the LLM provider. But the real problem is architectural: you’re using an expensive stateless API as a stateful memory layer.

What to Use Instead

AI-native architecture requires purpose-built memory systems. Here’s what actually works:

Vector Databases for Semantic Retrieval

Store embeddings in Pinecone, Weaviate, or pgvector. Retrieve only the top-k most relevant chunks for each request. This is Retrieval-Augmented Generation (RAG), and it’s not optional for production systems.

// Example: Querying pgvector for relevant context
const query = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: userMessage
});

const relevantChunks = await db.query(
  `SELECT content, 1 - (embedding <=> $1) AS similarity
   FROM knowledge_base
   ORDER BY embedding <=> $1
   LIMIT 5`,
  [query.data[0].embedding]
);

// Send only these 5 chunks to the LLM, not the entire KB

This pattern reduces context from potentially hundreds of thousands of tokens to a few thousand—while improving relevance.

Structured State Management

For conversational applications, maintain explicit state outside the LLM. Use Redis, PostgreSQL, or even local storage to track:

User preferences and profile data
Conversation metadata (topic, intent, stage)
Extracted entities and facts
Decision history

Pass only the current state to the LLM, not the entire conversation history. Most turns don’t require full context.

// Store conversation state explicitly
const conversationState = {
  userId: user.id,
  topic: 'deployment_pipeline',
  lastIntent: 'configure_github_actions',
  entities: {
    repository: 'champlin/api',
    environment: 'production'
  },
  summary: 'User is setting up CI/CD for their API repository'
};

// Prompt includes compact state, not full history
const prompt = `Given context: ${JSON.stringify(conversationState)}
User: ${currentMessage}`;

Hierarchical Summarization

For long-running conversations, summarize aggressively. After every 10-15 turns, use a cheap model (GPT-4o-mini, Claude Haiku) to compress the conversation into a running summary. Pass the summary forward, not the raw transcript.

This is how humans actually remember conversations—we retain gist and key details, not verbatim transcripts.

When Large Context Windows Matter

There are legitimate use cases for large context windows:

Single-pass document analysis: Legal review, code audits, research summarization where you genuinely need to process one large artifact.
Few-shot learning with many examples: When you need 50+ examples to establish a pattern the model doesn’t know.
Complex multi-file code generation: Scaffolding an entire feature that requires understanding multiple interconnected files.

But these are bounded operations, not ongoing stateful interactions. You’re paying for context once to accomplish a specific task, then discarding it.

Architectural Guardrails

If you’re building AI-native products, enforce these rules:

Set hard context limits per request type. Different operations have different context needs. A quick clarification shouldn’t carry 80K tokens.
Instrument context size in production. Track P50, P95, and P99 token counts. Alert when requests exceed expected ranges.
Calculate cost per request in development. Make engineers see the dollar impact of context decisions before shipping.
Default to retrieval. If you’re tempted to include a large static resource (docs, codebase, knowledge base), use RAG instead.

The Bottom Line

Expanded context windows are a powerful capability, but they’re not a memory architecture. Every token you send costs money and cognitive overhead. The best AI-native products use context windows surgically—fetching only what’s needed, when it’s needed, from purpose-built storage layers.

If your context size is growing linearly with usage or time, you don’t have an LLM integration. You have an architectural problem. Fix it before it shows up in your AWS bill or your users’ response times.

Treat context windows like you treat RAM: finite, expensive, and something you optimize carefully. The difference is that RAM costs cents per gigabyte. LLM context costs dollars per megabyte-equivalent. Engineer accordingly.

LLM Context Windows Are Not Infinite Memory

The Mental Model Problem

The Cost Reality

What to Use Instead

Vector Databases for Semantic Retrieval

Structured State Management

Hierarchical Summarization

When Large Context Windows Matter

Architectural Guardrails

The Bottom Line

Rate Limiting LLM Calls Without Breaking User Experience

Prompt Versioning in Production: Treat Prompts Like Code

How to Set Up Continuous Integration for a Machine Learning Model

Have a project this could apply to?

LLM Context Windows Are Not Infinite Memory

The Mental Model Problem

The Cost Reality

What to Use Instead

Vector Databases for Semantic Retrieval

Structured State Management

Hierarchical Summarization

When Large Context Windows Matter

Architectural Guardrails

The Bottom Line

You might also like

Rate Limiting LLM Calls Without Breaking User Experience

Prompt Versioning in Production: Treat Prompts Like Code

How to Set Up Continuous Integration for a Machine Learning Model

Have a project this could apply to?