Context Window

The hard limit on what an AI can think about at once.

// The Concept

The context window is the maximum number of tokens a language model can process in a single pass. It's the model's working memory — the total capacity of everything it can hold in mind simultaneously while generating a response. Everything outside the context window simply doesn't exist to the model. There is no workaround, no clever trick. If a token falls outside the window, it cannot influence the output.

This is not a soft limit that degrades gracefully. It's architectural, determined by the positional encoding scheme hardcoded into the model during training. GPT-4o supports 128K tokens. Claude supports 200K tokens. Google's Gemini pushes past 1 million tokens. These numbers define fundamentally different capabilities — and fundamentally different cost profiles — because processing more context requires quadratically more computation.

To put these numbers in practical terms: 128K tokens is roughly a 300-page book. 200K tokens is closer to 500 pages. 1M tokens could encompass an entire codebase or several books simultaneously. These are extraordinary capacities compared to GPT-3's original 2K token window — but they're still finite, and understanding how that finite capacity is allocated is essential for anyone creating content that AI systems evaluate.

// How It Works

Each token position in the context window gets a positional encoding — a mathematical signal that tells the model where in the sequence this token sits. The original transformer paper used sinusoidal functions. Modern models use more sophisticated approaches: RoPE (Rotary Position Embeddings), which encode relative positions through rotation in complex number space, or ALiBi (Attention with Linear Biases), which adds a linear penalty based on distance between tokens.

// Context window sizes across major models (2026) GPT-4o 128,000 tokens // ~300 pages Claude Opus 4 200,000 tokens // ~500 pages Gemini 2.0 1,000,000+ tokens // ~2500 pages GPT-3 (2020) 2,048 tokens // ~5 pages (for reference) // Attention computation complexity: Self-Attention = O(n^2 * d) // n = sequence length, d = hidden dimension // 128K window: 128,000^2 = 16.4 billion attention pairs // 1M window: 1,000,000^2 = 1 trillion attention pairs // Positional encoding approaches: Sinusoidal // original transformer — fixed, limited RoPE // rotary — relative position, extrapolates ALiBi // linear bias — penalizes distant tokens YaRN // extends RoPE to longer sequences // The "Lost in the Middle" attention pattern: Attention | |** ** | *** *** | **** **** | ***************************** +----------------------------------------→ Position ^beginning ^middle ^end HIGH attention LOW attention HIGH attention

The model computes attention scores between all token pairs — meaning every token can attend to every other token within the window. This is what gives transformers their power, but it also explains the quadratic cost: doubling the context window quadruples the computation. A 128K window requires computing 16.4 billion attention pairs. A 1M window requires over a trillion. That's why longer context costs more through the API — it's not artificial pricing, it's physics.

Recent architectural advances have attacked this quadratic bottleneck from multiple angles. Ring attention distributes the computation across multiple devices. Grouped-query attention (GQA) shares key-value pairs across attention heads, reducing memory bandwidth requirements. Flash attention restructures the computation to be more hardware-friendly without changing the mathematical result. These optimizations have made million-token windows practically feasible, but the fundamental tradeoff between window size and computational cost persists.

// Why It Matters for Search

When an AI system evaluates your content for citation — whether that's Perplexity.ai selecting sources, Google's AI Overviews choosing what to reference, or ChatGPT's browsing feature deciding what to quote — it processes your page within its context window alongside competing sources. Your content competes for attention against everything else in that window.

This creates a critical optimization imperative: front-load your entity signals. The "lost in the middle" phenomenon, documented by Liu et al. (2023), shows that models pay significantly less attention to content positioned in the middle of long context windows. Information at the beginning and end receives disproportionate attention weight. For your content, this means your strongest entity credentials — who you are, why you're authoritative, what your page is definitively about — need to appear above the fold.

Schema markup offers a structural advantage here. JSON-LD in the <head> element gets processed before body content. When an AI system parses your page, your structured data enters the context window first, establishing entity identity before the model encounters a single paragraph of body text. This early positioning means your entity signals receive maximum attention weight. By the time the model reaches your body content, its hidden state already encodes a representation of your entity — making it more likely to correctly attribute and cite your work.

Content structure matters too. Clear headers function as navigation markers within the context window — they help the model allocate attention efficiently, finding the relevant section rather than treating your entire page as undifferentiated text. Pages with semantic HTML structure consistently outperform flat text walls in AI evaluation, because the structure makes efficient use of the model's limited attention budget.

// In Practice

For content creators: put your strongest entity signals in the first 500 words. Not buried in a bio at the bottom. Not in a sidebar that gets stripped during parsing. In the main content flow, early. State who you are, what you know, and why this page exists before diving into the body content. The model's attention allocation is not uniform — exploit that.

For technical SEO: JSON-LD schema in the <head> loads before body content. This is your pre-positioning advantage. A well-crafted schema block with @id cross-references, sameAs links, and clear entity declarations gives AI systems your identity data before they encounter your content. Think of it as the AI equivalent of a first impression — by the time the model reads your opening paragraph, it already knows who's talking.

For prompt engineering: the same principle applies. Structure your prompts with the most important context first. If you're feeding documentation into a model, lead with the section that matters most for your query. If you're building RAG systems, order retrieved passages by relevance, not by document order — because the model's attention allocation will favor the beginning and end of the retrieved context.

And a counterintuitive insight: shorter content can outperform longer content in AI evaluation contexts. A focused 1,500-word page that fits entirely within any model's attention budget will receive uniform consideration. A sprawling 10,000-word page will have its middle sections discounted. Unless you need the depth for human readers, concise and front-loaded beats comprehensive and buried.

Does more context always mean better results?

No. Models exhibit the "lost in the middle" phenomenon where they pay significantly less attention to content positioned in the center of long context windows. Research by Liu et al. showed that models can fail to use relevant information placed in the middle even when they successfully use the same information placed at the beginning or end. More context helps when the additional information is genuinely relevant, but padding your context with marginally related content can actually dilute the signal from your most important information.

What happens when content exceeds the window?

It gets truncated or chunked, depending on the system. Simple API calls typically truncate — the model sees the first N tokens and everything else is discarded. More sophisticated systems (like RAG pipelines) chunk the document into segments, embed each chunk separately, and retrieve only the most relevant chunks for the context window. This is why the first 1,000 words of your page matter most for AI evaluation — they're what survives truncation, and they're what gets the highest retrieval priority when chunked.

Go deeper with practitioners

Join the Burstiness & Perplexity community for implementation support and weekly discussions.

Join the Community