Chain of Thought

Why asking AI to show its work makes it dramatically smarter.

// The Concept

Chain-of-thought (CoT) prompting is a technique where you instruct a language model to reason step-by-step before producing a final answer. Instead of jumping directly to a conclusion, the model generates intermediate reasoning steps — and these steps genuinely improve the quality of the final answer. It's one of the most powerful techniques in prompt engineering, and it requires absolutely zero additional training. The same model, given the same question, performs measurably better when asked to think out loud.

The discovery, published by Wei et al. at Google Brain in 2022, was almost embarrassingly simple. Adding the phrase "Let's think step by step" to a math problem increased GPT-3's accuracy from 17.7% to 78.7% on the GSM8K benchmark. No fine-tuning. No new data. Just five words that changed how the model allocated its computational budget across the problem.

This matters because it reveals something fundamental about how transformers work. A language model's output at any given position is a function of a single forward pass through the network. That forward pass has a fixed computational depth — a fixed number of layers, each performing a fixed amount of work. For simple tasks, one forward pass is enough. For complex reasoning, it isn't. Chain-of-thought gives the model more forward passes by spreading the computation across more tokens. Each intermediate step produces hidden states that encode partial computations, and those states become context for the next step.

In essence, CoT converts serial reasoning depth into sequential token generation. The model can't think harder within a single token prediction, but it can think longer across many token predictions. Each "thinking" token is a computational step that the model wouldn't otherwise have access to.

// How It Works

Consider a concrete example. Without chain-of-thought, the model attempts to solve the entire problem in the hidden states of a single output token. With CoT, it decomposes the problem into subproblems, solves each one, and uses the results as context for subsequent steps.

// Without Chain of Thought Q: "What is 23 x 47?" A: "1071" // wrong — model guesses from pattern matching // With Chain of Thought Q: "What is 23 x 47? Think step by step." A: "23 x 47 = 23 x 40 + 23 x 7" // decompose "23 x 40 = 920" // solve sub-problem 1 "23 x 7 = 161" // solve sub-problem 2 "920 + 161 = 1081" // combine results // correct — each step builds on previous hidden states // Performance gains on benchmarks (GPT-3.5 / GPT-4): GSM8K (math) 17.7%78.7% // +345% improvement StrategyQA (logic) 65.4%73.4% // multi-hop reasoning ARC-Challenge 78.2%85.1% // science reasoning HotpotQA 56.3%68.9% // multi-document QA // CoT variants: Zero-shot CoT: "Think step by step" // simplest Few-shot CoT: provide worked examples // most reliable Self-consistency: sample N chains, vote // highest accuracy Tree of Thought: branch + evaluate + prune // complex problems

The mechanism is rooted in how attention works across the context window. When the model generates "23 x 40 = 920," those tokens become part of the context for all subsequent generation. The attention mechanism can now reference the result 920 directly, rather than trying to hold the entire computation in the hidden states of a single token. Each intermediate step externalizes part of the computation into the token sequence, where the model can attend to it explicitly.

Self-consistency takes this further. Instead of generating a single chain of thought, the model generates multiple independent reasoning chains (with temperature > 0 for diversity). The final answer is determined by majority vote across the chains. This is analogous to how multiple experts might reason differently about a problem but converge on the correct answer. Self-consistency with 40 chains pushes accuracy on GSM8K above 90% — approaching human performance on grade school math.

More advanced variants like Tree of Thought and Graph of Thought allow the model to explore branching reasoning paths, evaluate intermediate states, and backtrack from dead ends. These approaches convert the linear chain into a search tree, giving the model something resembling deliberate planning. The computational cost scales with the number of paths explored, but the accuracy gains on complex problems are substantial.

// Why It Matters for Search

Chain-of-thought reasoning explains a phenomenon that content practitioners observe but rarely understand: why long-form, well-structured content consistently outperforms thin pages in AI-driven search. The reason is structural, not just about "more words." When an AI system processes your content, each paragraph builds on the hidden state created by previous paragraphs — functionally identical to how CoT reasoning builds on previous reasoning steps.

A page that establishes a problem, explains the mechanism, provides evidence, and then draws conclusions creates a progressive argument that AI processes into a coherent, strong entity representation. Each section contributes context that enriches the model's understanding of the next section. By the time the model reaches your conclusion, it has built a rich hidden state that encodes not just what you said, but the logical structure that connects your claims.

This is why pillar content with genuine depth outperforms scattered thin pages on the same topic. Ten 200-word pages about different aspects of entity SEO create ten isolated hidden states. One 2000-word page that builds a progressive argument about entity SEO creates a single, deep hidden state where each concept is contextualized by every other concept. The AI system's representation of your content — and by extension, your authority — is fundamentally different in each case.

AI search engines like Perplexity.ai and Google's AI Overviews use CoT-like reasoning internally when they synthesize answers from multiple sources. They decompose complex queries into sub-questions, retrieve relevant sources for each, and compose the results. Content that mirrors this decomposition pattern — clear problem statements, explicit reasoning steps, evidence-backed conclusions — is structurally compatible with how these systems think. It becomes easier to cite because it already matches the reasoning format.

// In Practice

Structure your most important pages as progressive arguments. Don't front-load all your conclusions in the first paragraph and then fill the rest with supporting fluff. Instead, build your case the way a chain-of-thought model solves a problem: establish the premise, develop the reasoning, present the evidence, draw the conclusion. Each section should build meaningfully on the previous one.

This doesn't mean every page needs to be a linear argument. But it does mean that your content architecture should create clear logical dependencies between sections. If someone reads section three without reading section two, they should feel like they're missing context. That dependency structure is exactly what creates strong hidden state representations in AI systems — each section adds information that enriches the model's understanding of every subsequent section.

For content that targets complex queries — the kind that trigger AI Overviews and extended search features — explicitly decompose the problem. Use headings that map to sub-questions. Address each sub-question with evidence and reasoning before connecting them into a synthesis. This mirrors the internal reasoning process that AI search systems use, making your content structurally compatible with how they generate answers.

When using AI tools for content creation, leverage chain-of-thought in your prompts. Don't ask for a finished article. Ask the model to first outline its reasoning about the topic, identify the key claims it wants to make, and plan the evidence for each claim. Then ask it to write each section with explicit reference to that plan. The resulting content will have the progressive structure that both human readers and AI systems reward — because the generation process itself followed a chain of thought.

Does chain-of-thought work for all tasks?

No — and knowing when it helps versus when it wastes tokens is part of effective prompt engineering. CoT dramatically improves performance on reasoning tasks: math, logic puzzles, multi-step analysis, code debugging, and anything requiring intermediate computation. For simple factual recall ("What is the capital of France?"), CoT adds unnecessary tokens without improving accuracy. The reliable heuristic: if a task requires more than one logical step, chain-of-thought will likely improve the result. If a human could answer without thinking, the model probably can too.

Is chain-of-thought related to chain-of-thought search?

Yes, directly. AI search engines use CoT-like internal reasoning when they decompose complex queries, retrieve sources for each sub-question, and synthesize multi-source answers. When Perplexity.ai answers a complex question, it's running a reasoning chain: identify sub-questions, retrieve relevant documents, evaluate source quality, synthesize a coherent answer. Content that provides clear, well-structured reasoning on a topic is more likely to be selected as a source in this process — because it already contains the intermediate reasoning steps that the search system needs to build its answer.

Go deeper with practitioners

Join the Burstiness & Perplexity community for implementation support and weekly discussions.

Join the Community