Few-Shot Learning

Q: How many examples does few-shot need?

3-5 examples typically capture the pattern for most tasks. More examples help with complex or ambiguous tasks but consume valuable context window space. Research shows diminishing returns after 5-8 examples for well-defined tasks. The quality of examples matters far more than quantity — 3 perfect examples outperform 10 mediocre ones.

Q: Is few-shot the same as fine-tuning?

No — few-shot uses the context window at inference time. Fine-tuning changes the model's weights through additional training. Few-shot is free, instant, and temporary — it only lasts for the current context. Fine-tuning is costly, time-consuming, and permanent — the model retains the knowledge across all future interactions. Few-shot is ideal for quick adaptation; fine-tuning is for deep specialization.

Teaching AI by example — the middle ground between training and hoping.

// The Concept

Few-shot learning provides a model with a small number of examples — typically 1 to 10 — before asking it to perform a task. Unlike zero-shot (no examples) or fine-tuning (thousands of examples and actual weight updates), few-shot learning operates entirely within the context window. "Here are 3 examples of good product descriptions. Now write one for this product." The model infers the pattern from the examples and applies it to the new input. No training run. No gradient updates. Just pattern recognition in real time.

This works because transformers are fundamentally pattern-matching machines. The attention mechanism — the core computational primitive of every modern language model — excels at identifying regularities across sequences. When you place three examples of a task in the prompt, the model's attention heads identify what those examples have in common: format, tone, length, structure, vocabulary patterns, reasoning style. It then applies those extracted patterns to the new input, generating output that follows the demonstrated convention.

The technical term is in-context learning (ICL), and it was one of the most surprising discoveries in the GPT-3 era. Nobody designed language models to learn from examples in their context window. The training objective is simply "predict the next token." But at sufficient scale, next-token prediction gives rise to the ability to infer and apply rules from examples — without any change to the model's weights. The model doesn't "learn" in the traditional machine learning sense. It recognizes a pattern in the prompt and extends it. But the practical effect is indistinguishable from learning.

The power of few-shot learning lies in its flexibility. You can change the task in seconds by swapping out the examples. Want the model to switch from writing formal product descriptions to casual social media posts? Change three examples. Want it to switch from English to Spanish output? Provide Spanish examples. Want it to adopt a specific brand voice? Show it three paragraphs written in that voice. The model adapts instantly, with no retraining, no API calls to fine-tuning endpoints, no waiting for training jobs to complete.

// How It Works

Examples are included directly in the prompt as part of the context. The model's attention mechanism processes the examples alongside the new input, identifying cross-example patterns that define the task. This is computationally identical to processing any other text — the model doesn't know it's "learning from examples." It's simply processing a sequence of tokens and predicting what comes next.

// Few-shot prompting — learning by example

prompt = """
Classify each review's sentiment:

Review: "Amazing quality, fast shipping!"
Sentiment: positive

Review: "Broke after two days. Total waste."
Sentiment: negative

Review: "It's okay, nothing special."
Sentiment: neutral

Review: "Best purchase I've made all year!"
Sentiment:"""

output = model(prompt)  // → "positive"

// The model extracted the pattern from 3 examples:
//   format: Review → Sentiment label
//   labels: {positive, negative, neutral}
//   signal: enthusiasm = positive, complaint = negative

// Few-shot vs zero-shot accuracy (GPT-4 class):
Sentiment classification   92% → 96%  // +4% with examples
Named entity recognition   78% → 89%  // +11% — complex tasks benefit more
Custom format generation    61% → 94%  // +33% — format is hard to zero-shot

// Diminishing returns curve:
1-shot:  ~80% of max improvement
3-shot:  ~92% of max improvement
5-shot:  ~97% of max improvement
10-shot: ~99% of max improvement // but costs context window
  

The quality of examples matters far more than the quantity. Research consistently shows that 3 well-chosen examples outperform 10 poorly chosen ones. An ideal few-shot example is representative of the task, clearly demonstrates the expected format, and covers the range of expected inputs. If your task has three possible outputs (positive, negative, neutral), your examples should cover all three — not just show three instances of "positive."

Order matters too. The position of examples within the prompt influences the model's behavior, especially for weaker models. The most recent example tends to have the strongest influence — a phenomenon called recency bias. Placing your most representative example last, closest to the actual input, typically yields the best results. For production systems, researchers sometimes shuffle example order across requests to average out positional effects.

There's a fundamental tension between few-shot quality and context window consumption. Each example uses tokens that could otherwise be allocated to longer inputs, more detailed instructions, or chain-of-thought reasoning. A 5-shot prompt with 200-token examples consumes 1,000 tokens before the model even sees the actual task. In systems with 4K or 8K context windows, that's a significant fraction. In 128K+ context windows, it's negligible. The optimal number of shots depends on the task complexity, example length, and available context budget.

// Why It Matters for Search

Few-shot learning explains why consistent content patterns across your site work better than varied approaches. When an AI system processes your domain, each page functions as an implicit "shot" — an example of what your brand produces, how your entity presents itself, what quality level your content achieves. If your pages are consistent in structure, schema, and quality, the AI builds a strong in-context model of your entity. Inconsistent pages create noise that weakens the in-context representation.

Think about it from the model's perspective. It encounters your About page: clean structure, clear credentials, Person schema with @id. It encounters your service page: same structure, same schema pattern, same entity references. It encounters your blog post: same structure, same schema, same voice. Three consistent "examples" of your entity create a robust pattern. The model doesn't need to be told what your entity is — it infers the pattern from the consistent signal across pages, just as it would infer a task from three few-shot examples in a prompt.

Now imagine the inconsistent alternative. Your About page uses one structure and schema format. Your service page uses a completely different layout with no schema. Your blog uses yet another format with a different authorship attribution. Each page is an "example" that contradicts the others. The model cannot extract a clean pattern because the signal is noisy. It's like providing three few-shot examples that each demonstrate a different task — the model has no idea which pattern to follow.

This insight extends to cross-domain entity architecture. In a Distributed Authority Network, each domain is a "shot" in the model's in-context understanding of your entity. When five domains carry the same Person schema with the same @id, the same sameAs references, and the same credential descriptions, the model encounters five consistent examples of your entity. The few-shot effect kicks in: the pattern is so clear, so consistent, that the model's representation of your entity becomes highly confident. Each additional domain is another "example" that reinforces the pattern.

// In Practice

Maintain consistent content templates across your site. If your "About" pages all follow the same structure — credentials, experience, current role, entity links — AI systems learn this pattern quickly and extract entity information reliably. The template doesn't need to be rigid, but the structural elements should be predictable. Same position for credentials. Same type of schema markup. Same relationship declarations. The content varies; the pattern stays constant.

Apply this principle to schema markup with particular rigor. Consistent schema across pages is few-shot training for AI entity recognition. If every page on your site carries the same Person schema with the same @id, the same jobTitle, the same worksFor reference, the AI processes each page as another confirming example of the entity's identity. After three or four pages, the pattern is locked in. This is why DAN architecture uses consistent schema templates — each domain reinforces the pattern, creating a multi-source few-shot signal that is exceptionally hard for AI systems to ignore.

Design your site's content types to be internally consistent. All your case studies should follow the same structure. All your product pages should use the same schema pattern. All your blog posts should use the same authorship format. This doesn't mean monotonous content — the substance varies, the expertise varies, the specific arguments vary. But the structural pattern remains consistent, giving AI systems a clean few-shot signal about what kind of content your entity produces and how it should be categorized.

Think of your entire web presence as a few-shot prompt. Each page, each domain, each social profile is an example. The "task" you're demonstrating for the AI is: "This entity is an authority on [topic] with these credentials and this body of work." Make every example reinforce that pattern. Inconsistent examples dilute the signal. Consistent examples compound it. The difference between 3 clean examples and 3 noisy ones can be the difference between confident entity recognition and ambiguous classification.

// FAQ

How many examples does few-shot need?

3-5 examples typically capture the pattern for well-defined tasks. Research shows you get approximately 80% of the maximum improvement with just a single example, and 92% with three. After five examples, returns diminish sharply — each additional example contributes less while consuming more context window. The exception is complex tasks with many edge cases, where 8-10 examples can still meaningfully improve performance. The quality of examples always matters more than quantity. Three perfect examples that cover the full range of expected outputs outperform ten sloppy ones that all demonstrate the same easy case.

Is few-shot the same as fine-tuning?

No, and the distinction is fundamental. Few-shot learning happens at inference time in the context window — the model's weights never change. It's temporary, free, and instant. When the conversation ends, the "learning" disappears. Fine-tuning happens during a separate training phase where the model's weights are actually updated with new data. It's permanent, costly, and time-consuming, but the knowledge persists across all future interactions. Few-shot is like showing someone an example before asking them to do something. Fine-tuning is like sending them to school. Both produce specialization, through completely different mechanisms.

Go deeper with practitioners

Join the Burstiness & Perplexity community.

Join the Community

Few-Shot Learning

How many examples does few-shot need?

Is few-shot the same as fine-tuning?

Related Concepts

Go deeper with practitioners