Does chain-of-thought prompting work for all tasks?

No. Chain-of-thought improves performance on reasoning tasks like arithmetic, logic, and commonsense problems, but can hurt performance on simple extraction or classification tasks. It also only works in sufficiently large models; smaller models don't benefit and may produce worse results.

Does the accuracy of few-shot examples matter?

Surprisingly little. Research shows that what matters is the label space (showing possible outputs), input text distribution (examples that look like real inputs), and format consistency. You can use randomly assigned labels and the model will still learn the correct output format.

Should I use emphatic language like CRITICAL and MUST in prompts?

For modern models, usually not. Research shows that newer models respond better to precise, clear instructions than emphatic language. The 'magic words' that seemed essential for earlier models often compensated for limitations that have since been fixed.

mment

Q: Do structured outputs prevent hallucination?

No. Structured outputs guarantee format compliance, not factual accuracy. A model can produce perfectly formatted JSON containing hallucinated data. To prevent hallucination, you need retrieval augmentation, citations, or other grounding mechanisms.

Prompt engineering has accumulated a cottage industry of tips, tricks, and "magic words" that supposedly unlock better AI performance. Most of it is noise.

A systematic survey of over 1,500 papers, co-authored by researchers from OpenAI, Microsoft, Google, Princeton, and Stanford, reveals that only a handful of techniques have robust empirical backing. The rest? Cargo cult. Rituals that looked correlated with success but lacked any causal mechanism.

The techniques that move the needle: chain-of-thought prompting, few-shot examples, and structured outputs. But each comes with caveats that the "10 prompting hacks" posts conveniently omit. And as models improve, some strategies that worked become actively counterproductive.

When Reasoning Needs Scaffolding

Chain-of-thought prompting asks the model to show its reasoning before arriving at an answer. The classic trigger is "Let's think step by step." For arithmetic, commonsense reasoning, and symbolic logic, this technique consistently outperforms direct answers.

The key insight: CoT is an emergent ability that only appears in sufficiently large models.

Smaller models don't benefit. There's even evidence it can hurt performance on tasks that don't require multi-step reasoning. If you're extracting a name from text, forcing the model to "think step by step" adds latency and cost without improving accuracy. You're paying for theater.

The research distinguishes between zero-shot CoT (just adding "let's think step by step") and few-shot CoT (providing worked examples that demonstrate the reasoning process). Few-shot CoT consistently wins for complex reasoning tasks. But the examples matter less than you'd think for their content, and more than you'd think for their format.

Few-Shot Examples: Format Over Content

The intuitive model of few-shot prompting is that you're teaching by example. Show three good responses, and the model learns what "good" means.

The evidence tells a different story.

Research from Min et al. (2022) found that label accuracy in few-shot examples barely matters. What matters is label space (showing the possible output categories), input text distribution (examples that look like your actual inputs), and format consistency. You can give a sentiment classifier examples with randomly assigned positive/negative labels, and it will still learn to output correctly as long as the examples are structurally consistent.

Good news and bad news here. Good: you don't need perfectly curated examples. Bad: few-shot prompting alone won't teach a model to reason better. For multi-step reasoning, you need chain-of-thought. Few-shot just handles format and style.

The Prompt Report documents that modifications as minor as reordering examples or adjusting whitespace can swing accuracy by 30% or more.

If your prompt engineering process doesn't include systematic testing of example order, you're optimizing based on luck.

Anthropic's documentation reveals a counterintuitive pattern for modern models: aggressive few-shot prompting can backfire. Claude 4 may overtrigger on techniques that were necessary for earlier models. The "magic words" that seemed essential often compensated for limitations that have since been fixed. Dialing back emphatic language like "CRITICAL" and "MUST" often produces better results on current models. The incantations expired.

Structured Outputs Won't Save You From Hallucination

Structured outputs let you constrain a model to produce valid JSON matching a schema. Genuinely useful for downstream processing. But some developers assume it solves hallucination.

It doesn't.

The distinction is critical: structured outputs guarantee format compliance, not accuracy. A model can and will produce perfectly formatted JSON containing hallucinated data. The structure validates; the content doesn't. Teams often celebrate getting valid JSON while missing that the content is fabricated. Format and accuracy are orthogonal problems requiring different solutions. If you need accurate data, you need retrieval augmentation, citations, or other grounding mechanisms. Structured outputs help you parse the response reliably. They say nothing about whether the response is true.

System Prompts: Clarity Beats Volume

System prompts work through instruction following, not through some special channel to the model's soul. The mechanism is mundane: the model was trained to give high weight to text in the system position because that's what instruction-following training rewards.

Anthropic's guidance emphasizes being explicit over being emphatic. Newer models follow instructions precisely rather than inferring intent, which means vague instructions produce unpredictable results while precise instructions produce reliable ones. The fix for inconsistent outputs isn't adding CAPS and exclamation points. It's being clearer about what you want.

Context and motivation often work better than bare commands. Explaining why you want something helps the model navigate ambiguous cases. "Extract the shipping date" is less effective than "Extract the shipping date from this order confirmation email because we need to calculate delivery windows."

The prompt length question cuts against intuition: well-structured short prompts often outperform verbose alternatives. Longer prompts introduce noise, create opportunities for conflicting instructions, and dilute attention across more tokens. If you're adding paragraphs of context "just in case," you may be degrading performance.

Automated Optimization Is Already Winning

The most sobering finding from the research concerns the future of prompt engineering as a human skill. The Prompt Report documents that automated prompt optimization consistently matches or exceeds manual engineering, achieving in 10 minutes what takes humans 20 hours.

This doesn't mean prompting is irrelevant. It means the valuable skill isn't knowing which words to use. It's knowing how to define success, build evaluation sets, and run systematic experiments.

The cargo cult dies when you measure.

Our read: prompt engineering will bifurcate. Simple prompting for predictable tasks will be automated entirely. Complex prompting for novel applications will matter more, not less, but it will look like experiment design rather than wordsmithing. The practitioners who thrive will be those who treat prompts as hypotheses to test rather than incantations to perfect.

The Evidence-Based Checklist

If you're building something with LLMs:

Use chain-of-thought for reasoning tasks. Skip it for simple extraction or classification.
Use few-shot examples for format control, not for teaching reasoning.
Test example order systematically. A 30% accuracy swing from reordering is common.
Structured outputs solve parsing, not accuracy. You still need grounding for factual correctness.
Be precise rather than emphatic. Newer models respond to clarity, not intensity.
Shorter, clearer prompts often beat longer ones.
Measure, don't theorize. Automated optimization beats human intuition.

The prompting techniques that work have something in common: they solve specific, well-defined problems rather than promising general "better" results. Chain-of-thought helps with reasoning. Few-shot controls format. Structured outputs guarantee parseability. Each tool has a job. Matching technique to task is the skill.

Everything else is superstition.

Prompt Engineering: What Works, What's Superstition