Question 1

Which prompting technique gives the best accuracy improvement?

Accepted Answer

Self-consistency combined with chain-of-thought delivers the highest accuracy gains, typically 20-35% over zero-shot baselines on reasoning tasks. It works by generating multiple reasoning paths and selecting the most consistent answer through majority voting. For simpler tasks, few-shot prompting provides the best effort-to-improvement ratio.

Question 2

How do you benchmark prompt techniques fairly?

Accepted Answer

Fair benchmarking requires using standardized datasets (GSM8K for math, MMLU for knowledge, HumanEval for code), controlling for temperature and sampling parameters, running multiple trials per technique, and comparing against a consistent zero-shot baseline. We use published results from peer-reviewed papers and replicate on current model versions.

Question 3

Does chain-of-thought prompting work on all models?

Accepted Answer

Chain-of-thought prompting is most effective on models with 50B+ parameters. Smaller models may produce incoherent reasoning chains. It works well on Claude (all sizes), GPT-4, GPT-3.5-turbo, Gemini Pro, and Llama 70B. On smaller models like Llama 7B or Mistral 7B, the improvement is minimal or negative.

Question 4

What is tree-of-thought prompting and when should I use it?

Accepted Answer

Tree-of-thought (ToT) prompting explores multiple reasoning branches simultaneously, evaluating each path before committing to an answer. Use it for problems with multiple valid approaches where the optimal path is unclear — creative writing, game strategy, planning tasks. It costs 3-5x more tokens than standard CoT but delivers 15-30% better results on complex problems.

Question 5

How does ReAct prompting compare to chain-of-thought?

Accepted Answer

ReAct (Reasoning + Acting) extends chain-of-thought by adding action steps where the model can interact with external tools like search engines or databases. CoT is better for self-contained reasoning tasks. ReAct is better for tasks requiring external information retrieval or multi-step tool use. ReAct improves accuracy by 20-35% on knowledge-intensive tasks where CoT alone would hallucinate.

Technique	Category	Accuracy Gain (GSM8K)	Accuracy Gain (MMLU)	Token Cost Multiplier	Claude	GPT-4	Gemini	Llama 70B	Best For
Zero-Shot	Baseline	Baseline (57%)	Baseline (70%)	1.0x	Yes	Yes	Yes	Yes	Simple tasks, low cost
Zero-Shot CoT	Chain-of-Thought	+17% (74%)	+5% (75%)	1.3x	Yes	Yes	Yes	Yes	Quick reasoning boost
Manual CoT (8-shot)	Chain-of-Thought	+28% (85%)	+8% (78%)	2.5x	Yes	Yes	Yes	Yes	Math, multi-step logic
Auto-CoT	Chain-of-Thought	+24% (81%)	+7% (77%)	2.2x	Yes	Yes	Yes	Partial	Batch processing
Few-Shot (3 examples)	In-Context Learning	+12% (69%)	+10% (80%)	1.8x	Yes	Yes	Yes	Yes	Classification, formatting
Few-Shot (5 examples)	In-Context Learning	+15% (72%)	+12% (82%)	2.2x	Yes	Yes	Yes	Yes	Complex patterns
Few-Shot CoT	Combined	+30% (87%)	+11% (81%)	3.0x	Yes	Yes	Yes	Yes	Reasoning + examples
Self-Consistency (k=5)	Ensemble	+33% (90%)	+9% (79%)	5.0x	Yes	Yes	Yes	Yes	High-stakes accuracy
Self-Consistency (k=10)	Ensemble	+35% (92%)	+10% (80%)	10.0x	Yes	Yes	Yes	Partial	Maximum accuracy
Tree of Thoughts	Search	+25% (82%)	+6% (76%)	3.5x	Yes	Yes	Partial	Partial	Creative problem solving
Graph of Thoughts	Search	+22% (79%)	+7% (77%)	4.0x	Yes	Yes	No	No	Complex dependencies
ReAct	Tool Use	+10% (67%)	+15% (85%)	2.5x	Yes	Yes	Yes	Partial	Knowledge retrieval
Reflexion	Self-Improvement	+20% (77%)	+8% (78%)	3.0x	Yes	Yes	Partial	No	Code generation
Step-Back Prompting	Abstraction	+15% (72%)	+12% (82%)	1.5x	Yes	Yes	Yes	Yes	Science, physics
Persona Prompting	Role-Based	+8% (65%)	+10% (80%)	1.1x	Yes	Yes	Yes	Yes	Domain expertise
Structured Output (JSON)	Format Control	N/A	N/A	1.2x	Yes	Yes	Yes	Yes	API responses
XML Tag Prompting	Format Control	+10% (67%)	+5% (75%)	1.2x	Yes	Partial	Partial	No	Claude-optimized tasks
Least-to-Most	Decomposition	+22% (79%)	+8% (78%)	2.0x	Yes	Yes	Yes	Partial	Progressive complexity
Skeleton-of-Thought	Decomposition	+12% (69%)	+6% (76%)	1.5x	Yes	Yes	Yes	Yes	Long-form writing
Emotional Prompting	Motivation	+5% (62%)	+3% (73%)	1.0x	Yes	Yes	Yes	Yes	Engagement boost

AI Prompt Benchmark — Which Prompting Techniques Work Best

Methodology

Key Findings

Cost-Efficiency Rankings

Frequently Asked Questions