LLM Reasoning Playbook

The Reasoning Framework Playbook

12 core frameworks across 7 axes — taxonomy, choice matrix, boilerplate templates, and a selection flowchart.

A master reference for selecting, engineering, and deploying advanced text-based LLM prompting architectures. Twelve core frameworks organized across seven axes of reasoning manipulation, with a choice matrix, copy-pasteable templates, and a selection flowchart.

The 7 Axes. Every framework manipulates reasoning along one primary axis. Naming the axes makes coverage auditable and stacking explicit. A Topology · B Sampling/Aggregation · C Grounding/Action · D Decomposition/Planning · E Abstraction · F Self-Evaluation/Memory · G Steering


MODULE 1: THE TAXONOMY OF LLM REASONING

Each framework below follows the same shape: what it is (with a quick analogy), why it works, how long it should be, when to use it, and when not to use it. Plain language comes first, with the technical term in parentheses for readers who want it.


AXIS A — Reasoning Topology

A1. Chain of Thought (CoT) / Few-Shot CoT

What it is (definition). Chain of Thought makes the model think out loud, one step at a time, before it gives an answer. You turn it on by saying "think step by step" (zero-shot) or by showing a few worked examples (few-shot). The reasoning runs in one straight line: each step builds on the step before it. The model does not branch off, back up, or second-guess itself.

It's like doing a long math problem on scratch paper instead of in your head. Writing each step down keeps you from losing your place.

Why it works (cognitive basis). A model gets a fixed amount of "thinking room" (compute budget) each time it answers. Writing the steps out spends that room well: the model turns one big leap into many small ones, and each word it writes becomes a note it can read back (its own context) while writing the next word. Worked examples help even more, because they show the model how big each step should be.

How long it should be (scale). Aim for about 100–600 words of reasoning (input tokens) — roughly 3 to 8 steps. Past about 800 words in a single chain, small mistakes pile up (error accumulation) and start to outweigh the benefit.

When to use it (best fit). Reach for it on step-by-step problems that follow one line of logic — grade-school math, everyday "why" questions, or anything where you want the model to show its work. It's the sensible default for most tasks.

When not to use it (anti-patterns). Skip it for simple lookups: it can talk itself into a wrong answer and dress it up as logic (confabulation). Skip it when the task needs the model to weigh several different options, because it locks onto the first path it picks. And take care on fact-heavy tasks — a smooth, confident chain can make a made-up "fact" (a hallucination) look trustworthy.

A2. Thread of Thought (ThoT)

What it is (definition). Thread of Thought walks the model through a long, messy input one piece at a time. It's Chain of Thought built for sprawl — retrieved documents, chat logs, mixed evidence. The prompt says: "Walk me through this in manageable parts, summarizing as you go." The model breaks the input into chunks, keeps a running summary, then pulls it all together at the end.

It's like reading a long book and writing a one-line summary after each chapter, so you don't forget what happened in the middle.

Why it works (cognitive basis). Models pay the most attention to the start and end of a long input and tend to miss the middle (the "lost in the middle" problem). By summarizing each chunk as it goes, the model pulls buried facts back into recent view (where attention is strongest) before it has to use them. One giant reading problem becomes a series of small, well-read ones.

How long it should be (scale). Built for big inputs (roughly 2,000 to 100,000+ input tokens). The reasoning itself stays modest — about 300 to 1,000 words of chunk summaries.

When to use it (best fit). Best when the hard part is reading a lot of text — answering questions across many documents, summarizing a long chat thread, or pulling facts out of retrieved passages.

When not to use it (anti-patterns). Skip it for short, dense inputs — the chunking is wasted effort. Skip it when the task needs branching or search. And skip it for pure math or logic, where there's no long text to walk through.

A3. Tree of Thoughts (ToT)

What it is (definition). Tree of Thoughts turns reasoning into a search. Instead of one straight line, the model proposes several possible next steps (branching), scores how promising each looks (a state-evaluator), keeps the best, and backs out of dead ends (backtracking). It explores options on purpose instead of committing to the first idea.

It's like a chess player thinking a few moves ahead — trying different lines and dropping the ones that lead nowhere.

Why it works (cognitive basis). This is slow, deliberate "System 2" thinking. A single chain is greedy: one bad early step ruins everything after it. Splitting "coming up with ideas" from "judging ideas" lets the model throw out weak paths before it wastes effort on them.

How long it should be (scale). Expensive: thousands to tens of thousands of words across many model calls (roughly breadth × depth × scoring calls). Each thought is small, but the whole search adds up.

When to use it (best fit). Best for problems that need you to try, judge, and back up — planning, puzzles and games, or fitting pieces under strict rules (like scheduling or the "Game of 24").

When not to use it (anti-patterns). Skip it when speed or cost matter. Skip it when the model can't reliably score partial answers — bad scores send the search chasing dead ends (hallucinatory backtracking). And skip it for easy problems a single chain already solves; it's massive overkill.

A4. Graph of Thoughts (GoT)

What it is (definition). Graph of Thoughts lets ideas connect in any shape, not just a tree. Thoughts are points; links show how they depend on each other. On top of branching, it can merge separate thoughts into one (aggregation), loop a thought back to improve it (refinement), and reshape the whole structure. This lets it pull ideas together, not just spread them apart.

It's like a group essay: people write different sections, then you merge and edit them into one piece.

Why it works (cognitive basis). Many jobs are "split it up, then combine": break the problem apart, solve the pieces, and fuse the results. A tree can only spread out — it can't rejoin branches. A graph can, so it matches how these problems actually fit together, and its loops handle draft-critique-redraft as a built-in move.

How long it should be (scale). The heaviest option. It needs a controller program to manage many connected calls, and cost grows with the size of the graph and the number of improvement loops. Save it for high-value batch jobs.

When to use it (best fit). Best when you must break a job into parts, solve them separately, then combine — merging findings from several sources, or polishing an answer through repeated passes.

When not to use it (anti-patterns). Skip it if the pieces never need to be merged. Skip it if you don't have a controller to run it (it's not a single prompt). And skip it when speed or budget is tight.

A5. Skeleton-of-Thought (SoT) (Tier 2)

What it is (definition). Skeleton-of-Thought writes a quick outline first, then fills in each point — all at the same time (in parallel). It's a speed trick (a latency optimization): expanding the points at once cuts the clock time on answers that list out cleanly.

It's like writing an essay outline, then having several people each flesh out one bullet at once.

When to use it (best fit). Best for answers that are really a list — a set of tips, options, or sections — that you want written quickly.

When not to use it (anti-pattern): when the points depend on each other and can't be written on their own.


AXIS B — Sampling & Aggregation

B1. Self-Consistency

What it is (definition). Self-Consistency asks the same question several times and keeps the most common answer. Instead of one Chain of Thought, it runs many (say 5 to 40) with some randomness turned on (nonzero temperature), reads the final answer from each, and takes the majority vote. The path doesn't matter — only which answer wins.

It's like asking several friends the same question and trusting the answer most of them give.

Why it works (cognitive basis). Right answers tend to agree — many different correct paths land on the same result — while mistakes scatter in all directions. Voting lets the correct answer pile up support while random errors cancel out. It's a cheap way to combine one model's own attempts (ensembling), and it directly fights Chain of Thought's habit of running with one bad step (error propagation).

How long it should be (scale). Multiplies the base cost by the number of samples (usually 5 to 40). No new prompt structure — you just run it many times. Most of the benefit shows up by about 10 to 20 samples.

When to use it (best fit). Best when there's one clear answer to land on and you want it to be reliable — math, logic, or multiple-choice-style questions where you can afford a few extra runs.

When not to use it (anti-patterns). Skip it for open-ended answers you can't line up and compare (use Universal Self-Consistency instead). Skip it when cost or speed is tight. And watch out when the model is reliably wrong in the same way — then voting just repeats the same mistake with more confidence.

B2. Universal Self-Consistency (USC) (Tier 2)

What it is (definition). Universal Self-Consistency handles the case where answers can't be counted. It samples several full responses, then has a judge model pick the best-supported one — so you can still "vote" on free-form text.

It's like a panel choosing the best essay when you can't just tally identical answers.

When to use it (best fit). Best when you want that same reliability but the answers are free-form (essays, summaries) and can't simply be counted.

When not to use it (anti-pattern): when the judge is just as biased as the writer.


AXIS C — Grounding & Action

C1. ReAct (Reason + Act)

What it is (definition). ReAct lets the model think, act, and observe in a loop. It reasons about what it needs (Thought), takes an action like a search or tool call (Action), gets the real result back (Observation), and uses that to decide the next step — repeating until it can answer. Its reasoning is tied to live, outside information.

It's like a detective who forms a hunch, checks a clue, then updates the theory.

Why it works (cognitive basis). Plain Chain of Thought reasons in a sealed room and makes up facts it doesn't have. ReAct opens the room: thinking guides which action to take, and real results pull the reasoning back to the truth. The two help each other — and this loop is the backbone of almost every AI agent in production.

How long it should be (scale). Varies and can grow large: every loop adds thinking, an action, and a result (often long). Multi-step tasks run to thousands of words and many calls. The size of the results coming back — not the thinking — is usually the biggest hidden cost.

When to use it (best fit). Best when the model needs the outside world — AI agents that use tools, search the web, query a database, or act on live, changing information.

When not to use it (anti-patterns). Skip it when the task needs no outside information — the tool machinery is dead weight. Skip it when the tools are unreliable or noisy, because bad results derail the loop. And skip it under tight speed limits, since each step has to wait on the one before it.

C2. Program-Aided LM (PAL) / Program-of-Thoughts (PoT)

What it is (definition). PAL (also called Program-of-Thoughts) has the model write code and lets a real interpreter run it. The model handles the reasoning; the computer handles the math. The two jobs are kept separate.

It's like using a calculator instead of doing long division in your head.

Why it works (cognitive basis). Models are shaky calculators — they guess numbers token by token and drift off. Handing the arithmetic to an exact interpreter removes that whole class of error: the model only has to get the method right, not the calculation itself. Writing code also forces clear, precise steps.

How long it should be (scale). About the same as Chain of Thought (tens to a few hundred words of code), plus the time to run the code. On number-heavy tasks it's both cheaper and more accurate than Chain of Thought.

When to use it (best fit). Best when the answer depends on exact math or data work — arithmetic, unit conversions, crunching a table, or anything a short script does better than prose.

When not to use it (anti-patterns). Skip it for soft, judgment-based reasoning that doesn't turn into math. Skip it if you can't safely run code. And skip it when the hard part is framing the problem, not computing the answer.

C3. Chain-of-Verification (CoVe) (Tier 2)

What it is (definition). Chain-of-Verification fact-checks its own draft. It writes an answer, lists questions that would prove each claim wrong, answers those questions on their own (so the draft can't bias them), then fixes the answer. It's built to cut made-up facts (hallucinations).

It's like an editor who writes questions to double-check each fact before publishing.

When to use it (best fit). Best when facts must be right — biographies, product specs, or any answer where a made-up detail is costly.

When not to use it (anti-pattern): on non-factual tasks — it just adds delay for no gain.


AXIS D — Decomposition & Planning

D1. Least-to-Most Prompting

What it is (definition). Least-to-Most solves the easy parts first and uses those answers to solve the hard parts. It works in two stages: break the problem into an ordered list of smaller problems, then solve them in order, feeding each answer into the next. The solutions chain together.

It's like building with LEGO — you snap small pieces together before the big assembly.

Why it works (cognitive basis). It's built to handle problems harder than any single example the model has seen (compositional generalization). Chain of Thought breaks things down in its head and can lose track; Least-to-Most makes the steps explicit, so each one works in a small, clear space with real answers already in hand. Easy-to-hard order keeps every step within reach.

How long it should be (scale). Medium to high: one pass to break things down, plus one pass per sub-problem (the context grows as answers stack up). More calls than Chain of Thought, far fewer than Tree of Thoughts.

When to use it (best fit). Best for problems that stack, where each step needs the answer from the last, and that are harder than the examples you can show — like multi-step word problems or building up a result piece by piece.

When not to use it (anti-patterns). Skip it when the problem doesn't split into a clean, ordered chain. Skip it for simple tasks a single chain handles. And watch out — an early wrong answer quietly poisons every step after it, since there's no built-in check.

D2. Plan-and-Solve (PS) (Tier 2)

What it is (definition). Plan-and-Solve separates making a plan from carrying it out. The model first says "here's the plan," then follows it — a lightweight fix for Chain of Thought's habit of skipping steps.

It's like writing a to-do list before starting a project.

When to use it (best fit). Best for general multi-step tasks where the model tends to rush; laying out a plan first keeps it from skipping steps.

When not to use it (anti-pattern): trivial tasks that need no plan.

D3. Self-Ask (Tier 2)

What it is (definition). Self-Ask has the model ask and answer its own follow-up questions before giving the final answer. It pairs naturally with a search tool — one lookup per follow-up question.

It's like breaking a research question into "who, what, when" and looking each one up.

When to use it (best fit). Best for questions that need several lookups to answer — "Who led the company that made X when it was founded?" — especially paired with a search tool.

When not to use it (anti-pattern): simple, single-step questions.


AXIS E — Abstraction

E1. Step-Back Prompting

What it is (definition). Step-Back finds the general rule first, then applies it. Before solving, the model names the principle, law, or bigger question behind the specific case — then reasons forward from that rule to the answer.

It's like remembering the formula before you plug in the numbers.

Why it works (cognitive basis). Starting from a solid principle points the model at the right part of the solution space and cuts down on distraction from surface details. It copies how experts think: spot what kind of problem this is, recall the rule, apply it. Naming the principle also pulls the right knowledge into view before the model dives into specifics.

How long it should be (scale). Cheap — just one short "step back" added before normal reasoning. One of the best quality boosts for the price.

When to use it (best fit). Best when a known rule or principle drives the answer — physics and math problems, legal questions, or anything where naming the right law first makes the rest fall into place.

When not to use it (anti-patterns). Skip it when there's no useful bigger rule. Watch out for stepping back too far — a rule that's too general doesn't pin down the answer. And skip it when the details are the whole point and generalizing throws them away.

E2. Analogical Prompting (Tier 2)

What it is (definition). Analogical Prompting has the model come up with its own related examples before solving, so you don't have to hand-write example problems.

It's like solving a new puzzle by remembering how you solved a similar one.

When to use it (best fit). Best when you don't have good example problems to show and want the model to recall similar ones on its own.

When not to use it (anti-pattern): brand-new areas where the model's self-made examples aren't trustworthy.


AXIS F — Self-Evaluation & Memory

F1. Metacognitive / Self-Refine / Rephrase-and-Respond (RaR)

What it is (definition). This family adds a "check your work" step. In Rephrase-and-Respond (RaR), the model first restates the question in its own words to clear up confusion, then answers. In Self-Refine, it answers, writes an honest critique of that answer, then fixes it. The core move is separating "make it" from "judge it."

It's like re-reading a test question, answering it, then proofreading before you hand it in.

Why it works (cognitive basis). Spotting a flaw is usually easier than getting it perfect the first time. Putting its own draft in front of it lets the model critique real, specific mistakes. Rephrasing helps because a lot of failure is just misreading the question — restating it lines the model up with what you actually meant.

How long it should be (scale). About 2 to 3 times a single answer (draft, plus critique, plus fix), usually 300 to 1,500 words. One to three rounds is enough; more brings little.

When to use it (best fit). Best for messy or unclear requests and for work you can check — clarifying a vague ask, polishing writing, or fixing code you can run tests against.

When not to use it (anti-patterns). The big one: on fact-or-math tasks with no outside checker, self-correction can make things worse — the model "fixes" right answers into wrong ones. Only trust it with a real check (tests, tools, or a lookup). Skip it under tight speed limits, and skip RaR on simple, clear questions.

F2. Reflexion

What it is (definition). Reflexion learns from failure across tries. When an attempt fails, the model writes a note to itself about why it failed and saves it to memory; the next attempt reads that note first. It gets better over repeated tries — without any retraining.

It's like keeping a mistakes journal so you don't repeat the same error.

Why it works (cognitive basis). It's different from a single self-check: Reflexion works across whole attempts, turning a thin signal ("that failed") into clear, reusable advice. The saved note acts like a lesson learned, so the agent stops making the same kind of mistake. In effect, plain language becomes the way it improves.

How long it should be (scale). High overall: several full attempts, each carrying its growing memory of lessons. Worth it only when you can retry the task and actually check success.

When to use it (best fit). Best for agents that get more than one try and a clear pass/fail signal — fixing code until the tests pass, or retrying a task and learning from each miss.

When not to use it (anti-patterns). Skip it for one-shot tasks you can't retry. Skip it when there's no clear way to tell success from failure. And skip it when each task is so different that past lessons don't carry over.


AXIS G — Steering

G1. Directional Stimulus Prompting (DSP)

What it is (definition). Directional Stimulus Prompting drops in small hints to steer the model without dictating the answer. The hints — keywords, a constraint, a focus cue — nudge the model's thinking toward what you want. You can write them by hand or have a small helper model generate them.

It's like a coach shouting one keyword to remind a player what to focus on.

Why it works (cognitive basis). A few strong hint-words tilt the model toward the right part of the solution space. It uses the model's existing skill and just fixes its focus — cheaper than retraining and less brittle than a giant rulebook. The hint says what to pay attention to; the model does the rest.

How long it should be (scale). Tiny — hints are just a handful of words. It's the cheapest steering method there is, adding almost nothing to the base task.

When to use it (best fit). Best when you want to nudge style or focus without micromanaging — steering tone, keeping a summary on the right topic, or aligning output to a specific domain.

When not to use it (anti-patterns). Skip it for open-ended exploration, where an early hint can lock out good options (anchoring). Skip it when you can't write a good hint — a wrong one makes things worse. And skip it for complex problems that need real structure; a hint gives direction, not a method.


MODULE 2: CRITICAL CHOICE MATRIX

#FrameworkAxisTask ComplexityLatency & Token OverheadPrimary Failure ModeBest-Suited Domain
A1CoT / Few-Shot CoTTopologyLow–MedLowError propagation; hallucination launderingArithmetic, commonsense, single-path derivation
A2Thread of ThoughtTopologyMedLow–MedMid-context evidence loss; summary driftLong-context / multi-doc QA, retrieval synthesis
A3Tree of ThoughtsTopologyHighHighHallucinatory backtracking (weak evaluator)Planning, puzzles, constraint satisfaction
A4Graph of ThoughtsTopologyHigh–V.HighHighestAggregation error; orchestration fragilityDecompose-recombine synthesis, optimization
A5Skeleton-of-ThoughtTopologyLow–MedLow (parallel)Incoherence across independent pointsEnumerable outputs needing low latency
B1Self-ConsistencySamplingLow–HighHigh (N×)Amplifies systematic bias; costAny discrete-answer reasoning
B2Universal Self-ConsistencySamplingMed–HighHigh (N× + judge)Biased judge picks wrong sampleFree-form answers needing robustness
C1ReActGroundingMed–HighHigh (obs-heavy)Tool/observation noise derails loopAgents, tool-use, live-info tasks
C2PAL / Program-of-ThoughtsGroundingMed–HighLow + execWrong procedure; non-computable framingMath, data manipulation, precise computation
C3Chain-of-VerificationGroundingMedMed–HighVerifier inherits draft biasFactual generation, anti-hallucination
D1Least-to-MostDecompositionHighMed–HighEarly subproblem error corrupts chainCompositional / symbolic generalization
D2Plan-and-SolveDecompositionMedLow–MedWeak plan misguides executionGeneral multi-step, zero-shot decomposition
D3Self-AskDecompositionMedMedPoor sub-question selectionMulti-hop QA (+ retrieval)
E1Step-BackAbstractionMed–HighLowOver/under-abstractionScience, law, principle-driven reasoning
E2AnalogicalAbstractionMedLow–MedUnreliable self-generated exemplarsReasoning where good demos are scarce
F1Metacognitive / Self-Refine / RaRSelf-EvalMed–HighMed (2–3×)Self-correction degradation (no verifier)Ambiguous NL, code-with-tests, drafting
F2ReflexionSelf-Eval/MemoryHighHigh (multi-episode)No retry/verifier signal; non-transferAgentic retry loops, code repair
G1Directional StimulusSteeringLow–MedLowestAnchoring bias; bad stimulus misdirectsControllable generation, focus/style alignment

Stacking note. Axes are composable. High-value stacks combine one framework per axis — e.g., ReAct (C) × Self-Consistency (B) × Step-Back (E), or Least-to-Most (D) × PAL (C) × Self-Refine (F).


MODULE 3: PRODUCTION-READY BOILERPLATE TEMPLATES

All templates are hybrid: Markdown headers (## Task, ## Instructions) scaffold the prompt so it's easy to read and edit, while XML tags wrap only the injected data — like <input>{{INPUT_DATA}}</input> — that needs a hard boundary so it can't blur into your instructions. They use {{PLACEHOLDERS}} and are built to snap together across axes.

A1 · Chain of Thought

markdown## Role
Expert reasoner in {{DOMAIN}}.

## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Instructions
Reason step by step. Number each step. Each step must follow only from the input
and prior steps — introduce no unstated facts. Do not skip steps.

## Output format
Reasoning: numbered steps
Answer: final answer to {{TASK_GOAL}}

A2 · Thread of Thought

markdown## Task
{{TASK_GOAL}}

## Context
<context>
{{LARGE_INPUT_DATA}}
</context>

## Instructions
Walk through the context in manageable parts, step by step. For each part:
(1) summarize what it contributes, (2) note any evidence relevant to the task.
Carry a running summary forward. After the final part, synthesize an answer
grounded ONLY in what you surfaced above.

## Output format
Part-by-part analysis: segment summaries
Synthesis: answer with citations to the relevant parts

A3 · Tree of Thoughts

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Search config
breadth = {{B}} · depth = {{D}} · strategy = {{BFS_or_DFS}}

## Instructions
1. PROPOSE: generate {{B}} distinct candidate next-steps toward the goal.
2. EVALUATE: rate each candidate 0–10 for promise; justify in one line.
   Label as "sure / likely / dead-end".
3. EXPAND: keep the top candidates; discard dead-ends.
4. BACKTRACK: if all children of a node are dead-ends, return to its parent.
5. Repeat to depth {{D}} or until solved.

## Output format
Search trace: nodes, scores, prune/backtrack decisions
Chosen path: winning branch
Answer: final answer

A4 · Graph of Thoughts

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Instructions
DECOMPOSE the task into independent sub-thoughts (nodes).
For each node: solve it. Then apply graph operations as needed:
- AGGREGATE: merge related nodes into a combined thought.
- REFINE (loop): critique a node and regenerate it; repeat up to {{K}} times.
- TRANSFORM: restructure nodes if a better decomposition emerges.
Continue until a single root thought answers the task.

## Output format
Nodes: id → thought
Operations: aggregate/refine/transform log
Answer: final aggregated result

A5 · Skeleton-of-Thought (Tier 2)

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Phase 1 — skeleton
List 3–10 concise point headings that structure the full answer. Headings only.

## Phase 2 — expand
Expand each heading independently into a complete paragraph. Do not reference
other points’ contents. (Expansions may be dispatched in parallel.)

B1 · Self-Consistency (wrapper over any discrete-answer template)

markdown## Sampling
Run the inner template {{N}} times, temperature = {{T}}.

## Inner template
<inner_template>
{{ANY_COT_STYLE_TEMPLATE}}
</inner_template>

## Aggregation
Extract the final answer from each of the {{N}} runs.
Return the answer that appears most frequently (majority vote).
Report the vote distribution and the winning answer.

C1 · ReAct

markdown## Task
{{TASK_GOAL}}

## Tools
<tools>
{{TOOL_LIST_WITH_SIGNATURES}}
</tools>

## Instructions
Loop until solved. Each cycle emit exactly one:
Thought: reasoning about what to do next
Action: tool_name(args)   // one of the listed tools
Observation: result — supplied by the environment
When you have enough information, stop and emit the Final Answer.
Never fabricate an Observation; only reason over ones actually returned.

## Output format
Final Answer: answer to {{TASK_GOAL}}

C2 · PAL / Program-of-Thoughts

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Instructions
Solve by writing executable {{LANGUAGE}} code. Put all reasoning in code comments;
put all computation in code. Do NOT compute values yourself. The final line must
assign the result to a variable named `answer`.

## Output format
A single {{LANGUAGE}} code block. The runtime executes it and reports `answer`.

C3 · Chain-of-Verification (Tier 2)

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Steps
1. DRAFT: produce a baseline answer.
2. PLAN: list independent verification questions that would prove the draft wrong.
3. VERIFY: answer each verification question in ISOLATION (do not look at the draft).
4. REVISE: reconcile the answers and output a corrected final answer.

D1 · Least-to-Most

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Phase 1 — decompose
Break the problem into an ORDERED list of subproblems, easiest first, where each
later subproblem may depend on earlier answers.

## Phase 2 — solve
Solve subproblems in order. When solving subproblem k, include the answers to
subproblems 1..k-1 in your working. Carry answers forward explicitly.

## Output format
Subproblems: ordered list
Solutions: answer per subproblem, referencing priors
Answer: final composed answer

E1 · Step-Back

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Phase 1 — step back
State the general principle, law, or high-level question that governs this
problem. Do not solve yet.

## Phase 2 — apply
Using that principle, reason forward to the specific answer.

## Output format
Principle: the abstraction
Derivation: principle → specifics
Answer: final answer

F1 · Metacognitive / Self-Refine / RaR

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Steps
1. REPHRASE: restate the task precisely; resolve any ambiguity in scope or intent.
2. DRAFT: produce an initial answer.
3. CRITIQUE: list concrete, specific flaws (errors, gaps, unmet constraints).
   If a {{VERIFIER}} (tests/tool/retrieval) exists, run it and cite the results.
4. REVISE: rewrite the answer addressing every critique. Iterate up to {{K}} rounds.

## Output format
Final Answer: the revised answer

⚠️ On objective tasks, rely on the critique step only when {{VERIFIER}} is a real external signal — unaided self-critique can degrade accuracy.

F2 · Reflexion

markdown## Task
{{TASK_GOAL}}

## Memory
<memory>
{{PRIOR_REFLECTIONS_OR_EMPTY}}
</memory>

## Instructions
1. Attempt the task, using any lessons in Memory above.
2. Evaluate the attempt against {{SUCCESS_CRITERION}} (test/tool/ground truth).
3. If failed: write a concise reflection on WHY it failed and WHAT to change.
   Append it to memory. Retry (up to {{MAX_ATTEMPTS}}).
4. If passed: stop and return the successful result.

## Output format
Attempt: trajectory
Evaluation: pass/fail + evidence
Reflection: lesson, if failed
Final Result: on success

G1 · Directional Stimulus Prompting

markdown## Task
{{TASK_GOAL}}

## Input
<input>
{{INPUT_DATA}}
</input>

## Stimulus
Focus your reasoning on: {{HINT_KEYWORDS_OR_GUIDELINE}}

## Instructions
Use the stimulus as a directional cue, not a constraint on the final content.
Reason toward the goal, letting the cue guide what to attend to.

## Output format
Answer: final answer

Composition. Wrap B1 (Self-Consistency) around any discrete-answer template; prepend E1 (Step-Back) or inject G1 (stimulus) into another template's <instructions>; run C2 (PAL) inside a C1 (ReAct) action; nest A1/D1 as the inner template of F2 (Reflexion).


MODULE 4: ARCHITECTURAL SELECTION FLOWCHART

Branches on axis first, then framework. Four sequential gates.

flowchart LR
    START([Define task]) --> Q1{Q1: Needs EXTERNAL info, tools, or exact computation the model can't do internally?}

    Q1 -->|Yes| G1{Grounding: which kind?}
    G1 -->|Pure math/data computation| PAL[C2 · PAL / PoT]
    G1 -->|Live tools / search / actions| REACT[C1 · ReAct]
    G1 -->|Verify factual claims| COVE[C3 · Chain-of-Verification]

    Q1 -->|No| Q2{Q2: Is the main challenge a LARGE / long / messy input context?}
    Q2 -->|Yes| THOT[A2 · Thread of Thought]

    Q2 -->|No| Q3{Q3: Requires exploring MULTIPLE solution paths or MERGING sub-solutions?}
    Q3 -->|Explore + backtrack| TOT[A3 · Tree of Thoughts]
    Q3 -->|Merge / recombine sub-parts| GOT[A4 · Graph of Thoughts]
    Q3 -->|Ordered decompose, each builds on last| LTM[D1 · Least-to-Most]

    Q3 -->|No / single path| Q4{Q4: Dominant need?}
    Q4 -->|Max reliability on a discrete answer| SC[B1 · Self-Consistency over CoT]
    Q4 -->|Ground in a general principle first| SB[E1 · Step-Back]
    Q4 -->|Iterative improve + retry + verifier| RFX[F2 · Reflexion / F1 Self-Refine]
    Q4 -->|Ambiguous query wording| RAR[F1 · Rephrase-and-Respond]
    Q4 -->|Cheaply steer focus / style| DSP[G1 · Directional Stimulus]
    Q4 -->|None — simple derivation| COT[A1 · Chain of Thought]
click PAL "#c2-program-aided-lm-pal-program-of-thoughts-pot"
click REACT "#c1-react-reason-act"
click COVE "#c3-chain-of-verification-cove-tier-2"
click THOT "#a2-thread-of-thought-thot"
click TOT "#a3-tree-of-thoughts-tot"
click GOT "#a4-graph-of-thoughts-got"
click LTM "#d1-least-to-most-prompting"
click SC "#b1-self-consistency"
click SB "#e1-step-back-prompting"
click RFX "#f2-reflexion"
click RAR "#f1-metacognitive-self-refine-rephrase-and-respond-rar"
click DSP "#g1-directional-stimulus-prompting-dsp"
click COT "#a1-chain-of-thought-cot-few-shot-cot"

Text logic-gate form:

Q1  External info / tools / exact computation needed?
    YES → computation only ........... PAL/PoT (C2)
          tools, search, actions ..... ReAct (C1)
          verify generated facts ..... CoVe (C3)
    NO  → Q2
Q2  Large / long / messy input context is the core difficulty?
    YES → Thread of Thought (A2)
    NO  → Q3
Q3  Need to explore alternatives or merge sub-solutions?
    explore + backtrack ............... Tree of Thoughts (A3)
    merge / recombine ................. Graph of Thoughts (A4)
    ordered dependent decomposition ... Least-to-Most (D1)
    NO (single path) → Q4
Q4  Dominant secondary need?
    reliability (discrete answer) ..... Self-Consistency (B1)
    principle-grounded ................ Step-Back (E1)
    retry + verifier loop ............. Reflexion (F2) / Self-Refine (F1)
    ambiguous wording ................. Rephrase-and-Respond (F1)
    cheap focus/style steering ........ Directional Stimulus (G1)
    none — just derive ................ Chain of Thought (A1)

The gates select a primary framework; because axes are orthogonal, layer a second framework from a different axis when needed (e.g., land on ReAct at Q1, still wrap Self-Consistency from Q4).