Here are three complete examples of stacking frameworks together (see Advanced-AI-Reasoning-Framework-Playbook, the composition note). Each one shows the same five things: the situation, why this mix of frameworks, the actual prompt, a shortened trace of the model working, and the payoff. The traces are trimmed to show the shape, not every word.
The one rule. Use at most one framework per axis, and apply them from the outside in: the frameworks that break down or reframe the problem shape the prompt; grounding frameworks (tools, code) run inside the loop; and sampling or checking frameworks wrap the whole thing.
Example 1 — Market-Entry Analyst
Stack: Step-Back (E) × ReAct (C) × Self-Consistency (B)
The situation. "Should a mid-size US grocery chain jump into the online meal-kit market in 2026?" This needs three things at once: real, current facts (market size, competitors), a solid way to think it through, and an answer steady enough to defend in a boardroom.
Why this mix.
- Step-Back (E) — make the model name the decision framework first (what makes
a market worth entering) so the reasoning rests on a principle, not a random anecdote.
- ReAct (C) — the model doesn't already know 2026 market data, so it has to go
look it up. The fact-finding happens inside the loop.
- Self-Consistency (B) — a GO/NO-GO call is one clear answer, so running it
several times and taking the majority guards against a single unlucky search.
Assembled prompt (per sample):
xml<task>Recommend GO or NO-GO: mid-size US grocery chain entering the online
meal-kit market in 2026, with justification.</task>
<tools>web_search(query), get_market_report(sector)</tools>
<phase name="step_back">
First state the general framework that governs any market-entry decision
(e.g., market attractiveness × competitive intensity × fit with existing
capability × unit economics). Do not decide yet.
</phase>
<phase name="react">
Loop until you can decide. Each cycle:
Thought: what does the framework tell you to investigate next?
Action: web_search(...) or get_market_report(...)
Observation: <environment-supplied>
Ground every claim in an Observation; never fabricate one.
</phase>
<output>Final Answer: GO or NO-GO + 3-bullet justification tied to the framework.</output>
Wrapper (Self-Consistency, B1):
Run the prompt above N=7 times, temperature=0.7.
Extract GO / NO-GO from each. Return the majority verdict + vote split.
Abbreviated trace (one sample):
step_back → Framework: attractiveness × intensity × capability-fit × unit economics.
Thought: size the market first.
Action: get_market_report("US meal-kit 2026")
Observation: ~$12B, growth slowing to ~5% YoY, top-3 hold ~60% share.
Thought: intensity is high; check capability fit / differentiation.
Action: web_search("grocery chain private-label meal kit cold-chain advantage")
Observation: existing refrigerated logistics = real cost edge vs pure-play.
Thought: unit economics — CAC vs pure-plays?
Action: web_search("meal kit customer acquisition cost 2025")
Observation: CAC high (~$90–130), churn heavy.
Final Answer: NO-GO (standalone). Justification: (1) attractiveness fading —
slowing growth, entrenched top-3; (2) high CAC + churn breaks unit economics;
(3) capability-fit real but better captured as an add-on to existing e-commerce,
not a standalone entry.
The votes: 5 said NO-GO, 2 said GO → NO-GO (5 of 7). The two GO votes both leaned on the cold-chain edge; showing the split tells the analyst exactly where the decision is shaky.
The payoff. The answer rests on a principle (Step-Back), on real facts (ReAct), and is steadied against luck (Self-Consistency). The 5-to-2 vote split is a bonus deliverable — it points straight at the weak spot in the case.
Example 2 — Quantitative Word Problem
Stack: Least-to-Most (D) × PAL (C) × Self-Refine (F)
The situation. "A company's revenue grew 12% in Year 1, fell 5% in Year 2, and grew 20% in Year 3. If Year 3 revenue was $4.2M, what was the starting revenue, and what was the 3-year growth rate (CAGR)?" It's several steps plus exact math — exactly where plain Chain of Thought tends to slip.
Why this mix.
- Least-to-Most (D) — break it into an ordered chain (find the starting
revenue, then the growth rate) so each step is small and reuses the last answer.
- PAL (C) — hand the actual arithmetic to a code interpreter, so the model only
has to get the method right and can't fumble the numbers.
- Self-Refine (F) — because the interpreter gives a real check, a critique
pass can safely catch setup mistakes (like flipping a growth rate the wrong way).
Assembled prompt:
xml<task>Find starting revenue and 3-year CAGR. Y1 +12%, Y2 -5%, Y3 +20%,
Y3 revenue = $4.2M.</task>
<phase name="decompose">
Ordered subproblems, easiest first:
1. Express Y3 revenue as start × product of growth factors.
2. Solve for start.
3. Compute CAGR from start and Y3 over 3 years.
</phase>
<phase name="solve_with_code">
Write Python that implements the subproblems in order. Reasoning in comments,
computation in code. Final line: answer = {"start": ..., "cagr": ...}.
</phase>
<phase name="self_refine">
Given the interpreter result, critique the SETUP (are growth factors and the
CAGR exponent correct?). If flawed, rewrite the code and re-run. Max 2 rounds.
</phase>
Model-written code (PAL):
python# Sub 1: Y3 = start * 1.12 * 0.95 * 1.20
factor = 1.12 * 0.95 * 1.20 # cumulative growth factor
# Sub 2: solve for start
start = 4_200_000 / factor
# Sub 3: CAGR over 3 years = (Y3/start)^(1/3) - 1
cagr = (4_200_000 / start) ** (1/3) - 1
answer = {"start": round(start, 2), "cagr": round(cagr, 4)}
Interpreter result: {"start": 3287671.23, "cagr": 0.0846}
Self-Refine pass:
Critique: factor = 1.2768; start = 4.2M / 1.2768 = $3.29M — consistent.
CAGR check: (4.2/3.288)^(1/3) - 1 = 8.46%. Exponent 1/3 correct for 3 years.
Setup valid → no revision needed.
Final Answer: Starting revenue ≈ $3.29M; 3-year CAGR ≈ 8.46%.
The payoff. Least-to-Most keeps the model from skipping steps; PAL makes the numbers exact; and Self-Refine is safe here because it checks against a real interpreter — not the shaky self-checking the playbook warns about.
Example 3 — Multi-Document Synthesis with Fact-Check
Stack: Thread of Thought (A) × Chain-of-Verification (C)
The situation. "Read these 6 uploaded audit reports and support tickets, summarize our vendor's security posture, and flag any contradictions." The input is long, messy, and spread across many sources — easy to lose a fact in the middle — and a made-up claim about a vendor is expensive.
Why this mix.
- Thread of Thought (A) — go source by source, summarizing each and keeping a
running thread, so nothing buried in the middle gets dropped.
- Chain-of-Verification (C) — the summary makes factual claims about a vendor,
so check each one on its own before finalizing.
Assembled prompt:
xml<task>Summarize the vendor's security posture and flag contradictions across the
6 sources.</task>
<context>{{6 REPORTS + TICKETS}}</context>
<phase name="thread">
Walk through each source in turn. For each: summarize its security claims and
note anything that conflicts with earlier sources. Carry a running summary.
</phase>
<phase name="draft_synthesis">Synthesize an overall posture + contradiction list,
grounded only in the per-source notes.</phase>
<phase name="verify">
List verification questions that would falsify each key claim in the synthesis
(e.g., "Does Source 3 actually state SOC2 Type II, or Type I?"). Answer each in
ISOLATION against the sources.
</phase>
<phase name="revise">Correct any claim the verification contradicts. Output final.</phase>
Abbreviated trace:
thread → S1: SOC2 Type II (2024). S2: pen-test, 2 criticals open. S3: claims
"fully remediated" — conflicts with S2. S4: MFA enforced. S5: ticket — one
critical still open as of Q1'26. S6: SLA doc, no security detail.
Running thread flags S3-vs-S5 conflict.
draft_synthesis → "Strong (SOC2 II, MFA), but open criticals; S3's 'fully
remediated' contradicts S5's open ticket."
verify → Q: "Does S3 say fully remediated?" → A: S3 says "remediation in
progress," NOT "fully remediated." (draft overstated it)
Q: "Is a critical open in S5?" → A: yes, confirmed Q1'26.
revise → Corrected: S3 claims remediation *in progress*; S5 confirms ≥1 critical
still open → posture: solid controls, but unresolved criticals and inconsistent
remediation reporting.
The payoff. Thread of Thought surfaces the buried conflict between Source 3 and Source 5 that a one-pass summary would miss; Chain-of-Verification catches the draft's own overstatement ("fully remediated") before it goes out. You get long-input reliability and fact-checking in one pipeline.
Stack-selection cheatsheet
| If your task is… | Consider the stack |
|---|---|
| Strategic decision needing live facts + defensibility | Step-Back × ReAct × Self-Consistency |
| Multi-step quantitative / financial / algorithmic | Least-to-Most × PAL × Self-Refine |
| Synthesis over many messy documents, high factual stakes | Thread of Thought × Chain-of-Verification |
| Autonomous agent that must improve over retries | ReAct × Reflexion |
| Creative strategy exploring many options then converging | Tree/Graph of Thoughts × Self-Refine |
Rule of thumb: one framework per axis, applied outside-in — the framework that breaks down or reframes the problem shapes the prompt, grounding runs inside the loop, and sampling or checking wraps the result.