A living note. Real before-and-after failures — the thing that actually changes how you build, where the Module 1 "when not to use it" sections stay general. Each case points to the rubric rule it breaks (see Reasoning-Framework-Eval-Rubrics). It's seeded with the most common failures; add your own from Reasoning-Framework-Decision-Log lessons.
Each case has: the framework · what you see (symptom) · ❌ a bad example · why it happens · ✅ the fix · which rubric rule it breaks.
Seed cases
1. CoT — Hallucination laundering
- Symptom. A smooth, step-by-step chain that lands, with total confidence, on a
wrong answer. The clean writing makes it look trustworthy.
- ❌ Bad. "The treaty was signed in 1847, so the 20-year clause expired in
1867, therefore X." (The 1847 date was made up.)
- Why. Chain of Thought lays out reasoning but never checks its starting
facts, so a false fact gets a polished argument built on top of it.
Chain-of-Verification pass on the facts. Never trust an unchecked factual claim from Chain of Thought.
- Violates: Grounding, Faithfulness.
2. Tree of Thoughts — Weak evaluator, confident search
- Symptom. Tree of Thoughts burns through many calls and still returns a bad
path — even though the search looked thorough.
- ❌ Bad. The scorer rates every branch "8/10 — promising," so nothing gets cut
and the search just spreads wide and expensive.
- Why. Tree of Thoughts is only as good as its scorer. If the scorer can't tell
good paths from bad ones, the branching costs a lot and buys nothing ("hallucinatory backtracking").
- ✅ Fix. Test the scorer first: does it reliably rate known-good vs. known-bad
partial answers? If not, don't use Tree of Thoughts — the branching adds nothing.
- Violates: Efficiency; ToT signature check.
3. Self-Refine — Correcting a correct answer
- Symptom. On a math problem, the first answer was right — but the critique
pass "fixed" it into a wrong one.
- ❌ Bad. "On reflection, step 3 should use 1/2, not 1/3…" — inventing a flaw
that was never there.
- Why. With no outside check, self-correction on fact-or-math tasks can make
things worse: the model has no ground truth, so its critique is just more guessing.
- ✅ Fix. Only self-correct against a real checker (unit tests, a code
interpreter, retrieval). No checker → skip the refine pass.
- Violates: Correctness; Self-Refine signature check.
4. ReAct — Fabricated Observation
- Symptom. The trace shows an "Observation" the tool never returned — and the
agent reasons over that made-up result.
- ❌ Bad.
Action: search(...)→Observation: "Revenue was $4.2M"when the
search actually returned nothing.
- Why. The model finishes the familiar pattern (Thought → Action → Observation)
even when the tool gave nothing back, so it invents an observation to keep going.
- ✅ Fix. Make sure observations come only from the tool (strict turn-taking,
or a stop token right after the action). When a result is empty, force an explicit "no result — retry or replan" branch.
- Violates: Grounding; ReAct signature check.
5. Self-Consistency — Voting for a systematic error
- Symptom. 9 of 10 samples agree — on the wrong answer.
- ❌ Bad. A trick question the model reliably gets wrong; voting just repeats
the shared mistake with extra confidence.
- Why. Voting assumes mistakes are random and scattered. When the model is wrong
the same way every time, the majority is the mistake.
- ✅ Fix. Make the reasoning paths genuinely different (change the wording, not
just the temperature); add a checker or a Step-Back reframe to break the shared bias.
- Violates: Correctness; Self-Consistency signature check.
6. Thread of Thought — Lost mid-document evidence
- Symptom. A long-input summary quietly drops a key fact buried in source 4 of
6.
- ❌ Bad. The final summary reflects sources 1–2 and 5–6; the middle just
vanished.
- Why. Even Thread of Thought can under-read the middle if its chunk summaries
are too short to bring the fact back into view.
- ✅ Fix. Require one bullet per source in the summary, and "quote the exact
supporting line" for any claim the answer leans on.
- Violates: Completeness; ThoT signature check.
7. PAL / PoT — Right code, wrong setup
- Symptom. The code runs cleanly, but the answer is wrong.
- ❌ Bad. Growth factors typed as
1.12, 1.05, 1.20when Year 2 was a 5%
drop (should be 0.95).
- Why. PAL fixes calculation mistakes, not setup mistakes — a wrong
translation of the problem gives you a confidently exact wrong number.
- ✅ Fix. Add a self-refine pass that checks the setup (does each variable map
to the right quantity?) before trusting the interpreter's output.
- Violates: Correctness; PAL signature check.
Add a case (template)
markdown### N. {{Framework}} — {{short symptom name}}
- **Symptom.** ...
- **❌ Bad.** ...
- **Why.** ...
- **✅ Fix.** ...
- **Violates:** <rubric dimension(s)>