Anti-Pattern Gallery · Joshua Frattarola

A living note. Real before-and-after failures — the thing that actually changes how you build, where the Module 1 "when not to use it" sections stay general. Each case points to the rubric rule it breaks (see Reasoning-Framework-Eval-Rubrics). It's seeded with the most common failures; add your own from Reasoning-Framework-Decision-Log lessons.

Each case has: the framework · what you see (symptom) · ❌ a bad example · why it happens · ✅ the fix · which rubric rule it breaks.

Seed cases

1. CoT — Hallucination laundering

Symptom. A smooth, step-by-step chain that lands, with total confidence, on a

wrong answer. The clean writing makes it look trustworthy.

❌ Bad. "The treaty was signed in 1847, so the 20-year clause expired in

1867, therefore X." (The 1847 date was made up.)

Why. Chain of Thought lays out reasoning but never checks its starting

facts, so a false fact gets a polished argument built on top of it.

✅ Fix. Add grounding — PAL for math, ReAct or retrieval for facts, or a

Chain-of-Verification pass on the facts. Never trust an unchecked factual claim from Chain of Thought.

Violates: Grounding, Faithfulness.

2. Tree of Thoughts — Weak evaluator, confident search

Symptom. Tree of Thoughts burns through many calls and still returns a bad

path — even though the search looked thorough.

❌ Bad. The scorer rates every branch "8/10 — promising," so nothing gets cut

and the search just spreads wide and expensive.

Why. Tree of Thoughts is only as good as its scorer. If the scorer can't tell

good paths from bad ones, the branching costs a lot and buys nothing ("hallucinatory backtracking").

✅ Fix. Test the scorer first: does it reliably rate known-good vs. known-bad

partial answers? If not, don't use Tree of Thoughts — the branching adds nothing.

Violates: Efficiency; ToT signature check.

3. Self-Refine — Correcting a correct answer

Symptom. On a math problem, the first answer was right — but the critique

pass "fixed" it into a wrong one.

❌ Bad. "On reflection, step 3 should use 1/2, not 1/3…" — inventing a flaw

that was never there.

Why. With no outside check, self-correction on fact-or-math tasks can make

things worse: the model has no ground truth, so its critique is just more guessing.

✅ Fix. Only self-correct against a real checker (unit tests, a code

interpreter, retrieval). No checker → skip the refine pass.

Violates: Correctness; Self-Refine signature check.

4. ReAct — Fabricated Observation

Symptom. The trace shows an "Observation" the tool never returned — and the

agent reasons over that made-up result.

❌ Bad. Action: search(...) → Observation: "Revenue was $4.2M" when the

search actually returned nothing.

Why. The model finishes the familiar pattern (Thought → Action → Observation)

even when the tool gave nothing back, so it invents an observation to keep going.

✅ Fix. Make sure observations come only from the tool (strict turn-taking,

or a stop token right after the action). When a result is empty, force an explicit "no result — retry or replan" branch.

Violates: Grounding; ReAct signature check.

5. Self-Consistency — Voting for a systematic error

Symptom. 9 of 10 samples agree — on the wrong answer.
❌ Bad. A trick question the model reliably gets wrong; voting just repeats

the shared mistake with extra confidence.

Why. Voting assumes mistakes are random and scattered. When the model is wrong

the same way every time, the majority is the mistake.

✅ Fix. Make the reasoning paths genuinely different (change the wording, not

just the temperature); add a checker or a Step-Back reframe to break the shared bias.

Violates: Correctness; Self-Consistency signature check.

6. Thread of Thought — Lost mid-document evidence

Symptom. A long-input summary quietly drops a key fact buried in source 4 of

❌ Bad. The final summary reflects sources 1–2 and 5–6; the middle just

vanished.

Why. Even Thread of Thought can under-read the middle if its chunk summaries

are too short to bring the fact back into view.

✅ Fix. Require one bullet per source in the summary, and "quote the exact

supporting line" for any claim the answer leans on.

Violates: Completeness; ThoT signature check.

7. PAL / PoT — Right code, wrong setup

Symptom. The code runs cleanly, but the answer is wrong.
❌ Bad. Growth factors typed as 1.12, 1.05, 1.20 when Year 2 was a 5%

drop (should be 0.95).

Why. PAL fixes calculation mistakes, not setup mistakes — a wrong

translation of the problem gives you a confidently exact wrong number.

✅ Fix. Add a self-refine pass that checks the setup (does each variable map

to the right quantity?) before trusting the interpreter's output.

Violates: Correctness; PAL signature check.

Add a case (template)

markdown### N. {{Framework}} — {{short symptom name}}
- **Symptom.** ...
- **❌ Bad.** ...
- **Why.** ...
- **✅ Fix.** ...
- **Violates:** <rubric dimension(s)>

Seed cases#

1. CoT — Hallucination laundering#

2. Tree of Thoughts — Weak evaluator, confident search#

3. Self-Refine — Correcting a correct answer#

4. ReAct — Fabricated Observation#

5. Self-Consistency — Voting for a systematic error#

6. Thread of Thought — Lost mid-document evidence#

7. PAL / PoT — Right code, wrong setup#

Add a case (template)#