This is how to judge what a framework produces — the other half of the playbook. Use it to score a reasoning trace, to decide whether a stack is ready for production, or to build a prompt that has one model grade another. A cost-and-speed model is at the end.
How to use it. Score the six universal dimensions first, then run the framework-specific checks. A trace can be right on the answer but weak on faithfulness — meaning the answer didn't really come from the steps shown. That looks fine today and breaks tomorrow, so treat it as a hidden failure.
1. Universal reasoning-quality dimensions
Score each one 0 to 2 (0 = missing, 1 = partial, 2 = solid). A trace that's ready for production should score at least 10 of 12, with no zero on Faithfulness or Grounding.
| Dimension | Question it answers | 0 (fail) | 2 (pass) |
|---|---|---|---|
| Correctness | Is the final answer right? | Wrong / unverifiable | Correct & checkable |
| Faithfulness | Does the stated reasoning actually produce the answer (not post-hoc)? | Answer doesn't follow from steps | Answer is entailed by the trace |
| Step validity | Does each step follow from prior ones + inputs? | Non-sequiturs / leaps | Every step justified |
| Grounding | Are claims tied to evidence/observations, not invented? | Fabricated facts | Every claim sourced |
| Completeness | Are all constraints/sub-questions addressed? | Constraints dropped | All satisfied |
| Efficiency | Minimal wasted steps/tokens/calls? | Rambling / redundant | Tight, purposeful |
LLM-judge scaffold (drop-in):
xml<judge_task>Score the reasoning trace below on the 6 dimensions (0–2 each).
For each, quote the specific step that justifies the score. Then give a verdict:
PASS (>=10/12, no zero on Faithfulness/Grounding) or FAIL.</judge_task>
<trace>{{TRACE_TO_EVALUATE}}</trace>
2. Framework-specific checks
Run these on top of the universal six. Each one catches the specific way that framework tends to fail (see Reasoning-Framework-Anti-Pattern-Gallery for real examples).
| Framework | Check for its signature failure | Pass signal |
|---|---|---|
| CoT | Post-hoc rationalization — reasoning that justifies a pre-committed (wrong) answer | Perturbing an early step changes the answer |
| Thread of Thought | Mid-context evidence dropped from the synthesis | Every source/segment appears in the final synthesis |
| Tree of Thoughts | Evaluator scores are uncorrelated with real path quality | Pruned branches are genuinely worse than kept ones |
| Graph of Thoughts | Aggregation loses or double-counts sub-results | Merged node = faithful function of its inputs |
| Self-Consistency | Vote converges on a systematic error, not truth | Correct answer wins on genuinely independent paths, not paraphrases |
| ReAct | Fabricated Observation / reasoning ignores the real one | Every Thought after an Action cites that Observation |
| PAL / PoT | Correct code, wrong problem setup | Code's variables map 1:1 to the problem's quantities |
| Least-to-Most | Early subproblem error silently propagates | Each step's inputs match prior steps' verified outputs |
| Step-Back | Principle too generic to constrain, or wrong principle | The stated principle actually determines the specific answer |
| Self-Refine / RaR | "Correction" degrades a correct answer (no real verifier) | Revisions are backed by a verifier, not vibes |
| Reflexion | Reflection is vague ("try harder") not actionable | Reflection names a specific changed behavior |
| DSP | Stimulus anchors the model past the right answer | Output improves toward truth, not just toward the hint |
3. Cost / Latency Budget Model
These are rough estimates. Let T_in = input tokens, T_r = tokens per reasoning pass, and C = number of model calls. Cost is about Σ(T_in + T_r) across all calls; speed (latency) depends on the calls that must run one-after-another, since calls that run in parallel overlap.
| Framework | Calls (C) | Token driver | Latency shape |
|---|---|---|---|
| CoT | 1 | T_r (100–600) | 1 sequential |
| Thread of Thought | 1 | T_in large + segment summaries | 1 sequential |
| Tree of Thoughts | ~b·d + evals | b·d·T_r | d sequential (per depth) |
| Graph of Thoughts | graph size + K loops | nodes·T_r + refinement | depends on dependency depth |
| Self-Consistency | N (5–40) | N·T_r | 1 (fully parallel) |
| ReAct | = #cycles | Σ observations (often dominant) | #cycles sequential |
| PAL / PoT | 1 + exec | code tokens + interpreter time | 1 + exec round-trip |
| Least-to-Most | 1 + #subproblems | growing context per step | #subproblems sequential |
| Step-Back | 1 (or 2) | T_r + short abstraction | 1–2 sequential |
| Self-Refine | 1 + 2·K | ~(2–3)·T_r | 2K+1 sequential |
| Reflexion | ≤ attempts·cycles | attempts × full trajectory | attempts sequential |
| DSP | 1 | T_r + tiny stimulus | 1 sequential |
Rules of thumb.
- Self-Consistency is cheap on time, expensive on tokens — it makes N calls,
but they run in parallel, so the wall-clock time stays near 1×. Use it when you can spend tokens but not time.
- ReAct's hidden cost is the Observations, not the Thoughts — trim tool outputs
before they flow back into the prompt.
- Depth-bound frameworks (Tree of Thoughts, Least-to-Most, Reflexion) set your
speed floor; if you're time-limited, prefer the parallel ones (Self-Consistency, Skeleton-of-Thought).
- Stacks multiply. Self-Consistency (N=7) × ReAct (5 cycles) ≈ 35 calls. Work
out the budget before you deploy (see Reasoning-Framework-Decision-Log).
4. Go/no-go gate for production
Ship a framework or stack only when all of these are true: it scores at least 10 of 12 with no zero on Faithfulness or Grounding; its framework-specific check passes; the measured cost and speed fit your budget; and there's a real checker in place for any self-correction step.