LLM Reasoning Playbook

Output Eval Rubrics

How to judge each framework's output: universal dimensions, per-framework signature checks, and a cost/latency budget model.

This is how to judge what a framework produces — the other half of the playbook. Use it to score a reasoning trace, to decide whether a stack is ready for production, or to build a prompt that has one model grade another. A cost-and-speed model is at the end.

How to use it. Score the six universal dimensions first, then run the framework-specific checks. A trace can be right on the answer but weak on faithfulness — meaning the answer didn't really come from the steps shown. That looks fine today and breaks tomorrow, so treat it as a hidden failure.


1. Universal reasoning-quality dimensions

Score each one 0 to 2 (0 = missing, 1 = partial, 2 = solid). A trace that's ready for production should score at least 10 of 12, with no zero on Faithfulness or Grounding.

DimensionQuestion it answers0 (fail)2 (pass)
CorrectnessIs the final answer right?Wrong / unverifiableCorrect & checkable
FaithfulnessDoes the stated reasoning actually produce the answer (not post-hoc)?Answer doesn't follow from stepsAnswer is entailed by the trace
Step validityDoes each step follow from prior ones + inputs?Non-sequiturs / leapsEvery step justified
GroundingAre claims tied to evidence/observations, not invented?Fabricated factsEvery claim sourced
CompletenessAre all constraints/sub-questions addressed?Constraints droppedAll satisfied
EfficiencyMinimal wasted steps/tokens/calls?Rambling / redundantTight, purposeful

LLM-judge scaffold (drop-in):

xml<judge_task>Score the reasoning trace below on the 6 dimensions (0–2 each).
For each, quote the specific step that justifies the score. Then give a verdict:
PASS (>=10/12, no zero on Faithfulness/Grounding) or FAIL.</judge_task>
<trace>{{TRACE_TO_EVALUATE}}</trace>

2. Framework-specific checks

Run these on top of the universal six. Each one catches the specific way that framework tends to fail (see Reasoning-Framework-Anti-Pattern-Gallery for real examples).

FrameworkCheck for its signature failurePass signal
CoTPost-hoc rationalization — reasoning that justifies a pre-committed (wrong) answerPerturbing an early step changes the answer
Thread of ThoughtMid-context evidence dropped from the synthesisEvery source/segment appears in the final synthesis
Tree of ThoughtsEvaluator scores are uncorrelated with real path qualityPruned branches are genuinely worse than kept ones
Graph of ThoughtsAggregation loses or double-counts sub-resultsMerged node = faithful function of its inputs
Self-ConsistencyVote converges on a systematic error, not truthCorrect answer wins on genuinely independent paths, not paraphrases
ReActFabricated Observation / reasoning ignores the real oneEvery Thought after an Action cites that Observation
PAL / PoTCorrect code, wrong problem setupCode's variables map 1:1 to the problem's quantities
Least-to-MostEarly subproblem error silently propagatesEach step's inputs match prior steps' verified outputs
Step-BackPrinciple too generic to constrain, or wrong principleThe stated principle actually determines the specific answer
Self-Refine / RaR"Correction" degrades a correct answer (no real verifier)Revisions are backed by a verifier, not vibes
ReflexionReflection is vague ("try harder") not actionableReflection names a specific changed behavior
DSPStimulus anchors the model past the right answerOutput improves toward truth, not just toward the hint

3. Cost / Latency Budget Model

These are rough estimates. Let T_in = input tokens, T_r = tokens per reasoning pass, and C = number of model calls. Cost is about Σ(T_in + T_r) across all calls; speed (latency) depends on the calls that must run one-after-another, since calls that run in parallel overlap.

FrameworkCalls (C)Token driverLatency shape
CoT1T_r (100–600)1 sequential
Thread of Thought1T_in large + segment summaries1 sequential
Tree of Thoughts~b·d + evalsb·d·T_rd sequential (per depth)
Graph of Thoughtsgraph size + K loopsnodes·T_r + refinementdepends on dependency depth
Self-ConsistencyN (5–40)N·T_r1 (fully parallel)
ReAct= #cyclesΣ observations (often dominant)#cycles sequential
PAL / PoT1 + execcode tokens + interpreter time1 + exec round-trip
Least-to-Most1 + #subproblemsgrowing context per step#subproblems sequential
Step-Back1 (or 2)T_r + short abstraction1–2 sequential
Self-Refine1 + 2·K~(2–3)·T_r2K+1 sequential
Reflexion≤ attempts·cyclesattempts × full trajectoryattempts sequential
DSP1T_r + tiny stimulus1 sequential

Rules of thumb.

but they run in parallel, so the wall-clock time stays near 1×. Use it when you can spend tokens but not time.

before they flow back into the prompt.

speed floor; if you're time-limited, prefer the parallel ones (Self-Consistency, Skeleton-of-Thought).

out the budget before you deploy (see Reasoning-Framework-Decision-Log).


4. Go/no-go gate for production

Ship a framework or stack only when all of these are true: it scores at least 10 of 12 with no zero on Faithfulness or Grounding; its framework-specific check passes; the measured cost and speed fit your budget; and there's a real checker in place for any self-correction step.