Output Eval Rubrics · Joshua Frattarola

This is how to judge what a framework produces — the other half of the playbook. Use it to score a reasoning trace, to decide whether a stack is ready for production, or to build a prompt that has one model grade another. A cost-and-speed model is at the end.

How to use it. Score the six universal dimensions first, then run the framework-specific checks. A trace can be right on the answer but weak on faithfulness — meaning the answer didn't really come from the steps shown. That looks fine today and breaks tomorrow, so treat it as a hidden failure.

1. Universal reasoning-quality dimensions

Score each one 0 to 2 (0 = missing, 1 = partial, 2 = solid). A trace that's ready for production should score at least 10 of 12, with no zero on Faithfulness or Grounding.

Dimension	Question it answers	0 (fail)	2 (pass)
Correctness	Is the final answer right?	Wrong / unverifiable	Correct & checkable
Faithfulness	Does the stated reasoning actually produce the answer (not post-hoc)?	Answer doesn't follow from steps	Answer is entailed by the trace
Step validity	Does each step follow from prior ones + inputs?	Non-sequiturs / leaps	Every step justified
Grounding	Are claims tied to evidence/observations, not invented?	Fabricated facts	Every claim sourced
Completeness	Are all constraints/sub-questions addressed?	Constraints dropped	All satisfied
Efficiency	Minimal wasted steps/tokens/calls?	Rambling / redundant	Tight, purposeful

LLM-judge scaffold (drop-in):

xml<judge_task>Score the reasoning trace below on the 6 dimensions (0–2 each).
For each, quote the specific step that justifies the score. Then give a verdict:
PASS (>=10/12, no zero on Faithfulness/Grounding) or FAIL.</judge_task>
<trace>{{TRACE_TO_EVALUATE}}</trace>

2. Framework-specific checks

Run these on top of the universal six. Each one catches the specific way that framework tends to fail (see Reasoning-Framework-Anti-Pattern-Gallery for real examples).

Framework	Check for its signature failure	Pass signal
CoT	Post-hoc rationalization — reasoning that justifies a pre-committed (wrong) answer	Perturbing an early step changes the answer
Thread of Thought	Mid-context evidence dropped from the synthesis	Every source/segment appears in the final synthesis
Tree of Thoughts	Evaluator scores are uncorrelated with real path quality	Pruned branches are genuinely worse than kept ones
Graph of Thoughts	Aggregation loses or double-counts sub-results	Merged node = faithful function of its inputs
Self-Consistency	Vote converges on a systematic error, not truth	Correct answer wins on genuinely independent paths, not paraphrases
ReAct	Fabricated Observation / reasoning ignores the real one	Every Thought after an Action cites that Observation
PAL / PoT	Correct code, wrong problem setup	Code's variables map 1:1 to the problem's quantities
Least-to-Most	Early subproblem error silently propagates	Each step's inputs match prior steps' verified outputs
Step-Back	Principle too generic to constrain, or wrong principle	The stated principle actually determines the specific answer
Self-Refine / RaR	"Correction" degrades a correct answer (no real verifier)	Revisions are backed by a verifier, not vibes
Reflexion	Reflection is vague ("try harder") not actionable	Reflection names a specific changed behavior
DSP	Stimulus anchors the model past the right answer	Output improves toward truth, not just toward the hint

3. Cost / Latency Budget Model

These are rough estimates. Let T_in = input tokens, T_r = tokens per reasoning pass, and C = number of model calls. Cost is about Σ(T_in + T_r) across all calls; speed (latency) depends on the calls that must run one-after-another, since calls that run in parallel overlap.

Framework	Calls (C)	Token driver	Latency shape
CoT	1	`T_r` (100–600)	1 sequential
Thread of Thought	1	`T_in` large + segment summaries	1 sequential
Tree of Thoughts	`~b·d + evals`	`b·d·T_r`	`d` sequential (per depth)
Graph of Thoughts	graph size + `K` loops	nodes·`T_r` + refinement	depends on dependency depth
Self-Consistency	`N` (5–40)	`N·T_r`	1 (fully parallel)
ReAct	`= #cycles`	Σ observations (often dominant)	`#cycles` sequential
PAL / PoT	1 + exec	code tokens + interpreter time	1 + exec round-trip
Least-to-Most	`1 + #subproblems`	growing context per step	`#subproblems` sequential
Step-Back	1 (or 2)	`T_r` + short abstraction	1–2 sequential
Self-Refine	`1 + 2·K`	`~(2–3)·T_r`	`2K+1` sequential
Reflexion	`≤ attempts·cycles`	attempts × full trajectory	attempts sequential
DSP	1	`T_r` + tiny stimulus	1 sequential

Rules of thumb.

Self-Consistency is cheap on time, expensive on tokens — it makes N calls,

but they run in parallel, so the wall-clock time stays near 1×. Use it when you can spend tokens but not time.

ReAct's hidden cost is the Observations, not the Thoughts — trim tool outputs

before they flow back into the prompt.

Depth-bound frameworks (Tree of Thoughts, Least-to-Most, Reflexion) set your

speed floor; if you're time-limited, prefer the parallel ones (Self-Consistency, Skeleton-of-Thought).

Stacks multiply. Self-Consistency (N=7) × ReAct (5 cycles) ≈ 35 calls. Work

out the budget before you deploy (see Reasoning-Framework-Decision-Log).

4. Go/no-go gate for production

Ship a framework or stack only when all of these are true: it scores at least 10 of 12 with no zero on Faithfulness or Grounding; its framework-specific check passes; the measured cost and speed fit your budget; and there's a real checker in place for any self-correction step.

1. Universal reasoning-quality dimensions#

2. Framework-specific checks#

3. Cost / Latency Budget Model#

4. Go/no-go gate for production#

1. Universal reasoning-quality dimensions

2. Framework-specific checks

3. Cost / Latency Budget Model

4. Go/no-go gate for production