ADR-0016: Per-task-class eval harness as the evidence source for trust-tier promotion, threshold calibration, and LLM-Judge un-deferral¶

Status: Accepted (commitment) — Deferred (implementation; lands before Phase 7 ships) Date: 2026-05-12 Tags: trust · evidence · eval-harness · per-task-class · roadmap-gap · calibration Related: Phase 5 ADR-0003, Phase 5 ADR-0008, production ADR-0008, production ADR-0011, production ADR-0015, production ADR-0028

Context¶

Three currently-deferred decisions all wait on the same missing artifact:

production ADR-0015 defers trust-score threshold calibration pending "N = 50 production migrations with full objective-signal traces and post-merge outcome data" and explicitly calls for per-task-class refinement: "Vulnerability patches likely tolerate higher gate threshold than convenience migrations; calibrate separately."
Phase 5 ADR-0008 defers the LLM-Judge persona "until evidence accrues," with no mechanism named for what kind of evidence justifies un-deferral.
production ADR-0028 sequences task classes (vuln remediation → migrations → agentic recipe authoring) but does not specify the gate that prevents a task class from graduating to higher autonomy tiers.

Phase 5 ships strict-AND objective signals as the only gate verdict source (Phase 5 ADR-0003, production ADR-0008). Strict-AND is the right contract for boundary correctness — "did the build pass, did tests pass, did SAST not regress" — but it answers a narrower question than the one the system actually needs to answer at task-class introduction time: "is the harness's judgment good enough on this task class to justify opening PRs at scale?"

The boundary-correctness layer (Phase 4 cassette discipline, Phase 5 strict-AND, Phase 9 Temporal retries) tests whether the wiring works. It does not test whether the LLM/recipe/planner picked the right answer. For vulnerability remediation, that gap is tolerable: the validator pass (does it still compile, do tests pass, does SAST regress, did the CVE drop out of the dependency tree) is a strong objective signal of correctness. For Chainguard migrations (Phase 7) and agentic recipe authoring (Phase 15), the objective signal weakens — a Chainguard migration that builds and passes tests may still be semantically wrong (wrong base image variant, missing runtime capability surfacing only under production load), and a generated recipe may "work" against one repo while being subtly incorrect for the broader class of repos it claims to cover.

CLAUDE.md §"Determinism over probabilism for structural changes" cites the empirical finding ("safer builders, risky maintainers") that drives recipe-first/LLM-fallback ordering. The corollary the codebase has not yet operationalized: judgment quality varies by task class, and trust granted in Phase 3 does not transfer to Phase 7 or Phase 15. Each new task class must earn its own evidence before promotion off the conservative starting tier.

The outcome ledger (Phase 13) closes the post-merge loop — did the PR merge clean, did it revert, did it regress — but only after PRs are opened. Outcome-ledger-only learning is too slow for task classes where one bad PR can poison the well across the portfolio (migrations especially). The system needs a pre-production evidence source that operates offline, against curated ground truth, and produces a per-task-class score the trust-tier promotion logic can read.

See phase-arch-design.md §Gap analysis — this is the gap the architect did not surface as a separate ADR until now. final-design.md §Open questions records the threshold-calibration question without naming the evidence-source contract.

Options considered¶

Rely on the outcome ledger alone (status quo). Promote task classes based on post-merge merge-rate / revert-rate. Pros: real-world signal, no synthetic-benchmark drift. Cons: too slow to detect a regression class (weeks of bad PRs before a pattern emerges); requires PRs already open at scale, which is exactly what task-class introduction is gating; "first 100 Chainguard PRs" is itself an unacceptable burn if 30% are wrong.
Eval-as-CI-gate (per-PR benchmark run). Run an eval suite on every PR before it opens. Pros: fast feedback. Cons: bench-case curation is too expensive to justify per-PR cadence; the LLM call cost alone makes this impractical; conflates "boundary correctness" (which CI already gates) with "judgment quality" (which is a different question with different cadence).
Closed eval set defined once, per phase. Pick one benchmark for "the LLM is good." Pros: one-shot work. Cons: ignores production ADR-0015's explicit "per-task-class refinement" requirement; ignores CLAUDE.md "Extension by addition" — a benchmark for vuln remediation does not predict migration performance.
Per-task-class eval harness with open registry, offline cadence, gating trust-tier promotion (and only that). Each task class ships bench/{task-class}/ with curated ground-truth cases; a scoring rubric per task class produces bench_score; trust-tier promotion (bronze → silver → gold) gates on bench_score ≥ tier_threshold; the ledger continues to feed post-merge outcomes that periodically refresh the bench set. Mirrors Phase 5 ADR-0003's signal-kind registry pattern and reuses the @register_* decorator convention.

Decision¶

The codebase adopts a per-task-class eval harness as a first-class architectural artifact whose contract is owned by Phase 5 (the home of trust tiers) and whose implementation lands as a precondition for Phase 7 — i.e., Phase 7 cannot ship its first Chainguard migration PR at scale without bench/migration-chainguard-distroless/ populated and scored. Vulnerability remediation (Phase 3 / Phase 4) backfills bench/vuln-remediation/ as part of Phase 7's lead-in work, so the harness has a worked example before its first new-task-class consumer.

Contract shape:

Directory layout per task class. bench/{task-class-slug}/cases/{case-id}/ contains a curated input (a synthetic repo fixture or a frozen snapshot of a real repo at a specific commit), expected outcomes (ground-truth diff, expected recipe pick, expected CVE-delta, expected validator verdict), and metadata (case.yaml with provenance, difficulty tier, and a disposition: positive | negative | ambiguous label so the rubric can score appropriately).
Scoring rubric per task class. A bench/{task-class-slug}/rubric.py exports a single function score(harness_output, expected) -> BenchScore. BenchScore is a Pydantic model with at minimum passed: bool, score: float ∈ [0, 1], breakdown: dict[str, float], and failure_modes: list[str]. The rubric is task-class-specific — vuln-remediation scores on "fix correctness + no regressions + CVE drop"; migration scores on "image-variant match + runtime-capability match + size-delta within bound"; recipe-authoring scores on "recipe applies cleanly across the held-out repo set + no false-positive transforms."
Open registry mirror. Task classes register via @register_task_class("name") decorator (same pattern as @register_probe, @register_signal_kind); each registration declares bench_path, rubric, and the min_cases_for_promotion floor per tier. Phase 7 adds migration-chainguard-distroless; Phase 15 adds agentic-recipe-authoring — both as one-line decorator registrations.
Trust-tier promotion gate. Trust tiers (bronze / silver / gold / platinum) are read by Phase 5's gate framework. Promotion from tier N → N+1 for a task class requires bench_score.lower_bound_95 ≥ tier_threshold[N+1] over ≥ min_cases_for_promotion[N+1] cases AND zero block-severity failure modes in the breakdown. Demotion is automatic on any production regression that the bench set fails to catch (the regression becomes a new bench case; the score recomputes; the tier drops if it falls below threshold).
Offline cadence, not CI-per-PR. The eval harness runs nightly (or per-release-candidate) — never per-PR. Per-PR CI continues to gate on boundary correctness (Phase 5 strict-AND on objective signals); the eval harness gates trust-tier promotion, which is a slower, separately-decided question. This is the load-bearing distinction: per-PR gates answer "is this PR safe to open?"; the eval harness answers "is this task class at this autonomy tier safe to operate at?"
Bench cases are versioned and provenance-tracked. Each case carries provenance: {source: "curated" | "outcome-ledger-derived" | "regression-converted", commit_sha, added_at, last_validated_at}. Outcome-ledger reconciliation periodically converts post-merge incidents into new bench cases (the regression-converted path) — this is the feedback loop that prevents the bench set from going stale.
bench_score feeds, but does not replace, the strict-AND gate verdict. Per Phase 5 ADR-0003, bench_score MAY register as a signal kind in a future phase (it is "extension by addition," not a redesign), but Phase 5 does not require it. Strict-AND remains the per-PR verdict source; bench_score is the meta-verdict on whether the task class itself is trusted enough for its current tier.

What this ADR explicitly resolves:

production ADR-0015's "Evidence needed to resolve." The eval harness is the structured evidence source. "N = 50 production migrations" becomes "N ≥ min_cases_for_promotion curated bench cases + post-merge outcome reconciliation." ADR-0015 is not un-deferred by this ADR — that requires the production data to actually accrue — but the shape of the evidence is now contractual.
Phase 5 ADR-0008's un-deferral criterion. When/if the LLM Judge persona is introduced, the un-deferral ADR must reference bench_score on a judgment-quality benchmark (cases where objective signals conflict; the Judge's adjudication is scored against ground-truth resolutions). This ADR does not un-defer the Judge; it makes un-deferral evidence-shaped.
production ADR-0028's graduation gate. Each task class must ship bench/{task-class}/ with min_cases_for_promotion[bronze] ≥ 10 curated cases before its introduction phase exits. This is the first concrete exit criterion for task-class introduction beyond "the code exists."

Tradeoffs¶

Gain	Cost
Trust-tier promotion has a measurable, per-task-class evidence source instead of sentiment — ADR-0015's "calibrate separately per task class" becomes an executable contract	Bench-case curation is the dominant cost; expert-curated ground-truth cases for migrations alone are weeks of work; the cost is real and front-loaded
Phase 7 (Chainguard migrations) cannot accidentally ship with vuln-remediation's tier-thresholds; the harness forces per-class evidence	A new task class can no longer be introduced with a "let's see how it goes" PR run — the bench/ directory is a fence-CI-style precondition
LLM-Judge un-deferral (Phase 5 ADR-0008) has a clear evidence-shape: judgment-quality benchmarks where objective signals conflict	Until the harness exists, neither ADR-0015 nor Phase 5 ADR-0008 can be resolved; the deferrals stack on this ADR's implementation
Mirrors `@register_probe` / `@register_signal_kind` patterns — extension by addition, one-line registration per task class — no central edits	A fourth registry to maintain; registry collision detection must be added (same shape as `SignalKindAlreadyRegistered`)
Bench cases are versioned with provenance; outcome-ledger reconciliation feeds new cases back; the bench set self-refreshes	The reconciliation loop (Phase 13) becomes load-bearing — if it breaks, the bench set decays without warning; a staleness probe (analogous to IndexHealthProbe §B2) must monitor `last_validated_at`
Offline cadence (nightly, not per-PR) keeps per-PR CI fast and the LLM call cost bounded; per-PR strict-AND continues unchanged	Trust-tier promotion is correspondingly slower — a task class cannot graduate within a single sprint; promotion is a weeks-to-months decision tied to bench accumulation
`bench_score` is structurally separate from `TrustSignal.kind` — Phase 5's gate contract is not mutated; the harness is an external gate on tier promotion, not an internal gate input	Two trust-shaped contracts coexist (per-PR strict-AND, per-task-class bench_score); the conceptual surface area grows; docs must distinguish them sharply
Aligns with CLAUDE.md "Honest confidence" and "Facts, not judgments" — the harness reports per-case correctness as evidence; the judgment on whether to promote remains explicit and human-confirmable	Bench cases themselves embed human judgment about what "correct" means — the curation step is judgment, not facts, and must be auditable
Failure modes in `BenchScore.failure_modes` become the input to recipe-set improvement and prompt-iteration — closes the learning loop the codebase otherwise lacks	The failure-mode taxonomy must be designed per task class; without it, `failure_modes: list[str]` degenerates into free-text noise

Consequences¶

New package: src/codegenie/eval/ lands as a precondition for Phase 7. Contains the @register_task_class decorator, BenchScore Pydantic model, the registry, the harness runner (loads cases, invokes the system under test, calls the rubric, aggregates), and the trust-tier promotion gate.
New directory contract: bench/{task-class-slug}/{cases,rubric.py,registration.py}. Treated like tests/snapshots/: contract territory; mutation requires ADR amendment for cases/ removals; additions are routine.
Phase 5's gates/ package is not modified. The harness is parallel infrastructure; Phase 5 ADR-0003's TrustScorer stays untouched. Phase 7+ may later register bench_score as a signal kind via Phase 5 ADR-0003's mechanism — that decision is deferred to whichever phase first finds it load-bearing.
Fence-CI extends. A new fence test asserts that any task class registered via @register_task_class("name") has a bench/{name}/ directory with at minimum cases/, rubric.py, and registration.py. A task class registered without a bench directory fails CI. This is the structural enforcement that prevents "extension by addition" from becoming "extension without evidence."
Phase 13 outcome ledger gains a reconciliation hook. Post-merge incidents that the bench set did not predict get converted into new bench cases (the regression-converted provenance class). This is a new responsibility for Phase 13's ledger consumer; Phase 13's ADRs must reference this one.
Phase 16 production hardening owns the staleness probe + alerting on last_validated_at decay. The probe lands there because production-cadence monitoring is its native concern.
New invariant: introducing a new task class is a four-part PR — (a) @register_task_class registration; (b) bench/{name}/ with ≥ 10 curated cases; (c) per-task-class rubric.py; (d) ADR amendment to production ADR-0028 documenting the new class's place in the introduction order. The fence CI rejects any three-of-four PR.
Phase 5 ADR-0008 (LLM-Judge deferral) gains an un-deferral criterion without un-deferring: any future ADR introducing the Judge must reference a judgment-quality bench (bench/judgment-arbitration/ or per-task-class judgment slices) and show bench_score ≥ judge_promotion_threshold over ≥ min_cases_for_promotion[silver] cases.
production ADR-0015 (threshold calibration) gains a structured-evidence path without being un-deferred: the per-task-class calibration ADR-0015 calls for is now sourceable from bench_score distributions per class, not just raw production-migration counts.
Roadmap implication. A "Phase 6.5" or "Phase 7-preamble" implementation slot is needed for the harness package + the first two bench directories (vuln + migration). Phase 5's architect surfaces this; the roadmap amendment is a separate task.

Reversibility¶

Medium-low. Once trust-tier promotion gates on bench_score, removing the harness means falling back to either (a) sentiment-based promotion (which this ADR exists to replace), or (b) freezing all task classes at the conservative bronze tier indefinitely (which kills autonomy). Neither is acceptable. Reverting the decorator and registry mechanics is mechanically easy (delete src/codegenie/eval/, drop the fence test); reverting the commitment — that task-class introduction requires curated evidence — costs the entire post-Phase-7 trust model. The bench cases themselves are durable assets; they survive any rewrite of the harness shape.

If the per-task-class scoring model proves wrong (e.g., bench performance does not predict production outcomes), the reversal is not "remove the harness" but "redesign the rubric" — same harness, new rubric.py per task class. The contract is the harness/rubric split; the rubric implementations are evolvable.

Open questions deferred to implementation¶

Bench-case provenance taxonomy. "curated" / "outcome-ledger-derived" / "regression-converted" is the minimum set. Whether to add "adversarial-synthetic" (LLM-generated cases designed to find failure modes) is deferred — the synthesis here is wary of adversarial cases that drift from real-world distributions.
Tier-threshold values. This ADR commits to tiers (bronze/silver/gold/platinum) and to per-task-class thresholds; it does not commit to specific numbers. Numbers are picked when the first bench set exists and the score distribution can be inspected.
LLM cost budget per eval run. Per-task-class evals invoke the full LLM/recipe/planner stack; cost per case ranges from cents (vuln remediation, deterministic recipe path) to dollars (recipe authoring, multi-turn). Budget caps per eval run are deferred to Phase 13's cost ledger; the harness must emit per-case cost as a BenchScore.cost_usd field.
Bench-set sharing across the org. Whether bench/ is checked into the same repo, lives in a separate codewizard-sherpa-benches repo, or is curated per-customer is deferred — affects whether the harness has access to proprietary repo snapshots.
Mutation testing the rubric itself. The rubric is code; it has bugs. A meta-eval that mutates rubric.py and asserts the bench cases catch the mutations is plausibly load-bearing but deferred — Phase 16 hardening territory.

Evidence / sources¶

phase-arch-design.md §Gap analysis — this ADR closes the unowned gap on "judgment-quality evidence for task-class promotion"
final-design.md §Open questions — threshold calibration's "per-task-class refinement" requirement
Phase 5 ADR-0003 — the @register_signal_kind pattern this ADR mirrors
Phase 5 ADR-0008 — the deferral this ADR makes evidence-shaped
production ADR-0008 — the per-PR gate contract this ADR composes with (does not replace)
production ADR-0011 — the recipe-first ordering whose effectiveness this harness measures
production ADR-0015 — the calibration ADR whose evidence shape this ADR contractualizes
production ADR-0028 — the introduction-order ADR this ADR adds a graduation gate to
CLAUDE.md §"Determinism over probabilism for structural changes" — the "safer builders, risky maintainers" empirical finding driving per-task-class evidence
CLAUDE.md §"Honest confidence" — the load-bearing commitment this harness operationalizes for trust-tier promotion
roadmap.md §Phase 7 — the first task class this ADR's implementation must precede
roadmap.md §Phase 13 — the outcome-ledger reconciliation source
roadmap.md §Phase 15 — the task class that cannot ship without this harness