ADR-0016: Per-task-class eval harness as the evidence source for trust-tier promotion, threshold calibration, and LLM-Judge un-deferral¶
Status: Accepted (commitment) — Deferred (implementation; lands before Phase 7 ships) Date: 2026-05-12 Tags: trust · evidence · eval-harness · per-task-class · roadmap-gap · calibration Related: Phase 5 ADR-0003, Phase 5 ADR-0008, production ADR-0008, production ADR-0011, production ADR-0015, production ADR-0028
Context¶
Three currently-deferred decisions all wait on the same missing artifact:
- production ADR-0015 defers trust-score threshold calibration pending "N = 50 production migrations with full objective-signal traces and post-merge outcome data" and explicitly calls for per-task-class refinement: "Vulnerability patches likely tolerate higher gate threshold than convenience migrations; calibrate separately."
- Phase 5 ADR-0008 defers the LLM-Judge persona "until evidence accrues," with no mechanism named for what kind of evidence justifies un-deferral.
- production ADR-0028 sequences task classes (vuln remediation → migrations → agentic recipe authoring) but does not specify the gate that prevents a task class from graduating to higher autonomy tiers.
Phase 5 ships strict-AND objective signals as the only gate verdict source (Phase 5 ADR-0003, production ADR-0008). Strict-AND is the right contract for boundary correctness — "did the build pass, did tests pass, did SAST not regress" — but it answers a narrower question than the one the system actually needs to answer at task-class introduction time: "is the harness's judgment good enough on this task class to justify opening PRs at scale?"
The boundary-correctness layer (Phase 4 cassette discipline, Phase 5 strict-AND, Phase 9 Temporal retries) tests whether the wiring works. It does not test whether the LLM/recipe/planner picked the right answer. For vulnerability remediation, that gap is tolerable: the validator pass (does it still compile, do tests pass, does SAST regress, did the CVE drop out of the dependency tree) is a strong objective signal of correctness. For Chainguard migrations (Phase 7) and agentic recipe authoring (Phase 15), the objective signal weakens — a Chainguard migration that builds and passes tests may still be semantically wrong (wrong base image variant, missing runtime capability surfacing only under production load), and a generated recipe may "work" against one repo while being subtly incorrect for the broader class of repos it claims to cover.
CLAUDE.md §"Determinism over probabilism for structural changes" cites the empirical finding ("safer builders, risky maintainers") that drives recipe-first/LLM-fallback ordering. The corollary the codebase has not yet operationalized: judgment quality varies by task class, and trust granted in Phase 3 does not transfer to Phase 7 or Phase 15. Each new task class must earn its own evidence before promotion off the conservative starting tier.
The outcome ledger (Phase 13) closes the post-merge loop — did the PR merge clean, did it revert, did it regress — but only after PRs are opened. Outcome-ledger-only learning is too slow for task classes where one bad PR can poison the well across the portfolio (migrations especially). The system needs a pre-production evidence source that operates offline, against curated ground truth, and produces a per-task-class score the trust-tier promotion logic can read.
See phase-arch-design.md §Gap analysis — this is the gap the architect did not surface as a separate ADR until now. final-design.md §Open questions records the threshold-calibration question without naming the evidence-source contract.
Options considered¶
- Rely on the outcome ledger alone (status quo). Promote task classes based on post-merge merge-rate / revert-rate. Pros: real-world signal, no synthetic-benchmark drift. Cons: too slow to detect a regression class (weeks of bad PRs before a pattern emerges); requires PRs already open at scale, which is exactly what task-class introduction is gating; "first 100 Chainguard PRs" is itself an unacceptable burn if 30% are wrong.
- Eval-as-CI-gate (per-PR benchmark run). Run an eval suite on every PR before it opens. Pros: fast feedback. Cons: bench-case curation is too expensive to justify per-PR cadence; the LLM call cost alone makes this impractical; conflates "boundary correctness" (which CI already gates) with "judgment quality" (which is a different question with different cadence).
- Closed eval set defined once, per phase. Pick one benchmark for "the LLM is good." Pros: one-shot work. Cons: ignores production ADR-0015's explicit "per-task-class refinement" requirement; ignores CLAUDE.md "Extension by addition" — a benchmark for vuln remediation does not predict migration performance.
- Per-task-class eval harness with open registry, offline cadence, gating trust-tier promotion (and only that). Each task class ships
bench/{task-class}/with curated ground-truth cases; a scoring rubric per task class producesbench_score; trust-tier promotion (bronze → silver → gold) gates onbench_score ≥ tier_threshold; the ledger continues to feed post-merge outcomes that periodically refresh the bench set. Mirrors Phase 5 ADR-0003's signal-kind registry pattern and reuses the@register_*decorator convention.
Decision¶
The codebase adopts a per-task-class eval harness as a first-class architectural artifact whose contract is owned by Phase 5 (the home of trust tiers) and whose implementation lands as a precondition for Phase 7 — i.e., Phase 7 cannot ship its first Chainguard migration PR at scale without bench/migration-chainguard-distroless/ populated and scored. Vulnerability remediation (Phase 3 / Phase 4) backfills bench/vuln-remediation/ as part of Phase 7's lead-in work, so the harness has a worked example before its first new-task-class consumer.
Contract shape:
- Directory layout per task class.
bench/{task-class-slug}/cases/{case-id}/contains a curated input (a synthetic repo fixture or a frozen snapshot of a real repo at a specific commit), expected outcomes (ground-truth diff, expected recipe pick, expected CVE-delta, expected validator verdict), and metadata (case.yamlwith provenance, difficulty tier, and adisposition: positive | negative | ambiguouslabel so the rubric can score appropriately). - Scoring rubric per task class. A
bench/{task-class-slug}/rubric.pyexports a single functionscore(harness_output, expected) -> BenchScore.BenchScoreis a Pydantic model with at minimumpassed: bool,score: float ∈ [0, 1],breakdown: dict[str, float], andfailure_modes: list[str]. The rubric is task-class-specific — vuln-remediation scores on "fix correctness + no regressions + CVE drop"; migration scores on "image-variant match + runtime-capability match + size-delta within bound"; recipe-authoring scores on "recipe applies cleanly across the held-out repo set + no false-positive transforms." - Open registry mirror. Task classes register via
@register_task_class("name")decorator (same pattern as@register_probe,@register_signal_kind); each registration declaresbench_path,rubric, and themin_cases_for_promotionfloor per tier. Phase 7 addsmigration-chainguard-distroless; Phase 15 addsagentic-recipe-authoring— both as one-line decorator registrations. - Trust-tier promotion gate. Trust tiers (bronze / silver / gold / platinum) are read by Phase 5's gate framework. Promotion from tier N → N+1 for a task class requires
bench_score.lower_bound_95 ≥ tier_threshold[N+1]over ≥min_cases_for_promotion[N+1]cases AND zeroblock-severity failure modes in the breakdown. Demotion is automatic on any production regression that the bench set fails to catch (the regression becomes a new bench case; the score recomputes; the tier drops if it falls below threshold). - Offline cadence, not CI-per-PR. The eval harness runs nightly (or per-release-candidate) — never per-PR. Per-PR CI continues to gate on boundary correctness (Phase 5 strict-AND on objective signals); the eval harness gates trust-tier promotion, which is a slower, separately-decided question. This is the load-bearing distinction: per-PR gates answer "is this PR safe to open?"; the eval harness answers "is this task class at this autonomy tier safe to operate at?"
- Bench cases are versioned and provenance-tracked. Each case carries
provenance: {source: "curated" | "outcome-ledger-derived" | "regression-converted", commit_sha, added_at, last_validated_at}. Outcome-ledger reconciliation periodically converts post-merge incidents into new bench cases (the regression-converted path) — this is the feedback loop that prevents the bench set from going stale. bench_scorefeeds, but does not replace, the strict-AND gate verdict. Per Phase 5 ADR-0003,bench_scoreMAY register as a signal kind in a future phase (it is "extension by addition," not a redesign), but Phase 5 does not require it. Strict-AND remains the per-PR verdict source;bench_scoreis the meta-verdict on whether the task class itself is trusted enough for its current tier.
What this ADR explicitly resolves:
- production ADR-0015's "Evidence needed to resolve." The eval harness is the structured evidence source. "N = 50 production migrations" becomes "N ≥
min_cases_for_promotioncurated bench cases + post-merge outcome reconciliation." ADR-0015 is not un-deferred by this ADR — that requires the production data to actually accrue — but the shape of the evidence is now contractual. - Phase 5 ADR-0008's un-deferral criterion. When/if the LLM Judge persona is introduced, the un-deferral ADR must reference
bench_scoreon a judgment-quality benchmark (cases where objective signals conflict; the Judge's adjudication is scored against ground-truth resolutions). This ADR does not un-defer the Judge; it makes un-deferral evidence-shaped. - production ADR-0028's graduation gate. Each task class must ship
bench/{task-class}/withmin_cases_for_promotion[bronze] ≥ 10curated cases before its introduction phase exits. This is the first concrete exit criterion for task-class introduction beyond "the code exists."
Tradeoffs¶
| Gain | Cost |
|---|---|
| Trust-tier promotion has a measurable, per-task-class evidence source instead of sentiment — ADR-0015's "calibrate separately per task class" becomes an executable contract | Bench-case curation is the dominant cost; expert-curated ground-truth cases for migrations alone are weeks of work; the cost is real and front-loaded |
| Phase 7 (Chainguard migrations) cannot accidentally ship with vuln-remediation's tier-thresholds; the harness forces per-class evidence | A new task class can no longer be introduced with a "let's see how it goes" PR run — the bench/ directory is a fence-CI-style precondition |
| LLM-Judge un-deferral (Phase 5 ADR-0008) has a clear evidence-shape: judgment-quality benchmarks where objective signals conflict | Until the harness exists, neither ADR-0015 nor Phase 5 ADR-0008 can be resolved; the deferrals stack on this ADR's implementation |
Mirrors @register_probe / @register_signal_kind patterns — extension by addition, one-line registration per task class — no central edits |
A fourth registry to maintain; registry collision detection must be added (same shape as SignalKindAlreadyRegistered) |
| Bench cases are versioned with provenance; outcome-ledger reconciliation feeds new cases back; the bench set self-refreshes | The reconciliation loop (Phase 13) becomes load-bearing — if it breaks, the bench set decays without warning; a staleness probe (analogous to IndexHealthProbe §B2) must monitor last_validated_at |
| Offline cadence (nightly, not per-PR) keeps per-PR CI fast and the LLM call cost bounded; per-PR strict-AND continues unchanged | Trust-tier promotion is correspondingly slower — a task class cannot graduate within a single sprint; promotion is a weeks-to-months decision tied to bench accumulation |
bench_score is structurally separate from TrustSignal.kind — Phase 5's gate contract is not mutated; the harness is an external gate on tier promotion, not an internal gate input |
Two trust-shaped contracts coexist (per-PR strict-AND, per-task-class bench_score); the conceptual surface area grows; docs must distinguish them sharply |
| Aligns with CLAUDE.md "Honest confidence" and "Facts, not judgments" — the harness reports per-case correctness as evidence; the judgment on whether to promote remains explicit and human-confirmable | Bench cases themselves embed human judgment about what "correct" means — the curation step is judgment, not facts, and must be auditable |
Failure modes in BenchScore.failure_modes become the input to recipe-set improvement and prompt-iteration — closes the learning loop the codebase otherwise lacks |
The failure-mode taxonomy must be designed per task class; without it, failure_modes: list[str] degenerates into free-text noise |
Consequences¶
- New package:
src/codegenie/eval/lands as a precondition for Phase 7. Contains the@register_task_classdecorator,BenchScorePydantic model, the registry, the harness runner (loads cases, invokes the system under test, calls the rubric, aggregates), and the trust-tier promotion gate. - New directory contract:
bench/{task-class-slug}/{cases,rubric.py,registration.py}. Treated liketests/snapshots/: contract territory; mutation requires ADR amendment forcases/removals; additions are routine. - Phase 5's
gates/package is not modified. The harness is parallel infrastructure; Phase 5 ADR-0003'sTrustScorerstays untouched. Phase 7+ may later registerbench_scoreas a signal kind via Phase 5 ADR-0003's mechanism — that decision is deferred to whichever phase first finds it load-bearing. - Fence-CI extends. A new fence test asserts that any task class registered via
@register_task_class("name")has abench/{name}/directory with at minimumcases/,rubric.py, andregistration.py. A task class registered without a bench directory fails CI. This is the structural enforcement that prevents "extension by addition" from becoming "extension without evidence." - Phase 13 outcome ledger gains a reconciliation hook. Post-merge incidents that the bench set did not predict get converted into new bench cases (the regression-converted provenance class). This is a new responsibility for Phase 13's ledger consumer; Phase 13's ADRs must reference this one.
- Phase 16 production hardening owns the staleness probe + alerting on
last_validated_atdecay. The probe lands there because production-cadence monitoring is its native concern. - New invariant: introducing a new task class is a four-part PR — (a)
@register_task_classregistration; (b)bench/{name}/with ≥ 10 curated cases; (c) per-task-classrubric.py; (d) ADR amendment to production ADR-0028 documenting the new class's place in the introduction order. The fence CI rejects any three-of-four PR. - Phase 5 ADR-0008 (LLM-Judge deferral) gains an un-deferral criterion without un-deferring: any future ADR introducing the Judge must reference a judgment-quality bench (
bench/judgment-arbitration/or per-task-class judgment slices) and showbench_score ≥ judge_promotion_thresholdover ≥min_cases_for_promotion[silver]cases. - production ADR-0015 (threshold calibration) gains a structured-evidence path without being un-deferred: the per-task-class calibration ADR-0015 calls for is now sourceable from
bench_scoredistributions per class, not just raw production-migration counts. - Roadmap implication. A "Phase 6.5" or "Phase 7-preamble" implementation slot is needed for the harness package + the first two bench directories (vuln + migration). Phase 5's architect surfaces this; the roadmap amendment is a separate task.
Reversibility¶
Medium-low. Once trust-tier promotion gates on bench_score, removing the harness means falling back to either (a) sentiment-based promotion (which this ADR exists to replace), or (b) freezing all task classes at the conservative bronze tier indefinitely (which kills autonomy). Neither is acceptable. Reverting the decorator and registry mechanics is mechanically easy (delete src/codegenie/eval/, drop the fence test); reverting the commitment — that task-class introduction requires curated evidence — costs the entire post-Phase-7 trust model. The bench cases themselves are durable assets; they survive any rewrite of the harness shape.
If the per-task-class scoring model proves wrong (e.g., bench performance does not predict production outcomes), the reversal is not "remove the harness" but "redesign the rubric" — same harness, new rubric.py per task class. The contract is the harness/rubric split; the rubric implementations are evolvable.
Open questions deferred to implementation¶
- Bench-case provenance taxonomy. "curated" / "outcome-ledger-derived" / "regression-converted" is the minimum set. Whether to add "adversarial-synthetic" (LLM-generated cases designed to find failure modes) is deferred — the synthesis here is wary of adversarial cases that drift from real-world distributions.
- Tier-threshold values. This ADR commits to tiers (bronze/silver/gold/platinum) and to per-task-class thresholds; it does not commit to specific numbers. Numbers are picked when the first bench set exists and the score distribution can be inspected.
- LLM cost budget per eval run. Per-task-class evals invoke the full LLM/recipe/planner stack; cost per case ranges from cents (vuln remediation, deterministic recipe path) to dollars (recipe authoring, multi-turn). Budget caps per eval run are deferred to Phase 13's cost ledger; the harness must emit per-case cost as a
BenchScore.cost_usdfield. - Bench-set sharing across the org. Whether
bench/is checked into the same repo, lives in a separatecodewizard-sherpa-benchesrepo, or is curated per-customer is deferred — affects whether the harness has access to proprietary repo snapshots. - Mutation testing the rubric itself. The rubric is code; it has bugs. A meta-eval that mutates
rubric.pyand asserts the bench cases catch the mutations is plausibly load-bearing but deferred — Phase 16 hardening territory.
Evidence / sources¶
- phase-arch-design.md §Gap analysis — this ADR closes the unowned gap on "judgment-quality evidence for task-class promotion"
- final-design.md §Open questions — threshold calibration's "per-task-class refinement" requirement
- Phase 5 ADR-0003 — the
@register_signal_kindpattern this ADR mirrors - Phase 5 ADR-0008 — the deferral this ADR makes evidence-shaped
- production ADR-0008 — the per-PR gate contract this ADR composes with (does not replace)
- production ADR-0011 — the recipe-first ordering whose effectiveness this harness measures
- production ADR-0015 — the calibration ADR whose evidence shape this ADR contractualizes
- production ADR-0028 — the introduction-order ADR this ADR adds a graduation gate to
- CLAUDE.md §"Determinism over probabilism for structural changes" — the "safer builders, risky maintainers" empirical finding driving per-task-class evidence
- CLAUDE.md §"Honest confidence" — the load-bearing commitment this harness operationalizes for trust-tier promotion
- roadmap.md §Phase 7 — the first task class this ADR's implementation must precede
- roadmap.md §Phase 13 — the outcome-ledger reconciliation source
- roadmap.md §Phase 15 — the task class that cannot ship without this harness