Phase 6.5 — Per-task-class eval harness + first benches (preamble to Phase 7)¶
Design pipeline: roadmap-phase-designer → phase-architect → phase-story-writer — all three complete
Status: Design pipeline complete. 36 stories ready for autonomous implementation. See stories/README.md for the manifest + dependency DAG.
Roadmap entry: ../../roadmap.md §Phase 6.5
Anchor ADR: Phase 5 ADR-0016 — per-task-class eval harness as evidence source for trust-tier promotion, threshold calibration, and LLM-Judge un-deferral.
What this phase delivers¶
A per-task-class eval harness — src/codegenie/eval/ package with @register_task_class registry, BenchScore Pydantic model, runner, audit chain, and read-only trust-tier promotion gate — plus the bench/{task-class}/ directory contract enforced by fence-CI, the backfilled bench/vuln-remediation/ set (≥10 curated cases), and the seed bench/migration-chainguard-distroless/ skeleton (≥3 cases) that Phase 7 expands. Offline cadence — runs nightly, never per-PR; per-PR strict-AND from production ADR-0008 is unchanged.
The phase exists because Phase 5 ADR-0016 committed to the contract but deferred the implementation, and Phase 7's introduction of a second task class (Chainguard distroless migrations) cannot land without a per-task-class evidence source that justifies tier promotion off the conservative starting tier. The non-integer phase number (6.5) keeps the change surgical — renumbering Phases 7–16 to slot a preamble in would ripple through cross-doc references for no architectural gain.
Reading order¶
Start with the design of record. Consult the per-lens designs and critique as audit trail / source material.
-
final-design.md — the design of record. Synthesized from the three competing designs + critique using Graph-of-Thought decomposition (Besta et al., 2308.09687). Annotated with provenance per component (
[P],[S],[B],[synth]). Includes the full synthesis ledger: vertex/edge counts, conflict-resolution table, shared-blind-spot disposition, departures from all three inputs, exit-criteria checklist, load-bearing-commitments check, roadmap coherence check. Reading time: 30–45 min. -
critique.md — devil's-advocate attacks on each of the three competing designs, with concrete-by-name findings (components, decisions, numbers). Surfaced the load-bearing fork (rubric execution model) and three shared blind spots (canary-token cassette breakage, uncurated bench cases, undefined block-severity). Reading time: 15–20 min.
-
design-performance.md — competing design under the performance-first lens. Optimizes for nightly eval wall-clock, $/run, cache hit rate on unchanged-cassette + unchanged-rubric reruns, fence-CI overhead. Key bets: content-addressed
BenchScorecache keyed onblake3(case || sut || rubric || cassette);asyncio.Semaphorebounded pool sized to sandbox concurrency; streaming JSONL sinks. Reading time: 15–20 min. -
design-security.md — competing design under the security-first lens. Optimizes for rubric-execution isolation, bench-case provenance integrity, audit-chain tamper detection, promotion-authorization defense in depth. Key bets: rubric in microVM; two-signature Sigstore/GPG bench-case provenance; BLAKE3-chained
.codegenie/bench/history/with daily Sigstore-signed published anchors; six-layer defense-in-depth. Reading time: 15–20 min. -
design-best-practices.md — competing design under the best-practices-first lens. Optimizes for idiomatic Python, mirrored project patterns, mypy-strict cleanliness, minimal abstraction count. Key bets:
@register_task_classbyte-for-byte mirror of@register_probeand@register_signal_kind;BenchScoremirrorsObjectiveSignals(frozen=True, extra="forbid"); serial runner (simplicity over throughput);PromotionGateis read-only. Reading time: 15–20 min.
Key synthesized decisions¶
Highlights from final-design.md worth knowing before opening the file:
- Rubric execution model: subprocess + scrubbed env (Departure from all three inputs). In-process (
[P]/[B]) is RCE on operator host; full Firecracker microVM ([S]) forks Phase 5's sandbox stack on macOS and breaks the curator dev-loop. Subprocess via stdlib defeats credential read + arbitrary FS write + harness-state import at ~150 ms/case. Residual risk (host-level network reachable from rubric child) is documented and deferred to Phase 16. - All three critic-flagged shared blind spots resolved in-design, not deferred:
- Phase 4 canary-token cassette breakage → per-case
cassette_canary_pin: strincase.toml; additiveseedkwarg threaded into Phase 4'sCanary.mint(...). Phase 4 ADR amendment drafted as part of this phase's deliverables. - Bench-case curation source →
curation_class: rag-corpus-derived | held-outwith fence-CI-enforced ≥5 held-out split per task class. Memorization tests (RAG-corpus-derived) and judgment tests (held-out) are distinguished. block-severity definition → per-task-classfailure_modes.yamltaxonomy; typedFailureModemodel withseverity: Literal["block","warn","info"]; unknown codes resolve torubric.unknown_failure_mode(block-severity).- Promotion gate keys on
lower_bound_95(BCa bootstrap), notmean— addresses critic's N=10 statistical-noise concern. Phase 7's hard precondition shifts tobench_score.lower_bound_95 ≥ tier_threshold[bronze]. - Tier identifiers as
strvalidated at startup againstdocs/trust-tiers.yaml, notLiteral[...]— extension by addition wins over compile-time exhaustiveness. - Bench-driven SUT invocations tagged via
CODEGENIE_BENCH_INVOCATION_TAGenv + additivebench_invocation: boolfield onSandboxCostEntry(Phase 5 ADR-0010 amendment) — closes the Phase 13 cost-ledger pollution gap the critic flagged. - Per-task-class
BreakdownKeyStrEnum + fence-CI ban onconfidence|llm|self_reported|model_sayssubstrings at dict-key value level, not just field names — closes the LLM-judgment-smuggling hole the critic found inbreakdown: dict[str, float].
ADRs implied (drafted by phase-architect in the next pass)¶
Six new Phase 6.5 ADRs are implied by the final design:
1. Subprocess-isolated rubric execution (the load-bearing fork)
2. Per-case cassette_canary_pin + Phase 4 seed kwarg amendment
3. curation_class split + held-out-set fence enforcement
4. failure_modes.yaml taxonomy + block-severity contract
5. Promotion gate keys on bootstrap lower_bound_95
6. Bench-invocation tagging on SandboxCostEntry (Phase 5 ADR-0010 amendment)
7. ADR-0016 amendment clarifying "automatic demotion" = recommendation-shift, not side-effect
phase-architect will produce phase-arch-design.md, the per-phase ADRs/ directory in Nygard format, and High-level-impl.md as the implementation-step roadmap.
Why not SWE-bench (Verified or Pro)?¶
The 2026 literature has triangulated on SWE-bench Pro (Scale AI) as the cleaner public-leaderboard signal after SWE-bench Verified contamination became openly acknowledged across frontier models (../../reviews/2026-05-18-agent-orchestration-survey-and-recommendations.md row #5). Phase 6.5 deliberately does not benchmark against either:
- Task-class mismatch. SWE-bench scores arbitrary GitHub-issue resolution with free-form patches; vuln-remediation patches are a closed 4-variant
PlanProposalsum type (Phase 4 ADR-04-0001) and migration patches will share the same constrained-output discipline. The decision surface SWE-bench measures is not the decision surface this project ships. - Promotion-gate evidence must be portfolio-shaped, not paper-shaped. Trust-tier promotion (Phase 5 ADR-0016) keys on cases that mirror the operator's repos and the operator's vulnerability sources. A SWE-bench score would not justify promoting a vuln-rem tier for a Java-heavy operator portfolio; per-task-class
bench/{task-class}/cases do. - Cleanroom property. Phase 6.5's curation contract (held-out split fence-enforced, BLAKE3-chained provenance) gives a stronger contamination story than SWE-bench Pro's public-corpus posture for the metric we actually gate on.
The two are complementary, not interchangeable: external SWE-bench-style numbers may surface in phase-story-writer design-pipeline literature reviews; the promotion gate stays on per-task-class evidence.
Provenance¶
| Round | Agent | Output | Token usage |
|---|---|---|---|
| 1 (parallel) | Performance-first designer | design-performance.md |
~85k |
| 1 (parallel) | Security-first designer | design-security.md |
~92k |
| 1 (parallel) | Best-practices designer | design-best-practices.md |
~123k |
| 2 | Devil's-advocate critic | critique.md |
~141k |
| 3 | Graph-of-Thought synthesizer | final-design.md |
~187k |
Synthesizer's full vertex/edge/conflict counts are inside final-design.md §Synthesis ledger.