Story S3-01 — Runner plan phase: load + digest + cache-key compute¶
Step: Step 3 — Implement the runner: asyncio fan-out, subprocess rubric, aggregator with BCa bootstrap
Status: Ready
Effort: M
Depends on: S2-02 (loader + digests), S2-04 (audit chain extension)
ADRs honored: ADR-0001 (subprocess isolation — informs the digest set), ADR-0002 (lower_bound_95 — informs run_id derivation), ADR-0010 (isolation_class annotation — RunPlan carries it forward to the report)
Context¶
The runner's first phase is pure planning: load the task class, verify the audit chain at startup, compute the three deterministic digests (sut_digest, rubric_digest, cassette_corpus_digest), derive per-case cache_keys, and abort before any SUT invocation if anything is wrong. This is the load-bearing gate that prevents poisoned cases or a tampered chain from contaminating a run.
The plan output is a plain dataclass consumed by S3-02's fan-out. Plan is pure — no SUT calls, no rubric subprocess, no cache writes. Plan's invariant is "abort early, abort loud" (CLAUDE.md "Fail loud"): if the chain is tampered or any case digest mismatches, no new state is created and exit code 5 (tamper) or 6 (digest mismatch) propagates to the CLI.
References — where to look¶
- Architecture:
../phase-arch-design.md §Process view— the six-phase pipeline diagram; this story owns phases 1 (plan) and the integrity check at startup.../phase-arch-design.md §Components → runner.py—run_evalsix-phase internal structure; this story is phase 1 only.../phase-arch-design.md §Control flow → Happy path (cold cache, vuln-remediation, 10 cases)— sequencing ofaudit.verify→loader.load_task_class→loader.load_cases→ digest computation → per-casecache_key.../phase-arch-design.md §Control flow → Decision points #5 (audit chain tamper)and#6 (bench-case digest mismatch)— both abort before any SUT invocation; exit codes 5 and 6.../phase-arch-design.md §Edge cases #11— tamper detected at startup → exit code 5; no new record written.../phase-arch-design.md §Components → cache.py— composition rule forcache_key:BLAKE3(case_digest || sut_digest || rubric_digest || cassette_corpus_digest || harness_version || cassette_canary_pin).- Phase ADRs:
../ADRs/0001-rubric-execution-isolation-via-subprocess.md— the rubric subprocess shape is what makesrubric_digesta single artifact (rubric.py + breakdown_keys.py + failure_modes.yaml).../ADRs/0002-promotion-gate-keys-on-lower-bound-95.md—run_idderives a deterministic bootstrap seed (int(run_id[:8], 16)); plan must produce a content-addressedrun_idfor that downstream contract to hold.../ADRs/0010-isolation-class-annotation-on-bench-run-report.md—isolation_classis a report field, not a cache_key component; plan documents this distinction explicitly.- Production ADRs:
../../../production/adrs/0009-humans-always-merge.md— audit chain is hard gate; tamper aborts before any new state. - Source design:
../final-design.md §Components → runner.py,§Synthesis ledger row "Concurrency knob source".
Goal¶
Land Runner.plan(task_class_name, *, sut_digest_fn, bench_root, out_dir, run_started_iso, cassette_root, harness_version) -> RunPlan that resolves the task class, verifies the audit chain, computes the three digests, derives per-case cache keys, and aborts on digest mismatch or chain tamper before any SUT call.
Acceptance criteria¶
- [ ]
RunPlanis a@dataclass(frozen=True, slots=True)insrc/codegenie/eval/runner.pycarrying:task_class: TaskClass,cases: tuple[BenchCase, ...](sorted bycase_id),sut_digest: str,rubric_digest: str,cassette_corpus_digest: str,harness_version: str,run_id: str,prev_chain_head: str,cache_keys: Mapping[str, str],isolation_class: Literal["subprocess"] = "subprocess". - [ ]
run_idis content-addressed:BLAKE3(task_class || sut_digest || rubric_digest || cassette_corpus_digest || run_started_iso)[:16](32 hex chars truncated to 16 — enough entropy for the bootstrap seed and the per-run filename short). - [ ] Two
plan(...)calls with identical inputs produce byte-identicalRunPlan(asserted viadataclasses.asdict+ JSON-serialize comparison) — load-bearing for ADR-0002's deterministic bootstrap seed. - [ ] Abort order is enforced:
audit.verify(out_dir).ok is FalseraisesChainTamperDetectedbeforeloader.load_task_classruns;BenchCaseDigestMismatchfromloader.load_casespropagates before any digest computation. - [ ]
rubric_digest = BLAKE3(rubric.py || breakdown_keys.py || failure_modes.yaml)(files concatenated in sorted-by-name order); a one-byte edit to any of the three flips the digest and invalidates every per-casecache_key(asserted by a property test). - [ ] Per-case
cache_key = BLAKE3(case_digest || sut_digest || rubric_digest || cassette_corpus_digest || harness_version || cassette_canary_pin)— matches S2-03's composition rule exactly (asserted via shared helper test). - [ ]
TaskClassNotFound(name, available_names)raised when the task class is not registered afterloader.load_task_classruns (exit code 3). - [ ]
isolation_classis not part ofcache_key— documented in a code comment citing ADR-0010 §"isolation_classis structural, not a runtime measurement"; a regression test asserts the comment exists. - [ ]
plan(...)performs no SUT call, no rubric subprocess, no cacheput, no auditwrite_run_record— asserted by a test usingunittest.mock.patchto fail loudly if any is called. - [ ]
mypy --strict,ruff format --check,ruff checkclean on touched files. - [ ] All red tests in §TDD plan exist, were committed at the red marker, and are now green.
Implementation outline¶
- Add
RunPlan@dataclass(frozen=True, slots=True)tosrc/codegenie/eval/runner.py. - Implement
Runner.plan(...)with this exact order (abort early at each step): chain_result = audit.verify(out_dir); ifnot chain_result.ok→ raiseChainTamperDetected(...)(no other work performed).prev_chain_head = chain_result.head(genesis ="0" * 64per S2-04).task_class = loader.load_task_class(name, bench_root)(raisesTaskClassNotFoundon miss).cases = loader.load_cases(task_class)(raisesBenchCaseDigestMismatch,BenchCaseIDCollisionon inconsistencies).sut_digest = sut_digest_fn()(injected; Phase 6 supplies a digest provider in production; tests stub).rubric_digest = BLAKE3(concat(rubric.py, breakdown_keys.py, failure_modes.yaml))— explicit sorted-by-filename concat to keep the digest deterministic.cassette_corpus_digest = codegenie.hashing.blake3_tree(cassette_root).run_id = BLAKE3(task_class || sut_digest || rubric_digest || cassette_corpus_digest || run_started_iso)[:16].cache_keys = {case.case_id: BLAKE3(case.case_digest || sut_digest || rubric_digest || cassette_corpus_digest || harness_version || case.cassette_canary_pin) for case in cases}— extract the composition into a_compose_cache_key(...)module-level helper shared with S2-03.- Inject all I/O collaborators as parameters (
sut_digest_fn: Callable[[], str],bench_root: Path,out_dir: Path,cassette_root: Path,harness_version: str) — no module-level globals. - Log a single
INFOrecord atplan_completewith{run_id, n_cases, prev_chain_head[:16], sut_digest[:16], rubric_digest[:16]}viastructlog. No per-case logs inplan(...)(S3-02 owns those). - Document the abort-order invariant in the function docstring; cite the three exit codes (3/5/6) it can produce via raised exceptions.
TDD plan — red / green / refactor¶
Red — write failing tests first¶
Test file path: tests/unit/test_runner_plan.py
import dataclasses
import json
import pytest
from codegenie.eval.errors import (
ChainTamperDetected, BenchCaseDigestMismatch, TaskClassNotFound,
)
from codegenie.eval.runner import Runner, RunPlan
from tests.helpers.chain import seed_clean_chain, tamper_last_record
from tests.helpers.bench import stub_task_class_fixture
def test_plan_aborts_on_chain_tamper_before_any_load(tmp_path, monkeypatch):
"""Tamper is the most severe finding; surface it before any other error."""
out_dir = tmp_path / ".codegenie" / "eval"
seed_clean_chain(out_dir, n=2)
tamper_last_record(out_dir)
load_calls = []
monkeypatch.setattr(
"codegenie.eval.loader.load_task_class",
lambda *a, **kw: load_calls.append(("load_task_class", a, kw)) or pytest.fail("must not load"),
)
runner = Runner()
with pytest.raises(ChainTamperDetected) as exc:
runner.plan(
task_class_name="stub-task-class",
sut_digest_fn=lambda: "stub_sut_digest_0000000000000000",
bench_root=stub_task_class_fixture(),
out_dir=out_dir,
run_started_iso="2026-05-12T00:00:00Z",
cassette_root=tmp_path / "cassettes",
harness_version="0.6.5",
)
assert exc.value.file_path.name.startswith("2") # ISO-prefixed file
assert load_calls == [] # load_task_class never called
def test_plan_is_byte_identical_across_two_calls_with_identical_inputs():
"""Required for ADR-0002 deterministic bootstrap seed."""
plan_a = Runner().plan(**_stable_plan_args())
plan_b = Runner().plan(**_stable_plan_args())
assert json.dumps(dataclasses.asdict(plan_a), default=str, sort_keys=True) == \
json.dumps(dataclasses.asdict(plan_b), default=str, sort_keys=True)
def test_plan_run_id_is_16_hex_chars_of_blake3():
plan = Runner().plan(**_stable_plan_args())
assert len(plan.run_id) == 16
assert all(c in "0123456789abcdef" for c in plan.run_id)
def test_plan_rubric_digest_flips_on_one_byte_edit_to_breakdown_keys(tmp_path):
"""Any edit to rubric.py / breakdown_keys.py / failure_modes.yaml invalidates the digest."""
args = _stable_plan_args(bench_root=tmp_path)
digest_before = Runner().plan(**args).rubric_digest
breakdown = (tmp_path / "stub-task-class" / "breakdown_keys.py")
breakdown.write_text(breakdown.read_text() + "\n# touch\n")
digest_after = Runner().plan(**args).rubric_digest
assert digest_before != digest_after
def test_plan_cache_keys_match_s2_03_composition():
"""The cache_key formula must match S2-03's BLAKE3 composition exactly."""
from codegenie.eval.cache import compose_cache_key # shared helper
plan = Runner().plan(**_stable_plan_args())
for case in plan.cases:
expected = compose_cache_key(
case_digest=case.case_digest,
sut_digest=plan.sut_digest,
rubric_digest=plan.rubric_digest,
cassette_corpus_digest=plan.cassette_corpus_digest,
harness_version=plan.harness_version,
cassette_canary_pin=case.cassette_canary_pin,
)
assert plan.cache_keys[case.case_id] == expected
def test_plan_does_not_invoke_sut_or_cache_or_audit_write(monkeypatch):
"""Plan is pure. SUT, cache, and audit-write are S3-02/S3-03/S3-06 responsibilities."""
monkeypatch.setattr("codegenie.eval.cache.put", lambda *a, **kw: pytest.fail("cache.put must not be called"))
monkeypatch.setattr("codegenie.eval.audit.write_run_record", lambda *a, **kw: pytest.fail("audit.write must not be called"))
plan = Runner().plan(**_stable_plan_args())
assert plan.run_id # smoke
Run all six tests; confirm five fail with ImportError or AttributeError and one (the helper test) fails with ModuleNotFoundError. Commit as the red marker.
Green — make them pass¶
Smallest implementation: RunPlan dataclass + synchronous Runner.plan(...) doing the nine-step order above. Inject all I/O. Extract compose_cache_key(...) to codegenie/eval/cache.py (S2-03 will already host it; this story confirms the import).
Refactor — clean up¶
- Docstring on
plan(...)enumerates the abort order: verify → load_task_class → load_cases → digest_sut → digest_rubric → digest_cassettes → derive run_id → derive cache_keys. - Module-level comment near
cache_keyscomputation citesADR-0010 §"the value is structural, not a runtime measurement"and explains whyisolation_classis on the report, not the cache key. structlog.bind(run_id=...)once at plan entry; the bound context propagates through S3-02's workers when they calllog.info(...).harness_versiondefaults tocodegenie.__version__if not injected — but the test always passes it explicitly to keep tests immune to package-version bumps.
Files to touch¶
| Path | Why |
|---|---|
src/codegenie/eval/runner.py |
New module: RunPlan dataclass + Runner.plan |
tests/unit/test_runner_plan.py |
New tests: tamper-abort, digest determinism, missing-task-class, cache-key composition, purity |
tests/helpers/chain.py |
Helpers seed_clean_chain, tamper_last_record shared with S2-04 tests |
tests/helpers/bench.py |
Helper stub_task_class_fixture() returning the path from S2-02's fixture tree |
src/codegenie/eval/cache.py |
(May already exist from S2-03) — confirm compose_cache_key(...) is importable |
Out of scope¶
- Actual fan-out, cache probe, SUT invocation — S3-02.
- Subprocess rubric invocation — S3-03.
- Bootstrap / cost cap / partial reports — S3-05, S3-06.
- The genesis-record case (
prev_hash == "0" * 64) — owned by S2-04; this story just calls into the verified API.
Notes for the implementer¶
- Abort order is load-bearing. Tamper > digest mismatch > unknown task class. A poisoned chain is the only failure where the rest of the run is meaningless; the rest are partial-recovery surfaces (the curator re-curates one case and reruns; they cannot recover from a tampered chain by partial re-run).
- The
sut_digest_fncallable is the seam: Phase 6'sbuild_vuln_loopwill inject a digest provider; tests inject a constant. Do not couple the runner to Phase 6 directly — the runner has nofrom codegenie.engines.vuln_loop import ...import. harness_versioniscodegenie.__version__— not the git SHA. The git SHA is mutable across the same release; the package version is the stable contract.- Do not compute
chain_headhere — that's the post-run append's output (S3-06 owns the write).prev_chain_headis the input. isolation_classlives on the report, not the cache_key. Including it in the cache key would invalidate the cache on every Phase 16 microVM rollout — wrong cardinality. ADR-0010 says the field is "structural foresight"; the cache key cares about the bytes-of-the-rubric, not the process-model-used-to-run-it.- Resist the urge to call the cache here. Plan is pure; cache probe is S3-02's responsibility. The test that fails loudly on
cache.put/audit.write_run_recordis your guardrail. - The
run_id[:16]truncation matters for ADR-0002 (int(run_id[:8], 16)→ bootstrap seed). 16 hex chars = 64 bits = plenty of entropy for the seed and the audit filename short.