Phase 6.5 — Per-task-class eval harness + first benches: Performance-first design¶
Lens: Performance — throughput, latency, token economy, footprint. Designed by: Performance-first design subagent Date: 2026-05-12
Lens summary¶
I optimize the harness as if the headline metric were nightly eval wall-clock at portfolio scale, with the strict precondition that the harness changes nothing about per-PR CI hot paths (ADR-0016 §Decision §5 — offline cadence, not CI-per-PR). The dominant compute in an eval run is the SUT itself — Phase 6's build_vuln_loop().ainvoke(...) plus Phase 5's sandbox plus Phase 4's cassette-replayed LLM step. The harness's job is to never run the SUT when the answer is already known, and when it must run, to run as many cases in parallel as the host allows without thrashing the sandbox. Concretely: content-addressed BenchScore cache keyed on (case_digest, sut_digest, rubric_digest, cassette_digest); a bounded asyncio worker pool sized to sandbox-concurrency; lazy import of bench/{task-class}/ modules (no module-level langgraph import in the harness — pay it once, in the SUT-bound worker); audit-format that streams per-case JSON Lines as cases complete (not a single end-of-run aggregate); a single pyproject.toml [project.optional-dependencies] eval extras slot so the harness adds zero import-time cost to the main codegenie CLI hot path.
Explicit deprioritizations. I do not invest in pretty CLI output, multi-format reporters, or a dashboard. I do not invest in mutation testing the rubric (ADR-0016 §Open Questions §5 — Phase 16 territory). I do not invest in adversarial-synthetic case generation. I do not invest in cross-case statistical analysis beyond mean + min + max + count of failure_modes by tag — the calibration math is ADR-0015's concern, not the harness runner's. I do not invest in a separate audit DB; everything is filesystem and content-addressed.
Goals (concrete, measurable)¶
- Nightly eval wall-clock for
vuln-remediation(≥10 cases), cold cache, single 8-core host: ≤ 8 minutes wall-clock. Per-case p95 ≤ 60 s when SUT goes through full Phase 4 cassette-replayed LLM path. Per-case p50 ≤ 25 s when SUT hits Phase 4 query-cache tier-1. - Nightly eval wall-clock, warm cache (no SUT or rubric digest change since last run): ≤ 5 s for the same ≥10 cases — pure cache validation + manifest emit.
- Cache hit rate on unchanged-cassette + unchanged-rubric reruns: ≥ 98%. A miss only on (a) a case file mtime/content change, (b) a rubric edit, (c) a cassette re-record, (d) a recipe-set version bump. The 2% allowance is reserved for genuine SUT-source edits.
- Fence-CI added wall-clock: ≤ 2 seconds. The fence test is a directory-and-file existence check against the registry; it does not import the rubric, does not load cases, does not call
pydanticon case files. - $/eval-run target (vuln-remediation, ≥10 cases, cold cache, CI): $0.00. Cassettes only (Phase 4 discipline; roadmap §Phase 6.5 "No live LLM calls in CI"). For operator-invoked live runs: ≤ $0.40 / 10-case run under recipe-first / RAG hit assumptions from production ADR-0011; cap enforced by harness via
BenchScore.cost_usdrolling sum +--max-cost-usdflag (default $5.00) that aborts the run. - Storage growth rate (cassettes + audit + scores). Audit JSONL: ≤ 12 KB/case (one line per case, no embedded diff). Score history: append-only
.codegenie/eval/runs/<utc-iso>-<short>.json+runs.jsonlindex — ≤ 4 KB/run aggregate. Cassettes: harness adds zero new cassettes; reuses Phase 4 cassettes via SUT invocation. Retention: 90-day rolling onruns/, infinite onruns.jsonl(index only). At nightly cadence and 10 cases/class × 2 classes, that's ~7 MB/year audit + ~3 MB/year index. Negligible. - Memory per worker: ≤ 800 MB resident. Dominated by the SUT (LangGraph + Phase 5 sandbox client). Harness overhead itself: ≤ 30 MB.
- Hot vs cold cost ratio (warm cache vs cold cache): ≥ 100×. This is the load-bearing economic argument for content-addressed caching.
Architecture¶
┌──────────────────────────────────────────────┐
│ codegenie eval run --task-class=<name> │
│ (entry: src/codegenie/eval/cli.py) │
└────────────────────┬─────────────────────────┘
│
┌──────────────────────────▼──────────────────────────┐
│ Runner.plan() │
│ ─ resolve task-class from registry │
│ ─ discover cases (glob bench/{tc}/cases/*/) │
│ ─ compute (case_digest, sut_digest, │
│ rubric_digest, cassette_digest) for every case │
│ ─ probe cache for hits → emit cached BenchScores │
│ ─ build work-queue of MISS cases only │
└────────────────────┬────────────────────────────────┘
│
┌──────────────────────────────▼────────────────────────────┐
│ asyncio bounded worker pool (Semaphore, N = sandbox_cap) │
│ │
│ worker-1 ──► SUT.ainvoke(case_input) ─► rubric.score() │
│ worker-2 ──► SUT.ainvoke(case_input) ─► rubric.score() │
│ worker-N ──► SUT.ainvoke(case_input) ─► rubric.score() │
│ │ │
│ └──► BenchScore (Pydantic, frozen) │
└────────────────────┬──────────────────────────────────────┘
│
│ (streams per-case JSONL as cases finish)
▼
┌────────────────────────────────────────────────────────────────────┐
│ Sinks (run concurrently) │
│ ─ stdout JSONL stream (one BenchScore per line) │
│ ─ .codegenie/eval/runs/<utc-iso>-<short>.jsonl (audit, append) │
│ ─ .codegenie/eval/cache/<sha256>.json (cache write) │
│ ─ in-memory aggregator (mean/min/max + failure-mode tally) │
└────────────────────┬───────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────┐
│ Aggregate emit: runs/<utc-iso>-<short>.json + runs.jsonl line │
│ Promotion gate (read-only): compare aggregate to tier thresholds │
│ declared in task-class registration. Reports `tier_candidate`. │
│ Does NOT mutate any tier state — promotion is human/ADR-gated │
│ per ADR-0016 §Decision §4 and roadmap §Phase 6.5 exit #5. │
└────────────────────────────────────────────────────────────────────┘
Why this shape. The runner is a pure planner over digests + an async fan-out over a bounded pool + a streaming sink. There is no batch aggregation step that blocks completion of fast cases on slow ones (streaming JSONL = a fast case at index 0 lands on disk while case 9's sandbox is still booting). The cache is the single performance win; the bounded pool is the single concurrency win; everything else is plumbing.
Components¶
src/codegenie/eval/registry.py — @register_task_class¶
- Purpose: Open registry mapping
task-class-slug→TaskClassRegistration(bench_path, rubric callable, min_cases_for_promotion floors). Mirrors@register_probe/@register_signal_kind(Phase 5 ADR-0003). - Interface:
@register_task_class(name: str, *, bench_path: Path, rubric: Callable[[HarnessOutput, Expected], BenchScore], min_cases_for_promotion: dict[Tier, int]) -> Callableget(name: str) -> TaskClassRegistration— raisesTaskClassNotRegistered.all() -> dict[str, TaskClassRegistration]— used only by fence-CI.- Collision:
TaskClassAlreadyRegisteredat import (same shape asSignalKindAlreadyRegistered). - Internal design: Module-level
dict[str, TaskClassRegistration]. Registrations are lazy-imported viaimportlib.metadataentry pointscodegenie.eval.task_classesso the maincodegenieCLI does not importlanggraph,chromadb, or any heavy SUT dep when runningcodegenie gatherorcodegenie remediate. Task-class modules import their SUT entry only inside thescore()callable, not at module top. - Tradeoffs accepted: Entry-point lazy loading adds one
importlib.metadata.entry_points()call (~5 ms once per CLI invocation) but saves ~250 ms oflanggraph/chromadbimport on non-eval commands. Worth it.
src/codegenie/eval/models.py — BenchScore + types¶
- Purpose: The Pydantic contract every rubric must return.
- Interface:
Plus
class BenchScore(BaseModel): model_config = ConfigDict(extra="forbid", frozen=True) passed: bool score: float # ∈ [0.0, 1.0], validated breakdown: dict[str, float] failure_modes: list[str] cost_usd: float # 0.0 on cache hit; SUT-reported on miss wall_clock_ms: int sut_digest: str # for audit case_id: str provenance: Literal["curated", "outcome-ledger-derived", "regression-converted"]class HarnessOutput,class Expected,class Tier(StrEnum),class TaskClassRegistration. - Internal design: Mirrors Phase 5 ADR-0014 —
extra="forbid", frozen=True.frozen=Trueis load-bearing for caching: aBenchScoreis hashable-equivalent via its JSON serialization, and the cache stores the serialized form.breakdown: dict[str, float](no nested dicts, no lists in values — same structural-smuggle prevention as ADR-0014).scorevalidator enforces0.0 ≤ score ≤ 1.0(property-tested per roadmap §Phase 6.5 testing). - Tradeoffs accepted: No richer score structure (e.g., confusion matrices, sub-rubric trees). If a rubric needs to express finer-grained evidence, it stuffs it into
breakdownwith namespaced keys ("diff/exact_match": 1.0,"diff/semantic_match": 0.8). The flat-dict constraint is a feature, not a limitation — it keeps the audit format trivially diffable.
src/codegenie/eval/runner.py — Runner¶
- Purpose: Plan + execute + aggregate one eval run for one task class.
- Interface:
Runner(task_class: str, cases_glob: str = "*", concurrency: int | None = None, max_cost_usd: float = 5.0, cache_dir: Path = Path(".codegenie/eval/cache"))async def run() -> RunAggregate— emits per-case JSONL to stdout + audit file as cases complete; returns the in-memory aggregate at the end.- Internal design:
- Plan phase (synchronous, fast). Walk
bench/{task-class}/cases/viaPath.iterdir()(notglob.glob— faster on large case sets, no shell quoting). For each case directory, readcase.yamlonce and computecase_digest = blake3(case_input_bytes || expected_bytes || case_yaml_bytes). Computesut_digestonce per run (not per case) by hashing the SUT's transitive source — concretely:blake3over (a)src/codegenie/graph/source tree, (b)recipes/content, (c) the resolvedpyproject.tomllock hash. Computerubric_digestonce per run by hashingbench/{task-class}/rubric.pysource. Computecassette_digestonce per run by hashing thetests/cassettes/subtree the SUT will consult (Phase 4 cassette dir). - Cache probe (synchronous, fast). Compute
cache_key = blake3(case_digest || sut_digest || rubric_digest || cassette_digest)per case. Lookupcache_dir/<cache_key>.json. On hit: load, validate via Pydantic, emit to stdout + audit immediately, mark case complete, do not enqueue. On miss: enqueue. - Execute phase (async).
asyncio.Semaphore(concurrency)whereconcurrencydefaults tomin(os.cpu_count(), sandbox_max_concurrent)— sandbox_max_concurrent is read from Phase 5 config (Phase 5 sandbox is the bottleneck, not CPU). For each enqueued case, spawn anasyncio.Taskthat: (a) acquires the semaphore; (b) callsawait SUT.run(case.input)— for vuln-remediation, this isawait build_vuln_loop(...).ainvoke(...)from Phase 6; (c) callsrubric.score(harness_output, case.expected)(sync; rubric is plain Python); (d) writesBenchScoreto cache; (e) emits to sinks; (f) releases semaphore. - Aggregate phase (constant memory).
RunAggregatekeepscount,mean,min,max,failure_mode_tally: Counter[str],total_cost_usd,total_wall_clock_ms. Updated incrementally as eachBenchScorelands — no list of allBenchScores held in memory (cases stream to disk; aggregate is rolling). - Cost cap. After every
BenchScoreis collected, iftotal_cost_usd > max_cost_usd, cancel all outstanding tasks viaasyncio.Task.cancel(), emitRunAggregate.aborted = True, exit non-zero. - Tradeoffs accepted:
- Streaming aggregate means no per-case median or stddev in the live output — the aggregate has
mean/min/max/countonly. If a downstream consumer (calibration ADR-0015) needs richer stats, it reads the JSONL audit file and computes them offline. The runner does not own statistics. sut_digestis computed conservatively (hashes the wholegraph/+recipes/trees). This means an unrelated edit to agraph/docstring invalidates all bench scores. The fix is human-controllable: don't commit cosmetic edits tograph/between calibration runs. I accept the false-positive invalidation rate for the simplicity of "any source change = re-eval."- The semaphore is sized for the sandbox, not the CPU. On a CI host with 4 sandbox slots, even an 8-core box runs 4 cases at once. This is correct: the sandbox is the contended resource (see Phase 5 microVM design). Oversubscribing CPU while under-subscribing sandbox would thrash.
src/codegenie/eval/cache.py — content-addressed BenchScore cache¶
- Purpose: The single largest performance win. Skip the SUT entirely when nothing semantically changed since the last run.
- Interface:
get(cache_key: str) -> BenchScore | None— synchronous, ~1 ms (onePath.exists+Path.read_bytes+BenchScore.model_validate_json).put(cache_key: str, score: BenchScore) -> None— synchronous, ~1 ms (atomic write: write to<key>.tmp,renameto<key>.json).gc(retain_days: int = 90) -> None— invoked nightly by the runner after success; deletes entries older than retain_days by mtime.- Internal design: Filesystem-backed (consistent with the rest of the codebase's "filesystem-backed everything" stance per
CLAUDE.md).cache_dir/<cache_key>.json. No subdirectory sharding for ≤10k entries — at 90-day retention × 2 task classes × 10 cases/class × 1 run/night = 1800 entries max. Flat dir is fine. SQLite would be premature. - Tradeoffs accepted:
- No process-level lock on cache writes — two concurrent eval runs of the same task class could race. The atomic-rename pattern (write
.tmpthenos.rename) makes this race safe: either run'sBenchScoreis valid; the loser's write is overwritten harmlessly. I deliberately do not file-lock — the cost of contention is one extra SUT run, which is exactly what we accept by running concurrently on purpose. - No remote/shared cache (e.g., S3, Redis). Each operator's machine has its own cache. This is correct for Phase 6.5: nightly cadence on a single CI host means the cache lives where the runner lives. Phase 13 (cost ledger) may later want a shared cache; that's its problem.
src/codegenie/eval/cli.py — codegenie eval run¶
- Purpose: The operator surface. Per roadmap §Phase 6.5 tooling:
codegenie eval run --task-class=<name> [--cases=<glob>] [--out=<path>]. - Interface:
codegenie eval run --task-class=<name> [--cases=<glob>] [--concurrency=<int>] [--max-cost-usd=<float>] [--no-cache] [--out=<path>] [--audit-dir=<path>]- Exit codes:
0on success (run completed, aggregate emitted);2on cost cap exceeded;3on registry miss;4on bench directory missing;1on any other harness-side error. SUT failures inside a case do not abort the run — they becomeBenchScore(passed=False, failure_modes=[...])and the case continues. - Internal design: A thin click wrapper. The CLI itself is < 80 LOC. Stdout is JSONL by default (one
BenchScoreper line, then a final aggregate line withkind: "aggregate"); a--format=humanflag exists but is documented as "for interactive use only, do not pipe." - Tradeoffs accepted: No
--watch, no--diff-against-previous-run, no pretty-progress. Live progress on terminals istqdm-free; the user pipes JSONL tojqif they want progress. This is a CI-first tool, not a developer-experience tool.
src/codegenie/eval/promotion.py — trust-tier promotion gate (read-only)¶
- Purpose: Implement the gate logic ADR-0016 §Decision §4 specifies, but wire it as advisory only per roadmap §Phase 6.5 exit criterion #5.
- Interface:
evaluate(aggregate: RunAggregate, registration: TaskClassRegistration, current_tier: Tier) -> PromotionVerdictPromotionVerdict(frozen):current_tier,candidate_tier,passes_threshold: bool,block_severity_failure_modes_seen: list[str],cases_evaluated: int,min_cases_for_promotion: int,reasoning: str.- Internal design: Pure function over the aggregate and registration. The function does not consult any state store, does not write any tier file, does not transition anything. It computes what tier the bench evidence would support if a human were to promote. The harness prints the verdict; an operator (or future Phase 12 / Phase 16 mechanism) acts on it.
- Tradeoffs accepted: No persistence of "tier state per task class." Per roadmap §Phase 6.5 exit #5, the promotion gate is wired but does not auto-promote. The "current tier" is read from the registration (a constant), not from a state store. This is correct: tier promotion is a human/ADR decision; storing it as code state would invite the harness to mutate it.
bench/{task-class}/cases/, rubric.py, registration.py — directory contract¶
- Purpose: Per ADR-0016 §Decision §1–3 and §Consequences — the fixed contract. The performance lens does not redesign the layout; it optimizes how the runner consumes it.
- Performance-relevant details:
- Case manifest cache. On first read of a
bench/{tc}/directory in a run, the planner builds an in-memorylist[CaseDescriptor]and reuses it. No re-walk per case. case.yamlis small and flat. Provenance metadata only (source, commit_sha, added_at, last_validated_at, disposition). The actual case input (repo snapshot, expected diff, expected CVE delta) lives in sibling files referenced by relative path. The case-digest hashes the resolved files, not the YAML alone.- Repo-snapshot cases use frozen tarballs, not git repos. A bench case's "repo" is
bench/{tc}/cases/{id}/repo.tar.zst— extracted into a tmpfs scratch dir by the worker, used by the SUT, deleted on case end. zstd over gzip: ~30% smaller, ~3× faster decompress at level 3. Tarball size cap: ≤ 5 MB compressed per case (validated by fence-CI as a courtesy, not a hard fail). registration.pyis import-cheap. It imports the registry decorator and the rubric only. The rubric module imports its own heavy deps (e.g.,unidifffor diff parsing) at function call time, not module top.
Fence-CI test extension¶
- Purpose: Per ADR-0016 §Consequences and roadmap §Phase 6.5 exit #4 — assert every registered task class has its
bench/{name}/{cases,rubric.py,registration.py}and that a missing one fails CI with a specific diagnostic. - Performance-relevant details:
- The fence test does not import
bench/{name}/rubric.py(importing it would pay rubric-deps cost on every CI run, every PR). It checks file existence viaPath.existsonly. - It does not call the registry (which would force the registration entry-point machinery on every CI run). Instead, it walks
bench/directly and asserts that everybench/{name}/registration.pyfile's first-or-second line contains the literal string@register_task_class("followed byname— a 2-line regex check. - Wall-clock budget: ≤ 2 seconds for the whole fence test. At 10 future task classes, that's 100 ms per class — fine.
- Tradeoffs accepted: The regex check on
registration.pyis brittle if a contributor reformats the decorator. The fence test's failure message tells them so:"Could not detect @register_task_class('foo') in bench/foo/registration.py — keep the decorator literal as the first decorator and the name as a string literal."We accept the brittleness for the speed.
Data flow¶
End-to-end trace of codegenie eval run --task-class=vuln-remediation:
cli.pyinvocation (~50 ms). Click parses args;codegenie evalsubcommand resolves; importsrunner; importsregistry. The registry's entry-point load fires here (~5 ms); thevuln-remediationtask class'sregistration.pyruns, registering the rubric callable. Phase 6'sbuild_vuln_loopis not imported yet — only the rubric'sscore()will pull it in lazily.Runner.plan()(~200 ms for ≥10 cases). Walkbench/vuln-remediation/cases/, read eachcase.yaml, computecase_digestper case (blake3 over file contents — ~5 ms/case for typical case sizes). Compute onesut_digest(~50 ms for fullgraph/+recipes/tree walk). Compute onerubric_digest(~2 ms). Compute onecassette_digest(~30 ms for Phase 4 cassette tree). Build per-casecache_key.- Cache probe (~10 ms, all 10 cases). Synchronous loop,
Path.existsper key, load + Pydantic-validate hits. On a warm-cache nightly rerun, all 10 hit; runner streams 10 cachedBenchScores to stdout and the audit file; aggregate is computed; promotion verdict is computed; total wall-clock < 5 s. This is the warm-cache hot path. - Worker spawn (cold cache, all 10 miss).
Semaphore(N=4)(default to sandbox max concurrent on a typical CI host). Spawn 10asyncio.Tasks; 4 run concurrently. Each acquires the semaphore, then: - Extracts
repo.tar.zstto a tmpfs scratch dir (~150 ms for a 2 MB tarball at zstd level 3). - Imports Phase 6 lazily:
from codegenie.graph.vuln_loop import build_vuln_loop(~250 ms first time per process; cached by the import system afterward). - Builds the LangGraph with a per-case SQLite checkpointer at
.codegenie/eval/scratch/<case_id>.sqlite3(per Phase 6 per-workflow checkpointer pattern). ~100 ms. await loop.ainvoke(initial_state). Phase 4 cassette-replay tier-1 hits run in ~2–5 s; tier-2 LLM-fallback cassette replay runs in ~15–45 s. The semaphore bounds how many can sandbox-evaluate concurrently — typically 2–4.- Calls
rubric.score(harness_output, expected)(sync, ~50–500 ms — diff comparisons, CVE-delta parse). - Constructs
BenchScore, writescache_dir/<cache_key>.jsonatomically, emits JSONL to stdout, appends to audit JSONL. - Aggregate. As tasks complete (in arbitrary order), the aggregator updates
mean/min/max/countincrementally. When the last task finishes, the runner emits an{"kind": "aggregate", ...}JSONL line and writes the aggregate to.codegenie/eval/runs/<utc-iso>-<short>.jsonplus a single-line entry in.codegenie/eval/runs.jsonl(the durable per-run index). - Promotion verdict (~1 ms). Pure function over the aggregate and registration; printed as a final JSONL line
{"kind": "promotion_verdict", ...}. - GC (~30 ms). Cache GC walks
cache_dir/, deletes entries withmtime < now - 90d.
Parallelism extraction points:
- Across cases: asyncio.Semaphore-bounded worker pool. Bounded by sandbox concurrency, not CPU.
- Within a case: None. A case is a SUT.ainvoke + a rubric call; both are sequential by nature (you can't score before you know the SUT output).
- Across sinks: stdout emit, audit-file append, cache write happen sequentially per case (in that order). Cross-case, they're independent — stdout writes from concurrent workers are line-buffered by asyncio + a Lock on stdout (≤ 1 μs per acquire).
Serialization points:
- The aggregator is single-tasked (one task receives BenchScores via an asyncio.Queue and updates state). This avoids needing locks on mean/min/max.
- The audit JSONL file is appended via a single writer task fed by the same queue.
Failure modes & recovery¶
| Failure | Detected by | Recovery |
|---|---|---|
| Cache file corrupt (truncated mid-write — should not happen with atomic-rename, but disk-full could cause it) | Pydantic validation fails on BenchScore.model_validate_json in cache.get |
Treat as cache miss; rerun the case; log a warning but do not abort |
SUT (build_vuln_loop().ainvoke) raises |
Worker try/except Exception around the SUT call |
Construct BenchScore(passed=False, failure_modes=["sut_raised:<exception_type>"], score=0.0, ...); case completes (does not abort run); audit captures the exception type + traceback in a sibling <cache_key>.error.txt |
| Rubric raises | Worker try/except Exception around the rubric call |
Same as above; failure_modes=["rubric_raised:<exception_type>"]. The aggregate's failure_mode_tally surfaces rubric brittleness fast |
| Sandbox boot fails (Phase 5 microVM init) | SUT raises sandbox-specific exception | Retry once per case at the harness level (not the SUT level — that's Phase 5's three-retry); if both fail, failure_modes=["sandbox_boot_failed"], case completes |
| Cost cap exceeded mid-run | Aggregator detects after a BenchScore lands |
Cancel all outstanding asyncio.Tasks; emit RunAggregate(aborted=True, ...); exit code 2. Cached results stand — they are the truth for those cases |
Cassette miss in CI (per Phase 4 VCR_BAN_NEW_CASSETTES=1) |
SUT raises Phase 4 cassette-miss exception | failure_modes=["cassette_miss:<cassette_key>"]; case fails. Operator must re-record the cassette before the next nightly run. Fence-CI separately catches new task classes added without cassettes. |
| Concurrent eval runs racing on cache | None — atomic-rename makes the race safe | No recovery needed |
| Task-class registration entry-point fails to import | Lazy import in registry.get() raises |
Exit code 3 with diagnostic "Task class '<name>' not registered. Available: [...]. Did you mean: '<closest>'?" |
bench/{name}/cases/ empty |
Runner.plan() finds zero cases |
Exit code 4 with "No cases found at bench/{name}/cases/. Did you forget to commit fixtures?" |
| Disk full during audit JSONL append | OSError from Path.open("a").write |
Abort run with exit code 1; cached scores written before the failure stand; partial JSONL is valid because it's line-delimited (any complete line is parseable) |
Resource & cost profile¶
- Tokens per nightly eval run. 0 tokens in CI — cassette replay, no live API (Phase 4
--record-mode=none). ~150k–400k tokens / 10-case operator-invoked live run depending on Phase 4 tier hit distribution (recipe-first hits cost ~5k tokens/case; RAG few-shot ~25k tokens/case; LLM-cold ~50k tokens/case). At Anthropic Sonnet 4.7 prices, that's $0.10–$0.40 per live run. - Wall-clock per case (p50 / p95).
- Cache hit: 50 ms / 200 ms.
- Cache miss, recipe-first hit: 8 s / 25 s (dominated by Phase 5 sandbox boot + build/test).
- Cache miss, RAG hit: 15 s / 40 s.
- Cache miss, LLM-cold cassette replay: 30 s / 60 s.
- Memory per worker. ≤ 800 MB resident, dominated by Phase 5 sandbox client + LangGraph state. Harness overhead per worker: ≤ 30 MB.
- Storage growth rate.
- Cache: bounded by 90-day retention × 2 task classes × ~10 cases × 1 run/night × ~6 KB/entry ≈ ~11 MB steady-state.
- Audit JSONL: 12 KB/case × 20 cases/night × 365 = ~88 MB/year, retained 90 days → ~22 MB steady-state.
- Runs index (
runs.jsonl): 1 line × ~400 bytes × 365 = ~150 KB/year, infinite retention → negligible. - Bench cases themselves (tarballs): ≤ 5 MB/case × 20 cases ≈ 100 MB committed — durable.
- Hot vs cold cost ratio. Warm-cache run: ~5 s wall-clock, $0.00. Cold-cache run: ~8 min wall-clock, $0.00 in CI / $0.40 live. Hot:cold ≈ 100× wall-clock, ≥ ∞ on cost in CI. This is the load-bearing economic argument.
Test plan¶
The performance lens commits to these tests being merge-gating:
- Unit: registry collision.
@register_task_class("dup")twice in one process raisesTaskClassAlreadyRegistered(mirror ofSignalKindAlreadyRegistered). - Unit:
BenchScoreshape.extra="forbid"rejects unknown fields;frozen=Truerejects mutation;scorevalidator rejectsscore = 1.1. Property test (hypothesis):BenchScore.score ∈ [0, 1]for all rubric outputs againstbench/vuln-remediation/cases/. - Unit: rubric determinism.
rubric.score(out, exp) == rubric.score(out, exp)byte-for-byte across 100 invocations (notime.time(), norandom, nosetiteration order). - Integration: end-to-end against
bench/vuln-remediation/.codegenie eval run --task-class=vuln-remediationexits 0; emits ≥10BenchScoreJSONL lines + 1 aggregate + 1 promotion_verdict; aggregatemean ∈ [0, 1]; audit file is valid JSONL. - Integration: cache hit-rate. Run the eval twice back-to-back, no source edits. Second run completes in ≤ 5 s wall-clock and all 10
BenchScores carrycost_usd: 0.0(cache-hit marker). Hard performance gate, merge-blocking. - Integration: cache invalidation correctness. Edit
rubric.py(whitespace-only change). Rerun. All 10 cases must re-execute (cache miss). Editbench/vuln-remediation/cases/CVE-2024-XXXX/case.yamlwhitespace. Rerun. Exactly 1 case must re-execute; the other 9 must hit cache. Hard gate. - Integration: bounded parallelism. With
--concurrency=2, peak simultaneous in-flightSUT.ainvokecalls is ≤ 2 (measured by instrumenting acontextvars.ContextVarcounter in a test-only SUT stub). With--concurrency=4, peak ≤ 4. - Integration: cost cap. With
--max-cost-usd=0.10and a stub SUT that reportscost_usd=0.05per case, the run aborts after exactly 2 cases land. Exit code 2. - Fence: missing bench directory. A synthetic
@register_task_class("ghost")with nobench/ghost/triggers the fence-CI test with a specific diagnostic naming the missing path. - Fence: wall-clock budget. The fence-CI test itself runs in ≤ 2 s on the standard CI runner (
time pytest tests/fence/test_task_class_bench_dirs.py). Regression gate. - Property: parallel cache writes. Two
asyncio.gather-spawned writers writing the samecache_keyconcurrently produce a valid file (one wins; the loser is harmless). No corrupted file ever results. Hard gate. - Performance regression: nightly wall-clock canary. A nightly CI job runs the full vuln-remediation bench cold-cache and asserts total wall-clock ≤ 10 minutes (20% headroom over the 8-minute target). Fails CI with a flame-graph artifact attached on regression.
- Performance regression: warm-cache wall-clock canary. Same as above but second run; asserts ≤ 8 seconds (60% headroom over the 5-second target).
- Property: streaming sink correctness. For every run, the count of stdout JSONL
BenchScorelines equalsaggregate.countequals the count of audit-file lines minus the trailing aggregate + promotion_verdict lines.
Risks (top 5)¶
sut_digestis too coarse. Hashing the entiregraph/+recipes/source tree means a docstring edit invalidates all bench scores, eroding the 98% cache hit-rate target. Mitigation: document the practice; consider atools/eval_digest_excludes.txtfile in a future phase if false-positives become painful. Do not over-engineer it now — the simplicity of "any source change = re-eval" is worth the false-positive cost early on.- Phase 6's per-workflow SQLite checkpointer thrashes under parallel eval. Phase 6 uses per-workflow
.codegenie/loop/checkpoints/<workflow_id>.sqlite3— fine. But the harness creates<case_id>.sqlite3in.codegenie/eval/scratch/. If the scratch dir is on a slow disk, parallel-workers SQLite contention shows up. Mitigation: scratch dir lives on tmpfs by default (/tmp/codegenie-eval-scratch/); fall back to repo-relative on systems without tmpfs. - The cache becomes the bottleneck. At 1800 entries flat-dir is fine, but if a future phase adds 100 task classes the dir blows up. Mitigation: add sharding (
cache_dir/<key[:2]>/<key>.json) when entry count crosses 10k. A 1-line change. Not now. - Streaming JSONL stdout is fragile. A consumer pipe (
| jq) that closes early (SIGPIPE) crashes the runner. Mitigation: suppressBrokenPipeErrorin the stdout writer; the audit file is the source of truth. Documented as "stdout is a courtesy, the audit file is the contract." - The 98% cache hit-rate assumes nightly cadence on a stable SUT. If the team is iterating on
graph/orrecipes/during development, every iteration invalidates the cache and the runner is no faster than a cold-cache run. Mitigation: acknowledged. The cache is a scheduled-cadence win, not a developer-loop win. For dev-loop, use--cases=<single-case-glob>and accept the cold cost on a single case.
Acknowledged blind spots¶
- Bench-case curation throughput. The harness runs whatever cases are in
bench/. Curating ≥10 high-quality cases is weeks of work per ADR-0016's tradeoff table; the performance lens has nothing to add to that cost. - Rubric quality. A fast rubric is worthless if the rubric is wrong. Mutation testing the rubric is Phase 16 territory per ADR-0016 §Open Questions §5. I do not invest here.
- Adversarial-synthetic cases. ADR-0016 §Open Questions §1 defers; I defer too.
- LLM Judge integration. Phase 5 ADR-0008 defers; this harness's evidence shape un-defers it cleanly. I do not pre-build hooks for the Judge; when it lands, it registers as its own task-class via
@register_task_class("judgment-arbitration")— extension by addition. - Cross-host cache sharing. Each operator's machine has its own cache. Fine for nightly single-host CI; will be revisited if/when Phase 13 cost-ledger needs a shared cache layer.
- Pretty CLI output. No
tqdm, no spinners, no colors. JSONL or bust. - Per-case statistical analysis (median, stddev, quantiles). The aggregate exposes
mean/min/max/count. Calibration (ADR-0015) reads the JSONL and computes the rest offline. I do not embed a statistics library in the runner. - The promotion gate is read-only. It computes verdicts but does not store tier state. A future phase that wants automated promotion has to add a tier-state store; this design does not pre-build it.
Open questions for the synthesizer¶
sut_digestgranularity. Hashing the wholegraph/+recipes/tree (my choice) vs. an explicit allowlist of files the rubric declares it cares about. The former is simple and over-invalidates; the latter is precise and under-invalidates on hidden coupling. I picked the former. If the best-practices lens wants per-task-class declared SUT footprints, surface that — it is a real fork.- Sandbox concurrency knob. The runner reads
sandbox_max_concurrentfrom Phase 5 config. Is there a documented Phase 5 config key for this? If not, this design effectively requires one — either I declare a default (e.g.,4) and the operator overrides via--concurrency, or Phase 5 ships the config first. Coordinate with the security lens, which may want it lower. - Cache scope at task-class introduction. When a brand-new task class lands (e.g.,
migration-chainguard-distrolessin Phase 6.5's seed-3-cases step), should the harness require zero cache hits on its first run, or does normal cache-key derivation already cover the "fresh registration" case? My answer: the latter (cache key includescase_digest, which is novel for new cases — automatic miss). Confirm. - Live-LLM eval cadence outside CI. ADR-0016 §Decision §5 says nightly. CI runs cassettes. When does the live eval (against real Anthropic API) run? Once per cassette re-record? Once per recipe-set release? This is a calibration-cadence question that touches my
--max-cost-usddefault; I picked $5.00 conservatively. The synthesizer should pick a cadence and tune the default. - Promotion verdict consumers. I emit
PromotionVerdictas the final JSONL line, but no Phase 6.5 component reads it. ADR-0016 says promotion is an explicit, ADR-anchored decision. Is there a future-phase consumer (Phase 12 / Phase 16) that will readruns.jsonl+ the latest verdict and surface "this task class is eligible for promotion" to operators? If yes, document the contract forruns.jsonlhere so it survives. If no, the verdict is documentation-only — that's fine, but call it out. - Bench tarball format. I picked
repo.tar.zstlevel 3 for 3× decompress speed over gzip. The security lens may want a signed archive; the best-practices lens may want plain git-archive for diffability. The performance answer is zstd; surface if the others fork.