Skip to content

Phase 6.5 — Per-task-class eval harness + first benches: Devil's-advocate critique

Reviewed by: Devil's-advocate critic subagent Date: 2026-05-12

Method

I read all three designs and attacked each on its own terms. I do not propose alternatives. My job is to surface what the synthesizer needs to see before it merges.

Attacks on the performance-first design

Concrete problems

  1. Problem: sut_digest hashes the entire src/codegenie/graph/ and recipes/ source trees plus pyproject.toml lock hash. The design admits this over-invalidates on docstring edits, then waves it off as "a feature." But the design also commits to a ≥ 98% cache hit rate as a hard goal and a merge-blocking integration test (test #5 / #6) that the warm-cache rerun completes in ≤ 5 s. Any normal Phase 7+ development (a contributor edits a graph/ module to add a Phase 7 hook) invalidates 100% of cache entries while the test gate still demands 98% hits. The two commitments are incompatible. The "warm-cache canary" (test #13) is then either deleted in week 2 of Phase 7 or it becomes a perpetual flake. Why it matters: A merge-gating performance test that can only pass when the codebase is frozen is worse than no test — it teaches the team to disable it. Where: design-performance.md Components → Runner.plan() step 1 ("compute one sut_digest once per run … hashing the whole graph/ + recipes/ trees"); Goals ("Cache hit rate on unchanged-cassette + unchanged-rubric reruns: ≥ 98%"); Test plan #6 + #13 ("Hard gate, merge-blocking"); Risk #1 acknowledges the tension but does not resolve it.

  2. Problem: The runner explicitly imports Phase 6's build_vuln_loop inside the worker, which the design says costs "~250 ms first time per process; cached by the import system afterward." That is true within one process, but the design also defaults concurrency = min(os.cpu_count(), sandbox_max_concurrent) and uses asyncio.Tasks (one process). Fine — until the design's own Risk #2 says the SQLite checkpointer thrashes under parallel eval, then "scratch dir lives on tmpfs." Phase 6's final-design.md ships a per-workflow SQLite checkpointer with integrity verification on every step (AuditedSqliteSaver), and the design here glosses over how that checkpointer behaves when N concurrent ainvoke calls touch sibling sqlite files in the same process. The integrity-verification path takes locks; the perf design has not measured it. Why it matters: The 8-minute cold-cache wall-clock target is computed assuming the SUT cost is dominated by sandbox boot, not by checkpointer contention. If Phase 6's audited checkpointer serializes step writes across workers (it might — neither design checked), the perf design's parallelism extracts nothing. Where: design-performance.md Components → Runner.plan() step 4 ("Builds the LangGraph with a per-case SQLite checkpointer"); Risks #2; Phase 6 final-design.md §"AuditedSqliteSaver".

  3. Problem: The cassette_digest is computed once per run by hashing "the tests/cassettes/ subtree the SUT will consult." The design does not say how the runner knows which subtree the SUT will consult. For vuln-remediation today there is one cassette dir; for migration-chainguard-distroless (Phase 7) there will be another; for any future task class, more. Hashing tests/cassettes/ whole means a Phase 7 cassette re-record invalidates every vuln-remediation cache entry — for no semantic reason. Hashing only the sub-subtree requires the runner to know cassette routing, which is Phase 4 internals the harness explicitly does not import. Why it matters: Either the cache invalidation false-positive rate explodes once Phase 7 lands (every cassette change in any task class invalidates every score) or the harness has to import Phase 4's cassette-routing knowledge, which violates the design's own "harness has zero import-time cost on the main CLI" stance. Where: design-performance.md Components → Runner.plan() step 1 ("Compute one cassette_digest once per run by hashing the tests/cassettes/ subtree the SUT will consult"). The design never specifies the routing.

  4. Problem: "No process-level lock on cache writes — atomic-rename pattern makes the race safe." Atomic rename makes the file write race-safe; it does not make the cost accounting race-safe. Two concurrent eval runs of the same task class each pay full SUT cost, both write a BenchScore with cost_usd = 0.07, the loser's write is overwritten — but both LLM bills already cleared. The design's $5.00 default --max-cost-usd cap is enforced within a run, not across concurrent runs. Run two concurrent live-mode evals and you can blow the cap by 2× silently. Why it matters: The headline cost claim ("≤ $0.40 / 10-case live run … cap enforced by harness") is not robust to concurrent invocation, which the design also enables by not file-locking. Where: design-performance.md Components → cache.py Tradeoffs ("I deliberately do not file-lock"); Goals ("$/eval-run target … cap enforced by harness via BenchScore.cost_usd rolling sum + --max-cost-usd flag").

  5. Problem: bench_score cases ship as repo.tar.zst extracted into tmpfs scratch per case, and the design declares ≤ 5 MB compressed cap. But the SUT for vuln-remediation is build_vuln_loop().ainvoke(...), which invokes Phase 5's microVM sandbox. Phase 5's sandbox needs the repo inside the microVM, not in tmpfs on the host. The design does not describe the host-tmpfs → microVM boundary. Either the worker re-tars the extracted repo to push it into the microVM (paying the extraction cost twice) or the microVM mounts host tmpfs (which violates Phase 5's isolation invariants). Why it matters: The 8-minute target depends on tarball extract being ~150 ms; if that work has to be done twice (or replaced by a microVM-image build per case), the per-case wall-clock dominates the budget. Where: design-performance.md Components → bench/{task-class}/cases/ ("Repo-snapshot cases use frozen tarballs … extracted into a tmpfs scratch dir by the worker"); the design does not connect this to Phase 5's sandbox boundary.

Hidden assumptions

  1. Assumption: Phase 5 ships a documented sandbox_max_concurrent configuration knob the runner can read. What breaks if it's wrong: The runner's concurrency default is undefined; the operator either picks a number (likely too high → sandbox thrashing) or the design needs to land its own config key, which then becomes load-bearing forever. The design itself flags this as Open Question #2 but builds the entire performance argument on the knob existing.
  2. Assumption: The Phase 4 cassette layer is keyed in a way the harness can fingerprint per-case without reaching into Phase 4 internals. What breaks if it's wrong: See Problem #3 — cassette_digest either over-invalidates or requires the harness to import Phase 4's routing.
  3. Assumption: Operators run nightly evals on a single host with stable filesystem mtimes. What breaks if it's wrong: The 90-day mtime-based GC sweeps live cache entries on a host whose clock is wrong or whose tmpfs gets remounted. The design has no defensive mtime check against future-dated files.

Things this design missed that a different lens caught

  • Rubric isolation. The perf design imports bench/{task-class}/rubric.py directly and runs it in-process; its only acknowledgment of safety is "the rubric module imports its own heavy deps … at function call time." The security design correctly identifies the rubric as control-plane code (it gates promotion) and puts it in a microVM. The perf design did not consider that a rubric edit is RCE on the operator's laptop.
  • Provenance integrity. The perf design hashes case files for cache-key derivation but never asks "is this case content authentic?" The security design flags two-signature curation; the best-practices design at least flags last_validated_at staleness. The perf design treats case.yaml provenance fields as descriptive metadata only.
  • Promotion-verdict consumer contract. The perf design itself flags this as Open Q #5 — emits PromotionVerdict to JSONL with no documented consumer. Best-practices design defines a verdict subcommand. Perf design produces an artifact nobody reads.

Attacks on the security-first design

Concrete problems

  1. Problem: "Per-case microVM cold-start cost (100ms Firecracker, seconds for gVisor-on-Lima)" multiplied by 10–30 cases, serially (the design explicitly forbids parallelism — "no parallelism in Phase 6.5 — concurrency is an integrity-correlation risk"), gives "~1 minute wall-clock total" on Linux/CI. But the design also puts the SUT itself inside the same isolation envelope — "its execution within the microVM is the same shape as the Phase 5 trust-aware gates." Phase 5's microVM boot for the gate sandbox is seconds, not 100 ms (Firecracker's headline number is for a stripped guest with no rootfs init; the rubric needs Python + the rubric's deps; the SUT needs the entire LangGraph + Phase 4 cassette stack). The "1 minute total" claim is not credible. A realistic per-case lower bound is 5–15 s for two microVM boots, putting a 30-case run at 5–8 minutes serial. Acceptable for nightly — but the design hand-waves the difference. Why it matters: The design's Risk #1 commits to "production-quality BenchRunRecords come from Linux/CI Firecracker only," meaning the actual nightly cadence runs in CI on a runner whose microVM stack is the bottleneck. If Phase 7 grows the bench to 50 cases (the roadmap's min_cases_for_promotion[silver] floor), the serial+microVM combo pushes the nightly run toward 30 minutes, eating CI budget the design never accounted for. Where: design-security.md Resource & cost profile ("rubric microVM cold-start: 100–500ms"); Components → runner.py ("never execs untrusted code in-process"); Risk #1 ("Linux/CI Firecracker only").

  2. Problem: Sigstore-backed daily anchor publication adds a new external dependency (Sigstore transparency log + OIDC) at the audit chain root. The design's Risk #4 says "if Sigstore is compromised, anchors are forgeable" and proposes "a parallel offline operator-signed anchor (GPG-detached)." That is a doubled signing ceremony for every nightly eval, performed by a human who must hold a GPG key. The design also says (Open Q #4) the anchor PR may be auto-merged "with a narrow exception to humans always merge." So the security design is asking the synthesizer to: (a) add Sigstore as a runtime dep, (b) require operator GPG keys, (c) carve an exception to ADR-0009. Three load-bearing changes to satisfy a defense layer (L6) whose stated purpose is detecting tampering on the operator's own laptop — a threat (A3) that already requires shell on the operator's host. Why it matters: The cost of L6 (Sigstore + GPG + ADR-0009 amendment) is paid every night to detect an attacker who, by assumption, already has shell on the operator. The synthesizer should weigh whether L6 is load-bearing or theatrical. Where: design-security.md Goals #5; Components → audit.py (publish_anchor); Risk #4; Open Q #4.

  3. Problem: The two-signature provenance rule (Sigstore or GPG fingerprint on every case.yaml) is enforced at startup integrity check time, not at PR-review time. If a contributor merges a PR that adds a case missing the second signature, the PR merges fine (CODEOWNERS approves the PR; CODEOWNERS does not validate the YAML payload's content). Then the next nightly bench run aborts with BenchCaseDigestMismatch or unverifiable signature. The design's containment is "entire run aborts; no BenchRunRecord written" — meaning a single malformed case nukes the entire night's eval, including all the other legitimate cases. That converts a curation mistake into a blast-radius event for promotion-evidence accrual. Why it matters: "One mismatch → abort" is the wrong containment for an integrity check that runs against a corpus of independent cases. The legitimate cases produce no evidence that night because of one bad sibling. Over weeks of curation that is dozens of lost data points feeding into ADR-0015's calibration. Where: design-security.md Data flow steps 1–2 ("One mismatch → abort"); Failure modes table ("entire run aborts; no BenchRunRecord written").

  4. Problem: The design forbids parallelism explicitly on the grounds that "concurrency is an integrity-correlation risk this phase does not need." That is asserted, not argued. The audit chain's prev_hash linkage requires serialized appends, true — but per-case execution can fan out, with chain appends serialized at the writer. The security design conflates "the chain is serial" (true) with "the harness must be serial" (false). The synthesizer should not inherit the conflation as if it were a security argument. Why it matters: The design pays the full serial cost for a defense it has not actually justified. The performance design will argue parallelism; the security design's response is rhetorical. Where: design-security.md Data flow step 4 ("Per-case execution. For each case, in serial (no parallelism in Phase 6.5 — concurrency is an integrity-correlation risk this phase does not need)"); not argued anywhere else.

  5. Problem: BenchRunRecord.operator_fingerprint is required and Sigstore-OIDC-bound. Who is the "operator" when the bench runs nightly in CI? A CI runner does not have a human Sigstore identity; it has a workload identity (GitHub OIDC, e.g.). The design says "operator's Sigstore identity (short-lived OIDC)" and Open Q #6 asks whether to hash the identity — but never resolves the prior question of whose identity is on the chain when the run is unattended. If it is the CI workload identity, the design should say so (and the threat model A8 — "Promotion forger has commit access" — gains a new path: anyone who can push a workflow change can rotate the workload identity). Why it matters: The integrity model claims every record carries an operator fingerprint. The cadence the ADR commits to (nightly, automated) means the operator is a CI principal, undermining the human-attribution claim. The design owes an answer. Where: design-security.md Components → models.py BenchRunRecord (operator_fingerprint: str); ADR-0016 §Decision §5 (offline cadence, nightly).

Hidden assumptions

  1. Assumption: The CI runners can run Firecracker (kernel virt; requires nested-virt or bare-metal). GitHub-hosted runners cannot; self-hosted runners on most cloud VMs cannot without explicit config. What breaks if it's wrong: The "production-quality records come from Linux/CI Firecracker only" commitment in Risk #1 is unrunnable on the actual CI substrate, leaving every record flagged shared_kernel and promotion.recommend() with zero usable evidence. The entire harness produces nothing the promotion gate can read.
  2. Assumption: Two reviewers including one with security role exist and are responsive enough to gate every bench-case PR. What breaks if it's wrong: Bench curation queues on one reviewer (Risk #2 acknowledges this), throttling the very evidence accrual the harness exists to produce.
  3. Assumption: Outcome-ledger reconciliation (Phase 13) will route through cases-pending/ rather than direct-write cases/. The design sets this contract (TB-7) before Phase 13 has been designed. What breaks if it's wrong: Phase 13 chooses a different shape; the security design's defense against A4 (poisoned outcome ledger) silently no-ops.

Things this design missed

  • Cache layer entirely. The security design has no BenchScore cache. Every nightly run pays full SUT + microVM cost on every case even when nothing changed. Over a year of nightly runs that is hundreds of hours of redundant compute. The design acknowledges "cache invalidation under my integrity model is non-trivial and worth deferring" but does not estimate the cost of deferring it.
  • Per-task-class registration footprint. The security design's tools/digests.yaml#bench.{task_class}.rubric pin requires editing a central file every time a rubric changes. That is a load-bearing edit on a CODEOWNERS-protected manifest for every routine rubric update. The other two designs do not require this.
  • Test-plan ratio. The security design's test plan is dominated by adversarial tests; it has no property test on BenchScore.score ∈ [0, 1] despite the roadmap explicitly requiring one ("Property test: BenchScore.score ∈ [0, 1] for all rubric outputs"). Best-practices design includes the Hypothesis test directly. Security design does not.

Attacks on the best-practices design

Concrete problems

  1. Problem: signal.SIGALRM-based per-case timeout. SIGALRM is single-process, single-thread, and fundamentally incompatible with the asyncio-based SUT path the design wires (from codegenie.workflows.vuln import run_against_case — Phase 6 ships LangGraph, which is asyncio-shaped). SIGALRM fires the handler from arbitrary signal context and does not interrupt blocking syscalls inside C extensions; it is not a reliable wall-clock cap on a coroutine-driven SUT. Phase 6's final-design.md is explicitly asyncio (build_vuln_loop().ainvoke(...)), and the best-practices design proposes calling that via a synchronous wrapper plus SIGALRM. The two do not compose. Why it matters: "Other cases continue" relies on the timeout actually firing. SIGALRM either hangs or kills the process. The serial-runner-with-SIGALRM-timeout design is a soft footgun; under load it produces silent never-completes that look like the test "running long," not failures. Where: design-best-practices.md Components → runner.py ("Timeout via signal.SIGALRM on POSIX"); Open Q #3 acknowledges this is a question without answering it; Phase 6 final-design.md mandates async.

  2. Problem: loader.import_module(_codegenie_bench.{name}.registration) synthesizes a top-level package name _codegenie_bench that doesn't exist on disk. importlib.import_module requires a real importable path. The design hand-waves "synthesized module name like _codegenie_bench.{task_class_name}.registration" but does not say how Python resolves it. To make this work the loader needs sys.path manipulation or a custom MetaPathFinder — neither of which the design specifies. Phase 0's probe registry uses real codegenie.probes.{name} paths because probes live inside the src/codegenie/ package; bench cases live outside, in bench/. Why it matters: The decorator-side-effect-import pattern is not transferable from probes to benches without solving the import-path problem. The design promises Phase 0 parity and does not actually achieve it. Where: design-best-practices.md Components → loader.py Internal design ("synthesized module name like _codegenie_bench.{task_class_name}.registration"). Compare Phase 0's probes/__init__.py pattern.

  3. Problem: "≥ 90% line, ≥ 80% branch on src/codegenie/eval/" is a coverage output metric tied to a 600-LOC budget. But the design's runner.py includes per-case try/except Exception → BenchScore(passed=False, failure_modes=("harness_error: ...",)) as the dominant failure containment. Branch coverage on broad except Exception blocks via unit tests is trivial to satisfy and trivially worthless — it tests that the wrapper exists, not that the right failures land in failure_modes with the right tags. The 80% branch number creates the appearance of rigor without the discipline. Rule 9 ("Tests verify intent, not just behavior") is the relevant global rule the design itself cites elsewhere. Why it matters: A merge-gating coverage floor on a runner whose primary logic is exception flattening produces tests that lock in the flattening pattern and then nobody can refactor it. The security design's adversarial tests at least verify behavior; the best-practices design's coverage target measures the wrong thing. Where: design-best-practices.md Goals ("Test coverage target: ≥ 90% line, ≥ 80% branch"); Components → runner.py (per-case try/except).

  4. Problem: The design declares "0 net-new runtime dependencies" but specifies BenchCase.cassette_path: Path | None with cassette content-hashing for Phase 6.5. The hash computation uses what algorithm? Phase 0 uses BLAKE3 (codegenie/hashing.py per security design's reference). BLAKE3 is not stdlib — it is blake3 on PyPI. If the design uses SHA-256 instead it diverges from Phase 0's audit chain shape; if it uses BLAKE3 it adds a dep it claimed not to add. The design's blind-spots section says "cassettes are content-hashed; the case.toml carries cassette_sha256" — committing to SHA-256, diverging from Phase 0. The body of the design does not mention this divergence. Why it matters: The design's "consistency with Phase 0" claim is partial; the diverging hash choice undermines the "single audit-record format" benefit the design pitches. Either it adds the BLAKE3 dep (violating the dep claim) or it forks audit-record hashing semantics (violating the consistency claim). Where: design-best-practices.md Goals ("Net-new runtime dependencies … 0"); Acknowledged blind spots ("cassette_sha256"); Phase 0's codegenie/hashing.py uses BLAKE3.

  5. Problem: The fence test test_eval_fence.py AST-walks bench/*/registration.py looking for @register_task_class("name"). The design admits this misses non-literal names. But there is a second hole: the AST walker matches node.func.id == "register_task_class" — a contributor who imports the decorator under any alias (from codegenie.eval import register_task_class as rtc) or as eval_registry.register_task_class slips past the fence entirely while the decorator still works at runtime. Phase 0's test_pyproject_fence.py does not have this exposure (it tests pyproject.toml strings, not imported names). The design imports a Phase-0 fence pattern into a context where the assumption fails. Why it matters: The fence is the structural enforcement for ADR-0016 §Consequences ("the fence CI rejects any three-of-four PR"). A fence that any contributor with an import as can dodge isn't a fence. Where: design-best-practices.md Components → tests/unit/test_eval_fence.py (node.func.id == "register_task_class" match); Risk #4 acknowledges the literal-name exposure but not the alias exposure.

Hidden assumptions

  1. Assumption: The system-under-test for vuln-remediation can be invoked synchronously from the runner via system_under_test: Callable[[BenchCase], Mapping[str, Any]]. What breaks if it's wrong: Phase 6 ships an async LangGraph; wrapping it in a sync callable means asyncio.run() per case, which is the opposite of what Phase 6's per-workflow checkpointer architecture expects. Sync wrapping of ainvoke is well-known fragile (event-loop cleanup, signal handler conflicts).
  2. Assumption: Bench cases live in the same repo and the bench/ directory is checked in unencrypted. What breaks if it's wrong: Phase 7 wants distroless migration cases that may snapshot real customer Dockerfiles. ADR-0016 §Open Q4 explicitly defers this. The best-practices design assumes same-repo-public.
  3. Assumption: The static-introspection test for banned field-name substrings (confidence|llm|self_reported|model_says) is a sufficient defense against LLM-judgment smuggling. What breaks if it's wrong: A rubric author who wants to smuggle a confidence value names it evidence_strength (the design's own example!) and the test passes. The Phase 5 ADR-0014 lineage is that the type system + extra="forbid" + the introspection together prevent smuggling. Substring matching alone is naming theater.

Things this design missed

  • Cost cap enforcement. The performance design adds --max-cost-usd with mid-run abort; the best-practices design only sums cost_usd into the report and trusts Phase 13 to consume it. A live operator run in this design has no in-run cost ceiling.
  • Concurrent-eval safety. Same blind spot as the security design — no consideration of two concurrent eval runs writing to .codegenie/eval/runs/. The atomic-rename-via-os.replace makes the file write safe; the run_id derivation (SHA-256 of inputs + scores) makes collisions impossible within deterministic outputs; nondeterministic SUTs make collisions live.
  • Integrity of the bench/ directory. The design has CODEOWNERS gating but nothing else — no digest pin, no signature, no verification at run time. The security design correctly catches that a contributor with bench-write access can soften the corpus; the best-practices design accepts CODEOWNERS as the only line.

Cross-design observations

Where do the three disagree?

Dimension Performance picks Security picks Best-practices picks What's at stake
Concurrency model asyncio bounded pool sized to sandbox cap Strictly serial Serial with signal.SIGALRM per-case timeout Whether nightly wall-clock at portfolio scale is 8 min or 30+ min; whether sandbox/checkpointer contention is real
Rubric isolation In-process import (bench.{tc}.rubric) Per-case microVM, JSON I/O over pipe In-process import, defended by Pydantic + static introspection RCE on operator host vs. minutes of microVM cold-start cost per nightly run
Cache layer Content-addressed BenchScore cache as the load-bearing perf win None (deferred) None Whether the harness re-runs the SUT when nothing semantically changed
Audit shape Per-case JSONL stream + runs.jsonl index BLAKE3-chained BenchRunRecords + Sigstore daily anchors Single Phase-0-shaped RunRecord JSON per run Tamper-detection capability vs. operational complexity (Sigstore + GPG vs. plain JSON)
Bench-case provenance case.yaml provenance fields are descriptive only Two-signature CODEOWNERS + Sigstore/GPG signature on every case + cases/digests.yaml pin CODEOWNERS gating + last_validated_at warning at 90 days Whether a contributor with bench-write access can soften the corpus undetected
Bench-case format repo.tar.zst level 3 (~30% smaller, 3× faster decompress) Tar-serialization (format unspecified) digested into cases/digests.yaml input/ and expected/ directories with case.toml (no archive) Storage size + extraction perf vs. diffability under git
Cost cap --max-cost-usd with mid-run abort Deferred to Phase 13 Sum into report; consumer is Phase 13 Whether a runaway live run can cost more than the cap before Phase 13 ships
Promotion authority shape Read-only PromotionVerdict emitted as last JSONL line recommend() writes recommendation file; apply() raises unconditionally; tier change is hand-edited PR against bench/{tc}/registration.py#current_tier Read-only PromotionGate.evaluate(...) → PromotionVerdict; tier change is hand-edited PR against docs/trust-tiers.yaml Where tier state lives — in the registration module (couples tier to code) or in a separate YAML (couples to a config file)
Fence-CI implementation Regex on first/second line of registration.py for the literal decorator Three gates including a meta-gate that hashes the workflow YAML AST walk for @register_task_class("...") literal Brittleness vs. defeatability vs. import-path coupling
Hash algorithm blake3 (Python blake3 package) on case + sut + rubric + cassette BLAKE3 (per Phase 0 codegenie/hashing.py) SHA-256 implied (cassette_sha256) — diverges from Phase 0 Consistency with the existing audit hashing pattern; whether Phase 13 can read both with one parser

Which disagreement matters most for this phase?

Rubric isolation. Everything else (concurrency, cache, audit shape, fence implementation) can be evolved phase-by-phase without breaking earlier records. Rubric isolation is the one that cannot be retrofitted: if the harness ships with in-process rubric execution, every operator who has run the bench has executed every rubric ever merged on their host with their environment, and the threat model is closed retroactively only by re-running every eval inside a microVM — which produces a different score (different process model, different timing, different env) and breaks the audit chain. The synthesizer must decide the rubric's trust posture now. The performance argument against microVM is real (per-case cold-start dominates wall-clock at portfolio scale); the security argument for it is that the rubric is control-plane code (it sets the input to the promotion gate). Whichever the synthesizer picks, it cannot be quietly reversed in Phase 7 or Phase 16 without invalidating prior evidence.

Where do all three quietly agree on something questionable?

  1. All three keep bench/ in the same repo. ADR-0016 §Open Q4 explicitly defers this; all three designs assume same-repo. The synthesizer should notice the consensus is a deferral, not a decision. Once Phase 7 starts curating distroless migration cases that may include customer Dockerfiles, "same repo" stops being neutral.
  2. All three hand-wave the live-LLM cadence. Performance's $5 default cap is unjustified; security defers cost discussion to Phase 13; best-practices treats cost_usd as logging only. ADR-0016 §Open Q3 deferred LLM cost budget to Phase 13. Phase 6.5 ships a tool that can spend money with no consensus on when it is allowed to, who pays, or what the per-task-class budget is.
  3. All three treat bench/vuln-remediation/ as easy. All three claim ≥10 cases drawn from "Phases 3–4's solved-example corpus" (best-practices) or "real CVE-fix scenarios" (perf, security implicit). None of the three designs the case-extraction tooling. The roadmap's exit criterion #2 ("bench/vuln-remediation/cases/ contains ≥10 curated cases") is the actual hard work and none of the three designs allocate engineering for it. ADR-0016 §Tradeoffs warns "expert-curated ground-truth cases for migrations alone are weeks of work." All three lenses skipped that work.

Roadmap-level critiques

  1. Future-phase setup problems. Phase 7's exit criterion (§Phase 7) requires "the diff for this phase touches only new files — no Phase 0–6 source code is modified." If Phase 6.5's harness lands with a per-task-class registration that requires editing anything outside bench/migration-chainguard-distroless/ — e.g., security design's tools/digests.yaml central manifest of rubric digests, or best-practices' docs/trust-tiers.yaml listing the new tier slot — then Phase 7 cannot satisfy its "no edits to existing code" invariant because of how Phase 6.5 designed the registry. The performance design and the best-practices design partially avoid this; the security design's tools/digests.yaml pinning collides directly with the Phase 7 invariant.

  2. Earlier-phase reliance not actually established. All three designs assume Phase 4's cassette layer is digestable per task class (perf), per case (security), or per case via cassette_path (best-practices). Phase 4's final-design.md defines cassette-record/replay discipline but does not commit to a per-case-addressable cassette identity that another package can hash without importing Phase 4 internals. The harness is being designed against a Phase 4 contract that may not exist in the shape assumed.

  3. Load-bearing-commitment violations.

  4. Production design.md §2.1 ("No LLM in the gather pipeline"). None of the three designs put an LLM in the harness, but all three describe the harness as "gather-shaped: deterministic, cacheable, auditable" (best-practices says this explicitly). Performance design's cache invalidation on docstring edits breaks the cacheability claim against any normal dev cadence — and the design admits it. The synthesizer should flag that "deterministic" in the gather sense (same inputs → same outputs) is not the same as "cache-friendly"; the harness's sut_digest over the whole graph tree conflates the two.
  5. §2.5 ("Extension by addition"). Best-practices design's bench/{tc}/registration.py synthesized-import-path requires sys.path or finder hacks (Problem #2 above), which means adding a new task class requires teaching loader.py about its module path. That is "extension by editing the loader," not "extension by addition." The performance and security designs use entry points (perf) or tools/digests.yaml central manifest (security) — both of which require an edit somewhere to add a class.
  6. §2.8 ("Humans always merge"); ADR-0009. Security design's Open Q #4 explicitly proposes carving an exception for daily anchor PR auto-merge. The synthesizer must either reject this or amend ADR-0009 — and an amendment to a load-bearing commitment ADR for the sake of a Phase 6.5 audit-anchor convenience is the tail wagging the dog.
  7. CLAUDE.md "Fail loud." Best-practices design's BenchCaseLoadError containment is "case is excluded from the run with a logged warning; aggregate computed on remaining cases; exit code 1." Excluding a malformed case and continuing while marking the run failed is a contradiction — either it failed (no aggregate is reportable) or the case is excludable (no need to fail). The design has it both ways.