Phase 6.5 — Per-task-class eval harness + first benches: Devil's-advocate critique¶
Reviewed by: Devil's-advocate critic subagent Date: 2026-05-12
Method¶
I read all three designs and attacked each on its own terms. I do not propose alternatives. My job is to surface what the synthesizer needs to see before it merges.
Attacks on the performance-first design¶
Concrete problems¶
-
Problem:
sut_digesthashes the entiresrc/codegenie/graph/andrecipes/source trees pluspyproject.tomllock hash. The design admits this over-invalidates on docstring edits, then waves it off as "a feature." But the design also commits to a ≥ 98% cache hit rate as a hard goal and a merge-blocking integration test (test #5 / #6) that the warm-cache rerun completes in ≤ 5 s. Any normal Phase 7+ development (a contributor edits agraph/module to add a Phase 7 hook) invalidates 100% of cache entries while the test gate still demands 98% hits. The two commitments are incompatible. The "warm-cache canary" (test #13) is then either deleted in week 2 of Phase 7 or it becomes a perpetual flake. Why it matters: A merge-gating performance test that can only pass when the codebase is frozen is worse than no test — it teaches the team to disable it. Where:design-performance.mdComponents →Runner.plan()step 1 ("compute onesut_digestonce per run … hashing the wholegraph/+recipes/trees"); Goals ("Cache hit rate on unchanged-cassette + unchanged-rubric reruns: ≥ 98%"); Test plan #6 + #13 ("Hard gate, merge-blocking"); Risk #1 acknowledges the tension but does not resolve it. -
Problem: The runner explicitly imports Phase 6's
build_vuln_loopinside the worker, which the design says costs "~250 ms first time per process; cached by the import system afterward." That is true within one process, but the design also defaultsconcurrency = min(os.cpu_count(), sandbox_max_concurrent)and usesasyncio.Tasks (one process). Fine — until the design's own Risk #2 says the SQLite checkpointer thrashes under parallel eval, then "scratch dir lives on tmpfs." Phase 6'sfinal-design.mdships a per-workflow SQLite checkpointer with integrity verification on every step (AuditedSqliteSaver), and the design here glosses over how that checkpointer behaves when N concurrent ainvoke calls touch sibling sqlite files in the same process. The integrity-verification path takes locks; the perf design has not measured it. Why it matters: The 8-minute cold-cache wall-clock target is computed assuming the SUT cost is dominated by sandbox boot, not by checkpointer contention. If Phase 6's audited checkpointer serializes step writes across workers (it might — neither design checked), the perf design's parallelism extracts nothing. Where:design-performance.mdComponents →Runner.plan()step 4 ("Builds the LangGraph with a per-case SQLite checkpointer"); Risks #2; Phase 6final-design.md§"AuditedSqliteSaver". -
Problem: The
cassette_digestis computed once per run by hashing "thetests/cassettes/subtree the SUT will consult." The design does not say how the runner knows which subtree the SUT will consult. For vuln-remediation today there is one cassette dir; formigration-chainguard-distroless(Phase 7) there will be another; for any future task class, more. Hashingtests/cassettes/whole means a Phase 7 cassette re-record invalidates every vuln-remediation cache entry — for no semantic reason. Hashing only the sub-subtree requires the runner to know cassette routing, which is Phase 4 internals the harness explicitly does not import. Why it matters: Either the cache invalidation false-positive rate explodes once Phase 7 lands (every cassette change in any task class invalidates every score) or the harness has to import Phase 4's cassette-routing knowledge, which violates the design's own "harness has zero import-time cost on the main CLI" stance. Where:design-performance.mdComponents →Runner.plan()step 1 ("Compute onecassette_digestonce per run by hashing thetests/cassettes/subtree the SUT will consult"). The design never specifies the routing. -
Problem: "No process-level lock on cache writes — atomic-rename pattern makes the race safe." Atomic rename makes the file write race-safe; it does not make the cost accounting race-safe. Two concurrent eval runs of the same task class each pay full SUT cost, both write a
BenchScorewithcost_usd = 0.07, the loser's write is overwritten — but both LLM bills already cleared. The design's $5.00 default--max-cost-usdcap is enforced within a run, not across concurrent runs. Run two concurrent live-mode evals and you can blow the cap by 2× silently. Why it matters: The headline cost claim ("≤ $0.40 / 10-case live run … cap enforced by harness") is not robust to concurrent invocation, which the design also enables by not file-locking. Where:design-performance.mdComponents →cache.pyTradeoffs ("I deliberately do not file-lock"); Goals ("$/eval-run target … cap enforced by harness viaBenchScore.cost_usdrolling sum +--max-cost-usdflag"). -
Problem:
bench_scorecases ship asrepo.tar.zstextracted into tmpfs scratch per case, and the design declares ≤ 5 MB compressed cap. But the SUT for vuln-remediation isbuild_vuln_loop().ainvoke(...), which invokes Phase 5's microVM sandbox. Phase 5's sandbox needs the repo inside the microVM, not in tmpfs on the host. The design does not describe the host-tmpfs → microVM boundary. Either the worker re-tars the extracted repo to push it into the microVM (paying the extraction cost twice) or the microVM mounts host tmpfs (which violates Phase 5's isolation invariants). Why it matters: The 8-minute target depends on tarball extract being ~150 ms; if that work has to be done twice (or replaced by a microVM-image build per case), the per-case wall-clock dominates the budget. Where:design-performance.mdComponents →bench/{task-class}/cases/("Repo-snapshot cases use frozen tarballs … extracted into a tmpfs scratch dir by the worker"); the design does not connect this to Phase 5's sandbox boundary.
Hidden assumptions¶
- Assumption: Phase 5 ships a documented
sandbox_max_concurrentconfiguration knob the runner can read. What breaks if it's wrong: The runner's concurrency default is undefined; the operator either picks a number (likely too high → sandbox thrashing) or the design needs to land its own config key, which then becomes load-bearing forever. The design itself flags this as Open Question #2 but builds the entire performance argument on the knob existing. - Assumption: The Phase 4 cassette layer is keyed in a way the harness can fingerprint per-case without reaching into Phase 4 internals. What breaks if it's wrong: See Problem #3 — cassette_digest either over-invalidates or requires the harness to import Phase 4's routing.
- Assumption: Operators run nightly evals on a single host with stable filesystem mtimes. What breaks if it's wrong: The 90-day mtime-based GC sweeps live cache entries on a host whose clock is wrong or whose tmpfs gets remounted. The design has no defensive mtime check against future-dated files.
Things this design missed that a different lens caught¶
- Rubric isolation. The perf design imports
bench/{task-class}/rubric.pydirectly and runs it in-process; its only acknowledgment of safety is "the rubric module imports its own heavy deps … at function call time." The security design correctly identifies the rubric as control-plane code (it gates promotion) and puts it in a microVM. The perf design did not consider that a rubric edit is RCE on the operator's laptop. - Provenance integrity. The perf design hashes case files for cache-key derivation but never asks "is this case content authentic?" The security design flags two-signature curation; the best-practices design at least flags
last_validated_atstaleness. The perf design treatscase.yamlprovenance fields as descriptive metadata only. - Promotion-verdict consumer contract. The perf design itself flags this as Open Q #5 — emits
PromotionVerdictto JSONL with no documented consumer. Best-practices design defines a verdict subcommand. Perf design produces an artifact nobody reads.
Attacks on the security-first design¶
Concrete problems¶
-
Problem: "Per-case microVM cold-start cost (100ms Firecracker, seconds for gVisor-on-Lima)" multiplied by 10–30 cases, serially (the design explicitly forbids parallelism — "no parallelism in Phase 6.5 — concurrency is an integrity-correlation risk"), gives "~1 minute wall-clock total" on Linux/CI. But the design also puts the SUT itself inside the same isolation envelope — "its execution within the microVM is the same shape as the Phase 5 trust-aware gates." Phase 5's microVM boot for the gate sandbox is seconds, not 100 ms (Firecracker's headline number is for a stripped guest with no rootfs init; the rubric needs Python + the rubric's deps; the SUT needs the entire LangGraph + Phase 4 cassette stack). The "1 minute total" claim is not credible. A realistic per-case lower bound is 5–15 s for two microVM boots, putting a 30-case run at 5–8 minutes serial. Acceptable for nightly — but the design hand-waves the difference. Why it matters: The design's Risk #1 commits to "production-quality
BenchRunRecords come from Linux/CI Firecracker only," meaning the actual nightly cadence runs in CI on a runner whose microVM stack is the bottleneck. If Phase 7 grows the bench to 50 cases (the roadmap'smin_cases_for_promotion[silver]floor), the serial+microVM combo pushes the nightly run toward 30 minutes, eating CI budget the design never accounted for. Where:design-security.mdResource & cost profile ("rubric microVM cold-start: 100–500ms"); Components →runner.py("neverexecs untrusted code in-process"); Risk #1 ("Linux/CI Firecracker only"). -
Problem: Sigstore-backed daily anchor publication adds a new external dependency (Sigstore transparency log + OIDC) at the audit chain root. The design's Risk #4 says "if Sigstore is compromised, anchors are forgeable" and proposes "a parallel offline operator-signed anchor (GPG-detached)." That is a doubled signing ceremony for every nightly eval, performed by a human who must hold a GPG key. The design also says (Open Q #4) the anchor PR may be auto-merged "with a narrow exception to humans always merge." So the security design is asking the synthesizer to: (a) add Sigstore as a runtime dep, (b) require operator GPG keys, (c) carve an exception to ADR-0009. Three load-bearing changes to satisfy a defense layer (L6) whose stated purpose is detecting tampering on the operator's own laptop — a threat (A3) that already requires shell on the operator's host. Why it matters: The cost of L6 (Sigstore + GPG + ADR-0009 amendment) is paid every night to detect an attacker who, by assumption, already has shell on the operator. The synthesizer should weigh whether L6 is load-bearing or theatrical. Where:
design-security.mdGoals #5; Components →audit.py(publish_anchor); Risk #4; Open Q #4. -
Problem: The two-signature provenance rule (Sigstore or GPG fingerprint on every
case.yaml) is enforced at startup integrity check time, not at PR-review time. If a contributor merges a PR that adds a case missing the second signature, the PR merges fine (CODEOWNERS approves the PR; CODEOWNERS does not validate the YAML payload's content). Then the next nightly bench run aborts withBenchCaseDigestMismatchor unverifiable signature. The design's containment is "entire run aborts; noBenchRunRecordwritten" — meaning a single malformed case nukes the entire night's eval, including all the other legitimate cases. That converts a curation mistake into a blast-radius event for promotion-evidence accrual. Why it matters: "One mismatch → abort" is the wrong containment for an integrity check that runs against a corpus of independent cases. The legitimate cases produce no evidence that night because of one bad sibling. Over weeks of curation that is dozens of lost data points feeding into ADR-0015's calibration. Where:design-security.mdData flow steps 1–2 ("One mismatch → abort"); Failure modes table ("entire run aborts; noBenchRunRecordwritten"). -
Problem: The design forbids parallelism explicitly on the grounds that "concurrency is an integrity-correlation risk this phase does not need." That is asserted, not argued. The audit chain's
prev_hashlinkage requires serialized appends, true — but per-case execution can fan out, with chain appends serialized at the writer. The security design conflates "the chain is serial" (true) with "the harness must be serial" (false). The synthesizer should not inherit the conflation as if it were a security argument. Why it matters: The design pays the full serial cost for a defense it has not actually justified. The performance design will argue parallelism; the security design's response is rhetorical. Where:design-security.mdData flow step 4 ("Per-case execution. For each case, in serial (no parallelism in Phase 6.5 — concurrency is an integrity-correlation risk this phase does not need)"); not argued anywhere else. -
Problem:
BenchRunRecord.operator_fingerprintis required and Sigstore-OIDC-bound. Who is the "operator" when the bench runs nightly in CI? A CI runner does not have a human Sigstore identity; it has a workload identity (GitHub OIDC, e.g.). The design says "operator's Sigstore identity (short-lived OIDC)" and Open Q #6 asks whether to hash the identity — but never resolves the prior question of whose identity is on the chain when the run is unattended. If it is the CI workload identity, the design should say so (and the threat model A8 — "Promotion forger has commit access" — gains a new path: anyone who can push a workflow change can rotate the workload identity). Why it matters: The integrity model claims every record carries an operator fingerprint. The cadence the ADR commits to (nightly, automated) means the operator is a CI principal, undermining the human-attribution claim. The design owes an answer. Where:design-security.mdComponents →models.pyBenchRunRecord(operator_fingerprint: str); ADR-0016 §Decision §5 (offline cadence, nightly).
Hidden assumptions¶
- Assumption: The CI runners can run Firecracker (kernel virt; requires nested-virt or bare-metal). GitHub-hosted runners cannot; self-hosted runners on most cloud VMs cannot without explicit config.
What breaks if it's wrong: The "production-quality records come from Linux/CI Firecracker only" commitment in Risk #1 is unrunnable on the actual CI substrate, leaving every record flagged
shared_kernelandpromotion.recommend()with zero usable evidence. The entire harness produces nothing the promotion gate can read. - Assumption: Two reviewers including one with
securityrole exist and are responsive enough to gate every bench-case PR. What breaks if it's wrong: Bench curation queues on one reviewer (Risk #2 acknowledges this), throttling the very evidence accrual the harness exists to produce. - Assumption: Outcome-ledger reconciliation (Phase 13) will route through
cases-pending/rather than direct-writecases/. The design sets this contract (TB-7) before Phase 13 has been designed. What breaks if it's wrong: Phase 13 chooses a different shape; the security design's defense against A4 (poisoned outcome ledger) silently no-ops.
Things this design missed¶
- Cache layer entirely. The security design has no
BenchScorecache. Every nightly run pays full SUT + microVM cost on every case even when nothing changed. Over a year of nightly runs that is hundreds of hours of redundant compute. The design acknowledges "cache invalidation under my integrity model is non-trivial and worth deferring" but does not estimate the cost of deferring it. - Per-task-class registration footprint. The security design's
tools/digests.yaml#bench.{task_class}.rubricpin requires editing a central file every time a rubric changes. That is a load-bearing edit on a CODEOWNERS-protected manifest for every routine rubric update. The other two designs do not require this. - Test-plan ratio. The security design's test plan is dominated by adversarial tests; it has no property test on
BenchScore.score ∈ [0, 1]despite the roadmap explicitly requiring one ("Property test:BenchScore.score ∈ [0, 1]for all rubric outputs"). Best-practices design includes the Hypothesis test directly. Security design does not.
Attacks on the best-practices design¶
Concrete problems¶
-
Problem:
signal.SIGALRM-based per-case timeout. SIGALRM is single-process, single-thread, and fundamentally incompatible with the asyncio-based SUT path the design wires (from codegenie.workflows.vuln import run_against_case— Phase 6 ships LangGraph, which is asyncio-shaped). SIGALRM fires the handler from arbitrary signal context and does not interrupt blocking syscalls inside C extensions; it is not a reliable wall-clock cap on a coroutine-driven SUT. Phase 6'sfinal-design.mdis explicitly asyncio (build_vuln_loop().ainvoke(...)), and the best-practices design proposes calling that via a synchronous wrapper plus SIGALRM. The two do not compose. Why it matters: "Other cases continue" relies on the timeout actually firing. SIGALRM either hangs or kills the process. The serial-runner-with-SIGALRM-timeout design is a soft footgun; under load it produces silent never-completes that look like the test "running long," not failures. Where:design-best-practices.mdComponents →runner.py("Timeout viasignal.SIGALRMon POSIX"); Open Q #3 acknowledges this is a question without answering it; Phase 6final-design.mdmandates async. -
Problem:
loader.import_module(_codegenie_bench.{name}.registration)synthesizes a top-level package name_codegenie_benchthat doesn't exist on disk.importlib.import_modulerequires a real importable path. The design hand-waves "synthesized module name like_codegenie_bench.{task_class_name}.registration" but does not say how Python resolves it. To make this work the loader needssys.pathmanipulation or a customMetaPathFinder— neither of which the design specifies. Phase 0's probe registry uses realcodegenie.probes.{name}paths because probes live inside thesrc/codegenie/package; bench cases live outside, inbench/. Why it matters: The decorator-side-effect-import pattern is not transferable from probes to benches without solving the import-path problem. The design promises Phase 0 parity and does not actually achieve it. Where:design-best-practices.mdComponents →loader.pyInternal design ("synthesized module name like_codegenie_bench.{task_class_name}.registration"). Compare Phase 0'sprobes/__init__.pypattern. -
Problem: "≥ 90% line, ≥ 80% branch on
src/codegenie/eval/" is a coverage output metric tied to a 600-LOC budget. But the design'srunner.pyincludes per-casetry/except Exception → BenchScore(passed=False, failure_modes=("harness_error: ...",))as the dominant failure containment. Branch coverage on broadexcept Exceptionblocks via unit tests is trivial to satisfy and trivially worthless — it tests that the wrapper exists, not that the right failures land infailure_modeswith the right tags. The 80% branch number creates the appearance of rigor without the discipline. Rule 9 ("Tests verify intent, not just behavior") is the relevant global rule the design itself cites elsewhere. Why it matters: A merge-gating coverage floor on a runner whose primary logic is exception flattening produces tests that lock in the flattening pattern and then nobody can refactor it. The security design's adversarial tests at least verify behavior; the best-practices design's coverage target measures the wrong thing. Where:design-best-practices.mdGoals ("Test coverage target: ≥ 90% line, ≥ 80% branch"); Components →runner.py(per-case try/except). -
Problem: The design declares "0 net-new runtime dependencies" but specifies
BenchCase.cassette_path: Path | Nonewith cassette content-hashing for Phase 6.5. The hash computation uses what algorithm? Phase 0 uses BLAKE3 (codegenie/hashing.pyper security design's reference). BLAKE3 is not stdlib — it isblake3on PyPI. If the design uses SHA-256 instead it diverges from Phase 0's audit chain shape; if it uses BLAKE3 it adds a dep it claimed not to add. The design's blind-spots section says "cassettes are content-hashed; the case.toml carriescassette_sha256" — committing to SHA-256, diverging from Phase 0. The body of the design does not mention this divergence. Why it matters: The design's "consistency with Phase 0" claim is partial; the diverging hash choice undermines the "single audit-record format" benefit the design pitches. Either it adds the BLAKE3 dep (violating the dep claim) or it forks audit-record hashing semantics (violating the consistency claim). Where:design-best-practices.mdGoals ("Net-new runtime dependencies … 0"); Acknowledged blind spots ("cassette_sha256"); Phase 0'scodegenie/hashing.pyuses BLAKE3. -
Problem: The fence test
test_eval_fence.pyAST-walksbench/*/registration.pylooking for@register_task_class("name"). The design admits this misses non-literal names. But there is a second hole: the AST walker matchesnode.func.id == "register_task_class"— a contributor who imports the decorator under any alias (from codegenie.eval import register_task_class as rtc) or aseval_registry.register_task_classslips past the fence entirely while the decorator still works at runtime. Phase 0'stest_pyproject_fence.pydoes not have this exposure (it testspyproject.tomlstrings, not imported names). The design imports a Phase-0 fence pattern into a context where the assumption fails. Why it matters: The fence is the structural enforcement for ADR-0016 §Consequences ("the fence CI rejects any three-of-four PR"). A fence that any contributor with animport ascan dodge isn't a fence. Where:design-best-practices.mdComponents →tests/unit/test_eval_fence.py(node.func.id == "register_task_class"match); Risk #4 acknowledges the literal-name exposure but not the alias exposure.
Hidden assumptions¶
- Assumption: The system-under-test for vuln-remediation can be invoked synchronously from the runner via
system_under_test: Callable[[BenchCase], Mapping[str, Any]]. What breaks if it's wrong: Phase 6 ships an async LangGraph; wrapping it in a sync callable meansasyncio.run()per case, which is the opposite of what Phase 6's per-workflow checkpointer architecture expects. Sync wrapping ofainvokeis well-known fragile (event-loop cleanup, signal handler conflicts). - Assumption: Bench cases live in the same repo and the
bench/directory is checked in unencrypted. What breaks if it's wrong: Phase 7 wants distroless migration cases that may snapshot real customer Dockerfiles. ADR-0016 §Open Q4 explicitly defers this. The best-practices design assumes same-repo-public. - Assumption: The static-introspection test for banned field-name substrings (
confidence|llm|self_reported|model_says) is a sufficient defense against LLM-judgment smuggling. What breaks if it's wrong: A rubric author who wants to smuggle a confidence value names itevidence_strength(the design's own example!) and the test passes. The Phase 5 ADR-0014 lineage is that the type system +extra="forbid"+ the introspection together prevent smuggling. Substring matching alone is naming theater.
Things this design missed¶
- Cost cap enforcement. The performance design adds
--max-cost-usdwith mid-run abort; the best-practices design only sumscost_usdinto the report and trusts Phase 13 to consume it. A live operator run in this design has no in-run cost ceiling. - Concurrent-eval safety. Same blind spot as the security design — no consideration of two concurrent eval runs writing to
.codegenie/eval/runs/. The atomic-rename-via-os.replacemakes the file write safe; the run_id derivation (SHA-256 of inputs + scores) makes collisions impossible within deterministic outputs; nondeterministic SUTs make collisions live. - Integrity of the
bench/directory. The design has CODEOWNERS gating but nothing else — no digest pin, no signature, no verification at run time. The security design correctly catches that a contributor with bench-write access can soften the corpus; the best-practices design accepts CODEOWNERS as the only line.
Cross-design observations¶
Where do the three disagree?¶
| Dimension | Performance picks | Security picks | Best-practices picks | What's at stake |
|---|---|---|---|---|
| Concurrency model | asyncio bounded pool sized to sandbox cap |
Strictly serial | Serial with signal.SIGALRM per-case timeout |
Whether nightly wall-clock at portfolio scale is 8 min or 30+ min; whether sandbox/checkpointer contention is real |
| Rubric isolation | In-process import (bench.{tc}.rubric) |
Per-case microVM, JSON I/O over pipe | In-process import, defended by Pydantic + static introspection | RCE on operator host vs. minutes of microVM cold-start cost per nightly run |
| Cache layer | Content-addressed BenchScore cache as the load-bearing perf win |
None (deferred) | None | Whether the harness re-runs the SUT when nothing semantically changed |
| Audit shape | Per-case JSONL stream + runs.jsonl index |
BLAKE3-chained BenchRunRecords + Sigstore daily anchors |
Single Phase-0-shaped RunRecord JSON per run |
Tamper-detection capability vs. operational complexity (Sigstore + GPG vs. plain JSON) |
| Bench-case provenance | case.yaml provenance fields are descriptive only |
Two-signature CODEOWNERS + Sigstore/GPG signature on every case + cases/digests.yaml pin |
CODEOWNERS gating + last_validated_at warning at 90 days |
Whether a contributor with bench-write access can soften the corpus undetected |
| Bench-case format | repo.tar.zst level 3 (~30% smaller, 3× faster decompress) |
Tar-serialization (format unspecified) digested into cases/digests.yaml |
input/ and expected/ directories with case.toml (no archive) |
Storage size + extraction perf vs. diffability under git |
| Cost cap | --max-cost-usd with mid-run abort |
Deferred to Phase 13 | Sum into report; consumer is Phase 13 | Whether a runaway live run can cost more than the cap before Phase 13 ships |
| Promotion authority shape | Read-only PromotionVerdict emitted as last JSONL line |
recommend() writes recommendation file; apply() raises unconditionally; tier change is hand-edited PR against bench/{tc}/registration.py#current_tier |
Read-only PromotionGate.evaluate(...) → PromotionVerdict; tier change is hand-edited PR against docs/trust-tiers.yaml |
Where tier state lives — in the registration module (couples tier to code) or in a separate YAML (couples to a config file) |
| Fence-CI implementation | Regex on first/second line of registration.py for the literal decorator |
Three gates including a meta-gate that hashes the workflow YAML | AST walk for @register_task_class("...") literal |
Brittleness vs. defeatability vs. import-path coupling |
| Hash algorithm | blake3 (Python blake3 package) on case + sut + rubric + cassette |
BLAKE3 (per Phase 0 codegenie/hashing.py) |
SHA-256 implied (cassette_sha256) — diverges from Phase 0 |
Consistency with the existing audit hashing pattern; whether Phase 13 can read both with one parser |
Which disagreement matters most for this phase?¶
Rubric isolation. Everything else (concurrency, cache, audit shape, fence implementation) can be evolved phase-by-phase without breaking earlier records. Rubric isolation is the one that cannot be retrofitted: if the harness ships with in-process rubric execution, every operator who has run the bench has executed every rubric ever merged on their host with their environment, and the threat model is closed retroactively only by re-running every eval inside a microVM — which produces a different score (different process model, different timing, different env) and breaks the audit chain. The synthesizer must decide the rubric's trust posture now. The performance argument against microVM is real (per-case cold-start dominates wall-clock at portfolio scale); the security argument for it is that the rubric is control-plane code (it sets the input to the promotion gate). Whichever the synthesizer picks, it cannot be quietly reversed in Phase 7 or Phase 16 without invalidating prior evidence.
Where do all three quietly agree on something questionable?¶
- All three keep
bench/in the same repo. ADR-0016 §Open Q4 explicitly defers this; all three designs assume same-repo. The synthesizer should notice the consensus is a deferral, not a decision. Once Phase 7 starts curating distroless migration cases that may include customer Dockerfiles, "same repo" stops being neutral. - All three hand-wave the live-LLM cadence. Performance's $5 default cap is unjustified; security defers cost discussion to Phase 13; best-practices treats
cost_usdas logging only. ADR-0016 §Open Q3 deferred LLM cost budget to Phase 13. Phase 6.5 ships a tool that can spend money with no consensus on when it is allowed to, who pays, or what the per-task-class budget is. - All three treat
bench/vuln-remediation/as easy. All three claim ≥10 cases drawn from "Phases 3–4's solved-example corpus" (best-practices) or "real CVE-fix scenarios" (perf, security implicit). None of the three designs the case-extraction tooling. The roadmap's exit criterion #2 ("bench/vuln-remediation/cases/contains ≥10 curated cases") is the actual hard work and none of the three designs allocate engineering for it. ADR-0016 §Tradeoffs warns "expert-curated ground-truth cases for migrations alone are weeks of work." All three lenses skipped that work.
Roadmap-level critiques¶
-
Future-phase setup problems. Phase 7's exit criterion (§Phase 7) requires "the diff for this phase touches only new files — no Phase 0–6 source code is modified." If Phase 6.5's harness lands with a per-task-class registration that requires editing anything outside
bench/migration-chainguard-distroless/— e.g., security design'stools/digests.yamlcentral manifest of rubric digests, or best-practices'docs/trust-tiers.yamllisting the new tier slot — then Phase 7 cannot satisfy its "no edits to existing code" invariant because of how Phase 6.5 designed the registry. The performance design and the best-practices design partially avoid this; the security design'stools/digests.yamlpinning collides directly with the Phase 7 invariant. -
Earlier-phase reliance not actually established. All three designs assume Phase 4's cassette layer is digestable per task class (perf), per case (security), or per case via
cassette_path(best-practices). Phase 4'sfinal-design.mddefines cassette-record/replay discipline but does not commit to a per-case-addressable cassette identity that another package can hash without importing Phase 4 internals. The harness is being designed against a Phase 4 contract that may not exist in the shape assumed. -
Load-bearing-commitment violations.
- Production design.md §2.1 ("No LLM in the gather pipeline"). None of the three designs put an LLM in the harness, but all three describe the harness as "gather-shaped: deterministic, cacheable, auditable" (best-practices says this explicitly). Performance design's cache invalidation on docstring edits breaks the cacheability claim against any normal dev cadence — and the design admits it. The synthesizer should flag that "deterministic" in the gather sense (same inputs → same outputs) is not the same as "cache-friendly"; the harness's
sut_digestover the whole graph tree conflates the two. - §2.5 ("Extension by addition"). Best-practices design's
bench/{tc}/registration.pysynthesized-import-path requiressys.pathor finder hacks (Problem #2 above), which means adding a new task class requires teachingloader.pyabout its module path. That is "extension by editing the loader," not "extension by addition." The performance and security designs use entry points (perf) ortools/digests.yamlcentral manifest (security) — both of which require an edit somewhere to add a class. - §2.8 ("Humans always merge"); ADR-0009. Security design's Open Q #4 explicitly proposes carving an exception for daily anchor PR auto-merge. The synthesizer must either reject this or amend ADR-0009 — and an amendment to a load-bearing commitment ADR for the sake of a Phase 6.5 audit-anchor convenience is the tail wagging the dog.
- CLAUDE.md "Fail loud." Best-practices design's
BenchCaseLoadErrorcontainment is "case is excluded from the run with a logged warning; aggregate computed on remaining cases; exit code 1." Excluding a malformed case and continuing while marking the run failed is a contradiction — either it failed (no aggregate is reportable) or the case is excludable (no need to fail). The design has it both ways.