Phase 05 — Sandbox + Trust-Aware gates: Devil's-advocate critique¶
Reviewed by: Devil's-advocate critic subagent Date: 2026-05-12
Method¶
I read all three designs and attacked each on its own terms. I do not propose alternatives. My job is to surface what the synthesizer needs to see before it merges.
Attacks on the performance-first design¶
Concrete problems¶
-
Problem: The
GateVerdictCachekey isblake3(patch || lockfile || base_image_digest || gate_id || gate_inputs_digest || gate_impl_hash)but it deliberately omits thegrypevuln-DB digest from the cache key for the Build/Install/Test gates and segregates CVE-delta as a "different cache-key shape." That segregation is the bug. A patch's Test result depends on transitively-resolved registry tarballs whose ground truth changes when registry mirror state changes — but the design pins onlyregistry_mirror_digestinside the lockfile resolver (Phase 3), not in the Phase 5 verdict cache. If the npm registry serves alatest-resolved transitive dep differently between retry-1 and retry-2 (typo-squat takedown, yanked version, mirror swap) the Test verdict cached at retry-1 is wrong on the next workflow. The design says "passing gates whose inputs didn't change hit cache" — but the inputs the cache sees are not the inputs that produced the verdict. Why it matters: A "warm" verdict cache serves stale "pass" verdicts. Phase 13's cost dashboard will show savings; production will ship CVEs. Where: §ArchitectureGateVerdictCache, §ComponentsGateVerdictCache.key_for, §Data flow step 2 ("five hits"). -
Problem: The Install→Test snapshot-sharing optimization (saves ~15–25 s on Test gate) is keyed on
(patch_blake3, lockfile_blake3)and claimed read-only. Butnpm testwrites (jest coverage outputs,~/.npm/_cacache,/tmpmutations, Node's V8 code-cache). The design says "read-only mount inside the Test VM" — a read-only mount ofnode_modulesdoes not buy isolation if the test runner needs to write anywhere under that tree (e.g.,node_modules/.cache/, which Babel, jest, webpack, and SWC all use). Either you make the layer copy-on-write (defeats the perf claim — you pay the upper-layer materialization), or tests genuinely-read-only fail mysteriously on real fixtures. Why it matters: The 15–25 s savings is the largest single perf claim in §Resource & cost profile. If it doesn't hold on real Node test suites, p50 cold goes from ~85 s to ~110 s and Goal 1 (24 cold workflows/hour) misses. Where: §ComponentsRootfsCache + WarmPoollast paragraph, §Risks risk-2. -
Problem: Goal 6 says "Re-run of an identical patch on an identical fixture: 100% cache hit; 0 microVM boots." But Goal 11 measures
runtime_traceas one of the six gates. The runtime trace's "stable vs. Phase 2 baseline" property depends on the kernel version inside the rootfs. The design rebakes rootfs per(base_image_digest, node_major, fixture_class)— there is nokernel_digestin the rootfs cache key, even though §Failure modes calls out "Firecracker vmlinux mismatch (rootfs built against newer kernel)." So after avmlinuxbump, every cachedruntime_traceverdict is stale, but the verdict cache happily serves the prior pass. Why it matters: RuntimeTrace is the signal that detects "the patch added a new shell invocation." Caching that against a no-longer-relevant kernel is a silent failure of the gate's reason to exist. Where: §ComponentsGateVerdictCache, §Failure modes row 1, gate typeruntime_trace.pydefinition. -
Problem: §Risks risk-1 acknowledges Firecracker hard-pin on Linux/CI is "a partial pre-emption of ADR-0019" and §Open questions #2 asks the synth to choose. The design then commits its entire performance budget to Firecracker (every number in §Goals 3, 4, 9, 10, 12 assumes snapshot-resume). On DinD-on-macOS the design admits the WarmPool "degrades to cold-boot mode" — that's not degradation, that's "the entire perf model evaporates." The design has no fallback target for the macOS path; it just shrugs. This means the perf design has no defensible perf claims for ~50% of the operator surface (developer laptops). Why it matters: The synthesizer cannot use these numbers as gate criteria. There is no committed cold-DinD perf budget anywhere in the doc. Where: §Goals row "DinD on macOS: ≤ 12 s," §Risks risk-1, §Open questions #2.
-
Problem: The
SignalAggregatoris "strict-AND over objective signals only" — but the design enumerates six gates (Build, Install, Test, Policy, RuntimeTrace, CVE-delta). ADR-0008 lists seven signal categories, and Phase 3's strict-ANDTrustScoreralready exists and consumes a different signal set (build, install, test, prove-it). The design says "Phase 3'sTrustScoreralready exists ... and is preserved" but also saysSignalAggregator.aggregate(ledger) -> TrustSignals. There are two strict-AND scorers in the same pipeline: Phase 5'sSignalAggregatorand Phase 3'sTrustScorer. Which one'spassedfield gates the workflow? The §Data flow says aggregator returnspassed=Trueand the workflow finishes — Phase 3'sTrustScoreris never re-invoked. So the design quietly replaces Phase 3's scorer. That is an edit, not an addition. Why it matters: Violates "extension by addition" (load-bearing commitment §2.5). Also: if Phase 5 ships its own scorer, then Phase 6's LangGraph conditional_edge has two scorer types to wrap and no clarity on which is canonical. Where: §ComponentsSignalAggregator, §Data flow step 5, §Acknowledged blind spots. -
Problem: §Goals row 12 ("retry-2 ≤ 110% of retry-1 wall-clock; retry-3 ≤ 115%") is a target the cache will pass trivially on green-path fixtures (where passing gates cache and only the failing gate re-runs). The number is meaningless because no fixture in the test suite contains the actual hard case — retry-1 fails on Test, retry-2 patches differently, retry-3 fails on a different gate. The §Test plan exit criterion ("retry-1 fails on Test, retry-2 passes") demonstrates only the favorable shape. There is no test where the failing gate moves between retries, which is the case where the cache shape matters. Why it matters: The perf targets are gameable by the chosen fixture. The synth should not treat them as honest. Where: §Goals row 12, §Test plan integration tests.
Hidden assumptions¶
-
Assumption: "Same patch_blake3 means same Install result." The design uses
patch_blake3+lockfile_blake3as the key for snapshot reuse and for verdict caching. What breaks if it's wrong: A patch that touches a.npmrc,.node-version,node-gypbuild descriptor, orpackage.json#engineschanges Install behavior but not the patch bytes' hash (if those files are unmodified by the patch but read bynpm ci). Two workflows on two repos with the same patch + lockfile but different.npmrcconfigurations get the same verdict. -
Assumption: "Cached verdicts are auditable on cache hit." The design says the cache hit "emits an audit.cache_replay event referencing the original BLAKE3-chained audit entry." What breaks if it's wrong: The Phase 11 PR evidence bundle needs to show the actual signals from this workflow, not a pointer to a historical entry. A merger looking at the PR cannot run a one-shot sandbox replay if the original evidence dir was GC'd at the 14-day boundary the design specifies. "Audit-chain extension" without preserved evidence is a citation to a deleted source.
-
Assumption: "Four concurrent gate slots per worker is safe with 1 GB Firecracker VMs." What breaks if it's wrong: §Resource & cost profile says 3.0 GB worker ceiling with "2 concurrent active VMs × 1 GB each." Then §Architecture says "4 concurrent gate slots." 4 × 1 GB + orchestrator 200 MB + chromadb 400 MB + RootfsCache 150 MB = 4.75 GB. The ceiling number contradicts the slot count. Whichever the synth picks, the other is wrong.
Things this design missed that a different lens caught¶
- Threat model. Performance never names the adversary. The security design's Phase 5 threat model says this is the first phase that executes attacker-influenced code — performance treats
npm cilike a benign tool invocation. The sharednode_modulessnapshot optimization across Install→Test is unimaginable under the security threat model and the perf design never acknowledges this trade. - No-credentials-in-sandbox. Best-practices and security both call out the "no creds in the sandbox env" invariant; performance's
SandboxClient.exec(env=Mapping[str,str])interface accepts arbitrary env — there is nothing in the contract forbidding the orchestrator from passingANTHROPIC_API_KEY. - Operator surface / config. Best-practices ships
codegenie sandbox {health,inspect,gc}and aSandboxHealthProbeas the "B2 analog." Performance ships no operator CLI and no health probe. On a cold machine, the first sandbox failure is a stack trace. - YAML gate catalog. Both other designs externalize gate definitions to data; performance bakes the six gate types into Python files. Adding a seventh gate (Phase 7 distroless work) is a code edit, not data — that violates extension-by-addition.
Attacks on the security-first design¶
Concrete problems¶
-
Problem: The design contradicts the roadmap on Docker-in-Docker for macOS dev and admits it: "I propose macOS gates run inside a Lima-managed Firecracker-or-gVisor VM" and "operators on macOS pay a Lima VM cost — non-negotiable." This makes Phase 5 unrunnable on developer laptops without operator-side platform work. The exit criterion in roadmap.md says "Local dev (including macOS): Docker-in-Docker, since it is the portable choice" — security explicitly throws this out. The
--unsafe-shared-kernel-gatesflag is the only escape and the design says verdicts produced under it are tainted forever. Net effect: a developer on a fresh MacBook cannot reproduce a CI gate failure locally without installing Lima + a microVM stack. Why it matters: Phase 6 (the next phase) tests the gate loop withinterrupt()+ checkpointer on developer laptops. If gates need Lima to run, Phase 6's tests need Lima to run. Phase 7 (distroless migration, the second task class and the test of extension-by-addition) needs gates to run during local development of the new probes. The security design's posture stalls every phase 6+. Where: §Lens summary "Contradiction-to-roadmap surfaced," §Goals item 1, §Risks risk-2, §Open questions #1. -
Problem: The
gate-control daemonis a separate process (codegenie-gatedas a systemd / launchd unit) communicating overAF_UNIXwith per-orchestrator-process HMAC keys. The design lists this under "components" without acknowledging that a daemon adds an entirely new failure mode the orchestrator never had: the daemon must be running before any gate runs. The §Failure modes table has a row for "Docker daemon unreachable" but not "codegenie-gated unreachable" or "codegenie-gated version-skew with orchestrator." Phase 5 was supposed to be a thin wrapper around an existing chokepoint; instead it ships its own OS-level service. Why it matters: Phase 5 is the foundation Phase 6's state machine consumes. Phase 6 assumes the gate is a function call; the security design makes it an out-of-process RPC to a systemd unit. Phase 6's tests now require a fixture that boots a daemon. Where: §Components "Gate-control daemon" entry, §Architecture diagram. -
Problem: §Goals item 6 specifies eight strict-AND signals (build, tests, delta_test_count, sast, cve_delta, new_shell, new_endpoint, trace_coverage). One is
tests.delta_test_count == 0— "the gate runs the pre-patch test inventory, not the post-patch one." This forbids the LLM from adding a test that proves the fix, which is a standard, legitimate engineering practice and is required by Phase 11's PR convention ("the agent ships tests proving the fix"). The §Open questions #4 admits the trade. Phase 4's exit criterion is "solves a breaking-change vuln end-to-end" — a major-version bump with API changes often legitimately removes/renames tests in the same patch. Phase 5 ships a gate that fails Phase 4's exit-criterion fixture. Why it matters: This forces every Phase 4 outcome to either fail Phase 5 or escalate to human review. The "three retries" loop becomes "always three retries then escalate." Where: §Goals item 6, §Threat model "Prompt-injection-pivoted adversary," §Open questions #4. -
Problem: SAST is added to the gate via
semgrep ci --config /sandbox/semgrep-ruleset@<pinned>baked into the rootfs. Phase 5 of the roadmap does not include SAST — the roadmap says "build/test/runtime gates." Phase 12 owns "Stage 4 validation depth" which is where contract testing and deeper validation live; ADR-0008 lists "SAST/DAST findings" as a signal but does not commit it to Phase 5. The security design quietly imports a Phase 12 deliverable into Phase 5 because "we're already running stuff in the microVM." This adds a maintenance burden (semgrep ruleset authorship, ruleset versioning, false-positive triage) the phase scope did not budget. Why it matters: Scope creep. Phase 5's exit criterion (per roadmap) is the three-retry loop demonstration. Adding SAST to the gate adds a whole new false-positive workflow the design has no plan for. Where: §Architecture ASCII workload step 4, §Goals item 6, §Components "microVM rootfs image." -
Problem: The "ephemeral microVM per gate, no warm pool" decision (§Goals item 7) combined with "every Phase 5 gate boots its own microVM" produces three cold microVM boots per workflow for the vuln remediation pipeline (Build/Install/Test as three gates), then up to three retries = nine cold boots worst case. At ~150 ms Firecracker × 9 = 1.35 s — fine; at ~1.5 s gVisor on Lima × 9 = 13.5 s; at "3–10 s nested QEMU" × 9 = up to 90 s of boot overhead alone. The §Resource & cost profile's "p95: 200s" buries this. The Phase 6 LangGraph state machine adds
interrupt()plumbing on top — every state transition is now a fresh microVM. Phase 5's "Cold-only in Phase 5 (no warm pool)" choice doesn't show up as a constraint in the perf budget. Why it matters: The hard "no warm pool ever" stance forces Phase 6 / Phase 9 (Temporal) to ship a warm pool that invalidates Phase 5's security model. The "no reuse" guarantee held by ADR-0012 §33 (every gate clean) is preserved only as long as no later phase needs throughput. Where: §Goals item 7, §Resource & cost profile, §Acknowledged blind spots first bullet. -
Problem: The "pre-patch test inventory is signed by the orchestrator before the gate; gate refuses to run if the in-VM test discovery returns a different set." The signing key lives on the orchestrator. The verifying logic lives inside the microVM init. If the microVM is the untrusted zone and the in-VM init is the only thing trusted with the HMAC key (§Lens summary, "in-VM HMAC key held by init, unreachable by the workload"), then a test-inventory tampering check inside the microVM is being performed by code that the design itself classifies as semi-trusted at best. Either the check happens outside the VM (you mounted the post-patch tree on the host — defeats isolation), or inside (verifier shares fate with attacker). The design is silent on which. Why it matters: The single most adversarial-resistant claim in the design ("LLM can't disable tests because the gate runs the pre-patch inventory") has no clean implementation path. §Failure modes row "LLM-produced patch deliberately disables a gate" claims this works; §Acknowledged blind spots last bullet admits "if the workload exploits init... the HMAC key leaks." Where: §Goals item 6
tests.delta_test_count, §Failure modes "LLM-produced patch deliberately disables a gate," §Acknowledged blind spots.
Hidden assumptions¶
-
Assumption: "Host-side eBPF will be available on every sandbox host." What breaks if it's wrong: §Goals item 6 requires
runtime_trace.coverage_ok == trueas one of the eight strict-AND signals; §Threat model "Runtime-trace tampering adversary" specifies host-side eBPF as the trusted trace source. macOS hosts have no eBPF. Lima-on-macOS has no privileged eBPF access into the inner microVM. The design admits this in §Risks risk-3 ("trace coverage check firesruntime_trace.coverage_ok == false; strict-AND treats it as a hard fail") — which means: every gate on every macOS dev machine fails by default. The design intends this; the operator experience is "no Phase 5 gate ever passes on a developer laptop." That is not a workable dev loop. -
Assumption: "The npm registry tarball SHA can be verified against the lockfile
integrityfield by the egress proxy before the bytes reach the microVM." What breaks if it's wrong: §Threat model "Cache-poisoning adversary" says "registry pull through a verifying proxy that asserts the tarball SHA against the lockfile'sintegrityfield."npm cialready does this check inside the workload. Doing it again at the proxy means the proxy must understand npm'sintegrityfield format and trust a lockfile inside the microVM (which is attacker-influenced bytes from the patch). The proxy is being asked to enforce the integrity field of a file whose authority it cannot independently verify. -
Assumption: "Per-gate fresh microVM with no persistent state means cache-poisoning is impossible across workflows." What breaks if it's wrong: The design says "build-cache scoped per-workflow per-run (no shared upper layer across workflows)" but also says base-image pulls go through
docker.io/cgr.dev. The base image is the cross-workflow shared state. A poisoned Chainguard base image (compromised maintainer, dependency confusion in the registry, takedown-and-replace by a typo-squatter on a digest that resolves at boot) compromises every gate of every workflow. The 7-day patch SLA is real but doesn't help in the 0-7 day window.
Things this design missed¶
- Cache for verdicts. Performance shows that retries with one gate failing produce 5/6 gates whose inputs haven't changed. Security insists on fresh microVMs and no verdict cache. So every retry pays every gate's full cost, not just the failing one. ADR-0014 specifies retries are bounded to 3, but at 3× the wall-clock and 3× the cost of every gate, the per-workflow budget (ADR-0025 — deferred to Phase 13 but the cap is real) is materially worse than necessary.
- The
ObjectiveSignalsPydantic model withextra="forbid"is one of the strongest ADR-0008 enforcements proposed. Best-practices also forbidsconfidencebut does it inGateSignal.details: dict[str, primitive]— adetails["confidence"]key bypasses the static check. Security catches this; best-practices doesn't. - YAML gate catalog. Security bakes gates into Python (the gate-control daemon dispatches to typed gate handlers). Adding a Phase 7 distroless gate (BaseImageProbe, ShellInvocationTraceProbe) means modifying the daemon. Best-practices' YAML catalog is the better extension-by-addition shape.
- Operator CLI. Best-practices ships
codegenie sandbox {health,inspect,gc}plus aSandboxHealthProbe(an ADR-0007 honest-confidence input). Security shipscodegenie sandbox provisionandcodegenie sandbox up/status— different surface, fewer subcommands, no health probe.
Attacks on the best-practices design¶
Concrete problems¶
-
Problem: §Goals row 1 says "0 deps for Firecracker stub (it's an out-of-process binary check)" and §Components 4 says "
FirecrackerClient... raisesBackendNotAvailable... unless--firecracker-experimentalis set on the CLI AND the host has KVM + the pinned firecracker binary digest." So Phase 5's Firecracker support is a stub that fails by default everywhere. The exit criterion the design claims (§Goals row 14) is met entirely on DiD. ADR-0019 explicitly names "cold-start latency tolerance" and "operational experience" as the evidence Phase 5 should generate. The stub generates none of it. Phase 16 will revisit ADR-0019 with zero new data from the phase that was supposed to produce data. Why it matters: The phase exists in part to start producing the evidence ADR-0019 needs. Best-practices ships a stub instead of an experiment. Where: §Goals row 8, §Components 4, §Open questions #1, §Risks risk-3 (admits the stub bit-rots without a weekly cron). -
Problem:
GateisABC,SandboxClientisProtocol. The justification (§Components 1) is "composition over inheritance — backends are duck-typed viaruntime_checkable." ThenGateis an ABC because... it has subclasses (StrictAndGate). Both shapes are used. Two different Python idioms for two contracts in the same package, with the design's own §Conventions honored claiming "best-practices rule 4: composition over inheritance." Pick one. The codebase already has aProbeABC (Phase 1) and aRecipeEngineABC (Phase 3); adding a Protocol next to them violates §"Match the codebase's conventions" (CLAUDE.md global rule 11). Why it matters: Convention drift inside the same PR. Phase 7 distroless will copy whichever idiom it sees first and now both idioms are sanctioned. Where: §Components 1 ("Why Protocol over ABC"), §Components 5 ("GateABC"), §Conventions honored item 4. -
Problem:
GateSignal.details: dict[str, str | int | bool]— explicitly a typed dict with primitive values, no nested structure, ≤ 8 keys. The four signal collectors emitdetails={"failing_tests":3, "first_failure":"auth/jwt.test.ts: ..."}(build/test) and the trace collector emits free strings naming "new entries." There is nothing in this schema that prevents a collector from emittingdetails["confidence"] = 0.87ordetails["llm_says_ok"] = True. The security design forbids these field names statically with a CI test on the Pydantic model; best-practices forbids them only in prose. ADR-0008 is enforced by convention, not by code. Why it matters: ADR-0008 is the load-bearing commitment for the trust score. A test introduced in Phase 6 (or Phase 15 recipe authoring) can slip aconfidencevalue intodetailsand the type system accepts it. Where: §Components 2GateSignaldefinition, §Conventions honored item 2. -
Problem: The "flaky-timer-node" fixture for the exit-criterion test (§Test plan,
test_stage6_retry_recovers.py) is described as: "apytest_dynamic_test_orderfixture rewrites a flag inside the sandbox between attempts via a sidecar marker file written by the gate YAML's attempt-2cmd." This is the test demonstrating the central exit criterion of the phase (ADR-0014's three-retry loop) — and it's a fixture whose attempt-1-fails-attempt-2-passes property is engineered by a sidecar marker file, not by anything resembling the real production case (where retry-2 fixes the patch via Phase 4'sRagLlmEngine.apply). The fixture doesn't exercise the retry loop's actual feedback path — theprior_failure_summaryfield going into Phase 4 and producing a different patch. It exercises the loop's control flow only. Why it matters: The exit-criterion test cannot detect a regression in the retry feedback semantics. CLAUDE.md global rule 9: "every test must encode WHY the behavior matters, not just WHAT it does." A test that hardcodes the pass-on-2 outcome via a marker file doesn't encode why retries matter. Where: §Test plan integration teststest_stage6_retry_recovers.py, §Data flow step 8. -
Problem: §Components 3
DockerInDockerClientadmits subprocess use ("subprocess is allowed only insidedid/build.pyfordocker buildx build...") because "the Python SDK's build progress streaming is awkward." Phase 1'srun_in_sandboxchokepoint and Phase 3's lockfile resolver both explicitly avoid subprocess shelling for the same operations. The design carves out a single-file exception with no test boundary preventing that exception from spreading. The "one chokepoint" promised by Phase 0/1/2/3 conventions is now "two chokepoints, one of them shells out." Why it matters: The very first subprocess exception always grows. By Phase 9 there will be three exceptions and the chokepoint property is gone. Where: §Components 3 internal design third bullet. -
Problem: §Components 9 "Signal collectors — four functions" includes
collect_policy_signal(run: SandboxRun, policy_yaml: Path) -> GateSignal. Thepolicy_yamlargument is read from somewhere. The design never says where —~/.config/codegenie/policy.yaml? In the repo (.codegenie/policy.yaml)? Baked into the gate-catalog YAML? If it's in-repo, it's attacker-influenced (the LLM-produced patch can modify it). If it's user-config, it's not portable across CI/dev. If it's baked, the operator can't tune it without an ADR amendment. The design silently leaves a real attack surface open by not specifying the trust source. Why it matters: Security caught this and pins SAST config to the digest-pinned rootfs (so the patch can't change it). Best-practices leaves the policy source unspecified — a Phase 11 PR can include.codegenie/policy.yamlmodifications and the gate would honor them. Where: §Components 9collect_policy_signalinterface, §Failure modes "Policy gate fails."
Hidden assumptions¶
-
Assumption: "strace is the boring choice that works on macOS DiD and Linux CI" (§Components 3 trace capture). What breaks if it's wrong: strace under Docker on macOS requires the
SYS_PTRACEcapability inside the container. Docker Desktop on macOS runs containers inside its own Linux VM, andSYS_PTRACEon processes in another container's namespace doesn't work out of the box without--cap-add=SYS_PTRACE,--security-opt seccomp=unconfined, and the docker version supporting it. The design says "the boring choice that works on macOS DiD" — it does not work without runtime config the design never lists. -
Assumption: "Phase 5 ships no LLM call. Fence CI extended" (§Goals row 15). What breaks if it's wrong: The §Data flow step 8 says "next
SandboxSpecBuildercall seesprior_attempts=[Attempt(1)]and (perstage6_validate.yaml) chooses a more verbosecmdfor attempt 2." That re-prompts Phase 4'sRagLlmEngine.applywithprior_attempts(§Components 6 internal design). Phase 4'sRagLlmEnginecalls Anthropic. Phase 5 does not directly invoke an LLM but each retry triggers an LLM call in Phase 4. The "tokens per run: 0" claim is honest only at the package boundary — every workflow that exercises retry-1-fail spends Phase 4 tokens × 2. Phase 13's per-workflow cap (ADR-0025) is the constraint, and Phase 5 will hit it routinely without admitting that Phase 5 is the driver. -
Assumption: "Two ABCs is the public surface; adding a fifth signal is a fifth file" (§Goals row 1, §Components 9 tradeoffs). What breaks if it's wrong: Phase 7's distroless task class needs new signal kinds (base-image-vulnerability count, shell-binary-presence). Adding them requires extending
GateSignal.kind: Literal["build","tests","trace","policy"]to include"baseimage"and"shell_presence". That's editing a Literal type in a shared module, which is an edit, not addition. Best-practices claims extension-by-addition; this Literal makes the kind enum closed.
Things this design missed¶
- Threat model. Best-practices spends one bullet in §Acknowledged blind spots on "security depth" — "we trust the DiD/Firecracker primitive. No seccomp profile authoring in Phase 5." There is no adversary named, no asset list, no isolation requirement. The security design's full threat model — orchestrator credentials, audit-chain integrity, runtime-trace tampering, prompt-injection-in-retry-feedback — is entirely absent from best-practices. Phase 5 is the first phase to execute LLM-produced code; that property does not show up here.
- Performance. §Goals rows 9–12 commit to latency budgets without a verdict cache, without snapshot reuse, without parallelism. §Acknowledged blind spots admits this. The result is honest but every retry pays full freight. Best-practices accepts the cost without surfacing what it would take to fix later.
- Egress restriction depth. Best-practices'
network_policy.yamlproduces iptables rules; the security design adds an in-VM egress proxy + byte cap + rate cap as defense-in-depth. A patch with apostinstallthat calls out toregistry.npmjs.org(an allowed domain) and exfiltrates 50 MB of data is invisible to best-practices' allowlist and visible to security's byte cap. - No-credentials-in-sandbox enforcement. Best-practices'
SandboxSpec.env: dict[str, str]accepts any env; the comment says "validated allowlist (PATH, NPM_CONFIG_*, NODE_ENV...)" but no code is shown enforcing the allowlist. The security design refuses env inheritance entirely.
Cross-design observations¶
Where do the three disagree?¶
| Dimension | Performance picks | Security picks | Best-practices picks | What's at stake |
|---|---|---|---|---|
| Sandbox stack default | Firecracker on Linux/CI, DinD on macOS | Firecracker on Linux/CI, gVisor-via-Lima on macOS (no shared-kernel anywhere) | DinD everywhere; Firecracker as a stub | macOS dev loop usability vs. isolation strength vs. ADR-0019 evidence gathering |
| Verdict / signal caching across retries | Content-addressed GateVerdictCache; passing gates cache, failing gate re-runs |
No verdict cache; every gate fresh microVM every time | No verdict cache (Phase 9 territory) | Wall-clock per retry × cost-per-workflow vs. cache-poisoning surface |
| Snapshot reuse across gates in a workflow | Yes — read-only node_modules snapshot Install→Test |
No — every gate is its own ephemeral microVM | No — per-gate fresh boot | 15–25 s saved per retry vs. ADR-0012 §33 "every gate starts clean" |
node_modules / build state |
Shared (read-only) within workflow | Strictly per-gate | Strictly per-gate | Same as above |
| LLM in retry feedback | Phase 4 replans via FallbackTier with failure signals |
Phase 4 replans, error log fenced + canary-checked + 8 KB truncated | Phase 4 replans via prior_attempts |
Token cost vs. prompt-injection-via-error-log surface |
| Gate definition shape | Six gate types as Python files | Eight strict-AND signals enforced in the Pydantic model | Two YAML catalogs (strict, loose) plus StrictAndGate Python class |
Extension by addition (Phase 7) vs. type-safety vs. operator tunability |
| Process topology | Single orchestrator process; gates are async tasks | Three processes: orchestrator → codegenie-gated daemon → microVM |
Single orchestrator process; gates are sync function calls | Operational complexity vs. credential confinement |
SignalAggregator / scorer relationship to Phase 3's TrustScorer |
Phase 5 ships its own SignalAggregator (replaces, doesn't extend) |
Phase 5 ships its own typed ObjectiveSignals model |
Phase 5 injects new signals into Phase 3's TrustScorer (extends) |
"Extension by addition" — which design honors it |
| Test inventory tampering | Not addressed | Pre-patch test inventory signed; in-VM discovery refused if it differs | Not addressed | LLM-disables-test attack vs. legitimate test-addition patches |
| SAST in Phase 5 | No | Yes (semgrep inside the rootfs) | No | Phase 5 scope creep |
| eBPF / strace for runtime trace | strace inside the VM (informational) + (Linux-only) eBPF | Host-side eBPF required; in-guest strace informational | strace inside the VM (DiD trace capture) | macOS dev viability |
| Three-retry override flag | Configurable via CLI knob | No override flag in Phase 5 (strict ADR-0014) | --max-attempts-override <int> requires --operator-ack |
Whether ADR-0014's "configurable" property survives Phase 5 |
| Health probe | None | Implicit via daemon | SandboxHealthProbe (ADR-0007 honest-confidence input) |
Operator surface; first-failure UX |
Which disagreement matters most for this phase?¶
The macOS dev-loop posture is the central conflict. The roadmap says DinD on macOS is the portable choice; security says shared-kernel Docker is unacceptable for executing LLM-produced code; performance picks DinD and gives it no perf budget; best-practices picks DinD and quietly admits its trace-capture path doesn't actually work without runtime config. Three phases downstream (Phase 6 state machine, Phase 7 distroless task class, Phase 9 Temporal) all assume gates run on developer laptops during local dev. Whatever the synth picks here propagates to those phases. Every other conflict in the table (verdict cache, signal aggregator vs. TrustScorer, gate-control daemon) is downstream of this one — once you know whether macOS gates run in a shared-kernel container or a microVM-in-Lima, the rest of the architecture follows.
The synth must answer: can Phase 5 ship a dev-loop that runs unmodified on a fresh MacBook? If yes, security's full posture is unachievable and the threat-model trade is real. If no, Phase 6+ all need a "Lima-up" prerequisite documented and tested.
Where do all three quietly agree on something questionable?¶
-
All three accept that the patch under test is copied into the sandbox in full. None of the three contemplate that the patch itself can contain a malicious
.gitignore, a.git/hooks/post-checkout, or apackage.json#scripts.preinstall(Phase 3's--ignore-scriptsonly blocksinstall/postinstall, notpreinstallor arbitrary scripts invoked bynpm run). Best-practicescmd: ["sh","-c","cd /work && npm ci --ignore-scripts && npm test"]—npm testruns arbitrary scripts by design. The shared assumption: "the microVM boundary makes this OK." It does for the orchestrator host. It does not for cache-poisoning attacks on the rootfs's persistent npm cache that all three either share (performance), rebake (security), or implicitly accept (best-practices). -
All three treat
runtime_traceas a strict gate signal but none calibrates the baseline. ADR-0008 says runtime-trace coverage is an objective signal; Phase 2 produces a trace baseline; all three Phase 5 designs diff against it. None addresses how the baseline is updated when a legitimate dependency upgrade introduces a new (benign) syscall pattern. Best-practices §Risks risk-4 names this as "trace baseline drift" and defers it to Phase 11. Security treatscoverage_ok==falseas a hard fail without a refresh path. Performance just consumes Phase 2's baseline as given. The shared blind spot: in production, this gate is going to fail benignly often, and the cost is reviewer fatigue (which Phase 11 will absorb without warning). -
All three assume Phase 4's
FallbackTierwill accept the failed-gate error log as a clean retry input. Performance: "re-invoke upstream planner (Phase 4FallbackTier) with the failure signals as context." Security: "fence-wrapped, truncated sandbox error log handed back to Phase 4." Best-practices: "prior_attemptsappended;RagLlmEnginere-invoked." None of the three checked Phase 4'sfinal-design.mdto confirmFallbackTier.run(...)accepts aprior_attempts: list[AttemptOutcome]kwarg shape with the exact fields each Phase 5 design proposes. Phase 4's exit criterion (a breaking-change vuln solved end-to-end) does not test the prior-attempts path. If Phase 4 has to amend its interface to satisfy Phase 5, Phase 4 ships incomplete.
Roadmap-level critiques¶
-
Phase 6 (LangGraph state machine) will not be a thin wrap of Phase 5's coordinator. Phase 5's roadmap entry says retries land here; Phase 6's roadmap entry says "the deterministic + LLM + sandbox loop is now stitched together as a proper state machine ... conditional edges are the Trust-Aware gates." Performance's
GateCoordinatoris sync-with-bounded-async; security's gate-control daemon is a separate process with HMAC handshakes; best-practices'GateRunneris sync. None of the three is a LangGraph node ready to lift unchanged. Phase 6 will either (a) re-implement the retry loop inside LangGraph, leaving Phase 5's coordinator as dead code, or (b) keep Phase 5's coordinator and inject LangGraphinterrupt()into the middle, which is an edit. Either way, "Phase 6 will wraprun()as a single graph node" (performance §Components, §Open questions #4) is wishful — the state machine has more transitions than Phase 5 acknowledges. -
Phase 7 distroless task class (the test of extension-by-addition) needs new gates added without editing Phase 5. Performance's six gate types are Python modules — adding a
BaseImageProbegate is a new file but adding ashell-presencesignal is an edit toSignalAggregator's known signal set. Security'sObjectiveSignalsPydantic model withextra="forbid"literally cannot accept a new signal without editing the model. Best-practices'GateSignal.kind: Literal["build","tests","trace","policy"]is a closed enum. All three of these violate the "extension by addition" invariant for Phase 7. The phase whose exit criterion is "the diff touches only new files" cannot ship against any of these three Phase 5 designs without modification. -
ADR-0019 is supposed to remain deferred until production evidence exists. Performance hard-pins Firecracker on Linux/CI as a "partial pre-emption." Security commits to gVisor on macOS-via-Lima + Firecracker on Linux/CI, which is two-stack production routing — explicitly listed as an "option considered" in ADR-0019 but never decided. Best-practices ships a Firecracker stub, generating no evidence. All three either resolve ADR-0019 prematurely or refuse to engage with it. The synth must either honor the deferral (in which case Phase 5 must ship multi-backend test coverage and produce comparable cold-start / cost numbers) or resolve it with new evidence.
-
Load-bearing commitment §2.5 "extension by addition" is at risk. Performance ships a new strict-AND scorer that replaces Phase 3's
TrustScorer(per its own §Data flow). Security ships a newObjectiveSignalstyped model that supplants Phase 3's signal set. Best-practices is the only one that injects new signals into Phase 3's existing scorer — and it does so by widening the type, which §Roadmap-level #2 above shows is itself a closed-enum edit. Phase 3 carefully cut aTrustScorerseam ("strict-AND objective signals only" with composable signal injection). Phase 5 must extend it, not replace it. The synth needs to enforce this and the best-practices design is the closest to honoring it — but not by a wide margin. -
Load-bearing commitment §2.1 "no LLM in the gather pipeline" extends to gates implicitly but is not literally what ADR-0008 says. ADR-0008 says "Trust score uses objective signals only — no LLM self-confidence." Production design.md §3.1 lists an LLM Judge persona (Functional-Equivalence Critic) for Stage 5 Validation "on disagreement when objective signals conflict." All three Phase 5 designs say "no LLM in Phase 5." Security explicitly defers the LLM Judge to "Phase 5+N." Performance says "no LLM at the gate (ADR-0008 invariant)." Best-practices says "no LLM call. Fence CI extended." But the production design's stage-5 persona table requires an LLM judge for adjudication. Phase 5 is the phase that introduces sandboxed gates; deferring the LLM-judge persona to a later phase that doesn't exist on the roadmap means the production target has a persona no phase ever ships. This is a roadmap gap, not a phase bug — but Phase 5 is where the synth notices it.
-
The audit chain extends across Phase 2 → Phase 3 → Phase 4 → Phase 5. Performance's
GateVerdictCacheemitsaudit.cache_replayevents referencing prior chain entries; security's gate-control daemon refuses to advance if the chain head mismatches at startup; best-practices'RetryLedgeris BLAKE3-chained into the prior chain. All three claim to extend the chain. None has verified that Phase 4'sFallbackTieraudit emissions (Phase 4'ssolved_example.duplicate_skippedevents, theengine_usedstamping, etc.) produce chain entries with the same shape Phase 5 will consume. If Phase 4's chain events don't include the fields Phase 5's chain verification expects, Phase 5 refuses to start.