Phase 04 — Vuln remediation: LLM fallback + solved-example RAG: Devil's-advocate critique¶

Reviewed by: Devil's-advocate critic subagent Date: 2026-05-18

Method¶

I read all three designs and attacked each on its own terms. I do not propose alternatives. My job is to surface what the synthesizer needs to see before it merges.

Attacks on the performance-first design¶

Concrete problems¶

DeterministicRetargeter is fan-fiction. Performance hangs its entire token-economy story on a "byte-applicable" RAG tier that retargets a prior solved example's typed PackageJsonEdit list onto the current VulnerabilityRecord without an LLM call, with a similarity floor of 0.97. The design then admits the substitution allowlist is {PackageVersionPin, SemverRangeBump, LockfileResolverHint} — i.e., lockfile-only edits. But the headline Phase 4 case the same design picks for its data-flow example is a major-bump call-site rewrite that by definition touches .ts source files outside package.json, where retargeter "refuses." So the retargeter is structurally inapplicable to the only Phase 4 case that justifies the phase. The "compounding savings → $0" asymptote is a number the design proves can't be reached on the exact case it claims to solve. Why it matters: The "≥90% tier-skip rate after 90 days" goal, the ≤ $0.004 / PR RAG-hit cost target, and the "asymptote to $0" story all collapse if the retargeter only fires on cases Phase 3 already handles for free. The performance lens has bought itself a goal it can't deliver. Where: §"Lens summary" bullet 1, §Component 3 (RagTier) "Branch" §C, §Component 8 (DeterministicRetargeter).
Embedding query keyed on a free-text string Phase 3 doesn't expose. The RAG query is f"vuln-remediation cve={ctx.cve.id} affected={ctx.affected_package.name} kind={ctx.failure_mode}". ctx.failure_mode is a typed tag emitted by recipe-tier; the design assumes it's stable enough that embedding it produces a useful cluster. But Phase 3 final-design §"NpmPeerDepConflictRecipe → NotApplicable" returns NotApplicable(reason=...) strings — no enum, no contract. So either Phase 4 freezes the reason strings (an undocumented coupling) or the embedding key drifts every time Phase 3 tweaks a string. Why it matters: Cache key drift = silent corpus invalidation. Two repos with the same CVE but a Phase 3 wording change ("major_bump_breaking" vs "major_version_break") cluster apart and the design's "second run hits RAG" exit criterion fails for reasons no one will debug. Where: §Component 3 (RagTier) Internal design step 1.
--plan-cache=on is opt-in because the design can't articulate when it's correct. The (plan_cache_key) → TierChainResult cache is documented as "Disabled by default because the LLM tier's output is non-deterministic across re-records; cassette replay already gives us reproducibility in CI." This is the design admitting that its own cache is unsafe to enable in the configuration (record_mode="none" cassette replay) the design also claims is the only way Phase 4 runs in CI. The cache only matters in operator-mode portfolio scans, which is exactly the workload the goals table optimizes for ("workflows/hour @ portfolio scale, 24 workers"). So the goal table is computed assuming the cache is on and the design ships it off. Why it matters: The 6,500/hr throughput target is a fiction unless --plan-cache=on is actually deployed; the design hides the assumption behind an opt-in flag with no calibration evidence. Where: §"Data flow" "Caches consulted" §3 + goal table row "Workflows/hour".
Prompt-cache hit rate target ≥ 65% requires a workload assumption the design doesn't own. The 5-minute TTL only stays warm if consecutive workflows on different repos share the same skill + RAG few-shot pair. But Phase 13's portfolio scan (which the design references with codegenie remediate-batch) by definition spans hundreds of different repos with different lockfiles, different CVEs, different RepoContext slices. Even when the skill block is shared, system[2] (RAG few-shot) is different per workflow, breaking cache contiguity at the second-most-cached block. The math in §"Token economy" assumes warm cache = 90% off on all three system blocks; it isn't. At most system[0] (skill) hits cache; system[1] (instruction template) maybe; system[2] (per-workflow few-shot) almost never. Why it matters: The "$0.010 warm" calculation is wrong by roughly the size of system[2] (1–3 KB). At portfolio scale this is the line item that decides whether the phase is cost-defensible. Where: §Component 7 internal design (4 cached blocks claim) + §"Token economy math" Warm calculation.
Mixing chromadb HNSW into the worker process when Phase 11 will need concurrent writes. The design ships chromadb local (single-writer) and explicitly defers the "Phase 11's merge-webhook-driven ingest_solved_example calls may concurrently write" problem to a Phase 11 adapter swap. But Phase 4 also ships ingest_solved_example as the write primitive, and the workflow runs every successful Phase 4 fix through store.add synchronously inside the workflow process. The single-writer limit will bite the moment Phase 4 itself runs more than one worker concurrently on the same store (which is exactly what the "24 workers" throughput target assumes). Why it matters: The design is internally inconsistent: 24 workers + single-writer-chromadb + per-workflow ingest = lock contention or silent dropped writes the moment two workflows validate within the same window. Where: §Component 5 "Tradeoffs accepted" + goal table row "Workflows/hour @ portfolio scale, 24 workers".

Hidden assumptions¶

Assumption: RAG queries are short enough that one embedding (~80 ms) is the only embedding-tier cost. What breaks if it's wrong: The design's "user block: RepoContext slice + the affected file's tree-sitter outline" implies the LLM gets file content; if the query embedding is also derived from file content (the obvious richer-query upgrade), every workflow embeds kilobytes, not the ~50-byte query string the latency math assumes.
Assumption: pytest-recording cassettes are stable enough that a cassette_blake3 lockfile is meaningful. What breaks if it's wrong: Anthropic SDK upgrades change response shapes; the request_id header, the cache-control echo in responses, and the usage.cache_creation_tokens field name are all SDK-version-coupled. The design assumes pinning anthropic>=0.x,<0.y is enough; in practice cassettes break on patch bumps.
Assumption: fastembed's ONNX session is deterministic across worker hosts. What breaks if it's wrong: ONNX runtime on different CPU architectures (x86_64 vs arm64) can produce slightly different float outputs at the 5th decimal place. Cosine similarity at the 0.97 byte-applicable floor is exactly where this matters. The "embedding cache is byte-identical across runs" property test (§Test plan unit tests) is a single-host check; it doesn't catch cross-architecture drift.

Things this design missed that a different lens caught¶

Cassette sanitization for secrets. Performance treats cassettes as a token-economy concern (replay = $0). Security caught that pytest-recording records Authorization headers verbatim. Performance's cassette plan has no sanitizer — a contributor recording cassettes locally leaks their API key into tests/cassettes/.
Output-shape constraint on the LLM. Performance says "the leaf is instructed to emit a structured Transform.from_json(...)" but the constraint is a prompt-instruction, not a structural one (no JSON schema enforced at the API boundary). Best-practices ships JSON schema'd output; security ships a Pydantic discriminated union PlanProposal. Performance has the cheapest enforcement (and the loosest).
Provenance refuse-mode short-circuit. Both security and best-practices call out that ADR-0038's vuln.provenance == Unknown must refuse before any LLM call. Performance never mentions provenance; the LLM tier will happily spend tokens on a base-image CVE that should never have reached it.

Attacks on the security-first design¶

Concrete problems¶

SPKI pinning for api.anthropic.com is operationally unworkable and the design acknowledges this then ships it anyway. §"Resource & cost profile" admits "When Anthropic rotates intermediates (~annually) we must ship a release." That's a release triggered by Anthropic's CA, not our own roadmap. The pin set is ANTHROPIC_SPKI_PINS: Final = frozenset({...}) with the "..." literally unspecified. Anthropic does not publish a stable SPKI for api.anthropic.com — they use commercial CAs and rotate. A pin miss EgressCertPinFailed halts every workflow. Why it matters: This is not defense-in-depth, this is a self-DOS waiting to happen. Anthropic's CDN may even rotate intermediates without notice; the design has no mechanism to refresh the pin set out-of-band. Where: §Component 4 (AnthropicLeafAdapter SPKI block) + §"Resource & cost profile" "non-trivial cost" note.
EgressGuard monkey-patches socket.create_connection at process start and then carves out 127.0.0.1 "unconditionally." §Component 12 admits this. But the leaf adapter uses the anthropic SDK which dials api.anthropic.com — that resolves via DNS to a non-loopback. Meanwhile pytest-httpserver (used by test_cert_pinning.py) and chromadb's "Phase 4 ADR ratifying the carve-out" (which doesn't exist yet) both want loopback. So the design has one monkey-patched socket layer doing two contradictory jobs (block-all-except-Anthropic and allow-all-loopback-for-tests) and the carve-out is exactly the bypass an attacker needs: any in-process binding to 127.0.0.1 (e.g., a poisoned local proxy injected by an adversarial dep on PyPI) is whitelisted. Why it matters: A loopback whitelist defeats the "EgressGuard catches dynamic uses" argument. The control's threat-model claim ("catches transitive deps opening sockets on import") fails the moment any dep opens a loopback socket — which chromadb itself might do for its embedded mode debug surface, and onnxruntime/torch definitely do for telemetry-shaped behaviors. Where: §Component 12 "Loopback (127.0.0.1, ::1) is permitted unconditionally".
PlanProposal closed sum type doesn't model the Phase 4 use case the roadmap asks for. The exit criterion is "a breaking-change vuln (e.g., major-version-bump CVE) is solved end-to-end with the LLM fallback." This requires call-site rewrites across N source files. The security design models PlanProposalCallsiteRewrite.diff: UnifiedDiff with a hard 32 KB cap and rejects anything else as PlanProposalRefuse. A real major-bump (e.g., Express 4 → 5, lodash 3 → 4) rewrites every callsite, which is routinely >32 KB of diff. The design admits this in Open Question 6 ("32 KB is generous; 16 KB or 8 KB caps would make the LLM physically incapable of emitting a sweeping rewrite") but doesn't acknowledge that the current 32 KB cap also kneecaps the goal. Why it matters: The phase exit criterion is not met by this design's worst-case cap. A real major-version-bump fix will refuse out via PlanProposalRefuse(reason="out_of_scope") and HITL-escalate — meaning Phase 4 ships a system that refuses to do the work Phase 4 exists to do. Where: §Component 2 (PlanProposal) "Tradeoffs accepted" + Open Question 6.
SolvedExampleWriteCapability claims unforgeability via "private-by-convention name" + import-linter. §Component 7 admits "the actual unforgeability is enforced by the fact that the minting function _mint_solved_example_capability(...) lives in src/codegenie/gates/_capability_mint.py (Phase 5) and is import-linter-blocked from being imported anywhere except the GateRunner and the writer." Two problems: (a) Phase 5 hasn't shipped yet; this is an unwritten dep. (b) Import-linter is a lint — it runs at CI time, not at runtime. A contributor running tests locally with import-linter off (a real configuration) can forge the capability. The "capability pattern" claim is a runtime contract that's actually enforced by a static check; it's a Python Pydantic model whose constructor is public by language semantics. Why it matters: The design's strongest claim about RAG poisoning resistance ("A capability token unforgeable except by Phase 5's GateRunner makes 'is this write authorized?' a type-level question, not a runtime if validated: check") is false on close reading. Where: §Component 7 + §"Design patterns applied" row 2 (Capability pattern).
The FenceWrapper + CanaryGuard defense is a regex-based denylist over INJECTION_PATTERNS: list[bytes]. §Component 3 lists patterns like Ignore (all )?(previous|prior|above) and BEGIN SYSTEM. The design says "the injection-pattern list is incomplete (it cannot be complete — that's adversarial-ML's open problem)." So the lens is committed to a security posture whose primary control the design admits is incomplete and updated only as new attack patterns surface. Worse, the canary scan happens before truncation per the code (truncated = self._truncate(payload, source_kind); canary_result = CanaryGuard.scan(truncated, nonce)), so an attacker who hides the injection past the truncation byte gets past the canary. Why it matters: The control is theater: a known incomplete denylist applied to truncated input, claimed as "every byte that enters an LLM prompt is fence-wrapped, canary-checked, and traceable." It's not "every byte" — it's "every byte that survives a 4–8 KB cap, checked against patterns we knew yesterday." Where: §Component 3 (FenceWrapper + CanaryGuard) + §"Acknowledged blind spots" bullet on adaptive injection learning.

Hidden assumptions¶

Assumption: BLAKE3-chained event log provenance per RAG record is enforceable across worker restarts and machine moves. What breaks if it's wrong: A record's provenance.event_chain_proof is "the chain segment from that head to the current head." If a worker restarts and resumes mid-workflow, or if a portfolio scan distributes across hosts (as the perf lens implies for Phase 13), the chain head is machine-local unless the BLAKE3 chain is a global write-ordered log — which contradicts Phase 4's "single-process CLI" framing.
Assumption: keyring works in CI and is not bypassed by the CODEGENIE_ANTHROPIC_KEY_CI escape. What breaks if it's wrong: The escape is "CI=1 and ALLOW_CI_ENV_KEY=1 (both required)." Setting both is one PR. The design hopes contributor culture enforces it; nothing in the code does.
Assumption: All untrusted bytes have a known source_kind. What breaks if it's wrong: prior_attempts from Phase 5 retry path is fenced as prior_attempt_summary. But Phase 5 hasn't shipped; the AttemptSummary shape is hypothetical. The design fences fields that don't exist yet, against a Phase 5 interface this design itself extends.

Things this design missed¶

Vendor-shim seam at the Protocol. Performance and best-practices both treat LeafLLM as the seam for ADR-0020's eventual multi-vendor un-deferral. Security pins to Anthropic at the network layer via SPKI plus an allowlist of one host. When ADR-0020 resolves, a second adapter cannot share the EgressGuard install (one host pin, one allowlist) without a runtime mutation. The design has effectively re-frozen ADR-0020 by making "add a second adapter" require an ADR amendment to the allowlist plus another SPKI pin set.
langgraph as a Phase-4 dep for Phase-6 forward-compat. Best-practices ships langgraph in Phase 4 to lock topology early. Performance treats Phase 4 as plain async Python pipeline. Security never mentions langgraph — meaning when Phase 6's state-machine refactor lands, the security guarantees (fence positions, capability flow, audit emission points) have to be re-proven against a graph executor the lens never reasoned about.
Cassette discipline as test correctness, not just secret hygiene. Security treats cassettes as a secret-exfil concern. Performance/best-practices treat cassette miss as a CI-fail mechanism that catches LLM-prompt drift. Security's test_cassettes_clean.py checks for secrets, but does not check that cassettes still match the SDK shape — the failure mode best-practices Risk 1 calls out ("cassette rot").

Attacks on the best-practices design¶

Concrete problems¶

langgraph admitted in Phase 4 is a roadmap violation dressed up as forward-compat. The roadmap §Phase 4 says "langgraph imported minimally — just enough to wrap the leaf agent invocation." The roadmap §Phase 6 says "the deterministic + LLM + sandbox loop is now stitched together as a proper state machine ... LangGraph is the runtime." The roadmap treats Phase 6 as the LangGraph introduction phase. Best-practices §Component 5 builds a three-node StateGraph with _LlmReplanState (a mutable Pydantic model — see Concrete Problem 3) and ships it in Phase 4 with the claim "Phase 6's state-machine refactor will lift this exact graph (via g.add_subgraph(...))." This is taking on the largest Phase 6 dep one phase early to "lock topology" — except the topology is three nodes with no conditional edges and no checkpointer. There's nothing to lock; this is langgraph added for the sake of using langgraph. Why it matters: The "Phase 6 migration cost" the design names in Open Question 3 is a problem the design has created, not solved. A three-node flat graph could be three function calls. The dep weight (~10 MB), the second framework whose semantics the next engineer must know, and the LangGraph state mutability hole (Concrete Problem 3) all arrive in Phase 4 for a future-coupling claim Phase 6 will mostly ignore. Where: §Component 5 (LlmReplanRecipe) + Open Question 3.
SentenceTxEmbedder defaults to all-MiniLM-L6-v2 and brings sentence-transformers → torch (~250 MB transitive) into the runtime. §"Resource & cost profile" calls this out: "torch is well-supported across our CI matrix ... install time grows by ~40 s on a cold CI run." Performance lens picks fastembed (130 MB ONNX, no torch) for the same job. The best-practices design picks the heavier dep because "the contributor can python -m codegenie.rag.inspect without docker-compose" — but fastembed also runs in-process with no docker. The design has picked the heavier dep without comparing against the lighter alternative on the very dimension (contributor friction) it claims to optimize for. Why it matters: The "contributor experience" justification for chromadb-vs-qdrant is the same justification that should kill sentence-transformers-vs-fastembed, and the design only applies it in one direction. Where: §Component 2 (SentenceTxEmbedder) + §"Resource & cost profile" torch line.
_LlmReplanState is frozen=False — the design's only mutable Pydantic model — and shipped as an idiomatic LangGraph requirement. §Component 5 has model_config = ConfigDict(frozen=False, extra="forbid"). §"Design patterns applied" row 6 names this "State pattern (LangGraph idiom)" and §"Conventions honored" boasts "every Pydantic model is frozen=True, extra='forbid'." The design states the convention then makes the one exception inside its largest new component. The design-patterns toolkit's Make illegal states unrepresentable and Side effects in constructors flags both light up — a mutable model whose mutation is "scoped to a single compiled graph invocation and never escapes" is an unverified claim. Why it matters: ADR-0033 §3 (domain-modeling discipline) and the codebase-wide frozen=True convention are why refactor-safety is plausible. One exception is one entry point for a class of bug the convention exists to prevent. Where: §Component 5 _LlmReplanState definition + §"Conventions honored".
Auto-harvest is deferred to Phase 11 but the roadmap exit criterion is "re-running the same case hits RAG, not LLM." §"Data flow" step 9: "Phase 4 ships the harvester as a separate CLI (codegenie rag harvest <workflow_id>) and does NOT auto-harvest in Phase 4 — auto-harvest waits for Phase 11's post-merge webhook." But the roadmap exit criterion requires the second run on the same case to be cheaper. If harvest is operator-invoked CLI-only, the "re-run hits RAG" claim is met only if an operator manually ran codegenie rag harvest between runs. Best-practices acknowledges this gap in Open Question 1 but ships the design with the gap. Why it matters: The roadmap exit criterion is unmet by this design unless an operator step is added to the test. The design test test_phase4_e2e_replay_lands_rag.py would have to manually invoke harvest between runs — which makes "exit criterion met" depend on test scaffolding, not production behavior. Where: §"Data flow" step 9 + Open Question 1 + §Test plan integration test.
RecipeOutcome is widened by "additive union widening" with MatchedFromRag and ReplannedByLlm — but Phase 3's RecipeOutcome is Applied | Skipped | Failed (per Phase 3 final-design §3). §"Conventions honored" bullet says "the RecipeOutcome Phase 3 already ships gains two non-LLM variants (MatchedFromRag, ReplannedByLlm) by additive union widening — not by Optional[X] fields on existing variants." MatchedFromRag contains an LLM input (the few-shot example) and is named "non-LLM" because retrieval is non-LLM. ReplannedByLlm is plainly an LLM-produced outcome. Both being added to RecipeOutcome widens a sum type used by Phase 3's deterministic strict-AND checks. Phase 5's already-merged design assumes RecipeOutcome variants are exhaustively matched (per Phase 3 ADR-0001 contract-snapshot); adding two variants is a Phase 5 contract change that the design claims is additive but actually requires every Phase 5 site to add two new case clauses or else assert_never fires. Why it matters: "Extension by addition" is being equivocated. Adding variants to a sum type ≠ adding a new sum type. The design's "zero edits to Phase 3" claim is structurally false at the Pydantic-discriminator level. Where: §"Conventions honored" "Domain modeling discipline" bullet + §Component 4 (RagMatchRecipe) + §Component 5 (LlmReplanRecipe).

Hidden assumptions¶

Assumption: pytest-recording cassette matching on ("method","path","body") is sufficient when the body is JSON with stable key ordering. What breaks if it's wrong: Pydantic's JSON serialization is key-stable but Anthropic's SDK may inject anthropic-version headers, metadata.user_id, or timestamp-shaped fields. Cassette miss in CI when nothing semantically changed.
Assumption: The store's "files-on-disk are canonical, sqlite is derived" property holds across chromadb upgrades. What breaks if it's wrong: chromadb schema migrations are not stable. codegenie rag rebuild (the recovery path) is described as one CLI command — but it has to re-embed every example and re-insert into a new sqlite. On a corpus of 1000+ examples, that's minutes. The design ships "Phase 11 expects <200 examples" without a migration plan.
Assumption: The single similarity threshold 0.78 will be calibrated in Phase 6.5 before Phase 4 actually ships. What breaks if it's wrong: Phase 6.5 hasn't shipped (the roadmap places it between Phase 6 and Phase 7). The 0.78 number is "hand-calibrated" and lives in YAML; until Phase 6.5 produces a labeled set, the threshold is a guess that determines the entire retrieval-quality story.

Things this design missed¶

Cost-cap circuit breaker. Performance has LlmInvocationGuard budget event emission. Security has LlmInvocationGuard as a capability-pattern with hard token caps. Best-practices ships no per-workflow budget cap — the design's risks list never mentions cost runaway. The design defers cost observability to Phase 13's ledger but ships nothing to enforce a cap in Phase 4. A bug in LeafLLM.invoke that retries on a never-resolving prompt will rack up tokens until a human notices.
Prompt-injection containment. Best-practices fences nothing. The only mention of adversarial input is Risk 4 about typecheck.node flakiness. The design treats untrusted bytes (CVE descriptions, repo content, RAG hits) as trusted enough to inline into the user prompt verbatim.
The Phase 3 Result[T, E] type. §Component 7 (SolvedExample.from_yaml) returns Result[SolvedExample, ParseError] "same shape as PluginScope.parse." Phase 3 final-design does not document a Result type — it documents Pydantic sum types and RecipeOutcome discriminated unions. Best-practices invents a Result shape and attributes it to Phase 3. Adopting Rust-style Result is fine but should not be invisible.

Cross-design observations¶

Where do the three disagree?¶

Dimension	Performance picks	Security picks	Best-practices picks	What's at stake
Embedding model	`fastembed` ONNX BGE-small (130 MB, no torch)	`sentence-transformers all-MiniLM-L6-v2` with `EmbeddingsBootstrap` offline-only, content-addressed	`sentence-transformers all-MiniLM-L6-v2` lazy-loaded	RSS, cold-start time, supply-chain surface
Vector store	chromadb persistent local	chromadb embedded mode only; qdrant explicitly rejected	chromadb persistent local	Concurrent writes, dep weight, ops simplicity
LangGraph in Phase 4?	No — plain async pipeline	No — sequential Python	Yes — three-node flat graph	Phase 6 migration cost vs current complexity
LLM output discipline	Prompt instruction asking for `Transform.from_json` + Pydantic validate at adapter boundary	Closed sum type `PlanProposal` + JSON schema at API boundary + smart-constructed paths	`_validate_lockfile_transform_shape` parser node in LangGraph	What an injected LLM can structurally emit
Per-workflow cost cap	Emits cost events; defers enforcement to Phase 13	Hard cap `LlmInvocationGuard` as capability token	Nothing — defers entirely to Phase 13	Blast radius on a runaway LLM call
Prompt injection containment	None — relies on Phase 5 gates	`FenceWrapper` + `CanaryGuard` + per-source caps + nonce	None	Whether adversarial bytes can hijack the LLM
Provenance refuse-mode	Not mentioned	Explicit short-circuit before LLM call	Inherited from Phase 3 `applies_to` short-circuit	Whether a base-image CVE silently spends LLM tokens
Auto-harvest of solved examples	Inline `ingest_solved_example` on `RemediationOutcome.validated`	Capability-gated by `SolvedExampleWriteCapability` (Phase 5 mints)	Deferred to Phase 11 operator CLI	Roadmap exit criterion ("second run hits RAG")
Network egress posture	No mention — Anthropic SDK called directly	`EgressGuard` monkey-patches socket + SPKI pin + DNS allowlist	No mention — Anthropic SDK called directly	Defense-in-depth vs operational fragility
Cassette secret hygiene	Not mentioned beyond `record_mode="none"`	`CassetteSanitizer` + CODEOWNERS gate + `test_cassettes_clean.py`	"Scrubbing: request headers `x-api-key`, `authorization`, `anthropic-version` are scrubbed"	API key exfil through committed cassettes
RAG retargeting	`DeterministicRetargeter` for byte-applicable RAG hits	Always feed RAG as few-shot to LLM (no retarget)	Always feed RAG as few-shot to LLM (no retarget)	The compounding-savings asymptote

Which disagreement matters most for this phase?¶

LLM output discipline (row 4) is the load-bearing disagreement. Phase 4 is the first phase where an LLM produces bytes the system applies — every other phase that uses an LLM does it for judgment (Assessment) or routing. The three designs disagree on whether the LLM's output is constrained at the API boundary (security: JSON schema'd discriminated union), parsed inside a state graph (best-practices), or trusted-then-Pydantic-validated (performance). This decision determines:

What an injected LLM can structurally do (security's closed sum type makes "shell command in output" representationally impossible; the other two designs let prose flow up to Pydantic for ad-hoc rejection).
Whether the roadmap's major-version-bump case is even expressible (security's 32 KB cap may refuse it; the other two have no cap and accept whatever fits in max_tokens).
Phase 5's interface. Phase 5 already ships RecipeOutcome consumption; widening it (best-practices) requires re-snapshotting the contract, whereas keeping RecipeOutcome shape and wrapping LLM output as Transform (performance, security) lets Phase 5 stay unchanged.

The synthesizer cannot defer this. Picking it wrong cascades through Phase 5's retry interface, Phase 6's state-machine node, and Phase 7's distroless-plugin claim that the plugin contract is frozen.

Where do all three quietly agree on something questionable?¶

All three use a single similarity threshold for RAG retrieval. Performance picks 0.92 / 0.97 floors, security uses 0.85, best-practices uses 0.78. The number varies; the shape of "one global float threshold over cosine similarity" is uniform. None of the three designs considers that retrieval quality at Phase-4-corpus-size (≤200 examples per best-practices) is a different problem than retrieval quality at portfolio scale. With <50 examples, every cosine-similarity threshold is gameable; the "top-K with threshold" pattern is a stand-in for "the corpus is too small for any retrieval to matter." Phase 6.5's "calibration in Phase 6.5" disclaimer is doing work all three lenses lean on.
All three treat chromadb as the in-tree vector store. Performance for cold-start, security for embedded-mode-no-network-listener, best-practices for "no docker-compose." None of them seriously evaluate that chromadb's schema migrations are unstable, its HNSW writer is single-threaded, and its DuckDB+parquet on-disk format is not git-attributable for fixture portability (despite performance claiming it is — chromadb's sqlite uses binary blobs that don't diff). The choice is by consensus, not by analysis.
All three treat the cassette layer as if it solves CI determinism. None of the three names the failure mode where cassettes work in CI but mask a real bug: if the cassette captures a working response from a buggy prompt, the cassette tests pass and the live behavior regresses. Performance's "test_no_live_anthropic_calls" test asserts the cassette layer works; it doesn't assert the content of cassettes still represents reality. Security mentions a nightly real-API drift job (one line); best-practices mentions a "quarterly cassette-refresh runbook" (one line). Cassettes are load-bearing in all three but operationally weak in all three.

Roadmap-level critiques¶

Does this phase set up problems for later phases?
Phase 5 retry interface coupling. All three designs reference prior_attempts: list[AttemptSummary] per Phase 5's already-merged design (Phase 3 ADR-0001's ApplyContext extension). Security treats this as a trust boundary (fence-wrapped); best-practices treats it as "Phase 5 owns retry"; performance treats it as a context input. The three lenses ship three different prior_attempts semantics. Phase 5 expects one.
Phase 6 LangGraph migration. Best-practices ships LangGraph in Phase 4 and claims Phase 6 inherits the topology. The other two designs ship plain async/sequential code; Phase 6 will have to either lift their code into LangGraph (cost) or accept that Phase 4's leaf-LLM-call site doesn't fit the state-machine framing (architectural debt).
Phase 6.5 cassette dependency. The roadmap names Phase 4 as the cassette discipline owner — Phase 6.5 reads cassette_blake3 per bench case. Performance commits to a cassettes.lock file Phase 6.5 reads; best-practices and security mention pytest-recording but don't pin per-case BLAKE3. If Phase 6.5 launches expecting that file, two of three designs don't ship it.
Phase 7 distroless plugin contract. Phase 7's exit criterion is "the diff for this phase touches only the new plugin directory." Best-practices widens RecipeOutcome with two variants — meaning Phase 7's plugin must add case MatchedFromRag / case ReplannedByLlm clauses just to compile, even though it has nothing to do with RAG or LLM. That's an edit Phase 7's exit criterion forbids.
Does it rely on something an earlier phase didn't actually establish?
Phase 3 final-design has RecipeOutcome as a sum type with Applied | Skipped | Failed. Best-practices widens it. Performance and security carry RecipeApplication (also Phase 3) through and avoid widening — but performance Component 1 uses outcome: RecipeOutcome in TierAttempt, which means the outcome shape is widened by whatever the tier emits.
All three reference vuln.provenance as a refuse-mode short-circuit. ADR-0038 says Phase 7 owns the full primitive; Phase 3 ships only the CVE_NOT_IN_APP_LAYER refuse-mode. Security explicitly notes this gap (Open Question 1) but ships the design depending on a Phase 7 primitive. Performance simply ignores provenance. Best-practices says "inherited from Phase 3 refuse-mode" which is the smallest legitimate use of the primitive available today.
Phase 3 ADR-0001 ships RemediationOrchestrator, TrustScorer, Transform, ApplyContext, RecipeEngine by name. Performance and best-practices reuse these names correctly. Security renames Stage 5 as "Plan" and routes FallbackTier outside RemediationOrchestrator — a structural disagreement with Phase 3 ADR-0001 that the design does not flag.
Does it violate any load-bearing commitment from production/design.md §2 or CLAUDE.md?
Commitment §2.1 (NO LLM in the gather pipeline). All three designs admit anthropic into the runtime closure. All three claim it's gated out of src/codegenie/probes/, coordinator/, cache/ via import-linter. None of them shows the new import-linter contract diff that allows anthropic and simultaneously fences it. The fence test (tests/unit/test_pyproject_fence.py) currently has FORBIDDEN_LLM_SDKS = {anthropic, langgraph, openai, langchain, transformers} — the three designs simultaneously want anthropic in the runtime closure and langgraph (best-practices) and sentence-transformers (security, best-practices) admitted. The fence amendment is the single most load-bearing change in Phase 4 and none of the three designs writes out the exact set membership change.
Commitment §2.2 (Facts, not judgments). Best-practices honors this explicitly (LLM self-confidence "is logged and discarded"). Performance honors it implicitly (TrustScorer extension is via objective signals). Security honors it via PlanProposalRefuse — but allows the LLM to emit a rationale: str ≤ 2 KB "for the audit log only — never re-prompted." If "rationale" survives anywhere that influences trust scoring, this commitment cracks.
Commitment §2.4 (Determinism over probabilism for structural changes). The major-bump call-site rewrite is exactly a structural change. The phase exists because Phase 3's deterministic recipes can't handle it. So Phase 4 is, in effect, the exception to commitment §2.4. The three designs all accept this but none acknowledges it as a commitment-stress: probabilistic source-code rewriting is what §2.4 says we don't do. The mitigation in all three is "the gate catches it after the fact" — which is not the same as the commitment, which says reserve LLM for judgment, not structural change.

Design-pattern critiques (cross-cutting)¶

Misapplied patterns (with concrete victim)¶

Best-practices, _LlmReplanState "State pattern (LangGraph idiom)". Calling a mutable Pydantic model that mutates through three nodes a "State pattern" is the passive state-bag, not the GoF State pattern (which is "the object's behavior changes based on its state, encapsulating each state as a class"). What's here is a record passed through functions. The pattern name is decoration.
Performance, TierChain "Chain of Responsibility + Pipeline". It's three async function calls in a for loop with short-circuit. Calling this "Chain of Responsibility" inflates a simple if elif ladder into a pattern name; performance §"Pattern not applied" even admits "Strategy" doesn't fit because "there aren't three interchangeable plan strategies" — which is also the argument against calling this Chain of Responsibility. It's a for loop.
Security, SolvedExampleWriteCapability "Capability pattern". True Capability patterns are unforgeable at the language/runtime level (object capability languages, or sealed-trait + private-constructor enforcement). What's shipped is a Pydantic model whose constructor Python makes public. The "unforgeability" comes from a CI-time import-linter rule. That's the Module Boundary pattern with a CI gate, not Capability.
Best-practices, SolvedExampleStore "Repository pattern + Adapter pattern (chromadb wrapped)". It's a class wrapping chromadb.PersistentClient with .add/.query/.digest/.close. Two pattern names for a wrapper class. The toolkit's Adapter that re-exports the same interface unchanged. If you didn't translate, you wrote a forwarder flag fires.
All three, LeafLLM Protocol "Hexagonal architecture / Ports and adapters". Hexagonal requires the domain core to be free of I/O concerns. All three designs put the LLM port in the orchestration layer that also writes to disk (events), reads from disk (RepoContext), and calls subprocess (npm install). The "port" is a Protocol; the architecture around it isn't hexagonal.

Missed patterns (with concrete location)¶

Specification pattern for the tier-skip / refuse-mode decisions. Performance's RagTier has nested if score >= floor and retargeter.can_retarget branches; security has if provenance is Unknown: refuse / if budget exceeded: refuse / if RAG chain orphan: exclude. Both are compositions of business rules ripe for Specification AND/OR with named conditions. None of the three uses it.
Command pattern for the LLM invocation. A LeafLLMRequest is exactly a Command (serializable, replayable, audit-anchorable). All three designs treat it as a transient call argument. Best-practices makes it Pydantic frozen — which is half the pattern. Performance and security pass it as ad-hoc kwargs.
Make illegal states unrepresentable for RecipeOutcome widening. Best-practices adds MatchedFromRag (carries an example + similarity) and ReplannedByLlm (carries a transform + response ID) by union widening. Either should be a separate sum type for the planning outcome, with the Phase 3 RecipeOutcome left untouched and consumed via composition. The toolkit's Tag-and-dispatch without a tagged union flag should fire on the next place in the codebase that consumes RecipeOutcome and now has two new case arms to add.

Pattern claims that don't survive scrutiny¶

Security: "Hexagonal architecture / Ports & adapters." §"Design patterns applied" row 1 names the leaf adapter as hexagonal. But the same row says "The model provider is the single dirtiest external dependency in the system. Containing it behind a port localizes every security control." Hexagonal isolates the domain from the infrastructure. The domain here (planning) isn't isolated from the infrastructure — the orchestrator calls egress_guard.pinned_to(...) directly, which is infrastructure leaking into the orchestration layer.
Performance: "Plugin architecture (not applied)." §"Design patterns applied" row 2 says Plugin architecture is overkill "(Phase 4 ships exactly one adapter each)." But the design ships three Protocols (SolvedExampleStore, Embedder, LeafLlm) with one adapter each — that's premature pluggability per the toolkit's flag-on-sight list. The design hasn't earned three abstraction layers; it has earned one (the LLM, where ADR-0020 is actually the un-deferral risk). The other two Protocols are speculative.
Best-practices: "Tagged union / sum type for state (ADR-0033)." §"Design patterns applied" row 4 names this for RetrievalOutcome = RagHit | RagMiss | RagDegraded. Fine. But the same design widens RecipeOutcome by addition (Concrete Problem 5), violating the "closed sum type" property that makes the tagged-union pattern valuable. The design uses the pattern in one place and breaks its closure-property in another.
All three: "Smart constructor." All three use Pydantic extra="forbid" and call it "smart constructor." Pydantic validates input; it does not refuse to expose the raw constructor. A SolvedExample(**bad_dict) raises; a SolvedExample.model_construct(**bad_dict) succeeds silently and bypasses validation. Until a test asserts model_construct is never called in production code (none of the three ships that test), "smart constructor" is wishful.

Anti-patterns from the toolkit's "flag on sight" list¶

Premature pluggability — performance, three Protocols with one adapter each (SolvedExampleStore, Embedder, LeafLlm). The toolkit says "If there's exactly one strategy, it's a function, not a Strategy." Two of three Protocols here have one impl and no announced second.
Premature pluggability — best-practices, Embedder Protocol over SentenceTxEmbedder with no second adapter. The design says "Behind an Embedder Protocol so a Voyage adapter can be added later." Adding the Protocol now without the second adapter is exactly the premature-pluggability anti-pattern.
Stringly-typed identifiers — performance, f"vuln-remediation cve={ctx.cve.id} affected={ctx.affected_package.name} kind={ctx.failure_mode}". The query key is a hand-formatted string; ctx.failure_mode is "the failure mode tag from recipe-tier" with no NewType, no enum. The toolkit's flag is explicit: every domain primitive shows up in 50 call sites; this string is one of them.
Boolean flag — security, EgressGuard.pinned_to(host) context manager that temporarily widens the allowlist. A boolean-shaped temporal mutation of process-global state (the socket allowlist) is the boolean flag at scale. The toolkit calls this out: "each flag doubles the behavioral surface." The design has two behaviors (pinned to host / not pinned) for one global resource.
Capability passed through ten frames — security, BudgetToken from LlmInvocationGuard.precharge threaded through FallbackTier → PromptBuilder → LeafLlmPort.invoke. Three frames is fine; the design's data flow §6–§8 shows the token traveling through every layer of the call stack. Toolkit flag: "That's a context object trying to escape — make it explicit."
Side effects in module import — security, EgressGuard "Installed at process start via src/codegenie/sitecustomize.py (auto-loaded by Python)." This is the textbook violation: import-time monkey-patching of socket.create_connection. Toolkit flag: "Side effects in constructors / module import time. Module load runs … Untestable." The test plan (test_egress_guard.py) needs to mutate the patched function — it's actively fighting the pattern.
Untyped dict[str, Any] interfaces — performance, TrustSignal(..., details={"new_errors": n}) and details={"timeout": True} and details={"degraded_reason": "no_tsconfig"}. The details field is dict[str, Any] shaped across multiple signal kinds. Best-practices does the same: details={"new_errors": int, "stderr_head": str}. The toolkit's flag is explicit: "A 'flexible' payload format is a contract you can't see and can't refactor."
Tag-and-dispatch without a tagged union — best-practices, _validate_lockfile_transform_shape as a node that "returns LeafLLMRefused (typed)" but is named per the shape it validates. The function's behavior depends on a shape-classification it computes; the dispatch is conditional inside one function. Refactoring _validate_lockfile_transform_shape to handle a second transform shape (Phase 7's DockerfileBaseImageTransform) means editing this function — the OCP-violating tag-and-dispatch pattern the toolkit warns against.