Phase 04 — Vuln remediation: LLM fallback + solved-example RAG: Devil's-advocate critique¶
Reviewed by: Devil's-advocate critic subagent Date: 2026-05-18
Method¶
I read all three designs and attacked each on its own terms. I do not propose alternatives. My job is to surface what the synthesizer needs to see before it merges.
Attacks on the performance-first design¶
Concrete problems¶
-
DeterministicRetargeteris fan-fiction. Performance hangs its entire token-economy story on a "byte-applicable" RAG tier that retargets a prior solved example's typedPackageJsonEditlist onto the currentVulnerabilityRecordwithout an LLM call, with a similarity floor of 0.97. The design then admits the substitution allowlist is{PackageVersionPin, SemverRangeBump, LockfileResolverHint}— i.e., lockfile-only edits. But the headline Phase 4 case the same design picks for its data-flow example is a major-bump call-site rewrite that by definition touches.tssource files outsidepackage.json, where retargeter "refuses." So the retargeter is structurally inapplicable to the only Phase 4 case that justifies the phase. The "compounding savings → $0" asymptote is a number the design proves can't be reached on the exact case it claims to solve. Why it matters: The "≥90% tier-skip rate after 90 days" goal, the≤ $0.004 / PR RAG-hitcost target, and the "asymptote to $0" story all collapse if the retargeter only fires on cases Phase 3 already handles for free. The performance lens has bought itself a goal it can't deliver. Where: §"Lens summary" bullet 1, §Component 3 (RagTier) "Branch" §C, §Component 8 (DeterministicRetargeter). -
Embedding query keyed on a free-text string Phase 3 doesn't expose. The RAG query is
f"vuln-remediation cve={ctx.cve.id} affected={ctx.affected_package.name} kind={ctx.failure_mode}".ctx.failure_modeis a typed tag emitted by recipe-tier; the design assumes it's stable enough that embedding it produces a useful cluster. But Phase 3 final-design §"NpmPeerDepConflictRecipe → NotApplicable" returnsNotApplicable(reason=...)strings — no enum, no contract. So either Phase 4 freezes the reason strings (an undocumented coupling) or the embedding key drifts every time Phase 3 tweaks a string. Why it matters: Cache key drift = silent corpus invalidation. Two repos with the same CVE but a Phase 3 wording change ("major_bump_breaking" vs "major_version_break") cluster apart and the design's "second run hits RAG" exit criterion fails for reasons no one will debug. Where: §Component 3 (RagTier) Internal design step 1. -
--plan-cache=onis opt-in because the design can't articulate when it's correct. The(plan_cache_key) → TierChainResultcache is documented as "Disabled by default because the LLM tier's output is non-deterministic across re-records; cassette replay already gives us reproducibility in CI." This is the design admitting that its own cache is unsafe to enable in the configuration (record_mode="none"cassette replay) the design also claims is the only way Phase 4 runs in CI. The cache only matters in operator-mode portfolio scans, which is exactly the workload the goals table optimizes for ("workflows/hour @ portfolio scale, 24 workers"). So the goal table is computed assuming the cache is on and the design ships it off. Why it matters: The 6,500/hr throughput target is a fiction unless--plan-cache=onis actually deployed; the design hides the assumption behind an opt-in flag with no calibration evidence. Where: §"Data flow" "Caches consulted" §3 + goal table row "Workflows/hour". -
Prompt-cache hit rate target ≥ 65% requires a workload assumption the design doesn't own. The 5-minute TTL only stays warm if consecutive workflows on different repos share the same skill + RAG few-shot pair. But Phase 13's portfolio scan (which the design references with
codegenie remediate-batch) by definition spans hundreds of different repos with different lockfiles, different CVEs, differentRepoContextslices. Even when the skill block is shared,system[2](RAG few-shot) is different per workflow, breaking cache contiguity at the second-most-cached block. The math in §"Token economy" assumes warm cache = 90% off on all three system blocks; it isn't. At mostsystem[0](skill) hits cache;system[1](instruction template) maybe;system[2](per-workflow few-shot) almost never. Why it matters: The "$0.010 warm" calculation is wrong by roughly the size ofsystem[2](1–3 KB). At portfolio scale this is the line item that decides whether the phase is cost-defensible. Where: §Component 7 internal design (4 cached blocks claim) + §"Token economy math" Warm calculation. -
Mixing chromadb HNSW into the worker process when Phase 11 will need concurrent writes. The design ships chromadb local (single-writer) and explicitly defers the "Phase 11's merge-webhook-driven
ingest_solved_examplecalls may concurrently write" problem to a Phase 11 adapter swap. But Phase 4 also shipsingest_solved_exampleas the write primitive, and the workflow runs every successful Phase 4 fix throughstore.addsynchronously inside the workflow process. The single-writer limit will bite the moment Phase 4 itself runs more than one worker concurrently on the same store (which is exactly what the "24 workers" throughput target assumes). Why it matters: The design is internally inconsistent: 24 workers + single-writer-chromadb + per-workflow ingest = lock contention or silent dropped writes the moment two workflows validate within the same window. Where: §Component 5 "Tradeoffs accepted" + goal table row "Workflows/hour @ portfolio scale, 24 workers".
Hidden assumptions¶
-
Assumption: RAG queries are short enough that one embedding (~80 ms) is the only embedding-tier cost. What breaks if it's wrong: The design's "user block: RepoContext slice + the affected file's tree-sitter outline" implies the LLM gets file content; if the query embedding is also derived from file content (the obvious richer-query upgrade), every workflow embeds kilobytes, not the ~50-byte query string the latency math assumes.
-
Assumption:
pytest-recordingcassettes are stable enough that acassette_blake3lockfile is meaningful. What breaks if it's wrong: Anthropic SDK upgrades change response shapes; therequest_idheader, the cache-control echo in responses, and theusage.cache_creation_tokensfield name are all SDK-version-coupled. The design assumes pinninganthropic>=0.x,<0.yis enough; in practice cassettes break on patch bumps. -
Assumption:
fastembed's ONNX session is deterministic across worker hosts. What breaks if it's wrong: ONNX runtime on different CPU architectures (x86_64 vs arm64) can produce slightly different float outputs at the 5th decimal place. Cosine similarity at the 0.97 byte-applicable floor is exactly where this matters. The "embedding cache is byte-identical across runs" property test (§Test plan unit tests) is a single-host check; it doesn't catch cross-architecture drift.
Things this design missed that a different lens caught¶
- Cassette sanitization for secrets. Performance treats cassettes as a token-economy concern (replay = $0). Security caught that
pytest-recordingrecordsAuthorizationheaders verbatim. Performance's cassette plan has no sanitizer — a contributor recording cassettes locally leaks their API key intotests/cassettes/. - Output-shape constraint on the LLM. Performance says "the leaf is instructed to emit a structured
Transform.from_json(...)" but the constraint is a prompt-instruction, not a structural one (no JSON schema enforced at the API boundary). Best-practices ships JSON schema'd output; security ships a Pydantic discriminated unionPlanProposal. Performance has the cheapest enforcement (and the loosest). - Provenance refuse-mode short-circuit. Both security and best-practices call out that ADR-0038's
vuln.provenance == Unknownmust refuse before any LLM call. Performance never mentions provenance; the LLM tier will happily spend tokens on a base-image CVE that should never have reached it.
Attacks on the security-first design¶
Concrete problems¶
-
SPKI pinning for
api.anthropic.comis operationally unworkable and the design acknowledges this then ships it anyway. §"Resource & cost profile" admits "When Anthropic rotates intermediates (~annually) we must ship a release." That's a release triggered by Anthropic's CA, not our own roadmap. The pin set isANTHROPIC_SPKI_PINS: Final = frozenset({...})with the "..." literally unspecified. Anthropic does not publish a stable SPKI forapi.anthropic.com— they use commercial CAs and rotate. A pin missEgressCertPinFailedhalts every workflow. Why it matters: This is not defense-in-depth, this is a self-DOS waiting to happen. Anthropic's CDN may even rotate intermediates without notice; the design has no mechanism to refresh the pin set out-of-band. Where: §Component 4 (AnthropicLeafAdapterSPKI block) + §"Resource & cost profile" "non-trivial cost" note. -
EgressGuardmonkey-patchessocket.create_connectionat process start and then carves out 127.0.0.1 "unconditionally." §Component 12 admits this. But the leaf adapter uses theanthropicSDK which dialsapi.anthropic.com— that resolves via DNS to a non-loopback. Meanwhilepytest-httpserver(used bytest_cert_pinning.py) and chromadb's "Phase 4 ADR ratifying the carve-out" (which doesn't exist yet) both want loopback. So the design has one monkey-patched socket layer doing two contradictory jobs (block-all-except-Anthropic and allow-all-loopback-for-tests) and the carve-out is exactly the bypass an attacker needs: any in-process binding to127.0.0.1(e.g., a poisoned local proxy injected by an adversarial dep on PyPI) is whitelisted. Why it matters: A loopback whitelist defeats the "EgressGuard catches dynamic uses" argument. The control's threat-model claim ("catches transitive deps opening sockets on import") fails the moment any dep opens a loopback socket — whichchromadbitself might do for its embedded mode debug surface, andonnxruntime/torchdefinitely do for telemetry-shaped behaviors. Where: §Component 12 "Loopback (127.0.0.1, ::1) is permitted unconditionally". -
PlanProposalclosed sum type doesn't model the Phase 4 use case the roadmap asks for. The exit criterion is "a breaking-change vuln (e.g., major-version-bump CVE) is solved end-to-end with the LLM fallback." This requires call-site rewrites across N source files. The security design modelsPlanProposalCallsiteRewrite.diff: UnifiedDiffwith a hard 32 KB cap and rejects anything else asPlanProposalRefuse. A real major-bump (e.g., Express 4 → 5, lodash 3 → 4) rewrites every callsite, which is routinely >32 KB of diff. The design admits this in Open Question 6 ("32 KB is generous; 16 KB or 8 KB caps would make the LLM physically incapable of emitting a sweeping rewrite") but doesn't acknowledge that the current 32 KB cap also kneecaps the goal. Why it matters: The phase exit criterion is not met by this design's worst-case cap. A real major-version-bump fix will refuse out viaPlanProposalRefuse(reason="out_of_scope")and HITL-escalate — meaning Phase 4 ships a system that refuses to do the work Phase 4 exists to do. Where: §Component 2 (PlanProposal) "Tradeoffs accepted" + Open Question 6. -
SolvedExampleWriteCapabilityclaims unforgeability via "private-by-convention name" + import-linter. §Component 7 admits "the actual unforgeability is enforced by the fact that the minting function_mint_solved_example_capability(...)lives insrc/codegenie/gates/_capability_mint.py(Phase 5) and is import-linter-blocked from being imported anywhere except theGateRunnerand the writer." Two problems: (a) Phase 5 hasn't shipped yet; this is an unwritten dep. (b) Import-linter is a lint — it runs at CI time, not at runtime. A contributor running tests locally with import-linter off (a real configuration) can forge the capability. The "capability pattern" claim is a runtime contract that's actually enforced by a static check; it's a PythonPydanticmodel whose constructor is public by language semantics. Why it matters: The design's strongest claim about RAG poisoning resistance ("A capability token unforgeable except by Phase 5's GateRunner makes 'is this write authorized?' a type-level question, not a runtime if validated: check") is false on close reading. Where: §Component 7 + §"Design patterns applied" row 2 (Capability pattern). -
The
FenceWrapper+CanaryGuarddefense is a regex-based denylist overINJECTION_PATTERNS: list[bytes]. §Component 3 lists patterns likeIgnore (all )?(previous|prior|above)andBEGIN SYSTEM. The design says "the injection-pattern list is incomplete (it cannot be complete — that's adversarial-ML's open problem)." So the lens is committed to a security posture whose primary control the design admits is incomplete and updated only as new attack patterns surface. Worse, the canary scan happens before truncation per the code (truncated = self._truncate(payload, source_kind); canary_result = CanaryGuard.scan(truncated, nonce)), so an attacker who hides the injection past the truncation byte gets past the canary. Why it matters: The control is theater: a known incomplete denylist applied to truncated input, claimed as "every byte that enters an LLM prompt is fence-wrapped, canary-checked, and traceable." It's not "every byte" — it's "every byte that survives a 4–8 KB cap, checked against patterns we knew yesterday." Where: §Component 3 (FenceWrapper+CanaryGuard) + §"Acknowledged blind spots" bullet on adaptive injection learning.
Hidden assumptions¶
-
Assumption:
BLAKE3-chained event logprovenance per RAG record is enforceable across worker restarts and machine moves. What breaks if it's wrong: A record'sprovenance.event_chain_proofis "the chain segment from that head to the current head." If a worker restarts and resumes mid-workflow, or if a portfolio scan distributes across hosts (as the perf lens implies for Phase 13), the chain head is machine-local unless the BLAKE3 chain is a global write-ordered log — which contradicts Phase 4's "single-process CLI" framing. -
Assumption:
keyringworks in CI and is not bypassed by theCODEGENIE_ANTHROPIC_KEY_CIescape. What breaks if it's wrong: The escape is "CI=1andALLOW_CI_ENV_KEY=1(both required)." Setting both is one PR. The design hopes contributor culture enforces it; nothing in the code does. -
Assumption: All untrusted bytes have a known
source_kind. What breaks if it's wrong:prior_attemptsfrom Phase 5 retry path is fenced asprior_attempt_summary. But Phase 5 hasn't shipped; theAttemptSummaryshape is hypothetical. The design fences fields that don't exist yet, against a Phase 5 interface this design itself extends.
Things this design missed¶
- Vendor-shim seam at the Protocol. Performance and best-practices both treat
LeafLLMas the seam for ADR-0020's eventual multi-vendor un-deferral. Security pins to Anthropic at the network layer via SPKI plus an allowlist of one host. When ADR-0020 resolves, a second adapter cannot share theEgressGuardinstall (one host pin, one allowlist) without a runtime mutation. The design has effectively re-frozen ADR-0020 by making "add a second adapter" require an ADR amendment to the allowlist plus another SPKI pin set. langgraphas a Phase-4 dep for Phase-6 forward-compat. Best-practices shipslanggraphin Phase 4 to lock topology early. Performance treats Phase 4 as plain async Python pipeline. Security never mentionslanggraph— meaning when Phase 6's state-machine refactor lands, the security guarantees (fence positions, capability flow, audit emission points) have to be re-proven against a graph executor the lens never reasoned about.- Cassette discipline as test correctness, not just secret hygiene. Security treats cassettes as a secret-exfil concern. Performance/best-practices treat cassette miss as a CI-fail mechanism that catches LLM-prompt drift. Security's
test_cassettes_clean.pychecks for secrets, but does not check that cassettes still match the SDK shape — the failure mode best-practices Risk 1 calls out ("cassette rot").
Attacks on the best-practices design¶
Concrete problems¶
-
langgraphadmitted in Phase 4 is a roadmap violation dressed up as forward-compat. The roadmap §Phase 4 says "langgraphimported minimally — just enough to wrap the leaf agent invocation." The roadmap §Phase 6 says "the deterministic + LLM + sandbox loop is now stitched together as a proper state machine ... LangGraph is the runtime." The roadmap treats Phase 6 as the LangGraph introduction phase. Best-practices §Component 5 builds a three-nodeStateGraphwith_LlmReplanState(a mutable Pydantic model — see Concrete Problem 3) and ships it in Phase 4 with the claim "Phase 6's state-machine refactor will lift this exact graph (viag.add_subgraph(...))." This is taking on the largest Phase 6 dep one phase early to "lock topology" — except the topology is three nodes with no conditional edges and no checkpointer. There's nothing to lock; this islanggraphadded for the sake of usinglanggraph. Why it matters: The "Phase 6 migration cost" the design names in Open Question 3 is a problem the design has created, not solved. A three-node flat graph could be three function calls. The dep weight (~10 MB), the second framework whose semantics the next engineer must know, and the LangGraph state mutability hole (Concrete Problem 3) all arrive in Phase 4 for a future-coupling claim Phase 6 will mostly ignore. Where: §Component 5 (LlmReplanRecipe) + Open Question 3. -
SentenceTxEmbedderdefaults toall-MiniLM-L6-v2and bringssentence-transformers→torch(~250 MB transitive) into the runtime. §"Resource & cost profile" calls this out: "torchis well-supported across our CI matrix ... install time grows by ~40 s on a cold CI run." Performance lens picksfastembed(130 MB ONNX, no torch) for the same job. The best-practices design picks the heavier dep because "the contributor canpython -m codegenie.rag.inspectwithout docker-compose" — butfastembedalso runs in-process with no docker. The design has picked the heavier dep without comparing against the lighter alternative on the very dimension (contributor friction) it claims to optimize for. Why it matters: The "contributor experience" justification for chromadb-vs-qdrant is the same justification that should killsentence-transformers-vs-fastembed, and the design only applies it in one direction. Where: §Component 2 (SentenceTxEmbedder) + §"Resource & cost profile"torchline. -
_LlmReplanStateisfrozen=False— the design's only mutable Pydantic model — and shipped as an idiomatic LangGraph requirement. §Component 5 hasmodel_config = ConfigDict(frozen=False, extra="forbid"). §"Design patterns applied" row 6 names this "State pattern (LangGraph idiom)" and §"Conventions honored" boasts "every Pydantic model isfrozen=True, extra='forbid'." The design states the convention then makes the one exception inside its largest new component. The design-patterns toolkit'sMake illegal states unrepresentableandSide effects in constructorsflags both light up — a mutable model whose mutation is "scoped to a single compiled graph invocation and never escapes" is an unverified claim. Why it matters: ADR-0033 §3 (domain-modeling discipline) and the codebase-widefrozen=Trueconvention are why refactor-safety is plausible. One exception is one entry point for a class of bug the convention exists to prevent. Where: §Component 5_LlmReplanStatedefinition + §"Conventions honored". -
Auto-harvest is deferred to Phase 11 but the roadmap exit criterion is "re-running the same case hits RAG, not LLM." §"Data flow" step 9: "Phase 4 ships the harvester as a separate CLI (
codegenie rag harvest <workflow_id>) and does NOT auto-harvest in Phase 4 — auto-harvest waits for Phase 11's post-merge webhook." But the roadmap exit criterion requires the second run on the same case to be cheaper. If harvest is operator-invoked CLI-only, the "re-run hits RAG" claim is met only if an operator manually rancodegenie rag harvestbetween runs. Best-practices acknowledges this gap in Open Question 1 but ships the design with the gap. Why it matters: The roadmap exit criterion is unmet by this design unless an operator step is added to the test. The design testtest_phase4_e2e_replay_lands_rag.pywould have to manually invokeharvestbetween runs — which makes "exit criterion met" depend on test scaffolding, not production behavior. Where: §"Data flow" step 9 + Open Question 1 + §Test plan integration test. -
RecipeOutcomeis widened by "additive union widening" withMatchedFromRagandReplannedByLlm— but Phase 3'sRecipeOutcomeisApplied | Skipped | Failed(per Phase 3 final-design §3). §"Conventions honored" bullet says "theRecipeOutcomePhase 3 already ships gains two non-LLM variants (MatchedFromRag,ReplannedByLlm) by additive union widening — not byOptional[X]fields on existing variants."MatchedFromRagcontains an LLM input (the few-shot example) and is named "non-LLM" because retrieval is non-LLM.ReplannedByLlmis plainly an LLM-produced outcome. Both being added toRecipeOutcomewidens a sum type used by Phase 3's deterministic strict-AND checks. Phase 5's already-merged design assumesRecipeOutcomevariants are exhaustivelymatched (per Phase 3 ADR-0001 contract-snapshot); adding two variants is a Phase 5 contract change that the design claims is additive but actually requires every Phase 5 site to add two newcaseclauses or elseassert_neverfires. Why it matters: "Extension by addition" is being equivocated. Adding variants to a sum type ≠ adding a new sum type. The design's "zero edits to Phase 3" claim is structurally false at the Pydantic-discriminator level. Where: §"Conventions honored" "Domain modeling discipline" bullet + §Component 4 (RagMatchRecipe) + §Component 5 (LlmReplanRecipe).
Hidden assumptions¶
-
Assumption:
pytest-recordingcassette matching on("method","path","body")is sufficient when the body is JSON with stable key ordering. What breaks if it's wrong: Pydantic's JSON serialization is key-stable but Anthropic's SDK may injectanthropic-versionheaders,metadata.user_id, or timestamp-shaped fields. Cassette miss in CI when nothing semantically changed. -
Assumption: The store's "files-on-disk are canonical, sqlite is derived" property holds across
chromadbupgrades. What breaks if it's wrong:chromadbschema migrations are not stable.codegenie rag rebuild(the recovery path) is described as one CLI command — but it has to re-embed every example and re-insert into a new sqlite. On a corpus of 1000+ examples, that's minutes. The design ships "Phase 11 expects <200 examples" without a migration plan. -
Assumption: The single similarity threshold
0.78will be calibrated in Phase 6.5 before Phase 4 actually ships. What breaks if it's wrong: Phase 6.5 hasn't shipped (the roadmap places it between Phase 6 and Phase 7). The 0.78 number is "hand-calibrated" and lives in YAML; until Phase 6.5 produces a labeled set, the threshold is a guess that determines the entire retrieval-quality story.
Things this design missed¶
- Cost-cap circuit breaker. Performance has
LlmInvocationGuardbudget event emission. Security hasLlmInvocationGuardas a capability-pattern with hard token caps. Best-practices ships no per-workflow budget cap — the design's risks list never mentions cost runaway. The design defers cost observability to Phase 13's ledger but ships nothing to enforce a cap in Phase 4. A bug inLeafLLM.invokethat retries on a never-resolving prompt will rack up tokens until a human notices. - Prompt-injection containment. Best-practices fences nothing. The only mention of adversarial input is Risk 4 about
typecheck.nodeflakiness. The design treats untrusted bytes (CVE descriptions, repo content, RAG hits) as trusted enough to inline into the user prompt verbatim. - The Phase 3
Result[T, E]type. §Component 7 (SolvedExample.from_yaml) returnsResult[SolvedExample, ParseError]"same shape asPluginScope.parse." Phase 3 final-design does not document aResulttype — it documents Pydantic sum types andRecipeOutcomediscriminated unions. Best-practices invents aResultshape and attributes it to Phase 3. Adopting Rust-styleResultis fine but should not be invisible.
Cross-design observations¶
Where do the three disagree?¶
| Dimension | Performance picks | Security picks | Best-practices picks | What's at stake |
|---|---|---|---|---|
| Embedding model | fastembed ONNX BGE-small (130 MB, no torch) |
sentence-transformers all-MiniLM-L6-v2 with EmbeddingsBootstrap offline-only, content-addressed |
sentence-transformers all-MiniLM-L6-v2 lazy-loaded |
RSS, cold-start time, supply-chain surface |
| Vector store | chromadb persistent local | chromadb embedded mode only; qdrant explicitly rejected | chromadb persistent local | Concurrent writes, dep weight, ops simplicity |
| LangGraph in Phase 4? | No — plain async pipeline | No — sequential Python | Yes — three-node flat graph | Phase 6 migration cost vs current complexity |
| LLM output discipline | Prompt instruction asking for Transform.from_json + Pydantic validate at adapter boundary |
Closed sum type PlanProposal + JSON schema at API boundary + smart-constructed paths |
_validate_lockfile_transform_shape parser node in LangGraph |
What an injected LLM can structurally emit |
| Per-workflow cost cap | Emits cost events; defers enforcement to Phase 13 | Hard cap LlmInvocationGuard as capability token |
Nothing — defers entirely to Phase 13 | Blast radius on a runaway LLM call |
| Prompt injection containment | None — relies on Phase 5 gates | FenceWrapper + CanaryGuard + per-source caps + nonce |
None | Whether adversarial bytes can hijack the LLM |
| Provenance refuse-mode | Not mentioned | Explicit short-circuit before LLM call | Inherited from Phase 3 applies_to short-circuit |
Whether a base-image CVE silently spends LLM tokens |
| Auto-harvest of solved examples | Inline ingest_solved_example on RemediationOutcome.validated |
Capability-gated by SolvedExampleWriteCapability (Phase 5 mints) |
Deferred to Phase 11 operator CLI | Roadmap exit criterion ("second run hits RAG") |
| Network egress posture | No mention — Anthropic SDK called directly | EgressGuard monkey-patches socket + SPKI pin + DNS allowlist |
No mention — Anthropic SDK called directly | Defense-in-depth vs operational fragility |
| Cassette secret hygiene | Not mentioned beyond record_mode="none" |
CassetteSanitizer + CODEOWNERS gate + test_cassettes_clean.py |
"Scrubbing: request headers x-api-key, authorization, anthropic-version are scrubbed" |
API key exfil through committed cassettes |
| RAG retargeting | DeterministicRetargeter for byte-applicable RAG hits |
Always feed RAG as few-shot to LLM (no retarget) | Always feed RAG as few-shot to LLM (no retarget) | The compounding-savings asymptote |
Which disagreement matters most for this phase?¶
LLM output discipline (row 4) is the load-bearing disagreement. Phase 4 is the first phase where an LLM produces bytes the system applies — every other phase that uses an LLM does it for judgment (Assessment) or routing. The three designs disagree on whether the LLM's output is constrained at the API boundary (security: JSON schema'd discriminated union), parsed inside a state graph (best-practices), or trusted-then-Pydantic-validated (performance). This decision determines:
- What an injected LLM can structurally do (security's closed sum type makes "shell command in output" representationally impossible; the other two designs let prose flow up to Pydantic for ad-hoc rejection).
- Whether the roadmap's major-version-bump case is even expressible (security's 32 KB cap may refuse it; the other two have no cap and accept whatever fits in
max_tokens). - Phase 5's interface. Phase 5 already ships
RecipeOutcomeconsumption; widening it (best-practices) requires re-snapshotting the contract, whereas keepingRecipeOutcomeshape and wrapping LLM output asTransform(performance, security) lets Phase 5 stay unchanged.
The synthesizer cannot defer this. Picking it wrong cascades through Phase 5's retry interface, Phase 6's state-machine node, and Phase 7's distroless-plugin claim that the plugin contract is frozen.
Where do all three quietly agree on something questionable?¶
-
All three use a single similarity threshold for RAG retrieval. Performance picks 0.92 / 0.97 floors, security uses 0.85, best-practices uses 0.78. The number varies; the shape of "one global float threshold over cosine similarity" is uniform. None of the three designs considers that retrieval quality at Phase-4-corpus-size (≤200 examples per best-practices) is a different problem than retrieval quality at portfolio scale. With <50 examples, every cosine-similarity threshold is gameable; the "top-K with threshold" pattern is a stand-in for "the corpus is too small for any retrieval to matter." Phase 6.5's "calibration in Phase 6.5" disclaimer is doing work all three lenses lean on.
-
All three treat
chromadbas the in-tree vector store. Performance for cold-start, security for embedded-mode-no-network-listener, best-practices for "no docker-compose." None of them seriously evaluate thatchromadb's schema migrations are unstable, its HNSW writer is single-threaded, and its DuckDB+parquet on-disk format is not git-attributable for fixture portability (despite performance claiming it is —chromadb's sqlite uses binary blobs that don't diff). The choice is by consensus, not by analysis. -
All three treat the cassette layer as if it solves CI determinism. None of the three names the failure mode where cassettes work in CI but mask a real bug: if the cassette captures a working response from a buggy prompt, the cassette tests pass and the live behavior regresses. Performance's "test_no_live_anthropic_calls" test asserts the cassette layer works; it doesn't assert the content of cassettes still represents reality. Security mentions a nightly real-API drift job (one line); best-practices mentions a "quarterly cassette-refresh runbook" (one line). Cassettes are load-bearing in all three but operationally weak in all three.
Roadmap-level critiques¶
-
Does this phase set up problems for later phases?
-
Phase 5 retry interface coupling. All three designs reference
prior_attempts: list[AttemptSummary]per Phase 5's already-merged design (Phase 3 ADR-0001's ApplyContext extension). Security treats this as a trust boundary (fence-wrapped); best-practices treats it as "Phase 5 owns retry"; performance treats it as a context input. The three lenses ship three differentprior_attemptssemantics. Phase 5 expects one. - Phase 6 LangGraph migration. Best-practices ships LangGraph in Phase 4 and claims Phase 6 inherits the topology. The other two designs ship plain async/sequential code; Phase 6 will have to either lift their code into LangGraph (cost) or accept that Phase 4's leaf-LLM-call site doesn't fit the state-machine framing (architectural debt).
- Phase 6.5 cassette dependency. The roadmap names Phase 4 as the cassette discipline owner — Phase 6.5 reads
cassette_blake3per bench case. Performance commits to acassettes.lockfile Phase 6.5 reads; best-practices and security mention pytest-recording but don't pin per-case BLAKE3. If Phase 6.5 launches expecting that file, two of three designs don't ship it. -
Phase 7 distroless plugin contract. Phase 7's exit criterion is "the diff for this phase touches only the new plugin directory." Best-practices widens
RecipeOutcomewith two variants — meaning Phase 7's plugin must addcase MatchedFromRag/case ReplannedByLlmclauses just to compile, even though it has nothing to do with RAG or LLM. That's an edit Phase 7's exit criterion forbids. -
Does it rely on something an earlier phase didn't actually establish?
-
Phase 3 final-design has
RecipeOutcomeas a sum type withApplied | Skipped | Failed. Best-practices widens it. Performance and security carryRecipeApplication(also Phase 3) through and avoid widening — but performance Component 1 usesoutcome: RecipeOutcomeinTierAttempt, which means the outcome shape is widened by whatever the tier emits. - All three reference
vuln.provenanceas a refuse-mode short-circuit. ADR-0038 says Phase 7 owns the full primitive; Phase 3 ships only theCVE_NOT_IN_APP_LAYERrefuse-mode. Security explicitly notes this gap (Open Question 1) but ships the design depending on a Phase 7 primitive. Performance simply ignores provenance. Best-practices says "inherited from Phase 3 refuse-mode" which is the smallest legitimate use of the primitive available today. -
Phase 3 ADR-0001 ships
RemediationOrchestrator,TrustScorer,Transform,ApplyContext,RecipeEngineby name. Performance and best-practices reuse these names correctly. Security renames Stage 5 as "Plan" and routesFallbackTieroutsideRemediationOrchestrator— a structural disagreement with Phase 3 ADR-0001 that the design does not flag. -
Does it violate any load-bearing commitment from
production/design.md§2 orCLAUDE.md? -
Commitment §2.1 (NO LLM in the gather pipeline). All three designs admit
anthropicinto the runtime closure. All three claim it's gated out ofsrc/codegenie/probes/,coordinator/,cache/via import-linter. None of them shows the new import-linter contract diff that allowsanthropicand simultaneously fences it. The fence test (tests/unit/test_pyproject_fence.py) currently hasFORBIDDEN_LLM_SDKS = {anthropic, langgraph, openai, langchain, transformers}— the three designs simultaneously wantanthropicin the runtime closure andlanggraph(best-practices) andsentence-transformers(security, best-practices) admitted. The fence amendment is the single most load-bearing change in Phase 4 and none of the three designs writes out the exact set membership change. - Commitment §2.2 (Facts, not judgments). Best-practices honors this explicitly (LLM self-confidence "is logged and discarded"). Performance honors it implicitly (TrustScorer extension is via objective signals). Security honors it via
PlanProposalRefuse— but allows the LLM to emit arationale: str ≤ 2 KB"for the audit log only — never re-prompted." If "rationale" survives anywhere that influences trust scoring, this commitment cracks. - Commitment §2.4 (Determinism over probabilism for structural changes). The major-bump call-site rewrite is exactly a structural change. The phase exists because Phase 3's deterministic recipes can't handle it. So Phase 4 is, in effect, the exception to commitment §2.4. The three designs all accept this but none acknowledges it as a commitment-stress: probabilistic source-code rewriting is what §2.4 says we don't do. The mitigation in all three is "the gate catches it after the fact" — which is not the same as the commitment, which says reserve LLM for judgment, not structural change.
Design-pattern critiques (cross-cutting)¶
Misapplied patterns (with concrete victim)¶
- Best-practices,
_LlmReplanState"State pattern (LangGraph idiom)". Calling a mutable Pydantic model that mutates through three nodes a "State pattern" is the passive state-bag, not the GoF State pattern (which is "the object's behavior changes based on its state, encapsulating each state as a class"). What's here is a record passed through functions. The pattern name is decoration. - Performance,
TierChain"Chain of Responsibility + Pipeline". It's three async function calls in aforloop with short-circuit. Calling this "Chain of Responsibility" inflates a simpleif elifladder into a pattern name; performance §"Pattern not applied" even admits "Strategy" doesn't fit because "there aren't three interchangeable plan strategies" — which is also the argument against calling this Chain of Responsibility. It's aforloop. - Security,
SolvedExampleWriteCapability"Capability pattern". True Capability patterns are unforgeable at the language/runtime level (object capability languages, or sealed-trait + private-constructor enforcement). What's shipped is a Pydantic model whose constructor Python makes public. The "unforgeability" comes from a CI-time import-linter rule. That's the Module Boundary pattern with a CI gate, not Capability. - Best-practices,
SolvedExampleStore"Repository pattern + Adapter pattern (chromadb wrapped)". It's a class wrappingchromadb.PersistentClientwith.add/.query/.digest/.close. Two pattern names for a wrapper class. The toolkit'sAdapter that re-exports the same interface unchanged. If you didn't translate, you wrote a forwarderflag fires. - All three,
LeafLLMProtocol "Hexagonal architecture / Ports and adapters". Hexagonal requires the domain core to be free of I/O concerns. All three designs put the LLM port in the orchestration layer that also writes to disk (events), reads from disk (RepoContext), and calls subprocess (npm install). The "port" is a Protocol; the architecture around it isn't hexagonal.
Missed patterns (with concrete location)¶
- Specification pattern for the tier-skip / refuse-mode decisions. Performance's
RagTierhas nestedif score >= floor and retargeter.can_retargetbranches; security hasif provenance is Unknown: refuse / if budget exceeded: refuse / if RAG chain orphan: exclude. Both are compositions of business rules ripe forSpecification AND/ORwith named conditions. None of the three uses it. - Command pattern for the LLM invocation. A
LeafLLMRequestis exactly a Command (serializable, replayable, audit-anchorable). All three designs treat it as a transient call argument. Best-practices makes it Pydantic frozen — which is half the pattern. Performance and security pass it as ad-hoc kwargs. - Make illegal states unrepresentable for
RecipeOutcomewidening. Best-practices addsMatchedFromRag(carries an example + similarity) andReplannedByLlm(carries a transform + response ID) by union widening. Either should be a separate sum type for the planning outcome, with the Phase 3RecipeOutcomeleft untouched and consumed via composition. The toolkit'sTag-and-dispatch without a tagged unionflag should fire on the next place in the codebase that consumesRecipeOutcomeand now has two newcasearms to add.
Pattern claims that don't survive scrutiny¶
- Security: "Hexagonal architecture / Ports & adapters." §"Design patterns applied" row 1 names the leaf adapter as hexagonal. But the same row says "The model provider is the single dirtiest external dependency in the system. Containing it behind a port localizes every security control." Hexagonal isolates the domain from the infrastructure. The domain here (planning) isn't isolated from the infrastructure — the orchestrator calls
egress_guard.pinned_to(...)directly, which is infrastructure leaking into the orchestration layer. - Performance: "Plugin architecture (not applied)." §"Design patterns applied" row 2 says Plugin architecture is overkill "(Phase 4 ships exactly one adapter each)." But the design ships three Protocols (
SolvedExampleStore,Embedder,LeafLlm) with one adapter each — that's premature pluggability per the toolkit's flag-on-sight list. The design hasn't earned three abstraction layers; it has earned one (the LLM, where ADR-0020 is actually the un-deferral risk). The other two Protocols are speculative. - Best-practices: "Tagged union / sum type for state (ADR-0033)." §"Design patterns applied" row 4 names this for
RetrievalOutcome = RagHit | RagMiss | RagDegraded. Fine. But the same design widensRecipeOutcomeby addition (Concrete Problem 5), violating the "closed sum type" property that makes the tagged-union pattern valuable. The design uses the pattern in one place and breaks its closure-property in another. - All three: "Smart constructor." All three use
Pydanticextra="forbid"and call it "smart constructor." Pydantic validates input; it does not refuse to expose the raw constructor. ASolvedExample(**bad_dict)raises; aSolvedExample.model_construct(**bad_dict)succeeds silently and bypasses validation. Until a test assertsmodel_constructis never called in production code (none of the three ships that test), "smart constructor" is wishful.
Anti-patterns from the toolkit's "flag on sight" list¶
- Premature pluggability — performance, three Protocols with one adapter each (
SolvedExampleStore,Embedder,LeafLlm). The toolkit says "If there's exactly one strategy, it's a function, not a Strategy." Two of three Protocols here have one impl and no announced second. - Premature pluggability — best-practices,
EmbedderProtocol overSentenceTxEmbedderwith no second adapter. The design says "Behind anEmbedderProtocol so a Voyage adapter can be added later." Adding the Protocol now without the second adapter is exactly the premature-pluggability anti-pattern. - Stringly-typed identifiers — performance,
f"vuln-remediation cve={ctx.cve.id} affected={ctx.affected_package.name} kind={ctx.failure_mode}". The query key is a hand-formatted string;ctx.failure_modeis "the failure mode tag from recipe-tier" with no NewType, no enum. The toolkit's flag is explicit: every domain primitive shows up in 50 call sites; this string is one of them. - Boolean flag — security,
EgressGuard.pinned_to(host)context manager that temporarily widens the allowlist. A boolean-shaped temporal mutation of process-global state (the socket allowlist) is the boolean flag at scale. The toolkit calls this out: "each flag doubles the behavioral surface." The design has two behaviors (pinned to host / not pinned) for one global resource. - Capability passed through ten frames — security,
BudgetTokenfromLlmInvocationGuard.prechargethreaded throughFallbackTier → PromptBuilder → LeafLlmPort.invoke. Three frames is fine; the design's data flow §6–§8 shows the token traveling through every layer of the call stack. Toolkit flag: "That's a context object trying to escape — make it explicit." - Side effects in module import — security,
EgressGuard"Installed at process start viasrc/codegenie/sitecustomize.py(auto-loaded by Python)." This is the textbook violation: import-time monkey-patching ofsocket.create_connection. Toolkit flag: "Side effects in constructors / module import time. Module load runs … Untestable." The test plan (test_egress_guard.py) needs to mutate the patched function — it's actively fighting the pattern. - Untyped
dict[str, Any]interfaces — performance,TrustSignal(..., details={"new_errors": n})anddetails={"timeout": True}anddetails={"degraded_reason": "no_tsconfig"}. Thedetailsfield isdict[str, Any]shaped across multiple signal kinds. Best-practices does the same:details={"new_errors": int, "stderr_head": str}. The toolkit's flag is explicit: "A 'flexible' payload format is a contract you can't see and can't refactor." - Tag-and-dispatch without a tagged union — best-practices,
_validate_lockfile_transform_shapeas a node that "returnsLeafLLMRefused(typed)" but is named per the shape it validates. The function's behavior depends on a shape-classification it computes; the dispatch is conditional inside one function. Refactoring_validate_lockfile_transform_shapeto handle a second transform shape (Phase 7'sDockerfileBaseImageTransform) means editing this function — the OCP-violating tag-and-dispatch pattern the toolkit warns against.