Phase 02 — Context gathering — Layers B–G: Devil's-advocate critique¶
Reviewed by: Devil's-advocate critic subagent Date: 2026-05-14
Method¶
I read all three designs and attacked each on its own terms. I do not propose alternatives. My job is to surface what the synthesizer needs to see before it merges.
I cross-checked each design against:
- localv2.md §4 — the frozen probe contract.
- production/design.md §2 — the nine load-bearing commitments.
- production/adrs/0031 — plugin architecture, with the explicit roadmap-Phase-3 anchor.
- production/adrs/0032 — language search adapters.
- production/adrs/0033 — domain modeling discipline (and its confidence() -> AdapterConfidence clause).
- production/adrs/0034 — event sourcing as canonical primitive (and its "lands in Phase 9 or 13" clause).
- phases/00-bullet-tracer-foundations/final-design.md — single Semaphore, pytest-xdist veto, run_allowlisted chokepoint, ALLOWED_BINARIES discipline.
- phases/01-context-gather-layer-a-node/final-design.md — the Phase-1 ABC additions (parsed_manifest, input_snapshot) were ADR-gated and added as Callable | None = None — the precedent for what "additive contract change" actually costs.
- roadmap.md Phase 2 (exit criterion: every layer runs against real repos + IndexHealthProbe catches a seeded staleness fixture) and Phase 3 (the first plugin ships, the plugin loader debuts, and the four ADR-0032 adapters ship — there).
Attacks on the performance-first design¶
Concrete problems¶
-
Problem: Ships the Plugin Loader plus a
plugins/universal--*--*/plugin.yamlin Phase 2 to "operationalize ADR-0031 minimally." Why it matters: The roadmap (Phase 3 §"Plugin framing") explicitly assigns both the loader and the universal fallback to Phase 3 — "A universal (, , ) fallback plugin also ships in this phase" is Phase 3 prose, not Phase 2 prose. ADR-0031 §Consequences §1 is even more direct: "Phase 3 of the roadmap (first vuln remediation, deterministic recipe path) becomes 'author vulnerability-remediation--node--npm.' The first plugin doubles as the proof that the plugin loader works." The design's own §Risks §4 admits this is "shipping infrastructure on speculation," then defends it with a strawman ("the alternative is registering Phase 2's B–G probes via the Phase 0/1 import-side-effect mechanism and then refactoring them into a plugin in Phase 3"). That refactor is not "extension by editing" — Phase 3 is when plugins are introduced, by design. The design has turned a Phase-3 deliverable into a Phase-2 deliverable with a manifest that currently does nothing because (its own words)contributes.adapters: {}and the "subgraph is a stub." Premature pluggability with a single implementation, dressed as a Port. Where:* Components §2 (PluginLoader), Architecture diagram lines 53–65, §Risks §4, Open question §1. -
Problem: The cost-tier coordinator adds a new
cost_tier: Literal[0,1,2,3]field to theProbeABC. Why it matters:localv2.md §4declares the probe contract frozen; Phase 0 §"Frozen surfaces" lists the ABC and the registry decorator among the things Phase 1+ must not edit. Phase 1's one ABC change (parsed_manifest,input_snapshot) cost an ADR (0002-parsed-manifest-memo-on-probe-context.md) and was added asCallable | None = None. The performance design hand-waves "same governance discipline as Phase 1'sparsed_manifestfield addition," butparsed_manifestwas added toProbeContext, not to theProbeclass itself — a meaningful distinction the design glosses. Worse,cost_tieris data the coordinator needs to dispatch — it is a coordinator concern smuggled onto the probe ABC instead of expressed as a registry-side annotation (e.g., the registry could carry per-probe metadata). The result is contract churn for a scheduling optimization. Where: Components §1, "Pattern decision" line ("Plugin architecture (the kernel knows about tiers; probes self-classify)" — which is exactly the centralizing pattern ADR-0031 forbids: tiers are kernel-side concerns, not plugin-contract concerns). -
Problem: Reverses Phase 0's
pytest-xdistveto without owning it. Why it matters: Phase 0's synthesizer scoredpytest-xdist"Off" at 10/4 against "On" with a recorded rationale ("shared-fixture races for zero parallelism win on 5 tests"). The performance design unilaterally turns it on for the "portfolio lane" while admitting via Open Question §4 "the synthesizer should confirm whether this is a reversal of Phase 1 §2.2." That is not a "Phase 2 picks for itself" call — it reverses a Phase 0 commitment by inertia. The portfolio fixtures live intests/integration/portfolio/; nothing about xdist makes those races safer than the Phase 0 tests it was banned for. Where: §9 "Multi-repo fixture portfolio," Open Q §4. -
Problem: The "MessagePack-on-disk SCIP symbol index" decision (
projections/scip.symbols.idx) ships a second on-disk schema alongsiderepo-context.yaml, with binary search over a length-prefixed record list, and the design defends it with "msgpack parses ~5× faster than JSON." Why it matters: (a) The 50 ms p95 latency target the design uses to justify msgpack is its own target, not a phase or ADR target — circular justification. (b) Adding a binary on-disk format that downstream adapters must agree on creates exactly the format-coupling the design claims projections eliminate; the bench intests/bench/bench_projection_adapter_latency.pyis "Phase 3 forward-looking" — the consumer doesn't exist yet. (c)scip-python(parser-only) is added as a dep "ratified by ADR0002-scip-python-reader-for-projections.md" — but the design also addstantivy,msgpack,httpx,gitleaks-python,tree-sitter-python, plus 4 external CLIs (semgrep, syft, grype, gitleaks, scip-typescript). Phase 1 ADR-0009 ("no new C-extension parser dependencies") sets a strong precedent; the design doesn't engage with it. Where: Components §4 (line ~196), Resource & cost profile. -
Problem:
IndexHealthProbeships anAdapterConfidence = Trusted | Degraded(reason) | Unavailable(reason)sum type for its own slice while every other Phase 2 probe returnsconfidence: Literal["high","medium","low"]per thelocalv2.md §4contract. Why it matters: Either the probe contract has changed (an unannounced ABC change, breaking the Phase 0 freeze) or theIndexHealthProbe's output is shaped differently from every other probe's output, which means consumers (scipadapter, Bundle Builder) have to special-case one probe's slice format. The design also talks about a per-probehealth_check(slice) -> AdapterConfidenceProtocol — makingAdapterConfidencethe type of every probe's freshness output, not justIndexHealthProbe's — which is a contract change disguised as an "interface." ADR-0033 even prescribesconfidence() -> AdapterConfidencefor ADR-0032 adapters, not for probes. The design conflates the two. Where: Components §3, §3 "Pattern decisions" line, §8 (events writer). -
Problem: The image-digest cache-key strategy "bypasses the Phase 0
declared_inputsmechanism." Why it matters: This is the design's own admission (§6 "Tradeoffs accepted").declared_inputsis the cache primitive inlocalv2.md §4andproduction/design.md §8.2(I1, the most load-bearing interface). The design's mitigation is "ADR-gated" — but an ADR doesn't make the deviation safe, it just records it. The actual mechanism — letting probes overridecache_key()— already exists; the design doesn't need a new strategy at all if it just respectsdeclared_inputsby including the image digest as a declared input (a "special token" perlocalv2.md §4'sdeclared_inputsspec which allows them). Instead, §6 introduces a second cache-key derivation pathway that future probes will copy. Where: Components §6, §Risks §2.
Hidden assumptions¶
- Assumption:
scip-typescriptwhole-repo re-indexes on every.tschange but the cache key's Merkle root makes "unchanged repo on re-run" a hit. What breaks if it's wrong: The 60-second cold-without-build target assumesscip-typescriptruns in 8–15 s on a 50k-LOC repo. Open Question §5 already half-acknowledges this. Ifscip-typescriptregresses, Phase 2's exit criterion ("IndexHealthProbe surfaces at least one real staleness case") still passes but the warm/incremental targets miss by 5–10×, which the performance design has built the rest of its architecture around (the 600 workflows/hr figure is a function of these times). The cost-tier coordinator's tier-1 cap of 4 was sized assuming SCIP fits in one slot. - Assumption:
cpu_count() // 2is the "sweet spot" for tier-1 on an 8-core CI box. What breaks if it's wrong:cpu_count()on a CI runner can be 2 (GitHub-hosted default), in which casemax(cpu_count()//2, 2) = 2— tier-1 and tier-2 both have 2 slots, meaning a docker build and a semgrep run can starve each other on a small runner. The sizing has been calibrated to a developer laptop, not to the CI substrate that actually has to validate the exit criterion. - Assumption:
TreeSitterImportGraphProbeuses an internalThreadPoolExecutorbecause "tree-sitter's C extension releases the GIL." What breaks if it's wrong: Open Q §6 already flags this. The bigger problem is that the coordinator's "tier-1 cap =cpu_count()//2" budget assumed each probe = 1 CPU; the probe now uses 4 threads inside its slot, so on an 8-core box the four tier-1 slots × 4 threads = 16 threads competing with whatever else tier-2 and tier-3 are doing. The cost-tier semaphore is a lie about actual CPU usage.
Things this design missed that a different lens caught¶
The security design catches that gitleaks findings are themselves secrets and need redaction at the writer chokepoint; the performance design lists gitleaks-python as a dep without ever discussing what happens when a finding lands in repo-context.yaml. The best-practices design catches that IndexFreshness should live in codegenie.indices.freshness and ship as a Pydantic sum type with mypy --warn-unreachable enforcing exhaustiveness; the performance design's AdapterConfidence sum type is asserted but never localized to a module path or grounded in a mypy flag. The best-practices design correctly notes that Phase 2 does not ship the plugin loader — directly contradicting the performance design's load-bearing §2.
Attacks on the security-first design¶
Concrete problems¶
-
Problem: Mandates
bubblewrapas the Layer B/G CLI sandbox while explicitly noting macOS has no equivalent. Why it matters: The design treats this as a "documented gap" (§Risks §2) but the consequence is silent: developers on macOS — byCLAUDE.md's posture, the canonical developer environment — get materially weaker isolation than CI. Every adversarial test the design lists (tests/adv/phase02/, ≥40 hostile inputs) only validates on Linux CI; the macOS code path is exercised by zero tests and the design admits "macOS has no equivalent" but never says how the macOS branch (the one developers actually run) is fuzzed. The "good enough until microVM" framing buys protection for the runner, not for the developer machine where adversarial repos are most likely to be analyzed first. Where: Trust boundary 3 diagram, §"Layer B/G external-CLI sandbox runner" tradeoffs, §Risks §2. -
Problem: Introduces
ProbeContext.capabilities: ProbeCapabilitiesas a discriminated union overInProcessCapabilities | SubprocessSandboxCapabilities | ContainerSandboxCapabilities. Why it matters: This is aProbeABC contract change (or at minimum aProbeContextchange with downstream effects on every existing probe). Phase 1'sProbeContextchange was one additive optional field with an ADR; this is a new mandatory discriminated-union field that every probe has to know how to read. The design hand-waves "additive, Phase 0/1 contract preserved" but then says "each probe family's allowed surface is type-checked, and mypyassert_nevercatches a new family without explicit handling" — meaning every existing Phase 0/1 probe must now exhaustivelymatchon the discriminator or fail typecheck. That's a coordinated, every-file edit dressed as "additive." Where: §"Coordinator (extended)" interface line, §Design patterns applied row 1. -
Problem:
_safe_yaml_load_skillopens files withO_NOFOLLOW, runsyaml.SafeLoader-with-Python-constructors-stripped, caps size at 256 KB, and records body BLAKE3. Why it matters: The Phase 1 chokepoint for YAML iscodegenie/parsers/safe_yaml.pywrappingyaml.CSafeLoader; Phase 0forbidden-patternsbansyaml.loadrepo-wide and the existingsafe_yaml.loadenforces size+depth caps. The security design proposes a parallel "hardened" loader path specifically for skills, with its own subclass discipline andO_NOFOLLOW. Either the Phase 1 chokepoint is sufficient (and the new helper is duplication) or it isn't (and Phase 1's chokepoint is silently broken — which would be a Phase 1 critique, not a Phase 2 design). Surfacing two YAML loaders ("normal" and "hardened-for-skills") creates Rule 7's anti-pattern: two existing conventions, blended. Where: §"SkillsIndex (D2) loader" internal-design, §Design patterns applied row 4. -
Problem:
IndexHealthProbeswitches from "mtime-based freshness" to "cryptographic anchoring againstgit_commit" and storesoutput_blob_blake3in the audit log, requiringB2to recompute BLAKE3 over every cached blob on every gather. Why it matters: (a) None of the three Phase 2 designs ever proposed mtime-based freshness — this attacks a strawman from earlier in the brief, then "rejects" it with cryptography. The actual freshness signal is(scip_indexed_commit == repo.HEAD), an O(1) string compare against an indexer-emitted header, which doesn't need BLAKE3. The "cryptographic anchoring" frame is solving a problem (cache-blob tampering by an attacker with write access to.codegenie/cache/) that is not in the Phase 2 threat model (.codegenie/is the user's own directory at 0700/0600 per Phase 0 ADR-0011 — the only way to mutate it is to already own the host). (b) The audit-log chain "fails closed if the chain breaks" — but the chain is also on disk in the same.codegenie/directory the attacker would already own; the chain protects against accidental corruption, not against the threat model the design names. The cryptographic ceremony is real spend on a non-threat. Where: §Goal §5, §"IndexHealthProbe (B2)" internal design, §"EventStream emitter" tradeoffs. -
Problem:
SecretRedactorwrites plaintext to.codegenie/findings/secrets/<fp>.enc"encrypted-at-rest with a per-repo key under~/.codegenie/keys/<repo>.key" with the design's own §Risks §5 saying "the keys live in$HOME, same trust tier as~/.ssh/id_rsa." Why it matters: That admission is fatal. The key lives in$HOME, the encrypted blob lives in$HOME, the host compromise that the encryption is theoretically defending against can read both. The encryption is performance — encryption-as-theatre — that costs (a) per-repo key generation; (b) a new module (SecretFindingCapability); (c) operational fragility ("lost key = lost ability to read past findings") for zero security gain against the actual attacker. The structural fix is "don't write the plaintext at all"; the design buys obfuscation. Where: §"Secret redactor" credentials-accessed line, §Risks §5. -
Problem:
ExternalDocsProbeis implemented as a separate optional subprocess (codegenie-fetch-external-docs) "invoked under the same bubblewrap jail" with a per-source allowlist — and §Goals §10 frames this as preserving the Phase 0/1 "nohttpx/requests/socketundersrc/codegenie/" ban. Why it matters: Shipping a second binary that does speak HTTP, inside the same source tree, lockstep-versioned with the maincodegenieCLI, is a structural workaround for a structural rule. The fence test the design promises ("Phase 0 structural network ban; this probe is opt-in and must use a constrained client") doesn't enforce anything because the network-using code lives outside the path the fence checks. The Phase 0 commitment is now nominal — the network capability exists, it's just spelled as a separate executable. The design should either honor the ban (defer ExternalDocsProbe) or revisit it (an ADR-amend), not paper over it. Where: §Goals §10, §"Layer C runtime sandbox runner" precedent for sibling-binary pattern.
Hidden assumptions¶
- Assumption:
bubblewrapis "available on every Linux CI runner." What breaks if it's wrong: GitHub Actions hosted runners shipbwrapinubuntu-24.04but notubuntu-22.04by default; the design's §Test plan asserts "Linux (canonical, full bubblewrap)" and "per-platform CI matrix" but doesn't pin which Ubuntu version. A CI image change silently disables Boundary 3 — the design's most important defense — without failing a test. - Assumption:
npm installlifecycle scripts inside the container traced byRuntimeTraceProbeonly need--network=noneto be neutralized. What breaks if it's wrong: The design's own "acknowledged blind spot" admits "docker build step at Layer C runsnpm installlifecycle scripts. Even with--network=none, those scripts can mutate the image filesystem in adversarial ways the SBOM probe might not notice." That admission contradicts the §Goals §2 promise of "zero observed escapes from the container into the gather host." The trust the design buys with Boundary 4 is real for host escape but the SBOM/CVE/runtime-trace probes consume the post-install filesystem as ground truth — adversarial filesystem mutation is in the threat model as a judgment-input attack, not a host-escape attack, and the design only defends the second. - Assumption: Audit-log "hash chain detectable on tamper" provides meaningful integrity. What breaks if it's wrong: The design's §Acknowledged blind spots §3 says: "An attacker who can write to the whole
.codegenie/audit/directory can rewrite the chain from the genesis event upward and produce a forged-but-internally-consistent chain." That admission means the chain protects against accidental corruption only — same property a single trailing checksum would buy. The "fail closed if the chain breaks" promise of §"EventStream emitter" is therefore detecting bit-rot, not adversaries, while the design framing is the latter.
Things this design missed¶
The performance design catches that SCIPIndexProbe requires a projection step (sorted, mmap'd) for the ADR-0032 adapter to hit its latency target — the security design ships scip-typescript under bubblewrap but never asks what shape the output takes downstream, so it has no opinion on whether the post-jail artifact is consumed efficiently. The best-practices design catches that the plugin loader and adapter implementations belong to Phase 3, not Phase 2 — the security design carries forward ProbeExecutionEvent variants (SkillRefused, GrammarLoadRefused, EgressBlocked) into Phase 2 as if the canonical event log existed, but ADR-0034 §Consequences §1 explicitly defers "Phase 9 (Temporal) formalizes the canonical event log." Phase 2 pre-shaping is fine; treating those events as load-bearing now (e.g., "B2 reads the audit log's FinishedOk events for each upstream probe" — §"IndexHealthProbe (B2)" interface line) inverts dependency direction: a probe is now a consumer of the event stream it emits to, which is event sourcing as a runtime data plane, not as an audit primitive. ADR-0034 §"Event types" never described the Phase 2 use case.
Attacks on the best-practices design¶
Concrete problems¶
-
Problem: Ships seven new top-level packages (
codegenie.indices,codegenie.runtime,codegenie.security,codegenie.conventions,codegenie.skills,codegenie.tccm,codegenie.adapters) pluscodegenie.depgraph. Why it matters: The design's own §Risks §5 admits this. The defense ("each package's__init__.py:__all__is the public contract") is a runtime guarantee, not a structural one — once eight packages exist, an internal helper "across" packages becomes someone's import that other code starts to rely on, and the public surface ratchets up. Phase 1 added one new package (codegenie.parsers); Phase 0 froze the layout intentionally. The design treats package count as a stylistic choice whenproduction/design.md §8.4is explicit that the module layout is a load-bearing interface. Where: §Goals "Net-new top-level packages: seven," §Risks §5. -
Problem: Ships
codegenie.adapters.protocolscontaining fourProtocolclasses (DepGraphAdapter,ImportGraphAdapter,ScipAdapter,TestInventoryAdapter) plusAdapterConfidence— with zero implementations — and defends this with "documentation as code." Why it matters: ADR-0032 §"Plugin manifest contract" places the adapters atplugins/{slug}/adapters/*.pyinside the plugin, and ADR-0031 §Consequences §1 names the first plugin (Phase 3) as the proof the loader works. Shipping the Protocol skeletons in Phase 2 without their first implementation is the textbook "premature pluggability" pattern the design's own §Patterns deliberately avoided section claims to refuse ("No plugin DSL", "No metaclasses for probe registration") — but it ships the typed-Protocol equivalent for adapters. The §Risks §4 mitigation ("Phase 2 ships aNullAdapterset... Phase 3's first real adapter has a concrete reference point") proves the point: when the only consumer of a Protocol is a NullAdapter fixture written to validate the Protocol exists, the Protocol is speculative. Where: §Components "Adapter Protocol definitions (kernel side of ADR-0032)," §Risks §4. -
Problem: Ships
TCCMLoaderand theDerivedQueryPydantic discriminated union for the five ADR-0030 primitives, with no concrete TCCM in the repo. Why it matters: §Components §"TCCMLoader" admits this: "No concrete TCCMs ship in Phase 2; all TCCMs ship inside their plugin (ADR-0031 §Consequences)." So the loader has no inputs to load. The integration test described in §Test plan ("walks the (Phase 2)TCCMLoader→ typed model... Does not wire the plugin into a Supervisor (that's Phase 8)") loads a synthetic fixture intests/fixtures/plugins/synthetic--syn--syn/tccm.yamlthat exists nowhere else in the system. The design is asking Phase 2 to author the schema of a Phase-3-and-later artifact, with the sole consumer being a Phase-2-authored test fixture validating its own author. This is the schema-without-a-consumer pattern Rule 2 (Simplicity First) attacks. Where: §Components §"TCCMLoader", §Open questions §3 ("a sixth primitive arrives, do we ADR-amend the sum type or accept Unknown") — a question that wouldn't be a question if the schema followed a real consumer. -
Problem: Adopts
mypy --warn-unreachableandmypy --enable-error-code=truthy-boolin Phase 2 and commits to retrofit-Option-(a) in Open Q §5 ("fix them as part of Phase 2 (small retrofit)"). Why it matters: This is a Phase-0 + Phase-1 retrofit dressed as a Phase 2 lens choice. The synthesizer has no way to scope it. "The retrofit will be small (Phase 0/1 has a few dozen functions touched)" is asserted, not measured. The design itself flagged "Stage 7 Learning telemetry hooks are not in Phase 2" — but adopts a strictness flag whose blast radius is everything Phase 0/1 wrote. Rule 3 (Surgical Changes) attacks this directly: "Don't 'improve' adjacent code." Where: §Goals, §Open Q §5. -
Problem: The
DepGraphProbe(kernel skeleton) "looks up anecosystem-detector" by a "string-keyed dict (npm/pnpm/yarn-classic/yarn-berry)" — the design's own §Acknowledged blind spots §5 admits this is "string-keyed" with a TODO comment. Why it matters: ADR-0033 §1 ("newtype for every domain primitive") makes this exactly the discipline Phase 2 ships to honor — but the design ships the violation, decorates it with a# TODO(phase-3): sum-type after first plugin shipscomment, and adds it to the backlog. The backlog item is the primary signal that a sum type is wanted now; deferring it to Phase 3 because "the first plugin actually owns the npm row" is exactly the kind of "we'll fix it when the first concrete user appears" deferral that ADR-0033 §"Reversibility" warns is high-cost-late. Where: §"DepGraphProbe" internal design, §Acknowledged blind spots §5. -
Problem:
IndexFreshness = Fresh | Staleis the sum type the design rests on — and it lives incodegenie.indices.freshness, away fromcodegenie.probes.index_health— defended in §Open Q §1 as "the sum type is consumed by Phase 8 Bundle Builder and ADR-0032 adapters — neither imports from probes." Why it matters: Phase 8 doesn't exist yet. ADR-0032 adapters don't exist yet (they ship in Phase 3). The sum type's only Phase-2 consumer isIndexHealthProbeitself, the one module the design extracts it away from. The justification is "neither imports from probes" — but neither imports anything from Phase 2 at all yet, so the import-direction argument is hypothetical and the layout decision is speculative. Co-location is the boring default until the second consumer exists. Where: §"IndexHealthProbe (B2)" "Where it lives" line, §Open Q §1.
Hidden assumptions¶
- Assumption: "Phase 0 + Phase 1 has a few dozen functions touched" by
--warn-unreachable. What breaks if it's wrong: If the retrofit turns up dozens of legitimate-looking unreachable branches in Phase 0 schema validation or Phase 1 lockfile parser fallbacks, the Phase 2 PR balloons; Rule 6 (token budgets) starts being violated mid-phase. - Assumption: The
gitpython-or-shell-out question in §Open Q §4 is genuinely tied. What breaks if it's wrong: The design's preferred choice (shell out viarun_allowlisted) requires addinggittoALLOWED_BINARIES— butgitis already added (Phase 0 §6.4). So the "shell out" answer is cheap and thegitpythonoption is gratuitously a new dep. The Open Q exists only because the design didn't check the existing allowlist. The synthesizer might pickgitpythonfrom politeness. - Assumption: "Four scanners with four genuinely different input/output shapes do not share a common abstraction worth ~60 LOC of saved boilerplate." What breaks if it's wrong: §Components §"Layer G security wrappers" claims this; the security design proposes exactly the unified abstraction (
_run_external_cli(probe_name, argv, cwd, allowlisted_egress, capability_token) -> CompletedProcess) and argues it's the security primitive ("Command pattern"). The best-practices lens has not engaged with this conflict; it has refused the abstraction on aesthetic grounds while the security lens demands it on threat-model grounds. The assumption that there is no shared abstraction worth writing is unexamined.
Things this design missed¶
The performance design catches that the SCIPIndexProbe cold path on a 50k-LOC repo is the wall-clock dominator and the projection step decides whether Phase 3's adapters are O(lookup) or O(walk) — the best-practices design ships scip-typescript as a "thin wrapper" with no projection at all, so a Phase-3 ScipAdapter implementing the Protocol the best-practices design ships will need to re-parse the .scip binary every query, which the Protocol definition doesn't make visible. The security design catches that yaml.SafeLoader defaults still permit some unsafe constructors (the relevant attack surface is yaml.SafeLoader-subclass discipline) and that gitleaks findings carry their secrets in the slice — the best-practices design's GitleaksProbe is a thin Pydantic-typed JSON wrapper with no redaction at the writer chokepoint, so a real AKIA... in .git/ history ends up verbatim in repo-context.yaml.
Cross-design observations¶
Where do the three disagree?¶
| Dimension | Performance picks | Security picks | Best-practices picks | What's at stake |
|---|---|---|---|---|
| Plugin loader in Phase 2 | YES — kernel-resident loader + universal-fallback plugin | Silent — assumes loader exists for Capability bundle dispatch | NO — Phase 3 owns the loader; only Protocols + TCCMLoader ship here | Whether Phase 2 violates the roadmap's Phase-3 assignment of ADR-0031's first concrete delivery |
| Probe ABC contract change | YES — adds cost_tier: Literal[0,1,2,3] to the ABC |
YES — adds capabilities: ProbeCapabilities to ProbeContext |
NO — ABC frozen; uses requires and decorator registration unchanged |
Whether localv2.md §4 survives Phase 2 as a frozen contract |
confidence modeling |
AdapterConfidence = Trusted \| Degraded \| Unavailable sum type, used for both probes and adapters |
IndexConfidence sum type for B2; confidence: str retained elsewhere |
IndexFreshness = Fresh \| Stale(reason); downstream consumers match exhaustively |
Whether ADR-0033's confidence() -> AdapterConfidence clause applies to probe outputs (a contract change) or only to ADR-0032 adapters (a Phase-3 concern) |
| Cache-key derivation for Layer C | Image-digest-keyed cache deviating from declared_inputs discipline |
Cryptographic anchor of upstream blob BLAKE3 via audit-log lookup | Stays on declared_inputs; runtime trace's declared_inputs includes Dockerfile + scenarios.yaml |
Whether the Phase 0 I1 contract (declared_inputs as the universal cache key) remains the only mechanism |
| Audit-log shape in Phase 2 | Three event variants (IndexHealthDegraded, ProbeCacheInvalidated, ExternalToolMissing) shipped to .codegenie/events/ |
Full hash-chained discriminated-union with 10+ variants, B2 reads from the chain | Phase-0 audit anchor (runs/<utc-iso>-<short>.json) untouched; no event stream |
Whether Phase 2 pre-ships Phase 9/13's event-sourcing infrastructure (ADR-0034) or honors the ADR's own "lands in Phase 9 or 13" clause |
pytest-xdist |
Enabled for portfolio lane | Silent | Silent | Whether Phase 0's veto (synthesizer-recorded 10/4) survives |
| External-CLI sandbox | Cost-tier semaphore only; no jail | bubblewrap (Linux) + documented macOS gap; Container sandbox for Layer C |
Phase 0 run_allowlisted chokepoint untouched |
The whole supply-chain threat model |
Secret findings in repo-context.yaml |
Not addressed | Redacted to <REDACTED:fingerprint=…>; plaintext encrypted under per-repo key |
Not addressed — GitleaksProbe ships findings inline as typed JSON |
Whether Phase 2 leaks secrets the moment a real AKIA... shows up in .git/ |
ExternalDocsProbe network capability |
Added; tier-3 cost tier; allowlisted via httpx | Sidecar binary codegenie-fetch-external-docs under bubblewrap |
Opt-in; "skip cleanly if no external_docs config" | Whether Phase 0's "no httpx/requests/socket under src/codegenie/" rule remains structural |
Which disagreement matters most for this phase?¶
Where the plugin loader lives. Performance ships it in Phase 2 with a universal-fallback plugin "to make the kernel-side seam exist." Best-practices says Phase 2 only ships kernel-side scaffolding (TCCMLoader, adapter Protocols). The roadmap and ADR-0031 are unambiguous: Phase 3 ships the plugin loader, the universal-fallback plugin, the first concrete plugin (vulnerability-remediation--node--npm), the TCCM, the four ADR-0032 adapters, the Skills, and the OpenRewrite recipes — all together, as the proof the loader works. Pulling any one of these forward into Phase 2 means shipping infrastructure on speculation (premature pluggability) and leaving Phase 3's exit criterion ("first plugin ships AS the proof") with no proof to deliver because the loader already exists. The synthesizer must resolve this with prejudice in favor of the roadmap — every other Phase 2 decision (cost-tier coordinator, ABC change, projection format, sum-type placement) downstream-changes if the loader is or is not present.
Where do all three quietly agree on something questionable?¶
- All three pre-ship some flavor of
IndexFreshness/AdapterConfidence/IndexConfidencesum type whose only Phase-2 consumer isIndexHealthProbeitself. The Phase 3+ consumers (ADR-0032 adapters, Phase 8 Bundle Builder) are postulated. Best-practices' §Risks §1 names the risk ("type sits unused for 1–6 phases, drifts"); performance and security don't. The whole industry has named this pattern: schema before consumer. - All three ship
tree-sitteras a Phase 2 dep without engaging with Phase 1 ADR-0009 ("no new C-extension parser dependencies"). That ADR specifically excluded tree-sitter from Phase 1; nothing in any of the three Phase 2 designs treats it as a precedent or argues for the precedent's amendment. The dep is just assumed. - All three treat
RuntimeTraceProbe(5 scenarios × 16–20 s, ~80–100 s wall-clock floor) as a Phase 2 deliverable despite none of the three citing where the five scenarios are configured by the user. Performance has a config flag, security has a typedScenarioenum, best-practices has ascenarios.yaml. The three designs are aligned on shipping the heaviest probe in the system with three different configuration shapes — a structural inconsistency the synthesizer must pick from.
Roadmap-level critiques¶
-
Does this phase as designed set up problems for later phases? Yes — through the plugin loader. If Phase 2 lands the loader (performance design's pick), Phase 3 is no longer the "first plugin doubles as the proof the loader works" milestone the roadmap names; it becomes "first plugin lands against a loader that already shipped without any plugin to test it." The proof is hollowed out. Conversely, if Phase 2 ships only adapter
Protocoldefinitions and aTCCMLoader(best-practices design's pick), the Phase-3 plugin can be written against aProtocolset that turned out wrong, forcing a Phase-2 schema amendment that ripples through every Phase-2 consumer. Best-practices' mitigation (aNullAdapterfixture set authored by Phase 2 itself) is exactly the schema-validates-itself anti-pattern — the Phase 3 implementer will discover the divergence on day 1 of writing the real adapter. The synthesizer cannot pick "neither" — something about ADR-0029/0031/0032 has to land in Phase 2 because the brief lists those ADRs as load-bearing inputs — but the three designs disagree on which part lands where, and any pick has Phase 3 friction. -
Does it rely on something an earlier phase didn't actually establish?
- The performance design's
cost_tiersemaphores rely on Phase 0's coordinator having a per-probe semaphore-selection hook. Phase 0 §"Coordinator" actually ships oneSemaphore(min(cpu_count(), 8)); per-probe selection requires editing the Phase-0 coordinator (the design admits this as "extending Phase 0/1" but doesn't acknowledge it as a Phase-0 chokepoint edit, which Phase 0 §"Frozen surfaces" forbids). - The security design's
_run_in_containerchokepoint relies on Phase 0 having a subprocess pathway that admits a container runtime. Phase 0 hasrun_allowlistedfor binaries; running containers with--cap-drop=ALL --network=noneis structurally a different subprocess pathway. Either_run_in_containerwrapsrun_allowlisted("docker", ...)(in which case Phase 0's allowlist must adddockerandpodman, neither currently allowlisted) or it's a second subprocess path — Phase 0 §"Chokepoints" forbids the latter. -
The best-practices design's "TCCMLoader walks a
tests/fixtures/plugins/synthetic--syn--syn/tccm.yamland validates it againstcodegenie.tccm.model.TCCM" relies on a plugin layout (plugins/{slug}/tccm.yaml) ADR-0031 specifies — but ADR-0031 §"Plugin layout" places plugins atplugins/in the repo root, not attests/fixtures/plugins/. The fixture path quietly forks the convention. -
Does it violate any load-bearing commitment from
production/design.md§2 orCLAUDE.md? - §2.5 "Extension by addition. Adding a new language, new task type, or new tool must be new probes + new Skills + new subgraphs, never edits to existing ones." The performance design's
cost_tierABC field is an edit to the existing Probe ABC; the security design'sProbeContext.capabilitiesis an edit to the existingProbeContext. Both designs assert "additive, backward-compatible" — but ABCs/protocols Phase 0 explicitly froze and listed as load-bearing inproduction/design.md §8.4 I1. ADR-required, not "additive." - §2.1 "No LLM in the gather pipeline." None of the three violate this — but the best-practices design adds
tantivy(a Rust BM25 indexer) forExternalDocsIndexProbe; the performance design addstantivy+tree-sitter-python+gitleaks-python; the security design adds none of these but addsbubblewrap+seccomp+ a custom seccomp profile. Thegatherextras list grows in three different shapes; the Phase 0 fence test only checks the LLM SDK ban, not the shape of dep growth. - §2.3 "Honest confidence. …
IndexHealthProbeis the canonical example in the POC." All three designs treat B2 as load-bearing — but disagree wildly on its output shape (AdapterConfidencevs.IndexConfidencevs.IndexFreshness), its inputs (sibling probe artifacts vs. audit-log events vs. cached blob BLAKE3 + git HEAD), and its cache strategy ("none"per performance vs. read-only over the audit log per security vs. content-keyed against.codegenie/context/raw/*.jsonper best-practices). Three "canonical examples" of honest confidence in one phase is not honest.
Design-pattern critiques (cross-cutting)¶
Misapplied patterns (with concrete victim)¶
- Performance design — "Plugin architecture" applied to cost tiers. §Components §1 names "Plugin architecture (the kernel knows about tiers; probes self-classify)" as the pattern for the cost-tier coordinator. That is not plugin architecture; that is a new ABC field consumed by the scheduler. Plugin architecture means the kernel imports nothing from plugins and discovers them via a registry — not "probes set a field the kernel reads." The pattern label is doing work the design hasn't earned.
- Performance design — "Hexagonal" applied to the plugin loader. §Components §2 names "Hexagonal: the loader is a Port; 'Pydantic-validated YAML on disk' is the only Adapter today, but Phase 14's webhook-driven loader and Phase 16's signed-plugin loader plug into the same Port." A Port + one Adapter is a function with a side door taped open; the second adapter doesn't exist for 12+ phases. The toolkit's "Failure mode" on hexagonal is exactly this: "a 'hexagonal' design that smuggles
requests.get(...)directly into a domain function. The whole point is that the core doesn't know about HTTP." Symmetrically, hexagonal with one Adapter is the door without the wall. - Security design — "Capability pattern" applied to
SecretFindingCapability. §Components §"Secret redactor" claims the LLM accesses plaintext "only viaSecretFindingCapabilitytoken" — but the LLM doesn't construct objects in Python; it constructs strings that describe tokens. The capability pattern works for in-process calls where holding the token is the capability. Across a Python/LLM boundary, the LLM never holds the token; it asks code to act on its behalf. The capability is therefore a server-side check, which is just authorization with a fancier name. §Design patterns row 6 says "The LLM cannot mint the token" — true, but irrelevant; the LLM doesn't need to mint it to exfiltrate the value if a helper reads-and-renders. - Best-practices design — "Strategy via Protocol" for adapters with zero implementations. §Components §"Adapter Protocol definitions" names "Structural subtyping / Strategy via Protocol." Strategy is a pattern for interchangeable algorithms behind a common interface — pick one at runtime. With zero implementations, there is nothing to pick. The Protocol is a type with no behavior; it's a contract document with
.pyextension. Misapplied pattern + premature pluggability double feature.
Missed patterns (with concrete location)¶
- Performance design — Decorator/Registry omitted in favor of "Plugin Loader." The Phase 0/1 mechanism (
@register_probewriting into a module-level dict) is already the kernel-side registry for everything Phase 2 ships. The performance design has the loader doing what the registry already does (resolving probe sets) while pretending it does something different (loading plugins that don't exist). - Security design — Smart-constructor pattern missed at the secret-redactor boundary. §"Secret redactor" returns
tuple[dict, list[SecretFinding]]. The natural shape is a smart constructor on aRedactedSlicevalue type that cannot be constructed without a populatedfindingslist — making "redactor was called" type-checkable. The current shape allows a caller to drop thelist[SecretFinding]and persist onlydict, losing the audit trail; the smart constructor would force the audit-trail to be co-located with the slice or fail typecheck. - Best-practices design — Open/Closed at the
DepGraphProbe.ecosystem-detectordict. §"DepGraphProbe" ships a string-keyed dict (admits this in §Acknowledged blind spots §5) when the natural shape is a per-ecosystem@register_dep_graph_strategydecorator following the existing@register_probepattern. The toolkit's Open/Closed Failure mode names this exactly: "the centraldispatch_task_class(name)function has amatch nameblock that grows every time. That's modification, not extension." A string-keyed dict is the data-shape equivalent.
Pattern claims that don't survive scrutiny¶
- Performance design "Functional core, imperative shell" on the SCIP projector. §Components §4 "Pattern decisions: Functional core (the projector is pure:
.scipbytes → projection bytes)." A function that takes.scipbytes and returns projection bytes is just a pure function — labeling it FCIS is ceremony. The "imperative shell" the design names issubprocess.runandPath.write_bytes(); the bar for "shell" is higher than "the function that calls the function" or every Python program qualifies. - Best-practices design "Composition + clear naming over generic frameworks" for three loaders. §Components §"
SkillsLoader,ConventionsCatalogLoader,TCCMLoaderas separate concrete classes." Composition implies the three share components by combining them; the design ships three separate classes that share nothing in code — they each callyaml.safe_loadand run a Pydantic validator. That's not composition; that's parallel implementation. The pattern label is window-dressing on duplication-by-copy. - Security design "Hexagonal" applied to
_run_external_cli+_run_in_container. §Design patterns row 2 claims "Hexagonal architecture / Ports & Adapters" for the chokepoint pair. Hexagonal requires the core not know about the adapter; here the core (every probe) calls_run_external_cli(... capability_token)with an argv list that is shaped per-CLI. The chokepoint forwards; it doesn't translate. Adapter pattern's failure mode in the toolkit ("an Adapter that re-exports the same interface unchanged. If you didn't translate, you wrote a forwarder.") applies verbatim.
Anti-patterns from the toolkit's "flag on sight" list¶
- Premature pluggability. Performance ships the Plugin Loader with one implementation (universal fallback) whose
contributes.adapters: {}is empty — a "pluggable" architecture with zero plugins. Best-practices ships four adapterProtocols with zero adapters. Security ships aProbeCapabilitiesdiscriminated union with three variants when every Phase 2 probe fits into one of three known buckets — the data-driven approach (probe declares its kind in the registry, coordinator passes a context shaped accordingly) would work without a sum-typed capability bundle. - Untyped
dict[str, Any]interfaces. None of the three carry these forward egregiously — but the best-practices design'sProbeOutput.schema_slice: dict[str, Any](inherited fromlocalv2.md §4) is unchallenged across all three. Every Phase 2 design adds typed slices but routes them through the untyped envelope at merge time. None of the three names this as a contract gap. - Boolean flags on public methods. Best-practices design's
SkillsLoader.load_all() -> Result[list[Skill], SkillsLoadError]is fine; but itsConventionsCatalogLoader.apply(conventions, repo)has no flag (good). Security's_run_in_container(image_or_dockerfile, scenario, capability_token)uses animage_or_dockerfileparameter that union-types "either path or string contents" — the toolkit's "boolean flag" anti-pattern generalized to "polymorphic parameter." - Tag-and-dispatch without a tagged union. Performance design's per-probe
cache_strategyisLiteral["content", "none"](inherited fromlocalv2.md §4) — and then the design adds a third behavior (image-digest-keyed) via thecache_key()override hook, without extending the discriminator — so cache strategy is now partially captured in thecache_strategyfield and partially in thecache_keymethod, with no sum type covering both. Future readers cannot grep for "all cache strategies." - Capability passed through ten frames as a parameter. Security's
capability_tokenis exactly this: it threads fromProbeContextthrough every probe'srun()into_run_external_cli/_run_in_container/_grammar_runner. The toolkit's prescription ("make it explicit (ContextVar, a small Context dataclass, or a registry)") would route through a coordinator-level registry indexed by probe family; the design routes through ten frames as a parameter. - Side effects in constructors. Best-practices design's
SkillsLoader.__init__(self, search_paths)is fine; itsConventionsCatalogLoader.__init__(self, search_paths)is fine. But the security design's per-repo encryption key generation ("generated on first run per repo; lost key = lost ability to read past findings") is an import-time / first-call side effect — the design doesn't name a constructor forSecretRedactor, but its lifecycle includes "writes to~/.codegenie/keys/<repo>.key" the first time it sees a new repo, which is a side effect at the wrong layer (file creation in$HOMEon a code path the user did not opt into).
This phase's specific traps the brief named:
- Probes that re-encode
IndexFreshnessasOptional. None of the three commits this exactly; best-practices makes a virtue of avoiding it. But performance'scache_strategy = "none"onIndexHealthProbeplus its sibling-probehealth_check(slice) -> AdapterConfidenceProtocol is the structural cousin: the freshness signal is now a method spread across N probes rather than a sum-typed return on B2 alone. - Plugin scaffolding the kernel still hardcodes. Performance ships the Plugin Loader but the kernel-resident Layer B–G probes are "registered through" the universal plugin's
contributes.probes:list. So the kernel does import them; the manifest is bureaucracy. Best-practices ships adapter Protocols incodegenie.adapters— kernel namespace — and notes "no implementations." Hardcoding by another name. - "Hexagonal" sandbox claims that smuggle subprocess into the core. Security's
_run_external_cliis the chokepoint, but every probe still constructs theargvlist itself, so the per-tool knowledge (semgrep's flags, syft's flags) lives in probe modules — not in adapters. The "Port" sees a CLI invocation as if all CLIs were interchangeable, which they aren't, so the abstraction leaks per-tool flags into the core. Hexagonal in name; subprocess-with-extra-steps in code. - Smart-constructor parsers that the rest of the code bypasses with
model_construct(). None of the three ships an explicitmodel_constructbypass — but best-practices' §"Result[T, E](Pydantic-modeledOk | Err) at every parse boundary" assertion is checkable only via convention. Phase 0/1 don't shipResult; adding it in Phase 2 and asserting "every parse boundary" without a lint rule is the bypass-by-omission failure mode.