Phase 02 — Context gathering — Layers B–G: Devil's-advocate critique¶

Reviewed by: Devil's-advocate critic subagent Date: 2026-05-14

Method¶

I read all three designs and attacked each on its own terms. I do not propose alternatives. My job is to surface what the synthesizer needs to see before it merges.

I cross-checked each design against: - localv2.md §4 — the frozen probe contract. - production/design.md §2 — the nine load-bearing commitments. - production/adrs/0031 — plugin architecture, with the explicit roadmap-Phase-3 anchor. - production/adrs/0032 — language search adapters. - production/adrs/0033 — domain modeling discipline (and its confidence() -> AdapterConfidence clause). - production/adrs/0034 — event sourcing as canonical primitive (and its "lands in Phase 9 or 13" clause). - phases/00-bullet-tracer-foundations/final-design.md — single Semaphore, pytest-xdist veto, run_allowlisted chokepoint, ALLOWED_BINARIES discipline. - phases/01-context-gather-layer-a-node/final-design.md — the Phase-1 ABC additions (parsed_manifest, input_snapshot) were ADR-gated and added as Callable | None = None — the precedent for what "additive contract change" actually costs. - roadmap.md Phase 2 (exit criterion: every layer runs against real repos + IndexHealthProbe catches a seeded staleness fixture) and Phase 3 (the first plugin ships, the plugin loader debuts, and the four ADR-0032 adapters ship — there).

Attacks on the performance-first design¶

Concrete problems¶

Problem: Ships the Plugin Loader plus a plugins/universal--*--*/plugin.yaml in Phase 2 to "operationalize ADR-0031 minimally." Why it matters: The roadmap (Phase 3 §"Plugin framing") explicitly assigns both the loader and the universal fallback to Phase 3 — "A universal (, , ) fallback plugin also ships in this phase" is Phase 3 prose, not Phase 2 prose. ADR-0031 §Consequences §1 is even more direct: "Phase 3 of the roadmap (first vuln remediation, deterministic recipe path) becomes 'author vulnerability-remediation--node--npm.' The first plugin doubles as the proof that the plugin loader works." The design's own §Risks §4 admits this is "shipping infrastructure on speculation," then defends it with a strawman ("the alternative is registering Phase 2's B–G probes via the Phase 0/1 import-side-effect mechanism and then refactoring them into a plugin in Phase 3"). That refactor is not "extension by editing" — Phase 3 is when plugins are introduced, by design. The design has turned a Phase-3 deliverable into a Phase-2 deliverable with a manifest that currently does nothing because (its own words) contributes.adapters: {} and the "subgraph is a stub." Premature pluggability with a single implementation, dressed as a Port. Where:* Components §2 (PluginLoader), Architecture diagram lines 53–65, §Risks §4, Open question §1.
Problem: The cost-tier coordinator adds a new cost_tier: Literal[0,1,2,3] field to the Probe ABC. Why it matters: localv2.md §4 declares the probe contract frozen; Phase 0 §"Frozen surfaces" lists the ABC and the registry decorator among the things Phase 1+ must not edit. Phase 1's one ABC change (parsed_manifest, input_snapshot) cost an ADR (0002-parsed-manifest-memo-on-probe-context.md) and was added as Callable | None = None. The performance design hand-waves "same governance discipline as Phase 1's parsed_manifest field addition," but parsed_manifest was added to ProbeContext, not to the Probe class itself — a meaningful distinction the design glosses. Worse, cost_tier is data the coordinator needs to dispatch — it is a coordinator concern smuggled onto the probe ABC instead of expressed as a registry-side annotation (e.g., the registry could carry per-probe metadata). The result is contract churn for a scheduling optimization. Where: Components §1, "Pattern decision" line ("Plugin architecture (the kernel knows about tiers; probes self-classify)" — which is exactly the centralizing pattern ADR-0031 forbids: tiers are kernel-side concerns, not plugin-contract concerns).
Problem: Reverses Phase 0's pytest-xdist veto without owning it. Why it matters: Phase 0's synthesizer scored pytest-xdist "Off" at 10/4 against "On" with a recorded rationale ("shared-fixture races for zero parallelism win on 5 tests"). The performance design unilaterally turns it on for the "portfolio lane" while admitting via Open Question §4 "the synthesizer should confirm whether this is a reversal of Phase 1 §2.2." That is not a "Phase 2 picks for itself" call — it reverses a Phase 0 commitment by inertia. The portfolio fixtures live in tests/integration/portfolio/; nothing about xdist makes those races safer than the Phase 0 tests it was banned for. Where: §9 "Multi-repo fixture portfolio," Open Q §4.
Problem: The "MessagePack-on-disk SCIP symbol index" decision (projections/scip.symbols.idx) ships a second on-disk schema alongside repo-context.yaml, with binary search over a length-prefixed record list, and the design defends it with "msgpack parses ~5× faster than JSON." Why it matters: (a) The 50 ms p95 latency target the design uses to justify msgpack is its own target, not a phase or ADR target — circular justification. (b) Adding a binary on-disk format that downstream adapters must agree on creates exactly the format-coupling the design claims projections eliminate; the bench in tests/bench/bench_projection_adapter_latency.py is "Phase 3 forward-looking" — the consumer doesn't exist yet. (c) scip-python (parser-only) is added as a dep "ratified by ADR 0002-scip-python-reader-for-projections.md" — but the design also adds tantivy, msgpack, httpx, gitleaks-python, tree-sitter-python, plus 4 external CLIs (semgrep, syft, grype, gitleaks, scip-typescript). Phase 1 ADR-0009 ("no new C-extension parser dependencies") sets a strong precedent; the design doesn't engage with it. Where: Components §4 (line ~196), Resource & cost profile.
Problem: IndexHealthProbe ships an AdapterConfidence = Trusted | Degraded(reason) | Unavailable(reason) sum type for its own slice while every other Phase 2 probe returns confidence: Literal["high","medium","low"] per the localv2.md §4 contract. Why it matters: Either the probe contract has changed (an unannounced ABC change, breaking the Phase 0 freeze) or the IndexHealthProbe's output is shaped differently from every other probe's output, which means consumers (scip adapter, Bundle Builder) have to special-case one probe's slice format. The design also talks about a per-probe health_check(slice) -> AdapterConfidence Protocol — making AdapterConfidence the type of every probe's freshness output, not just IndexHealthProbe's — which is a contract change disguised as an "interface." ADR-0033 even prescribes confidence() -> AdapterConfidence for ADR-0032 adapters, not for probes. The design conflates the two. Where: Components §3, §3 "Pattern decisions" line, §8 (events writer).
Problem: The image-digest cache-key strategy "bypasses the Phase 0 declared_inputs mechanism." Why it matters: This is the design's own admission (§6 "Tradeoffs accepted"). declared_inputs is the cache primitive in localv2.md §4 and production/design.md §8.2 (I1, the most load-bearing interface). The design's mitigation is "ADR-gated" — but an ADR doesn't make the deviation safe, it just records it. The actual mechanism — letting probes override cache_key() — already exists; the design doesn't need a new strategy at all if it just respects declared_inputs by including the image digest as a declared input (a "special token" per localv2.md §4's declared_inputs spec which allows them). Instead, §6 introduces a second cache-key derivation pathway that future probes will copy. Where: Components §6, §Risks §2.

Hidden assumptions¶

Assumption: scip-typescript whole-repo re-indexes on every .ts change but the cache key's Merkle root makes "unchanged repo on re-run" a hit. What breaks if it's wrong: The 60-second cold-without-build target assumes scip-typescript runs in 8–15 s on a 50k-LOC repo. Open Question §5 already half-acknowledges this. If scip-typescript regresses, Phase 2's exit criterion ("IndexHealthProbe surfaces at least one real staleness case") still passes but the warm/incremental targets miss by 5–10×, which the performance design has built the rest of its architecture around (the 600 workflows/hr figure is a function of these times). The cost-tier coordinator's tier-1 cap of 4 was sized assuming SCIP fits in one slot.
Assumption: cpu_count() // 2 is the "sweet spot" for tier-1 on an 8-core CI box. What breaks if it's wrong: cpu_count() on a CI runner can be 2 (GitHub-hosted default), in which case max(cpu_count()//2, 2) = 2 — tier-1 and tier-2 both have 2 slots, meaning a docker build and a semgrep run can starve each other on a small runner. The sizing has been calibrated to a developer laptop, not to the CI substrate that actually has to validate the exit criterion.
Assumption: TreeSitterImportGraphProbe uses an internal ThreadPoolExecutor because "tree-sitter's C extension releases the GIL." What breaks if it's wrong: Open Q §6 already flags this. The bigger problem is that the coordinator's "tier-1 cap = cpu_count()//2" budget assumed each probe = 1 CPU; the probe now uses 4 threads inside its slot, so on an 8-core box the four tier-1 slots × 4 threads = 16 threads competing with whatever else tier-2 and tier-3 are doing. The cost-tier semaphore is a lie about actual CPU usage.

Things this design missed that a different lens caught¶

The security design catches that gitleaks findings are themselves secrets and need redaction at the writer chokepoint; the performance design lists gitleaks-python as a dep without ever discussing what happens when a finding lands in repo-context.yaml. The best-practices design catches that IndexFreshness should live in codegenie.indices.freshness and ship as a Pydantic sum type with mypy --warn-unreachable enforcing exhaustiveness; the performance design's AdapterConfidence sum type is asserted but never localized to a module path or grounded in a mypy flag. The best-practices design correctly notes that Phase 2 does not ship the plugin loader — directly contradicting the performance design's load-bearing §2.

Attacks on the security-first design¶

Concrete problems¶

Problem: Mandates bubblewrap as the Layer B/G CLI sandbox while explicitly noting macOS has no equivalent. Why it matters: The design treats this as a "documented gap" (§Risks §2) but the consequence is silent: developers on macOS — by CLAUDE.md's posture, the canonical developer environment — get materially weaker isolation than CI. Every adversarial test the design lists (tests/adv/phase02/, ≥40 hostile inputs) only validates on Linux CI; the macOS code path is exercised by zero tests and the design admits "macOS has no equivalent" but never says how the macOS branch (the one developers actually run) is fuzzed. The "good enough until microVM" framing buys protection for the runner, not for the developer machine where adversarial repos are most likely to be analyzed first. Where: Trust boundary 3 diagram, §"Layer B/G external-CLI sandbox runner" tradeoffs, §Risks §2.
Problem: Introduces ProbeContext.capabilities: ProbeCapabilities as a discriminated union over InProcessCapabilities | SubprocessSandboxCapabilities | ContainerSandboxCapabilities. Why it matters: This is a Probe ABC contract change (or at minimum a ProbeContext change with downstream effects on every existing probe). Phase 1's ProbeContext change was one additive optional field with an ADR; this is a new mandatory discriminated-union field that every probe has to know how to read. The design hand-waves "additive, Phase 0/1 contract preserved" but then says "each probe family's allowed surface is type-checked, and mypy assert_never catches a new family without explicit handling" — meaning every existing Phase 0/1 probe must now exhaustively match on the discriminator or fail typecheck. That's a coordinated, every-file edit dressed as "additive." Where: §"Coordinator (extended)" interface line, §Design patterns applied row 1.
Problem: _safe_yaml_load_skill opens files with O_NOFOLLOW, runs yaml.SafeLoader-with-Python-constructors-stripped, caps size at 256 KB, and records body BLAKE3. Why it matters: The Phase 1 chokepoint for YAML is codegenie/parsers/safe_yaml.py wrapping yaml.CSafeLoader; Phase 0 forbidden-patterns bans yaml.load repo-wide and the existing safe_yaml.load enforces size+depth caps. The security design proposes a parallel "hardened" loader path specifically for skills, with its own subclass discipline and O_NOFOLLOW. Either the Phase 1 chokepoint is sufficient (and the new helper is duplication) or it isn't (and Phase 1's chokepoint is silently broken — which would be a Phase 1 critique, not a Phase 2 design). Surfacing two YAML loaders ("normal" and "hardened-for-skills") creates Rule 7's anti-pattern: two existing conventions, blended. Where: §"SkillsIndex (D2) loader" internal-design, §Design patterns applied row 4.
Problem: IndexHealthProbe switches from "mtime-based freshness" to "cryptographic anchoring against git_commit" and stores output_blob_blake3 in the audit log, requiring B2 to recompute BLAKE3 over every cached blob on every gather. Why it matters: (a) None of the three Phase 2 designs ever proposed mtime-based freshness — this attacks a strawman from earlier in the brief, then "rejects" it with cryptography. The actual freshness signal is (scip_indexed_commit == repo.HEAD), an O(1) string compare against an indexer-emitted header, which doesn't need BLAKE3. The "cryptographic anchoring" frame is solving a problem (cache-blob tampering by an attacker with write access to .codegenie/cache/) that is not in the Phase 2 threat model (.codegenie/ is the user's own directory at 0700/0600 per Phase 0 ADR-0011 — the only way to mutate it is to already own the host). (b) The audit-log chain "fails closed if the chain breaks" — but the chain is also on disk in the same .codegenie/ directory the attacker would already own; the chain protects against accidental corruption, not against the threat model the design names. The cryptographic ceremony is real spend on a non-threat. Where: §Goal §5, §"IndexHealthProbe (B2)" internal design, §"EventStream emitter" tradeoffs.
Problem: SecretRedactor writes plaintext to .codegenie/findings/secrets/<fp>.enc "encrypted-at-rest with a per-repo key under ~/.codegenie/keys/<repo>.key" with the design's own §Risks §5 saying "the keys live in $HOME, same trust tier as ~/.ssh/id_rsa." Why it matters: That admission is fatal. The key lives in $HOME, the encrypted blob lives in $HOME, the host compromise that the encryption is theoretically defending against can read both. The encryption is performance — encryption-as-theatre — that costs (a) per-repo key generation; (b) a new module (SecretFindingCapability); (c) operational fragility ("lost key = lost ability to read past findings") for zero security gain against the actual attacker. The structural fix is "don't write the plaintext at all"; the design buys obfuscation. Where: §"Secret redactor" credentials-accessed line, §Risks §5.
Problem: ExternalDocsProbe is implemented as a separate optional subprocess (codegenie-fetch-external-docs) "invoked under the same bubblewrap jail" with a per-source allowlist — and §Goals §10 frames this as preserving the Phase 0/1 "no httpx/requests/socket under src/codegenie/" ban. Why it matters: Shipping a second binary that does speak HTTP, inside the same source tree, lockstep-versioned with the main codegenie CLI, is a structural workaround for a structural rule. The fence test the design promises ("Phase 0 structural network ban; this probe is opt-in and must use a constrained client") doesn't enforce anything because the network-using code lives outside the path the fence checks. The Phase 0 commitment is now nominal — the network capability exists, it's just spelled as a separate executable. The design should either honor the ban (defer ExternalDocsProbe) or revisit it (an ADR-amend), not paper over it. Where: §Goals §10, §"Layer C runtime sandbox runner" precedent for sibling-binary pattern.

Hidden assumptions¶

Assumption: bubblewrap is "available on every Linux CI runner." What breaks if it's wrong: GitHub Actions hosted runners ship bwrap in ubuntu-24.04 but not ubuntu-22.04 by default; the design's §Test plan asserts "Linux (canonical, full bubblewrap)" and "per-platform CI matrix" but doesn't pin which Ubuntu version. A CI image change silently disables Boundary 3 — the design's most important defense — without failing a test.
Assumption: npm install lifecycle scripts inside the container traced by RuntimeTraceProbe only need --network=none to be neutralized. What breaks if it's wrong: The design's own "acknowledged blind spot" admits "docker build step at Layer C runs npm install lifecycle scripts. Even with --network=none, those scripts can mutate the image filesystem in adversarial ways the SBOM probe might not notice." That admission contradicts the §Goals §2 promise of "zero observed escapes from the container into the gather host." The trust the design buys with Boundary 4 is real for host escape but the SBOM/CVE/runtime-trace probes consume the post-install filesystem as ground truth — adversarial filesystem mutation is in the threat model as a judgment-input attack, not a host-escape attack, and the design only defends the second.
Assumption: Audit-log "hash chain detectable on tamper" provides meaningful integrity. What breaks if it's wrong: The design's §Acknowledged blind spots §3 says: "An attacker who can write to the whole .codegenie/audit/ directory can rewrite the chain from the genesis event upward and produce a forged-but-internally-consistent chain." That admission means the chain protects against accidental corruption only — same property a single trailing checksum would buy. The "fail closed if the chain breaks" promise of §"EventStream emitter" is therefore detecting bit-rot, not adversaries, while the design framing is the latter.

Things this design missed¶

The performance design catches that SCIPIndexProbe requires a projection step (sorted, mmap'd) for the ADR-0032 adapter to hit its latency target — the security design ships scip-typescript under bubblewrap but never asks what shape the output takes downstream, so it has no opinion on whether the post-jail artifact is consumed efficiently. The best-practices design catches that the plugin loader and adapter implementations belong to Phase 3, not Phase 2 — the security design carries forward ProbeExecutionEvent variants (SkillRefused, GrammarLoadRefused, EgressBlocked) into Phase 2 as if the canonical event log existed, but ADR-0034 §Consequences §1 explicitly defers "Phase 9 (Temporal) formalizes the canonical event log." Phase 2 pre-shaping is fine; treating those events as load-bearing now (e.g., "B2 reads the audit log's FinishedOk events for each upstream probe" — §"IndexHealthProbe (B2)" interface line) inverts dependency direction: a probe is now a consumer of the event stream it emits to, which is event sourcing as a runtime data plane, not as an audit primitive. ADR-0034 §"Event types" never described the Phase 2 use case.

Attacks on the best-practices design¶

Concrete problems¶

Problem: Ships seven new top-level packages (codegenie.indices, codegenie.runtime, codegenie.security, codegenie.conventions, codegenie.skills, codegenie.tccm, codegenie.adapters) plus codegenie.depgraph. Why it matters: The design's own §Risks §5 admits this. The defense ("each package's __init__.py:__all__ is the public contract") is a runtime guarantee, not a structural one — once eight packages exist, an internal helper "across" packages becomes someone's import that other code starts to rely on, and the public surface ratchets up. Phase 1 added one new package (codegenie.parsers); Phase 0 froze the layout intentionally. The design treats package count as a stylistic choice when production/design.md §8.4 is explicit that the module layout is a load-bearing interface. Where: §Goals "Net-new top-level packages: seven," §Risks §5.
Problem: Ships codegenie.adapters.protocols containing four Protocol classes (DepGraphAdapter, ImportGraphAdapter, ScipAdapter, TestInventoryAdapter) plus AdapterConfidence — with zero implementations — and defends this with "documentation as code." Why it matters: ADR-0032 §"Plugin manifest contract" places the adapters at plugins/{slug}/adapters/*.py inside the plugin, and ADR-0031 §Consequences §1 names the first plugin (Phase 3) as the proof the loader works. Shipping the Protocol skeletons in Phase 2 without their first implementation is the textbook "premature pluggability" pattern the design's own §Patterns deliberately avoided section claims to refuse ("No plugin DSL", "No metaclasses for probe registration") — but it ships the typed-Protocol equivalent for adapters. The §Risks §4 mitigation ("Phase 2 ships a NullAdapter set... Phase 3's first real adapter has a concrete reference point") proves the point: when the only consumer of a Protocol is a NullAdapter fixture written to validate the Protocol exists, the Protocol is speculative. Where: §Components "Adapter Protocol definitions (kernel side of ADR-0032)," §Risks §4.
Problem: Ships TCCMLoader and the DerivedQuery Pydantic discriminated union for the five ADR-0030 primitives, with no concrete TCCM in the repo. Why it matters: §Components §"TCCMLoader" admits this: "No concrete TCCMs ship in Phase 2; all TCCMs ship inside their plugin (ADR-0031 §Consequences)." So the loader has no inputs to load. The integration test described in §Test plan ("walks the (Phase 2) TCCMLoader → typed model... Does not wire the plugin into a Supervisor (that's Phase 8)") loads a synthetic fixture in tests/fixtures/plugins/synthetic--syn--syn/tccm.yaml that exists nowhere else in the system. The design is asking Phase 2 to author the schema of a Phase-3-and-later artifact, with the sole consumer being a Phase-2-authored test fixture validating its own author. This is the schema-without-a-consumer pattern Rule 2 (Simplicity First) attacks. Where: §Components §"TCCMLoader", §Open questions §3 ("a sixth primitive arrives, do we ADR-amend the sum type or accept Unknown") — a question that wouldn't be a question if the schema followed a real consumer.
Problem: Adopts mypy --warn-unreachable and mypy --enable-error-code=truthy-bool in Phase 2 and commits to retrofit-Option-(a) in Open Q §5 ("fix them as part of Phase 2 (small retrofit)"). Why it matters: This is a Phase-0 + Phase-1 retrofit dressed as a Phase 2 lens choice. The synthesizer has no way to scope it. "The retrofit will be small (Phase 0/1 has a few dozen functions touched)" is asserted, not measured. The design itself flagged "Stage 7 Learning telemetry hooks are not in Phase 2" — but adopts a strictness flag whose blast radius is everything Phase 0/1 wrote. Rule 3 (Surgical Changes) attacks this directly: "Don't 'improve' adjacent code." Where: §Goals, §Open Q §5.
Problem: The DepGraphProbe (kernel skeleton) "looks up an ecosystem-detector" by a "string-keyed dict (npm/pnpm/yarn-classic/yarn-berry)" — the design's own §Acknowledged blind spots §5 admits this is "string-keyed" with a TODO comment. Why it matters: ADR-0033 §1 ("newtype for every domain primitive") makes this exactly the discipline Phase 2 ships to honor — but the design ships the violation, decorates it with a # TODO(phase-3): sum-type after first plugin ships comment, and adds it to the backlog. The backlog item is the primary signal that a sum type is wanted now; deferring it to Phase 3 because "the first plugin actually owns the npm row" is exactly the kind of "we'll fix it when the first concrete user appears" deferral that ADR-0033 §"Reversibility" warns is high-cost-late. Where: §"DepGraphProbe" internal design, §Acknowledged blind spots §5.
Problem: IndexFreshness = Fresh | Stale is the sum type the design rests on — and it lives in codegenie.indices.freshness, away from codegenie.probes.index_health — defended in §Open Q §1 as "the sum type is consumed by Phase 8 Bundle Builder and ADR-0032 adapters — neither imports from probes." Why it matters: Phase 8 doesn't exist yet. ADR-0032 adapters don't exist yet (they ship in Phase 3). The sum type's only Phase-2 consumer is IndexHealthProbe itself, the one module the design extracts it away from. The justification is "neither imports from probes" — but neither imports anything from Phase 2 at all yet, so the import-direction argument is hypothetical and the layout decision is speculative. Co-location is the boring default until the second consumer exists. Where: §"IndexHealthProbe (B2)" "Where it lives" line, §Open Q §1.

Hidden assumptions¶

Assumption: "Phase 0 + Phase 1 has a few dozen functions touched" by --warn-unreachable. What breaks if it's wrong: If the retrofit turns up dozens of legitimate-looking unreachable branches in Phase 0 schema validation or Phase 1 lockfile parser fallbacks, the Phase 2 PR balloons; Rule 6 (token budgets) starts being violated mid-phase.
Assumption: The gitpython-or-shell-out question in §Open Q §4 is genuinely tied. What breaks if it's wrong: The design's preferred choice (shell out via run_allowlisted) requires adding git to ALLOWED_BINARIES — but git is already added (Phase 0 §6.4). So the "shell out" answer is cheap and the gitpython option is gratuitously a new dep. The Open Q exists only because the design didn't check the existing allowlist. The synthesizer might pick gitpython from politeness.
Assumption: "Four scanners with four genuinely different input/output shapes do not share a common abstraction worth ~60 LOC of saved boilerplate." What breaks if it's wrong: §Components §"Layer G security wrappers" claims this; the security design proposes exactly the unified abstraction (_run_external_cli(probe_name, argv, cwd, allowlisted_egress, capability_token) -> CompletedProcess) and argues it's the security primitive ("Command pattern"). The best-practices lens has not engaged with this conflict; it has refused the abstraction on aesthetic grounds while the security lens demands it on threat-model grounds. The assumption that there is no shared abstraction worth writing is unexamined.

Things this design missed¶

The performance design catches that the SCIPIndexProbe cold path on a 50k-LOC repo is the wall-clock dominator and the projection step decides whether Phase 3's adapters are O(lookup) or O(walk) — the best-practices design ships scip-typescript as a "thin wrapper" with no projection at all, so a Phase-3 ScipAdapter implementing the Protocol the best-practices design ships will need to re-parse the .scip binary every query, which the Protocol definition doesn't make visible. The security design catches that yaml.SafeLoader defaults still permit some unsafe constructors (the relevant attack surface is yaml.SafeLoader-subclass discipline) and that gitleaks findings carry their secrets in the slice — the best-practices design's GitleaksProbe is a thin Pydantic-typed JSON wrapper with no redaction at the writer chokepoint, so a real AKIA... in .git/ history ends up verbatim in repo-context.yaml.

Cross-design observations¶

Where do the three disagree?¶

Dimension	Performance picks	Security picks	Best-practices picks	What's at stake
Plugin loader in Phase 2	YES — kernel-resident loader + universal-fallback plugin	Silent — assumes loader exists for Capability bundle dispatch	NO — Phase 3 owns the loader; only Protocols + TCCMLoader ship here	Whether Phase 2 violates the roadmap's Phase-3 assignment of ADR-0031's first concrete delivery
Probe ABC contract change	YES — adds `cost_tier: Literal[0,1,2,3]` to the ABC	YES — adds `capabilities: ProbeCapabilities` to `ProbeContext`	NO — ABC frozen; uses `requires` and decorator registration unchanged	Whether `localv2.md §4` survives Phase 2 as a frozen contract
`confidence` modeling	`AdapterConfidence = Trusted \\| Degraded \\| Unavailable` sum type, used for both probes and adapters	`IndexConfidence` sum type for B2; `confidence: str` retained elsewhere	`IndexFreshness = Fresh \\| Stale(reason)`; downstream consumers `match` exhaustively	Whether ADR-0033's `confidence() -> AdapterConfidence` clause applies to probe outputs (a contract change) or only to ADR-0032 adapters (a Phase-3 concern)
Cache-key derivation for Layer C	Image-digest-keyed cache deviating from `declared_inputs` discipline	Cryptographic anchor of upstream blob BLAKE3 via audit-log lookup	Stays on `declared_inputs`; runtime trace's `declared_inputs` includes Dockerfile + scenarios.yaml	Whether the Phase 0 I1 contract (`declared_inputs` as the universal cache key) remains the only mechanism
Audit-log shape in Phase 2	Three event variants (`IndexHealthDegraded`, `ProbeCacheInvalidated`, `ExternalToolMissing`) shipped to `.codegenie/events/`	Full hash-chained discriminated-union with 10+ variants, B2 reads from the chain	Phase-0 audit anchor (`runs/<utc-iso>-<short>.json`) untouched; no event stream	Whether Phase 2 pre-ships Phase 9/13's event-sourcing infrastructure (ADR-0034) or honors the ADR's own "lands in Phase 9 or 13" clause
`pytest-xdist`	Enabled for portfolio lane	Silent	Silent	Whether Phase 0's veto (synthesizer-recorded 10/4) survives
External-CLI sandbox	Cost-tier semaphore only; no jail	`bubblewrap` (Linux) + documented macOS gap; Container sandbox for Layer C	Phase 0 `run_allowlisted` chokepoint untouched	The whole supply-chain threat model
Secret findings in `repo-context.yaml`	Not addressed	Redacted to `<REDACTED:fingerprint=…>`; plaintext encrypted under per-repo key	Not addressed — `GitleaksProbe` ships findings inline as typed JSON	Whether Phase 2 leaks secrets the moment a real `AKIA...` shows up in `.git/`
`ExternalDocsProbe` network capability	Added; tier-3 cost tier; allowlisted via httpx	Sidecar binary `codegenie-fetch-external-docs` under bubblewrap	Opt-in; "skip cleanly if no external_docs config"	Whether Phase 0's "no `httpx`/`requests`/`socket` under `src/codegenie/`" rule remains structural

Which disagreement matters most for this phase?¶

Where the plugin loader lives. Performance ships it in Phase 2 with a universal-fallback plugin "to make the kernel-side seam exist." Best-practices says Phase 2 only ships kernel-side scaffolding (TCCMLoader, adapter Protocols). The roadmap and ADR-0031 are unambiguous: Phase 3 ships the plugin loader, the universal-fallback plugin, the first concrete plugin (vulnerability-remediation--node--npm), the TCCM, the four ADR-0032 adapters, the Skills, and the OpenRewrite recipes — all together, as the proof the loader works. Pulling any one of these forward into Phase 2 means shipping infrastructure on speculation (premature pluggability) and leaving Phase 3's exit criterion ("first plugin ships AS the proof") with no proof to deliver because the loader already exists. The synthesizer must resolve this with prejudice in favor of the roadmap — every other Phase 2 decision (cost-tier coordinator, ABC change, projection format, sum-type placement) downstream-changes if the loader is or is not present.

Where do all three quietly agree on something questionable?¶

All three pre-ship some flavor of IndexFreshness / AdapterConfidence / IndexConfidence sum type whose only Phase-2 consumer is IndexHealthProbe itself. The Phase 3+ consumers (ADR-0032 adapters, Phase 8 Bundle Builder) are postulated. Best-practices' §Risks §1 names the risk ("type sits unused for 1–6 phases, drifts"); performance and security don't. The whole industry has named this pattern: schema before consumer.
All three ship tree-sitter as a Phase 2 dep without engaging with Phase 1 ADR-0009 ("no new C-extension parser dependencies"). That ADR specifically excluded tree-sitter from Phase 1; nothing in any of the three Phase 2 designs treats it as a precedent or argues for the precedent's amendment. The dep is just assumed.
All three treat RuntimeTraceProbe (5 scenarios × 16–20 s, ~80–100 s wall-clock floor) as a Phase 2 deliverable despite none of the three citing where the five scenarios are configured by the user. Performance has a config flag, security has a typed Scenario enum, best-practices has a scenarios.yaml. The three designs are aligned on shipping the heaviest probe in the system with three different configuration shapes — a structural inconsistency the synthesizer must pick from.

Roadmap-level critiques¶

Does this phase as designed set up problems for later phases? Yes — through the plugin loader. If Phase 2 lands the loader (performance design's pick), Phase 3 is no longer the "first plugin doubles as the proof the loader works" milestone the roadmap names; it becomes "first plugin lands against a loader that already shipped without any plugin to test it." The proof is hollowed out. Conversely, if Phase 2 ships only adapter Protocol definitions and a TCCMLoader (best-practices design's pick), the Phase-3 plugin can be written against a Protocol set that turned out wrong, forcing a Phase-2 schema amendment that ripples through every Phase-2 consumer. Best-practices' mitigation (a NullAdapter fixture set authored by Phase 2 itself) is exactly the schema-validates-itself anti-pattern — the Phase 3 implementer will discover the divergence on day 1 of writing the real adapter. The synthesizer cannot pick "neither" — something about ADR-0029/0031/0032 has to land in Phase 2 because the brief lists those ADRs as load-bearing inputs — but the three designs disagree on which part lands where, and any pick has Phase 3 friction.
Does it rely on something an earlier phase didn't actually establish?
The performance design's cost_tier semaphores rely on Phase 0's coordinator having a per-probe semaphore-selection hook. Phase 0 §"Coordinator" actually ships one Semaphore(min(cpu_count(), 8)); per-probe selection requires editing the Phase-0 coordinator (the design admits this as "extending Phase 0/1" but doesn't acknowledge it as a Phase-0 chokepoint edit, which Phase 0 §"Frozen surfaces" forbids).
The security design's _run_in_container chokepoint relies on Phase 0 having a subprocess pathway that admits a container runtime. Phase 0 has run_allowlisted for binaries; running containers with --cap-drop=ALL --network=none is structurally a different subprocess pathway. Either _run_in_container wraps run_allowlisted("docker", ...) (in which case Phase 0's allowlist must add docker and podman, neither currently allowlisted) or it's a second subprocess path — Phase 0 §"Chokepoints" forbids the latter.
The best-practices design's "TCCMLoader walks a tests/fixtures/plugins/synthetic--syn--syn/tccm.yaml and validates it against codegenie.tccm.model.TCCM" relies on a plugin layout (plugins/{slug}/tccm.yaml) ADR-0031 specifies — but ADR-0031 §"Plugin layout" places plugins at plugins/ in the repo root, not at tests/fixtures/plugins/. The fixture path quietly forks the convention.
Does it violate any load-bearing commitment from production/design.md §2 or CLAUDE.md?
§2.5 "Extension by addition. Adding a new language, new task type, or new tool must be new probes + new Skills + new subgraphs, never edits to existing ones." The performance design's cost_tier ABC field is an edit to the existing Probe ABC; the security design's ProbeContext.capabilities is an edit to the existing ProbeContext. Both designs assert "additive, backward-compatible" — but ABCs/protocols Phase 0 explicitly froze and listed as load-bearing in production/design.md §8.4 I1. ADR-required, not "additive."
§2.1 "No LLM in the gather pipeline." None of the three violate this — but the best-practices design adds tantivy (a Rust BM25 indexer) for ExternalDocsIndexProbe; the performance design adds tantivy + tree-sitter-python + gitleaks-python; the security design adds none of these but adds bubblewrap + seccomp + a custom seccomp profile. The gather extras list grows in three different shapes; the Phase 0 fence test only checks the LLM SDK ban, not the shape of dep growth.
§2.3 "Honest confidence. … IndexHealthProbe is the canonical example in the POC." All three designs treat B2 as load-bearing — but disagree wildly on its output shape (AdapterConfidence vs. IndexConfidence vs. IndexFreshness), its inputs (sibling probe artifacts vs. audit-log events vs. cached blob BLAKE3 + git HEAD), and its cache strategy ("none" per performance vs. read-only over the audit log per security vs. content-keyed against .codegenie/context/raw/*.json per best-practices). Three "canonical examples" of honest confidence in one phase is not honest.

Design-pattern critiques (cross-cutting)¶

Misapplied patterns (with concrete victim)¶

Performance design — "Plugin architecture" applied to cost tiers. §Components §1 names "Plugin architecture (the kernel knows about tiers; probes self-classify)" as the pattern for the cost-tier coordinator. That is not plugin architecture; that is a new ABC field consumed by the scheduler. Plugin architecture means the kernel imports nothing from plugins and discovers them via a registry — not "probes set a field the kernel reads." The pattern label is doing work the design hasn't earned.
Performance design — "Hexagonal" applied to the plugin loader. §Components §2 names "Hexagonal: the loader is a Port; 'Pydantic-validated YAML on disk' is the only Adapter today, but Phase 14's webhook-driven loader and Phase 16's signed-plugin loader plug into the same Port." A Port + one Adapter is a function with a side door taped open; the second adapter doesn't exist for 12+ phases. The toolkit's "Failure mode" on hexagonal is exactly this: "a 'hexagonal' design that smuggles requests.get(...) directly into a domain function. The whole point is that the core doesn't know about HTTP." Symmetrically, hexagonal with one Adapter is the door without the wall.
Security design — "Capability pattern" applied to SecretFindingCapability. §Components §"Secret redactor" claims the LLM accesses plaintext "only via SecretFindingCapability token" — but the LLM doesn't construct objects in Python; it constructs strings that describe tokens. The capability pattern works for in-process calls where holding the token is the capability. Across a Python/LLM boundary, the LLM never holds the token; it asks code to act on its behalf. The capability is therefore a server-side check, which is just authorization with a fancier name. §Design patterns row 6 says "The LLM cannot mint the token" — true, but irrelevant; the LLM doesn't need to mint it to exfiltrate the value if a helper reads-and-renders.
Best-practices design — "Strategy via Protocol" for adapters with zero implementations. §Components §"Adapter Protocol definitions" names "Structural subtyping / Strategy via Protocol." Strategy is a pattern for interchangeable algorithms behind a common interface — pick one at runtime. With zero implementations, there is nothing to pick. The Protocol is a type with no behavior; it's a contract document with .py extension. Misapplied pattern + premature pluggability double feature.

Missed patterns (with concrete location)¶

Performance design — Decorator/Registry omitted in favor of "Plugin Loader." The Phase 0/1 mechanism (@register_probe writing into a module-level dict) is already the kernel-side registry for everything Phase 2 ships. The performance design has the loader doing what the registry already does (resolving probe sets) while pretending it does something different (loading plugins that don't exist).
Security design — Smart-constructor pattern missed at the secret-redactor boundary. §"Secret redactor" returns tuple[dict, list[SecretFinding]]. The natural shape is a smart constructor on a RedactedSlice value type that cannot be constructed without a populated findings list — making "redactor was called" type-checkable. The current shape allows a caller to drop the list[SecretFinding] and persist only dict, losing the audit trail; the smart constructor would force the audit-trail to be co-located with the slice or fail typecheck.
Best-practices design — Open/Closed at the DepGraphProbe.ecosystem-detector dict. §"DepGraphProbe" ships a string-keyed dict (admits this in §Acknowledged blind spots §5) when the natural shape is a per-ecosystem @register_dep_graph_strategy decorator following the existing @register_probe pattern. The toolkit's Open/Closed Failure mode names this exactly: "the central dispatch_task_class(name) function has a match name block that grows every time. That's modification, not extension." A string-keyed dict is the data-shape equivalent.

Pattern claims that don't survive scrutiny¶

Performance design "Functional core, imperative shell" on the SCIP projector. §Components §4 "Pattern decisions: Functional core (the projector is pure: .scip bytes → projection bytes)." A function that takes .scip bytes and returns projection bytes is just a pure function — labeling it FCIS is ceremony. The "imperative shell" the design names is subprocess.run and Path.write_bytes(); the bar for "shell" is higher than "the function that calls the function" or every Python program qualifies.
Best-practices design "Composition + clear naming over generic frameworks" for three loaders. §Components §"SkillsLoader, ConventionsCatalogLoader, TCCMLoader as separate concrete classes." Composition implies the three share components by combining them; the design ships three separate classes that share nothing in code — they each call yaml.safe_load and run a Pydantic validator. That's not composition; that's parallel implementation. The pattern label is window-dressing on duplication-by-copy.
Security design "Hexagonal" applied to _run_external_cli + _run_in_container. §Design patterns row 2 claims "Hexagonal architecture / Ports & Adapters" for the chokepoint pair. Hexagonal requires the core not know about the adapter; here the core (every probe) calls _run_external_cli(... capability_token) with an argv list that is shaped per-CLI. The chokepoint forwards; it doesn't translate. Adapter pattern's failure mode in the toolkit ("an Adapter that re-exports the same interface unchanged. If you didn't translate, you wrote a forwarder.") applies verbatim.

Anti-patterns from the toolkit's "flag on sight" list¶

Premature pluggability. Performance ships the Plugin Loader with one implementation (universal fallback) whose contributes.adapters: {} is empty — a "pluggable" architecture with zero plugins. Best-practices ships four adapter Protocols with zero adapters. Security ships a ProbeCapabilities discriminated union with three variants when every Phase 2 probe fits into one of three known buckets — the data-driven approach (probe declares its kind in the registry, coordinator passes a context shaped accordingly) would work without a sum-typed capability bundle.
Untyped dict[str, Any] interfaces. None of the three carry these forward egregiously — but the best-practices design's ProbeOutput.schema_slice: dict[str, Any] (inherited from localv2.md §4) is unchallenged across all three. Every Phase 2 design adds typed slices but routes them through the untyped envelope at merge time. None of the three names this as a contract gap.
Boolean flags on public methods. Best-practices design's SkillsLoader.load_all() -> Result[list[Skill], SkillsLoadError] is fine; but its ConventionsCatalogLoader.apply(conventions, repo) has no flag (good). Security's _run_in_container(image_or_dockerfile, scenario, capability_token) uses an image_or_dockerfile parameter that union-types "either path or string contents" — the toolkit's "boolean flag" anti-pattern generalized to "polymorphic parameter."
Tag-and-dispatch without a tagged union. Performance design's per-probe cache_strategy is Literal["content", "none"] (inherited from localv2.md §4) — and then the design adds a third behavior (image-digest-keyed) via the cache_key() override hook, without extending the discriminator — so cache strategy is now partially captured in the cache_strategy field and partially in the cache_key method, with no sum type covering both. Future readers cannot grep for "all cache strategies."
Capability passed through ten frames as a parameter. Security's capability_token is exactly this: it threads from ProbeContext through every probe's run() into _run_external_cli / _run_in_container / _grammar_runner. The toolkit's prescription ("make it explicit (ContextVar, a small Context dataclass, or a registry)") would route through a coordinator-level registry indexed by probe family; the design routes through ten frames as a parameter.
Side effects in constructors. Best-practices design's SkillsLoader.__init__(self, search_paths) is fine; its ConventionsCatalogLoader.__init__(self, search_paths) is fine. But the security design's per-repo encryption key generation ("generated on first run per repo; lost key = lost ability to read past findings") is an import-time / first-call side effect — the design doesn't name a constructor for SecretRedactor, but its lifecycle includes "writes to ~/.codegenie/keys/<repo>.key" the first time it sees a new repo, which is a side effect at the wrong layer (file creation in $HOME on a code path the user did not opt into).

This phase's specific traps the brief named:

Probes that re-encode IndexFreshness as Optional. None of the three commits this exactly; best-practices makes a virtue of avoiding it. But performance's cache_strategy = "none" on IndexHealthProbe plus its sibling-probe health_check(slice) -> AdapterConfidence Protocol is the structural cousin: the freshness signal is now a method spread across N probes rather than a sum-typed return on B2 alone.
Plugin scaffolding the kernel still hardcodes. Performance ships the Plugin Loader but the kernel-resident Layer B–G probes are "registered through" the universal plugin's contributes.probes: list. So the kernel does import them; the manifest is bureaucracy. Best-practices ships adapter Protocols in codegenie.adapters — kernel namespace — and notes "no implementations." Hardcoding by another name.
"Hexagonal" sandbox claims that smuggle subprocess into the core. Security's _run_external_cli is the chokepoint, but every probe still constructs the argv list itself, so the per-tool knowledge (semgrep's flags, syft's flags) lives in probe modules — not in adapters. The "Port" sees a CLI invocation as if all CLIs were interchangeable, which they aren't, so the abstraction leaks per-tool flags into the core. Hexagonal in name; subprocess-with-extra-steps in code.
Smart-constructor parsers that the rest of the code bypasses with model_construct(). None of the three ships an explicit model_construct bypass — but best-practices' §"Result[T, E] (Pydantic-modeled Ok | Err) at every parse boundary" assertion is checkable only via convention. Phase 0/1 don't ship Result; adding it in Phase 2 and asserting "every parse boundary" without a lint rule is the bypass-by-omission failure mode.