Skip to content

Phase 00 — Bullet tracer + project foundations: Critique

Role: Devil's-advocate critic. Date: 2026-05-11 Targets: design-performance.md, design-security.md, design-best-practices.md Source of truth: docs/roadmap.md §"Phase 0", docs/production/design.md §2 (load-bearing commitments), CLAUDE.md.

Tone: hostile, concrete, specific. No fixes proposed. Each finding names a component, line item, or decision and explains the wound. The synthesizer is the next stop; this document only counts the holes.


1. Attacks on the performance-first design

1.1 Five concrete problems

  1. xxh3-128 for content addressing is the wrong hash and the wrong defense. The design swaps SHA-256 (called out in localv2.md §8) for xxh3 to save ~3ms per MB on lockfile hashing. xxh3 is not collision-resistant — it's specifically designed to be fast and not adversarial. Phase 14's continuous-gather model (production/design.md §3.2) re-hashes package.json and package-lock.json across the full watched portfolio on every push webhook. An attacker who lands a single PR in any watched repo can construct two payloads that collide under xxh3 and use that to pin the cache to the wrong probe output for a subsequent malicious commit — silent staleness, which load-bearing commitment §2.3 (production/design.md) calls out as "the worst failure mode." Worse: the design openly admits this in §"Acknowledged blind spots" and shrugs ("the threat model doesn't include local FS write attackers in Phase 0") — but the threat model the synthesizer is supposed to inherit does include that, because Phase 14 turns gather into a webhook-driven service.

  2. Lazy imports as a CI-gated invariant is fragile by construction. The design pins codegenie --help cold-start at ≤200ms with a test_cli_cold_start.py canary and an import-linter config. Both safeguards fail open in a way the design glosses over: (a) import-linter doesn't catch transitive imports introduced by an existing module's __init__.py — adding pydantic to errors.py brings in pydantic_core (the C extension) via a chain the linter doesn't model unless every transitive path is enumerated; (b) the canary asserts ≤200ms on Linux CI, but real measurement runs on a noisy GitHub Actions runner where p95 of "5 subprocess.run + median" routinely swings 60-80ms between runs. The canary will be either flaky (and disabled) or set so loose it doesn't catch the regression it exists to catch. Either way the 80ms M-series target is theatre — there's no enforcement loop touching the developer's actual machine.

  3. fastjsonschema is exec'd Python and a real attack surface. §"Schema validation" picks fastjsonschema for ~10× validation speed. fastjsonschema works by generating Python source for the schema and exec-ing it at import time. In a Phase 14 continuous-gather worker that ingests schemas from a Skills server (production/design.md §3.1 Continuous Gather Dispatcher) or any future where schemas come from disk rather than the wheel, that exec path is a code-injection vector. The design dismisses this as "two libraries, one schema source — acceptable complexity for the speed win." It is not — the validator is on the persistence path that the audit trail and reproducibility story depend on. Phase 0 hardcoding fastjsonschema makes the wrong choice path-dependent.

  4. pytest-xdist "from day one, even with five tests" is premature parallelism that breaks fixtures. The justification — "muscle memory and CI config in place before Phase 1 multiplies test count by 30" — is exactly the wrong reason to enable xdist with zero parallelism win. xdist forces tests to be isolated per worker; the Phase 0 cache-hit smoke test (one of only five tests!) needs to write to .codegenie/cache/ on a shared fixture dir, and xdist's default loadfile distribution will hand the same fixture dir to two workers if a future contributor adds even one parameterization. The bug — workers racing on cache writes — will surface in Phase 1 or 2 when the test count actually goes up, and the design will blame "flaky tests" rather than the structural decision to enable xdist before it provided value.

  5. Append-only JSONL cache index is content-addressed by design and then read linearly forever. §"Cache layer" promises mmap-scan linear over .codegenie/cache/index.jsonl, asserts ≤2ms p95 cache-hit dispatch, and waves at Phase 14 ("swap the backend to SQLite or LMDB"). The math in §"Storage growth rate" assumes one entry per gather. But the continuous-gather model in production/design.md §3.2 runs incremental gathers on every push webhook, every PR sync, every CVE event, every cron — for every probe whose declared_inputs changed. Phase 2's 30 probes × hundreds of repos × multiple triggers per day × 365 days hits the 1M-entry threshold inside the first production year, not in some abstract "Phase 14 scale." The mmap-scan p95 of "single-digit ms" cited in §"Resource & cost profile" only holds when the index is single-digit MB. The design's own logic ("11k entries × 150B = 1.6 MB after a year of nightly continuous gather") understates the trigger fan-out by 100×.

1.2 Three hidden assumptions

  1. The performance budget is denominated in single-repo, single-process gathers. Every figure in §"Resource & cost profile" assumes one CLI invocation against one repo. But Phase 14's Continuous Gather Dispatcher (production/design.md §3.2) runs the same coordinator code under a different shape — concurrent invocations against the same .codegenie/cache/ shared between gathers triggered by overlapping pushes. The async semaphore tuning (min(cpu, 8)) and the per-process file-descriptor budget are sized for a single CLI invocation; the design has no model for concurrent CLI processes contending on the same cache directory. Append-only with O_APPEND is atomic per-record, but two processes both appending into the same index.jsonl while a third is mmap-scanning is a race the design hand-waves away. "If you don't measure it, you don't have it" — and this isn't measured.

  2. uv is treated as the baseline installer for cost arithmetic, but then pinned to a specific version and called "young." §"Risks" rates uv instability as risk #2, mitigated by "pin to a specific version." A pinned uv means requirements-locked.txt is the input to a known binary — but uv ≤2025 has had three breaking changes in the resolver in the last 12 months. The lockfile generated by uv 0.4.x is not byte-identical to the lockfile generated by uv 0.5.x even for the same pyproject.toml. The performance design's reproducibility claim (the same "Phase 0's storage growth rate" math) silently assumes lockfile stability across pinned-uv upgrades. It is not stable.

  3. CI walltime ≤90s p95 is a target the design can't enforce against itself. Three parallel jobs × ~70s = ~70s wall-clock — if GitHub Actions assigns three free runners simultaneously. Free-tier GHA queues at peak hours stack jobs for 30-180s before a runner is allocated. The "p95 ≤90s" budget should be wall-clock from PR-push to green check, not wall-clock from runner-start. The design measures the wrong thing and reports the answer it wants.

1.3 Things this design missed that the others caught

  • No Output Sanitizer / path scrubber. The security design's §"Output writer + sanitizer" rewrites absolute paths (/Users/dannytrevino/...) to relative form before the YAML is persisted. The performance design writes the YAML directly via yaml.CSafeDumper and never considers that Phase 11's PR-opening stage will commit the YAML to a real repo. Phase 0's perf-optimized YAML will leak the developer's home directory tree across the portfolio.
  • No subprocess allowlist. The performance design tracks subprocess concurrency via asyncio.create_subprocess_exec but has no notion of a permitted-binaries set. By Phase 2 the codebase will have 30+ subprocess callsites (semgrep, syft, scip-typescript, etc.); retrofitting an allowlist over a codebase that already shells out everywhere is not bounded work.
  • No reproducible-build CI check. Best-practices and security both call out that two runs of the same commit must produce byte-identical outputs (the security design enforces with PYTHONHASHSEED=0/SOURCE_DATE_EPOCH; best-practices implies it via the cache-hit smoke test). The performance design omits this — and Phase 7's "extension by addition" invariant (production/design.md §2.5) relies on the contract being reproducible between versions of codewizard-sherpa.
  • No coverage floor. Best-practices sets ≥90% line / ≥80% branch and ratchets it up per phase. The performance design says "coverage is computed but not enforced as a gate." Skipping the gate in Phase 0 means the gate never lands — there's no later phase where adding a coverage floor doesn't require a backfill PR that adds tests rather than catches bugs.

2. Attacks on the security-first design

2.1 Five concrete problems

  1. The HMAC-signed cache index defends against a threat the design itself can't articulate. §"Cache layer" introduces a per-installation ~/.codegenie/.cache-key, signs every cache index record, and verifies on read. The design's own threat model (§"Acknowledged blind spots") says the threat is "an attacker with filesystem write access to .codegenie/cache/ but not the key file." Name this attacker: who has write access to a repo's .codegenie/cache/ but cannot read ~/.codegenie/.cache-key owned by the same user? The answer is "nobody on a developer workstation." On a CI runner with ephemeral filesystem the key file doesn't even exist between runs — so the HMAC is either regenerated every CI invocation (and thus rotates with every gather, making the audit trail's "tamper-evident" claim impossible to verify across runs) or stored as a CI secret (and the security design provides no story for that). The HMAC is complexity that buys nothing on the dev workstation and breaks the audit-trail claim in CI.

  2. gitleaks on every gather, in the write path, is a $50-$200ms unbounded subprocess on a hot loop. §"Output writer + sanitizer" puts gitleaks in the serialized-output sanitization pass. The cost ("acceptable; ~50-200ms") is wall-clock for a Phase 0 YAML of 2 KB. Phase 2's repo-context.yaml is "a few hundred KB" (the perf design's number). gitleaks over hundreds of KB plus the raw artifacts directory ("~5 MB Phase 2") will be seconds, not milliseconds. In the continuous-gather model where hundreds of probes re-serialize after a single trigger, gitleaks in the write path becomes the single largest cost in the gather pipeline — and load-bearing commitment §1 (production/design.md) demands the gather pipeline be cheap enough to run continuously. The security design has moved a tens-of-seconds subprocess into the cheap-and-continuous path.

  3. additionalProperties: false at every level breaks "extension by addition." The security design's repo-context.v1.json schema makes "unknown fields a validation failure — surfacing buggy probes." CLAUDE.md and production/design.md §2.5 both require "extension by addition — adding Java, Python, or a new task type must be new probes + new Skills, never edits to existing probes or the coordinator." If the schema rejects unknown fields, adding a new probe in Phase 1 requires editing the schema, not adding a new file. The security design has structurally violated the load-bearing extension model in service of structural validation strictness. The two policies are incompatible at the seam.

  4. The field-name regex filter is a heuristic security gate, which means it's neither. §"Probe contract" rejects keys matching (?i)^.*(secret|token|password|credential|api[_-]?key|auth[_-]?token|bearer).*$. False positives are the named ones (decryption_steps). The unnamed problem: false negatives. A probe emitting {"npm_registry_url": "https://registry-token-abc123.npm-corp.internal/"} carries an embedded credential and trips no filter; a probe emitting {"auth_provider_name": "okta"} trips the filter and fails closed. The "defense in depth" gitleaks scan two passes later catches the real secret only if gitleaks's own regex library hits it — and gitleaks misses many internal credential formats by design (it's tuned for AWS/GH/etc.). The whole filter is a confidence theatre that adds friction without bounding the leak set.

  5. Pinned-by-SHA actions are a maintenance dead-letter office. §"CI workflow" pins every GitHub Action by SHA, with Dependabot raising PRs. By design, this puts every action update behind a human review PR. Six months in, the team has 20 actions pinned by SHA; the team is shipping Phase 5-6; Dependabot PRs queue up; nobody reviews them because none of them changes user-visible behavior; a CVE in actions/checkout (the SHA the team pinned) goes unpatched for two months. The risk section's mitigation ("Dependabot handles routine bumps automatically") begs the question — Dependabot raises the PR; someone has to merge it. The design has prescribed friction without budgeting the review capacity to clear it.

2.2 Three hidden assumptions

  1. unshare -n (network namespaces) works on macOS. §"Goals" promises "verified in CI via unshare -n or netns equivalent." unshare is a Linux-only utility; macOS has no equivalent. The smoke-test guarantee ("zero outbound network calls during codegenie gather") is enforceable on Linux CI runners but not on the developer's M-series Mac. CLAUDE.md and localv2.md both target macOS+Linux. The security design's flagship invariant has no enforcement on half the supported platforms.

  2. The [project.optional-dependencies.gather] extra cleanly separates LLM deps from gather deps in Phase 0. §"Risks" #5 proposes a gather extra that locks the gather pipeline out of LLM SDKs. Today's pyproject.toml has one extra (dev). The security design proposes a second (gather) — but Phase 0 ships only the gather pipeline. The CLI entry point depends entirely on gather deps. So [project.optional-dependencies.gather] either is the same as [project] (in which case the separation is fictional and Phase 4's [project.optional-dependencies.service] will be the actual separation, lifting Phase 0's design into a refactor) or [project] is empty and the CLI doesn't install by default (broken UX). The design hasn't worked out which one it means.

  3. Hash-pinned lockfiles compose with uv resolver upgrades. §"Goals" requires uv pip compile --generate-hashes and --require-hashes install. The same uv pin to specific version problem the performance design has applies here doubly: uv pip compile output depends on uv's resolver version, and the security design pins uv "in CI for reproducibility" — which means the CI pin and the developer's uv version drift, which means make bootstrap produces a different lockfile than CI verifies, which means CI fails with a hash mismatch the developer can't reproduce locally. The design has a security story but no reproducibility story to make the security story enforceable.

2.3 Things this design missed that the others caught

  • No coverage floor. Security cares about audit and integrity but doesn't gate code coverage. Best-practices' 90% line / 80% branch floor is the actual mechanism for catching the "secret-leak defense layer was never tested" failure mode. Security tests for "an output containing a fake AWS key fails the sanitizer" — but doesn't enforce that every sanitizer code path is exercised.
  • No CI walltime budget. Best-practices doesn't quote a number, but performance does (≤90s). Security adds gitleaks + bandit + pip-audit + osv-scanner + the netns-isolated test + the reproducibility check; the design admits in §"Acknowledged blind spots" that walltime is "~120s." That's a 33% regression against the performance target with no recovery plan. Phase 0 ships with CI noticeably slower than the roadmap (docs/roadmap.md) implies for a "five tests" surface.
  • No conventions catalog. Best-practices §2.6 lays out a single errors.py hierarchy and a config-merge precedence with fail-loud-on-unknown-keys. Security has none of this. Phase 1's probe authors will diverge on exception types and config-typo handling, and the security design's "fail closed" reflex won't catch it because the failures aren't security-shaped.
  • No make-mirrors-CI invariant. Best-practices §6 makes make check mirror CI exactly so contributors can self-verify. Security's CI surface (six jobs, custom hooks, harden-runner, etc.) has no local mirror, which means the bar contributors hit is "push a PR and wait." This is exactly the "pre-commit hooks erode under contributor pressure" risk the security design itself names — caused by its own ergonomics gap.

3. Attacks on the best-practices-first design

3.1 Five concrete problems

  1. "Coordinator runs serially in Phase 0" is a contract-breaking decision dressed as scoping. §8 ("Speculative async optimization") sells a serial coordinator as Rule 2 (Simplicity First). But the coordinator's interfaceasyncio.Semaphore(N), per-probe asyncio.Task, asyncio.wait_for timeout — is itself a load-bearing convention the rest of the system inherits (production/design.md §3.1: "Probe Coordinator dispatches probes in parallel"). Shipping serial-only in Phase 0 means Phase 1 doesn't add concurrency; it replaces the coordinator. The exit-criteria item §11 ("the bounded async dispatch in the coordinator (replacing Phase 0's serial path)") admits this. So Phase 0 has shipped a stub the next phase rewrites — exactly the "speculative scaffolding" §9.1 warns against, with the speculation pointing the wrong way.

  2. dataclasses-only in Phase 0 and pydantic in Phase 6 is a four-phase migration disguised as restraint. §2.2 explicitly rejects pydantic ("Phase 0 doesn't need pydantic at all"). But §3.3's probe contract ships Probe with class attributes that the security and performance designs treat as Pydantic models (immutable RepoSnapshot, ProbeOutput with JSONValue recursive types). Phase 4 introduces anthropic and langgraph, both of which depend on pydantic v2 — so by Phase 4 pydantic is a transitive dep regardless. Postponing the migration to Phase 6 means every probe written between Phase 1 and Phase 5 gets rewritten in Phase 6 from dataclass-based to pydantic-based contracts. The "we'll add it when there's load" framing buries an unbudgeted migration in Phase 6's scope.

  3. The LanguageDetectionProbe recognizes Dockerfiles but no Phase 0 probe consumes them. §3.1 declares Dockerfile as a recognized language. Phase 7 (docs/roadmap.md) introduces the migration task class and the Dockerfile-specific probes. The best-practices design ships Dockerfile detection in Phase 0 ("seven extensions covered") — exactly the speculative-feature category §3 says it bans ("No IndexHealthProbe, no schema for B-G, no fixtures beyond the trivial"). The design enforces its own scope rule on B-G but breaks it on A1. The rule is either binding or it isn't.

  4. declared_inputs = ["**/*"] for LanguageDetectionProbe defeats the cache key. §3.3 sets the probe's declared inputs to the entire tree. The cache key derives from declared inputs (localv2.md §8). Every file change in the analyzed repo — every README edit, every test file touch, every CI config update — invalidates the language-detection cache entry. The design dismisses the performance lens ("their fix belongs in Phase 1 alongside the real probe") but the issue isn't performance; it's that the cache invariant ("cache hits on second run") in the Phase 0 exit criteria (docs/roadmap.md) is structurally impossible if any file changed between runs. The exit-criteria smoke test §4.2 ("re-running over the same <empty_dir> hits the cache") works only because the test uses an empty dir — adding a single touched file between runs breaks it.

  5. mkdocs build --strict as a CI gate over the docs/ tree is going to fire today. §2.2 mandates strict mkdocs in CI ("any broken link, any duplicated heading, any reference to a non-existent page fails the build"). The existing docs/ tree includes docs/local.md (marked superseded), cross-references between docs/auto-agent-design.md and docs/gemini-auto-agent-design.md, and the ADR index at docs/production/adrs/README.md that links to ADRs by number. A strict mkdocs build over this tree as-is will surface at least 3-5 broken refs (e.g., docs/auto-agent-design.md §"Empirical Realities" referenced from production/design.md is a section anchor that mkdocs may not generate identically to manual anchors). Phase 0 cannot ship green on this gate without a cleanup pass over the docs that the design doesn't budget.

3.2 Three hidden assumptions

  1. "Five-minute reviewer experience" assumes uv install of all dev deps in 30 seconds on a cold runner. §1 promises "clone the repo, run make bootstrap, run make check, and have green output in under five minutes." mkdocs-material alone is ~150 transitive packages; pinned mypy + ruff + pytest stack adds another ~80. Even with uv and a warm wheel cache, the first-run install on a clean macOS box pulls ~250 MB of wheels. The make bootstrap step is realistically 60-180 seconds for a contributor who has never installed any of these tools. Add make check's mypy cold-pass (15-30 seconds), pytest cold-collection, and mkdocs build, and the five-minute number is optimistic by a factor of 2-3 on a real contributor's machine.

  2. mypy --strict over tests/ will catch real bugs. §2.2 puts tests under strict mypy. Pytest fixtures, parameterize markers, and monkeypatch typing are all underspecified in the typeshed; strict mypy over tests/ produces a high false-positive rate that contributors silence with # type: ignore comments. The "type errors caught at commit are 10x cheaper" argument (§2.2 row "Pre-commit") is true for src/; for tests/ it's an obstacle that funnels real bugs through # type: ignore lines. The performance design got this right ("type-checking tests yields diminishing returns and inflates CI time"); the best-practices design didn't engage.

  3. Coverage floor of 90% line / 80% branch on Phase 0 with five tests is satisfiable. §4.4 sets the floor at 90% line / 80% branch on src/codegenie/ excluding cli.py. The Phase 0 surface is the coordinator, the cache store, the schema validator, the output writer, the language-detection probe, the registry, the error hierarchy, the config loader, the logging setup. Five tests can hit 90% line coverage of this surface only by being long integration tests, not the focused unit tests §4.1 enumerates. The floor is either gameable (one big test that walks every line and asserts almost nothing — Rule 9 violation) or unmet (and CI is red on day one). The design hasn't picked a side.

3.3 Things this design missed that the others caught

  • No CLI cold-start budget. Performance has it (≤200ms on Linux CI). Best-practices has none. By Phase 14, when the CLI is the orchestration entry for portfolio-scale workers, the cold-start cost compounds across thousands of invocations per hour. Best-practices' "no premature optimization" answer becomes a refactor in Phase 8 or Phase 14.
  • No subprocess allowlist or output sanitizer. Security has both. Best-practices' errors.py hierarchy and yaml.safe_load ban are the table stakes; the structural defenses (allowlist + sanitizer) are absent. By Phase 7 (migration task class) the codebase will shell out to docker buildx, dive, and dockerfile-parse — adding an allowlist over an established codebase is bounded work the security design budgets and best-practices does not.
  • No reproducibility check. Security has it (two builds, byte-identical). Best-practices' "make check" mirrors CI but doesn't verify reproducibility — a flake from a non-deterministic dependency surfaces as a test failure rather than a structural signal.
  • No cost model for tokens. Performance and security both say "0 tokens in Phase 0" but performance frames it as "structural — Phase 8 hot views depend on Phase 0 layout"; security frames it as "Phase 4 will add anthropic to a separate extra." Best-practices says nothing about the gather/service split that load-bearing commitment §1 (production/design.md) makes load-bearing. The synthesizer will inherit a gap: Phase 0's pyproject.toml has no shape for where the LLM SDKs land later, and the best-practices design declines to design it.

4. Cross-design dimension table

Dimension Performance Security Best-practices Friction point
Cache content hash xxh3-128 (~3ms/MB) SHA-256 (collision-resistant) SHA-256 (implicit; via localv2.md) xxh3 is non-cryptographic and breaks production/design.md §2.3 (Honest confidence) under Phase 14 webhook fan-out
Schema validator fastjsonschema (10× faster; exec'd code) jsonschema (slower; transparent) jsonschema (default; floor: ≥90% coverage) fastjsonschema runtime code-gen is an attack surface on a persistence path
Coordinator concurrency asyncio.Semaphore(min(cpu,8)) from day one asyncio.Semaphore(8) from Phase 1 onward Serial in Phase 0; async lands in Phase 1 Best-practices ships a coordinator interface that Phase 1 replaces, not extends
additionalProperties in envelope schema Not specified false at every level false at root; true under probes.* Security strictness collides with production/design.md §2.5 (extension by addition)
Subprocess allowlist None Hard frozenset({"git"}) chokepoint None Performance and best-practices both punt; Phase 7's docker tooling lands without a wall
Output sanitizer / path scrubbing None Mandatory three-pass (regex + path scrub + gitleaks) None Phase 11 commits the YAML to a real repo; performance and best-practices leak /Users/...
Audit trail (run-record.json) "structured-log JSON run record" (one line) Per-run <ts>-<hash>.json with HMAC anchor Not designed Best-practices omits the audit anchor; Phase 13 cost ledger reconciliation has nothing to anchor against
Lockfile discipline uv pip install -e ".[dev]" (no hashes) uv pip compile --generate-hashes; --require-hashes install uv.lock committed; loose pip install in a weekly drift job Performance ships fastest; security ships most reproducible; best-practices ships most contributor-friendly. All three diverge on the same pyproject.toml.
CI walltime target ≤90s p95 ~120s admitted Implicit ("cold ~2 min, warm ~30s") Security's six jobs blow performance's budget by 33% on day one
Network egress in tests Unspecified Zero, enforced via unshare -n / harden-runner Unspecified Security's enforcement only works on Linux CI; macOS dev has no equivalent
pytest-xdist On from day one Unspecified Off (asyncio_mode = "auto"; serial pytest) Performance enables xdist with no parallelism win, just risk
pydantic in Phase 0 Yes (pyproject.toml deps; lazy-imported) Yes (ProbeOutput, immutable RepoSnapshot) No (dataclasses only; pydantic lands Phase 6) Best-practices defers a dep that the other two ship and that Phase 4 forces
Coverage floor "Computed but not enforced" Not specified ≥90% line / ≥80% branch, ratcheting per phase Performance's "not a gate" is the default best-practices gate doesn't land later
additionalProperties for LanguageDetection probe slice Performance probe; not designed Strict everywhere true under probes.* (per envelope) Best-practices is the only one that left the seam open
Probe declared_inputs for LanguageDetection "Sorted tuple of compiled PurePath" hashed via mmap+xxh3 "Under the analyzed-repo root; refused at registration time if upward-traverse" ["**/*"] (whole tree) Best-practices breaks cache hits on every file change

5. The single most consequential disagreement

Whether the cache content hash is SHA-256, xxh3, or blake3 — and behind that, whether Phase 0's cache layer is a performance artifact or a security/audit artifact.

The performance design treats the cache as identity-only ("cache keys are not adversarial — they're identity, not security") and picks xxh3 for a ~100× speedup, openly contradicting localv2.md §8. The security design treats the cache as the audit anchor (cache key = artifact identity for the audit trail; the run-record's SHA-256-of-the-YAML is the integrity anchor) and picks SHA-256, signing every index record with per-installation HMAC and explicitly overruling the performance lens. The best-practices design defers entirely to localv2.md (SHA-256, no opinion on speed, no HMAC).

This is consequential because every later phase is downstream of the decision:

  • Phase 14 (Continuous Gather) runs the hashing function on every push webhook across the watched portfolio. SHA-256 vs xxh3 is a 100× factor on the throughput of the cheapest, most-frequent operation in the system.
  • Phase 13 (AgentOps cost ledger) uses the cache as the unit of work for the "amortized" cost tier (production/design.md §3.3, ADR-0027). The cost ledger's deduplication of "this probe ran twice on the same input" depends on cache-key stability. If the key is xxh3, an adversary in any watched repo can force apparent cache misses (collision-driven re-runs) and inflate the amortized cost attribution.
  • Phase 11 (Handoff at scale) commits artifacts and references the cache contents in PR evidence bundles. The cache key shows up in PR descriptions as a provenance anchor — collision resistance is the only thing that makes that anchor verifiable.
  • Load-bearing commitment §3 (production/design.md — Honest confidence; silent staleness is the worst failure mode) explicitly names cache integrity as the worst failure mode. A non-cryptographic hash is silent staleness under any adversarial portfolio assumption.

The three designs are not negotiating a parameter; they're disagreeing on what the cache is for. That disagreement propagates through Phase 11, Phase 13, and Phase 14 with compounding cost — and the synthesizer cannot defer it without baking the wrong assumption into the schema, the audit trail, the cost ledger, and the PR provenance layer simultaneously.


6. Shared blind spots (all three designs missed)

  1. No model for the gather-vs-service split in pyproject.toml shape. Load-bearing commitment §1 (production/design.md) demands the gather pipeline never depend on an LLM SDK, ever. Performance ships one [project.optional-dependencies.dev]; security mentions a gather extra in passing without designing it; best-practices doesn't address the split at all. None of the three designs land the [project.optional-dependencies] shape that Phase 4's anthropic / langgraph deps will need to slot into without polluting the gather entry point. Phase 4 will therefore have to refactor Phase 0's pyproject.toml — an "extension by addition" violation that all three lenses agreed to leave for the next phase.

  2. No design for the .gitignore mutation that CLAUDE.md mandates. CLAUDE.md ("The tool should offer to add it to that repo's .gitignore on first run") requires the CLI to mutate the analyzed repo's .gitignore (not its own). All three designs ship the YAML and raw artifacts to .codegenie/ and none describes the prompt/confirmation/file-write path for opting in. Phase 11's PR-opening stage will then either commit .codegenie/ artifacts into the PR or skip them based on whether someone manually added the entry — a coin flip that none of the three Phase 0 designs flipped.

  3. No story for the localv2.md §4 ABC being "frozen" across versions. Best-practices says "copy-paste verbatim and freeze"; performance says "verbatim from localv2.md §4 per CLAUDE.md"; security says "as specified in localv2.md §4." None of the three designs handles what happens when localv2.md §4 itself updates — and localv2.md is a living document, currently at v2. The "snapshot test" best-practices proposes (§10 exit criteria #10) compares against a frozen string, not against the document. When localv2.md reaches v3, all three designs leave the synthesizer with a divergence between the spec document and the implemented ABC and no policy for which wins.

  4. No story for .codegenie/ permissions on a shared CI runner. Security designs 0700/0600 perms for the dev workstation; performance shrugs at perms; best-practices doesn't address it. On GitHub Actions, jobs run as runner with umask 0022; the cache directory between jobs (when actions/cache is used, per performance's CI design) is restored with 0755. The security design's mode-bit-check test will fail in CI under the cache-restore path. None of the three designs reconciled the on-disk permission model with the CI cache model both performance and best-practices ship.

  5. No design for incremental gather under a coordinator interface change. All three designs ship a coordinator interface in Phase 0. The continuous-gather model in production/design.md §3.2 ("incremental gathers — only probes with changed declared_inputs re-run") depends on the cache key derivation and the coordinator's ability to skip-and-pass-through a cached ProbeOutput. None of the three Phase 0 designs implements the skip-and-pass-through path; all three describe cache.get → run probe → cache.put. Phase 14's incremental model needs cache.get → return cached ProbeOutput, and the coordinator's dict[probe_name, ProbeOutput] output shape must include skip-marker probes that didn't re-run. The Phase 0 coordinator interface doesn't model the skip case as a first-class output — Phase 14 will have to extend the contract.


7. Roadmap-level critiques

  1. Phase 0's exit criteria don't validate the contracts that load-bear into Phase 1-14. docs/roadmap.md lists "CLI runs; CI green; mkdocs builds." None of the criteria assert (a) that Probe ABC matches localv2.md §4, (b) that the cache hit path is deterministic across two CLI invocations against the same fixture, (c) that the schema rejects a known-invalid repo-context.yaml. Best-practices §10 adds these as design-level exit criteria, but the roadmap doesn't — meaning a Phase 0 PR can ship green without verifying the inheritance properties Phases 1+ depend on. The roadmap shape "CLI runs + CI green" is satisfiable by a script that does nothing.

  2. The "no LLM in gather pipeline" commitment is enforced nowhere in Phase 0. docs/production/design.md §2.1 is load-bearing. Phase 0 ships click, pyyaml, jsonschema, optionally pydantic, structlog — and Phase 4 will introduce anthropic/langgraph into the same pyproject.toml unless the structural separation is designed now. Only the security design even names the problem (§"Risks" #5), and even that design proposes the fix as a Phase 4 ADR ("Phase 4 will add anthropic and langgraph to the service dependencies, but the gather pipeline's allowlist stays clean"). The roadmap should require Phase 0 to land the dependency-graph fence, not Phase 4 to retroactively erect it.

  3. The roadmap's "Tooling & setup" section for Phase 0 contradicts the load-bearing commitments. The roadmap lists pydantic as a Phase 0 dependency ("Dependencies: click, pyyaml, jsonschema, aiofiles, pydantic"). Best-practices' design explicitly defers pydantic to Phase 6. The other two designs use pydantic for the trust boundary. So the roadmap and the best-practices design are in direct conflict on whether pydantic ships in Phase 0. Either the roadmap is wrong or the design is. The synthesizer will have to pick — and neither document acknowledges the conflict.

  4. aiofiles is in the roadmap's Phase 0 deps but no design uses it. Performance's coordinator uses asyncio.create_subprocess_exec, never aiofiles. Security's design uses synchronous file I/O for atomic writes (open(..., "w") → fsync → os.replace). Best-practices doesn't mention it. The roadmap lists a dep that no design needs — and shipping unused deps in Phase 0 is exactly the kind of unforced complexity all three lenses claim to avoid.

  5. No Phase 0 exit criterion validates the "facts, not judgments" commitment (production/design.md §2.2). The LanguageDetectionProbe returns a count map — this is fact-shaped. But none of the three designs lays down the structural rule that a probe cannot return a confidence field set to anything other than a measurement (vs an LLM judgment). Phase 1's IndexHealthProbe will need this distinction encoded somewhere; Phase 0 was the right time to encode it, and no design did.

  6. The mkdocs build --strict exit criterion in docs/roadmap.md is unverifiable without first cleaning docs/. As noted in §3.1 problem #5: the existing docs/ tree has cross-references that will fail strict mode. The roadmap's Phase 0 exit criterion ("The docs site builds locally without warnings") requires either (a) a docs cleanup pass that's not in any of the three designs or (b) loosening the criterion to "without errors." This is a roadmap-level ambiguity that all three designs absorbed silently — and the synthesizer will inherit it.


End of critique. No fixes proposed; this is the wound count. The next agent's job is to synthesize.