Phase 00 — Bullet tracer + project foundations: Critique¶
Role: Devil's-advocate critic.
Date: 2026-05-11
Targets: design-performance.md, design-security.md, design-best-practices.md
Source of truth: docs/roadmap.md §"Phase 0", docs/production/design.md §2 (load-bearing commitments), CLAUDE.md.
Tone: hostile, concrete, specific. No fixes proposed. Each finding names a component, line item, or decision and explains the wound. The synthesizer is the next stop; this document only counts the holes.
1. Attacks on the performance-first design¶
1.1 Five concrete problems¶
-
xxh3-128for content addressing is the wrong hash and the wrong defense. The design swaps SHA-256 (called out inlocalv2.md §8) for xxh3 to save ~3ms per MB on lockfile hashing. xxh3 is not collision-resistant — it's specifically designed to be fast and not adversarial. Phase 14's continuous-gather model (production/design.md §3.2) re-hashespackage.jsonandpackage-lock.jsonacross the full watched portfolio on every push webhook. An attacker who lands a single PR in any watched repo can construct two payloads that collide under xxh3 and use that to pin the cache to the wrong probe output for a subsequent malicious commit — silent staleness, which load-bearing commitment §2.3 (production/design.md) calls out as "the worst failure mode." Worse: the design openly admits this in §"Acknowledged blind spots" and shrugs ("the threat model doesn't include local FS write attackers in Phase 0") — but the threat model the synthesizer is supposed to inherit does include that, because Phase 14 turns gather into a webhook-driven service. -
Lazy imports as a CI-gated invariant is fragile by construction. The design pins
codegenie --helpcold-start at ≤200ms with atest_cli_cold_start.pycanary and animport-linterconfig. Both safeguards fail open in a way the design glosses over: (a)import-linterdoesn't catch transitive imports introduced by an existing module's__init__.py— addingpydantictoerrors.pybrings inpydantic_core(the C extension) via a chain the linter doesn't model unless every transitive path is enumerated; (b) the canary asserts ≤200ms on Linux CI, but real measurement runs on a noisy GitHub Actions runner where p95 of "5 subprocess.run + median" routinely swings 60-80ms between runs. The canary will be either flaky (and disabled) or set so loose it doesn't catch the regression it exists to catch. Either way the 80ms M-series target is theatre — there's no enforcement loop touching the developer's actual machine. -
fastjsonschemaisexec'd Python and a real attack surface. §"Schema validation" picksfastjsonschemafor ~10× validation speed.fastjsonschemaworks by generating Python source for the schema andexec-ing it at import time. In a Phase 14 continuous-gather worker that ingests schemas from a Skills server (production/design.md §3.1Continuous Gather Dispatcher) or any future where schemas come from disk rather than the wheel, that exec path is a code-injection vector. The design dismisses this as "two libraries, one schema source — acceptable complexity for the speed win." It is not — the validator is on the persistence path that the audit trail and reproducibility story depend on. Phase 0 hardcodingfastjsonschemamakes the wrong choice path-dependent. -
pytest-xdist"from day one, even with five tests" is premature parallelism that breaks fixtures. The justification — "muscle memory and CI config in place before Phase 1 multiplies test count by 30" — is exactly the wrong reason to enable xdist with zero parallelism win. xdist forces tests to be isolated per worker; the Phase 0 cache-hit smoke test (one of only five tests!) needs to write to.codegenie/cache/on a shared fixture dir, and xdist's defaultloadfiledistribution will hand the same fixture dir to two workers if a future contributor adds even one parameterization. The bug — workers racing on cache writes — will surface in Phase 1 or 2 when the test count actually goes up, and the design will blame "flaky tests" rather than the structural decision to enable xdist before it provided value. -
Append-only JSONL cache index is content-addressed by design and then read linearly forever. §"Cache layer" promises mmap-scan linear over
.codegenie/cache/index.jsonl, asserts ≤2ms p95 cache-hit dispatch, and waves at Phase 14 ("swap the backend to SQLite or LMDB"). The math in §"Storage growth rate" assumes one entry per gather. But the continuous-gather model inproduction/design.md §3.2runs incremental gathers on every push webhook, every PR sync, every CVE event, every cron — for every probe whosedeclared_inputschanged. Phase 2's 30 probes × hundreds of repos × multiple triggers per day × 365 days hits the 1M-entry threshold inside the first production year, not in some abstract "Phase 14 scale." The mmap-scan p95 of "single-digit ms" cited in §"Resource & cost profile" only holds when the index is single-digit MB. The design's own logic ("11k entries × 150B = 1.6 MB after a year of nightly continuous gather") understates the trigger fan-out by 100×.
1.2 Three hidden assumptions¶
-
The performance budget is denominated in single-repo, single-process gathers. Every figure in §"Resource & cost profile" assumes one CLI invocation against one repo. But Phase 14's Continuous Gather Dispatcher (
production/design.md §3.2) runs the same coordinator code under a different shape — concurrent invocations against the same.codegenie/cache/shared between gathers triggered by overlapping pushes. The async semaphore tuning (min(cpu, 8)) and the per-process file-descriptor budget are sized for a single CLI invocation; the design has no model for concurrent CLI processes contending on the same cache directory. Append-only withO_APPENDis atomic per-record, but two processes both appending into the sameindex.jsonlwhile a third is mmap-scanning is a race the design hand-waves away. "If you don't measure it, you don't have it" — and this isn't measured. -
uvis treated as the baseline installer for cost arithmetic, but then pinned to a specific version and called "young." §"Risks" ratesuvinstability as risk #2, mitigated by "pin to a specific version." A pinneduvmeansrequirements-locked.txtis the input to a known binary — butuv≤2025 has had three breaking changes in the resolver in the last 12 months. The lockfile generated byuv 0.4.xis not byte-identical to the lockfile generated byuv 0.5.xeven for the samepyproject.toml. The performance design's reproducibility claim (the same "Phase 0's storage growth rate" math) silently assumes lockfile stability across pinned-uv upgrades. It is not stable. -
CI walltime ≤90s p95 is a target the design can't enforce against itself. Three parallel jobs × ~70s = ~70s wall-clock — if GitHub Actions assigns three free runners simultaneously. Free-tier GHA queues at peak hours stack jobs for 30-180s before a runner is allocated. The "p95 ≤90s" budget should be wall-clock from PR-push to green check, not wall-clock from runner-start. The design measures the wrong thing and reports the answer it wants.
1.3 Things this design missed that the others caught¶
- No Output Sanitizer / path scrubber. The security design's §"Output writer + sanitizer" rewrites absolute paths (
/Users/dannytrevino/...) to relative form before the YAML is persisted. The performance design writes the YAML directly viayaml.CSafeDumperand never considers that Phase 11's PR-opening stage will commit the YAML to a real repo. Phase 0's perf-optimized YAML will leak the developer's home directory tree across the portfolio. - No subprocess allowlist. The performance design tracks subprocess concurrency via
asyncio.create_subprocess_execbut has no notion of a permitted-binaries set. By Phase 2 the codebase will have 30+ subprocess callsites (semgrep,syft,scip-typescript, etc.); retrofitting an allowlist over a codebase that already shells out everywhere is not bounded work. - No reproducible-build CI check. Best-practices and security both call out that two runs of the same commit must produce byte-identical outputs (the security design enforces with
PYTHONHASHSEED=0/SOURCE_DATE_EPOCH; best-practices implies it via the cache-hit smoke test). The performance design omits this — and Phase 7's "extension by addition" invariant (production/design.md §2.5) relies on the contract being reproducible between versions ofcodewizard-sherpa. - No coverage floor. Best-practices sets ≥90% line / ≥80% branch and ratchets it up per phase. The performance design says "coverage is computed but not enforced as a gate." Skipping the gate in Phase 0 means the gate never lands — there's no later phase where adding a coverage floor doesn't require a backfill PR that adds tests rather than catches bugs.
2. Attacks on the security-first design¶
2.1 Five concrete problems¶
-
The HMAC-signed cache index defends against a threat the design itself can't articulate. §"Cache layer" introduces a per-installation
~/.codegenie/.cache-key, signs every cache index record, and verifies on read. The design's own threat model (§"Acknowledged blind spots") says the threat is "an attacker with filesystem write access to.codegenie/cache/but not the key file." Name this attacker: who has write access to a repo's.codegenie/cache/but cannot read~/.codegenie/.cache-keyowned by the same user? The answer is "nobody on a developer workstation." On a CI runner with ephemeral filesystem the key file doesn't even exist between runs — so the HMAC is either regenerated every CI invocation (and thus rotates with every gather, making the audit trail's "tamper-evident" claim impossible to verify across runs) or stored as a CI secret (and the security design provides no story for that). The HMAC is complexity that buys nothing on the dev workstation and breaks the audit-trail claim in CI. -
gitleakson every gather, in the write path, is a $50-$200ms unbounded subprocess on a hot loop. §"Output writer + sanitizer" putsgitleaksin the serialized-output sanitization pass. The cost ("acceptable; ~50-200ms") is wall-clock for a Phase 0 YAML of 2 KB. Phase 2'srepo-context.yamlis "a few hundred KB" (the perf design's number).gitleaksover hundreds of KB plus the raw artifacts directory ("~5 MB Phase 2") will be seconds, not milliseconds. In the continuous-gather model where hundreds of probes re-serialize after a single trigger,gitleaksin the write path becomes the single largest cost in the gather pipeline — and load-bearing commitment §1 (production/design.md) demands the gather pipeline be cheap enough to run continuously. The security design has moved a tens-of-seconds subprocess into the cheap-and-continuous path. -
additionalProperties: falseat every level breaks "extension by addition." The security design'srepo-context.v1.jsonschema makes "unknown fields a validation failure — surfacing buggy probes." CLAUDE.md andproduction/design.md §2.5both require "extension by addition — adding Java, Python, or a new task type must be new probes + new Skills, never edits to existing probes or the coordinator." If the schema rejects unknown fields, adding a new probe in Phase 1 requires editing the schema, not adding a new file. The security design has structurally violated the load-bearing extension model in service of structural validation strictness. The two policies are incompatible at the seam. -
The field-name regex filter is a heuristic security gate, which means it's neither. §"Probe contract" rejects keys matching
(?i)^.*(secret|token|password|credential|api[_-]?key|auth[_-]?token|bearer).*$. False positives are the named ones (decryption_steps). The unnamed problem: false negatives. A probe emitting{"npm_registry_url": "https://registry-token-abc123.npm-corp.internal/"}carries an embedded credential and trips no filter; a probe emitting{"auth_provider_name": "okta"}trips the filter and fails closed. The "defense in depth" gitleaks scan two passes later catches the real secret only if gitleaks's own regex library hits it — and gitleaks misses many internal credential formats by design (it's tuned for AWS/GH/etc.). The whole filter is a confidence theatre that adds friction without bounding the leak set. -
Pinned-by-SHA actions are a maintenance dead-letter office. §"CI workflow" pins every GitHub Action by SHA, with Dependabot raising PRs. By design, this puts every action update behind a human review PR. Six months in, the team has 20 actions pinned by SHA; the team is shipping Phase 5-6; Dependabot PRs queue up; nobody reviews them because none of them changes user-visible behavior; a CVE in
actions/checkout(the SHA the team pinned) goes unpatched for two months. The risk section's mitigation ("Dependabot handles routine bumps automatically") begs the question — Dependabot raises the PR; someone has to merge it. The design has prescribed friction without budgeting the review capacity to clear it.
2.2 Three hidden assumptions¶
-
unshare -n(network namespaces) works on macOS. §"Goals" promises "verified in CI viaunshare -nor netns equivalent."unshareis a Linux-only utility; macOS has no equivalent. The smoke-test guarantee ("zero outbound network calls duringcodegenie gather") is enforceable on Linux CI runners but not on the developer's M-series Mac. CLAUDE.md andlocalv2.mdboth target macOS+Linux. The security design's flagship invariant has no enforcement on half the supported platforms. -
The
[project.optional-dependencies.gather]extra cleanly separates LLM deps from gather deps in Phase 0. §"Risks" #5 proposes agatherextra that locks the gather pipeline out of LLM SDKs. Today'spyproject.tomlhas one extra (dev). The security design proposes a second (gather) — but Phase 0 ships only the gather pipeline. The CLI entry point depends entirely ongatherdeps. So[project.optional-dependencies.gather]either is the same as[project](in which case the separation is fictional and Phase 4's[project.optional-dependencies.service]will be the actual separation, lifting Phase 0's design into a refactor) or[project]is empty and the CLI doesn't install by default (broken UX). The design hasn't worked out which one it means. -
Hash-pinned lockfiles compose with
uvresolver upgrades. §"Goals" requiresuv pip compile --generate-hashesand--require-hashesinstall. The sameuv pin to specific versionproblem the performance design has applies here doubly:uv pip compileoutput depends onuv's resolver version, and the security design pinsuv"in CI for reproducibility" — which means the CI pin and the developer'suvversion drift, which meansmake bootstrapproduces a different lockfile than CI verifies, which means CI fails with a hash mismatch the developer can't reproduce locally. The design has a security story but no reproducibility story to make the security story enforceable.
2.3 Things this design missed that the others caught¶
- No coverage floor. Security cares about audit and integrity but doesn't gate code coverage. Best-practices' 90% line / 80% branch floor is the actual mechanism for catching the "secret-leak defense layer was never tested" failure mode. Security tests for "an output containing a fake AWS key fails the sanitizer" — but doesn't enforce that every sanitizer code path is exercised.
- No CI walltime budget. Best-practices doesn't quote a number, but performance does (≤90s). Security adds gitleaks + bandit + pip-audit + osv-scanner + the netns-isolated test + the reproducibility check; the design admits in §"Acknowledged blind spots" that walltime is "~120s." That's a 33% regression against the performance target with no recovery plan. Phase 0 ships with CI noticeably slower than the roadmap (
docs/roadmap.md) implies for a "five tests" surface. - No conventions catalog. Best-practices §2.6 lays out a single
errors.pyhierarchy and a config-merge precedence with fail-loud-on-unknown-keys. Security has none of this. Phase 1's probe authors will diverge on exception types and config-typo handling, and the security design's "fail closed" reflex won't catch it because the failures aren't security-shaped. - No
make-mirrors-CI invariant. Best-practices §6 makesmake checkmirror CI exactly so contributors can self-verify. Security's CI surface (six jobs, custom hooks, harden-runner, etc.) has no local mirror, which means the bar contributors hit is "push a PR and wait." This is exactly the "pre-commit hooks erode under contributor pressure" risk the security design itself names — caused by its own ergonomics gap.
3. Attacks on the best-practices-first design¶
3.1 Five concrete problems¶
-
"Coordinator runs serially in Phase 0" is a contract-breaking decision dressed as scoping. §8 ("Speculative async optimization") sells a serial coordinator as Rule 2 (Simplicity First). But the coordinator's interface —
asyncio.Semaphore(N), per-probeasyncio.Task,asyncio.wait_fortimeout — is itself a load-bearing convention the rest of the system inherits (production/design.md §3.1: "Probe Coordinator dispatches probes in parallel"). Shipping serial-only in Phase 0 means Phase 1 doesn't add concurrency; it replaces the coordinator. The exit-criteria item §11 ("the bounded async dispatch in the coordinator (replacing Phase 0's serial path)") admits this. So Phase 0 has shipped a stub the next phase rewrites — exactly the "speculative scaffolding" §9.1 warns against, with the speculation pointing the wrong way. -
dataclasses-only in Phase 0 andpydanticin Phase 6 is a four-phase migration disguised as restraint. §2.2 explicitly rejects pydantic ("Phase 0 doesn't need pydantic at all"). But §3.3's probe contract shipsProbewith class attributes that the security and performance designs treat as Pydantic models (immutableRepoSnapshot,ProbeOutputwithJSONValuerecursive types). Phase 4 introducesanthropicandlanggraph, both of which depend on pydantic v2 — so by Phase 4 pydantic is a transitive dep regardless. Postponing the migration to Phase 6 means every probe written between Phase 1 and Phase 5 gets rewritten in Phase 6 from dataclass-based to pydantic-based contracts. The "we'll add it when there's load" framing buries an unbudgeted migration in Phase 6's scope. -
The
LanguageDetectionProberecognizes Dockerfiles but no Phase 0 probe consumes them. §3.1 declares Dockerfile as a recognized language. Phase 7 (docs/roadmap.md) introduces the migration task class and the Dockerfile-specific probes. The best-practices design ships Dockerfile detection in Phase 0 ("seven extensions covered") — exactly the speculative-feature category §3 says it bans ("NoIndexHealthProbe, no schema for B-G, no fixtures beyond the trivial"). The design enforces its own scope rule on B-G but breaks it on A1. The rule is either binding or it isn't. -
declared_inputs = ["**/*"]forLanguageDetectionProbedefeats the cache key. §3.3 sets the probe's declared inputs to the entire tree. The cache key derives from declared inputs (localv2.md §8). Every file change in the analyzed repo — every README edit, every test file touch, every CI config update — invalidates the language-detection cache entry. The design dismisses the performance lens ("their fix belongs in Phase 1 alongside the real probe") but the issue isn't performance; it's that the cache invariant ("cache hits on second run") in the Phase 0 exit criteria (docs/roadmap.md) is structurally impossible if any file changed between runs. The exit-criteria smoke test §4.2 ("re-running over the same<empty_dir>hits the cache") works only because the test uses an empty dir — adding a single touched file between runs breaks it. -
mkdocs build --strictas a CI gate over thedocs/tree is going to fire today. §2.2 mandates strict mkdocs in CI ("any broken link, any duplicated heading, any reference to a non-existent page fails the build"). The existingdocs/tree includesdocs/local.md(marked superseded), cross-references betweendocs/auto-agent-design.mdanddocs/gemini-auto-agent-design.md, and the ADR index atdocs/production/adrs/README.mdthat links to ADRs by number. A strict mkdocs build over this tree as-is will surface at least 3-5 broken refs (e.g.,docs/auto-agent-design.md §"Empirical Realities"referenced fromproduction/design.mdis a section anchor that mkdocs may not generate identically to manual anchors). Phase 0 cannot ship green on this gate without a cleanup pass over the docs that the design doesn't budget.
3.2 Three hidden assumptions¶
-
"Five-minute reviewer experience" assumes uv install of all dev deps in 30 seconds on a cold runner. §1 promises "clone the repo, run
make bootstrap, runmake check, and have green output in under five minutes."mkdocs-materialalone is ~150 transitive packages; pinned mypy + ruff + pytest stack adds another ~80. Even withuvand a warm wheel cache, the first-run install on a clean macOS box pulls ~250 MB of wheels. Themake bootstrapstep is realistically 60-180 seconds for a contributor who has never installed any of these tools. Addmake check's mypy cold-pass (15-30 seconds), pytest cold-collection, and mkdocs build, and the five-minute number is optimistic by a factor of 2-3 on a real contributor's machine. -
mypy --strictovertests/will catch real bugs. §2.2 puts tests under strict mypy. Pytest fixtures, parameterize markers, andmonkeypatchtyping are all underspecified in the typeshed; strict mypy overtests/produces a high false-positive rate that contributors silence with# type: ignorecomments. The "type errors caught at commit are 10x cheaper" argument (§2.2 row "Pre-commit") is true forsrc/; fortests/it's an obstacle that funnels real bugs through# type: ignorelines. The performance design got this right ("type-checking tests yields diminishing returns and inflates CI time"); the best-practices design didn't engage. -
Coverage floor of 90% line / 80% branch on Phase 0 with five tests is satisfiable. §4.4 sets the floor at 90% line / 80% branch on
src/codegenie/excludingcli.py. The Phase 0 surface is the coordinator, the cache store, the schema validator, the output writer, the language-detection probe, the registry, the error hierarchy, the config loader, the logging setup. Five tests can hit 90% line coverage of this surface only by being long integration tests, not the focused unit tests §4.1 enumerates. The floor is either gameable (one big test that walks every line and asserts almost nothing — Rule 9 violation) or unmet (and CI is red on day one). The design hasn't picked a side.
3.3 Things this design missed that the others caught¶
- No CLI cold-start budget. Performance has it (≤200ms on Linux CI). Best-practices has none. By Phase 14, when the CLI is the orchestration entry for portfolio-scale workers, the cold-start cost compounds across thousands of invocations per hour. Best-practices' "no premature optimization" answer becomes a refactor in Phase 8 or Phase 14.
- No subprocess allowlist or output sanitizer. Security has both. Best-practices'
errors.pyhierarchy andyaml.safe_loadban are the table stakes; the structural defenses (allowlist + sanitizer) are absent. By Phase 7 (migration task class) the codebase will shell out todocker buildx,dive, anddockerfile-parse— adding an allowlist over an established codebase is bounded work the security design budgets and best-practices does not. - No reproducibility check. Security has it (two builds, byte-identical). Best-practices' "make check" mirrors CI but doesn't verify reproducibility — a flake from a non-deterministic dependency surfaces as a test failure rather than a structural signal.
- No cost model for tokens. Performance and security both say "0 tokens in Phase 0" but performance frames it as "structural — Phase 8 hot views depend on Phase 0 layout"; security frames it as "Phase 4 will add
anthropicto a separate extra." Best-practices says nothing about the gather/service split that load-bearing commitment §1 (production/design.md) makes load-bearing. The synthesizer will inherit a gap: Phase 0'spyproject.tomlhas no shape for where the LLM SDKs land later, and the best-practices design declines to design it.
4. Cross-design dimension table¶
| Dimension | Performance | Security | Best-practices | Friction point |
|---|---|---|---|---|
| Cache content hash | xxh3-128 (~3ms/MB) | SHA-256 (collision-resistant) | SHA-256 (implicit; via localv2.md) |
xxh3 is non-cryptographic and breaks production/design.md §2.3 (Honest confidence) under Phase 14 webhook fan-out |
| Schema validator | fastjsonschema (10× faster; exec'd code) |
jsonschema (slower; transparent) |
jsonschema (default; floor: ≥90% coverage) |
fastjsonschema runtime code-gen is an attack surface on a persistence path |
| Coordinator concurrency | asyncio.Semaphore(min(cpu,8)) from day one |
asyncio.Semaphore(8) from Phase 1 onward |
Serial in Phase 0; async lands in Phase 1 | Best-practices ships a coordinator interface that Phase 1 replaces, not extends |
additionalProperties in envelope schema |
Not specified | false at every level |
false at root; true under probes.* |
Security strictness collides with production/design.md §2.5 (extension by addition) |
| Subprocess allowlist | None | Hard frozenset({"git"}) chokepoint |
None | Performance and best-practices both punt; Phase 7's docker tooling lands without a wall |
| Output sanitizer / path scrubbing | None | Mandatory three-pass (regex + path scrub + gitleaks) | None | Phase 11 commits the YAML to a real repo; performance and best-practices leak /Users/... |
Audit trail (run-record.json) |
"structured-log JSON run record" (one line) | Per-run <ts>-<hash>.json with HMAC anchor |
Not designed | Best-practices omits the audit anchor; Phase 13 cost ledger reconciliation has nothing to anchor against |
| Lockfile discipline | uv pip install -e ".[dev]" (no hashes) |
uv pip compile --generate-hashes; --require-hashes install |
uv.lock committed; loose pip install in a weekly drift job |
Performance ships fastest; security ships most reproducible; best-practices ships most contributor-friendly. All three diverge on the same pyproject.toml. |
| CI walltime target | ≤90s p95 | ~120s admitted | Implicit ("cold ~2 min, warm ~30s") | Security's six jobs blow performance's budget by 33% on day one |
| Network egress in tests | Unspecified | Zero, enforced via unshare -n / harden-runner |
Unspecified | Security's enforcement only works on Linux CI; macOS dev has no equivalent |
pytest-xdist |
On from day one | Unspecified | Off (asyncio_mode = "auto"; serial pytest) |
Performance enables xdist with no parallelism win, just risk |
pydantic in Phase 0 |
Yes (pyproject.toml deps; lazy-imported) |
Yes (ProbeOutput, immutable RepoSnapshot) |
No (dataclasses only; pydantic lands Phase 6) |
Best-practices defers a dep that the other two ship and that Phase 4 forces |
| Coverage floor | "Computed but not enforced" | Not specified | ≥90% line / ≥80% branch, ratcheting per phase | Performance's "not a gate" is the default best-practices gate doesn't land later |
additionalProperties for LanguageDetection probe slice |
Performance probe; not designed | Strict everywhere | true under probes.* (per envelope) |
Best-practices is the only one that left the seam open |
Probe declared_inputs for LanguageDetection |
"Sorted tuple of compiled PurePath" hashed via mmap+xxh3 |
"Under the analyzed-repo root; refused at registration time if upward-traverse" | ["**/*"] (whole tree) |
Best-practices breaks cache hits on every file change |
5. The single most consequential disagreement¶
Whether the cache content hash is SHA-256, xxh3, or blake3 — and behind that, whether Phase 0's cache layer is a performance artifact or a security/audit artifact.
The performance design treats the cache as identity-only ("cache keys are not adversarial — they're identity, not security") and picks xxh3 for a ~100× speedup, openly contradicting localv2.md §8. The security design treats the cache as the audit anchor (cache key = artifact identity for the audit trail; the run-record's SHA-256-of-the-YAML is the integrity anchor) and picks SHA-256, signing every index record with per-installation HMAC and explicitly overruling the performance lens. The best-practices design defers entirely to localv2.md (SHA-256, no opinion on speed, no HMAC).
This is consequential because every later phase is downstream of the decision:
- Phase 14 (Continuous Gather) runs the hashing function on every push webhook across the watched portfolio. SHA-256 vs xxh3 is a 100× factor on the throughput of the cheapest, most-frequent operation in the system.
- Phase 13 (AgentOps cost ledger) uses the cache as the unit of work for the "amortized" cost tier (
production/design.md §3.3, ADR-0027). The cost ledger's deduplication of "this probe ran twice on the same input" depends on cache-key stability. If the key is xxh3, an adversary in any watched repo can force apparent cache misses (collision-driven re-runs) and inflate the amortized cost attribution. - Phase 11 (Handoff at scale) commits artifacts and references the cache contents in PR evidence bundles. The cache key shows up in PR descriptions as a provenance anchor — collision resistance is the only thing that makes that anchor verifiable.
- Load-bearing commitment §3 (
production/design.md— Honest confidence; silent staleness is the worst failure mode) explicitly names cache integrity as the worst failure mode. A non-cryptographic hash is silent staleness under any adversarial portfolio assumption.
The three designs are not negotiating a parameter; they're disagreeing on what the cache is for. That disagreement propagates through Phase 11, Phase 13, and Phase 14 with compounding cost — and the synthesizer cannot defer it without baking the wrong assumption into the schema, the audit trail, the cost ledger, and the PR provenance layer simultaneously.
6. Shared blind spots (all three designs missed)¶
-
No model for the gather-vs-service split in
pyproject.tomlshape. Load-bearing commitment §1 (production/design.md) demands the gather pipeline never depend on an LLM SDK, ever. Performance ships one[project.optional-dependencies.dev]; security mentions agatherextra in passing without designing it; best-practices doesn't address the split at all. None of the three designs land the[project.optional-dependencies]shape that Phase 4'santhropic/langgraphdeps will need to slot into without polluting the gather entry point. Phase 4 will therefore have to refactor Phase 0'spyproject.toml— an "extension by addition" violation that all three lenses agreed to leave for the next phase. -
No design for the
.gitignoremutation that CLAUDE.md mandates. CLAUDE.md ("The tool should offer to add it to that repo's.gitignoreon first run") requires the CLI to mutate the analyzed repo's.gitignore(not its own). All three designs ship the YAML and raw artifacts to.codegenie/and none describes the prompt/confirmation/file-write path for opting in. Phase 11's PR-opening stage will then either commit.codegenie/artifacts into the PR or skip them based on whether someone manually added the entry — a coin flip that none of the three Phase 0 designs flipped. -
No story for the
localv2.md §4ABC being "frozen" across versions. Best-practices says "copy-paste verbatim and freeze"; performance says "verbatim fromlocalv2.md §4per CLAUDE.md"; security says "as specified inlocalv2.md §4." None of the three designs handles what happens whenlocalv2.md §4itself updates — andlocalv2.mdis a living document, currently at v2. The "snapshot test" best-practices proposes (§10 exit criteria #10) compares against a frozen string, not against the document. Whenlocalv2.mdreaches v3, all three designs leave the synthesizer with a divergence between the spec document and the implemented ABC and no policy for which wins. -
No story for
.codegenie/permissions on a shared CI runner. Security designs 0700/0600 perms for the dev workstation; performance shrugs at perms; best-practices doesn't address it. On GitHub Actions, jobs run asrunnerwithumask 0022; the cache directory between jobs (whenactions/cacheis used, per performance's CI design) is restored with0755. The security design's mode-bit-check test will fail in CI under the cache-restore path. None of the three designs reconciled the on-disk permission model with the CI cache model both performance and best-practices ship. -
No design for incremental gather under a coordinator interface change. All three designs ship a coordinator interface in Phase 0. The continuous-gather model in
production/design.md §3.2("incremental gathers — only probes with changeddeclared_inputsre-run") depends on the cache key derivation and the coordinator's ability to skip-and-pass-through a cachedProbeOutput. None of the three Phase 0 designs implements the skip-and-pass-through path; all three describecache.get → run probe → cache.put. Phase 14's incremental model needscache.get → return cached ProbeOutput, and the coordinator'sdict[probe_name, ProbeOutput]output shape must include skip-marker probes that didn't re-run. The Phase 0 coordinator interface doesn't model the skip case as a first-class output — Phase 14 will have to extend the contract.
7. Roadmap-level critiques¶
-
Phase 0's exit criteria don't validate the contracts that load-bear into Phase 1-14.
docs/roadmap.mdlists "CLI runs; CI green; mkdocs builds." None of the criteria assert (a) thatProbeABC matcheslocalv2.md §4, (b) that the cache hit path is deterministic across two CLI invocations against the same fixture, (c) that the schema rejects a known-invalidrepo-context.yaml. Best-practices §10 adds these as design-level exit criteria, but the roadmap doesn't — meaning a Phase 0 PR can ship green without verifying the inheritance properties Phases 1+ depend on. The roadmap shape "CLI runs + CI green" is satisfiable by a script that does nothing. -
The "no LLM in gather pipeline" commitment is enforced nowhere in Phase 0.
docs/production/design.md §2.1is load-bearing. Phase 0 shipsclick,pyyaml,jsonschema, optionallypydantic,structlog— and Phase 4 will introduceanthropic/langgraphinto the samepyproject.tomlunless the structural separation is designed now. Only the security design even names the problem (§"Risks" #5), and even that design proposes the fix as a Phase 4 ADR ("Phase 4 will addanthropicandlanggraphto the service dependencies, but the gather pipeline's allowlist stays clean"). The roadmap should require Phase 0 to land the dependency-graph fence, not Phase 4 to retroactively erect it. -
The roadmap's "Tooling & setup" section for Phase 0 contradicts the load-bearing commitments. The roadmap lists
pydanticas a Phase 0 dependency ("Dependencies:click,pyyaml,jsonschema,aiofiles,pydantic"). Best-practices' design explicitly defers pydantic to Phase 6. The other two designs use pydantic for the trust boundary. So the roadmap and the best-practices design are in direct conflict on whether pydantic ships in Phase 0. Either the roadmap is wrong or the design is. The synthesizer will have to pick — and neither document acknowledges the conflict. -
aiofilesis in the roadmap's Phase 0 deps but no design uses it. Performance's coordinator usesasyncio.create_subprocess_exec, neveraiofiles. Security's design uses synchronous file I/O for atomic writes (open(..., "w") → fsync → os.replace). Best-practices doesn't mention it. The roadmap lists a dep that no design needs — and shipping unused deps in Phase 0 is exactly the kind of unforced complexity all three lenses claim to avoid. -
No Phase 0 exit criterion validates the "facts, not judgments" commitment (
production/design.md §2.2). TheLanguageDetectionProbereturns a count map — this is fact-shaped. But none of the three designs lays down the structural rule that a probe cannot return aconfidencefield set to anything other than a measurement (vs an LLM judgment). Phase 1'sIndexHealthProbewill need this distinction encoded somewhere; Phase 0 was the right time to encode it, and no design did. -
The
mkdocs build --strictexit criterion indocs/roadmap.mdis unverifiable without first cleaningdocs/. As noted in §3.1 problem #5: the existingdocs/tree has cross-references that will fail strict mode. The roadmap's Phase 0 exit criterion ("The docs site builds locally without warnings") requires either (a) a docs cleanup pass that's not in any of the three designs or (b) loosening the criterion to "without errors." This is a roadmap-level ambiguity that all three designs absorbed silently — and the synthesizer will inherit it.
End of critique. No fixes proposed; this is the wound count. The next agent's job is to synthesize.