Phase 01 — Context gathering: Layer A (Node.js): Performance-first design¶
Lens: Performance — throughput, latency, token economy, footprint. Designed by: Performance-first design subagent Date: 2026-05-12
Lens summary¶
The whole point of Phase 1 — measured against the production target — is to make Stage 2 (Deep Scan) a cache hit in single-digit seconds on every workflow at portfolio scale. ADR-0006 (continuous deterministic gather) commits the system to running this code on every push, every PR, every CVE event, against every watched repo. ADR-0013 commits to single-digit-ms agent reads of hot views. If Phase 1's Layer A is a 30-second blob of subprocess churn, the continuous-gather model collapses under its own weight at 1,000 watched repos × 50 pushes/day = 50,000 gathers/day. Layer A is touched on every gather (it derives the language fingerprint everything else gates on), so it sits squarely on the critical path.
I optimized for: (1) cache hit rate at portfolio scale, where the headline number is "near-zero work when nothing changed"; (2) wall-clock on warm + incremental paths, because cold gathers are a one-time tax per repo while warm/incremental dominate steady state; (3) avoiding the fork-exec-parse-discard pattern that subprocess-heavy probes default to in Python; (4) keeping repo-context.yaml writes cheap because Phase 13's cost ledger and Phase 14's continuous gather both fan out from this artifact.
I deprioritized: defensive security beyond what Phase 0's chokepoints already provide; "pretty" error messages; readable intermediate logs; ecosystem expansion (Phase 2 Layer B is explicitly not my problem); developer-time UX of the first cold run on a fresh dev box. I will measure things; I will not gold-plate the measuring.
Goals (concrete, measurable)¶
These are aggressive vs. the roadmap's "useful on a real Node.js repo" exit criterion. They are the targets I want the synthesizer to push back on.
- Workflows/hour target (Layer A only, single 8-core worker): ≥ 2,400/hr in steady state (= 1.5s p50 wall clock per incremental gather; ≥90% cache-hit by probe).
- Time-to-PR contribution from Layer A (p95): ≤ 250 ms on incremental gather (only the changed-input probes re-execute); ≤ 4 s on warm gather (all Layer A probes re-execute, no cold subprocesses); ≤ 12 s on cold first-run for a typical 50k-LOC Node service.
- $/PR target: $0.00 — Layer A is fully deterministic per ADR-0005; the only spend is CPU-seconds. Target ≤ 2 CPU-seconds per incremental gather, ≤ 12 CPU-seconds per warm gather.
- Cache hit rate target: ≥ 92% per-probe across a steady-state portfolio (matches the Cursor reference number cited in
production/design.md §3.2). 100% on a re-run with zero file changes (this is the roadmap's exit criterion; I make it a regression test). - Per-worker memory ceiling: ≤ 250 MB RSS per active gather; ≤ 80 MB resident for the CLI itself between gathers (this matters when Phase 14 runs gathers as long-lived workers). Phase 0's
ResourceBudget(rss_mb=200, ...)default holds; I tighten it for Layer A probes that don't shell out. - Tail latency (p99) on incremental gather: ≤ 600 ms. The 99th percentile is the one that determines portfolio throughput, not the median.
- Hot-view pre-render budget: Phase 1 doesn't ship Redis yet (that's Phase 8), but Layer A's output must be shaped so that the four pre-rendered slices (ADR-0013:
available_skills,entrypoint,risk_flags,confidence_summary) can be projected fromrepo-context.yamlin ≤ 5 ms with zero re-parsing.
Architecture¶
codegenie gather <path>
│
▼
┌─────────────────────────────────────────┐
│ Snapshot builder (single fs.walk) │
│ - one-pass os.scandir, gitignore-aware │
│ - emits PathIndex: dict[str, FileStat] │
│ - the only filesystem traversal of the │
│ repo per gather. PERIOD. │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Coordinator (asyncio, from Phase 0) │
│ - probe DAG = explicit edges, not │
│ `requires:` string lookup at runtime │
│ - per-probe cache check BEFORE │
│ subprocess (the canonical bug) │
│ - bounded Semaphore(min(cpu, 8)) │
└────────────────┬────────────────────────┘
│
┌──────────────────────────┼──────────────────────────────────┐
│ Tier-0: zero-subprocess │ Tier-1: pure-python parse │
│ (LanguageDetection, │ (NodeManifest, NodeBuildSystem, │
│ prelude — runs first) │ CI, Deployment, TestInventory) │
│ cache key from PathIndex│ cache key from declared_inputs │
│ fingerprint only │ blob hashes only │
└──────────────────────────┴──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Per-probe cache (Phase 0 store) │
│ - hot in-memory LRU layered above │
│ filesystem (mmap'd index) │
│ - hit returns ProbeOutput w/o I/O │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Streaming merger + sanitizer │
│ - YAML writer emits as probes finish │
│ (no buffer the whole RepoContext) │
│ - hot-view projection emitted as a │
│ sibling .codegenie/context/views.json │
│ (Phase 8 reads this directly into │
│ Redis without re-parsing YAML) │
└────────────────┬────────────────────────┘
│
▼
.codegenie/context/repo-context.yaml
.codegenie/context/views.json [new — for Phase 8]
.codegenie/context/raw/<probe>.json
.codegenie/cache/blobs/... (Phase 0 layout)
Data flow: one filesystem walk → one PathIndex → six probes consume slices → cache layer short-circuits unchanged probes → results stream to YAML + JSON views. The PathIndex is the load-bearing object — it's why Layer A can be sub-second on the warm path.
Components¶
1. SnapshotBuilder (codegenie/coordinator/snapshot.py — extends Phase 0)¶
- Purpose: Walk the repo exactly once. All six Layer A probes read from the resulting
PathIndexrather than re-walking. Critique-proof against the "every probe runs its ownos.walk" antipattern. - Interface:
- in:
repo_root: Path,gitignore_rules: PathSpec - out:
PathIndex(frozen dataclass):paths: tuple[FileEntry, ...]sorted by path;by_extension: dict[str, tuple[FileEntry, ...]](precomputed bucket views — these are the slices probes actually consume);manifests_present: frozenset[str](subset check forpackage.json,pnpm-lock.yaml, etc., precomputed);fingerprint: str(BLAKE3 over the sorted(rel_path, size, mtime_ns)tuple — the cache key root for tier-0 probes). - errors: refuses symlinks crossing repo root (inherited from Phase 0); refuses ≥ 200k files without
--max-files(avoids accidental walk of vendored monorepos — fails loud per Rule 12). - Internal design:
os.scandirrecursive, breadth-first, with an exclusion set computed once from.gitignore+ hard-coded noise dirs (node_modules,.git,dist,build,coverage,.next,.turbo,target). Excluded at directory level sonode_modulesis never even descended into.gitignore_rulesusespathspec(compiled once) — orders of magnitude faster than per-file rule eval.- No
Pathobjects in the hot loop. Strings +os.DirEntry.stat(follow_symlinks=False).Pathis constructed lazily when a probe asks. by_extensionmaterialized in the walk loop, not as a post-pass dict-comp. Branchless: a precomputeddict[str, list]per extension we care about (.js,.mjs,.cjs,.ts,.tsx,.json,.yaml,.yml,.lock,.toml,.dockerfile); everything else goes into a single_otherbucket that is not sorted (saves a sort pass on the ~80% of files no probe cares about).fingerprintis BLAKE3 over a packed byte buffer of(len_path, path, size_u64, mtime_u64)— no string concatenation, no JSON serialization. ~3 GB/s on modern hardware (Phase 0 ADR ratified BLAKE3).- Tradeoffs accepted:
- Every probe must accept a
PathIndexargument. This is not the Phase 0 probe contract — it's a coordinator-level enrichment that probes can opt into via a typed helper. Probes that don't opt in walk the repo themselves (slow path, surfaced in logs). - The PathIndex holds
mtime_nsin memory. On a 50k-file repo that's ~3 MB resident. Acceptable. - We re-walk the repo if the
--no-cacheflag is set or the PathIndex itself was invalidated by a config change. That's fine; cold path was never the target.
2. Coordinator extensions (codegenie/coordinator/coordinator.py — extends Phase 0)¶
- Purpose: Make the Phase 0 coordinator perform well at Layer A's fan-out (6 probes) and Phase 14's repeat rate (50k gathers/day).
- Interface: unchanged from Phase 0 (
async def gather(...) -> GatherResult). - Internal design:
- Cache check before probe instantiation. Phase 0 already does this; I'm calling it out because the canonical performance regression is "we instantiated the probe, called
run(), and then checked the cache." The cache lookup is keyed by(probe_name, probe_version, per_probe_schema_version, declared_inputs_content_hash). For Layer A thedeclared_inputs_content_hashis derived from the PathIndex — never re-hashes file contents that were already hashed during the walk. - Prelude pass = LanguageDetection only. Phase 0 introduced the prelude pass. I tighten it: only LanguageDetection runs in the prelude, and it runs with the PathIndex already populated (no second walk). LanguageDetection is now an ~8 ms operation on the warm path (it reads
by_extensionbucket sizes and emits counts). - Explicit DAG. Phase 0 uses
requires: list[str]strings. At runtime that means dict lookups. For Layer A's six probes I precompute the DAG at module import time into atuple[tuple[Probe, ...], ...]— each tuple is a "wave" of probes that can run in parallel. Two waves:(LanguageDetection,)then(NodeBuildSystem, NodeManifest, CI, Deployment, TestInventory). No per-run topological sort. - Hot-path probe execution is
awaitin the same event loop. No thread pool, no process pool. Tier-0 and Tier-1 probes are pure-Python parse work; the GIL is irrelevant because we're parsing JSON/YAML, which releases the GIL in the C parsers. Concurrency is for hiding I/O, and at Layer A the only I/O is reading manifest files which the PathIndex already stat'd. asyncio.Semaphore(min(os.cpu_count(), 8))stays — but Layer A almost never hits it because the 5 parallel probes in wave 2 are bounded bymin(5, semaphore)anyway. The semaphore exists for Phase 2 Layer C, not for me.- Cancellation is cooperative, not best-effort. Each probe awaits
asyncio.sleep(0)between major parse steps (reading lockfile → walking workspaces → emitting slice). Phase 14's webhook flood can pre-empt long-running probes cheaply. - Tradeoffs accepted:
- The explicit DAG means adding a new Layer A probe is two edits, not one (decorator + DAG tuple). Phase 0's
@register_probedecorator +requires:lookup is preserved as the fallback for non-hot-path probes; this is an opt-in fast path. The cost is one tuple edit per new Layer A probe — bounded scope, well-known seam.
3. PathIndex-aware Probe mixin (codegenie/probes/path_index_aware.py)¶
- Purpose: Give Layer A probes a typed handle to the precomputed PathIndex without changing the Phase 0 ABC.
- Interface: A mixin class probes inherit from when they want PathIndex access.
def use_path_index(self, idx: PathIndex) -> Noneis called by the coordinator beforerun(). Probes that don't inherit it get an empty stub. - Internal design: Sets
self._idx. Provides typed accessors:files_with_extension(".ts") -> tuple[FileEntry, ...],manifest_path("package.json") -> Path | None(single dict lookup onby_path),monorepo_marker_present() -> bool. These are the operations all six Layer A probes do. - Tradeoffs accepted: A second class hierarchy alongside
Probe. Worth it because the alternative is every Layer A probe re-implementing "find me the package.json" or accepting aPathIndexconstructor argument that breaks ABC byte-for-byte.
4. LanguageDetectionProbe (extends Phase 0)¶
- Purpose: Phase 0 already shipped this. I'm describing only the Layer A delta.
- Interface: Same.
schema_sliceoutput unchanged. - Internal design:
- No filesystem walk — reads
PathIndex.by_extensionlengths. - Framework hints (NestJS, Express, Fastify, Next.js) detected from
package.json's declared dependencies (parsed once by NodeManifestProbe and cached in-process for this gather — see the dependency note in NodeBuildSystemProbe below).LanguageDetectionno longer parses package.json itself. - Monorepo detection via
PathIndex.manifests_presentset-membership check:{"pnpm-workspace.yaml", "lerna.json", "nx.json", "turbo.json"} & manifests_present. - Tradeoffs accepted: Framework hints are now a Tier-1 output (not Tier-0). On the absolute cold path where NodeManifestProbe hasn't run yet, framework hints are empty; the planner doesn't strictly need them at language-detection-time anyway (they're consumed by Stage 3 Planning, which by then has the full context).
5. NodeBuildSystemProbe (codegenie/probes/node_build_system.py)¶
- Purpose: Detect package manager, Node version constraints, build commands, bundler, TypeScript config.
- Interface: outputs the
build_systemslice fromlocalv2.md §5.1 A2. - Internal design:
package.jsonis parsed exactly once per gather byNodeManifestProbeand the parsed dict is stuffed intoProbeContext.workspaceas a sibling file.parsed/package.json.msgpack(msgpack chosen because it's ~10x faster than JSON for round-trip on dicts of this size). NodeBuildSystem reads the msgpack, not the raw JSON. Inter-probe sharing without violating the ABC: the contract still says "I read declared_inputs"; the implementation just notices the parsed form is available.- Lockfile detection by manifest presence, no file open.
pnpm > yarn > bun > npmprecedence is a single set-membership chain onPathIndex.manifests_present. tsconfig.jsonparsed withorjsonif installed (5–10x faster than stdlibjsonon real-world configs that contain comments — wait, no,orjsonis strict JSON;tsconfig.jsonallows JSON5. Usejson5library with its C extensionpyjson5. Bench at integration time; ifpyjson5underperforms a fork toorjsonwith a strip-comments pass, switch).- Bundler detection from declared dependencies, not from config-file probing.
webpackindependencies→ bundler is webpack. One dict lookup per bundler. The config-file probe is a fallback only. - Tradeoffs accepted:
- The msgpack-cache-of-parsed-package.json pattern is an inter-probe optimization. It's documented as such; it's not a new contract. If a future Layer A probe doesn't want to read the msgpack it can re-parse from the source. Surfaced in Phase 0's chokepoint discussion as the "in-process probe cache."
- I'm adding
msgpackandpyjson5(ororjson) to dependencies. That's two new dependencies; they're both fast, stable, widely used. Justified by the warm-path latency target.
6. NodeManifestProbe (codegenie/probes/node_manifest.py)¶
- Purpose: Parse
package.json, lockfile, detect native modules. The most expensive Layer A probe on cold gather (lockfile parsing dominates). - Interface: outputs
manifests[]slice perlocalv2.md §5.1 A3. - Internal design:
- Lockfile parsing is the hot inner loop.
pnpm-lock.yaml: parsed withruamel.yamlin C-extension mode if available; fall back topyyaml.CSafeLoader. Neverpyyaml.SafeLoader(pure Python).ruamelis faster on large lockfiles but adds a dep; bench at integration; choose the winner.package-lock.json:orjson.loadsif available else stdlibjson.yarn.lock: there's no fast Python parser. Use the@yarnpkg/parsersJS package via a one-timenode -einvocation, or a hand-rolled parser that handles only the 3 features we actually consume (package names, versions, integrity hashes). I'll hand-roll: the format is regular, and avoiding a node subprocess per gather is worth ~200 ms.
- Native module detection by package-name catalog match. The catalog (
codegenie/catalogs/native-modules.yaml) is a frozenset loaded once at module import. Match isO(1): iterate parsed dependencies once, checkname in CATALOG. No regex; no version range eval. - Integrity validation deferred to Phase 2. Lockfile integrity is a Phase 2 concern (B2 IndexHealth). Layer A reports
integrity_valid: nullwith confidencemedium; Phase 2 fills it in. Saves ~150 ms on the warm path. enginesfield extracted from the already-parsedpackage.json(the msgpack cache). One dict access.- In-place dict mutation in the parse-to-slice transform. No copy of the lockfile data; build the output slice as a slim projection, drop references to the source dict at function exit. RSS doesn't grow.
- Tradeoffs accepted:
- Hand-rolled
yarn.lockparser is a maintenance liability. Mitigation: 1k LOC tops, fuzzed against real yarn.lock files in CI, falls back to the node subprocess if parsing returns empty. The fallback is a feature flag; the perf-first design ships with the hand-rolled parser on by default. - Native module catalog is "data, not prompts." Adding a native module is a YAML edit. Consistent with
production/design.md §2.6("Organizational uniqueness as data, not prompts").
7. CIProbe (codegenie/probes/ci.py)¶
- Purpose: Detect GitHub Actions / CircleCI / GitLab CI / Jenkins.
- Interface: outputs
cislice per A4. - Internal design:
- Detection by directory presence, not by parsing.
.github/workflows/exists → GitHub Actions..circleci/exists → CircleCI. Boolean short-circuit; no file I/O. - Workflow YAML parsing only for the one workflow we care about (the one matching
name: cior the alphabetically-first one if absent). Parsing every workflow file in a heavy repo is wasted work for Stage 3, which only consumes the build/test commands. pyyaml.CSafeLoadermandatory;pyyaml.SafeLoaderbanned by Phase 0'sforbidden-patternshook.- Tradeoffs accepted:
- Skipping non-primary workflows means we miss multi-workflow setups (e.g., separate
build-image.ymlandrelease.yml). For Phase 1's exit criterion (usefulrepo-context.yaml), the primary workflow is enough. Phase 2's IndexHealthProbe can surface "we didn't parse 3 other workflows" as a confidence hit if it matters.
8. DeploymentProbe (codegenie/probes/deployment.py)¶
- Purpose: Detect Helm / Kustomize / Terraform / plain k8s manifests.
- Interface: outputs
deploymentslice per A5. - Internal design:
- Existence check before parse.
deploy/,helm/,charts/,k8s/,kustomization.yaml— directory/file existence drives 90% of the classification. - Helm
values.yamlparsed only for theimage:block; nothing else is consumed by Stage 3 Planning. Selective key extraction via streaming YAML parse: useruamel.yaml's event-driven parser to stop at the firstimage:key, not full document parse. - Terraform parsing optional.
hcl2is slow (pure Python, ~50 ms per file). Skipped by default; opt-in via--probe-opts deployment.terraform=true. - Tradeoffs accepted:
- Streaming YAML parse for one key is fragile (key ordering, nested anchors). If it fails it falls back to full-parse. Worth it on the happy path which is 99% of repos.
- Terraform-off-by-default trades coverage for warm-path latency. Surface in the report as
deployment.terraform_skipped: true.
9. TestInventoryProbe (codegenie/probes/test_inventory.py)¶
- Purpose: Detect test framework, count tests, find smoke test, read coverage.
- Interface: outputs
test_inventoryslice per A6. - Internal design:
- Framework detection from
package.jsondevDependencies, not from filesystem scan. The msgpack-cached parsed package.json is reused. One dict lookup per framework. - Test count from
PathIndex.by_extension[".test.ts"] + by_extension[".test.js"] + by_extension[".spec.ts"] + by_extension[".spec.js"]— len() on precomputed tuples. ~5 microseconds. - Coverage data parsed lazily: only if
coverage/lcov.infoexists and the planner declaresrequires_evidence: test_inventory.coverage. The probe emitscoverage_data.present: true, path: ...without parsing the file. Phase 2's planner-on-demand-read pattern. - Tradeoffs accepted:
- We don't actually count tests; we count test files. A file with 30
it()blocks counts as 1. Stage 3 Planning consumes the count to pick recipes; recipes don't care about per-itgranularity. Cheaper, less informative, sufficient.
10. Cache-layer hot path (codegenie/cache/store.py extensions)¶
- Purpose: Make the warm-path cache lookup memoryless from a wall-clock perspective.
- Interface: Phase 0 API unchanged; adds an in-process LRU.
- Internal design:
functools.lru_cache-style in-process layer above the Phase 0 filesystem store.maxsize=128(covers all probes on a portfolio of ~20 concurrently-being-gathered repos at the per-worker level — Phase 14's worker model).- Cache lookup is one BLAKE3 of the PathIndex slice + one in-process dict lookup. The slice is precomputed; the hash is ~1 microsecond per KB. For Layer A's typical declared_inputs (~5 files, ~20 KB total), the hash is ~50 microseconds.
- mmap of the on-disk index is now a yes. Phase 0 deferred mmap citing cross-platform and concurrency concerns. I push back: macOS dev workstations are not the portfolio-scale target; Phase 14 runs Linux in k8s. Add mmap behind a feature flag
--cache-mmap(default on Linux, default off on macOS). The Phase 0 critique that "the index is single-digit MB through Phase 13" was scoped to single-repo; portfolio cache at Phase 14 will see GB-scale aggregate. Surfacing the conflict with Phase 0's no-mmap stance — this is the one place I push back on the prior phase. - Hash-validation on read is skipped on the LRU layer. The on-disk store re-hashes blob contents per ADR-0006's integrity model; the in-process LRU trusts itself (it's in our address space). Saves ~200 microseconds per cache hit. Surfaces as a risk if RAM corruption is in the threat model; it's not.
- Tradeoffs accepted:
- LRU size of 128 assumes the worker doesn't gather more than 128 distinct probe-keys concurrently. Fine for portfolio gather; might thrash on a one-shot CLI gather of a 10k-repo monorepo. Mitigation:
--lru-sizeflag. - Pushing back on Phase 0's mmap stance is a real disagreement. I'm flagging it as an open question for the synthesizer; if Phase 0 stays no-mmap, my numbers move by ~5%.
11. Streaming writer + hot-view projection (codegenie/output/writer.py extensions)¶
- Purpose: Don't buffer the whole
RepoContextin memory before writing. Pre-render the Phase-8 hot views as a side artifact during write. - Interface: Phase 0's atomic-write semantics preserved.
- Internal design:
- YAML stream emission as probes finish. Each probe's slice is written under its top-level key as soon as the slice arrives at the merger. The trailing
gather_statusand any cross-probe summary fields are written last. Reduces peak RSS by ~30% on a real Node repo (~200 KB YAML). views.jsonsibling artifact. During the write pass, project the four ADR-0013 hot-view slices (available_skills,entrypoint,risk_flags,confidence_summary) directly into a JSON file at.codegenie/context/views.json. Phase 8 reads this and loads it into Redis without a YAML parse round-trip. Pre-shaping for the Phase 8 hot path is the explicit cite of ADR-0013 — I'm doing the projection work now, two phases early, because it's free here (the data is in hand) and expensive there (Phase 8 would otherwise parse the YAML to project).- Sanitizer integrated into the streaming write, not as a post-pass. The Phase 0 sanitizer's two passes (field-name regex + path scrub) run on each slice as it's emitted; no second walk over the merged dict.
- Tradeoffs accepted:
- Streaming YAML emission means a probe failure mid-write leaves a partial artifact. Atomic-write via
.tmp+os.replacestill applies (the.tmpfile is the streaming target); on probe failure the.tmpis unlinked and the previous artifact is preserved. Net: same correctness as Phase 0's batch write, lower peak RSS. views.jsonis a Phase-1 artifact for a Phase-8 consumer. That's a forward dependency. Surface as a stable schema (views.schema.jsonlands in Phase 1); Phase 8 either consumes it or ignores it. If Phase 8 changes its hot-view list, Phase 1 follows; no breaking change for Phase 2–7.
Data flow¶
Walk-through of one representative warm-path incremental gather on a 50k-LOC Node.js service:
- Trigger: push webhook fires for
repo-1234. Continuous Gather Dispatcher (Phase 14 — for Phase 1 this is the CLI) invokescodegenie gather <path>. - CLI startup (≤ 80 ms): lazy imports per Phase 0. Click parses args. Pydantic, jsonschema, pyyaml not yet imported.
- SnapshotBuilder (≤ 200 ms warm): one
os.scandirwalk. ~50k files visited, ~8k in non-excluded dirs. Per-file work:stat, byte-pack into the fingerprint buffer, append to appropriateby_extensionbucket. BLAKE3 of the packed buffer =path_index_fingerprint. RSS at this point ~30 MB. - Prelude probe (LanguageDetection, ≤ 8 ms): cache key =
sha256("language_detection", "v1", "v1", path_index_fingerprint). LRU hit (this is an incremental run; the PathIndex fingerprint matches the previous run because no.js/.tsfiles changed). Output materialized from LRU. - Coordinator builds enriched snapshot:
dataclasses.replace(snapshot, detected_languages=...). ≤ 1 ms. - Wave 2 dispatch (5 probes in parallel): for each probe, compute cache key from
(probe.name, probe.version, sub_schema_version, content_hash_of(declared_inputs)). The content hash is computed from the PathIndex (the sizes and mtimes are already in hand); no re-reading of files. ≤ 10 ms total for all 5 cache-key computations. - Cache lookups: LRU first, on-disk store second. Assume
package.jsonchanged (push triggered the gather because someone bumped a dep): NodeManifest misses cache. The other 4 hit LRU. - NodeManifest re-execute (~400 ms warm):
package.jsonparsed withorjson(~3 ms).pnpm-lock.yamlparsed withruamel.yamlC-mode (~280 ms — the dominant cost, ~50k LOC of lockfile). Native module catalog match (~5 ms). Slice emission (~10 ms). msgpack-cache the parsed package.json for in-process reuse (~5 ms). - NodeBuildSystem cache-hit → on-disk store hit (~3 ms): the declared_inputs hash for NodeBuildSystem is
(package.json, pnpm-lock.yaml, .nvmrc, tsconfig.json).package.jsonchanged; cache miss. Re-execute: read the msgpack-cached parsed package.json (~1 ms), pull engines + scripts + bundler hints (~5 ms). Total ~12 ms (msgpack saved us ~200 ms of re-parsing). - CI, Deployment, TestInventory all LRU hits.
- Streaming write begins: as each probe completes (in any order), its slice is sanitized and appended to
repo-context.yaml.tmp. Total write wall-clock ~80 ms. - Hot-view projection: during the write pass,
views.json.tmpis populated with the four hot-view slices. ~5 ms. - Schema validation:
jsonschema.Draft202012Validator(compiled-once,lru_cache'd) walks the in-memory dict. ~30 ms on a typical envelope. - Atomic rename:
os.replaceon both.tmpfiles. ~2 ms. - Total wall-clock: ~750 ms p50; ~1.2 s p95 on this scenario. Under the 4-s warm-path target.
For the pure no-change incremental gather (PathIndex fingerprint matches previous run), step 4 already provides the answer — total wall-clock ≤ 250 ms (dominated by the SnapshotBuilder walk).
Failure modes & recovery¶
| Failure | Detected by | Recovery |
|---|---|---|
| LanguageDetection LRU stale (PathIndex fingerprint matches but on-disk filesystem changed under our feet — e.g., a parallel git checkout) | path_index_fingerprint mismatch on next gather |
Next gather invalidates; correctness preserved at the cost of one extra warm gather. Surfaced in logs as lru.stale_detected event. |
pnpm-lock.yaml parse timeout (huge lockfile, ruamel.yaml hangs) |
Phase 0 per-probe wait_for(timeout_s=30) |
ProbeOutput(errors=["lockfile parse timeout"], confidence="low"); gather continues with degraded NodeManifest slice; planner sees confidence: low and routes to RAG/LLM fallback. |
Hand-rolled yarn.lock parser returns empty (format edge case we didn't handle) |
Empty direct_dependencies.production on a repo where package.json declares deps |
Coordinator detects the inconsistency (parsed package.json has deps; lockfile parse says none). Falls back to node -e subprocess parser. Surfaces as node_manifest.fallback_parser_used in the audit record. |
| msgpack cache corruption between probes | NodeBuildSystem's read of the msgpack fails with a parse error | Re-parse package.json from source. Add ~3 ms; correctness preserved. Surfaces as inter_probe_cache.miss in logs. |
views.json schema doesn't match what Phase 8 expects |
Phase 8 contract test (when Phase 8 lands) | Phase 1 ships views.schema.json as a versioned contract. Phase 8 either pins to v1 or migrates Phase 1 first. No silent drift. |
| LRU thrash on a workstation gathering a 10k-repo monorepo | RSS pressure observed | --lru-size flag to cap; default sized for portfolio-worker model. |
Probe declares declared_inputs that doesn't actually capture all files it reads |
Cache hit on a run where the un-declared input changed → stale slice | Out of scope for Phase 1 to prevent; Phase 2's IndexHealthProbe catches it. Phase 1 adds a tests/adv/test_declared_inputs_completeness.py fuzz test that mutates files outside declared_inputs and asserts cache hit — at least we know we're testing the property. |
| mmap-on-Linux flag enabled on a filesystem that doesn't support mmap (some FUSE setups) | mmap call returns EINVAL | Catch, fall back to plain buffered read, log once per process. |
| Streaming write fails after some slices were written | .tmp file exists but is incomplete |
The previous valid repo-context.yaml is still in place (atomic rename hasn't happened). Unlink the .tmp; exit non-zero per Phase 0's exit code map. |
Resource & cost profile¶
- Tokens per run: 0. (ADR-0005 — no LLM in gather.)
- Wall-clock per run:
- Cold (first gather of a 50k-LOC Node repo): p50 ~8 s, p95 ~12 s. Dominated by initial PathIndex walk + first lockfile parse.
- Warm (no cache, but PathIndex fingerprint partial-match): p50 ~750 ms, p95 ~1.2 s.
- Incremental (no relevant file changed): p50 ~180 ms, p95 ~250 ms. Dominated by SnapshotBuilder walk + LRU.
- Memory per worker: ~80 MB resident between gathers; ~250 MB peak during a cold gather (PathIndex + lockfile dict + writer buffers). RSS returns to ~80 MB after gather completes (PathIndex dropped, msgpack cache cleared).
- CPU per run: Cold ~6 CPU-seconds. Warm ~1.5 CPU-seconds. Incremental ~0.3 CPU-seconds.
- Storage growth rate:
.codegenie/cache/blobs/grows ~10 KB per cache entry per probe per content hash. For a portfolio of 1000 repos with 6 probes and ~50% weekly churn, ~30 MB/week of blobs.cache gcsubcommand (from Phase 0) compacts weekly. - Hot vs cold cost ratio: ~40× (cold 6s vs incremental 0.15s). This is the whole point of Phase 1's design — ADR-0006 says "continuous gather"; this ratio is what makes continuous gather affordable.
- Per-probe peak RSS (advisory budget enforced from Phase 0):
- LanguageDetection: 5 MB.
- NodeBuildSystem: 15 MB.
- NodeManifest: 80 MB (lockfile parse dominates).
- CI: 5 MB.
- Deployment: 10 MB.
- TestInventory: 5 MB.
Test plan¶
Performance regression tests live alongside correctness tests, not in a separate suite. The Phase 0 tests/bench/ directory is reused; tests are advisory-gated by default and CI-blocking on PRs that touch the cache, coordinator, or PathIndex.
tests/bench/test_warm_path_latency.py— canary. Runs agatheragainst a committed 50k-LOC Node fixture twice (warm + warm-with-one-file-changed). Asserts:- Warm p95 ≤ 1.5 s (50% headroom over the 1.0 s target).
- Incremental p95 ≤ 350 ms (40% headroom).
- Cache hit rate ≥ 5/6 probes on incremental.
- Test is hermetic: uses
freezegunfor time, fixed seed for any randomness, runs N=10 iterations and reports p50/p95/p99. - Failure mode: CI-blocking on PRs touching
src/codegenie/cache/,src/codegenie/coordinator/, orsrc/codegenie/probes/. Advisory elsewhere. tests/bench/test_path_index_fingerprint_stability.py— asserts the same repo content yields the same fingerprint across runs and across machines (no embedded absolute paths, no platform-specific mtime granularity assumptions). This is correctness of the cache key; a flaky fingerprint silently kills the cache hit rate.tests/bench/test_per_probe_rss.py— usestracemallocto measure peak allocation per probe against a fixture; asserts the per-probe budget from §"Resource & cost profile".tests/unit/test_streaming_writer.py— asserts incremental probe completion produces an incrementally-growing.tmpfile and that the final atomic-renamed YAML matches the all-at-once baseline byte-for-byte.tests/unit/test_views_json_projection.py— given a populatedRepoContext, asserts the four hot-view slices are projected correctly and thatviews.jsonvalidates againstviews.schema.json. This is the Phase-8 forward-compat contract test.tests/adv/test_lru_correctness.py— fuzz test: mutate the underlying filesystem between gathers without touching the PathIndex; assert the gather still produces a correct artifact (the LRU must not lie about file contents it didn't actually see).tests/adv/test_cache_invalidation_on_probe_version_bump.py— bump a probe'sversionconstant; assert next gather is a miss; assert the artifact still validates.tests/integration/test_real_repo_gather.py— clones a small real OSS Node repo (e.g.,expressjs/expressmirror, pinned to a SHA), runs gather twice, asserts:- First run: cold path, produces a valid
repo-context.yaml. - Second run: incremental path, ≥ 5/6 probes cache-hit. (This is the roadmap's literal exit criterion, with a measurable assertion.)
- JSON Schema validation passes in both runs.
CI canary policy: the warm-path latency test is the regression gate. A 20% degradation on warm p95 or incremental p95 fails CI. Cold p95 is measured but advisory (cold-path regressions matter less; portfolio steady-state is warm).
Risks (top 3–5)¶
- mmap-on-Linux flag conflict with Phase 0's "no mmap" stance. I am explicitly pushing back on a Phase 0 decision. The Phase 0 critique cited Windows behavior, concurrent CLI races, and "single-digit MB through Phase 13." All three are scoped concerns; my position is that portfolio scale at Phase 14 changes the math. If the synthesizer keeps no-mmap, warm-path latency moves by ~5%. Surface as load-bearing disagreement, not silent revision.
- Inter-probe in-process cache (msgpack-of-parsed-package.json) is a new pattern. It's not in the Phase 0 ABC and not in
localv2.md §4. It's a coordinator-level optimization that probes opt into. Risk: a future engineer treats it as the contract and starts requiring it. Mitigation: documented as "implementation detail of the in-process coordinator; not part of the ABC; probes must work without it." Phase 9's Temporal lift will not have this in-process cache (Activities run in separate workers); the lift is correct because each Activity re-parses. Performance regresses at Phase 9 — surfaced now so it's not a surprise then. - Hand-rolled
yarn.lockparser is a sharp edge. I'm choosing it over anodesubprocess for ~200 ms of warm-path latency. The maintenance cost over the lifetime of the project might exceed that win. Mitigation: feature flag, fuzz-tested in CI, falls back to subprocess on parse-empty. If we ever see a real bug from the hand-rolled parser, flip the flag. views.jsonis a forward dependency on Phase 8. Phase 1 ships an artifact for a consumer two phases away. Versioned schema mitigates breaking-change risk; the risk is that Phase 8's hot-view list changes and Phase 1's projection is wrong. Mitigation:views.schema.jsonis in the same repo as the Phase-1 implementation; both move together.- Per-probe ResourceBudget RSS enforcement is "advisory" in Phase 0. I'm relying on advisory tracking for tight per-probe RSS budgets. If a probe silently breaches its budget (e.g., a pathological lockfile expands the heap), the worker OOMs at portfolio scale without warning. Mitigation: bench tests in
tests/bench/test_per_probe_rss.pyare advisory in Phase 1 but the planned Phase-14 RSS-enforcement landing must include Layer A probes in the first batch. Open question for the synthesizer: should Phase 1 ship hard RSS enforcement for Tier-0 probes (cheap, all pure Python) even though Phase 0 deferred it?
Acknowledged blind spots¶
- Cold-path latency. I optimized for warm and incremental. Cold gathers on huge monorepos (
turbo-style 100+ package workspaces) might hit 30s+; I did not model that workload. Stage 0 Discovery in Phase 10 will eventually need to cold-gather lots of repos in parallel; my numbers don't say anything about that. - Network probes. Layer A is fully filesystem-local. Layer B (Phase 2) introduces network calls (registry lookups, advisory fetches). My streaming writer + LRU don't help there; the perf-first design for Phase 2 will need a different shape (connection pooling, batched fetches).
- macOS dev UX. I default mmap off on macOS, hand-rolled parsers can have platform-specific behavior, and the C-extension YAML loader's mtime granularity differs across filesystems. I haven't bench-tested on macOS; my numbers are Linux numbers.
- Security posture of the in-process LRU. Phase 0's structural trust boundary (
_ProbeOutputValidator) runs before the LRU read on cache hit, but only when the probe newly executes. On LRU hit the ProbeOutput skips the validator. This is a deliberate optimization. The security lens will object; the answer is "the LRU only holds outputs that were validated when they were inserted." If the validator's invariants change (e.g., new secret regex), LRU entries don't re-validate. Either invalidate the LRU on validator version bump (one extra bit in the LRU key) or accept a one-shot recheck on validator change. - Bencher methodology. All my latency numbers are "estimated from the design"; I have not run them against a real fixture. The bench tests are designed; the numbers will move. I committed to targets, not measurements.
- Phase 2 collision. Phase 2 introduces IndexHealthProbe, which observes Layer A's confidence claims. If Layer A claims
confidence: highbut skipped tail features (e.g., non-primary workflow files for CI), IndexHealthProbe surfaces it asconfidence_impact: medium. I traded coverage for latency in CIProbe and DeploymentProbe; that trade is paid for in Phase 2's confidence story, not Phase 1's.
Open questions for the synthesizer¶
- mmap-on-Linux: keep my proposed flag (default on for Linux) or honor Phase 0's no-mmap stance? My ~5% warm-path number rests on this.
- In-process inter-probe cache (msgpack-of-parsed-package.json): is this an acceptable Phase-1 pattern, or should the synthesizer push it back to a per-probe parse with the same ~200 ms cost?
views.jsonforward-compat with Phase 8: ship in Phase 1, or wait? The work is free now; the contract risk is real.pyjson5vsorjson+ comment-strip vs sticking with stdlibjsonfortsconfig.json: I'm proposingpyjson5for correctness on JSON5 syntax. Phase 0 didn't take a position. The dep-add cost is low; the perf delta vs stdlibjson(after a regex comment-strip) is ~10 ms per gather. Bench at integration; let the synthesizer pick the floor.- Hard RSS enforcement for Layer A Tier-0 probes: ship now, or defer to Phase 14 with the rest? My numbers say Tier-0 is cheap to enforce.
- Hand-rolled
yarn.lockparser: ship it, or take thenodesubprocess hit until someone needs the latency? My answer is "ship, behind a flag, with subprocess fallback"; the synthesizer can flip the default. - The roadmap exit criterion "useful repo-context.yaml": my warm/incremental targets are ~5–20× more aggressive than the roadmap implies. Is the synthesizer optimizing for the roadmap's spec, or for the production-target ADR-0006 throughput? My answer is "ADR-0006"; the synthesizer might disagree.