Phase 02 — Context gathering — Layers B–G: Performance-first design¶
Lens: Performance — throughput, latency, token economy, footprint. Designed by: Performance-first design subagent Date: 2026-05-14
Amendment note (2026-05-17 — 02-ADR-0011): the tree-sitter import-graph section's "in-process via
py-tree-sitterbindings + ThreadPoolExecutor inside the probe" proposal records the performance-lens position. The synthesizer (final-design.md) already rejected the internalThreadPoolExecutor(hidden parallelism lies to the coordinator's single semaphore — see 02-ADR-0003); that decision is unchanged. 02-ADR-0011 amends only howpy-tree-sittergrammars are delivered: from vendored.sofiles to PyPI wheels (tree-sitter-typescript,tree-sitter-javascript), behind acodegenie.grammars.lock.language_forkernel. The per-file latency budget (~5 ms / file) and the named-trigger C-extension discipline (Phase 1 ADR-0009 admitspy-tree-sitteras the one exception) carry forward unchanged.
Lens summary¶
Phase 2 is where the gather pipeline stops being cheap. Phase 1 was a 1-second walk of package.json plus a few YAML reads; Phase 2 adds scip-typescript (8 s cold on a 50k-LOC repo), syft against a built image (3 s after a 47 s docker build), grype (5 s), semgrep (12 s), and strace-driven runtime traces across five scenarios (84 s). Cold gather is the localv2.md §3.2 figure of 3–6 minutes; Phase 2 is what makes that number real. The continuous-gather model from ADR-0006 (50,000+ gathers/day at portfolio scale) does not survive a naïve Phase 2 implementation.
My lens optimizes for three things in this order: (1) keep every probe off the critical path that the cache can answer for it — Phase 1's cache-hit math (cold 6 min → warm 30 s → incremental <10 s) is a function of declared_inputs discipline, and Layer B–G probes have wildly more inputs than Layer A (a SCIP index depends on every .ts file, a runtime trace depends on the built image digest, semgrep depends on every source file plus the rule pack); each probe's declared_inputs must be designed to maximize hit rate per cost tier. (2) Bound the cold path via cost-tier-aware concurrency: B/G CPU-bound parsers run at cpu_count() slots, C heavy I/O (docker build + strace) gets serialized to 1 slot, network probes (registry pulls) get a separate budget. The single fork-and-forget asyncio.Semaphore from Phase 0 is insufficient. (3) Shape outputs for TCCM consumption (ADR-0029, ADR-0030) — every structural artifact this phase produces is the input to a graph-aware derived query in Phase 3+. If the shape requires re-parsing or re-indexing at query time, we lose the cache leverage we already paid for in gather. import_graph.reverse_lookup and test_inventory.tests_exercising must be pre-projected as adjacency lists keyed by module path, not raw SCIP/tree-sitter dumps the Bundle Builder re-walks.
I deprioritize: pretty progress UX, semantically rich error messages, parallelism inside individual external tools (we don't control their threading), and the second-order observability surface (Phase 13 owns that). I will surface IndexHealthProbe (B2) as the most expensive thing the cache cannot save us from — and the cheapest thing the cache makes load-bearing — and design accordingly.
Goals (concrete, measurable)¶
These are aggressive against the localv2.md baseline; the synthesizer should push back where the security or best-practices lens reaches a hard veto.
- Workflows/hour target (Layer B–G only, single 8-core worker, steady state): ≥ 600/hr (= 6 s p50 wall clock per incremental gather, ≥ 92% cache hit per probe across the portfolio). Cold gathers are amortized — at portfolio scale they happen once per repo per quarter.
- Time-to-PR contribution from Layer B–G (p95):
- Incremental gather (one source file changed, SCIP delta only, no image rebuild): ≤ 2 s.
- Warm gather (source unchanged, image unchanged, all caches valid): ≤ 0.8 s.
- Cold gather, typical 50k-LOC Node service with built image: ≤ 180 s (vs. localv2.md §3.2's 3–6 minute baseline — the floor is image build + scip-typescript, not Python).
- Cold gather excluding
docker build(image already in local registry): ≤ 60 s. - $/PR target: $0.00 for the gather pipeline itself (ADR-0005 — no LLM anywhere). Out-of-pocket cost is CPU-seconds + registry-pull bytes; budget ≤ 30 CPU-seconds per incremental gather, ≤ 360 CPU-seconds per cold gather.
- Cache hit rate target per probe (rolling 7-day portfolio average):
SCIPIndexProbe(B1): ≥ 70% (every push touching.tsinvalidates).IndexHealthProbe(B2): N/A — must run every gather (its job is to validate other probes' freshness; if cached itself, the whole staleness story collapses).SBOMProbe(C2),CVEProbe(C3): ≥ 95% (keyed on image digest; only changes ondocker build).RuntimeTraceProbe(C4): ≥ 90% (keyed on image digest + scenario set).SemgrepProbe(G1): ≥ 80% (keyed on source-tree hash + rule-pack version).SkillsIndexProbe,ConventionProbe,ExternalDocsIndexProbe(D2, D5, D9): ≥ 99% (rule packs and skills change at human cadence, not commit cadence).- Per-worker memory ceiling: ≤ 800 MB RSS during cold gather (dominated by
scip-typescriptandsemgrepsubprocesses we don't control); ≤ 300 MB RSS during warm gather; ≤ 120 MB RSS idle between gathers in Phase 14's long-lived worker mode. - p99 incremental gather: ≤ 4 s. The 99th percentile is what determines worker-pool sizing in Phase 14, not the median.
- TCCM consumer latency (forward-looking, no Redis yet): the four ADR-0032 adapter operations (
dep_graph.consumers,import_graph.reverse_lookup,scip.refs,test_inventory.tests_exercising) must be answerable in ≤ 50 ms p95 from Phase 2's on-disk outputs, with zero re-indexing. The hot views (ADR-0013) Phase 8 will project from these must reduce to ≤ 5 ms; Phase 2's job is to make the cold (un-Redis-cached) answer fast enough that Phase 8 is a 10× polish, not a 100× rescue. - IndexHealthProbe staleness blast radius: zero workflows reach Stage 3 (planning) with
scip.confidence < 0.7and no logged adapter downgrade. Operationalized as a fence-CI assertion against a deliberately-seeded stale-index fixture per the roadmap exit criterion. - Tokens per run: 0. Phase 0
fenceCI job continues to assert (Layer B–G adds zero LLM SDKs togatherextras).
Architecture¶
codegenie gather <path>
│
▼
┌────────────────────────────┐
│ Phase 0/1 CLI + Coordinator│
│ (unchanged; PathIndex │
│ already built; Layer A │
│ slices already in cache) │
└──────────────┬─────────────┘
│
┌───────────────────┴────────────────────────────┐
│ PLUGIN LOADER (Phase 2 addition) │
│ - scans plugins/*/plugin.yaml at startup │
│ - resolves (task × lang × build-tool) tuple │
│ from RepoContext.layer_a slices │
│ - unions probe requirements across resolved │
│ plugin chain (ADR-0031) │
│ - registers adapter import paths (ADR-0032) │
│ - Pydantic-validated; fail-fast on bad manifest│
│ - PHASE 2 SHIPS: kernel-only probes registered │
│ by `plugins/universal--*--*/plugin.yaml` │
│ (the fallback); Phase 3 adds the first │
│ concrete plugin (vuln-remed--node--npm) │
└───────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ COST-TIER COORDINATOR (Phase 2 extension of Phase 0) │
│ │
│ Tier 0 zero-fork, pure-Python (kernel) │
│ ConventionProbe, ExceptionProbe, ADRProbe, │
│ RepoConfigProbe, RepoNotesProbe, SkillsIndexProbe, │
│ IndexHealthProbe │
│ concurrency: Semaphore(cpu_count()) │
│ │
│ Tier 1 CPU-bound subprocess (kernel) │
│ SemgrepProbe, AstGrepProbe, GrepProbe, │
│ SCIPIndexProbe (scip-typescript), BuildGraphProbe, │
│ GeneratedCodeProbe, NodeReflectionProbe │
│ concurrency: Semaphore(max(cpu_count() // 2, 2)) │
│ │
│ Tier 2 heavy I/O / container builds (kernel) │
│ DockerfileProbe, SBOMProbe, CVEProbe, CertificateProbe, │
│ EntrypointProbe, ShellUsageProbe │
│ concurrency: Semaphore(2) [docker daemon serializes │
│ builds anyway] │
│ │
│ Tier 3 network + sandbox (kernel) │
│ ExternalDocsProbe (Confluence/URL fetches), │
│ RuntimeTraceProbe (5-scenario strace) │
│ concurrency: Semaphore(1) [strace cannot share a │
│ running container instance] │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────────┴────────────────────────────┐
│ PER-PROBE CACHE (Phase 0 + per-tier policy) │
│ - Tier 0: blob hash over declared_inputs │
│ - Tier 1: blob hash + tool-version stamp │
│ - Tier 2: image-digest-keyed cache (orthogonal │
│ to source-tree hash; per-image, not per-run) │
│ - Tier 3: scenario-set + image-digest │
└───────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ TCCM-SHAPED OUTPUT PROJECTIONS │
│ (Phase 2 addition; pre-rendered at write time) │
│ │
│ .codegenie/context/raw/ (raw probe outs)│
│ .codegenie/context/projections/ │
│ ├── import_graph.adj.json (forward+reverse) │
│ ├── dep_graph.adj.json (forward+reverse) │
│ ├── scip.symbols.idx (mmap-able) │
│ ├── test_exercises.adj.json (test → src files)│
│ └── _provenance.json (which probes, │
│ which versions, │
│ fold-input hash) │
└─────────────────────────────────────────────────┘
│
▼
.codegenie/context/repo-context.yaml (envelope)
.codegenie/context/raw/ (probe blobs)
.codegenie/context/projections/ (TCCM hot paths)
.codegenie/events/ (append-only;
Phase 9 will
project from)
Three things to read from the diagram:
- Cost tiers replace the single Phase 0/1 Semaphore. A
RuntimeTraceProbetaking 84 seconds inside the same concurrency budget as aConventionProbetaking 8 ms is the architectural bug that kills cold-gather wall-clock. Per-tier semaphores let cheap probes finish during the long-tail of expensive ones without making the docker daemon queue 14 builds at once. - The Plugin Loader runs at startup, not per-probe. This phase ships the loader and the universal-fallback plugin only (per ADR-0031's "no-match fallback" — every workflow must have a known handler from day one). Phase 3 ships the first concrete plugin without re-architecting the loader. The kernel-resident Layer B–G probes register via the universal-fallback plugin's
contributes.probes:list, not by being imported at coordinator-init time. - TCCM projections are a write-time fan-out, not a query-time computation. ADR-0030's
import_graph.reverse_lookup(module)is an O(1) dict lookup against a projection if we pre-compute the reverse adjacency list at gather time; it's an O(N files) tree-sitter re-walk if we don't. Phase 2 pre-pays this cost so Phase 3's plugin and Phase 8's hot views inherit a cache that already knows the answer.
Components¶
1. Cost-tier coordinator (extends Phase 0/1)¶
- Purpose: Run the right number of expensive probes at the right time. Phase 0/1's single
Semaphore(min(cpu_count(), 8))is fine when every probe is a 50-ms YAML parse; it falls apart when one probe isdocker buildand another isstrace-on-running-container. The tier model is the cheapest, smallest change to Phase 0 that fixes this. - Interface: Each probe declares a new optional
cost_tier: Literal[0, 1, 2, 3]field on theProbeclass (default0for backward compatibility — Phase 0/1 probes are untouched). The coordinator holds fourasyncio.Semaphores sized per tier;_acquire_for(probe)picks the right one. This is one added field on the ABC and is gated by a Phase-2 ADR (docs/phases/02-context-gather-layers-b-g/ADRs/0001-cost-tier-coordinator.md) — same governance discipline as Phase 1'sparsed_manifestfield addition. - Internal design:
- Tier-0 cap:
cpu_count()slots. Probes are pure Python, sub-100ms each. Letting all of them race finishes the cheap work before tier-1 needs to wait. - Tier-1 cap:
max(cpu_count() // 2, 2). CPU-bound external tools (semgrep,scip-typescript) consume their own CPU. Two of them on an 8-core box is the sweet spot; four contend. - Tier-2 cap:
2. The docker daemon serializes builds at the daemon level anyway; running more in parallel just queues them and triples wall-clock variance. The cap of 2 lets a build + asyft-against-built-image run concurrently. - Tier-3 cap:
1. Strace against a running container cannot share the container with another strace, and the runtime-trace scenarios within a probe run serially by design (startup, then smoke, then healthcheck, etc.). - No global semaphore. Tiers do not gate each other. A tier-3 strace run can be in flight while tier-0 conventions parse. This is the whole point.
- Tradeoffs accepted:
- More state in the coordinator (four semaphores, one per probe acquire/release cycle).
- Probes pick their own tier; misclassification means worse throughput. Mitigated by a CI lint that asserts each probe declares
cost_tierexplicitly (no default) and a per-tier wall-clock canary intests/bench/that fails the build if a tier-0 probe's p95 exceeds 200 ms. - Pattern decision: Plugin architecture (the kernel knows about tiers; probes self-classify). Refuses Strategy with one implementation — tiers are data on the probe, not a class hierarchy.
2. PluginLoader (codegenie/plugins/loader.py — NEW)¶
- Purpose: Operationalize ADR-0031 minimally in Phase 2. Ships the universal
(*, *, *)fallback plugin so the kernel-resident Layer B–G probes are registered through the same mechanism Phase 3's concrete plugin will use — no special case in the coordinator, no kernel-resident "blessed" import paths. - Interface:
PluginLoader.discover(plugins_root: Path) -> PluginRegistry: filesystem walk; oneplugin.yamlper plugin directory; Pydantic-validated on load; fail-fast diagnostic naming file + field on any malformed manifest (per ADR-0031's "Schema enforcement and validation" section).PluginRegistry.resolve(task: TaskClass, lang: Language, build: BuildSystem) -> ResolvedPluginChain: returns the orderedextendschain; in Phase 2 always resolves to[universal--*--*]because no concrete plugin exists yet.ResolvedPluginChain.probe_requirements() -> set[ProbeId]: unions probe-requirement contributions across the chain. The coordinator's probe-set for the gather is this union plus Layer A probes (Layer A is universally required).- Internal design:
- The kernel ships one plugin in Phase 2:
plugins/universal--*--*/. It declares: scope(*, *, *), precedence0(lowest — every concrete plugin beats it), and contributes every Layer B–G kernel-resident probe (SCIPIndexProbe,IndexHealthProbe,SemgrepProbe,RuntimeTraceProbe, the full list) plus the universal HITL escalation subgraph (Phase 2 ships only a stub subgraph; Phase 6 makes it real). - Adapter registration in Phase 2 is empty. The universal fallback declares
contributes.adapters: {}because no concrete language stack lives in the kernel. Phase 3's first plugin ships the four ADR-0032 adapters. Phase 2's job is to make sure the manifest mechanism exists and the loader fails-fast on bad imports — not to ship adapters yet. - Plugin discovery is at CLI startup, not per-gather. The registry is built once, cached in-process, invalidated only on
plugin.yamlmtime change (a 4 msos.statcheck on each gather). - Pydantic models for
plugin.yamlschema live atcodegenie/plugins/manifest.py. Same domain-modeling discipline as ADR-0033 —PluginId,PluginScope(smart-constructor-parsed fromtask--lang--buildstrings),PluginPrecedence,ExtendsChain. Nodict[str, Any]plugin manifest soup. - Tradeoffs accepted:
- Shipping the loader in Phase 2 before its first concrete consumer (Phase 3) feels speculative. Justified: the alternative is registering Phase 2's B–G probes via the Phase 0/1 import-side-effect mechanism and then refactoring them into a plugin in Phase 3 — that refactor would be either an "extension by editing" violation or a multi-week churn. Better to ship the seam now with one trivial plugin.
- The universal fallback's subgraph is a stub. That's correct — Phase 2 is gather-only; subgraphs ship with consuming phases.
- Pattern decision: Plugin architecture (kernel never imports plugins by name; the loader is the registry). Registry pattern at the manifest level (the
PluginRegistryis a typed dict). Hexagonal: the loader is a Port; "Pydantic-validated YAML on disk" is the only Adapter today, but Phase 14's webhook-driven loader and Phase 16's signed-plugin loader plug into the same Port.
3. IndexHealthProbe (B2 — the most important probe)¶
- Purpose: Make stale-index failures loud, not silent. Per ADR-0030: when SCIP is stale,
scip.refsreturns wrong call sites and the entire TCCM-driven Bundle is wrong. ADR-0032 declares the SCIP adapter'sconfidence()as the gate input to graceful degradation.IndexHealthProbeis what computes that confidence. - Interface: Standard probe ABC.
name = "index_health",layer = "B",cost_tier = 0,applies_to_languages = ["*"],applies_to_tasks = ["*"],requires = [].cache_strategy = "none"— the probe must run every gather; caching the freshness report is the same bug as cachingDate.now(). - Internal design:
- Reads only Phase 2 probe outputs that already wrote to
.codegenie/context/raw/. It does no source-tree walk of its own — the inputs are sibling-probe artifacts plus a small set of cheap repo facts. - Per-source-of-truth freshness checks (each is one logical assertion, all sub-millisecond):
- SCIP:
(scip_indexed_commit == repo.HEAD),(files_indexed / files_in_repo >= 0.95),(indexer_errors == 0),(scip_index_mtime > all source-file mtimes). Confidence is the harmonic mean of the four signals, snapped to{trusted, degraded, unavailable}per ADR-0033 sum-type discipline. - Runtime trace:
(traced_image_digest == current_built_image_digest),(all configured scenarios present),(no scenario_failed flag). Image-digest mismatch is the killer — and it's the most common silent failure in distroless migration, hence the load-bearing test. - SBOM / CVE: image-digest match.
- Semgrep:
(scanned_files == files_in_repo for matching extensions),(rule_pack_version matches probe's declared version).
- SCIP:
- The output slice is the localv2.md §5.2 B2 shape verbatim, with
confidencemodeled as a tagged-unionAdapterConfidence = Trusted | Degraded(reason: str) | Unavailable(reason: str)per ADR-0033.DegradedandUnavailablecarry a reason string the adapter dispatch (ADR-0032) and Phase 11 audit can read without parsing free text. - Loud-not-quiet on degradation: every
Degraded/Unavailableconfidence emits a Pydantic event written to.codegenie/events/(the Phase 9-shaped event log; see Component 8) AND a CLI warning visible in the run summary. Phase 14's continuous gather will dashboard these. - Performance argument:
IndexHealthProbeitself is sub-100ms — it does no work except read sibling outputs and run cheap assertions. The cost it imposes is the cost of running every gather, which against the cache-hit rate target is 1 invocation per incremental gather (most other probes hit cache). It is the cheapest probe in the system and the highest-leverage one. - Tradeoffs accepted:
- It depends on every other Layer B/C probe having already written its output. In the cache-hit case those outputs are loaded from cache rather than freshly written; the probe runs after the merge step (Phase 0's existing merge sequence handles this —
IndexHealthProbedeclaresrequiresfor every probe it consults; the existing topological ordering enforces it). - The probe knows about every other probe's output shape — coupling. Mitigated by exposing the freshness assertions as small per-probe
health_check(slice) -> AdapterConfidenceProtocol implementations, registered alongside each probe.IndexHealthProbeis a fold over those checks. New Phase 3+ probes self-register a health-check;IndexHealthProbedoesn't change. - Pattern decisions: Tagged union for
AdapterConfidence(illegal-states-unrepresentable: aTrustedcannot carry areason, anUnavailablecannot lack one). Specification pattern for the per-source-of-truth assertion suite. Open/Closed: adding a new freshness signal is a newhealth_checkregistration, not an edit toIndexHealthProbe.
4. SCIPIndexProbe + projection (B1)¶
- Purpose: Run
scip-typescriptand write its output in two shapes — the raw.scipbinary blob (for SCIP tooling that consumes it natively) AND a pre-computed adjacency-list projection atprojections/scip.symbols.idxkeyed by symbol soscip.refs(symbol)in ADR-0032's adapter is an O(log N) mmap'd lookup. - Interface: Standard probe ABC.
cost_tier = 1.declared_inputsincludes every.ts,.tsx,.js,.mjs,.cjsin the repo plustsconfig*.jsonplus thescip-typescriptbinary version stamp (so a tool upgrade invalidates the index — the alternative is silent staleness when the tool itself changes). - Internal design:
- Runs
scip-typescript index --output .codegenie/context/raw/scip-index.scip --infer-tsconfigviaexec.run_allowlisted. Hard captimeout_seconds = 300(cold gather on a large monorepo); typical 8-15 s on a 50k-LOC repo per localv2.md §1. - After the binary write, a small in-process parser walks the SCIP protobuf using the
scip-pythonreader library (added as a Phase 2 dep, ratified by ADR0002-scip-python-reader-for-projections.md). It emitsprojections/scip.symbols.idxas a sorted-by-symbol-id MessagePack-on-disk index (chosen over JSON for deserialization speed — MessagePack parses ~5× faster than JSON for the 100k-symbol shapes typical of a medium-sized service, which is the load-bearing property for the ADR-0032 adapter's p95 latency target). The index format is documented indocs/phases/02-context-gather-layers-b-g/projection-formats.md. The Phase 2 ADR captures:scip-pythonis read-only here (it's a parser, not an indexer; we usescip-typescriptto write); the MessagePack-on-disk choice resolves the "should we re-parse.scipevery query" question against ADR-0030's 50ms p95 target. - The projection structure: a length-prefixed list of
(symbol_id_bytes, ref_count, ref_offset)records, sorted bysymbol_id_bytes. Lookups by symbol are binary search over the header; ref payloads are mmap'd. No re-parse, no full-load. - Incremental SCIP is NOT in Phase 2 scope.
scip-typescriptdoes not support per-file incremental index update at the time of writing; we re-index the whole repo on any.tschange. This is the dominant cold-gather cost. Two mitigations: (a) cache key includes a Merkle root over.tsfiles, so an unchanged repo on a re-run is a cache hit (the common case in a portfolio); (b) the projection step is itself cached against the.scipblob hash, so even a SCIP re-index whose output happens to match (rare but possible for whitespace-only edits) skips the projection write. - Tradeoffs accepted:
- One full re-index per
.tschange. The portfolio-scale economics still work because (i) the cache invalidation is per-repo, not per-org; (ii)scip-typescriptupstream is moving toward incremental and Phase 14 can adopt it without changing the projection shape; (iii) the projection write is cheap (~500ms for a 100k-symbol service). - Adds
msgpackandscip-python(parser-only) togatherextras. Two libraries, both pure-Python or pure-C with no LLM ancestry. Phase 0's fence test continues to pass. - Pattern decisions: Functional core (the projector is pure:
.scipbytes → projection bytes), imperative shell (the probe runs the subprocess and writes the artifacts). Adapter pattern (the projection IS the adapter interface — Phase 3'sNodeScipAdapterreads this format directly, no translation needed).
5. Dependency graph + tree-sitter import graph (B5 + B3 + new projection)¶
- Purpose: Produce the three remaining ADR-0032 adapter inputs —
dep_graph(forward+reverse package adjacency),import_graph(forward+reverse file-level edges from tree-sitter), andtest_inventory.tests_exercising(test-file → exercised-source-file map) — as pre-projected adjacency lists. - Interface:
BuildGraphProbe(B5 — Phase 2 promotes from "monorepo-only" to "always-runs"): emitsprojections/dep_graph.adj.jsonwith forward edges{package: [deps]}and reverse edges{package: [consumers]}. Single repos collapse to a one-node graph.NodeReflectionProbe(B3 — extended) + newTreeSitterImportGraphProbe: the latter is acost_tier=1probe that runstree-sitter-based import extraction across every source file and emitsprojections/import_graph.adj.jsonwith forward+reverse adjacency.TestCoverageMappingProbe(G3) emitsprojections/test_exercises.adj.jsonby joining its coverage data against the import graph.- Internal design:
- Tree-sitter import extraction runs in-process via
py-tree-sitterbindings (already a Phase 1 dep per localv2.md §6 for thetree-sitterfallback). Per-file extraction is ~5 ms; for a 50k-LOC repo with 2k source files, ~10 s cold. Concurrency: tier-1 cap (cpu_count() // 2), but the actual extraction parallelism comes fromconcurrent.futures.ThreadPoolExecutorinside the probe (the GIL releases for the tree-sitter C extension, so threading is real) — the probe sees one tier-1 slot but uses ~4 threads inside it. This is the second concurrency layer the Phase 0 architecture does not anticipate; it lives inside the probe, not the coordinator, so no Phase 0 contract change. - All three projections share a common file shape:
{"forward": {node: [neighbors]}, "reverse": {node: [neighbors]}, "_provenance": {...}}. Adapter implementations (Phase 3) load the JSON once on first call and keep it in adapter-instance memory (~10 MB peak for a 50k-LOC service — well under the 800 MB worker ceiling). - Projection invalidation is per-projection, not per-gather. If
BuildGraphProbehit cache butTreeSitterImportGraphProbere-ran, onlyimport_graph.adj.jsonrewrites. Phase 0's per-probe-output write model already gives us this; we just have to keep projections out ofrepo-context.yaml(they're sibling artifacts inprojections/). - Why this shape:
- TCCM derived queries become O(lookup), not O(walk). ADR-0030's
import_graph.reverse_lookup(module)isproj["reverse"][module]— a dict access against a pre-built mapping, sub-millisecond. Without this projection, the adapter has to walk the SCIP index or re-parse tree-sitter output every call, which is the Phase 3 trap (cold queries dominate the Bundle Builder's p95). - The "wider net" cases scale with
should_read/may_readbudget, not with repo size. Atransitive_callers(file_set, depth=3)query is a bounded BFS over the projection. On a 5k-file repo it's the same cost as on a 50k-file repo because the bound is on the result set, not the graph. - Tradeoffs accepted:
- Three sibling JSON files growing over repo size. For a 50k-LOC service the total projection footprint is ~5 MB, well below the per-gather storage budget.
- The forward+reverse duplication doubles storage per graph. Cheaper than recomputing reverse at query time across the portfolio: a 5 MB write at gather time vs. a 50 ms recomputation per Bundle build × 50,000 gathers/day × 7+ TCCM queries per workflow.
- Pattern decisions: Functional core (graph projection is a pure fold over probe outputs). Adapter pattern at the file-format level (the projection format IS the ADR-0032 adapter contract; the adapter is a thin loader). Open/Closed: new graph types (call graph from runtime traces, code-ownership graph) land as additional projection files, never as edits to existing ones.
6. Tier-2 container probes (C1–C7) — image-digest-keyed cache¶
- Purpose: Make the C-layer probes cache against image digest, not against source-tree hash. A repo whose
Dockerfileis unchanged but whosepackage.jsonchanged re-runs A and B probes; the C-layer probes that consume the built image (SBOM, CVE, runtime trace, certificate) should hit cache as long as the image digest hasn't moved. - Interface: The Phase 0 cache key derivation gets a per-probe override hook. Probes opt into image-digest keying by overriding
cache_key: The Dockerfile probe (C1) keys against the Dockerfile content. SBOM/CVE/runtime-trace key against the built image digest. - Internal design:
DockerfileProberuns first within tier 2 (declaresrequires = []); emits the parsed Dockerfile and triggersdocker buildif no built image with the expected digest is present in the local registry (docker images --digests --filter "label=codegenie.builds=<hash>"). The build step is what costs 47 s in the localv2.md timing — and it's the cache key the rest of tier 2 inherits. We do NOT re-build if the digest matches.SBOMProbe,CVEProbe,CertificateProbeall share the image digest as their cache key. Result: on apackage.json-only change with no Dockerfile or lockfile change, all three hit cache. On aDockerfilechange, all three miss together — they share the cost of rebuilding the image once.RuntimeTraceProbekeys on(image_digest, scenario_set_hash, scenario_timeout_seconds). Scenario-set changes (a user adds an error-path scenario to config) invalidate. The probe itself runs scenarios serially (tier-3 cap of 1, internal serial loop) — there is no parallel-strace story that survives container determinism.- All tier-2 probes share an
exec.run_allowlistedwrapper that scrubsdocker/podmanoutput of timestamps, image build-IDs, and ephemeral container IDs before hashing. Without this, two byte-identical runs ofdocker buildproduce non-identical stdout and break the cache. This wrapper lives incodegenie/probes/_container/normalize.py. - Tradeoffs accepted:
- The image-digest cache key bypasses the Phase 0
declared_inputsmechanism for these probes. This is the one Phase 2 deviation from Phase 0/1's source-tree-hash cache discipline, and it requires an ADR (0003-image-digest-cache-keys.md). The justification: the image digest IS the content the probe consumes; source-tree hash would force re-runs on every commit even when the built image hasn't moved, and at portfolio-scale 50k gathers/day × 5 container probes × 5 s saved per cache hit = a meaningful cost reduction. - The cache survives
docker system prune. If the user prunes the image, the next gather rebuilds (it has to). The cache entry's image-digest key just becomes unresolvable — the probe re-runs, fills the cache, and life continues. This is correct. - Pattern decisions: Strategy pattern at the cache-key level (probes choose source-tree-hash vs. image-digest keying via the existing
cache_keyoverride). Refuses making this a Phase 0 ABC change — the override hook already exists; we just exercise it.
7. Skills loader + conventions catalog + external docs (D2, D5, D8, D9)¶
- Purpose: The "human-cadence" probes — SkillsIndexProbe, ConventionProbe, ExternalDocsProbe, ExternalDocsIndexProbe — change at human writing speed (days/weeks), not commit speed (minutes). They must hit cache aggressively and pre-render their outputs for hot consumption.
- Interface: Standard probe ABC, all
cost_tier = 0exceptExternalDocsProbe(cost_tier = 3— does network fetches). - Internal design:
- SkillsIndexProbe: walks
~/.codegenie/skills/,.codegenie/skills/,~/.codegenie/skills-org/. Parses YAML frontmatter only (the body is never loaded intoRepoContext, per the localv2.md §5.4 D2 spec). Cache key includes file mtimes of everySKILL.mdplus the Phase 0 schema version. Output: a flat list of{name, description, applies_to, requires_evidence, required_tools, source_path}records. ~10 ms cold on a typical Skills directory of ~30 entries. - ConventionProbe: loads
~/.codegenie/conventions/*.yaml, validates each rule against a schema, indexes bydetect.type. The detection itself doesn't run during gather — gather only enumerates available conventions; per-rule evaluation happens at Planner time (Phase 3+) to keep gather purely structural. - ExternalDocsProbe (D8) + ExternalDocsIndexProbe (D9): These are the two probes where caching has to be very aggressive — external fetches against Confluence/Notion are slow and rate-limited. Cache key: hash of
external_docs:config + per-source fetch endpoint + per-sourcelast_modified(which the source itself reports). Index (BM25 via Tantivy) is rebuilt only if any underlying doc changed. - All four probes pre-render their TCCM consumption shapes.
SkillsIndexProbewrites aprojections/skills.by_applies_to.jsonkeyed by(task_class, language)so the Bundle Builder's "which skills apply" lookup is O(1).ConventionProbewritesprojections/conventions.by_detect_type.json. The Phase 8 hot views (ADR-0013) project from these projections; same shape, just in Redis. - Tradeoffs accepted:
- The skills directory mtime walk is a few hundred
os.statcalls. ~5 ms even on a large skills directory. Cheaper than caching the directory listing itself (which would need its own invalidation logic). - The Tantivy BM25 index is a directory of files, ~5 MB per ~50 docs. Acceptable; the alternative (re-indexing every gather) is 1-2 seconds we don't want to pay on the warm path.
- Pattern decisions: Registry pattern (skills register themselves by being on disk; the probe is the loader). Functional core (parsing + indexing). Refuses Strategy for the doc-source kinds — there are three (Confluence, Notion, filesystem URL list) and they share enough that a single source-walker with three loaders is cleaner than three Strategy implementations.
8. Append-only events stream (codegenie/events/)¶
- Purpose: Per ADR-0034, the canonical Postgres event log lands in Phase 9. Phase 2 cannot ship the database, but it can ship probe outputs in a shape that projects cleanly into the future event stream. Specifically: every probe that reports a degradation, a low-confidence answer, an adapter fallback, or a cache invalidation that surprised an operator emits a typed event to
.codegenie/events/<gather-utc>-<probe>.jsonl. - Interface: A small
codegenie/events/writer.pyexposesemit(event: PhaseEvent)wherePhaseEventis a Pydantic discriminated union per ADR-0033. Phase 2 ships three event variants:class IndexHealthDegraded(BaseModel): kind: Literal["index_health_degraded"] = "index_health_degraded" probe_id: ProbeId confidence: AdapterConfidence reason: str class ProbeCacheInvalidated(BaseModel): kind: Literal["probe_cache_invalidated"] = "probe_cache_invalidated" probe_id: ProbeId reason: Literal["input_changed", "tool_version_changed", "schema_version_changed", "image_digest_changed"] class ExternalToolMissing(BaseModel): kind: Literal["external_tool_missing"] = "external_tool_missing" tool: str probe_id: ProbeId downgrade_path: str - Internal design:
- Append-only JSONL per gather, named for the UTC start time. Atomic-replace not needed (single writer per gather; OS-level
O_APPENDsemantics suffice). - Phase 9 will project these into the canonical Postgres event log by walking
.codegenie/events/andINSERT INTO events ... ON CONFLICT DO NOTHINGkeyed byevent_id. We pre-generate UUIDv7event_ids in Phase 2 so the migration is collision-safe. - The event log is NOT the place we write every probe execution. It's the place we write decisions — Phase 13 will read it to compute fallback-trigger rates, Phase 14 will alert on
IndexHealthDegradedspikes, Phase 11 will project per-workflow audit trails. Verbose probe-execution logging stays in.codegenie/logs/. - Performance argument: ~10 events per gather × 200 bytes × 50k gathers/day = ~100 MB/day of event JSON. Negligible at portfolio scale, eligible for daily rotation + compression. The cost of not shipping this shape in Phase 2 is that Phase 9 has to retrofit it across every probe — and the post-hoc retrofit is what produces "event soup" per the ADR-0034 warning.
- Tradeoffs accepted:
- Adds a new on-disk artifact category (
events/). Documented in localv2.md as part of Phase 2. - The schema is forward-evolving — new event variants land additively. Phase 9 imposes the full type discipline; Phase 2 ships the three load-bearing events plus the extension hook.
- Pattern decisions: Event sourcing (the future canonical primitive, pre-shaped). Tagged union for event variants (illegal-states-unrepresentable). Registry pattern is not applied here — events are emitted in-line by probes that care; there's no central event handler in Phase 2.
9. Multi-repo fixture portfolio + golden-file test infrastructure¶
- Purpose: Per the roadmap exit criterion, Phase 2 needs golden-file tests per probe AND a multi-repo fixture portfolio of 3–5 small repos. The performance lens requires this to be fast enough that CI doesn't degrade — golden-file diffs on heavy probe outputs (semgrep findings on a 200-file repo) can dominate CI time if naively implemented.
- Interface:
tests/fixtures/portfolio/{minimal-ts, native-modules, monorepo-pnpm, distroless-target, vuln-seeded}/— five fixture repos. Each is ~50-500 source files; each exercises a different Phase 2 probe surface.tests/golden/<probe>/<fixture-name>.json— committed expected output per (probe, fixture) pair.pytest --update-goldenregenerates; without the flag, diffs are CI failures.- Internal design:
- Heavy probes (SCIP, semgrep, runtime trace) run against the portfolio in a
pytest-xdist-parallel CI lane separated from the Phase 0 unit-test lane. The xdist budget is the one Phase 1synthexplicitly refused — Phase 2 ships it specifically for the portfolio runs because per-probe wall-clocks are now in the tens of seconds and serial CI is hostile. - The runtime-trace fixture (
distroless-target) uses a frozen Docker image (localhost:5000/cw-fixture-runtime:sha-<digest>) committed to a local registry container, started in CI as a sidecar. This eliminatesdocker buildtime from the fixture path (we already verifieddocker buildin unit tests forDockerfileProbe). - A deliberately-seeded stale-index fixture lives at
tests/fixtures/portfolio/stale-scip/. Its.codegenie/cache/is pre-populated with a SCIP index from a known prior commit; the repo HEAD has moved.IndexHealthProbeMUST detect this and emitIndexHealthDegradedevent. This is the roadmap exit criterion ("IndexHealthProbe surfaces at least one real staleness case in CI"), encoded as a CI assertion. - Tradeoffs accepted:
- CI walltime grows. Estimate: +3 minutes p50 on the portfolio lane. Acceptable — it runs in parallel with the unit lane, total wall-clock CI is dominated by the slower of the two.
- The frozen runtime-trace fixture image must be rebuilt occasionally as upstream base images age. Tracked as a Phase 2 maintenance task.
Data flow¶
A representative warm-path run on a real Node.js repo (~5k files, TypeScript, pnpm, GitHub Actions, Helm, built node:20-alpine image present in local registry) where src/payments/processor.ts changed since last gather:
- CLI + Phase 0/1 prelude (unchanged). PathIndex built (one walk). Layer A probes run;
language_detection,node_build_system,node_manifest,ci,deployment,test_inventoryall cache-hit exceptlanguage_detection(per-file walk fingerprint changed). Phase 1 wall-clock so far: ~150 ms. - Plugin loader resolves to
[universal--*--*](no concrete Phase 3 plugin yet). Probe-requirement union = Layer A (already done) + Layer B–G kernel probes from the universal plugin manifest. - Cost-tier coordinator dispatches in waves:
- Tier-0 wave (parallel up to cpu_count() = 8):
ConventionProbe,ExceptionProbe,ADRProbe,RepoConfigProbe,RepoNotesProbe,SkillsIndexProbeall cache-hit (none of their inputs changed).IndexHealthProbeis queued behind the probes itrequires; placeholder. Wall-clock added: ~20 ms (cache-hit retrievals + sub-schema validation). - Tier-1 wave (parallel up to 4):
SemgrepProbecache-hit (source-tree hash matches if only.tsfiles changed by content not count — semgrep's cache key is the Merkle root of source + rule-pack version; both unchanged on a single-file edit if rule packs haven't been bumped — wait: source changed, so semgrep MISSES). Actually:processor.tschanged → SemgrepProbe re-runs on the affected file set (~3 s on a 5k-file repo with incremental scope per the--baseline-refflag we configure with the git HEAD~1).SCIPIndexProbeMISSES — re-indexes the whole repo (~8 s).BuildGraphProbecache-hit.NodeReflectionProbecache-hit ifprocessor.tsdid not introduce new reflection patterns (per-file hash).GeneratedCodeProbecache-hit.TreeSitterImportGraphProbere-runs just forprocessor.ts(per-file projection update) — ~50 ms.AstGrepProbere-runs on affected files (~500 ms). Wall-clock added: ~8 s (SCIP dominates). - Tier-2 wave (parallel up to 2):
DockerfileProbecache-hit (Dockerfile unchanged). Image digest unchanged →SBOMProbe,CVEProbe,CertificateProbe,EntrypointProbe,ShellUsageProbeall cache-hit. Wall-clock added: ~50 ms (cache retrievals). - Tier-3 wave (serial, slot of 1):
RuntimeTraceProbecache-hit (image digest matches, scenarios unchanged).ExternalDocsProbecache-hit (source mtimes unchanged). Wall-clock added: ~10 ms. - Projection write fan-out (parallel, all
cost_tier = 0projection writers): scip.symbols.idxre-writes (SCIP re-indexed) — ~300 ms.import_graph.adj.jsonupdates theprocessor.tsrow + reverse edges — ~5 ms (single-file delta).dep_graph.adj.jsonunchanged (cache hit).test_exercises.adj.jsonupdates rows for tests transitively touchingprocessor.ts— ~10 ms.IndexHealthProberuns (requires-graph satisfied). Reads all sibling slices. SCIP confidence =Trusted(just re-indexed, all signals green). Runtime trace confidence =Trusted(image digest match). No events emitted. Wall-clock added: ~30 ms.- Output merge + schema validation (Phase 0 + Phase 1 sub-schemas + Phase 2 per-probe sub-schemas for B–G probes). ~100 ms.
- YAML write + raw artifact write + projection write (atomic
.tmp→os.replace, 0600 mode per Phase 0). ~50 ms. - Audit record (Phase 0). Per-probe execution path (
Ran/CacheHit/Skipped). - Exit 0. Total wall-clock: ~9 s, dominated by SCIP re-index. Without SCIP re-index (whitespace-only edit, SCIP cache-hits): ~600 ms.
Cold gather (first time on a 50k-LOC service with no built image): sequential floor is docker build (~47 s) + scip-typescript (~8-15 s) + RuntimeTraceProbe 5 scenarios (~80 s). Tier-1 probes (semgrep, ast-grep, tree-sitter graph) run in parallel against tier-2 (docker build) and tier-3 (runtime trace). Total: ~170 s vs. the localv2.md §3.2 baseline of 3-6 minutes. Meets the cold-gather goal.
Failure modes & recovery¶
| Failure | Detected by | Recovery |
|---|---|---|
scip-typescript exits non-zero (TypeScript compile error in repo) |
Probe run() catches subprocess exit code |
ProbeOutput(confidence="low", errors=[stderr_tail]); IndexHealthProbe reads this → SCIP confidence = Unavailable("indexer_failed"); IndexHealthDegraded event emitted; Phase 3+ adapter dispatch falls back per ADR-0032 |
scip-typescript exceeds timeout_seconds=300 (huge monorepo) |
Phase 0 coordinator asyncio.wait_for |
Cancel + SIGKILL at 1.5× timeout; SCIP slice empty; Phase 3+ Bundle Builder degrades to tree-sitter-only per ADR-0032 declared fallback |
| Image digest cache-key miss with no local image present | DockerfileProbe resolves the expected digest, finds none |
Triggers docker build; takes 47 s; subsequent probes cache-key on the new digest |
docker build fails (Dockerfile syntax, base-image pull failure) |
Subprocess exit code | Tier-2 probes that depend on the built image emit confidence="unavailable", errors=["image_build_failed: <stderr_tail>"]; gather completes with degraded tier-2 |
strace exec fails (macOS, missing binary) |
exec.run_allowlisted raises |
RuntimeTraceProbe slice marked confidence="low", scenarios skipped; IndexHealthProbe flags runtime_trace as Unavailable("strace_missing"); gather still succeeds — localv2.md §6 already documents this macOS case |
gitleaks finds a secret in the analyzed repo's source |
Probe parses gitleaks JSON output | Phase 0 OutputSanitizer field-name regex catches it as a secondary defense; PathSpec-based scrubbing strips the secret value from the probe output; the finding (file:line, secret-type) survives; the secret value never reaches repo-context.yaml |
Plugin manifest malformed (plugin.yaml missing required field) |
Pydantic validation at loader.discover() time | Loader fails-fast at CLI startup with diagnostic naming file + field per ADR-0031; gather refuses to start (loud) |
Plugin import path unresolvable (contributes.adapters points at missing module) |
Phase 2 ships no adapters in the universal plugin so this is a forward concern; loader still validates import paths at startup | Fails-fast at startup per ADR-0031 |
IndexHealthProbe cache pollution (somebody runs with --no-cache then back to normal) |
Probe cache_strategy = "none" enforces re-run every gather |
No staleness possible; probe is by construction always-fresh |
| Stale-SCIP fixture in CI (deliberate seeded staleness) | IndexHealthProbe confidence computation flags commits_behind > 0, fails the image_digest_match style assertion |
IndexHealthDegraded event emitted; CI test asserts event presence + correct reason; build passes only if the probe caught the staleness |
| Tier-2 image-digest cache invalidated by orphaned cache entries (image pruned but cache entry persists) | DockerfileProbe resolves expected digest, attempts to read from cache — cache entry references an image no longer in registry |
Cache entry expired silently (Phase 0 store handles missing referenced artifacts as cache-miss); tier-2 probes re-run with rebuilt image |
| Projection write fails (disk full mid-write) | writer.atomic_replace exception |
Probe completes; projection rewrite fails; downstream adapter sees an old projection and emits low confidence; loud via IndexHealthProbe |
| Tier-3 strace produces 500 MB trace file (long-running scenario) | Probe hard caps per-scenario trace at 100 MB via --limit-bytes-equivalent |
Scenario output truncated; scenario_truncated: true field set; confidence drops to medium |
| External docs fetch hangs (Confluence timeout) | Per-source httpx.Client(timeout=30) |
Source marked fetch_failures: [...]; index probe runs over the docs that succeeded; gather continues |
Concurrent gather race (two codegenie gather invocations against same repo) |
Phase 0 advisory lock at .codegenie/cache/.lock |
Second invocation either waits or fails fast (configurable); image-digest cache races resolved by the docker pull/docker build daemon-level serialization |
The pattern: loud degradation at every probe boundary, never silent omission. The cache layer never lies about its contents; IndexHealthProbe never reports Trusted for a stale signal; the event stream captures every degradation for Phase 9/13 to project.
Resource & cost profile¶
- Tokens per run: 0. Phase 0
fencejob continues to assert. Thegatherextras list grows bymsgpack,scip-python(parser-only),tantivy,pyarn-or-built-in-fallback,tree-sitter-pythonbindings,gitleaks-python,httpx. Zero LLM SDKs. - Wall-clock targets met on a 50k-LOC TypeScript service:
- Cold (no built image, no caches): ~170 s. Floor =
docker build(47 s) +scip-typescript(8-15 s) +RuntimeTraceProbe(80 s) running in parallel across tiers. - Cold (image present in local registry): ~95 s. Tier-2 collapses to tier-1 in concurrency terms.
- Warm (all caches valid): ~0.6 s. Dominated by cache-hit retrievals + schema validation + YAML write + projection-file existence checks.
- Incremental (single
.tschange): ~9 s. SCIP re-index dominates; everything else cache-hits or runs in parallel. - Memory (RSS):
- Cold gather peak: ~800 MB.
scip-typescriptsubprocess ~400 MB;semgrepsubprocess ~200 MB; coordinator + Python ~150 MB. - Warm gather peak: ~250 MB. Most of the cold-gather subprocesses never start.
- Idle (Phase 14 long-lived worker): ~120 MB.
- Storage per gather:
repo-context.yaml~60 KB (Phase 1's 30 KB + Phase 2 slices).raw/~8 MB (SCIP binary 2-3 MB; SBOM 1-2 MB; runtime traces 4-5 MB summed across 5 scenarios).projections/~5 MB (graph adjacencies + symbol index).cache/blobs/grows by ~10 MB per cold gather; per-blob, ~50-500 KB.events/~5 KB per gather (10 events × 500 bytes).- Total per cold gather: ~25 MB on disk. Per warm gather: ~70 KB.
- CI walltime delta vs. Phase 1: +5 minutes on the portfolio lane (running in parallel via
pytest-xdist); +15 s on the unit lane. Total CI walltime grows ~3 minutes if both lanes are well-balanced. - External-dep additions:
msgpack,scip-python(parser-only),tantivy,tree-sitter-pythonbindings,gitleaks-python,httpx. Each one ratified by ADR; each one tracked by Dependabot per Phase 0 §2.5. No C-extension churn beyond what Phase 0 already accepted (blake3,pyyamlwithCSafeLoader). - External CLI runtime additions:
semgrep,syft,grype,gitleaks,scip-typescript,strace(Linux),dive(optional). All checked at CLI startup per Phase 0's tool-readiness gate.
Test plan¶
The test pyramid widens at the integration tier (portfolio fixtures) without losing the unit-test floor.
Unit tests (tests/unit/probes/)¶
| Test module | Asserts |
|---|---|
test_scip_index_probe.py |
scip-typescript invocation; output projection format; per-symbol lookup correctness; cache-key sensitivity to tool-version stamp; timeout → low confidence |
test_index_health_probe.py |
Per-source-of-truth assertions (SCIP commit match, runtime-trace image-digest match, semgrep coverage); confidence-tier transitions; event emission on Degraded and Unavailable; cache_strategy = "none" enforced |
test_dependency_graph_probes.py |
Forward + reverse adjacency correctness; monorepo vs. single-package handling; projection format roundtrip |
test_tree_sitter_import_graph_probe.py |
Per-file extraction correctness; internal thread-pool parallelism; projection adjacency invariants (forward[a] contains b ↔ reverse[b] contains a) |
test_sbom_probe.py, test_cve_probe.py |
syft and grype invocation; cross-validation between grype and trivy; image-digest cache-key correctness; absence of image → low confidence |
test_runtime_trace_probe.py |
Per-scenario serial execution; per-scenario timeout; trace-file size cap; cert_paths_read/shared_libs_loaded parsing; macOS-fallback path |
test_semgrep_probe.py |
Rule-pack version stamp; finding-shape correctness; incremental --baseline-ref cache-key derivation |
test_skills_index_probe.py, test_convention_probe.py, test_exception_probe.py |
Frontmatter parsing; multi-source merging; pre-projection write |
test_external_docs_probes.py |
Filesystem + URL list + Confluence (mocked); BM25 index roundtrip; per-source fetch failure isolation |
test_cost_tier_coordinator.py |
Per-tier semaphore correctness; cross-tier non-blocking; default tier=0 backward-compat; misclassification CI lint |
test_plugin_loader.py |
Manifest Pydantic validation; fail-fast on missing field; universal-fallback resolution; extends chain walk |
test_events_writer.py |
Event variant Pydantic discrimination; JSONL append correctness; UUIDv7 uniqueness; forward-projectable to Phase 9 schema |
test_projection_formats.py |
All four projection files have well-defined formats; adapter Protocol implementations (Phase 3 will read these) get the shape they expect |
Golden-file tests (tests/golden/<probe>/<fixture>.json)¶
One per (probe, fixture) pair in the portfolio. CI diffs live output vs. committed expected; pytest --update-golden regenerates. Adversarial fixtures (deliberately malformed source files, oversized inputs) live at tests/golden/_adversarial/.
Integration tests (tests/integration/portfolio/)¶
| Test module | Asserts |
|---|---|
test_portfolio_minimal_ts.py |
Full gather against minimal-ts fixture; all probes complete; IndexHealthProbe reports Trusted across the board; total wall-clock < 30 s in CI |
test_portfolio_native_modules.py |
Native module catalog hits (sharp + bcrypt); RuntimeTraceProbe enumerates expected shared libs; SBOM includes libvips |
test_portfolio_monorepo_pnpm.py |
BuildGraphProbe produces correct module graph; projections are per-monorepo |
test_portfolio_distroless_target.py |
Pre-built image fixture; all tier-2 probes hit the digest cache after first run; cold→warm ratio matches goal |
test_portfolio_vuln_seeded.py |
Seeded CVE in lockfile; CVEProbe finds it; grype/trivy cross-validation triggers |
Adversarial / fail-loud tests (tests/adv/)¶
| Test module | Asserts |
|---|---|
test_stale_scip_fixture.py |
The deliberately-seeded stale-scip fixture is detected by IndexHealthProbe; IndexHealthDegraded event emitted; build FAILS if probe doesn't catch it (the roadmap exit criterion) |
test_huge_repo_timeout.py |
200k-file fixture triggers Phase 1's --max-files refusal; gather exits with clear error |
test_malformed_plugin_manifest.py |
Bad plugin.yaml fails CLI startup loudly; gather refuses to run |
test_image_digest_drift.py |
Mutating the built image between gathers correctly invalidates all tier-2 caches |
test_secret_in_source.py |
gitleaks finds seeded secret; OutputSanitizer scrubs value; finding metadata survives |
test_projection_corruption.py |
Truncated scip.symbols.idx triggers loud failure on first read; never silent-degrades the adapter |
test_concurrent_gather_race.py |
Two concurrent gathers don't corrupt cache or projections; advisory lock works |
Performance canary tests (tests/bench/)¶
Advisory, not gating (per Phase 0 §3.2 — bench is surfaced, not enforced).
| Test | Tracks |
|---|---|
bench_warm_gather_walltime.py |
Warm gather p50/p95 against the portfolio; flags regressions > 50% |
bench_cold_gather_walltime.py |
Cold gather p50/p95 (image pre-pulled to isolate from network) |
bench_per_tier_concurrency.py |
Verifies tier-0 probes finish during long-tail of tier-1/2/3 |
bench_projection_adapter_latency.py |
Phase 3 forward-looking: simulates ADR-0032 adapter calls against gathered projections; verifies p95 < 50 ms target |
Design patterns applied¶
| Decision | Pattern applied | Why here | Pattern NOT applied (and why) |
|---|---|---|---|
| Cost-tier coordinator | Plugin architecture (probes self-classify tier as data); composition over inheritance (tiers are a field, not a class hierarchy) | Tiers are data; the coordinator knows about tiers but not about probes; new probes pick their tier in the registration | Strategy pattern — there is no algorithm to swap; the tier is just a semaphore selector |
| Plugin loader | Plugin architecture; Registry pattern; Dependency inversion (kernel depends on Probe Protocol, never on plugin-specific classes); Hexagonal (the loader is a Port; YAML-on-disk is one Adapter, Phase 14 webhook will be another) |
ADR-0031 explicitly requires the kernel never import plugins by name; the loader is the only allowed coupling point | Inheritance for probe contributions — probes are registered via the manifest's contributes.probes list, not extended from a plugin base class |
| IndexHealthProbe + AdapterConfidence sum type | Tagged union for state (Trusted | Degraded | Unavailable); make illegal states unrepresentable (Trusted cannot carry a reason); Specification pattern for the per-source-of-truth freshness checks; Open/Closed (new health checks register, never edit IndexHealthProbe) |
ADR-0033 discipline applied to a state machine that's load-bearing for ADR-0032 adapter dispatch; the worst failure mode (silent staleness) is exactly the thing booleans + None lets through | Pattern-matching exhaustiveness via match is enforced; assert_never on the AdapterConfidence handler |
| Tier-2 image-digest cache-key strategy | Strategy pattern at the cache-key derivation level (probes override cache_key); Adapter pattern wraps docker/syft/grype output to a stable hash |
The probes consume image-digest-keyed evidence, not source-tree-keyed evidence; using the source-tree key would force re-runs that produce identical output | Premature pluggability for cache backends — the Phase 0 filesystem store is sufficient; the strategy is on the key derivation, not the store |
TCCM projection format (projections/*.json, scip.symbols.idx) |
Functional core, imperative shell (projection is a pure fold over probe outputs; the writer is the shell); Adapter pattern (projection format IS the ADR-0032 adapter contract) | The projection's purpose is to make ADR-0030's graph-aware queries O(lookup); doing it as a pure projection lets Phase 3's adapters be thin loaders, not graph-walking machines | Event sourcing for projections — projections are derived state, not the source of truth; rebuilding is cheap |
| Events writer | Event sourcing (pre-shape for Phase 9 ADR-0034); Tagged union for PhaseEvent; Registry pattern deliberately not applied (events are emitted in-line by probes) |
Pre-shaping the canonical Postgres event log lets Phase 9 be a projection migration, not a forensic data archaeology project | Pattern-matching dispatch — Phase 2 doesn't consume events; it only writes them. Phase 9 owns the consumer |
Risks (top 3-5)¶
-
scip-typescriptcold-gather wall-clock is a hard floor. Our 60 s cold-without-build target sits right on the SCIP indexer's typical 8-15 s runtime for a 50k-LOC repo; for a 500k-LOC monorepo it's 60-90 s alone. Incremental SCIP is unreleased upstream at the time of writing. Mitigation: the per-repo Merkle-root cache key means we only pay the cold cost once per.tschange at the repo level; portfolio-scale economics still work. Ifscip-typescriptperformance regresses, the projection format insulates us — we can swap the indexer (scip-typescript-next,sourcegraph-scipdirect, etc.) without touching ADR-0032 adapters. -
Image-digest cache-key strategy is the one Phase 2 deviation from Phase 0/1 cache discipline. A bug in image-digest resolution silently fails-quiet (cache-hits when it should miss). Mitigation: the deviation is ADR-gated, the digest-resolution helper is in one place, and there's an adversarial test (
test_image_digest_drift.py) that mutates the image between gathers. Phase 14 will revisit the multi-actor cache story per Phase 0 §2.7's commitment. -
Projection format is now a second on-disk schema (the first being
repo-context.yaml). Schema evolution discipline applies to both. Mitigation: the projection format is versioned per-file (_provenance.jsoncarriesformat_version); ADR-0032 adapters check the version on load and refuse mismatched formats loudly. Adding a new field is additive; renaming requires an ADR amendment + migration test. -
Plugin loader ships in Phase 2 before its first concrete consumer (Phase 3 plugin). YAGNI charge: we're shipping infrastructure on speculation. Mitigation: the universal fallback plugin is not speculative — it's the actual mechanism by which kernel-resident B–G probes register, per ADR-0031's "no-match fallback" requirement. The shipped surface is small (~300 LOC for loader + manifest models). The risk is over-design on the manifest schema — refused by keeping the Phase 2 manifest to the four required fields (
name,version,scope,contributes) and one optional (extends). -
Tier-3 serialization (RuntimeTraceProbe) is the cold-gather wall-clock dominator. Five scenarios × 16-20 s each = 80-100 s, and we cannot parallelize without contaminating traces. Mitigation: the scenario set is configurable per-repo (
runtime_trace.scenarios:in.codegenie/config.yaml); a repo that doesn't neederror_pathskips it. Image-digest caching means in steady state most workflows hit cache and pay zero. Future optimization (Phase 14+): pre-warm the trace cache as part of the continuous-gather Dispatcher's nightly cron, not in the workflow critical path.
Acknowledged blind spots¶
- Cross-language gather is out of scope. Phase 1 designed for Node.js; Phase 2 extends with language-agnostic probes (semgrep, gitleaks, conventions, skills) but the SCIP indexer + tree-sitter grammars shipped in Phase 2 are TypeScript-only. Adding Java/Python is the Phase 3+ language-plugin path per ADR-0031.
- No Layer F. Per localv2.md §5.7, Layer F (vuln-specific evidence — CodeQL, taint flow) is Phase 3+ scope. Our schema accommodates it, our cost tiers anticipate the addition, our event log is shape-compatible — but we don't ship the probes.
- Concurrent gather discipline against the same repo is delegated to Phase 0's advisory lock + the docker daemon's natural serialization of builds. We do not design a distributed-coordinator story for Phase 14's webhook-driven concurrent gathers — that's a Phase 14 problem.
- Cache eviction policy. Phase 0/1's cache grows unboundedly. Phase 2 doesn't tackle this either. At ~25 MB per cold gather and a normal portfolio cadence of one cold gather per repo per quarter, the disk pressure is sub-1 GB per repo per year — out of scope for our timeline but tracked.
- MCP serving of
RepoContextand projections. Production design.md §8 describes MCP; Phase 2 writes files on disk. Phase 8's MCP server reads our files. We do not pre-design the MCP API — we trust the file format to be the contract. - Cross-validation between probes that should agree.
IndexHealthProbereads each probe's slice independently; if two probes contradict each other (e.g.,SemgrepProbesays a.tsfile exists thatSCIPIndexProbedoes not see), we currently surface this as two confidence values, not as a contradiction event. Phase 9's event projection can fix this; we do not.
Open questions for the synthesizer¶
- Should the universal-fallback plugin manifest be hand-edited or generated? I lean hand-edited (~50 lines, reviewed-as-data). The best-practices lens may want a generator from probe registrations; the security lens may want the manifest signed. Pick one.
- Is
msgpackfor the SCIP symbol-index projection the right choice vs. a small custom binary format? Performance argument: msgpack is ~5× faster than JSON and ships as a maintained C extension. Best-practices may want JSON for diffability; security may flag any new C extension. I picked msgpack; defend or replace. - The image-digest cache-key deviation from Phase 0's source-tree discipline is one ADR away from being a separate cache substore. I kept it in the unified Phase 0 cache for simplicity. The synthesizer may want a separate substore at
cache/by-image-digest/for clarity; I'd accept it if the API surface stays the same. pytest-xdistfor the portfolio test lane reverses Phase 1'ssynthdecision. Phase 1 vetoed xdist for the unit lane; Phase 2 needs it for the portfolio lane (heavy probes). The synthesizer should confirm whether this is a reversal of Phase 1 §2.2 (pytest-xdist: NOT enabled) or an additive scope-limited adoption. I read it as additive; the synthesizer rules.- Should
IndexHealthProbe's event emission be synchronous (blocks gather completion) or fire-and-forget? I designed it synchronous (loud-not-quiet). The performance argument for fire-and-forget is marginal (~5 ms saved); the risk is silent event loss. I picked synchronous; the synthesizer can override if Phase 14 telemetry pressure makes async necessary. TreeSitterImportGraphProberuns its own internal thread pool inside a tier-1 slot. This is a second concurrency layer the coordinator doesn't see. The synthesizer should rule on whether this is acceptable (it works because tree-sitter's C extension releases the GIL) or whether it's a hidden coupling that should be explicit in the coordinator's accounting.- Phase 2 ships the event-stream skeleton (
.codegenie/events/) without the Phase 9 Postgres backend. Three events are defined. The synthesizer should rule whether to ship more events now (more upfront design, lower Phase 9 risk) or fewer (less upfront commitment, higher Phase 9 migration cost). I picked three load-bearing ones; defend or expand.