Phase 02 — Context gathering — Layers B–G: Architecture¶
Status: Architecture spec
Date: 2026-05-14
Inputs: final-design.md (synthesized design) · critique.md · design-performance.md · design-security.md · design-best-practices.md · docs/production/design.md · docs/roadmap.md Phase 2 / Phase 3 · docs/localv2.md §4–5 · Phase 0/1 final designs · ADR-0005, 0006, 0007, 0008, 0012, 0029, 0030, 0031, 0032, 0033, 0034
Audience: the engineer implementing this phase
Amendment note (2026-05-17 — 02-ADR-0011): tree-sitter grammar delivery moved from vendored
.so/.dylibfiles +tools/grammars.lockBLAKE3 pins (02-ADR-0002 model, described throughout this doc) to PyPI wheels (tree-sitter-typescript,tree-sitter-javascript, futuretree-sitter-python,tree-sitter-java). The kernel surface iscodegenie.grammars.lock.language_for(name) -> tree_sitter.Language; supply-chain pinning ispip --require-hashesat the wheel boundary. Sections describingtools/grammars/,tools/grammars.lock,load_and_verify, and BLAKE3-of-binary verification record the design-time decision and remain in place for historical context; the current implementation is the PyPI-wheel kernel documented in 02-ADR-0011 and exercised bytests/unit/grammars/test_lock.py. The named-trigger C-extension policy (Phase 1 ADR-0009 admitspy-tree-sitteras the one exception) carries forward unchanged.
Executive summary¶
Phase 2 lands the rest of the kernel-side gather pipeline on top of Phase 0/1's frozen contract surface: every Layer B–G probe that localv2.md §5.2–5.6 names as language-agnostic, plus four kernel scaffolds Phase 3 will consume on day 1 (IndexFreshness sum type, TCCMLoader, SkillsLoader, ADR-0032 adapter Protocols). The load-bearing citizen is IndexHealthProbe (B2): the roadmap's exit criterion is a deliberately-seeded stale-scip fixture that B2 must catch in CI — encoded as tests/adv/phase02/test_stale_scip_fixture.py, build FAILS otherwise. The synthesis explicitly rejected the Plugin Loader (Phase 3 owns it per ADR-0031 §Consequences §1), every Probe ABC contract change (registry-side @register_probe(heaviness=…, runs_last=…) annotations carry the scheduling concern instead), the cryptographic-anchor freshness ceremony (defends a non-threat against an attacker who already owns $HOME), per-repo encryption-key theatre for secret findings (key and ciphertext in the same trust tier — Phase 2 persists zero plaintext), and the pytest-xdist reversal of Phase 0's veto. The result is a smaller Phase 2 than any of the three input lenses proposed: kernel probes + one tagged-union IndexFreshness + four adapter Protocols (documentation as code) + TCCMLoader skeleton + writer-chokepoint secret redaction + one run_external_cli wrapper over Phase 0's run_allowlisted.
Goals¶
These are verifiable on Phase-2 exit. Every goal is testable by code that already exists or is named in this document.
- G1. Every Layer B–G probe in
localv2.md§5.2–5.6 that is language-agnostic ships with golden-file coverage against the 5-repo portfolio attests/fixtures/portfolio/. Verified by: golden tests undertests/golden/probes/, one file per probe per fixture. - G2.
IndexHealthProbe(B2) surfaces a real staleness case in CI against the deliberately-seededtests/fixtures/portfolio/stale-scip/fixture. Verified by:tests/adv/phase02/test_stale_scip_fixture.pyassertingIndexFreshness.Stale(reason=CommitsBehind(n>=1, last_indexed=<known prior commit>)). Build FAILS if the probe does not catch it. This is the roadmap exit criterion. - G3. Phase 0/1 frozen surfaces are unchanged. Verified by: existing Phase 0 contract-freeze snapshot test (
tests/unit/test_probe_contract.py) still passes; Phase 0'sforbidden-patternspre-commit catches edits toProbeABC,OutputSanitizer.scrubsignature,run_allowlistedsignature. - G4. Single name, single module, single Phase-2 consumer for the freshness sum type:
IndexFreshness = Fresh | Stale(reason: StaleReason)atsrc/codegenie/indices/freshness.py. Consumed bysrc/codegenie/report/confidence_section.pywithmatch+assert_never;mypy --warn-unreachableis a build error on a missed case in that module. - G5. Secret findings (
gitleaks,semgrep p/secrets, entropy fallback) are redacted at the writer chokepoint. Verified by:tests/adv/phase02/test_secret_in_source.pyasserting plaintext is present in zero persisted files (repo-context.yaml, everyraw/*.json, the cache blob, the audit anchor). - G6. One subprocess port for Layer B/G external CLIs:
codegenie.exec.run_external_cli, a wrapper around Phase 0run_allowlisted. Layer C (docker,strace) keeps usingrun_allowlisteddirectly with explicit hardening flags. No parallel chokepoint. - G7. Cost target: $0/run; tokens per gather: 0. Phase 0
fenceCI test continues to assert no LLM SDK import undersrc/codegenie/. - G8. Wall-clock targets (1k-file fixture): cold p50 ≤ 90 s, warm p50 ≤ 1.5 s, incremental (single .ts change) p50 ≤ 10 s. Verified by:
tests/bench/bench_portfolio_walltime.py(advisory; ≥ 50% regression flag, not gating). - G9. Kernel scaffolding is shipped — adapter
Protocols +TCCMLoader+SkillsLoader+IndexFreshness— but no Plugin Loader, noplugin.yamlparser, noplugins/directory in source tree. Phase 3 ships those together with the first plugin (ADR-0031 §Consequences §1). - G10. Nine new ADRs land alongside the code, each Nygard-style, recording an irreversible (or hard-to-reverse) Phase-2 commitment (see §"Path to production end state").
Non-goals¶
Each entry says why-not-in-scope so a later reader can find the deferred work.
- Plugin Loader, universal fallback plugin,
plugin.yamlparser,plugins/universal--*--*/directory. Why-not: ADR-0031 §Consequences §1 explicitly states "the first plugin doubles as the proof that the plugin loader works" — and the first plugin ships in Phase 3. Pulling the loader forward hollows Phase 3's exit criterion. - Adapter implementations (
DepGraphAdapter,ImportGraphAdapter,ScipAdapter,TestInventoryAdapter). Why-not: ADR-0032 places adapters atplugins/{slug}/adapters/*.pyinside the plugin. Phase 2 ships only theProtocoltyping surface; the first real adapter isplugins/vulnerability-remediation--node--npm/adapters/in Phase 3. - Bundle Builder, Hierarchical Planner, Supervisor. Why-not: Phase 8 (and 9) owns these. The
TCCMschema lands now so Phase 8 inherits a typed target. - Canonical event log,
.codegenie/events/JSONL, hash-chained audit. Why-not: ADR-0034 §Consequences §1 anchors the event log in Phase 9 (or 13). Phase 0'sruns/<utc-iso>-<short>.jsonaudit anchor is unchanged. - Cleartext secret persistence + per-repo encryption keys. Why-not: key + ciphertext in the same trust tier ($HOME) is obfuscation, not security (critic [S] finding #5). The Phase 5 microVM is the named escalation door if any downstream stage genuinely needs cleartext.
msgpackSCIP projection,scip-pythonreader,tantivydefault,gitleaks-pythonlibrary,gitpython. Why-not: each is rejected.msgpack/scip-pythoncreate format-coupling Phase 3 adapters would have to agree on;tantivyis opt-in withripgrep-via-run_allowlistedas the tested default;gitleaksandgitship as binaries throughrun_allowlisted. Phase 1 ADR-0009 (no new C-extension parser deps) is amended only forpy-tree-sitter(the one named trigger).pytest-xdist. Why-not: Phase 0 vetoed it 10/4 with a recorded rationale. Portfolio CI lane is small enough to run serial in ≤ 6 minutes; bench canary asserts.- Probe ABC changes (
cost_tier,capabilities: ProbeCapabilities). Why-not: scheduling data belongs to the registry annotation, not the contract.localv2.md§4 is frozen. mypy --warn-unreachablerepo-wide retrofit. Why-not: Phase 0/1 retrofit blast radius. Phase 2 enables--warn-unreachableper-module oncodegenie.{indices, probes/index_health.py, report, adapters, tccm}/**only.- Stage 7 Learning telemetry hooks, cost ledger, ROI dashboard. Why-not: Phases 9, 11, 13.
Architectural context¶
Phase 2 sits between Phase 1 (Layer A — Node manifests, build system, CI, deployment, test inventory) and Phase 3 (first plugin: vulnerability-remediation--node--npm). It is gather-layer only — no transforms, no LLMs, no autonomy gates. Its outputs feed two distinct downstream consumers: (a) the human-facing CONTEXT_REPORT.md (rendered each gather), and (b) Phase 3's first plugin (which imports the adapter Protocols, the IndexFreshness sum type, the TCCMLoader, and the run_external_cli port).
flowchart LR
P0["Phase 0<br/>Bullet tracer<br/>(ABC + Coordinator<br/>+ Cache + Sanitizer<br/>+ run_allowlisted)"]
P1["Phase 1<br/>Layer A probes<br/>(Node manifests,<br/>parsed_manifest memo,<br/>safe_yaml.load)"]
P2["Phase 2 (this)<br/>Layers B-G<br/>+ kernel scaffolding<br/>+ IndexFreshness<br/>+ SecretRedactor"]
P3["Phase 3<br/>First plugin +<br/>Plugin Loader +<br/>4 adapters +<br/>universal fallback"]
P5["Phase 5<br/>microVM sandbox<br/>(replaces direct docker;<br/>cleartext escalation door)"]
P9["Phase 9<br/>Canonical event log<br/>(projects Phase 0 audit)"]
P0 --> P1 --> P2 --> P3
P2 -. defers cleartext .-> P5
P2 -. emits slice metadata .-> P9
Phase 2's role in the broader 7-stage pipeline (production/design.md §3): it completes the continuous-deterministic-gather layer (commitment §2.6, ADR-0006) for everything that does not require a specific task class. Phase 3 begins introducing task-class-specific evidence on top of this kernel.
4+1 architectural views¶
Logical view — what components exist and how they relate¶
classDiagram
class Probe {
<<abstract, Phase 0 — frozen>>
+ProbeId name
+list~str~ applies_to_tasks
+list~str~ applies_to_languages
+list~ProbeId~ requires
+list~str~ declared_inputs
+Literal cache_strategy
+int timeout_seconds
+run(ProbeContext) ProbeOutput
}
class ProbeRegistry {
<<Phase 0 — extended via decorator kwargs>>
+register(probe, heaviness, runs_last)
+sorted_for_dispatch() list~ProbeRegEntry~
}
class Coordinator {
<<Phase 0 — sort-order edit only>>
+Semaphore single_budget
+dispatch(probes) dict~ProbeId, ProbeExecution~
}
class OutputSanitizer {
<<Phase 0 — extended by composition>>
+scrub(slice) SanitizedSlice
}
class SecretRedactor {
<<Phase 2 — new pass>>
+redact_secrets(slice, probe_name) tuple
}
class run_external_cli {
<<Phase 2 — new module function>>
+__call__(probe_name, argv, cwd, timeout, egress) ProcessResult
}
class IndexHealthProbe {
<<Phase 2 — load-bearing>>
+cache_strategy = "none"
+runs_last = True
+run(ctx) ProbeOutput
}
class IndexFreshness {
<<Phase 2 — sum type>>
Fresh | Stale(reason)
}
class StaleReason {
<<Phase 2 — sum type>>
CommitsBehind | DigestMismatch | CoverageGap | IndexerError
}
class RuntimeTraceProbe {
<<Phase 2 — Layer C>>
+heaviness = "heavy"
+run(ctx) ProbeOutput
}
class LayerGScanner {
<<Phase 2 — Semgrep|Syft|Grype|Gitleaks>>
+run(ctx) ProbeOutput
}
class DepGraphProbe {
<<Phase 2 — Layer B>>
+run(ctx) ProbeOutput
}
class DepGraphStrategy {
<<Phase 2 — registry>>
+build_for_ecosystem(PackageManager, ctx)
}
class AdapterProtocols {
<<Phase 2 — kernel scaffolding>>
DepGraphAdapter
ImportGraphAdapter
ScipAdapter
TestInventoryAdapter
}
class TCCMLoader {
<<Phase 2 — kernel scaffolding>>
+load(path) Result~TCCM, TCCMLoadError~
}
class SkillsLoader {
<<Phase 2 — kernel scaffolding>>
+load_all(search_paths) Result~list~Skill~~
}
class ConfidenceSectionRenderer {
<<Phase 2 — only consumer of IndexFreshness in Phase 2>>
+render(slices) str
}
Probe <|.. IndexHealthProbe
Probe <|.. RuntimeTraceProbe
Probe <|.. LayerGScanner
Probe <|.. DepGraphProbe
ProbeRegistry o-- Probe
Coordinator --> ProbeRegistry
Coordinator --> OutputSanitizer
OutputSanitizer ..> SecretRedactor : composes
RuntimeTraceProbe --> run_external_cli
LayerGScanner --> run_external_cli
IndexHealthProbe --> IndexFreshness : returns typed
IndexFreshness ..> StaleReason : Stale carries
DepGraphProbe --> DepGraphStrategy : dispatches via
ConfidenceSectionRenderer --> IndexFreshness : pattern-matches
TCCMLoader ..> AdapterProtocols : Phase 3 will wire
Reading guide. The kernel (Probe, ProbeRegistry, Coordinator, OutputSanitizer, Cache — all Phase 0) is unchanged in interface. Phase 2 extends by composition: SecretRedactor is a new pass added to the sanitizer pipeline; run_external_cli is a new function over run_allowlisted; registry annotations (heaviness, runs_last) are decorator kwargs that the coordinator reads when sorting. New types (IndexFreshness, StaleReason, four adapter Protocols, TCCM, Skill) live in their own packages and are imported, not inherited from kernel ABCs.
Process view — runtime behavior¶
Representative incremental gather on a 5k-file TypeScript repo where src/payments/processor.ts changed since last gather, image rebuilt with same FROM-digest, .git/ unchanged.
sequenceDiagram
autonumber
participant CLI as codegenie gather
participant TR as Tool-readiness
participant CO as Coordinator
participant SCIP as SCIPIndexProbe (heavy)
participant RT as RuntimeTraceProbe (heavy)
participant SEM as SemgrepProbe (medium)
participant LIGHT as Light probes (B/D/E)
participant B2 as IndexHealthProbe (runs_last)
participant SAN as OutputSanitizer
participant SR as SecretRedactor
participant WR as Writer + Cache
participant CR as ConfidenceSection Renderer
CLI->>TR: check semgrep/syft/grype/gitleaks/scip-typescript/docker/strace on PATH
TR-->>CLI: ok (or typed MissingToolError + install hint)
CLI->>CO: dispatch Phase 1 + Phase 2 probes
Note over CO: sorted by registry heaviness annotation;<br/>single Semaphore(min(cpu_count(), 8))
par heavy probes start first
CO->>SCIP: run() — MISS (.ts changed); ~8s
CO->>RT: run() — HIT (image-digest unchanged); ~0.1s
and medium then light fill slots
CO->>SEM: run() — incremental MISS; ~3s
CO->>LIGHT: run() — mix of HIT/MISS
end
Note over B2: dispatched LAST because runs_last=True
CO->>B2: run() — reads sibling slices
B2->>B2: construct IndexFreshness for each index
B2-->>CO: ProbeOutput with typed freshness
CO->>SAN: scrub(merged_envelope)
SAN->>SR: redact_secrets(slice, probe_name)
SR-->>SAN: (redacted_slice, in-memory findings)
SAN-->>WR: SanitizedSlice
WR->>WR: atomic .tmp → os.replace<br/>(repo-context.yaml + raw/*)
CR->>CR: render Confidence section<br/>(match on every IndexFreshness)
CR-->>WR: CONTEXT_REPORT.md
WR-->>CLI: exit 0 (~10s total)
Three load-bearing properties this diagram encodes:
IndexHealthProbedispatches after every sibling probe — enforced by theruns_last=Trueregistry annotation, not by topologicalrequires=[every_probe]hack (which the performance lens proposed and the synthesis rejected).- Sanitizer + Redactor + Writer form a single chokepoint. There is no second pathway from
ProbeOutputto disk. TheSecretRedactoris composition inside the existing Phase 0 sanitizer, not a parallel pipeline. - No
pytest-xdist, no internalThreadPoolExecutorinside any probe. The Phase 0 singleSemaphore(min(cpu_count(), 8))is the only concurrency budget; heavy probes simply start first under it.
Development view — package tree¶
graph TD
subgraph CG["src/codegenie/ — Phase 0/1 frozen + Phase 2 additions"]
direction TB
CLI["cli.py — Phase 0; tool-readiness extended"]
EXEC["exec.py — Phase 0 run_allowlisted +<br/>NEW run_external_cli"]
COORD["coordinator.py — Phase 0 + sort-order edit<br/>(reads heaviness from registry)"]
CACHE["cache.py — Phase 0 (declared_inputs unchanged)"]
OUT["output/sanitizer.py — Phase 0 + SecretRedactor pass<br/>output/writer.py — Phase 0 unchanged"]
subgraph PROBES["probes/ — Phase 0 +1 +2"]
P1["layer_a/* — Phase 1"]
P2B["layer_b/<br/>scip_index.py, index_health.py,<br/>dep_graph.py, tree_sitter_import_graph.py,<br/>generated_code.py, node_reflection.py,<br/>semantic_index_meta.py"]
P2C["layer_c/<br/>dockerfile.py, sbom.py, cve.py,<br/>runtime_trace.py, certificate.py,<br/>entrypoint.py, shell_usage.py"]
P2D["layer_d/<br/>skills_index.py, conventions.py, adrs.py,<br/>policy.py, exceptions.py, repo_notes.py,<br/>repo_config.py, external_docs.py"]
P2E["layer_e/<br/>ownership.py, service_topology_stub.py,<br/>slo_stub.py"]
P2G["layer_g/<br/>semgrep.py, ast_grep.py,<br/>ripgrep_curated.py, gitleaks.py,<br/>test_coverage_mapping.py"]
end
subgraph KERNEL["NEW Phase 2 kernel packages"]
INDICES["indices/<br/>freshness.py — IndexFreshness sum type"]
ADAPT["adapters/<br/>protocols.py — 4 Protocols (zero impls)<br/>confidence.py — AdapterConfidence placeholder"]
TCCM["tccm/<br/>loader.py, model.py, queries.py"]
SKILLS["skills/<br/>loader.py, model.py"]
CONV["conventions/<br/>catalog.py, model.py"]
DEP["depgraph/<br/>builder.py, model.py, registry.py"]
REPORT["report/<br/>confidence_section.py"]
end
PARSERS["parsers/safe_yaml.py — Phase 1<br/>(reused by SkillsLoader)"]
end
subgraph TESTS["tests/"]
UNIT["unit/{probes,indices,adapters,tccm,skills,conventions,depgraph,report}/"]
INT["integration/{portfolio,tccm}/"]
GOLDEN["golden/probes/<probe>/<fixture>.json"]
PROP["property/"]
ADV["adv/phase02/<br/>test_stale_scip_fixture.py (LOAD-BEARING)"]
FIX["fixtures/portfolio/<br/>minimal-ts/, native-modules/,<br/>monorepo-pnpm/, distroless-target/,<br/>stale-scip/ (LOAD-BEARING)"]
BENCH["bench/bench_portfolio_walltime.py (advisory)"]
end
subgraph DOCS["docs/phases/02-context-gather-layers-b-g/"]
ADRs["ADRs/0001..0009 (9 new ADRs)"]
REF["_reference-tccm/tccm.yaml<br/>(documentation, NOT a plugin)"]
end
CLI --> COORD
COORD --> PROBES
PROBES --> EXEC
PROBES --> OUT
PROBES --> INDICES
PROBES --> DEP
P2D --> SKILLS
P2D --> CONV
P2D --> TCCM
REPORT --> INDICES
Reading guide. The Phase 0/1 layout is preserved. Six new top-level packages (indices, adapters, tccm, skills, conventions, depgraph) + one render package (report) land — fewer than the best-practices lens proposed (seven, plus generic ambiguity), because runtime/ and security/ from that lens are folded into probes/layer_c/ and probes/layer_g/ respectively. The probe-by-layer subdirectory pattern (probes/layer_a/, probes/layer_b/, …) follows production-design §8.3's "organize by architectural layer, not by feature."
Physical view — deployment topology (developer / CI)¶
graph LR
subgraph DEV["Developer host (macOS or Linux)"]
direction TB
DEVPATH["$PATH must contain:<br/>git, node, npm/pnpm/yarn/bun,<br/>semgrep, syft, grype, gitleaks,<br/>scip-typescript, tree-sitter,<br/>docker (optional), strace (Linux only)"]
REPO["~/work/some-node-repo/<br/>(analyzed repo)"]
OUT_DEV[".codegenie/ inside analyzed repo<br/>0700 dir, 0600 files<br/>context/raw/*.json (redacted)<br/>context/repo-context.yaml<br/>cache/blobs/<br/>runs/<utc-iso>-<short>.json"]
HOME["~/.codegenie/<br/>skills/, conventions/, config/,<br/>tool_cache/<br/>(0700/0600)"]
end
subgraph CI["GitHub Actions (ubuntu-24.04, canonical)"]
direction TB
CIRUN["Job: gather-portfolio<br/>(serial; no xdist; ≤6 min)"]
CIIMG["actions image:<br/>preinstalled binaries above<br/>+ bubblewrap (best-effort hardening)"]
CIFIX["tests/fixtures/portfolio/<br/>(regenerated each run;<br/>NO committed .codegenie/cache/)"]
CIADV["Job: adv-phase02<br/>(stale-scip fixture is BUILD-GATING)"]
end
DEVPATH -. invoked by .-> REPO
REPO --> OUT_DEV
HOME -. read by .-> OUT_DEV
CIIMG --> CIRUN
CIFIX --> CIRUN
CIFIX --> CIADV
Reading guide. Phase 2 is still a local CLI (no service yet, per production/design.md §6 POC-to-service). The analyzed repo's .codegenie/ is the only writable workspace; ~/.codegenie/ is read-mostly (skills, conventions, config) at 0700/0600 (Phase 0 ADR-0011). CI runs on ubuntu-24.04 (the only image guaranteed to ship bwrap); macOS dev hosts skip the bubblewrap wrap with a single-line startup warning. Layer C (docker, strace) runs the analyzed-repo container under --network=none --cap-drop=ALL --security-opt=no-new-privileges; ADR-0012 (microVM, Phase 5) replaces those direct docker invocations later.
Scenarios (+1)¶
Scenario 1: Cold gather on a 50k-LOC TypeScript service (no built image)¶
sequenceDiagram
participant U as Operator
participant CLI as codegenie gather
participant CO as Coordinator
participant SCIP as SCIPIndexProbe
participant RT as RuntimeTraceProbe
participant SY as SyftProbe
participant GR as GrypeProbe
participant SE as SemgrepProbe
participant GL as GitleaksProbe
participant LIGHT as light probes
participant B2 as IndexHealthProbe
participant SR as SecretRedactor
participant WR as Writer
U->>CLI: codegenie gather ~/work/svc-payments
CLI->>CO: dispatch (heavy first)
CO->>SCIP: ~10s re-index whole repo
CO->>RT: docker build (~47s) + 5 trace scenarios sequential (~75s)
CO->>SY: syft on built image (~5s)
Note right of SY: requires RT image; serialized by registry "requires"
CO->>GR: grype on syft SBOM (~3s)
CO->>SE: semgrep --metrics=off (~8s)
CO->>GL: gitleaks on repo + .git history (~4s)
CO->>LIGHT: ConventionProbe, SkillsIndex, etc. (~2s aggregate)
CO->>B2: read sibling freshness; construct IndexFreshness per index
CO->>SR: redact_secrets — replaces any AKIA... with <REDACTED:fp=…>
SR->>WR: persist (atomic .tmp → os.replace)
WR-->>U: exit 0 (~110-140s; meets <=180s p95 cold target)
Scenario 2: Stale-SCIP fixture catches in CI (the load-bearing exit criterion)¶
sequenceDiagram
participant CI as GitHub Actions
participant FX as tests/fixtures/portfolio/stale-scip/
participant CLI as codegenie gather
participant SCIP as SCIPIndexProbe
participant B2 as IndexHealthProbe
participant T as test_stale_scip_fixture.py
CI->>FX: checkout fixture (HEAD has moved; cached SCIP from prior commit)
CI->>CLI: codegenie gather $fixture
CLI->>SCIP: cache lookup — HIT against prior commit's index
Note right of SCIP: cache HIT writes slice with<br/>last_indexed_commit = <prior>
CLI->>B2: runs_last; reads SCIP slice
B2->>B2: last_indexed_commit != current HEAD ⇒<br/>IndexFreshness.Stale(reason=CommitsBehind(n>=1, last_indexed=<prior>))
CLI-->>T: ProbeOutput emitted; CONTEXT_REPORT.md printed
T->>T: assert isinstance(slice.freshness, Stale)
T->>T: assert isinstance(slice.freshness.reason, CommitsBehind)
T->>T: assert slice.freshness.reason.n >= 1
T-->>CI: PASS — Phase 2 exit gate cleared
Note over CI: If B2 fails to construct Stale, test FAILS and Phase 2 build fails.
Scenario 3: Hostile YAML in a SKILL.md (adversarial)¶
sequenceDiagram
participant SK as SkillsLoader
participant OS as os.open(O_NOFOLLOW)
participant YAML as parsers.safe_yaml.load
participant CR as Coordinator
participant CLI as CLI summary
SK->>OS: open(~/.codegenie/skills/x/SKILL.md, O_NOFOLLOW|O_NOCTTY)
alt symlink detected
OS-->>SK: ELOOP
SK->>SK: SkillsLoadError(reason="symlink_refused")
SK-->>CR: load_all() returns Result.Err on this skill;<br/>other skills load
CR-->>CLI: print loud warning "skill x refused (symlink)"
else !!python/object payload
OS-->>SK: fd ok
SK->>YAML: safe_yaml.load(fd, size_cap, depth_cap)
YAML-->>SK: yaml.YAMLError (unknown tag)
SK->>SK: SkillsLoadError(reason="unsafe_yaml")
SK-->>CR: load_all() returns Result.Err on this skill;<br/>other skills load
CR-->>CLI: print loud warning "skill x refused (unsafe yaml)"
end
Scenario 4: Secret in .git/ history found by gitleaks¶
sequenceDiagram
participant GL as GitleaksProbe
participant SR as SecretRedactor
participant WR as Writer + Cache
participant AUD as Audit anchor
GL->>GL: run_external_cli("gitleaks", ["detect", "--source", ".", "--report-format=json", "--no-banner"])
GL-->>SR: ProbeOutput.schema_slice contains {finding: {match: "AKIAIOSFODNN7EXAMPLE", file: ".git/refs/...", line: 42}}
SR->>SR: pattern AKIA[0-9A-Z]{16} matches; replace with <REDACTED:fingerprint=BLAKE3_8>
SR->>SR: in-memory SecretFinding(probe="gitleaks", fingerprint, pattern="aws_access_key", file=".git/...", line=42) — NOT persisted
SR-->>WR: redacted slice
WR->>WR: write raw/gitleaks-findings.json (redacted)
WR->>WR: write repo-context.yaml (redacted)
WR->>AUD: per-probe Ran/CacheHit/Skipped (no plaintext)
Note over WR,AUD: Plaintext present in ZERO persisted files.<br/>CLI summary prints fingerprint count + file:line list.
Component design¶
1. IndexHealthProbe (B2 — the load-bearing one)¶
- Purpose. Detect and surface index staleness for every index Phase 2 produces (SCIP, runtime trace, SBOM, semgrep, conventions, skills). Silent staleness is the worst failure mode of the entire system (
production/design.md§2.3,CLAUDE.md). The probe is what makes commitment §2.3 real in Phase 2. - Public interface.
# src/codegenie/probes/layer_b/index_health.py @register_probe(runs_last=True) # heaviness defaults to "light" class IndexHealthProbe(Probe): name: ProbeId = ProbeId("index_health") layer: Literal["B"] = "B" tier: Literal["base"] = "base" applies_to_tasks: list[str] = ["*"] applies_to_languages: list[str] = ["*"] requires: list[ProbeId] = [] # reads sibling OUTPUTS via the # coordinator-provided slice map, # not via topological ordering declared_inputs: list[str] = [".codegenie/context/raw/*.json", ".git/HEAD", "<scip-index-output>", "<image-digest-token>"] cache_strategy: Literal["none"] = "none" # forbidden via mypy flag in this module timeout_seconds: int = 10 async def run(self, ctx: ProbeContext) -> ProbeOutput: ... - Internal structure. Reads (a) sibling probes' just-written
last_indexed_commit,files_indexed,files_in_repo,indexer_errors,last_traced_image_digest,built_image_digest,rule_pack_version; (b) currentgit rev-parse HEADviarun_allowlisted("git", "rev-parse", "HEAD"). For each source, constructs anIndexFreshnessvalue via a smart constructor incodegenie.indices.freshness. Pattern-matches on the result to derive the human-readableconfidence: "high" | "medium" | "low"forrepo-context.yaml. Forbidden by per-module pre-commit hook insrc/codegenie/probes/layer_b/index_health.py:os.path.getmtime,Path.stat().st_mtime— mtime is not a freshness signal. - Dependencies. Only
codegenie.indices.freshness,codegenie.exec.run_allowlisted, and Phase 0ProbeABC. No imports from sibling probe modules. - State. Stateless. Reads sibling slices from the coordinator-provided context.
- Performance envelope.
timeout_seconds=10. Wall-clock typically < 200 ms (O(1) string compares + git rev-parse). Memory: < 20 MB. - Failure behavior. If git rev-parse fails (e.g., the analyzed repo is not a git work tree), the probe emits
IndexFreshness.Stale(reason=IndexerError(message="repo_not_a_git_workdir"))for every index source. If a sibling slice is missing (the upstream probe failed), the probe emitsStale(reason=IndexerError(message=f"upstream_{name}_unavailable")). The probe never raises an exception for an absent upstream; it produces a typed value.
2. IndexFreshness sum type (src/codegenie/indices/freshness.py)¶
- Purpose. One name, one module, one variant set for "is index X stale, and if so, why?" Phase 2's typed answer to commitment §2.3.
- Public interface.
# frozen, discriminated unions; smart constructor at module level Fresh(indexed_at: datetime) Stale(reason: StaleReason) StaleReason = CommitsBehind(n, last_indexed) | DigestMismatch(expected, actual) | CoverageGap(files_indexed, files_in_repo) | IndexerError(message) IndexFreshness = Fresh | Stale__init__.py:__all__ = ["IndexFreshness", "Fresh", "Stale", "StaleReason", "CommitsBehind", "DigestMismatch", "CoverageGap", "IndexerError"]. - Internal structure. Pure Pydantic models,
extra="forbid",frozen=True,Literal["..."]discriminator onkind. Round-trips throughmodel_dump_json↔model_validate_jsonidentity-equal. - Dependencies. stdlib + pydantic (Phase 0 baseline).
- State. None. Pure data.
- Performance envelope. ~1 µs per construct; trivially negligible.
- Failure behavior. Construction raises
ValidationErroron a bad payload (theextra="forbid"discipline). Caller wraps asResult.Errper Phase 2 new-code discipline (Phase 0/1 code is NOT retrofit). - Why not co-located in
probes/layer_b/index_health.py. The Phase-2 consumer issrc/codegenie/report/confidence_section.py, which is intentionally outsideprobes/so a CONTEXT_REPORT render does not require pulling in the probe registry. One additional package (indices/) is the smallest separation that makes the consumer real. Phase 3 adapters and Phase 8 Bundle Builder will also import this without a circular-dependency risk.
3. run_external_cli (src/codegenie/exec.py — extends Phase 0)¶
- Purpose. Single subprocess port for Layer B/G external CLIs (
scip-typescript,syft,grype,semgrep,ast-grep,ripgrep,gitleaks,tree-sitter). Wraps Phase 0run_allowlisted; never duplicates it. Layer C (docker,strace) does NOT route through this — those go throughrun_allowlisteddirectly with explicit hardening flags. - Public interface.
- Internal structure. Delegates to
run_allowlisted(argv, ...)(on macOS / no-bwrap) or to the private_spawn_with_invariantshelper (on Linux+bwrap, sincebwrapis intentionally NOT inALLOWED_BINARIES— see 02-ADR-0001 §Consequences last bullet) with three additions: (a) env strip to the Phase 0 4-key baseline (PATH,HOME,LANG,LC_ALL); (b) on Linux only, optionalbubblewrap --unshare-net --ro-bind <repo> /work --bind <tmpdir> /tmp/probewrap whenbwrapis on PATH (graceful no-op when missing); (c)stdout/stderrcapped on every call (success or failure), tail-included on truncation. - Dependencies. Phase 0
run_allowlisted. No new third-party deps. - State. None.
- Performance envelope. Overhead vs raw
run_allowlisted: < 5 ms (env strip + optional bwrap fork). Bandwidth-bound by tool's stdout (cap 64 MB). - Failure behavior. Non-zero exit →
ProcessResult(exit_code=N, stderr_tail=...). Caller (a scanner probe) wraps asScannerOutcome.ScannerFailed.bwrapmissing on Linux (or non-Linux platform) → single startup warning per process (subproc.bwrap.skipped), no failure. Timeout viaasyncio.wait_for. Stdout/stderr are tail-truncated on every call when the per-stream cap is exceeded (success or failure).
4. SecretRedactor (extension of src/codegenie/output/sanitizer.py)¶
- Purpose. Intercept every string in every
ProbeOutput.schema_slicebefore it lands inrepo-context.yaml, raw artifacts, cache, or audit; replace anything matching a secret pattern with<REDACTED:fingerprint=BLAKE3_8>. Phase 2 persists zero plaintext. - Public interface.
The returned
# added to codegenie/output/sanitizer.py def redact_secrets(slice_: dict[str, JSONValue], probe_name: ProbeId ) -> tuple[dict[str, JSONValue], list[SecretFinding]]: ...list[SecretFinding]is collected in-memory for the CLI summary line; it is NOT persisted to any file. - Internal structure. Runs after Phase 0's field-name regex and the
JSONValuetree walk. Patterns: AWSAKIA[0-9A-Z]{16}, GitHubghp_[A-Za-z0-9]{36}, JWT, RSA private-key block, NPMnpm_…, Anthropicsk-ant-…, plus Shannon-entropy ≥ 4.5 bits/char forlen ≥ 32unknowns. Fingerprint = first 8 hex of BLAKE3 of the cleartext. - Dependencies.
codegenie.hashing.content_hash(Phase 0 — BLAKE3 helper). - State. Stateless function. The in-memory findings list is the caller's state.
- Performance envelope. ~50 MB/s scan; for a typical merged envelope (~10 MB) ~0.2 s.
- Failure behavior. Mutation test: a deliberately weakened pattern (e.g.,
AKIA[0-9A-Z]{15}) must causetests/unit/test_secret_redactor.py::test_aws_key_mutationto FAIL. Pattern failure is a build failure.
5. Layer G scanners — SemgrepProbe, SyftProbe, GrypeProbe, GitleaksProbe¶
- Purpose. Run third-party security/SBOM scanners; parse JSON output into typed
ProbeOutput.schema_slicevia Pydantic smart constructors. - Public interface. Each is a
Probesubclass@register_probe(heaviness="medium")(withruns_last=False, default). Slices followlocalv2.md§5.6. - Internal structure. One file per scanner, ≤ 200 LOC each, no shared
ScannerRunnerabstraction. Common pattern: (a) check tool availability via Phase 0tool_cache; (b) invoke viarun_external_cliwith explicit argv (no shell,--metrics=offforsemgrep); (c) parse stdout JSON via Pydantic; (d) returnProbeOutput. Outcome is a discriminated unionScannerOutcome = ScannerRan | ScannerSkipped | ScannerFailed. - Dependencies.
run_external_cli, Pydantic. - State. None.
- Performance envelope. Per-probe
timeout_seconds: semgrep 60s, syft 30s, grype 30s, gitleaks 30s. Memory dominated by the scanner subprocess (semgrep ~200 MB, others < 100 MB). - Failure behavior. Tool missing →
ScannerOutcome.ScannerSkipped(reason="tool_missing")+confidence="low". Bad JSON →ScannerFailed(reason="invalid_json", stderr_tail=…). All findings flow throughSecretRedactorat the writer chokepoint.
6. RuntimeTraceProbe (C4 — multi-scenario harness)¶
- Purpose. Capture syscalls, loaded shared libraries, network endpoints, shell invocations of the analyzed-repo's container under 5 scenarios (
startup,smoke_test,healthcheck,shutdown,error_path). - Public interface.
@register_probe(heaviness="heavy"). Reads.codegenie/scenarios.yaml(Pydantic-validated; defaults to 5 if absent). - Internal structure. Sequential per-scenario execution (no concurrency; multiple
docker runof the same image race resources). Per scenario:docker build→docker run --network=none --cap-drop=ALL --security-opt=no-new-privileges+strace -f(Linux) / typedStraceUnavailable(macOS). Alldocker/stracecalls viarun_allowlisted(Phase 2 ADR-0001 extendsALLOWED_BINARIES). Per-scenario timeout 120 s; aggregate 600 s. Per-scenario outcome:ScenarioResult = TraceScenarioCompleted | TraceScenarioFailed | TraceScenarioSkipped. - Cache key.
declared_inputsincludesDockerfile,.codegenie/scenarios.yaml, AND a special declared-input tokenimage-digest:<resolved>(Phase-2 ADR-gated via optionalProbeContext.image_digest_resolver: Callable[[Path], str | None] | None = Nonemirroring Phase 1 ADR-0002'sparsed_manifestprecedent — the one ProbeContext field Phase 2 adds; theProbeABC is untouched). - Dependencies.
run_allowlisted,codegenie.indices.freshness(emitsIndexFreshness.Stale(IndexerError("strace_unavailable"))on macOS path). - State. None.
- Performance envelope. Cold p50 ~90 s for 5 scenarios; warm hits cache on image-digest equality.
- Failure behavior. macOS deterministically emits
TraceScenarioFailed(StraceUnavailable())per scenario (no sudo prompt).docker buildfailure → all scenarios skip withconfidence="unavailable".
7. Adapter Protocol definitions (src/codegenie/adapters/protocols.py)¶
- Purpose. Documentation as code. The four Protocols are the typed target Phase 3's first plugin must implement against.
- Public interface.
@runtime_checkable class DepGraphAdapter(Protocol): def consumers(self, pkg: str) -> list[str]: ... def producers(self, pkg: str) -> list[str]: ... def confidence(self) -> AdapterConfidence: ... @runtime_checkable class ImportGraphAdapter(Protocol): def reverse_lookup(self, module: str) -> list[str]: ... def confidence(self) -> AdapterConfidence: ... @runtime_checkable class ScipAdapter(Protocol): def refs(self, symbol: str) -> list[Occurrence]: ... def confidence(self) -> AdapterConfidence: ... @runtime_checkable class TestInventoryAdapter(Protocol): def tests_exercising(self, symbol: str) -> list[TestId]: ... def confidence(self) -> AdapterConfidence: ...AdapterConfidence = Trusted | Degraded(reason: str) | Unavailable(reason: str)lives incodegenie/adapters/confidence.pyas a placeholder; Phase 3 may extend on first concrete adapter. - Internal structure. Pure typing, ~80 LOC total, stdlib +
typingonly. - Dependencies. stdlib.
- State. None.
- Performance envelope. N/A — zero runtime cost; type-only.
- Failure behavior. None — type definitions cannot fail at runtime.
mypy --strictenforces conformance at type-check time when Phase 3 adapters land.
8. TCCMLoader (src/codegenie/tccm/)¶
- Purpose. Load and Pydantic-validate Task-Class Context Manifests (ADR-0029). Schema only; no Bundle building (Phase 8).
- Public interface.
TCCMLoader.load(path: Path) -> Result[TCCM, TCCMLoadError].TCCMis a Pydantic model.DerivedQueryis a Pydantic discriminated union over the five ADR-0030 primitives — noUnknownvariant (ADR-amend on a sixth). - Internal structure.
loader.py(I/O +safe_yaml.load),model.py(TCCM,DerivedQuery),queries.py(the five-variant discriminated union). - Dependencies.
codegenie.parsers.safe_yaml(Phase 1 chokepoint), Pydantic. - State. None.
- Performance envelope. < 50 ms per TCCM load (size capped by
safe_yaml). - Failure behavior. Unknown
compute:variant →Result.Err(TCCMLoadError(reason="unknown_query_primitive")). Schema violation →Result.Err(TCCMLoadError(reason="schema", details=ValidationError.errors())). - Phase-2 consumer.
docs/phases/02-context-gather-layers-b-g/_reference-tccm/tccm.yamlis an illustrative manifest for anindex-health-self-checktask class. It is documentation, not a plugin — it does not live underplugins/, has no probes or subgraph, and exists only to (a) exercise every field of theTCCMPydantic model and (b) givetests/integration/tccm/test_reference_tccm_roundtrips.pya real input.
9. SkillsLoader (src/codegenie/skills/)¶
- Purpose. Load and index YAML-frontmatter
SKILL.mdfiles from three trust tiers (~/.codegenie/skills/,.codegenie/skills/, optional~/.codegenie/skills-org/). Validate frontmatter; record body byte-offset only (progressive disclosure). - Public interface.
SkillsLoader(search_paths: list[Path]).load_all() -> Result[list[Skill], SkillsLoadError].Skillis a Pydantic model. Constructor is pure data; first I/O isload_all(). - Internal structure. Per
SKILL.mdfile:os.open(path, O_NOFOLLOW | O_NOCTTY)→os.fdopen→ pass tocodegenie.parsers.safe_yaml.load(Phase 1 chokepoint). NO parallel YAML loader. Three-tier merge: first-tier-wins; collisions emit askill_shadowedwarning in the CLI summary. - Dependencies.
codegenie.parsers.safe_yaml, stdlib. - State. None.
- Performance envelope. ~5 ms per skill; ~100 skills typical = < 1 s.
- Failure behavior.
O_NOFOLLOWELOOP →Result.Err(SkillsLoadError(reason="symlink_refused", path)).safe_yaml.loadraises → wrapped asSkillsLoadError(reason="unsafe_yaml"). Both cases: this skill skipped; loud CLI warning; other skills load.
10. ConventionsCatalogLoader (src/codegenie/conventions/)¶
- Purpose. Load org convention catalog (
~/.codegenie/conventions/*.yaml); emit typedConventionResult = Pass | Fail | NotApplicableper rule. - Public interface.
ConventionsCatalogLoader(search_paths).load_all() -> Result[Catalog, ConventionsError];Catalog.apply(repo: RepoSnapshot) -> list[ConventionResult]. - Internal structure. Pure functions over Pydantic-modeled rules. Pattern types (
dockerfile_pattern,dockerfile_pattern_inverted,file_pattern,missing_file) are themselves a Pydantic discriminated union; onematchper pattern type withassert_neveron the unreachable branch. - Dependencies.
safe_yaml, Pydantic. - State. None.
- Performance envelope. < 100 ms for a typical catalog.
- Failure behavior. Unknown pattern type →
Result.Err(ConventionsError(reason="unknown_pattern_type")). OPA/Rego ships in Phase 16 (ADR-0021).
11. DepGraphProbe (B5 — kernel skeleton with strategy registry)¶
- Purpose. Build a
networkx.DiGraphof the repo's internal package dependencies (monorepo modules + cross-references). Ecosystem-specific resolution lives in Phase 3 plugin adapters; Phase 2 stitches Phase 1's already-parsed manifests into the graph. - Public interface.
@register_probe. Slice followslocalv2.md§5.2 B5. - Internal structure. Reads Layer A's
manifestsandbuild_systemslices. Dispatches via a typed registry decorator:@register_dep_graph_strategy(ecosystem=PackageManager.PNPM) def build_pnpm(ctx, manifests) -> networkx.DiGraph: ...PackageManageris the Phase 1 ADR-0013 enum (bun | pnpm | yarn-classic | yarn-berry | npm) — imported, not redefined. Adding Maven (Phase 8+) is a new file + new decorator + ADR-amend onPackageManager. - Dependencies.
networkx, Phase 1PackageManagerenum. - State. None.
- Performance envelope. O(N_manifests + N_internal_edges); typical ~200 ms on monorepo-pnpm fixture.
- Failure behavior. Unknown ecosystem (no registered strategy) → typed
DepGraphProbeOutput(confidence="low", reason="no_strategy_for_ecosystem").
12. TreeSitterImportGraphProbe (B3 — kernel skeleton)¶
- Purpose. Extract file-level import edges using
tree-sittergrammars; emitnetworkx.DiGraph-serializable JSON toraw/import-graph.json. Forward-only adjacency in Phase 2; reverse projection is a Phase 3 adapter concern. - Public interface.
@register_probe(heaviness="medium"). Slice followslocalv2.md§5.2 B3. - Internal structure.
py-tree-sitterbindings (the one C-extension dep Phase 2 accepts via Phase 2 ADR-0002 amending Phase 1 ADR-0009). No internalThreadPoolExecutor— the probe is one slot under the Phase 0 single semaphore; hidden parallelism lies to the coordinator's budget. Grammar.so/.dylibBLAKE3-pinned intools/grammars.lock; load-time mismatch isGrammarLoadRefused. In-process load (not a sidecar subprocess — the grammar pin is the supply-chain defense). - Dependencies.
py-tree-sitter, grammar packs (vendored). - State. None.
- Performance envelope. ~5 ms/file × 2k files ≈ 10 s sequential cold.
- Failure behavior. Grammar pin mismatch →
GrammarLoadRefused; probe slice markedconfidence="low"; no grammar code executes.
Component summary table¶
| # | Component | Location | Owner ADR |
|---|---|---|---|
| 1 | IndexHealthProbe |
src/codegenie/probes/layer_b/index_health.py |
02-ADR-0006 |
| 2 | IndexFreshness |
src/codegenie/indices/freshness.py |
02-ADR-0006 |
| 3 | run_external_cli |
src/codegenie/exec.py (extends Phase 0) |
02-ADR-0001 |
| 4 | SecretRedactor |
src/codegenie/output/sanitizer.py (extends Phase 0) |
02-ADR-0005 |
| 5 | Layer G scanners | src/codegenie/probes/layer_g/{semgrep,syft,grype,gitleaks}.py |
02-ADR-0001 |
| 6 | RuntimeTraceProbe |
src/codegenie/probes/layer_c/runtime_trace.py |
02-ADR-0001, 02-ADR-0004 |
| 7 | Adapter Protocols |
src/codegenie/adapters/protocols.py |
02-ADR-0007 |
| 8 | TCCMLoader |
src/codegenie/tccm/ |
02-ADR-0007 |
| 9 | SkillsLoader |
src/codegenie/skills/ |
(Phase 1 ADR-0006 — safe_yaml chokepoint preserved) |
| 10 | ConventionsCatalogLoader |
src/codegenie/conventions/ |
— |
| 11 | DepGraphProbe + strategy registry |
src/codegenie/depgraph/, probes/layer_b/dep_graph.py |
— |
| 12 | TreeSitterImportGraphProbe |
src/codegenie/probes/layer_b/tree_sitter_import_graph.py |
02-ADR-0002 |
Data model¶
Annotation key: [contract] = visible to Phase 3+ consumers; [internal] = Phase 2 implementation detail.
# ---------- [contract] codegenie/indices/freshness.py ----------
class CommitsBehind(BaseModel): # frozen, extra="forbid"
kind: Literal["commits_behind"] = "commits_behind"
n: int
last_indexed: str # commit sha at the I/O boundary; raw str by design
class DigestMismatch(BaseModel): # frozen, extra="forbid"
kind: Literal["digest_mismatch"] = "digest_mismatch"
expected: str
actual: str
class CoverageGap(BaseModel): # frozen, extra="forbid"
kind: Literal["coverage_gap"] = "coverage_gap"
files_indexed: int
files_in_repo: int
class IndexerError(BaseModel): # frozen, extra="forbid"
kind: Literal["indexer_error"] = "indexer_error"
message: str
StaleReason = Annotated[Union[CommitsBehind, DigestMismatch, CoverageGap, IndexerError],
Field(discriminator="kind")]
class Fresh(BaseModel): # frozen, extra="forbid"
kind: Literal["fresh"] = "fresh"
indexed_at: datetime
class Stale(BaseModel): # frozen, extra="forbid"
kind: Literal["stale"] = "stale"
reason: StaleReason
IndexFreshness = Annotated[Union[Fresh, Stale], Field(discriminator="kind")]
# ---------- [contract] codegenie/adapters/protocols.py ----------
@runtime_checkable
class DepGraphAdapter(Protocol): ... # see §Component 7
@runtime_checkable
class ImportGraphAdapter(Protocol): ...
@runtime_checkable
class ScipAdapter(Protocol): ...
@runtime_checkable
class TestInventoryAdapter(Protocol): ...
# ---------- [contract] codegenie/adapters/confidence.py ----------
class Trusted(BaseModel): kind: Literal["trusted"]; ...
class Degraded(BaseModel): kind: Literal["degraded"]; reason: str
class Unavailable(BaseModel): kind: Literal["unavailable"]; reason: str
AdapterConfidence = Annotated[Union[Trusted, Degraded, Unavailable], Field(discriminator="kind")]
# Phase 3 plugin may extend; revise on first adapter.
# ---------- [contract] codegenie/tccm/model.py ----------
class TCCM(BaseModel): # frozen, extra="forbid"
schema_version: Literal["1"]
task_class: TaskClassId
required_probes: list[ProbeId]
required_skills: list[SkillId]
derived_queries: list[DerivedQuery]
confidence_floor: AdapterConfidence
# ---------- [contract] codegenie/tccm/queries.py ----------
# 5 primitives per ADR-0030; no Unknown variant; ADR-amend on a 6th.
DerivedQuery = ConsumersOf | ProducersOf | ReverseLookup | RefsTo | TestsExercising
# ---------- [contract] codegenie/skills/model.py ----------
class Skill(BaseModel): # frozen, extra="forbid"
id: SkillId
applies_to_tasks: list[str]
applies_to_languages: list[str]
body_offset: int # progressive disclosure
body_size: int
body_blake3: str
# ---------- [internal] codegenie/probes/layer_g/scanner_outcome.py ----------
class ScannerRan(BaseModel):
kind: Literal["ran"] = "ran"; findings: list[Finding]
class ScannerSkipped(BaseModel):
kind: Literal["skipped"] = "skipped"; reason: str
class ScannerFailed(BaseModel):
kind: Literal["failed"] = "failed"; exit_code: int; stderr_tail: str
ScannerOutcome = Annotated[Union[ScannerRan, ScannerSkipped, ScannerFailed],
Field(discriminator="kind")]
# ---------- [internal] codegenie/probes/layer_c/scenario_result.py ----------
TraceScenarioCompleted | TraceScenarioFailed | TraceScenarioSkipped
# ---------- [internal] codegenie/conventions/model.py ----------
ConventionResult = Pass | Fail | NotApplicable
# ---------- [contract — additive] codegenie/probes/base.py (Phase 0 + 1 frozen) ----------
# ONLY new field in ProbeContext (Phase-2-ADR-gated):
@dataclass(frozen=True)
class ProbeContext:
# ... Phase 0 fields unchanged ...
parsed_manifest: Callable[[], dict | None] | None = None # Phase 1
image_digest_resolver: Callable[[Path], str | None] | None = None # Phase 2 (NEW)
The Probe ABC itself is not edited. The Phase 0 contract-freeze snapshot test (tests/unit/test_probe_contract.py) catches any drift.
Control flow¶
Happy path (warm gather, incremental .ts change, image unchanged):
- CLI tool-readiness checks
$PATHfor all Layer B/G binaries; missing-mandatory → exit with install hint; missing-optional → flag forconfidence="low"slice. RepoSnapshotbuilt (Phase 0).parsed_manifestmemo seeded (Phase 1).image_digest_resolvercallable bound at coordinator setup.- Coordinator reads registry, sorts by
heaviness(heavy first), reserves the last slot forruns_last=Trueprobes. - Probes dispatch under the single
Semaphore(min(cpu_count(), 8)).SCIPIndexProbeMISS → re-index.RuntimeTraceProbeHIT (image-digest token indeclared_inputsmatches). Layer G scanners MISS on changed files. - All sibling probes finish.
IndexHealthProberuns last, reads sibling slices via the coordinator-provided slice map, constructsIndexFreshnessper index source. - Merged envelope flows through
OutputSanitizer.scrub→SecretRedactor.redact_secrets→ writer. Atomic.tmp→os.replace. CONTEXT_REPORT.mdrendered alongside viaconfidence_section.py; pattern-matches everyIndexFreshnessvalue withassert_never.- Audit anchor (
runs/<utc-iso>-<short>.json) records per-probeRan/CacheHit/Skipped(Phase 0 unchanged). - Exit 0.
Decision points:
- Step 1, tool missing. Mandatory →
MissingToolErrortyped, CLI exits 4 with install command fromlocalv2.md §6. Optional → continue; probe emitsconfidence="low". - Step 4, probe times out.
asyncio.wait_forcancels; probe slice hasconfidence="low"anderrors=[…]; coordinator continues (Phase 0 isolation). - Step 5, sibling slice missing.
IndexHealthProbeemitsStale(reason=IndexerError(message=f"upstream_{name}_unavailable")). - Step 6, secret pattern matches.
SecretRedactorreplaces inline; in-memorySecretFindingcollected; CLI summary prints count + file:line list at gather end. - Step 8, audit-anchor write fails. Exit 6 (Phase 0 audit-failure exit code); the gather is invalidated even if all probes succeeded.
Harness engineering¶
- Logging. Phase 0
codegenie/logging.pyunchanged. Phase 2 probes use the existingstructlog-style logger; per-probe lifecycle events (dispatch,finish,cache_hit) are already emitted. Phase 2 adds one log field at the writer:secrets_redacted_count(int), so a 0-count run is grep-able. - Tracing. No OpenTelemetry yet (Phase 9 ships it). Phase 2 emits per-probe
gathered_at,last_indexed_commit,built_image_digest,rule_pack_versioninto the slice itself — this slice metadata is what Phase 9 will project into the canonical event log. - Idempotence. Every probe is idempotent given the same
declared_inputs.IndexHealthProbeiscache_strategy="none"because it observes a moving fact (HEAD vs. last_indexed) — caching that is the same bug as cachingDate.now(). The deterministic-gather commitment (§2.6, ADR-0006) requires(same repo state) → (same RepoContext)byte-for-byte (modulogenerated_attimestamp). - Determinism vs. probabilism. Phase 2 is deterministic end-to-end (commitment §2.1, ADR-0005). The
gatherextras Phase 0fencejob asserts is unchanged: noanthropic,openai,langgraph,httpx,requests,socketimport undersrc/codegenie/. The one new C-extension dep —py-tree-sitter— has a Phase 2 ADR (02-ADR-0002) amending Phase 1 ADR-0009 with the named trigger. - Replay.
RepoContextre-derives from the analyzed-repo +.codegenie/cache/blobs. There is no event log to replay (deferred to Phase 9 per ADR-0034). The Phase 0 audit anchor lets an operator reproduce a given gather invocation's per-probe outcomes. - Configuration.
.codegenie/config.yaml(Phase 0) +.codegenie/scenarios.yaml(Phase 2; RuntimeTraceProbe — Pydantic-validated; falls back to 5 default scenarios).~/.codegenie/config.yamlfor user defaults (skills paths, conventions paths).ExternalDocsProbeallowlist (external_docs:key) lands when the first real user opts in.
Agentic best practices¶
Phase 2 has no LLM call (commitment §2.1) — but it lands the surfaces Phase 4+ LLM consumers will depend on.
- Typed state. Every state machine is a Pydantic discriminated union:
IndexFreshness,ScannerOutcome,ScenarioResult,ConventionResult,AdapterConfidence. ADR-0033 §3 (make-illegal-states-unrepresentable) is the binding discipline;mypy --warn-unreachableenforces exhaustiveness on the consumer (confidence_section.py) from day 1. - Tool-use safety. All external CLI invocations route through
run_external_cli(Layer B/G) orrun_allowlisteddirectly (Layer C).ALLOWED_BINARIESis the auditable list (Phase 2 ADR-0001 amends).--metrics=offonsemgrepto refuse phone-home.gitleaks--no-bannerfor deterministic stdout. - Prompt template structure. N/A — no prompts in Phase 2. Phase 4 will consume the
SecretRedactor's chokepoint guarantee (zero plaintext in any artifact an LLM might see). - Confidence handling.
IndexFreshnessIS the confidence handling. The probe never says "safe" or "unsafe"; it saysFresh(indexed_at=…)orStale(reason=CommitsBehind(n=17, last_indexed=<sha>)). The Planner (Phase 4+) decides. - Error escalation. Every failure is a typed value, not a thrown exception (Rule 12 — fail loud, structural). Exceptions are reserved for bugs, OOM, signals.
Design patterns applied¶
| Decision | Pattern | Why here | Pattern NOT applied (and why) |
|---|---|---|---|
IndexFreshness = Fresh \| Stale(reason) + every state machine as discriminated union |
Sum type / Make-illegal-states-unrepresentable (ADR-0033 §3–4) | Silent "stale-without-reason" is the load-bearing failure mode B2 exists to prevent; mypy --warn-unreachable enforces exhaustiveness |
Null Object — loses the reason; Optional[str] — stringly-typed |
run_external_cli for B/G; direct run_allowlisted("docker", …) for C |
Command pattern at value-typed argv | One auditable chokepoint per family; "every external CLI call is grep run_external_cli" works |
Hexagonal Port/Adapter — one Adapter is a function; "Port" labeling is ceremony |
Adapter Protocol classes in codegenie/adapters/protocols.py (zero implementations) |
Structural subtyping (PEP 544) | Plugins are external (ADR-0031); inheriting from our ABC would couple plugin authors to our class hierarchy | Abstract Factory — too heavyweight for "the class named in plugin.yaml" |
@register_probe(heaviness=…, runs_last=…) decorator kwargs |
Registry + decorator-data over ABC fields | Scheduling data is coordinator-side; matches Phase 0's existing decorator-registry primitive | cost_tier: Literal[0..3] on the Probe ABC — ABC churn for a scheduling optimization |
@register_dep_graph_strategy(ecosystem: PackageManager) |
Open/Closed at the file boundary | Adding Maven (Phase 8+) is a new file + decorator, never an edit to DepGraphProbe |
String-keyed dict — Phase 3 deferral was the exact ADR-0033 violation the critic flagged |
SecretRedactor as a pass in the existing sanitizer pipeline |
Chain of responsibility / pipeline composition | Single chokepoint discipline survives; one pass added by composition | Capability pattern across LLM boundary — LLM never holds the token; authorization with a fancier name |
One file per Layer G scanner; no shared ScannerRunner abstraction |
SRP + Rule of Three | Four scanners with four genuinely different I/O shapes; ~60 LOC saved by sharing isn't worth the speculative coupling | Template Method / Generic ScannerRunner |
Reference TCCM under docs/_reference-tccm/, not plugins/ |
Documentation as code, deliberately outside the plugin namespace | Phase 3 owns the plugin namespace; Phase 2 ships the schema with one consumer (the integration test) | Synthetic plugin fixture under tests/fixtures/plugins/ — implies pluggability Phase 3 owns |
Patterns considered and deliberately rejected¶
- Plugin Loader in Phase 2. Roadmap and ADR-0031 §Consequences §1 assign the loader to Phase 3 alongside the first plugin ("the first plugin doubles as the proof the loader works"). Pulling forward hollows Phase 3's exit.
cost_tier: Literal[0,1,2,3]on theProbeABC. Coordinator scheduling concern smuggled onto the contract. Registry-side annotation does the same job at the right layer.ProbeContext.capabilities: ProbeCapabilitiesdiscriminated union. Every Phase 0/1 probe would need tomatchexhaustively to stay type-clean — coordinated every-file edit dressed as "additive."- Cryptographic anchoring + audit-log hash chain on B2. Defends against an attacker who can write to
.codegenie/cache/— which per Phase 0 ADR-0011 requires having already compromised the host. Ceremony against a non-threat. - Per-repo encryption key for secret findings under
~/.codegenie/keys/. Key + ciphertext in the same trust tier ($HOME). Structural fix: don't persist plaintext. pytest-xdistfor the portfolio test lane. Phase 0 vetoed it 10/4. Portfolio fits serial CI ≤ 6 minutes.AdapterConfidenceas the type of every probe's freshness output. Conflates ADR-0033's prescription for adapter outputs (Phase 3) with Phase 2's probe outputs.- Event stream (
.codegenie/events/) with hash-chained JSONL. ADR-0034 §Consequences §1 defers canonical event log to Phase 9 (or 13). scip-pythonparser + msgpack-on-disk projection. Adapter consumption shape is a Phase 3 concern; binary on-disk format would force Phase 3 adapters to agree on it.- Out-of-process
_grammar_runnersubprocess for tree-sitter. Grammar pin already guards supply-chain; in-process load is the boring shape. SkillsLoaderauto-discovery via env vars. Explicit search paths at construction; loader doesn't peek at env or import paths.gitpythondep.gitis already inALLOWED_BINARIES; we shell out viarun_allowlisted.
Anti-patterns avoided (from the toolkit's flag-on-sight list)¶
- Premature pluggability. No Plugin Loader; no universal-fallback plugin; no
NullAdapterfixtures. Adapter Protocols ship as documentation; first real consumer (Phase 3) is the proof. - Pattern soup. Three explicit pattern decisions (sum types, decorator-registry, single-chokepoint composition). No
FooFactory, noBarStrategy, noBazBuilder. - Inheritance for code reuse. Every Phase 2 class inherits only
Probe(the contract) orBaseModel(Pydantic). No "shared scanner base class" because the scanners aren't actually share-able. - Stringly-typed identifiers.
ProbeId,TaskClassId,SkillId,PackageManagerare typed;ecosystem: PackageManageris the registry key, notecosystem: str. - Untyped
dict[str, Any]. The one inherited surface (ProbeOutput.schema_slice: dict[str, JSONValue]) is bounded by Phase 0's recursiveJSONValue. Phase 2 doesn't extend it. - Boolean flags on public methods.
register_probe(heaviness=…, runs_last=…)usesLiteral["light","medium","heavy"]and a typed bool with one purpose; noforce=,strict=,dry_run=proliferation. - Tag-and-dispatch without a tagged union. All
kind-style discriminators are Pydantic discriminated unions (Field(discriminator="kind")). - Capability passed through ten frames. No
capability_tokenthreaded throughrun;run_external_cliis the registry the toolkit prescribed. - Side effects in constructors. Every loader (
SkillsLoader,TCCMLoader,ConventionsCatalogLoader) is pure data at__init__; first I/O isload_all()/load(). - Hexagonal sandbox that smuggles subprocess into the core.
run_external_cliis honestly a Command-pattern wrapper; we don't claim what we didn't build. - Schema before consumer. Every typed sum has at least one Phase 2 consumer:
IndexFreshness←confidence_section.py;TCCM← reference-TCCM integration test;ScannerOutcome← every Layer G probe's caller; AdapterProtocols ← Phase 3 exit contract (documented exception). model_construct()bypass. Phase 0'sforbidden-patternspre-commit (extended in Phase 2) bansmodel_constructundersrc/codegenie/{indices,tccm,skills,conventions,adapters,depgraph}/**.
Edge cases¶
| # | Edge case | Probe / Component | Detection | Containment | Source |
|---|---|---|---|---|---|
| 1 | External CLI not on $PATH |
Tool-readiness check at startup | MissingToolError typed; CLI exits with install hint |
Mandatory → exit 4; optional → confidence="low" slice; build continues |
final-design [B] |
| 2 | External CLI exits non-zero | run_external_cli |
ProcessResult(exit_code != 0, stderr_tail) |
ScannerOutcome.ScannerFailed; coordinator continues per Phase 0 isolation |
final-design |
| 3 | External CLI emits invalid JSON | Layer G scanner Pydantic smart constructor | Result.Err(ParseError(stdout_tail)) |
Slice confidence="low"; stdout/stderr tail in audit |
[B] |
| 4 | scip-typescript timeout on huge monorepo |
asyncio.wait_for(timeout_seconds=300) |
IndexFreshness.Stale(reason=IndexerError(message="timeout")) |
Phase 3 adapter falls back to tree-sitter per ADR-0032 declared fallback | final-design |
| 5 | docker build fails (adversarial Dockerfile) |
Subprocess exit code | C-tier probes confidence="unavailable"; gather completes degraded |
Operator fixes Dockerfile; re-runs | [P] |
| 6 | strace exec fails (macOS) |
run_external_cli raises typed StraceUnavailable |
TraceScenarioFailed(StraceUnavailable()) per scenario; B2 emits Stale(IndexerError("strace_unavailable")) for runtime_trace |
macOS path is permanent; CI is Linux-canonical | [P+B+S] |
| 7 | gitleaks finds AWS key in .git/ history |
gitleaks parses; SecretRedactor matches AKIA… |
Plaintext replaced with <REDACTED:fp=…> in repo-context.yaml, raw artifact, cache. Plaintext in zero persisted files |
Reviewer runs gitleaks manually at PR time for cleartext |
[S, scaled] |
| 8 | Hostile YAML in Skill (!!python/object, billion-laughs) |
safe_yaml.load (Phase 1 chokepoint) |
SkillsLoadError(reason="unsafe_yaml"); this skill skipped, others load |
Loud CLI warning; operator investigates supply chain | [B+S] |
| 9 | Symlink ~/.codegenie/skills/x/SKILL.md → /etc/passwd |
os.open(O_NOFOLLOW) → ELOOP |
SkillsLoadError(reason="symlink_refused"); loud CLI warning |
Operator investigates planted symlinks | [S] |
| 10 | tree-sitter grammar load failure (per 02-ADR-0011: PyPI wheel missing, unknown language, capsule-factory drift, or ABI mismatch — was BLAKE3 mismatch under 02-ADR-0002) |
Kernel codegenie.grammars.lock.language_for(name) raises GrammarLoadRefused on any failure surface |
GrammarLoadRefused; probe slice confidence="low" (and inverted-semantics confidence_impact="high" for NodeReflectionProbe); no grammar code executes |
Operator investigates closure regression (missing wheel) or pins a working tree-sitter / grammar-wheel pair in pyproject.toml |
[S] |
| 11 | Stale-SCIP fixture in CI (deliberate seed) | IndexHealthProbe reads last_indexed_commit mismatch |
IndexFreshness.Stale(reason=CommitsBehind(n>=1, last_indexed=<prior>)); test_stale_scip_fixture.py asserts exact typed outcome |
Roadmap exit criterion | final-design |
| 12 | Concurrent gather race against same repo | Phase 0 advisory lock at .codegenie/cache/.lock |
Second invocation waits or fails fast (config) | — | [P] |
| 13 | Hostile semgrep/grype/gitleaks JSON (truncated, oversized, deeply nested) | Pydantic smart constructor + JSONValue depth cap |
ScannerFailed(reason="invalid_json", stderr_tail); sanitizer rejects oversized payloads upstream |
Operator inspects audit | [B+S] |
| 14 | Image-digest resolver returns None (no built image) |
ProbeContext.image_digest_resolver callable |
C-tier probes skip with confidence="unavailable"; cache key falls back to declared inputs without the token |
Operator builds image; re-runs | final-design |
| 15 | cpu_count() returns 2 (GitHub-hosted runner) |
Semaphore(min(2, 8)) |
All probes serialize through 2 slots; heavy probes still sort-first via heaviness annotation |
No starvation explosion (vs. per-tier semaphores from [P]) | critic [P] hidden-assumption #2 |
| 16 | Skill collision across three trust tiers | SkillsLoader three-tier merge |
First-tier-wins + loud skill_shadowed warning in CLI summary |
Per-tier signing (Sigstore) is a Phase 14 concern | [S] |
Testing strategy¶
Test pyramid¶
▲
adversarial (≥ 5 cases — including the LOAD-BEARING stale-scip)
─────────────
e2e (1 — full gather against pinned Node fixture)
─────────────────
integration (~12 — one per scanner + portfolio + TCCM roundtrip)
──────────────────────
golden (~70 — one per probe × 5 portfolio fixtures)
─────────────────────────
property (~6 — IndexFreshness round-trip, monotonicity, well-formedness)
──────────────────────────────
unit (~40 — every component in §"Component design")
▼
Unit tests¶
| Test module | Asserts |
|---|---|
tests/unit/probes/layer_b/test_index_health_probe.py |
Per-source freshness; every IndexFreshness variant constructible; cache_strategy="none" enforced; runs_last annotation respected |
tests/unit/indices/test_freshness.py |
IndexFreshness round-trip (model_dump_json → model_validate_json = identity); exhaustive match test with assert_never (missing case is a mypy --warn-unreachable build error) |
tests/unit/probes/layer_b/test_scip_index.py |
scip-typescript invocation; output binary present; cache-key sensitivity to tool-version + Merkle of .ts files; timeout → IndexerError |
tests/unit/probes/layer_c/test_runtime_trace.py |
Per-scenario sequential execution; per-scenario timeout; macOS StraceUnavailable deterministic path |
tests/unit/probes/layer_b/test_dep_graph.py |
@register_dep_graph_strategy registry works; one strategy per PackageManager variant; monorepo graph correct |
tests/unit/probes/layer_b/test_tree_sitter_import_graph.py |
Per-file extraction; no internal thread pool; grammar pin verified at load |
tests/unit/probes/layer_g/test_{semgrep,syft,grype,gitleaks}.py |
Pydantic smart constructor; subprocess mocked via pytest-subprocess; ScannerOutcome variants |
tests/unit/skills/test_loader.py |
Frontmatter parsing; O_NOFOLLOW symlink refusal; three-tier merge + shadowing warning; body byte-offset not loaded into memory |
tests/unit/conventions/test_catalog.py |
One test per pattern type; NotApplicable path; assert_never on unknown |
tests/unit/tccm/test_loader.py |
Loads reference TCCM from docs/_reference-tccm/; unknown compute: variant fails fast; five DerivedQuery variants round-trip |
tests/unit/adapters/test_protocols.py |
Protocol structural typing (no-op stub passes isinstance via runtime_checkable); AdapterConfidence variants construct |
tests/unit/output/test_secret_redactor.py |
Each pattern class matches; entropy threshold catches generic high-entropy strings; mutation test: weakened regex causes a test to fail |
tests/unit/exec/test_run_external_cli.py |
Env strip; allowlisted-egress respected; stdout cap; bubblewrap graceful no-op on macOS |
Property tests (tests/property/)¶
test_index_freshness_roundtrip.py— Hypothesis: anyIndexFreshnessround-trips throughmodel_dump_json/model_validate_jsonidentity-equal.test_skills_loader_monotone.py—SkillsLoader.find_applicable(evidence_keys)is monotone inevidence_keys(adding a key never removes a match).test_trace_coverage_well_formed.py— for any combination ofScenarioResultvariants,TraceCoverageis well-formed.test_scanner_outcome_roundtrip.py—ScannerOutcome↔ JSON identity.test_dep_graph_strategy_dispatch.py— registry dispatch is total overPackageManagerenum members.
Golden files (tests/golden/probes/<probe>/<fixture>.json)¶
One golden per probe per portfolio fixture. CI diffs live output vs. committed expected. pytest --update-golden regenerates. Updating a golden is a deliberate PR step (the diff must be reviewed). Five-fixture portfolio:
| Fixture | Exercises |
|---|---|
minimal-ts |
Smallest happy path; smoke for every probe |
native-modules |
C-extension manifest edge cases (e.g., node-gyp) |
monorepo-pnpm |
DepGraphProbe cross-package edges; pnpm workspace |
distroless-target |
Layer C runtime trace against an already-distroless base image (Phase 7 forward-looking) |
stale-scip |
LOAD-BEARING — pre-populated SCIP from a prior commit; HEAD has moved |
Fixture portfolio engineering¶
- Fixtures are tiny (≤ 200 files each). Generation scripts under
tests/fixtures/portfolio/<name>/regenerate.share reviewed-as-code. .codegenie/cache/is NOT committed to fixtures (regenerated each CI run; transparent diff). Open question §6 in the synthesis names the alternative if CI walltime regresses past 8 min.stale-sciphas aREADME.mddocumenting the regeneration policy: the structural assertion isCommitsBehind.n >= 1, tool-version-agnostic.
CI gates¶
fencejob (Phase 0). Asserts noanthropic/openai/langgraphimports undersrc/codegenie/. Unchanged.contract-freezejob (Phase 0).tests/unit/test_probe_contract.pysnapshots theProbeABC againsttests/snapshots/probe_contract.v1.json. Unchanged — Phase 2 makes no edit.unitjob. Pytest serial. ≤ 90 s.integrationjob. Real-tool invocations against tiny fixtures. CI-gated on tool presence; skip-with-loud-warning if missing.portfoliojob. Five-fixture sweep + golden diff. Serial (no xdist). ≤ 6 min budget.adv-phase02job. LOAD-BEARING. Includestest_stale_scip_fixture.py,test_hostile_skills_yaml.py,test_secret_in_source.py,test_image_digest_drift.py,test_concurrent_gather_race.py. Build FAILS on any.mypyjob.mypy --strictrepo-wide;--warn-unreachableper-module oncodegenie.{indices, probes/index_health.py, report, adapters, tccm}/**viapyproject.tomloverrides.benchjob (advisory, not gating).tests/bench/bench_portfolio_walltime.py. Flags ≥ 50% regression on cold/warm p50; comment-on-PR, no block.
Performance regression tests¶
bench_portfolio_walltime.py— five-fixture cold + warm p50 captured per run; baseline JSON committed intests/bench/baselines/. ≥ 50% delta = comment-on-PR (per Phase 0 §3.2 advisory discipline).bench_index_health_overhead.py—IndexHealthProbewalltime must be < 5% of total cold gather onminimal-ts. ≥ 10% = comment-on-PR.
Adversarial tests (tests/adv/phase02/) — the load-bearing exit¶
| Test | What it proves |
|---|---|
test_stale_scip_fixture.py |
Roadmap exit criterion. IndexHealthProbe returns IndexFreshness.Stale(reason=CommitsBehind(n>=1, last_indexed=<prior>)). Build FAILS otherwise. |
test_hostile_skills_yaml.py |
!!python/object, billion-laughs, deep nesting, symlink-escape filenames. ≥ 8 cases. None executes user code. |
test_secret_in_source.py |
gitleaks finds seeded secret; SecretRedactor replaces in repo-context.yaml + every raw artifact + cache + audit. Plaintext in zero persisted files. |
test_image_digest_drift.py |
Mutating built image between gathers invalidates tier-C caches via the image-digest declared-input token. |
test_concurrent_gather_race.py |
Two concurrent gathers don't corrupt cache; Phase 0 advisory lock holds. |
test_adversarial_dockerfile.py |
Forkbomb / infinite-loop Dockerfile: probe times out; container --cap-drop=ALL --network=none --no-new-privileges contains it; coordinator continues. |
Integration with Phase 3 (next phase)¶
Phase 3 is "Vuln remediation: deterministic recipe path" (roadmap line 82). It is also the phase that ships the Plugin Loader, the universal fallback plugin, the first concrete plugin (plugins/vulnerability-remediation--node--npm/), the four ADR-0032 adapter implementations, Skills, and OpenRewrite recipes — all together, as ADR-0031 §Consequences §1 prescribes.
What Phase 3 inherits from Phase 2 on day 1, unchanged:
| Surface | Where Phase 3 finds it | What Phase 3 does with it |
|---|---|---|
codegenie.adapters.protocols.DepGraphAdapter |
src/codegenie/adapters/protocols.py |
plugins/vulnerability-remediation--node--npm/adapters/dep_graph_npm.py implements it |
codegenie.adapters.protocols.ImportGraphAdapter |
same | adapters/import_graph_node.py implements it (consumes raw/import-graph.json) |
codegenie.adapters.protocols.ScipAdapter |
same | adapters/scip_node.py implements it (consumes the SCIP binary blob — Phase 3 decides project/mmap/walk) |
codegenie.adapters.protocols.TestInventoryAdapter |
same | adapters/test_inventory_node.py implements it (consumes Phase 1 TestInventory slice) |
codegenie.adapters.confidence.AdapterConfidence |
src/codegenie/adapters/confidence.py |
Phase 3 adapter confidence() returns these variants; may extend with ADR if first adapter needs more |
codegenie.indices.freshness.IndexFreshness |
src/codegenie/indices/freshness.py |
Phase 3 plugin renders into bundle metadata; potentially layered over AdapterConfidence |
codegenie.tccm.TCCM + DerivedQuery |
src/codegenie/tccm/ |
plugins/vulnerability-remediation--node--npm/tccm.yaml is parsed by this loader |
codegenie.skills.SkillsLoader |
src/codegenie/skills/ |
Phase 3 plugin's Skills are loaded through this; three-tier merge already works |
codegenie.exec.run_external_cli |
src/codegenie/exec.py |
Phase 3's npm, jq invocations route through it (ALLOWED_BINARIES amended) |
PackageManager enum + @register_dep_graph_strategy |
src/codegenie/depgraph/registry.py |
Phase 3 registers build_npm, build_pnpm etc. via new files; never edits DepGraphProbe |
| Layer A–G probe outputs | .codegenie/context/repo-context.yaml + raw/* |
Phase 3 plugin reads these as evidence input |
SecretRedactor chokepoint |
Sanitizer | Phase 3 LLM-adjacent flows (still no LLM in Phase 3 — but Phase 4's RAG store builds atop Phase 3 outputs) inherit the guarantee |
| Phase 2 fixture portfolio | tests/fixtures/portfolio/ |
Phase 3 reuses monorepo-pnpm + minimal-ts as the first-recipe targets |
Implicit guarantees Phase 3 will rely on (and we must preserve):
- The
ProbeABC andProbeContextare append-only. Phase 3 may add probes (npm-specific) inside the plugin; the kernel ABC is unchanged. IndexFreshnessvariant set is stable. If Phase 3 needs a fifthStaleReason, it requires an ADR amendment to 02-ADR-0006, not a silent extension.SecretRedactorruns before any artifact is persisted. Phase 3 cannot bypass it; the only path fromProbeOutputto disk isOutputSanitizer.scrub(Phase 0 chokepoint).run_external_cliis the only way to add a new external CLI for B/G families. Phase 3 amendsALLOWED_BINARIESfornpm,jq; everything else stays throughrun_allowlisteddirectly (git) or viarun_external_cli(security scanners).- The image-digest declared-input token mechanism (Phase 2 ADR-0004) is reusable. Phase 3 transforms that change the analyzed image inherit the cache-correctness story.
New artifacts Phase 2 produces that Phase 3 consumes:
.codegenie/context/raw/scip-index.scip— binary SCIP blob; Phase 3'sScipAdapterdecides consumption shape..codegenie/context/raw/import-graph.json— forward-only adjacency; Phase 3'sImportGraphAdapterdecides projection..codegenie/context/raw/dep-graph.json—networkx-serializable monorepo dep graph..codegenie/context/raw/syft-sbom.json+grype-cves.json— Phase 3's deterministic recipe path reads these as the vulnerability evidence..codegenie/context/raw/runtime-trace-{scenario}.{strace,json}— Phase 3 readsshared_libs_loadedandshell_invocations_observedfor distroless feasibility checks..codegenie/context/raw/semgrep-findings.json+gitleaks-findings.json— both redacted at the writer chokepoint.
Path to production end state¶
Capabilities now possible (not necessarily wired in until later phases):
- A Planner reading a slice and knowing whether the slice is trustworthy (
IndexFreshness). The Planner doesn't exist yet (Phase 8), but the typed answer it will consume does. - A plugin author writing against ADR-0032 adapter Protocols without guessing the shape (Phase 3).
- A microVM (Phase 5) re-deriving cleartext for a secret finding without Phase 2 having persisted plaintext anywhere it could reach.
- A canonical event log (Phase 9) projecting Phase 0's audit anchor + Phase 2's per-probe slice metadata (
gathered_at,last_indexed_commit, etc.) into a typed Postgres event log without schema drift. - A cost ledger (Phase 13) attributing per-probe wall-clock and external-CLI runtime cost via the existing
Ran/CacheHit/Skippedoutcomes.
What's still missing (deliberate, named non-goals):
- The Plugin Loader (Phase 3).
- LLM-fallback adjudication (Phase 4).
- Sandbox + Trust-Aware gates (Phase 5).
- SHERPA state machine (Phase 6).
- The second task class — Chainguard distroless migration (Phase 7).
- Hierarchical Planner + pre-rendered hot views (Phase 8).
- Temporal envelope (Phase 9).
Deferred ADRs this phase makes resolvable:
- ADR-0015 (trust-score threshold calibration). Phase 2's
IndexFreshness+ (forthcoming Phase 3)AdapterConfidenceare the typed inputs the calibration will consume. Phase 6.5's eval harness can now ground its calibration runs on concrete typed signals. - ADR-0017 (knowledge-graph backend). Phase 2's
TCCMloader gives the KG a typed schema target for the "what evidence does this task class need" question. - ADR-0021 (policy engine). Phase 2's
ConventionsCatalogLoaderis the boring-default that decides ADR-0021's "build vs. adopt" question by example — if conventions stay this simple through Phase 7, OPA/Rego is over-engineered. - ADR-0023 (MCP server topology). Phase 2's
.codegenie/context/artifact shape is the input the Context MCP will serve in the service.
New ADRs landing in Phase 2 (numbered under docs/phases/02-context-gather-layers-b-g/ADRs/):
| # | ADR | Records |
|---|---|---|
| 02-ADR-0001 | Add docker + security-CLI tools to ALLOWED_BINARIES |
Mirrors Phase 1 ADR-0001 node addition |
| 02-ADR-0002 | py-tree-sitter grammars — Phase 2 amendment to Phase 1 ADR-0009 (Superseded 2026-05-17 by 02-ADR-0011) |
Named-trigger fired: localv2.md §5.2 B3 requires tree-sitter. Named-trigger discipline preserved by 02-ADR-0011; grammar delivery moved from vendored .so to PyPI wheels |
| 02-ADR-0011 | Tree-sitter grammars via PyPI wheels (supersedes 02-ADR-0002) | Vendored .so/.dylib + tools/grammars.lock BLAKE3 model replaced with tree-sitter-typescript/tree-sitter-javascript PyPI wheels behind codegenie.grammars.lock.language_for |
| 02-ADR-0003 | @register_probe(heaviness=, runs_last=) registry annotations |
Coordinator sort-order edit; preserves single Semaphore + ABC contract |
| 02-ADR-0004 | Image digest as declared-input token | Extends localv2.md §4 special-token mechanism; introduces optional ProbeContext.image_digest_resolver |
| 02-ADR-0005 | Secret findings — no plaintext persistence | Phase 5 microVM is the named cleartext escalation door |
| 02-ADR-0006 | IndexFreshness sum-type location |
Lives at codegenie.indices.freshness; documents why AdapterConfidence/IndexConfidence are NOT Phase 2 |
| 02-ADR-0007 | No Plugin Loader in Phase 2 | Phase 3 ships loader + first plugin + adapters + universal fallback together |
| 02-ADR-0008 | No event stream in Phase 2 | Defers to ADR-0034 §Consequences §1 |
| 02-ADR-0009 | pytest-xdist veto preserved |
Re-affirms Phase 0's 10/4 vote; portfolio fits serial CI |
Tradeoffs (consolidated)¶
| Decision | Gain | Cost | Source |
|---|---|---|---|
IndexFreshness sum type with one Phase-2 consumer (confidence_section.py) |
Variant set is exercised from day 1; no schema-without-consumer | One extra package (indices/); slightly more LOC than co-locating in B2 |
final-design §2 |
Registry annotations (heaviness, runs_last) instead of ABC fields |
Scheduling concern at the right layer; zero ABC churn; preserves contract freeze | ~15 LOC sort-order edit to Phase 0 coordinator (ADR-gated) | final-design §13 |
| Reject Plugin Loader; ship Protocols + TCCMLoader skeleton only | Roadmap-honest; Phase 3 exit ("first plugin doubles as loader proof") stays intact | Adapter Protocols ship with zero implementations — risk of Phase 3 discovering wrong shape | final-design §7 |
| Reject per-repo encryption key for secrets; persist zero plaintext | Eliminates trust-tier theatre; no key-management fragility | Reviewer must run gitleaks manually at PR time for cleartext; no inline rotation suggestions until Phase 4+ |
final-design §4 |
Reject pytest-xdist reversal |
Phase 0 veto holds; no shared-fixture race surface | Portfolio CI lane is serial (≤ 6 min budget) | final-design Goals §10 |
Reject in-probe ThreadPoolExecutor for TreeSitterImportGraphProbe |
Honesty to coordinator's single semaphore | ~10 s sequential vs. ~3 s threaded (~7 s tax) | final-design §12 |
| Sequential RuntimeTraceProbe scenarios | Avoid resource-race + attribution confusion across docker run instances |
~75 s wall-clock floor vs. theoretical ~15 s parallel | final-design §6 |
bubblewrap opt-in-on-availability, not mandatory |
Dev/CI parity preserved; honest about macOS gap | Less host isolation on macOS dev hosts than on Linux CI | final-design §3 |
| One file per Layer G scanner; no shared abstraction | Each probe reviewable in one sitting; no speculative coupling | ~200 LOC duplicated across four scanners | final-design §5 |
Image digest as a declared-input token, not a cache_key() override |
Phase 0 declared_inputs discipline survives; no second cache-key pathway |
One optional callable on ProbeContext (mirrors Phase 1 parsed_manifest precedent) |
final-design §6 |
mypy --warn-unreachable per-module (not repo-wide) |
Surgical-changes discipline (Rule 3); zero blast radius into Phase 0/1 | Phase 0/1 don't get the protection; backlog item to roll out later | final-design §"Design patterns applied" row 9 |
Gap analysis & improvements¶
Gap 1: Adapter Protocol drift between Phase 2 and Phase 3¶
The synthesis ships four Protocol classes with zero implementations in Phase 2; the only Phase-2-internal proof they're shaped correctly is the integration test loading the reference TCCM and dispatching to a mock. Phase 3's first adapter (adapters/scip_node.py, etc.) may discover the Protocol signature is wrong (e.g., consumers(self, pkg: str) should be consumers(self, pkg: PackageId, *, transitively: bool = False)). Discovering it at Phase 3 land means amending a Phase 2 module, which ripples through report/confidence_section.py and any Phase 3 plugin code that was prototyped against the wrong shape.
Improvement. Add an explicit Phase 3 entry-gate test: tests/integration/adapters/test_phase3_handoff_smoke.py lands in Phase 2 as skipped (with @pytest.mark.skip(reason="enabled when Phase 3 plugin lands")). The test, when unskipped at Phase 3, MUST import the Phase 2 Protocols unchanged and pass against plugins/vulnerability-remediation--node--npm/adapters/. Any drift requires an explicit ADR amendment to 02-ADR-0006/02-ADR-0007; the test naming makes the contract violation discoverable at PR review. The test stub also forces the reference TCCM to exercise every Protocol method via the mock dispatcher — closing the "Protocols defined, never called in Phase 2" critique fully.
Gap 2: cpu_count()=2 (GitHub-hosted runner) and heavy probes still serialize hard¶
The synthesis explicitly rejected per-tier semaphores because cpu_count()=2 would degenerate them to 2-vs-2 starvation. But the single Semaphore(min(cpu_count(), 8)) with cpu_count()=2 also means cold gather on a hosted runner serializes through 2 slots — SCIPIndexProbe (~10 s) + RuntimeTraceProbe (~90 s) + SemgrepProbe (~3 s) + the rest = ~110+ s, which still hits the ≤180 s p95 cold target but only barely. If the bench canary regresses by 50 %+, Phase 2 fails its own budget on hosted runners specifically.
Improvement. Add tests/bench/bench_portfolio_walltime_hosted_runner.py, configured to emulate cpu_count()=2 via CODEGENIE_FORCE_CPU_COUNT=2 env var (read by Semaphore sizing). Run nightly (not per-PR) on the actual hosted-runner image. Comment-on-PR threshold ≥ 50 %; build-fail threshold ≥ 100 % (i.e., > 360 s p95). This makes the hidden assumption (critic [P] hidden-assumption #2) measurable instead of asserted. Also document the escape valve: if hosted-runner walltime regresses past the build-fail threshold, the operator option (not Phase 2's choice) is to commit per-fixture .codegenie/cache/ blobs (open question §6 in the synthesis), which would re-flip the regenerate-each-run policy.
Gap 3: IndexHealthProbe couples to every sibling probe's slice shape¶
B2 reads last_indexed_commit, files_indexed, files_in_repo, last_traced_image_digest, built_image_digest, rule_pack_version from sibling slices. The synthesis defends this as "inverted coupling — B2 reads slice metadata that is already part of the schema." But adding a new index source (Phase 3's SCIP-per-language, Phase 7's distroless-target manifest, Phase 14's cross-repo SCIP) silently requires editing B2 to know about the new freshness signal. Today B2 has a match index_name: block; tomorrow that block grows every phase. The toolkit's Open/Closed failure mode names this exactly.
Improvement. Apply the same @register_dep_graph_strategy discipline to B2: introduce @register_index_freshness_check(index_name: IndexName) in src/codegenie/indices/freshness.py as a decorator-registry. Each Phase 2 index source registers a small function (slice: dict[str, JSONValue], head: str) -> IndexFreshness. B2's run() loops over the registry. Phase 3 adds new index sources by new file + new decorator, never by editing B2. This closes the Open/Closed gap before it widens. Cost: ~30 LOC in freshness.py + 5–10 LOC per Phase-2 index source. The pattern symmetry with @register_probe and @register_dep_graph_strategy is itself a documentation win.
Gap 4: SecretRedactor runs after the writer chokepoint conceptually, but the in-memory SecretFinding list flows in band with the slice¶
The synthesis says redact_secrets returns tuple[redacted_slice, list[SecretFinding]]. The caller is responsible for not persisting the findings list. But that's a convention enforced by code review, not by types — a future contributor could thread the findings into a debug log or audit anchor and silently leak plaintext. The critic ("Smart-constructor pattern missed at the secret-redactor boundary") named this verbatim.
Improvement. Apply the smart-constructor pattern at the redaction boundary:
class RedactedSlice(BaseModel):
model_config = ConfigDict(frozen=True, extra="forbid")
slice: dict[str, JSONValue]
findings_count: int
fingerprints: list[str] # 8-hex fingerprints only
# Construction is private; ONLY `redact_secrets(...)` can produce a RedactedSlice.
# The writer accepts ONLY RedactedSlice.
RedactedSlice without going through redact_secrets, which by construction produces only fingerprints. The plaintext-cleartext audit-trail lives in the CLI summary path (returned separately by redact_secrets, not threaded into a RedactedSlice). This makes "redactor was called" type-checkable. Cost: ~20 LOC in output_sanitizer.py; the writer's signature tightens from dict to RedactedSlice.
Gap 5: No explicit Phase-4 RAG-store handoff contract¶
Phase 4 (LLM fallback + solved-example RAG, roadmap §96) builds a vector store of solved examples atop Phase 3 outputs. But the vector-store input shape — what gets embedded — is not pinned by Phase 2. If Phase 4 embeds raw repo-context.yaml, the SecretRedactor chokepoint guarantee (G5) carries; if Phase 4 embeds intermediate raw/*.json artifacts, the same guarantee carries; if Phase 4 invents a new "RAG document" shape that pulls fields directly from in-memory probe outputs (bypassing the sanitizer), the chokepoint is silently broken.
Improvement. Phase 2 lands a tests/adv/phase02/test_no_inmemory_secret_leak.py that asserts at the boundary: every artifact reachable from the Phase 0 writer chokepoint passes through redact_secrets; no probe output is exposed to any consumer outside the sanitizer pipeline. The test reads the OutputSanitizer.scrub source via inspect and asserts the call to redact_secrets is present and unbypassable. This is a textual / structural test (Phase 0 already uses this pattern for forbidden-patterns); it makes the Phase 4 contract explicit at Phase 2 land time, before Phase 4 designs its RAG ingestion path.
Open questions deferred to implementation¶
- Phase 5 microVM cleartext-access protocol. The exact handoff (does the microVM receive
(file:line, pattern_class, fingerprint)and re-scan? a one-time decryption capability tied to the workflow ID?) is a Phase 5 design concern. Phase 2's commitment is only "no plaintext anywhere Phase 4 can reach it." (Final-design §Open Q 1.) TreeSitterImportGraphProbeprojection shape. Phase 2 emits forward-only adjacency; Phase 3's firstImportGraphAdapterdecides whether to pre-compute reverse, mmap a binary, or walk at query time. (Final-design §Open Q 2.)SkillsLoaderorg-shared tier signing. Per-tier Sigstore signing for~/.codegenie/skills-org/is a Phase 14 multi-tenant concern. (Final-design §Open Q 3.)ExternalDocsProbeenablement & host allowlist schema. Phase 2 ships opt-in skip-cleanly; theexternal_docs:config key shape lands when the first real user opts in. (Final-design §Open Q 4.)mypy --warn-unreachablerollout beyond Phase 2 modules. Per-module config only in Phase 2; full-repo rollout is a tracked backlog item. (Final-design §Open Q 5.)- Per-fixture cache pre-warming for CI walltime. Phase 2 picks regenerate-on-every-run; if CI walltime regresses past 8 min, this flips to commit
.codegenie/cache/blobs. The hosted-runner bench (Gap 2 above) is the trigger. (Final-design §Open Q 6.) stale-scipfixture regeneration policy. Documented intests/fixtures/portfolio/stale-scip/README.md; the structural assertion (CommitsBehind.n >= 1) is tool-version-agnostic — but the fixture regeneration ritual is implementer-time work. (Final-design §Open Q 7.)- Whether to introduce
@register_index_freshness_check(Gap 3 improvement) in Phase 2 or defer to Phase 3. Phase 2 makes B2 the Open/Closed example; Phase 3 will discover the cost in practice. Defaulting to Phase 2 introduction; revisit at Phase 3 entry if the decorator-registry adds friction the synthesis didn't anticipate.