ADR-0008: Output sanitizer is the single path from ProbeOutput to disk — two-pass, no synchronous gitleaks¶
Status: Accepted Date: 2026-05-11 Tags: security · chokepoint · privacy · provenance Related: ADR-0010, production ADR-0006
Context¶
Phase 11 commits .codegenie/ artifacts (including repo-context.yaml) into a real repo as part of a PR-opening bundle. That artifact will be reviewed by humans across the org. If it leaks developer home paths (/Users/dannytrevino/...), embedded credentials, or secret-shaped fields, the leak compounds across the portfolio.
The security lens proposed a three-pass sanitizer: (1) field-name regex, (2) absolute → relative path scrubbing, (3) gitleaks synchronous scan. ../critique.md §2.1.2 rejected pass 3: at Phase 2's hundreds-of-KB artifacts and Phase 14's continuous-gather model, gitleaks in the synchronous write path becomes a multi-second subprocess on the hot loop — incompatible with production ADR-0006's "cheap to run every hour" commitment.
The performance and best-practices lenses had zero sanitization — yaml.CSafeDumper directly writing to disk. ../critique.md §1.3 and §3.3 both name this as the load-bearing omission: Phase 11's PR will commit whatever path the writer wrote.
A sanitizer is mandatory; gitleaks synchronously is too expensive; the load-bearing defenses must be structural.
Options considered¶
- No sanitizer (
[P],[B]).yaml.CSafeDumperdirectly. Fast; leaks paths and any field name the probe happens to emit. Phase 11 commits the leak. - Field-name regex only. Catches obvious secret-shaped fields (
github_token,api_key). Doesn't rewrite paths; doesn't catch values that aren't keys. - Three-pass: field-name + path-scrub + synchronous gitleaks (
[S]). Belt-and-suspenders-and-belt. Multi-second cost at Phase 2+ scale; breaks the continuous-gather model. - Two-pass: field-name + path-scrub; gitleaks at pre-commit and CI but not synchronous (synth). Structural defenses carry the load:
_ProbeOutputValidator'sJSONValuerecursive type + field-name regex (ADR-0010) is the first line; path scrubbing is the Phase-11-specific defense; gitleaks runs overcodewizard-sherpa's own source at pre-commit time and over the analyzed repo's PR at Phase 11 — belt-and-suspenders at the boundaries where the cost is amortized, not in the gather hot path.
Decision¶
OutputSanitizer.scrub(output, repo_root) -> SanitizedProbeOutput is the only path from a ProbeOutput to a persisted byte. Two passes in fixed order:
- Field-name regex filter — defense in depth.
_ProbeOutputValidatorran the same check in the coordinator before this point; the sanitizer's repeat-pass is the second wall in case a future bug routes around the validator. - Absolute → relative path scrubbing — any string matching
^(/Users/|/home/|/root/|<analyzed-repo-abs>/)is rewritten relative to the analyzed-repo root. Load-bearing for Phase 11's PR commits.
No synchronous gitleaks in the write path. gitleaks runs as a pre-commit hook over codewizard-sherpa's source and as a CI step; at Phase 11 it runs over the analyzed repo's PR. The structural defenses (_ProbeOutputValidator's JSONValue recursive type, field-name regex, path scrubber) carry the load-bearing weight.
Tradeoffs¶
| Gain | Cost |
|---|---|
Phase 11's PR commits never leak /Users/<contributor>/ — the path scrubber is the structural defense |
Two passes per ProbeOutput instead of three; if a probe's value (not field name) contains a credential of a format gitleaks would catch but the field-name regex doesn't, it slips through to commit time |
The continuous-gather cost model (per production ADR-0006) holds — Phase 14 webhook fan-out doesn't spawn gitleaks subprocesses per gather |
gitleaks-shaped attacks (an internal credential format gitleaks knows but the field-name regex doesn't) are caught at PR-time, not gather-time |
| The sanitizer is one chokepoint, one module, one test file — Phase 11 adding new output paths means extending the chokepoint, not adding a second writer | Probe authors must trust the sanitizer's path detection; the regex set is enumerated, not heuristic, so unusual mount points (/mnt/...) aren't covered until added |
Defense in depth: the field-name regex runs twice — once in _ProbeOutputValidator, once in OutputSanitizer — closing the bypass-the-validator hole |
Two passes of the same check; trivial cost (~ < 1 ms) |
Writer is yaml.CSafeDumper (banned: yaml.Dumper, yaml.load(...) without Loader=); refusal to overwrite a symlink target (exit 5) |
pyyaml's C extension must be available; fallback to pure-Python yaml.SafeDumper logs a warning |
Consequences¶
src/codegenie/output/sanitizer.pyis the single import-target for the path. TheWriter.writemethod takes aSanitizedProbeOutput, not aProbeOutput— typed enforcement that the sanitizer ran.- Path-scrub regex is
^(/Users/|/home/|/root/|<analyzed-repo-abs>/). The<analyzed-repo-abs>is the runtime-resolved absolute path; new prefixes (e.g.,/mnt/work/) require a one-line PR and a test. - Output files written
0600; directories0700(see ADR-0011 for the CI-cache-restore interaction). The permissions model is a separate ADR. gitleaksis wired at three places outside the gather hot path: (a) pre-commit hook on the contributor's machine, (b)securityCI job overuv.lockandsrc/, (c) Phase 11's PR-opening stage over the analyzed-repo PR. None of these are incodegenie gather.- The sanitizer's two-pass contract is a Phase 0 frozen interface — Phase 1+ adds new probe outputs; the sanitizer's passes don't change without an ADR amendment.
LeakedSecretErroris reserved for a deferred-Phase synchronous-gitleaks call (Phase 11 PR-opening, plausibly Phase 14 for inbound webhook payloads); not raised in Phase 0.
Reversibility¶
Medium. Adding synchronous gitleaks back to the sanitizer is mechanically cheap (one subprocess call, one pre-import of the binary) but reintroduces the cost the synthesis explicitly rejected. The "two-pass" contract is what Phase 11+ assumes; widening to three-pass requires a coordinated update to consumer phases and a CI walltime budget revision. Reversing the direction (removing path scrubbing) is breaks-Phase-11 territory and requires both an ADR amendment and a Phase 11 redesign.
Evidence / sources¶
../final-design.md §2.8(Output writer + sanitizer)../final-design.md §L3 row 7(gitleaks-in-write-path: pre-commit+CI wins 12 vs every-gather's 4)../critique.md §2.1.2(Critic rejects synchronous gitleaks on cost grounds)../critique.md §1.3/§3.3(Critic flags[P]/[B]lacking any sanitizer)../phase-arch-design.md §Component design / Output writer + sanitizer- production ADR-0006 — cost model the synchronous gitleaks rejection serves
- ADR-0010 — the structural defense the sanitizer is layered on top of