ADR-0008: In-process parse caps in parsers/, no per-probe fork+exec sandbox¶
Status: Accepted Date: 2026-05-12 Tags: security · adversarial-input · parser-hardening · chokepoint · phase-evolution Related: Phase 0 ADR-0005, Phase 0 ADR-0007, Phase 0 ADR-0008, ADR-0004
Context¶
Phase 1 is the first phase parsing adversarial bytes from untrusted repos at scale — package.json, pnpm-lock.yaml, tsconfig.json, Chart.yaml, GitHub Actions YAML, Kustomize manifests. The security lens's design (design-security.md) made this the architecture's load-bearing concern and proposed a per-probe fork+exec parser sandbox: every probe runs as python -m codegenie.probes._sandbox <probe-module> with bwrap on Linux, sandbox-exec on macOS, env strip, stdin DEVNULL, and rlimits set pre-exec. The threat closure: ~95% threat surface from YAML/JSON bombs + ~5% from parser-CVE-in-C-extension classes that an in-process check cannot catch.
The critic dismantled the proposal in three blows (critique.md "Attacks on the security-first design" #1, #2, #4):
- The sandbox is a brand-new architectural layer Phase 0 never sanctioned; it inverts the coordinator-runs-probe-in-event-loop model (Phase 0 ADR-0005) without ADR amendment.
- Per-probe ~150–300 ms fork+exec overhead × 6 probes = ~1.5 s of pure overhead per cold-cache gather; CI walltime budget (Phase 0 §3.2) blows.
bwrapis Linux-only; macOS gets deprecatedsandbox-exec; Windows is "not supported." The Goal #2 statement ("Every probe runs in a per-execution parser sandbox with rlimits enforced") is platform-conditional and cannot deliver on its own claim.
The synthesizer's resolution (final-design.md "Conflict-resolution table" row 1): refuse the sandbox. Move all caps in-process to a shared parsers/ module that every probe routes through. Accept the ~5% residual (parser-CVE class) as the explicit Phase 1 risk; close it at the deployment layer in Phase 14 (OS-level rlimits + bwrap on the production worker, where one process serves many repos).
Options considered¶
- Per-probe fork+exec sandbox with bwrap/sandbox-exec ([S]). Strongest in-Phase-1 threat closure. New architectural layer; ABC inversion; platform-conditional; ~1.5 s overhead per cold gather.
- Each probe re-implements size/depth caps inline. No new module. Caps drift across probes; "security goal degrades to mostly enforced" (final-design Components #8 framing).
- Shared
parsers/module —safe_json.py,safe_yaml.py,jsonc.py— each enforcing identical in-process caps +O_NOFOLLOW; every probe routes through it. ~95% threat closure at ~0 ms overhead. No ABC change. No new process model. Parser-CVE-class residual is the explicit acceptance.
Decision¶
Phase 1 ships src/codegenie/parsers/ with three modules — safe_json.py, safe_yaml.py, jsonc.py — and every probe parsing untrusted bytes routes through them. Each module:
- Opens the path with
os.open(path, os.O_RDONLY | os.O_NOFOLLOW)— refuses symlinks at open time, not after read. - Size-checks the file descriptor before parse.
- Calls the underlying parser:
safe_json.load: stdlibjson.loads(the Phase 0 ratified parser).safe_yaml.load:yaml.CSafeLoader(Phase 0 ratified; Phase 0forbidden-patternscontinues to banyaml.load(...)withoutLoader=).jsonc.load: stdlib-only line + block comment stripper, thensafe_json.load.- Post-parse depth-walks the result to enforce
max_depth(the C parsers have no native depth cap). - Raises typed exceptions on cap or shape violations:
SizeCapExceeded,DepthCapExceeded,MalformedJSONError,MalformedYAMLError,SymlinkRefusedError.
The hard caps live with each parser invocation:
- package.json: 5 MB / depth 64.
- Lockfile: 50 MB / depth 64.
- YAML workflow / Helm values / Kustomize: 10 MB / depth 64.
No per-probe fork+exec sandbox is shipped. No codegenie.probes._sandbox module. No bwrap/sandbox-exec dependency. The Phase 0 Coordinator runs probes as asyncio.Tasks in the gather process exactly as ADR-0005 specified.
The ~5% residual (parser-CVE-class exploits that bypass the in-process depth-walker by exploiting _json.c or CSafeLoader internals) is accepted. Mitigations: pip-audit + osv-scanner + Dependabot on pyyaml and cpython continue to surface CVEs at the supply-chain layer; the _ProbeOutputValidator's field-name regex + recursive JSONValue typing + OutputSanitizer's path-scrubber are belt-and-suspenders defenses if parsed bytes do reach output. Phase 14's production worker adds OS-level rlimits + bwrap at the deployment layer where the multi-actor threat model arrives.
Tradeoffs¶
| Gain | Cost |
|---|---|
| ~95% of the YAML-bomb / JSON-bomb / depth-DoS / oversized-input threat surface closes at ~0 ms overhead per parse | ~5% of the threat surface (parser-CVE in _json.c / pyyaml.CSafeLoader) remains; Phase 14 closes it at the deployment layer |
| Phase 0 ADR-0005's coordinator-runs-probe-in-asyncio model is preserved — no new process model, no IPC contract, no ABC inversion | The security lens's stronger Goal #2 ("every probe in a per-execution sandbox") is explicitly not satisfied in Phase 1 |
Cross-platform — O_NOFOLLOW is POSIX (macOS + Linux); no bwrap dependency; no sandbox-exec deprecation concern |
Windows isn't supported (consistent with Phase 0); the in-process model works there if the project ever ships Windows-side |
| Caps are uniform across every probe — one helper enforces the rules; new probes inherit the defense without re-implementing | All probes share the cap defaults; a future Layer C probe needing a 200 MB allowance for SBOM JSON requires a parser-module amendment |
O_NOFOLLOW mitigates symlink-escape at file open — the right layer for the defense (no race between stat + open) |
Adversarial fixtures must include symlink cases; the tests/adv/test_symlink_escape_in_declared_inputs.py test is load-bearing |
| Two passes (size → parse → depth-walk) is ~5% slower than a hypothetical depth-aware parser; immaterial at Phase 1 budgets | Not a depth-aware streaming parser — the parse-then-walk shape doesn't catch CPU-DoS during the parse itself (the cap is bytes, not CPU) |
~1.5 s cold-gather overhead avoided (final-design.md Resource & cost profile); CI walltime stays inside Phase 0 §3.2's budget |
The fork+exec architectural diagram in design-security.md is rejected outright; reviewers reading that doc need to find this ADR |
Phase 0 ADR-0008's OutputSanitizer chokepoint is preserved — no third pass added; the strictness lives at sub-schema validation per ADR-0004 |
A third pass at the chokepoint would have been simpler structurally but violates Phase 0 ADR-0008's freeze (critic §"Attacks on the security-first design" #5) |
Consequences¶
src/codegenie/parsers/ships in Phase 1 withsafe_json.py,safe_yaml.py,jsonc.py, plus typed exceptions inparsers/exceptions.py(orcodegenie.errorsextensions).- Every Phase 1 probe that parses untrusted bytes routes through these. The lockfile parsers (
probes/_lockfiles/) wrapsafe_json/safe_yamland add format-specific shaping. tests/adv/ships ten adversarial fixtures (yaml_billion_laughs,json_bomb_deep_nesting,json_bomb_huge_string,yaml_unsafe_tag,symlink_escape_in_declared_inputs,zip_slip_kustomize,planted_node_on_path_ignored,tsconfig_pathological,regex_dos_yarn_lock,oversized_lockfile); each pins one structural defense and is CI-gating.- Phase 14's
phase-arch-design.md(when written) inherits the parser-CVE residual as a known precondition. The Phase 14 production worker adds rlimits + bwrap at the OS level — the security lens's design isn't discarded, it's relocated to the right layer. - Phase 2's
IndexHealthProbeand other Layer B/C/D/G probes reusesafe_json/safe_yamlforsemgrepJSON output and SCIP manifest parsing — no per-probe duplication. - The
forbidden-patternsPhase 0 hook continues to banyaml.load(...)withoutLoader=and anyLoader != CSafeLoader.
Amended (Phase 1, S1-09 — Soft truncation companion). ResourceBudget gains a sibling field raw_artifact_truncate_mb: int = 5 (the soft on-disk truncation threshold; semantically distinct from the hard raw_artifact_mb ceiling that raises via BudgetingContext.report_bytes). The invariant raw_artifact_truncate_mb <= raw_artifact_mb is enforced at construction by ResourceBudget.__post_init__ (fail loud per Rule 12) — otherwise the soft policy would be unreachable because the hard ceiling fires first. Enforcement is a pure helper codegenie.output.raw_truncation.apply_raw_artifact_truncation(payload, truncate_mb) (functional core: bytes-in, bytes-out + a tagged-union TruncationOutcome of Untruncated | Truncated), invoked from codegenie.cli's raw-artifact collection loop (imperative shell); Writer.write is unchanged (ADR-0011 chokepoint preserved). On truncation, payload bytes are replaced with a JSON wrapper {"__truncated_at_budget__": true, "original_bytes": ..., "budget_bytes": ..., "data": ...} (the data field is the parsed JSON value if the prefix is valid JSON; else the prefix decoded as utf-8 with errors="replace") and the event probe.raw_artifact.truncated is emitted with structured fields probe, original_bytes, budget_bytes, path, run_id. Boundary semantics mirror report_bytes — inclusive at the limit, exclusive above (> not >=). The original phase-arch-design.md §Gap analysis Gap 2 prescription (add Probe.declared_raw_artifact_budget_mb to the ABC) was superseded by this route to preserve Phase 0 ADR-0007's contract freeze and avoid duplicating the existing ResourceBudget mechanism (Rule 7 — surface conflicts, don't average them; see _validation/S1-09-raw-artifact-budget.md). NodeManifestProbe (S3-05) will override via declared_resource_budget = ResourceBudget(raw_artifact_mb=50, raw_artifact_truncate_mb=25). Budgets > 50 MB hard ceiling continue to require a further ADR amendment.
Reversibility¶
Medium. Adding a per-probe sandbox later (e.g., Phase 14) is mechanically additive — the parser modules continue to work; the sandbox wraps the probe run() call from outside. Removing the in-process caps is irreversible-feeling: probes have come to rely on the typed exceptions for degrading to confidence: low; ripping out parsers/ requires every probe to re-implement caps inline, which is the failure mode the synthesis explicitly avoided. The parser-CVE residual closure is Phase 14's job; bringing it forward to Phase 1 would require resurrecting the sandbox design, which this ADR rejects.
Evidence / sources¶
../final-design.md "Components" #8 Safe-parse helpers— design statement../final-design.md "Conflict-resolution table" row 1— the resolution../final-design.md "Risks" #2— the parser-CVE residual register../phase-arch-design.md "Component design" #8 Safe-parse helpers— interface../phase-arch-design.md "Non-goals" #2— explicit sandbox rejection../phase-arch-design.md "Path to production end state"— Phase 14 closes the residual../phase-arch-design.md "Testing strategy" "Adversarial tests"— the ten fixtures../critique.md "Attacks on the security-first design"#1, #2, #4 — the dismantling- Phase 0 ADR-0005 — the process model this preserves
- Phase 0 ADR-0008 — the chokepoint this avoids editing