Story S7-04 — Adversarial corpus: hostile-skills + concurrent-gather + no-inmemory-leak + phase3-handoff-skipped¶
Step: Step 7 — Plant five-repo fixture portfolio + per-probe golden files + remaining adversarial corpus
Status: Done — GREEN 2026-05-18 (phase-story-executor; see _attempts/S7-04.md for the per-AC evidence table + gate log)
Effort: M
Depends on: S7-03 (~70 goldens exist on disk; regen script proven byte-deterministic across two runs — the adversarial corpus exercises code paths that touch the same writer chokepoint the goldens audit)
ADRs honored: ADR-0005 (no plaintext persisted — test_no_inmemory_secret_leak.py is the type-level "redactor was called" boundary proof), ADR-0006 (IndexFreshness location — test_phase3_handoff_smoke.py references the Protocol contract that 02-ADR-0006/02-ADR-0007 lock), ADR-0007 (no plugin loader in Phase 2 — test_phase3_handoff_smoke.py is grep-discoverable so Phase 3's author finds it on first repo scan), ADR-0009 (pytest-xdist veto — test_concurrent_gather_race.py uses subprocess.Popen for two-process concurrency, NOT pytest-level parallelism), ADR-0010 (RedactedSlice smart constructor — the no-inmemory-leak test verifies the smart-constructor invariant via AST).
Validation notes (2026-05-18, phase-story-validator)¶
The original draft assumed an architecture that does not match the actual codebase. Verified-against-source edits:
RedactedSlicehas TWO production constructor sites, not one.sanitizer.redact_secrets(per-probe-slice path) ANDenvelope_redactor._build_redacted_slice_pass(envelope path the CLI actually uses viacli._seam_redact_envelope→envelope_redactor._redact_envelope→Writer.write). AC-15 was rewritten to assert an explicit two-site closed set within thecodegenie.outputpackage, plus a documented allowlist fortests/unit/output/fixtures that legitimately constructRedactedSliceto test its own invariants.OutputSanitizer.scrubreturnsSanitizedProbeOutput, NOTRedactedSlice. AC-16 was rewritten to assert (a)Writer.write's first non-self parameter is annotatedRedactedSlice(true atsrc/codegenie/output/writer.py:158); (b)envelope_redactor._redact_envelope's return annotation isRedactedSlice. The false "scrub returns RedactedSlice" claim is gone.Writer.writeis called from the CLI's top-level gather command (cli.py:397), not fromOutputSanitizer.scrub. AC-17 was rewritten to assert the actual closed call-site set rooted incli.py.- No advisory lock at
.codegenie/cache/.lockexists. The actual Phase-0 concurrency contract per Phase-0phase-arch-design.md §789edge-case 12 + Phase-0 S5-01 isO_APPENDrecord-level atomicity for records ≤PIPE_BUF=4096plus atomic blob writes via<dest>.tmp → os.replace. AC-7 through AC-13 were rewritten around the actual invariants the test must prove, withtests/unit/test_cache_concurrent.pynamed as the Phase-0 precedent the new test extends. - AC-23's example Protocol signature was aspirational. Replaced with the verbatim shape from
src/codegenie/adapters/protocols.py:77. - AC-18 referenced
pyproject.tomlfor themodel_constructban. Corrected toscripts/check_forbidden_patterns.py(_PHASE2_BANNED_PACKAGESincludes"output"). - Two new ACs (AC-30, AC-31) address coverage gaps: non-UTF8 / control-byte hostile YAML, and post-race JSONL-index parseability.
- Notes-for-implementer extended with: aliased-import resilience for the AST walker; wall-clock enforcement mechanism for AC-4; mypy drift trip-wire embedding; and the explicit two-site
ALLOWED_CONSTRUCTOR_SITESallowlist constant.
Full audit log: _validation/S7-04-adversarial-corpus-completion.md.
Context¶
This story lands the remaining adversarial corpus under tests/adv/phase02/. Four tests, each addressing a load-bearing risk-or-gap the Phase 2 design names explicitly:
test_hostile_skills_yaml.py— ≥ 8 hostile YAML cases againstSkillsLoader(S2-01). Tests defense against!!python/object(RCE attempt), billion-laughs (entity expansion), deep-nesting (recursion/stack), symlink-escape filenames, NUL-byte-in-name, oversized files. None executes user code. None mutates host state. All paths produce typedResult.Err(SkillsLoadError(reason=...))or are refused atO_NOFOLLOWopen time.test_concurrent_gather_race.py— two concurrentcodegenie gatherinvocations against the same fixture. The actual Phase-0 concurrency contract (phase-arch-design.md §789edge-case 12, Phase-0 S5-01tests/unit/test_cache_concurrent.py) isO_APPENDrecord-level atomicity for records ≤PIPE_BUF=4096plus atomic blob writes via<dest>.tmp → os.replace— NOT a.codegenie/cache/.lockadvisory lock (no such primitive exists insrc/codegenie/; verified bygrep -rn "flock\|filelock\|portalocker" src/codegenie/→ empty). The test asserts that two concurrent gathers produce a consistentindex.jsonl(every line parses as JSON) and consistent blobs (every file's content-hash matches its filename). The test is deterministic despite testing a concurrency surface — it uses real subprocess concurrency + explicit signal coordination if needed, not random timing.test_no_inmemory_secret_leak.py— Gap 5 defense + Risk #6 defense. Uses AST inspection to verify that (a)RedactedSliceconstruction is restricted to the two-site closed set withincodegenie.output—sanitizer.redact_secrets(per-probe-slice path) ANDenvelope_redactor._build_redacted_slice_pass(envelope path the CLI's_seam_redact_envelopeuses); (b)Writer.write's signature accepts onlyRedactedSlice(already true atwriter.py:158); (c)envelope_redactor._redact_envelope's return annotation isRedactedSlice; (d)Writer.writeis called only from the CLI's gather seam (cli.py:397) — adding a third call site fails the test. This is a structural boundary test, not a behavioral one — it reads source AST + signatures rather than executing the pipeline. Without it, a future contributor addingRedactedSlice.from_existing(...)in a random probe module silently breaks the type-level guarantee.test_phase3_handoff_smoke.py— Gap 1 defense. Landed@pytest.mark.skip(reason="enabled when Phase 3 plugin lands"). Phase 3's first author finds the test on first repo scan (grep -r "enabled when Phase 3 plugin lands"); unskips it at the Phase-3-entry-gate review; the test then asserts the four adapterProtocols from S1-03 are importable AND their signatures match Phase 3's first plugin's expectations. Any Protocol drift between Phase 2 and Phase 3 triggers an ADR amendment to 02-ADR-0006/02-ADR-0007, surfaced loudly at the entry-gate.
The synthesis ledger pins three risks this story directly defends:
- Risk #2 (Probe ABC not edited). Adversarial corpus must not trip an unintended
ProbeContext/ProbeABC widening. Tests run against the frozen contract. - Risk #4 (
mypy --warn-unreachableenforcement). Adversarial tests in this story exercise the typedResultpaths; mypy-unreachable false-positives would mask test failures. The structural test in particular relies on AST inspection of typed code. - Risk #6 (
RedactedSlicesmart-constructor silent break).test_no_inmemory_secret_leak.pyis the front-line defense. The AST call-site count ofRedactedSliceconstruction in production code (src/codegenie/) must equal exactly two, and those two sites are the documented closed set:sanitizer.redact_secrets+envelope_redactor._build_redacted_slice_pass. A third production-code construction site means the smart-constructor guarantee is broken; adding one requires an ADR amendment to 02-ADR-0010. Test-fixture constructions undertests/unit/output/are explicitly allowlisted (those tests exist to exercise the smart constructor's invariants directly and are the canonical anti-regression for the model itself).
References — where to look¶
- Architecture:
../phase-arch-design.md §"Gap analysis" → Gap 1(Adapter Protocol drift —test_phase3_handoff_smoke.pyis the named trip-wire).../phase-arch-design.md §"Gap analysis" → Gap 5(No-explicit Phase-4 RAG-store handoff contract —test_no_inmemory_secret_leak.pyis the named boundary test).../phase-arch-design.md §"Gap analysis" → Gap 4(RedactedSlicesmart-constructor — ADR-0010 amendment).../phase-arch-design.md §"Testing strategy" → "Adversarial tests"— the test inventory.../phase-arch-design.md §"Implementation risks"#2, #4, #6, #8.- Phase ADRs: ADR-0005 (no plaintext persistence), ADR-0006 (
IndexFreshnesslocation — Protocol contract anchor), ADR-0007 (no plugin loader in Phase 2 — the Phase-3 handoff trip-wire's rationale), ADR-0009 (pytest-xdist veto — concurrency tests live at subprocess level), ADR-0010 (RedactedSlicesmart constructor). - Implementation plan:
../High-level-impl.md §"Step 7"— adversarial-corpus bullet list,inspect-based discipline. - Source design:
../final-design.md §"Gaps"table +§"Phase-2 → Phase-3 handoff"checklist. - Existing code:
src/codegenie/skills/loader.py(S2-01 — theSkillsLoaderunder test).src/codegenie/output/sanitizer.py(S3-01 —SecretRedactor; firstRedactedSliceconstructor at line ~434).src/codegenie/output/envelope_redactor.py(the secondRedactedSliceconstructor at_build_redacted_slice_pass, ~line 247 — the envelope-level path the CLI uses viacli._seam_redact_envelope).src/codegenie/output/redacted_slice.py(S3-02 —RedactedSlicesmart constructor model).src/codegenie/output/writer.py:153–175(S3-03 —Writer.write(envelope: RedactedSlice, …); the type-system defense).src/codegenie/cli.py:349–397(the onlyWriter.writecall site;_seam_redact_envelopebuilds theRedactedSliceviaenvelope_redactor._redact_envelope).src/codegenie/adapters/protocols.py:37–121(S1-03 — the four@runtime_checkableProtocols this story's Phase-3 handoff test references; actual signatures land here, copy verbatim into AC-23's frozen tuple).src/codegenie/cache/store.py:170–337(Phase 0 —CacheStore.putisO_APPEND+ atomic blob write via<dest>.tmp → os.replace; NOT a.codegenie/cache/.lockadvisory lock).scripts/check_forbidden_patterns.py:55–80, 247–253(themodel_constructban;_PHASE2_BANNED_PACKAGESincludes"output").tests/adv/phase02/test_stale_scip_fixture.py(S4-02 — adversarial-test directory convention).tests/adv/phase02/test_secret_in_source.py(S6-07 — the behavioral secret-leak test this story's structural test complements).tests/unit/test_cache_concurrent.py(Phase-0 S5-01 — the precedenttest_concurrent_gather_race.pyextends to a two-gather scenario; usessubprocess.Popentwo-process invocation per ADR-0009).tests/unit/output/test_redacted_slice.py(the 20+ legitimate test-fixture constructors that motivate the test-allowlist in AC-15; these tests exercise the smart-constructor model directly).
Goal¶
Four adversarial tests exist under tests/adv/phase02/:
test_hostile_skills_yaml.py— ≥ 8 cases, each producing typedResult.ErrfromSkillsLoaderor refused atO_NOFOLLOW.test_concurrent_gather_race.py— two concurrent gathers; advisory lock holds; cache is consistent (no half-written blob).test_no_inmemory_secret_leak.py—inspect-based structural test;redact_secretsis the singleRedactedSlicecall site; every artifact reachable fromOutputSanitizer.scrubto writer passes through it.test_phase3_handoff_smoke.py— skipped per@pytest.mark.skip(reason="enabled when Phase 3 plugin lands"); grep-discoverable; asserts (when unskipped) that the fourProtocols import unchanged and signatures match Phase-3's-first-plugin expectations.
All four go into the adv-phase02 CI job (wired in S8-03) as part of the load-bearing gate.
Acceptance criteria¶
test_hostile_skills_yaml.py — ≥ 8 cases
- [ ] AC-1.
tests/adv/phase02/test_hostile_skills_yaml.pyexists; covers ≥ 8 distinct hostile cases: - Case 1 —
!!python/object. A skill file containing!!python/object:os.system args: ["touch /tmp/pwned"]. Loader returnsResult.Err(SkillsLoadError(reason="unsafe_yaml")). No/tmp/pwnedexists after the test (assertion). - Case 2 —
!!python/object/applyvariant. Same defense; same outcome. - Case 3 — billion-laughs. YAML entity-expansion bomb. Loader caps
safe_yaml.load'smax_bytes(S3 reference from Phase 1) ANDsafe_yaml's entity-expansion is disabled. Loader returnsResult.Err(SkillsLoadError(reason="schema"))OR caps out and returnsResult.Err(reason="oversized"). Wall-clock < 1 s (the cap-out is fast). - Case 4 — deep nesting. 1000 levels of nested
{a: {a: {a: ...}}}. Loader returnsResult.Err(reason="depth_exceeded")ORreason="schema". Wall-clock < 1 s. Test asserts noRecursionErrorescapes. - Case 5 — symlink-escape filename. Plant a symlink at
<tier>/skill-A.yaml → /etc/passwd(or any out-of-tree path). Loader'sO_NOFOLLOWopen refuses withELOOP;Result.Err(SkillsLoadError(reason="symlink_refused", path=...)). - Case 6 — NUL-byte in filename. A filename literally containing a NUL byte (constructed via
os.symlinkor low-levelos.openif the filesystem allows; otherwise the test skips with a clear reason). Loader rejects viaResult.Err(reason="invalid_filename")OROSErroris caught at the loader boundary and surfaced asResult.Err. - Case 7 — oversized file (50 MB + 1 byte; just past
safe_yaml.load's declared cap). Loader returnsResult.Err(reason="oversized"). Wall-clock < 5 s. - Case 8 — duplicate-key YAML.
{a: 1, a: 2}. Per Phase-1safe_yamldiscipline, duplicate keys produceResult.Err(reason="schema"). Loader propagates. - (Recommended) Case 9 — yaml-bomb via alias chain. Long chain of
&a *b-style references; combined size cap or alias-count cap rejects it. - (Recommended) Case 10 — non-UTF8 / control-byte payload. A YAML file whose bytes are not valid UTF-8 (e.g., raw
\xff\xfeBOM with mixed-encoding scalars), OR a UTF-8 file containing control bytes (\x00–\x08,\x0b–\x1f, excluding\t\n\r) inside string scalars. Loader returnsResult.Err(SkillsLoadError(reason="schema"))orreason="io_failure"per the closedSkillsLoadErrorreason set insrc/codegenie/skills/loader.py. Wall-clock < 1 s. - [ ] AC-2 — no user code executes. For each case, after the test runs, no file under
/tmp/,/var/folders/,$HOMEwas created by the YAML's payload. The!!python/objectcases assert no/tmp/pwned-<random>exists. - [ ] AC-3 — no host-state mutation. No environment variable was set; no signal was raised. (Defense against esoteric YAML payloads that exercise
__reduce__chains viasafe_yamlparsers other than the Phase-1-blessed one.) - [ ] AC-4 — wall-clock per case < 5 s. Each parametrized case completes in under 5 s. (Otherwise an attacker DoSes the gatherer with a crafted hostile skill file.)
- [ ] AC-5 — typed
Result.Errpaths. Every hostile case produces aResult.Err(SkillsLoadError(reason=<one of allowlisted reasons>)). The allowlisted-reasons set is grep-discoverable insrc/codegenie/skills/loader.py; the test asserts the reason against the closed set. - [ ] AC-6 — fixture under
tests/adv/phase02/fixtures/hostile_skills/. Each hostile YAML case lives as a separate file undertests/adv/phase02/fixtures/hostile_skills/<case-name>.yaml(the symlink and NUL-byte cases land under a sibling_create_at_test_time.pyfixture-builder that constructs the path at test time — symlinks and NUL-byte names are not safely committable to git on all platforms). The fixture builder cleans up in a pytest fixture teardown.
test_concurrent_gather_race.py — Phase-0 O_APPEND atomicity + atomic blob writes
Validation note: the original story referenced a
.codegenie/cache/.lockadvisory lock. No such primitive exists insrc/codegenie/(verified by grep: nofcntl.flock, nofilelock, noportalockerin production code). The actual Phase-0 contract perphase-arch-design.md §789edge-case 12 and S5-01 isO_APPENDrecord-level atomicity for records ≤PIPE_BUF=4096plus atomic blob writes via<dest>.tmp → os.replace. Phase-0 S5-01'stests/unit/test_cache_concurrent.pyis the precedent — this test extends that pattern to the fullcodegenie gatherCLI surface.
- [ ] AC-7.
tests/adv/phase02/test_concurrent_gather_race.pyexists; launches two concurrentcodegenie gatherinvocations against the same fixture (tests/fixtures/portfolio/minimal-ts/) viasubprocess.Popen(NOTmultiprocessing, NOTasyncio.gather— independent OS processes are what exercises theO_APPENDkernel-atomicity guarantee per S5-01's Notes-for-implementer rationale). - [ ] AC-8 —
.codegenie/cache/index.jsonlparses line-by-line post-race. After both processes exit, opening.codegenie/cache/index.jsonland iterating line-by-line: every line is valid JSON (json.loads(line)succeeds; no torn records). This is the actual Phase-0O_APPENDinvariant. No advisory-lock assertion (none exists). - [ ] AC-9 — blob consistency post-race. Every file under
.codegenie/cache/blobs/is internally consistent: blob filename starts withcontent_hash(file_bytes)[:N]for the documented N; no.tmpfile remains (atomic-replace contract held); no zero-byte files. Verified by walking the tree. - [ ] AC-10 —
repo-context.yamlconsistency. Whichever process completed last produces arepo-context.yamlthat round-trips throughyaml.safe_load(no exception) and matches one of: A's-final-output OR B's-final-output (no half-merged hybrid). - [ ] AC-11 — deterministic. The test passes 100 / 100 runs on the implementer's machine + CI runner. NOT flake-tolerant. Implementer enforces via either (a)
pytest --count=100ifpytest-repeatis admitted, or (b) a one-shotfor i in $(seq 100); do pytest tests/adv/phase02/test_concurrent_gather_race.py || break; donerecipe documented in the PR. If flaky, surface as a real concurrency bug — do NOT mark as@pytest.mark.flaky. Most-likely root cause when flaky: blobos.replacenot fsync'd before index-line append. - [ ] AC-12 — wall-clock < 60 s. The test completes (both processes terminate) within 60 s. Two cold gathers against
minimal-tsshould fit easily; if not, the timeout indicates a pathology, not a test problem. - [ ] AC-13 —
ADR-0009honored at this test. The test usessubprocess.Popenfor two-process concurrency, NOTpytest-xdist. Pytest-xdist remains vetoed repo-wide; this AC scopes the commitment to this specific test. - [ ] AC-31 — post-race JSONL count matches expectation. After both gathers exit,
.codegenie/cache/index.jsonlhas at leastN_unique_keyslines (whereN_unique_keysis the number of cache keys exercised byminimal-ts); duplicates are permitted (both processes may have written the same key — both records are valid JSON; consumer dedups by key). The test asserts no line was silently lost to torn-write corruption.
test_no_inmemory_secret_leak.py — Gap 5 + Risk #6 + ADR-0010 structural test
- [ ] AC-14.
tests/adv/phase02/test_no_inmemory_secret_leak.pyexists; uses Python's stdlibastmodule (NOTmock, NOTinspect.getsourceregex, NOTpytest-asyncioexecution) to verify the four structural invariants below. - [ ] AC-15 —
RedactedSliceconstruction is restricted to a closed two-site set inside thecodegenie.outputpackage. The test: - Declares
ALLOWED_CONSTRUCTOR_SITES: Final[frozenset[tuple[str, str]]]at module top — the two permitted production sites as(relative_file_path, qualified_function_name)pairs:("src/codegenie/output/sanitizer.py", "redact_secrets")("src/codegenie/output/envelope_redactor.py", "_build_redacted_slice_pass")
- Declares
ALLOWED_TEST_CONSTRUCTOR_DIRS: Final[frozenset[str]] = frozenset({"tests/unit/output"})— the documented test allowlist (tests/unit/output/test_redacted_slice.pyand siblings exist to exercise the smart constructor's invariants directly; they are the canonical anti-regression for theRedactedSlicemodel itself). - Uses
astto parse every Python source file undersrc/codegenie/andtests/(excluding_validation/,_attempts/,__pycache__/, and the test file itself). - Builds a per-module import alias map (
{local_name: real_name}) from everyast.ImportFromandast.Importwhose target iscodegenie.output.redacted_slice.RedactedSlice— so thatfrom codegenie.output.redacted_slice import RedactedSlice as _RS; _RS(...)is correctly resolved as aRedactedSliceconstruction. Aliased imports must NOT slip past the walker. - For every
Callnode whosefunc(after alias resolution) constructsRedactedSlice— either bare-name,.model_validate(...),.model_validate_json(...), or.model_construct(...)— assert the enclosing(file, function)pair is inALLOWED_CONSTRUCTOR_SITES(production code) OR the file lives underALLOWED_TEST_CONSTRUCTOR_DIRS(test fixtures). Any other site FAILS with the AC-19 message. - Defense-in-depth: also assert each of the two allowed production sites actually contains at least one construction (regression guard against silent removal of the redactor — same shape as the original AC-15 second-half assertion).
- Adding a third production site is an ADR amendment to 02-ADR-0010, not a code edit; the test's failure message says so.
- [ ] AC-16 — writer signature + envelope-redactor return shape pin
RedactedSliceas the only artifact reaching the writer. The test: - Parses
src/codegenie/output/writer.py. AssertsWriter.write's first non-self parameter (envelope) has annotationRedactedSlice(this is already true atsrc/codegenie/output/writer.py:158; the test pins the contract against future loosening). A writer signature acceptingdict[str, JSONValue]would FAIL. - Parses
src/codegenie/output/envelope_redactor.py. Asserts_redact_envelope's return annotation isRedactedSlice. The CLI's_seam_redact_envelopeis the composition seam; if_redact_envelopeever returns adict, the writer's signature would refuse it at runtime, but a contributor could weaken the writer signature first — this AC pins both sides. - Removed false claim: the original story asserted
OutputSanitizer.scrubreturnsRedactedSlice. It does not (it returnsSanitizedProbeOutputpersanitizer.py:208–241). The actual envelope-level redaction lives inenvelope_redactor._redact_envelope. Validation 2026-05-18 corrected this. - [ ] AC-17 — closed set of
Writer.writecall sites. The test AST-walks all ofsrc/codegenie/. AssertsWriter.writeis called from exactly the documented CLI seams insrc/codegenie/cli.py(one for the envelope atcli.py:397; an audit / verify seam if present). The test declaresALLOWED_WRITER_CALL_SITES: Final[frozenset[tuple[str, str]]]and fails if the AST walk discovers anyCalltoWriter.write(orWriter().write, or_writer.writewhere_writerresolves toWriter) outside that set. Adding a third call site is an explicit edit to this AC's allowlist constant — a deliberate code review event. - [ ] AC-18 —
model_constructbanned undersrc/codegenie/output/. Inherited from S1-11'sforbidden-patternsextension; this test asserts the ban is in place by: - Reading
scripts/check_forbidden_patterns.py(the actual ban site per S1-11 + 02-ADR-0010; NOTpyproject.toml— validation 2026-05-18 corrected this) and asserting"output"is in the rule's_PHASE2_BANNED_PACKAGESfrozenset. - AST-walking
src/codegenie/output/and confirming zeromodel_construct(...)call sites and zeromodel_construct=kwarg / assignment occurrences. - This is defense-in-depth — the pre-commit hook (
scripts/check_forbidden_patterns.py) is the front-line; this test catches the pre-commit-bypass case (e.g.,--no-verifycommit landing on a feature branch). - [ ] AC-19 — failure messages name the offending file + line + the named remediation. If a future contributor adds
RedactedSlice.from_existing(...)attests/integration/test_X.py:42, the test fails with the literal message:"RedactedSlice constructed at tests/integration/test_X.py:42 (call to 'RedactedSlice.from_existing') is outside the documented two-site closed set (sanitizer.redact_secrets + envelope_redactor._build_redacted_slice_pass) and the tests/unit/output/ allowlist. The smart-constructor invariant (02-ADR-0010, S3-02, Gap 4) requires construction to be restricted to the redaction pipeline. To add a third construction site, amend 02-ADR-0010 and update ALLOWED_CONSTRUCTOR_SITES in this test file. See docs/phases/02-context-gather-layers-b-g/ADRs/0010-redacted-slice-smart-constructor-at-writer-boundary.md." - [ ] AC-20 — passes
mypy --strict. NoAnyoutside theast.NodeVisitor's necessaryAnyreturns; even those carry explicit type narrowing.
test_phase3_handoff_smoke.py — Gap 1 trip-wire, landed @pytest.mark.skip
- [ ] AC-21.
tests/adv/phase02/test_phase3_handoff_smoke.pyexists. - [ ] AC-22 — skipped, grep-discoverable. The test is decorated with
@pytest.mark.skip(reason="enabled when Phase 3 plugin lands — see docs/phases/02-context-gather-layers-b-g/ADRs/0007-no-plugin-loader-in-phase-2.md and ../High-level-impl.md §Step 7 Phase-3-handoff bullet"). Thereason=string is literal and grep-discoverable; a Phase-3 author runninggrep -r "enabled when Phase 3 plugin lands" tests/finds it instantly. - [ ] AC-23 — contents (skipped but written). The test, when unskipped, asserts:
- The four
ProtocolsDepGraphAdapter,ImportGraphAdapter,ScipAdapter,TestInventoryAdapterare importable fromcodegenie.adapters.protocols(their exact import paths are pinned in S1-03). - Each Protocol's runtime structural conformance is verifiable via
isinstance(stub_object, ProtocolClass)wherestub_objectis a minimal mock implementing the Protocol's method signatures. - The
AdapterConfidencediscriminated union has exactly three variants:Trusted,Degraded,Unavailable. - The Protocol method signatures match the verbatim shapes currently in
src/codegenie/adapters/protocols.py(copy at test-write time; validation 2026-05-18 corrected the original aspirational example). Embed the four signatures as a frozen tuple in the test file, e.g.:Drift on any signature → test fails at unskip-time → Phase-3 author must amend 02-ADR-0006/02-ADR-0007 explicitly._FROZEN_S1_03_SIGNATURES: Final[tuple[tuple[str, str, str], ...]] = ( ("DepGraphAdapter", "consumers", "(self, pkg: str) -> list[str]"), ("DepGraphAdapter", "producers", "(self, pkg: str) -> list[str]"), ("DepGraphAdapter", "confidence", "(self) -> AdapterConfidence"), ("ImportGraphAdapter", "reverse_lookup", "(self, module: str) -> list[str]"), ("ImportGraphAdapter", "confidence", "(self) -> AdapterConfidence"), ("ScipAdapter", "refs", "(self, symbol: str) -> list[Occurrence]"), ("ScipAdapter", "confidence", "(self) -> AdapterConfidence"), ("TestInventoryAdapter", "tests_exercising", "(self, symbol: str) -> list[TestId]"), ("TestInventoryAdapter", "confidence", "(self) -> AdapterConfidence"), ) - [ ] AC-24 — comment block explains the trip-wire purpose. Top of file: a comment block names Gap 1, names ADR-0007, names the Phase-3 entry-gate-review process (S8-04 lands the named issue), and instructs the Phase-3 author on the unskip ritual ("If your first plugin patches one of these Protocols, that's a contract amendment — DO NOT silently change the Protocol; file an ADR amendment to 02-ADR-0006/02-ADR-0007 first, then update this test, THEN unskip.").
- [ ] AC-25 — passes
mypy --stricteven while skipped, AND a deliberate Protocol drift breaks mypy on this file (the silent-drift defense). Mypy type-checks skipped tests; the Protocol-conformance assertion code must type-check against the current S1-03 Protocols. The test file MUST embed aProtocol-typed helper of the formdef _frozen_dep_graph_signature(adapter: DepGraphAdapter) -> None: reveal_type(adapter.consumers); reveal_type(adapter.producers); ...(or equivalentcast(DepGraphAdapter, _stub)chain) whose parameter / call-site annotations re-state every S1-03 contract by name. If any S1-03 Protocol signature changes, mypy MUST fail on this file (e.g., the call toadapter.consumers(pkg=...)would surface the new keyword-only parameter shape). This is the type-system trip-wire — it fires before unskip, at every CI run, against the current S1-03 Protocols. Document the trip-wire in the test's top-of-file comment block. - [ ] AC-30 —
RedactedSliceimport resolver handles aliased imports. The AST walker intest_no_inmemory_secret_leak.pyMUST correctly classify a construction made viafrom codegenie.output.redacted_slice import RedactedSlice as _RS; _RS(...)as aRedactedSliceconstruction. The test includes an inline regression case (a temp module string parsed viaast.parse) that exercises the aliased-import path through the walker. Without this, a contributor who aliases the import silently slips past the structural test.
Determinism, audit hygiene, type cleanliness
- [ ] AC-26 — every test passes
mypy --strict. - [ ] AC-27 — no flakes. Each adversarial test passes 100/100 runs on CI. If
test_concurrent_gather_race.pyexhibits any flake, the implementer MUST stabilize via explicit signal synchronization before merging. - [ ] AC-28 — fixtures are minimal.
tests/adv/phase02/fixtures/hostile_skills/does not exceed ~10 small YAML files. The symlink + NUL-byte cases are built at test time (not committed) per AC-6. - [ ] AC-29 — log-emission assertions where relevant.
test_hostile_skills_yaml.pyasserts eachResult.Erris also emitted as a structured log event (e.g.,probe.skill.load_refusedwith areasonfield) — per the cross-cutting structlog discipline.
Implementation outline¶
- TDD red — write
test_hostile_skills_yaml.pyfirst. Plant ≥ 8 parametrized cases; each invokesSkillsLoader.load_all()against a tier-rooted attests/adv/phase02/fixtures/hostile_skills/; assertsResult.Err. With no fixture files yet, each case fails red (no skill file found). - Plant the hostile-YAML fixtures under
tests/adv/phase02/fixtures/hostile_skills/per AC-1's case list. Plant_create_at_test_time.pyfor the symlink + NUL-byte cases. Confirm each fixture file is parseable-as-YAML by an attacker tool (so we're testing real hostile input, not malformed-text edge cases — except where malformed-text IS the case, like duplicate keys). - Run
pytest tests/adv/phase02/test_hostile_skills_yaml.py -v. Adjust theSkillsLoaderorsafe_yamlchokepoint as needed (if a case slips through to RCE or to wall-clock-cap-violation, fix the production code — the test caught a real bug). Green. - TDD red — write
test_concurrent_gather_race.py. Usesubprocess.Popento launch twocodegenie gather tests/fixtures/portfolio/minimal-ts/invocations (precedent: Phase-0 S5-01tests/unit/test_cache_concurrent.py). Assert the Phase-0O_APPENDinvariants (AC-8: everyindex.jsonlline parses; AC-9: every blob filename matches its content hash; AC-31: line count ≥ N_unique_keys). Do NOT assert any.codegenie/cache/.lockbehavior — no such primitive exists. - If the test is flaky, do not paper over it with an advisory lock. Flakes mean either a real
O_APPENDviolation (record >PIPE_BUF) or a non-atomic write somewhere in the cache path — fix the bug. If you genuinely need deterministic ordering between A and B (e.g., for a follow-up assertion), use explicit signal synchronization: Process A pauses onSIGUSR1immediately after a documented checkpoint; the test sendsSIGUSR1to A; A continues. This is windowing, not race-tolerance. - Run 100 times. Confirm 100/100 pass. Green.
- TDD red — write
test_no_inmemory_secret_leak.py. Build the AST-walker that findsRedactedSliceconstruction call sites. With the production code as-is (per S3-01/S3-02/S3-03), the test should pass on first run — there should be exactly one call site, insideredact_secrets. If the count is wrong, debug the production code, not the test. - Verify the failure mode: temporarily add
RedactedSlice(slice={}, findings_count=0, fingerprints=[])to a scratch file undertests/_scratch/. Run the test; observe it fails with the expected message (AC-19). Delete the scratch file. Re-run; observe green. Document the deliberate-fail-then-pass observation in PR. - TDD red — write
test_phase3_handoff_smoke.py. Write the test body (per AC-23). Decorate with@pytest.mark.skip(reason="enabled when Phase 3 plugin lands — see ...")per AC-22. Write the comment block per AC-24. - Verify the unskip path. Temporarily remove the
@pytest.mark.skipdecorator. Run the test. With S1-03's Protocols in place, the test should pass. Confirm. Re-apply the skip decorator. Run again. Pytest records the skip; the test counts as passing (pytest convention). - Final pass:
mypy --strict,ruff check,ruff format --check. Run all four adversarial tests. Run the full Phase 2 test suite. Green.
TDD plan — red / green / refactor¶
Red — failing adversarial tests first¶
test_no_inmemory_secret_leak.py AST-walker skeleton (hardened 2026-05-18):
# tests/adv/phase02/test_no_inmemory_secret_leak.py
"""Structural boundary test — Gap 5, Risk #6, ADR-0010.
The smart-constructor invariant: RedactedSlice may only be constructed from the documented
two-site closed set inside the codegenie.output redaction pipeline, plus the
tests/unit/output/ allowlist for the model's own anti-regression tests. Adding a third
production-code construction site silently breaks the type-level "redactor was called"
guarantee — this test fails loudly if that happens.
The two documented production sites:
- codegenie.output.sanitizer.redact_secrets (per-probe-slice path)
- codegenie.output.envelope_redactor._build_redacted_slice_pass (envelope path; CLI uses)
Adding a third site requires an explicit ADR amendment to 02-ADR-0010, then editing
ALLOWED_CONSTRUCTOR_SITES below.
"""
from __future__ import annotations
import ast
from pathlib import Path
from typing import Final, NamedTuple
_REPO_ROOT = Path(__file__).parent.parent.parent.parent # tests/adv/phase02/ -> repo root
_SRC = _REPO_ROOT / "src"
_TESTS = _REPO_ROOT / "tests"
_REDACTED_SLICE_QUALNAME: Final[str] = "codegenie.output.redacted_slice.RedactedSlice"
# Closed two-site set. Adding a third entry = ADR amendment.
ALLOWED_CONSTRUCTOR_SITES: Final[frozenset[tuple[str, str]]] = frozenset({
("src/codegenie/output/sanitizer.py", "redact_secrets"),
("src/codegenie/output/envelope_redactor.py", "_build_redacted_slice_pass"),
})
# Test allowlist — the model's own anti-regression tests legitimately construct
# RedactedSlice. Any test under this directory may construct.
ALLOWED_TEST_CONSTRUCTOR_DIRS: Final[frozenset[str]] = frozenset({
"tests/unit/output",
})
class _CallSite(NamedTuple):
file: str # relative to repo root, POSIX-style separators
line: int
enclosing_func: str # qualified name of the nearest enclosing FunctionDef / AsyncFunctionDef
call_text: str
def _build_alias_map(tree: ast.Module) -> dict[str, str]:
"""Return {local_name: real_qualified_name} for every import in the module.
Resolves `from codegenie.output.redacted_slice import RedactedSlice as _RS`
to {'_RS': 'codegenie.output.redacted_slice.RedactedSlice'}.
"""
aliases: dict[str, str] = {}
for node in ast.walk(tree):
if isinstance(node, ast.ImportFrom) and node.module:
for alias in node.names:
aliases[alias.asname or alias.name] = f"{node.module}.{alias.name}"
elif isinstance(node, ast.Import):
for alias in node.names:
aliases[alias.asname or alias.name] = alias.name
return aliases
def _resolves_to_redacted_slice(call: ast.Call, aliases: dict[str, str]) -> bool:
func = call.func
# Bare-name: RedactedSlice(...) or _RS(...) where _RS aliases RedactedSlice
if isinstance(func, ast.Name):
return aliases.get(func.id, "") == _REDACTED_SLICE_QUALNAME
# Attribute: RedactedSlice.model_validate(...), _RS.model_construct(...), etc.
if isinstance(func, ast.Attribute) and isinstance(func.value, ast.Name):
base_qual = aliases.get(func.value.id, "")
return base_qual == _REDACTED_SLICE_QUALNAME and func.attr in {
"model_validate", "model_validate_json", "model_construct"
}
return False
def _enclosing_func_name(node: ast.AST, parents: dict[ast.AST, ast.AST]) -> str:
cur: ast.AST | None = node
while cur is not None:
if isinstance(cur, (ast.FunctionDef, ast.AsyncFunctionDef)):
return cur.name
cur = parents.get(cur)
return "<module>"
def _find_construction_sites(root: Path) -> list[_CallSite]:
sites: list[_CallSite] = []
for py in root.rglob("*.py"):
if any(p in {"_validation", "_attempts", "__pycache__"} for p in py.parts):
continue
if py == Path(__file__):
continue
try:
tree = ast.parse(py.read_text(encoding="utf-8"), filename=str(py))
except (SyntaxError, UnicodeDecodeError):
continue
aliases = _build_alias_map(tree)
parents: dict[ast.AST, ast.AST] = {}
for parent in ast.walk(tree):
for child in ast.iter_child_nodes(parent):
parents[child] = parent
for node in ast.walk(tree):
if isinstance(node, ast.Call) and _resolves_to_redacted_slice(node, aliases):
rel = py.relative_to(_REPO_ROOT).as_posix()
sites.append(_CallSite(
file=rel,
line=node.lineno,
enclosing_func=_enclosing_func_name(node, parents),
call_text=ast.unparse(node),
))
return sites
def _is_allowed(site: _CallSite) -> bool:
if (site.file, site.enclosing_func) in ALLOWED_CONSTRUCTOR_SITES:
return True
if any(site.file.startswith(d + "/") for d in ALLOWED_TEST_CONSTRUCTOR_DIRS):
return True
return False
def test_redacted_slice_construction_is_restricted_to_documented_sites() -> None:
"""AC-15 + AC-19 + AC-30 — closed two-site set, with aliased-import resilience."""
sites = _find_construction_sites(_SRC) + _find_construction_sites(_TESTS)
offending = [s for s in sites if not _is_allowed(s)]
assert not offending, "\n".join(
f"RedactedSlice constructed at {s.file}:{s.line} (call to '{s.call_text}') is outside the "
f"documented two-site closed set (sanitizer.redact_secrets + "
f"envelope_redactor._build_redacted_slice_pass) and the tests/unit/output/ allowlist. "
f"The smart-constructor invariant (02-ADR-0010, S3-02, Gap 4) requires construction to be "
f"restricted to the redaction pipeline. To add a third construction site, amend 02-ADR-0010 "
f"and update ALLOWED_CONSTRUCTOR_SITES in this test file. See "
f"docs/phases/02-context-gather-layers-b-g/ADRs/0010-redacted-slice-smart-constructor-at-writer-boundary.md."
for s in offending
)
# Defense-in-depth: each documented production site must contain at least one construction
# (regression guard against silent removal of either redaction path).
for (file, func) in ALLOWED_CONSTRUCTOR_SITES:
present = [s for s in sites if s.file == file and s.enclosing_func == func]
assert present, (
f"Expected at least one RedactedSlice construction inside {file}::{func}; "
f"the redactor at this site appears to have been removed. Either redaction path "
f"silently disappearing breaks 02-ADR-0005 + 02-ADR-0010."
)
test_hostile_skills_yaml.py parametrized skeleton:
# tests/adv/phase02/test_hostile_skills_yaml.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import pytest
from codegenie.skills.loader import SkillsLoader, SkillsLoadError
from codegenie.result import Result # Phase-0 Result type
_HOSTILE = Path(__file__).parent / "fixtures" / "hostile_skills"
@dataclass(frozen=True)
class _HostileCase:
name: str
fixture_relpath: str # under _HOSTILE
expected_reason: str # one of the SkillsLoadError reason values
_CASES = (
_HostileCase("python_object", "case1_python_object.yaml", "unsafe_yaml"),
_HostileCase("python_object_apply", "case2_python_object_apply.yaml","unsafe_yaml"),
_HostileCase("billion_laughs", "case3_billion_laughs.yaml", "oversized"), # or "schema"
_HostileCase("deep_nesting", "case4_deep_nesting.yaml", "schema"), # or "depth_exceeded"
# Cases 5 (symlink) + 6 (NUL-byte) built at test time, see below
_HostileCase("oversized", "case7_oversized.yaml", "oversized"),
_HostileCase("duplicate_key", "case8_duplicate_key.yaml", "schema"),
)
@pytest.mark.parametrize("case", _CASES, ids=lambda c: c.name)
def test_hostile_skill_yaml_refused(case: _HostileCase, tmp_path) -> None:
"""Loader returns typed Result.Err; no user code executes; wall-clock < 5 s."""
# Copy fixture into a fresh tier dir
tier = tmp_path / "skills"
tier.mkdir()
src = _HOSTILE / case.fixture_relpath
dest = tier / "skill-A.yaml"
dest.write_bytes(src.read_bytes())
pwned = Path("/tmp/pwned-hostile-skills-test")
assert not pwned.exists(), "stale marker from prior run — clean and retry"
loader = SkillsLoader(search_paths=[tier])
result = loader.load_all()
assert result.is_err(), f"expected Result.Err for {case.name}, got {result!r}"
err = result.unwrap_err()
assert isinstance(err, SkillsLoadError)
assert err.reason == case.expected_reason, f"expected reason={case.expected_reason!r}, got {err.reason!r}"
assert not pwned.exists(), f"!!python/object exec'd in case {case.name} — security regression"
def test_symlink_escape_refused(tmp_path) -> None:
"""Case 5 — symlink-escape filename. Built at test time (not committable to git)."""
tier = tmp_path / "skills"
tier.mkdir()
target = tmp_path / "out-of-tree.yaml"
target.write_text("name: out-of-tree\n")
(tier / "skill-A.yaml").symlink_to(target)
loader = SkillsLoader(search_paths=[tier])
result = loader.load_all()
assert result.is_err()
err = result.unwrap_err()
assert isinstance(err, SkillsLoadError)
assert err.reason == "symlink_refused"
test_concurrent_gather_race.py skeleton (hardened 2026-05-18 — asserts the actual Phase-0 O_APPEND + atomic-blob contract; no .codegenie/cache/.lock references):
# tests/adv/phase02/test_concurrent_gather_race.py
"""Concurrent-gather adversarial — extends Phase-0 S5-01's tests/unit/test_cache_concurrent.py
to the full `codegenie gather` CLI surface.
The Phase-0 concurrency contract per phase-arch-design.md §789 edge-case 12:
- .codegenie/cache/index.jsonl appends are atomic per-record (records ≤ PIPE_BUF=4096B).
- Blob writes are atomic via <dest>.tmp → os.replace.
- NO advisory lock; NO .codegenie/cache/.lock primitive.
We assert these invariants directly after two concurrent gathers — every jsonl line parses,
every blob filename matches its content hash, no .tmp remnants.
"""
from __future__ import annotations
import json
import subprocess
import sys
import time
import shutil
from pathlib import Path
import yaml
from codegenie.hashing import content_hash # actual helper name varies; pick the Phase-0 one
_REPO_ROOT = Path(__file__).parent.parent.parent.parent
_FIXTURE = _REPO_ROOT / "tests" / "fixtures" / "portfolio" / "minimal-ts"
def test_concurrent_gathers_do_not_corrupt_cache(tmp_path: Path) -> None:
# Work on a copy so we don't dirty the canonical fixture
workdir = tmp_path / "minimal-ts"
shutil.copytree(_FIXTURE, workdir)
def _launch() -> subprocess.Popen[bytes]:
return subprocess.Popen(
[sys.executable, "-m", "codegenie", "gather", str(workdir)],
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
)
a = _launch()
time.sleep(0.05) # explicit windowing — give A a brief head-start on cache-dir creation
b = _launch()
out_a, err_a = a.communicate(timeout=60)
out_b, err_b = b.communicate(timeout=60)
# AC-12: both processes terminated within budget — already asserted by communicate(timeout=60)
# AC-8: every line of .codegenie/cache/index.jsonl is valid JSON (the O_APPEND invariant)
index_path = workdir / ".codegenie" / "cache" / "index.jsonl"
assert index_path.exists(), f"index.jsonl missing after concurrent gathers; A={err_a!r} B={err_b!r}"
lines = index_path.read_text(encoding="utf-8").splitlines()
assert lines, "index.jsonl exists but is empty — neither gather wrote any cache record"
for i, line in enumerate(lines):
try:
json.loads(line)
except json.JSONDecodeError as e:
raise AssertionError(
f"index.jsonl line {i+1} is not valid JSON — O_APPEND atomicity violated "
f"(record likely exceeded PIPE_BUF=4096B). Line bytes: {line[:200]!r}..."
) from e
# AC-31: line count is at least the number of unique cache keys exercised by minimal-ts.
# Duplicates (same key written by both A and B) are PERMITTED — both records are valid JSON;
# consumers dedup by the "key" field.
seen_keys = {json.loads(line).get("key") for line in lines}
assert seen_keys, "index.jsonl lines had no 'key' fields — schema regression?"
# AC-9: every blob filename matches its content hash; no .tmp remnants; no zero-byte files
blobs_root = workdir / ".codegenie" / "cache" / "blobs"
assert blobs_root.exists(), "cache/blobs/ missing"
for blob in blobs_root.rglob("*"):
if not blob.is_file():
continue
assert not blob.name.endswith(".tmp"), (
f"leftover .tmp file at {blob} — atomic os.replace was interrupted; "
f"either a real bug or test teardown ordering issue"
)
size = blob.stat().st_size
assert size > 0, f"zero-byte blob at {blob} — atomic-write violated"
expected = content_hash(blob.read_bytes())
assert blob.name.startswith(expected[:16]), (
f"corrupt blob: {blob} (filename does not match content hash; "
f"got {blob.name!r}, expected prefix {expected[:16]!r})"
)
# AC-10: repo-context.yaml round-trips; matches one of A's or B's final outputs
ctx_path = workdir / ".codegenie" / "context" / "repo-context.yaml"
assert ctx_path.exists()
parsed = yaml.safe_load(ctx_path.read_text()) # no exception
assert isinstance(parsed, dict) and "probes" in parsed, (
"repo-context.yaml parsed but schema was malformed — half-merged hybrid?"
)
Green — make it pass¶
Plant fixtures, run tests, iterate until all four are green and stable.
Mutation-resistance witness table¶
| Mutation | Test that catches it |
|---|---|
Future contributor adds RedactedSlice.from_existing(...) at tests/integration/test_X.py:42 |
test_redacted_slice_construction_is_restricted_to_documented_sites — fails with the file:line + remediation pointer + ADR link |
Future contributor aliases the import (from … import RedactedSlice as _RS; _RS(...)) outside the allowlist to "dodge" the structural test |
Walker's alias-resolution pass (AC-30) classifies _RS correctly; test still fails |
Someone "fixes" safe_yaml.load to allow !!python/object (regression) |
test_hostile_skill_yaml_refused[python_object] — Result.Err becomes Result.Ok, /tmp/pwned-* exists, assertion fails |
Someone changes the cache's record format and a record now exceeds PIPE_BUF=4096B, breaking O_APPEND atomicity |
test_concurrent_gathers_do_not_corrupt_cache — json.loads(line) raises on a torn record |
Atomic-blob-write contract violated (e.g., <dest>.tmp → os.replace replaced with direct f.write) |
test_concurrent_gathers_do_not_corrupt_cache — .tmp remnants discovered OR blob filename does not match its content hash |
SkillsLoader swallows OSError(EILSEQ) for NUL-byte filenames silently |
test_symlink_escape_refused's NUL-byte variant — but only if implementer wrote it; otherwise the bug ships |
Future contributor edits S1-03 Protocols (DepGraphAdapter.consumers(self, pkg: str) → pkg: bytes) |
mypy fails on test_phase3_handoff_smoke.py's _frozen_dep_graph_signature helper (AC-25) at the next CI run, before unskip; AND test_phase3_handoff_smoke.py (when unskipped at Phase 3 entry-gate) catches the signature mismatch via the frozen-signature tuple |
Future contributor deletes the skip marker in test_phase3_handoff_smoke.py and the test fails on master |
CI fails on master; the test is unskipped prematurely — the comment block (AC-24) explicitly forbids this without an ADR amendment, and the failure points to the ADR |
Either redact_secrets or envelope_redactor._build_redacted_slice_pass is silently removed |
Defense-in-depth assertion in the AST test (per-allowed-site assert present loop) fires on the missing site |
A scanner probe (gitleaks) bypasses the redactor and writes plaintext to raw/gitleaks.json |
tests/golden/test_no_plaintext_in_goldens.py (S7-03) catches the plaintext in the golden; test_secret_in_source.py (S6-07) catches it behaviorally; this story's test_no_inmemory_secret_leak.py catches the structural path (writer signature would refuse a raw dict; envelope_redactor return shape pinned to RedactedSlice) |
Future contributor adds a third Writer.write call site (e.g., a debug logger path) |
test_writer_call_sites_are_closed_set (AC-17) — fails with the file:line of the unauthorized call site |
Refactor — clean up¶
test_no_inmemory_secret_leak.py's AST walker is reusable; consider whether to extract it totests/_helpers/ast_call_site_finder.py. Defer — only one consumer; extract at the second (if a future ADR mandates a similar "X is the only constructor of Y" invariant, that's the third + lifts).test_hostile_skills_yaml.py's_HostileCaseis the canonical-empty-set predicate. Each case'sexpected_reasonis pinned against the closed set declared inSkillsLoader.SkillsLoadError. If the closed set grows, the cases here grow accordingly.test_concurrent_gather_race.py'stime.sleep(0.1)is the explicit-windowing primitive, NOT race-tolerance. If the test becomes flaky on a slower CI runner, the fix is signal synchronization (SIGUSR1 handshake), NOT longer sleeps.- The four tests live side-by-side under
tests/adv/phase02/; they share fixture-directory conventions (tests/adv/phase02/fixtures/<test-name>/); the README attests/adv/phase02/README.md(if it exists from S4-02; if not, plant it) names the load-bearing tests.
Files to touch¶
| Path | Why |
|---|---|
tests/adv/phase02/test_hostile_skills_yaml.py |
≥ 8 hostile-YAML cases |
tests/adv/phase02/fixtures/hostile_skills/case{1..8}_*.yaml |
Hostile fixture files (cases 5 + 6 built at test time) |
tests/adv/phase02/test_concurrent_gather_race.py |
Two concurrent gathers; advisory-lock holds; cache consistent |
tests/adv/phase02/test_no_inmemory_secret_leak.py |
Gap 5 + Risk #6 + ADR-0010 — inspect-based structural boundary test |
tests/adv/phase02/test_phase3_handoff_smoke.py |
Gap 1 — skipped trip-wire; grep-discoverable; Phase-3 author finds it on first repo scan |
tests/adv/phase02/README.md (extend if exists, plant if not) |
Documents the four load-bearing adversarials + the adv-phase02 CI job's role |
Out of scope¶
- Property tests + portfolio sweep integration — S7-05.
- CI wiring (
adv-phase02job that runs these four + the earlier ones from S4-02 / S5-05 / S5-06 / S6-07) — S8-03. - Phase-3-handoff issues filing — S8-04 (this story lands only the skipped test; S8-04 files the GitHub issue Phase-3's author follows).
test_secret_in_source.py— owned by S6-07; complements this story's structural counterpart but lives there.- Behavioral concurrency tests against actual
pytest-xdistparallelism — out per ADR-0009. - A
tests/_helpers/ast_call_site_finder.pyextraction — premature; one consumer.
Notes for the implementer¶
- Risk #6 is the load-bearing risk this story defends.
test_no_inmemory_secret_leak.py's AST walker is the front-line guard against a future contributor silently breaking the smart-constructor invariant. Spend extra time on the failure-message ergonomics (AC-19). A CI failure message that points to the exact file:line PLUS the ADR PLUS the remediation step is what turns a "what does this mean?" CI red into a "oh I see, let me fix it" fix in 30 seconds. The named remediation is non-negotiable. - The deliberate fail-then-pass verification (outline step 8) is mandatory. Add a scratch
RedactedSlice(...)call somewhere outsidecodegenie.output.sanitizer, run the test, observe the failure with the AC-19 message, fix the code, observe green. Document the round-trip in the PR description. Without this, the test could be subtly wrong (e.g., AST walker missing a call form) and pass-by-default. test_concurrent_gather_race.pyflake-tolerance is forbidden. If you cannot get 100/100 passes locally and on CI, do NOT mark it as@pytest.mark.flakyorxfail. The flake indicates a real bug in the advisory-lock contract; fix the bug. The most common cause is the lock being released before the cache write fully fsyncs — investigatecache.py's flush discipline.test_phase3_handoff_smoke.py's skip marker MUST be@pytest.mark.skip(reason="..."), NOT@pytest.mark.skipif(...). The skipif form makes the un-skip an environmental concern; the plainskipmakes it a deliberate code edit Phase-3's author performs. The grep-discoverable string is the artifact.- AC-23's Protocol-conformance assertion can be left as code-but-skipped. The test body must type-check (
mypy --strict) even while skipped; this enforces that the four Protocols stay importable at their pinned paths AND that any signature change in S1-03 forces an explicit edit to this test file (preventing silent drift). If the test file no longer compiles because a Protocol moved, that is the contract trip-wire firing — even before Phase 3 unskips. - The hostile-YAML fixtures must be small. Billion-laughs in particular: keep the entity count under 10 (the parser caps it quickly; you don't need 1000 entities to trigger the defense). Deep nesting: 1000 levels is enough to trip stack-or-cap defenses without producing a 10 MB fixture file. Document the size + the cap in each fixture's top-of-file YAML comment.
- No
mock.patchanywhere in these tests. Per the project convention (and CLAUDE.md's "tests verify intent, not just behavior"). The tests exercise real production code paths — the loader, theO_APPENDcache contract, the AST of the production code, the four Protocols. Mocks would mask the regression cases these tests are designed to catch.
Validation-added implementer guidance (2026-05-18)¶
-
Two-site allowlist as a
Finalconstant, not a regex. The original story said "redact_secrets is the SOLE constructor ofRedactedSlice." It isn't —envelope_redactor._build_redacted_slice_passis the second site, by design (CLI envelope-level redaction path). ImplementALLOWED_CONSTRUCTOR_SITES: Final[frozenset[tuple[str, str]]]as a module-level constant at the top oftest_no_inmemory_secret_leak.py. The two entries are documented; adding a third is an ADR amendment, not a code edit. This pattern (aFinaltuple of(file, qualifier)pairs as the closed set) is the right shape for a structural-invariant test and composes with future "X is the only constructor of Y" invariants — when a third structural-invariant test arrives, lift the AST walker + the allowlist shape totests/_helpers/structural_invariants.pyand pass each test's allowlist constant in. Defer the lift until N=3 (Rule 2 / rule of three); today's deferred-extraction note in Refactor §1 is correct. -
Aliased-import resilience in the AST walker.
_is_redacted_slice_call(node)as drafted only checksisinstance(func.value, ast.Name) and func.value.id == "RedactedSlice". This will silently missfrom codegenie.output.redacted_slice import RedactedSlice as _RS; _RS(...). Walker must first build a per-module import alias map fromast.ImportFrom+ast.Importnodes ({local_name: real_qualified_name}) and resolveCall.funcagainst it. Add the inline regression case from AC-30 to lock the behavior down — without it, the mutation "alias the import to dodge the structural check" lands with no test failure. -
Wall-clock enforcement for AC-4 without adding
pytest-timeout. Do NOT introducepytest-timeoutas a dep just for this test (Phase-2 fence discipline). Each hostile-YAML case wraps itsloader.load_all()call betweent0 = time.monotonic()/assert time.monotonic() - t0 < 5.0book-ends. If a case exceeds 5 s, the loader has a real DoS surface — fix the loader, not the test. -
mypy drift trip-wire (AC-25 expansion). The Protocol-typed helper
_frozen_dep_graph_signature(adapter: DepGraphAdapter) -> None: ...lives near the top oftest_phase3_handoff_smoke.py, OUTSIDE the@pytest.mark.skip-decorated function body (decorators don't affect mypy reachability — module-level code type-checks always). The body uses every method name the test cares about (adapter.consumers(...),adapter.producers(...),adapter.confidence(), etc.). Any S1-03 signature change forces this file to fail mypy at the next CI run — even before unskip. This is the contract trip-wire firing through the type system rather than through pytest. -
O_APPENDinvariant > advisory lock. The original story referenced.codegenie/cache/.lockandfcntl.flock— neither exists. The actual Phase-0 contract isO_APPENDatomicity for records ≤PIPE_BUF=4096(proven in Phase-0 S5-01tests/unit/test_cache_concurrent.py) + atomic blob writes via<dest>.tmp → os.replace. The concurrent-gather test asserts these invariants directly (every line ofindex.jsonlparses; every blob's filename matches its content hash; no.tmpfiles remain). Do NOT introduce a lock primitive to "fix" flakes — the lock would be a real architectural change requiring a Phase-0 ADR amendment. Flakes mean either (a) a realO_APPENDviolation (record size exceededPIPE_BUF), or (b) a real atomic-write violation (someone added a non-atomic write path). Fix the bug; don't add a lock. -
Writer.writecall-site allowlist. AC-17'sALLOWED_WRITER_CALL_SITESis aFinal[frozenset[tuple[str, str]]]declared at the top of the test. Today the set contains exactly the CLI seam atcli.py:397(and a verify seam if S4-02 / S8-03 added one). Adding a third call site is an explicit edit to this constant — the diff in PR review is the load-bearing artifact. This is the same pattern asALLOWED_CONSTRUCTOR_SITESabove (Open/Closed via a registered closed set, lifted into a registry only at N=3).
Patterns DELIBERATELY deferred (per Rule 2 + Rule 3)¶
- A reusable AST
_call_site_finder.py— one consumer; lift at the second. - A behavioral skill-loader-hostile-input test that asserts
redact_secretswas called via a mock spy — defer; the structural AST test in this story + the behavioral end-to-end test in S6-07 (test_secret_in_source.py) together cover the surface; a mock-spy variant adds nothing observable. - A second concurrency test against Layer-C
runtime_traceparallel scenarios — out;runtime_traceis sequential per scenario by design (S5-02 contract). The advisory-lock is the only concurrency surface to exercise. - A
test_protocol_drift_at_phase2_boundary.pycompanion that re-asserts S1-03's Protocols at the Phase-2-boundary (without waiting for Phase 3) — defer; thetest_phase3_handoff_smoke.py's type-check-while-skipped already enforces drift-detection without redundancy.