Story S7-03 — Golden files (~70) + scripts/regen_golden.py canonical-ordering¶
Step: Step 7 — Plant five-repo fixture portfolio + per-probe golden files + remaining adversarial corpus
Status: GREEN-PARTIAL — shipped 2026-05-18 (phase-story-executor; see _attempts/S7-03.md for the per-AC evidence table). Core infrastructure + 90 goldens + 9 supporting tests landed; AC-13/14/18/19/20/22/29 deferred (matrix-expansion follow-up once the corresponding probes register slices consistently across the portfolio). AC-4/16/36 carry documented adaptations.
Effort: M
Depends on: S7-02 (all five fixtures exist on disk; stale-scip is fully materialized; monorepo-pnpm lockfile is committed). Soft-depends on Phase 1 S6-01 (scripts/regen_golden.py and tests/golden/<fixture>.repo-context.yaml — see Validation notes for the coexistence story; if S6-01 hasn't shipped yet, this story creates scripts/regen_golden.py from scratch and reserves the --portfolio mode without touching S6-01's envelope-level YAML golden seam).
ADRs honored: ADR-0001 (allowlisted binaries — regen_golden.py invokes codegenie gather which obeys the allowlist), ADR-0003 (heaviness sort — goldens reflect the deterministic dispatch order), ADR-0004 (image-digest declared-input token — distroless-target goldens include the resolved digest, never latest), ADR-0005 (no plaintext persisted — goldens for scanner probes record <REDACTED:fp=...> only), ADR-0006 (IndexFreshness location — stale-scip golden's freshness slice is a Stale(reason=CommitsBehind(...)) literal), ADR-0010 (RedactedSlice smart constructor — goldens for any probe whose slice contains scanner findings show the RedactedSlice JSON shape, never a raw dict).
Validation notes (2026-05-18)¶
This story was hardened by phase-story-validator. Full report under _validation/S7-03-golden-files-regen-script.md. Headline corrections:
- Coexistence with S6-01 made explicit. Phase 1 S6-01 (HARDENED 2026-05-15) ships an envelope-level YAML golden at
tests/golden/<fixture>.repo-context.yamlwith a_GOLDEN_PAIRSOpen/Closed seam and sentinel-normalization forgenerated_at+repo.git_commit(the only two top-level non-deterministic envelope fields; there is noauditblock at the envelope root —wall_clock_mslives in sidecarruns/*.json). S7-03 ships slice-level per-probe JSON goldens attests/golden/probes/<probe>/<fixture>.jsonconsuming the raw artifacts under<fixture>/.codegenie/context/raw/<probe>.jsonand uses field-exclusion. The two artifacts are orthogonal (different paths, different shapes, different non-determinism surfaces); S7-03 does NOT touch or replace S6-01's YAML golden or its_GOLDEN_PAIRStuple. The only shared code inscripts/regen_golden.pyis_canonicalize(recursive sorted-keys + tuple→list + None-vs-missing normalization). New ACs AC-34, AC-38 make this load-bearing. - Idempotence promoted from PR-description claim to CI gate (S6-01 precedent — "humans miss things; CI doesn't"). AC-27 reframed: the two-runs-byte-identical assertion is now a unit-test (
tests/golden/test_regen_golden_idempotent.py) shipped in this story, not deferred to S8-03. AC-26 is retained as PR-merge documentation discipline but is no longer the only safeguard. - macOS shape divergence — picked one path: error-out. AC-16 originally said "Linux-only-skip OR error-out — implementer's call." The synthesizer picks error-out with a clear remediation message (Rule 12: fail loud; mirrors S6-01's idempotence CI gate). A macOS contributor running
--portfolio --updategets a structured stderr message naming the Linux requirement and the CI runner image; the script exits 2. - Envvar substring-match guards added. AC-7's substring match against
$USER/$HOSTNAME/$HOMEwas a footgun: short or empty values would scrub legitimate occurrences ("ci", "root", "" → match everything). AC-7 amended: empty/short (< 4 chars) envvar values are skipped from the substring scrubber with aWARNINGemitted to stderr; the scrubber only fires when the envvar is unambiguous. - Plaintext-leak guard derives patterns from
SecretRedactor. AC-32 originally hardcoded patterns. Source-of-truth for ADR-0005 lives insrc/codegenie/output/sanitizer.py::_PATTERNS. AC-32 amended to import that module-level table and parametrize over it — drift between the production redactor and the golden audit becomes a compile-timeImportError, not a silent miss. - Count range tightened. AC-25 said "~70; allowed range 50–90" — a 1.8× spread that hides drift. Reframed: the exact count is computed by a deterministic enumeration function (
_compute_expected_golden_count()over the (probe, fixture) cells where each probe declares its applicable fixtures).COUNT.txtrecords the enumeration result;_compute_expected_golden_count()is the single source of truth grep-discoverable fromscripts/regen_golden.py. - Atomic writes added (AC-34).
--updatewrites each golden viatempfile.NamedTemporaryFile+os.replace(atomic on POSIX). A SIGINT mid-write never leaves a half-written golden on disk. - Clean-slate per-fixture regen added (AC-36). Each fixture's pre-existing
.codegenie/context/is cleared before the gather invocation so a cache hit doesn't silently produce a different slice shape. Audit logs this deletion via structlog. - Phase 1 fixture coexistence (AC-38). Phase 1's
tests/fixtures/node_typescript_helm/is not undertests/fixtures/portfolio/and is not swept by the--portfoliomode. S6-01's YAML golden for that fixture stays canonical; no per-probe JSON goldens are produced for Phase 1 fixtures. An AC plus a test enforce the non-collision (no goldens land undertests/golden/probes/*/node_typescript_helm.json). - Typed
JsonValuecanonicalize signature (AC-39). The recursive_canonicalizeuses a typedJsonValueunion (noAny), somypy --strictdoes real work rather than rubber-stamping. A type-only file atscripts/_json_value.py(or inlineTypeAliasif simpler) documents the shape. _PRESERVED_FIELDSinclusion list (AC-37). Declared-input tokens (image_digest, the SCIPoutput_path_relative, the BLAKE3 8-hexfingerprint) are explicitly inclusion-listed in a module-level frozen set. The exclusion list (AC-7) and the inclusion list (AC-37) are complementary; conflict resolution is "inclusion wins" (a field that names a declared-input token is never scrubbed). Discoverable bygrep.- Underscore-prefix skip becomes a test (AC-40). AC-3 says directories under
tests/fixtures/portfolio/prefixed_are skipped (the convention); this is now also asserted by a test, so a future contributor creating_shared_helpers/doesn't accidentally trigger a fixture-walk pass. - Unified-diff format assertion (AC-30 tightened). The "5-line excerpt" requirement was prose-only; the test now asserts the failure message contains
---+++++@@markers (unified-diff format), not just any 5 lines. _LIST_SORT_KEYSkeyed by typed probe name (Notes-for-implementer). Stringly-typed keys ("semgrep.findings") are a Newtype-pattern miss. Recommend keying by(ProbeName, str)whereProbeNameis the sharedLiteralfromtests/fixtures/_shape_test_kernel.py(S7-02 HARDENED precedent) — single source of truth for probe names across the test harness.
ACs touched: AC-7 amended (envvar guards), AC-13 (macOS note), AC-16 (error-out picked), AC-25 (deterministic count), AC-26 (clarified), AC-27 (promoted to CI gate), AC-30 (unified-diff assertion), AC-32 (pattern import). New: AC-34, AC-35, AC-36, AC-37, AC-38, AC-39, AC-40. Total ACs: 33 → 40.
Context¶
This story lands the golden-file layer: one canonical JSON snapshot per probe per fixture under tests/golden/probes/<probe>/<fixture>.json. With ~14 Phase-2 language-agnostic probes (the table below) × 5 fixtures (S7-01 + S7-02), the count lands near ~70 goldens. The CI portfolio job (S8-03) diffs live output against committed expected; pytest --update-golden regenerates canonically; updating a golden is a deliberate PR step.
~14 probes covered (Layer A is mostly Phase-1-covered; Phase-2 contributes Layers B/C/D/E/G):
| Layer | Probes |
|---|---|
| B | index_health, scip_index (binary blob path — golden records metadata only), tree_sitter_import_graph, dep_graph, generated_code, node_reflection, semantic_index_meta |
| C | runtime_trace (only minimal-ts + distroless-target — the other fixtures have no Dockerfile or stale-scip-irrelevant), dockerfile, entrypoint, shell_usage, certificate, sbom, cve |
| D | skills_index, conventions, adrs, repo_notes, repo_config, policy, exceptions, external_docs (mostly confidence="unavailable" against fixtures with no skills/policies — that's the load-bearing canonical-empty case) |
| E | ownership, service_topology_stub, slo_stub |
| G | semgrep, ast_grep, ripgrep_curated, gitleaks, test_coverage_mapping |
Not every cell is populated — e.g., runtime_trace only ships goldens for the two fixtures with a real Dockerfile (minimal-ts, distroless-target); dep_graph only for monorepo-pnpm (the only fixture with cross-package edges); gitleaks for none (the load-bearing secret_in_source adversarial fixture lives separately under tests/adv/phase02/, not in tests/fixtures/portfolio/). Implementer judgement on the empty cells; the story locks the count to "the cells where a probe produces a non-trivially-empty slice against a fixture."
Golden-file non-determinism is the recurring hazard (Implementation risk #5). Sources of non-determinism the regen script MUST exclude:
wall_clock_ms— varies per run.generated_at— wall-clock timestamp.- Audit-anchor timestamps inside any
runs/<utc-iso>-<short>.jsonreference. tmppaths inside any captured stdout/stderr.- Environment-derived values:
$USER,$HOSTNAME,$HOME,$PWD. - BLAKE3 fingerprints of cleartext plaintext (the fingerprint of the cleartext IS deterministic and DOES go into the golden — but only the 8-hex fingerprint, never the cleartext itself, per ADR-0005).
- Image digests captured at probe-run time that are not declared-input tokens — only the
ProbeContext.image_digest_resolverresolved digest goes in (deterministic per fixture; the resolver result is committed indistroless-target/built-image.digest). node --versioncross-check output fromNodeBuildSystemProbeif the CI runner's Node differs from the fixture's.nvmrc(per S2-03 Phase-1 note — the regen script disables this cross-check to keep the golden deterministic OR pins the regen environment's Node version; same applies to other tool-version cross-checks).
The discipline that anchors all of this: run scripts/regen_golden.py twice locally before opening the Step 7 PR; verify the second run produces zero file changes (same Phase-1 Step-6 discipline). If the second run produces diffs, a non-deterministic field has slipped past the exclusion list; fix the exclusion before merging.
Goldens are written as canonical JSON: keys sorted at every level (Python's json.dumps(..., sort_keys=True, indent=2) plus an explicit recursive _canonicalize for tuple-vs-list normalization and None-vs-missing-key normalization). Indentation is 2 spaces; trailing newline; UTF-8; LF endings. The same canonical-JSON conventions Phase 1 Step 6 set.
References — where to look¶
- Architecture:
../phase-arch-design.md §"Testing strategy" → "Golden files"— the table this story implements.../phase-arch-design.md §"Implementation risks"#5 (golden-file non-determinism — the central hazard).../phase-arch-design.md §"Goals"G1 (every Layer B–G language-agnostic probe ships with golden-file coverage).- Phase ADRs: ADR-0004 (image-digest token —
distroless-targetgoldens include the resolved digest verbatim), ADR-0005 (no plaintext — scanner goldens carry<REDACTED:fp=...>strings), ADR-0010 (RedactedSliceshape — scanner goldens render theRedactedSliceJSON, not a raw dict). - Implementation plan:
../High-level-impl.md §"Step 7"—scripts/regen_golden.pybullets, exclusion list, two-runs-byte-identical discipline. - Source design:
../final-design.md §"Open questions"#5 (golden-file regeneration policy — this story implements the named discipline). - Existing code:
- Phase 1's
scripts/regen_golden.pyand the goldens undertests/golden/(S6-01 + S6-02-bench-anchor) — convention precedent; this story extends rather than rewrites. - All five fixture trees from S7-01 + S7-02.
- Every Phase-2 probe under
src/codegenie/probes/layer_{b,c,d,e,g}/.
Goal¶
Three deliverables:
scripts/regen_golden.pyextended (or rewritten as the Phase-2 version layered on the Phase-1 base) to iterate overtests/fixtures/portfolio/<fixture>/for every fixture, runcodegenie gatheragainst each, capture the resultingrepo-context.yamlplus each.codegenie/context/raw/<probe>.json, canonicalize per-probe (sorted keys, exclusion list, LF endings, trailing newline), and writetests/golden/probes/<probe>/<fixture>.json. The script accepts--check(CI mode — diffs live output vs. committed; non-zero exit on diff) and--update(developer mode — overwrites).- ~70 golden files under
tests/golden/probes/<probe>/<fixture>.json— one per fixture per probe that produces a non-trivially-empty slice against that fixture. The exact count is determined by the implementer's judgment on which cells produce non-empty slices; document the count in the PR description. - A pytest harness (
tests/golden/test_goldens_match.py) that invokesregen_golden.py --checkmode against the fixture portfolio and asserts zero diffs. CIportfoliojob will run this; locallypytest --update-goldenregenerates by invokingregen_golden.py --update.
Acceptance criteria¶
scripts/regen_golden.py shape
- [ ] AC-1.
scripts/regen_golden.pyexists; passesmypy --strict; passesruff check+ruff format --check. - [ ] AC-2 — modes. Script supports
--check(exit 0 if every golden matches live output, exit 1 with a diff summary otherwise) AND--update(overwrite every golden with the canonicalized live output). Default mode is--check. The two flags are mutually exclusive (argparseadd_mutually_exclusive_group). - [ ] AC-3 — fixture iteration. Script walks
tests/fixtures/portfolio/for every subdirectory NOT prefixed_(the_shape_test_kernel.pydoes not appear here, but defending the convention). For each fixture, invokescodegenie gather <fixture-path>viarun_allowlisted(NEVERsubprocess.rundirectly — this is the audit hygiene); captures the output tree. - [ ] AC-4 — per-probe golden write. For each fixture, for each probe that produced a slice in the live run, the script:
- Reads the slice from
<fixture>/.codegenie/context/raw/<probe>.json. - Canonicalizes (
_canonicalize— recursive: sorted keys at every level, tuple → list normalization,None→nullnot missing). - Applies the exclusion list (AC-7).
- Writes
tests/golden/probes/<probe>/<fixture>.jsonwithjson.dumps(..., sort_keys=True, indent=2, ensure_ascii=False)+ trailing LF. - [ ] AC-5 — empty-cell handling. If a probe produced no slice for a fixture (e.g.,
dep_graphagainstminimal-ts— no cross-package edges), the script writes NO golden file for that cell. The CI--checkaccordingly does NOT require a golden file for that cell. (An empty-but-present slice — e.g.,confidence="unavailable"— IS a golden cell; the empty cell is when the probe did not register a slice at all.) - [ ] AC-6 — canonical JSON ordering. Every key at every nesting level is sorted. Lists preserve insertion order (Python's
json.dumpsdoes — but list contents themselves may need canonicalization for sets-of-dicts; if a probe emits afindings: list[dict], the implementer adds a sort key per probe, documented inregen_golden.py's top-of-file comment). For Phase 2 the only probes producing unordered-list slices aresemgrep,ast_grep,ripgrep_curated,gitleaks(findings) anddep_graph(edges) — each gets an explicit sort tuple.
Exclusion list (load-bearing — Risk #5 defense)
- [ ] AC-7 — exclusion list.
regen_golden.pyexcludes from every golden: - Any field named
wall_clock_ms. - Any field named
generated_at. - Any field named ending in
_timestamp. - Any string field whose value matches
r"/(tmp|var/folders)/[^/]+/"(tmppaths). - Any field whose value contains an unambiguous envvar value:
os.environ.get(name, "")is checked for substring presence ONLY if the value is non-empty AND ≥ 4 characters AND not in a closed deny-list of common short tokens ({"ci", "root", "home", "user", "test"}). For each skipped envvar aWARNINGis logged to stderr explaining why (e.g.,WARNING: $USER='ci' too short/ambiguous for substring scrub; rely on tmp-path + named-field exclusions instead). This prevents trivially-common values from scrubbing legitimate occurrences of "ci" / "root" in golden output. - The
audit_anchor.run_idfield (which is by definition a per-run UUID/timestamp pair). - [ ] AC-8 — exclusion list documented. The exclusion list is a module-level frozen tuple in
regen_golden.py(e.g.,_EXCLUDED_FIELD_NAMES,_EXCLUDED_VALUE_PATTERNS), discoverable bygrep. Adding an exclusion is a deliberate edit. Document each exclusion in a comment naming the source of non-determinism. - [ ] AC-9 — ADR-0005 plaintext-fingerprint allowed. The 8-hex BLAKE3 fingerprint emitted by
SecretRedactorIS deterministic and IS included in the golden — verified by a per-scanner golden that contains a"fingerprints": ["a1b2c3d4"]-shaped value. Plaintext NEVER appears. - [ ] AC-10 —
image_digestexception.distroless-targetgoldens forruntime_trace/sbom/cve/dockerfileinclude the resolved image digest (per ADR-0004) verbatim — the digest is a declared-input token, deterministic per fixture (committed indistroless-target/built-image.digest). The exclusion list does NOT stripimage_digestfields.
~70 goldens land on disk
- [ ] AC-11.
tests/golden/probes/exists as the root of the per-probe golden tree. - [ ] AC-12 —
index_healthgoldens. Five files attests/golden/probes/index_health/<fixture>.jsonforminimal-ts,native-modules,monorepo-pnpm,distroless-target,stale-scip. Thestale-scipgolden'sfreshnessis{"kind": "Stale", "reason": {"kind": "CommitsBehind", "n": <integer>=1>, "last_indexed": "<prior-sha>"}}— the literal load-bearing shape. - [ ] AC-13 —
scip_indexgoldens. Goldens are the metadata slice only (scip_index_present: true, indexer_version: "...", indexer_invocation_succeeded: true, output_path_relative: ".codegenie/context/raw/scip-index.scip"); the binary blob itself is not a golden (binary equality across runs is too brittle). One per fixture where SCIP was successfully indexed;stale-sciprecords the pre-existing-blob path;native-modulesmay registerconfidence="unavailable"(no TS sources). - [ ] AC-14 —
tree_sitter_import_graphgoldens. One per fixture where.tssources exist;monorepo-pnpmgolden records the cross-packageimportedges (the load-bearing case). - [ ] AC-15 —
dep_graphgoldens. Phase-2 dep-graph registry has zero strategies; the golden for every fixture is{"confidence": "low", "reason": "no_strategy_for_ecosystem", ...}(per S4-05's typed output). This is the explicit Open/Closed seam — Phase 3 fills strategies and Phase 3 regenerates affected goldens. - [ ] AC-16 —
runtime_tracegoldens.minimal-ts+distroless-targetonly; per-scenarioTraceScenarioCompleted | TraceScenarioFailed | TraceScenarioSkippedoutcomes;distroless-target'simage_digestmatchesbuilt-image.digest. The golden is committed against the Linux-runner shape only. macOS regen produces a different shape (TraceScenarioFailed(StraceUnavailable())); to prevent silent golden divergence,scripts/regen_golden.py --portfolioerrors out with exit code 2 on non-Linux platforms (sys.platform != "linux"), printing a structured stderr message:"regen_golden.py --portfolio requires Linux (CI: ubuntu-24.04). runtime_trace goldens cannot be regenerated on {platform}; run on a Linux box or via the CI runner."Rule 12 (fail loud) drives this choice; deferred to S8-03 only the enforcement of "all PRs that touch goldens were regenerated on Linux" (commit-trailer check).--checkmode (read-only) still runs on every platform. - [ ] AC-17 —
dockerfile/entrypoint/shell_usage/certificategoldens. Per fixture with aDockerfile(minimal-ts,monorepo-pnpm,distroless-target);distroless-targetrecordsUSER == null(distroless default-nonroot). - [ ] AC-18 —
sbom/cvegoldens. Per fixture with a built image (distroless-targetonly; the others' Dockerfiles are not built in regen — runtime_trace builds for the trace, but SBOM probe relies on the runtime_trace's built image). Both goldens recordScannerRan(...)-shaped outputs; thecvegolden's findings list is sorted deterministically. - [ ] AC-19 —
skills_index/conventions/adrs/repo_notes/repo_config/policy/exceptionsgoldens. Per fixture; mostlyconfidence="unavailable"against fixtures with no skills/policies — that's the canonical-empty case that proves the probes don't crash on the negative path. (Phase 2 fixtures don't seed.codegenie/skills/; that's a Phase-3+ test-surface concern.) - [ ] AC-20 —
external_docsgoldens. All five recordconfidence="unavailable", reason="not_opted_in"(per S6-04). - [ ] AC-21 —
ownership/service_topology_stub/slo_stubgoldens. Per fixture; mostly empty/unavailable; absence ofCODEOWNERS→confidence="low". - [ ] AC-22 —
semgrep/ast_grep/ripgrep_curatedgoldens. Per fixture;confidencereflects tool presence; findings (if any) flow throughSecretRedactorand the golden records theRedactedSliceJSON shape. Findings list is sorted by(rule_id, file, line, column). - [ ] AC-23 —
gitleaksgoldens. Per fixture; thesecret_in_sourceadversarial test (S6-07) lives undertests/adv/phase02/against its own fixture, NOT against the portfolio fixtures — so portfolio-fixturegitleaksgoldens are allScannerRan(findings_count=0, fingerprints=[])(proves the scanner runs and finds nothing in the curated fixtures, which is itself a load-bearing assertion). - [ ] AC-24 —
test_coverage_mappinggoldens. Per fixture with tests;minimal-ts+monorepo-pnpmqualify; the others are empty-cells (no goldens). - [ ] AC-25 — golden count is deterministic, not a range.
scripts/regen_golden.pyexposes a module-level function_compute_expected_golden_count() -> intthat enumerates the closed set of(probe, fixture)cells that ARE expected to produce a golden (the dual of the AC-5 empty-cell rule), drawing from a single declarative table (_PROBE_FIXTURE_MATRIX: dict[ProbeName, frozenset[FixtureName]]— grep-discoverable).tests/golden/probes/COUNT.txtrecords that exact integer (committed). A unit test (tests/golden/test_golden_count_matches.py) asserts: (a) on-disk golden count ==COUNT.txt; (b)COUNT.txt==_compute_expected_golden_count(). Adding or removing a(probe, fixture)cell requires editing_PROBE_FIXTURE_MATRIXANDCOUNT.txtin the same commit. The PR description records the count for human review; the test enforces it.
Two-runs-byte-identical discipline (Risk #5 defense)
- [ ] AC-26 — local two-runs verification (PR-merge discipline). PR description records: "
python scripts/regen_golden.py --update --portfoliowas run twice locally on Linux; the second run produced zero file changes (git diff --statempty)." This is documentation discipline for the human reviewer; the CI assertion in AC-27 is the load-bearing guarantee. - [ ] AC-27 — CI two-runs idempotence is a gating test in this story. A unit-level test
tests/golden/test_regen_golden_portfolio_idempotent.py(Linux-only viapytest.mark.skipif(sys.platform != "linux", reason="regen --portfolio requires Linux runner")) invokesscripts/regen_golden.py --update --portfoliotwice against a temp-copy oftests/fixtures/portfolio/, then asserts the second invocation produced zero file changes (via SHA256 of every committed golden, before and after the second run). This promotes the S6-01 HARDENED precedent ("humans miss things; CI doesn't") from a manual sha256 check to a CI gate now, not in S8-03. S8-03 wires the same assertion at theportfoliojob lane; this story's unit test guarantees the assertion exists before that lane is wired.
tests/golden/test_goldens_match.py harness
- [ ] AC-28.
tests/golden/test_goldens_match.pyexists; runsscripts/regen_golden.py --checkas a subprocess (viarun_allowlisted); asserts exit code 0; on failure, prints the diff summary in the pytest failure message. - [ ] AC-29 —
pytest --update-goldenflag. Adding a--update-goldenflag totests/conftest.pyrunsregen_golden.py --updatebefore the assertions. Test invocations:pytest tests/golden/(check mode, default),pytest tests/golden/ --update-golden(regenerate then check). - [ ] AC-30 — failure message includes the failing path AND unified-diff format. When a golden mismatch fires, the failure message names the exact file path (
tests/golden/probes/dep_graph/monorepo-pnpm.json) AND a unified-diff excerpt (difflib.unified_diffwithn=3). A unit test (tests/golden/test_failure_message_shape.py) feeds a synthetic mismatch intoregen_golden.py --checkand asserts the stderr/exception text contains: (a) the relative golden path, (b) at least one---line, (c) at least one+++line, (d) at least one@@ ... @@hunk header. "5 lines of context" alone is too weak — the unified-diff markers are the structural assertion a CI-log reader needs.
Determinism, audit hygiene, type cleanliness
- [ ] AC-31 —
mypy --strictonscripts/regen_golden.pyANDtests/golden/test_goldens_match.py. - [ ] AC-32 — no plaintext in goldens; patterns imported from
SecretRedactor. A test (tests/golden/test_no_plaintext_in_goldens.py)grep-walks every golden undertests/golden/probes/and asserts no string matches the production redactor patterns. Patterns are imported fromcodegenie.output.sanitizer._PATTERNS(the source-of-truth tuple-list per ADR-0005), NOT hardcoded in the test file. The test parametrizes over(golden_path, (pattern_class, pattern))and any future pattern added to_PATTERNSis automatically picked up. Drift between the production redactor and the golden-leak audit becomes a compile-timeImportError, not a silent miss. Defense-in-depth on top ofSecretRedactor(the redactor is supposed to catch this; the goldens are committed text and a stray paste during fixture authoring would be caught here). - [ ] AC-33 — LF endings + trailing newline on every golden file. Parametrized test over
tests/golden/probes/**/*.json.
New ACs from validation hardening
- [ ] AC-34 — atomic write.
regen_golden.py --updatewrites every golden viatempfile.NamedTemporaryFile(dir=<golden-parent-dir>, delete=False)+os.replace(tmp, final). A SIGINT or crash mid-write never leaves a partial golden on disk. Unit test:tests/golden/test_atomic_write.pypatchesos.replaceto raise mid-write and asserts the original golden is untouched. - [ ] AC-35 —
scripts/regen_golden.pyexposes a stable Open/Closed seam for new fixtures. Adding a new fixture undertests/fixtures/portfolio/<new-fixture>/(per S7-01/S7-02 conventions:_shape_test_kernel.py-compatible) MUST require zero edits toregen_golden.py— the script discovers fixtures bygit ls-files-rooted directory walk (skipping_-prefixed names). A test (tests/golden/test_regen_open_closed_seam.py) creates a temporary fixture directory inside atmp_pathcopy of the portfolio root, invokesregen_golden.py --check --portfolio --portfolio-root <tmp_path>, and asserts the new directory was discovered without any code-path edits. (The--portfolio-rootflag is added to support the test; default istests/fixtures/portfolio/.) - [ ] AC-36 — clean-slate per-fixture regen. Before invoking
codegenie gatheragainst a fixture,regen_golden.py --updateremoves<fixture>/.codegenie/context/(raw artifacts + runs) so a stale cache hit doesn't produce a different slice shape. The deletion is logged to stderr at INFO via structlog ({"event": "regen.clear_codegenie_context", "fixture": "<name>"}). The fixture's.codegenie/cache/is NOT touched (this would defeat S7-01's.codegenie/cache/not-committed guarantee — there's nothing to clear). The--updatemode is destructive-by-design; the--checkmode does NOT clear anything (idempotent read). - [ ] AC-37 —
_PRESERVED_FIELDSinclusion-list (complement to AC-7 exclusion list). Declared-input tokens and deterministic fingerprints are explicitly preserved: a module-level_PRESERVED_FIELDS: frozenset[str]lists at minimum{"image_digest", "output_path_relative", "fingerprint", "fingerprints", "last_indexed"}plus their dotted-path qualified variants. Conflict resolution rule (also asserted bytests/golden/test_preserved_fields_win.py): inclusion always wins over exclusion — if an exclusion pattern would match a field name in_PRESERVED_FIELDS, the field is preserved. Each entry's rationale lives in a sibling comment. Adding a new declared-input token (Phase 3+) is a deliberate one-line edit. - [ ] AC-38 — Phase 1 fixture coexistence + non-collision. Phase 1's
tests/fixtures/node_typescript_helm/is not undertests/fixtures/portfolio/and is not swept byregen_golden.py --portfolio. S6-01's YAML golden attests/golden/node_typescript_helm.repo-context.yamlremains the canonical artifact for Phase 1's fixture. Two enforcing tests: tests/golden/test_phase1_yaml_golden_untouched.py— snapshots the SHA256 oftests/golden/node_typescript_helm.repo-context.yamlbefore invokingregen_golden.py --portfolio --updateagainst the temp portfolio and asserts the SHA256 is unchanged after.tests/golden/test_no_phase1_per_probe_collision.py— asserts no file matchestests/golden/probes/*/node_typescript_helm.json(Phase 1's fixture name must not collide with the portfolio per-probe tree). If Phase 7+ wants to also land per-probe JSON goldens fornode_typescript_helm, that is a deliberate later PR with its own validation step.- [ ] AC-39 — typed canonicalize signature (no
Any). The recursive_canonicalizefunction uses a typedJsonValuerecursive union:JsonValue: TypeAlias = None | bool | int | float | str | list["JsonValue"] | dict[str, "JsonValue"]. Signature:def _canonicalize(payload: JsonValue) -> JsonValue.mypy --strictexercises this. NoAnythreads through the script (one explicitcast(JsonValue, json.load(fp))at the I/O boundary is the only acceptable narrow). A test (tests/golden/test_canonicalize_type_purity.py) does anast-walk ofscripts/regen_golden.pyand asserts no: Anyannotation appears on a function argument or return. - [ ] AC-40 —
_-prefixed directory skip is asserted. AC-3's convention (directories undertests/fixtures/portfolio/prefixed_are skipped) is asserted by a test (tests/golden/test_underscore_prefix_skip.py): createtests/fixtures/portfolio/_helpers_test/in a temp copy, invoke--check --portfolio, assert no error is raised and no golden is produced undertests/golden/probes/*/_helpers_test.json. Defends against a future contributor accidentally promoting a private helper directory to fixture status.
Implementation outline¶
- Read Phase 1's
scripts/regen_golden.py(from S6-01). Understand the convention: which functions do_canonicalizework, which write the files, which compute the diff. Decide: extend Phase 1's script with portfolio support, OR write a Phase-2 variant in the same file with--portfolioflag. Recommendation: extend, since the canonicalization machinery is identical. - Write
tests/golden/test_goldens_match.pyfirst (TDD red). With no goldens on disk, the test asserts every fixture × probe cell that should have a golden has one, and assertion fails everywhere. - Extend
regen_golden.pywith--portfoliomode (or the Phase-2 equivalent). The new code path: iteratetests/fixtures/portfolio/, runcodegenie gatherviarun_allowlisted, capture each<fixture>/.codegenie/context/raw/<probe>.json, canonicalize per AC-4..AC-10, writetests/golden/probes/<probe>/<fixture>.json. - First regen run.
python scripts/regen_golden.py --update --portfolio. This produces all goldens. Inspect a sample (tests/golden/probes/index_health/stale-scip.json) and confirm the load-bearingStale(reason=CommitsBehind(...))shape is present and JSON-canonical. - Second regen run.
python scripts/regen_golden.py --update --portfolioagain.git diff --statMUST be empty. If not: the diff names the offending field; add it to the exclusion list (AC-7) with a commented rationale. - Loop step 5 until the second run is a no-op. Common offenders: a transitive timestamp inside a
runs/<utc-iso>audit anchor reference, atmppath inside a captured stderr tail, an envvar leak. - Run the harness.
pytest tests/golden/test_goldens_match.py. Green. - Plant the count guard.
wc -l tests/golden/probes/**/*.json | tail -n 1→ record the count intests/golden/probes/COUNT.txt. Plant the test (tests/golden/test_golden_count_matches.py) that asserts the on-disk count equalsCOUNT.txt. - Plant the plaintext-leak guard.
tests/golden/test_no_plaintext_in_goldens.pyper AC-32. - Final pass.
mypy --strict,ruff check,ruff format --check. Run the full Phase 2 test suite. Green.
TDD plan — red / green / refactor¶
Red — failing harness first¶
# tests/golden/test_goldens_match.py (excerpt)
import subprocess
from pathlib import Path
from codegenie.exec import run_allowlisted
_REPO_ROOT = Path(__file__).parent.parent.parent
def test_goldens_match_live_output() -> None:
"""Run `regen_golden.py --check --portfolio` and assert zero diffs."""
result = run_allowlisted(
["python", str(_REPO_ROOT / "scripts" / "regen_golden.py"), "--check", "--portfolio"],
cwd=_REPO_ROOT, timeout_seconds=600,
)
assert result.exit_code == 0, (
f"Golden file mismatch — diff summary follows:\n{result.stderr_tail.decode()}"
)
Before goldens exist, the test fails red (--check reports every fixture × probe cell as missing).
Plus the count guard:
# tests/golden/test_golden_count_matches.py
from pathlib import Path
def test_committed_golden_count_matches_count_file() -> None:
root = Path(__file__).parent / "probes"
expected = int((root / "COUNT.txt").read_text().strip())
actual = sum(1 for _ in root.rglob("*.json"))
assert actual == expected, (
f"Golden count drift: committed {actual} files, COUNT.txt says {expected}. "
f"Update COUNT.txt as a deliberate PR step."
)
And the plaintext-leak guard:
# tests/golden/test_no_plaintext_in_goldens.py
import re
from pathlib import Path
import pytest
_PATTERNS = [
re.compile(r"AKIA[0-9A-Z]{16}"),
re.compile(r"ghp_[A-Za-z0-9]{36}"),
re.compile(r"npm_[A-Za-z0-9]{36}"),
re.compile(r"sk-ant-[A-Za-z0-9-]{30,}"),
re.compile(r"-----BEGIN RSA PRIVATE KEY-----"),
# eyJ-prefix JWT is too noisy (matches base64 blobs); leave to gitleaks
]
@pytest.mark.parametrize("path", sorted((Path(__file__).parent / "probes").rglob("*.json")),
ids=lambda p: p.name)
def test_no_plaintext_in_golden(path: Path) -> None:
text = path.read_text()
for pat in _PATTERNS:
assert not pat.search(text), (
f"{path} contains a string matching {pat.pattern!r} — "
f"SecretRedactor failure mode? Plaintext must not appear in goldens (ADR-0005)."
)
Green — make it pass¶
Extend regen_golden.py per the outline. Run --update --portfolio twice. Plant COUNT.txt. Re-run the harness. Green.
Mutation-resistance witness table¶
| Mutation | Test that catches it |
|---|---|
regen_golden.py forgets to canonicalize one field (sorted-keys regression) |
test_goldens_match_live_output — the diff fires on every run |
A new field wall_clock_ms_v2 slips into a probe's slice without being added to the exclusion list |
test_goldens_match_live_output — second-run-no-op breaks |
A contributor "fixes" the stale-scip golden to reflect kind: Fresh (silently breaking the load-bearing exit) |
tests/adv/phase02/test_stale_scip_fixture.py (S4-02) — still fails first; the golden is consistent with the gather output but the assertion test catches the regression at the probe level |
A stray plaintext AKIA... slips into a golden (e.g., implementer pasted a real key during fixture authoring) |
test_no_plaintext_in_golden[*] |
Implementer adds a golden without updating COUNT.txt |
test_committed_golden_count_matches_count_file + _compute_expected_golden_count() parity (AC-25) |
Implementer deletes a golden without updating COUNT.txt |
Same |
Implementer adds a (probe, fixture) cell silently (touches no count guard) |
_compute_expected_golden_count() derived from _PROBE_FIXTURE_MATRIX — adding a cell without editing the matrix means the matrix → count assertion fails (AC-25) |
--update silently overwrites a golden the implementer didn't intend to update |
git diff in the PR — the file's bytes change visibly. PR review catches it |
--update is interrupted mid-write leaving half a golden on disk |
test_atomic_write.py — patched os.replace failure asserts file is untouched (AC-34) |
Second --update produces drift (non-determinism slipped past exclusion list) |
test_regen_golden_portfolio_idempotent.py — CI gate, not PR-description-only (AC-27) |
| macOS regen produces shape-divergent goldens, contributor commits anyway | --update --portfolio exits 2 on non-Linux with a clear stderr message (AC-16) |
Plaintext leaks past SecretRedactor due to a new pattern added after this story landed |
test_no_plaintext_in_goldens.py imports _PATTERNS directly — new pattern is automatic (AC-32) |
_PRESERVED_FIELDS regression — a future exclusion regex inadvertently scrubs fingerprint |
test_preserved_fields_win.py — inclusion always wins (AC-37) |
A new fixture is added to tests/fixtures/portfolio/ but regen_golden.py doesn't pick it up |
test_regen_open_closed_seam.py — zero-edit discovery asserted (AC-35) |
Phase 1's node_typescript_helm.repo-context.yaml is mutated by S7-03's regen |
test_phase1_yaml_golden_untouched.py — SHA256 snapshot (AC-38) |
A future contributor creates _helpers/ under tests/fixtures/portfolio/ and accidentally triggers a fixture pass |
test_underscore_prefix_skip.py (AC-40) |
| Failure message is a wall of text with no diff markers — CI log unreadable | test_failure_message_shape.py asserts ---, +++, @@ (AC-30) |
Any-typed helper sneaks into _canonicalize defeating mypy --strict |
test_canonicalize_type_purity.py ast-walk (AC-39) |
| CRLF endings on a golden via Windows checkout | A line-endings test on tests/golden/probes/**/*.json (AC-33) |
Refactor — clean up¶
- The exclusion list is the load-bearing audit surface. Document each exclusion's source-of-non-determinism in a comment. A future contributor reading the list should be able to defend every entry — if they cannot, the entry is over-broad and should narrow.
- For probes whose slice contains lists-of-dicts (
semgrep.findings,gitleaks.findings,dep_graph.edges), the explicit sort tuple lives next to the probe in the script:Adding a probe that emits a list-of-dicts requires extending this dict — one line. The convention is grep-discoverable._LIST_SORT_KEYS: dict[str, tuple[str, ...]] = { "semgrep.findings": ("rule_id", "file", "line", "column"), "ast_grep.findings": ("rule_id", "file", "line", "column"), "ripgrep_curated.findings": ("rule_id", "file", "line"), "gitleaks.findings": ("rule_id", "file", "line", "fingerprint"), # fingerprint, NOT plaintext "dep_graph.edges": ("from", "to"), } regen_golden.py's_canonicalizefunction is mypy-strict; theAnypayload threads through one type-narrowed dispatch. No untyped helpers.- The PR description records the count, the second-run-byte-identical check result, and the explicit list of exclusions added (if any) beyond the AC-7 baseline.
Files to touch¶
| Path | Why |
|---|---|
scripts/regen_golden.py (extend with --portfolio mode; or create if S6-01 hasn't shipped) |
The canonical-JSON regen script + exclusion list + inclusion list + _PROBE_FIXTURE_MATRIX + _compute_expected_golden_count() |
tests/golden/probes/<probe>/<fixture>.json (count == _compute_expected_golden_count()) |
The committed goldens |
tests/golden/probes/COUNT.txt |
Count guard against silent golden additions/deletions (asserted equal to _compute_expected_golden_count()) |
tests/golden/test_goldens_match.py |
The CI harness — invokes --check mode and surfaces diffs (AC-28, AC-30) |
tests/golden/test_golden_count_matches.py |
Count guard (AC-25) |
tests/golden/test_no_plaintext_in_goldens.py |
Defense-in-depth ADR-0005 enforcement; patterns imported from codegenie.output.sanitizer._PATTERNS (AC-32) |
tests/golden/test_regen_golden_portfolio_idempotent.py |
Two-runs-byte-identical CI assertion (AC-27) |
tests/golden/test_failure_message_shape.py |
Unified-diff format assertion (AC-30) |
tests/golden/test_atomic_write.py |
Atomic write defense (AC-34) |
tests/golden/test_regen_open_closed_seam.py |
Open/Closed seam for new fixtures (AC-35) |
tests/golden/test_preserved_fields_win.py |
Inclusion-list-wins-over-exclusion-list (AC-37) |
tests/golden/test_phase1_yaml_golden_untouched.py |
S6-01 YAML golden non-collision (AC-38) |
tests/golden/test_no_phase1_per_probe_collision.py |
Phase-1 fixture name non-collision (AC-38) |
tests/golden/test_canonicalize_type_purity.py |
No : Any in canonicalize signatures (AC-39) |
tests/golden/test_underscore_prefix_skip.py |
Private-directory skip (AC-40) |
tests/conftest.py (add --update-golden flag) |
Developer convenience (AC-29) |
Out of scope¶
- Adversarial corpus (
hostile_skills_yaml,concurrent_gather_race,no_inmemory_secret_leak,phase3_handoff_smoke) — S7-04. - Property tests + portfolio sweep integration — S7-05.
- CI wiring (
portfoliojob that runs the harness on every PR) — S8-03. scip-index.scipbinary-golden equality — out; SCIP blob bytes are too brittle acrossscip-typescriptversions. Phase 2 commits the metadata slice only (AC-13).bench-canary goldens (wall-clock baselines undertests/bench/baselines/) — S8-03 owns the bench infrastructure; this story is golden-FILES only.runtime_tracemacOS shape goldens — out; regen-on-Linux-only discipline (AC-16). macOS regen producesStraceUnavailable()which is deterministic but documented as a separate code path, not a golden.
Notes for the implementer¶
- Risk #5 is the load-bearing risk this story defends. The mandatory PR-merge prerequisite is: "
scripts/regen_golden.py --update --portfoliowas run twice locally; the second run produced zero file changes." Record this verbatim in the PR description. Run the second pass on a fresh shell to avoid any in-memory caching artifact. If the second run produces diffs, find and fix the source of non-determinism before merging — don't add an exclusion without understanding what's being excluded. - The exclusion list (AC-7) is a load-bearing audit surface. Each entry should be defensible. Common temptation: adding
*_countto exclusions because a probe's count varies — that's wrong; the count is supposed to be deterministic across runs, and a varying count means the probe itself is non-deterministic. Fix the probe, not the exclusion list. Acceptable exclusions are wall-clock, tmp-path, envvar-leak only. dep_graphgoldens reflect Phase-2's zero-strategy state. Every fixture'sdep_graphgolden records{"confidence": "low", "reason": "no_strategy_for_ecosystem", ...}. When Phase 3 lands the first strategy (per S4-05's Open/Closed registry), it regeneratesmonorepo-pnpm'sdep_graphgolden (the one fixture with cross-package edges) — that's a deliberate Phase-3 PR, not a Phase-2 surprise. Document this indep_graph/monorepo-pnpm.json's top-level"_phase2_note": "Phase 2 ships with zero dep-graph strategies; Phase 3 fills the registry."field (if the schema allows; otherwise in the script's commit-message commentary).stale-scip/index_healthgolden is the load-bearing one. Itsfreshnessfield must be exactly{"kind": "Stale", "reason": {"kind": "CommitsBehind", "n": <integer>=1>, "last_indexed": "<prior-sha>"}}. Thenvalue is whatevergit rev-list --count <prior-sha>..HEADreturns for the fixture's two-commit history (= 1, given AC-18 of S7-02). Ifnis greater than 1 in your regen, the fixture has drifted (more commits added without updatinglast-indexed-commit.txt) — re-anchor it before committing the golden.distroless-targetgoldens include the resolved image digest (per AC-10 + ADR-0004). Make surebuilt-image.digest(gitignored, but generated bydistroless-target/regenerate.sh) exists at regen time. Document the dependency in the script: ifbuilt-image.digestis missing, the script emits a clear error pointing totests/fixtures/portfolio/distroless-target/regenerate.sh.- Run regen on Linux for the canonical truth. macOS regen produces shape-divergent goldens (
runtime_tracerecordsStraceUnavailable(); some BLAKE3 paths may differ if Python's hashlib varies by platform — should not, but verify). The CI runner isubuntu-24.04; for golden authoring, run on Linux. If you regen on macOS, the script should emit a one-line warning"WARNING: macOS regen — runtime_trace goldens will differ from CI Linux runner. Re-run on Linux before merging." pytest --update-goldenis the developer convenience flag. Implement it via aconftest.pypytest_addoption+ apytest_collection_modifyitemsthat injects an--updateinvocation before the assertion runs. Document intests/golden/README.md.-
Why no
gitleaksportfolio-fixture goldens with findings. Thesecret_in_sourceadversarial fixture (S6-07) lives undertests/adv/phase02/, nottests/fixtures/portfolio/. Portfolio fixtures are happy-path; theirgitleaksgoldens are allfindings_count=0(which is itself a load-bearing assertion:gitleaksran, found nothing in the curated fixtures, did not crash). Adversarial fixtures + their goldens live separately; their golden discipline is identical but their location istests/golden/adv-phase02/(out of scope for this story; S7-04 may or may not commit adversarial-fixture goldens — implementer's call there). -
_LIST_SORT_KEYSshould be keyed by typed probe name. The example in Refactor uses rawstrkeys ("semgrep.findings"). Per S7-02's_ProbeNameLiteral precedent (the shared kernel attests/fixtures/_shape_test_kernel.py), use a typed key:_LIST_SORT_KEYS: dict[tuple[ProbeName, str], tuple[str, ...]]whereProbeNameis imported fromtests.fixtures._shape_test_kernel. Single source of truth for probe names across the test harness; a typo ("semgreps.findings") becomes a mypy error rather than a silent no-op. - S6-01 / S7-03 coexistence. Phase 1 S6-01 owns
tests/golden/<fixture>.repo-context.yaml(envelope-level YAML, sentinel-normalizedgenerated_at+repo.git_commit,_GOLDEN_PAIRSseam). S7-03 ownstests/golden/probes/<probe>/<fixture>.json(slice-level JSON, field-excluded non-determinism,_PROBE_FIXTURE_MATRIXseam). They share_canonicalizeonly. If S6-01 hasn't shipped at executor time, createscripts/regen_golden.pyfrom scratch with the--portfoliomode reserved as the only mode for now; structure the file so S6-01 can later land its--fixture-yamlmode alongside withoutmain()edits (Open/Closed). If S6-01 has shipped, extend in place; add a--portfolioflag to the existing argparse; do NOT touch S6-01's_GOLDEN_PAIRStuple, its sentinel normalization, or its YAML writer. - macOS error-out rationale. Rule 12 (fail loud) drove the choice over the soft "skip with warning" alternative. A macOS contributor who silently skipped would commit a partial regen, push, and CI would diff against the unchanged Linux goldens — confusing failure mode. Erroring out at
--updateinvocation surfaces the constraint at the earliest possible moment with the clearest possible message.--checkmode (read-only) still works on every platform because it does not invokecodegenie gather; it diffs against committed bytes only. - Why
_PRESERVED_FIELDSis necessary and inclusion-wins is load-bearing. Without it, a clever exclusion regex (e.g., a future_HEX_FINGERPRINT_PATTERNaimed at scrubbing run-id UUIDs) could accidentally scrub the BLAKE3 8-hex fingerprints that ADR-0005 wants preserved. The inclusion list is a defensive co-loop: any field name in_PRESERVED_FIELDSshort-circuits the exclusion logic. The conflict-resolution rule is asserted bytest_preserved_fields_win.py, not just commented. JsonValuerecursive type. Python 3.12+ supports PEP 695-style recursiveTypeAliasdefinitions. The cleanest declaration is at the top ofscripts/regen_golden.py:Forward-string-quoted self-references survivefrom typing import TypeAlias JsonValue: TypeAlias = None | bool | int | float | str | list["JsonValue"] | dict[str, "JsonValue"]mypy --strict. The singleAnypermitted is at the JSON I/O boundary (json.load(fp)returnsAny); narrow withcast(JsonValue, ...)immediately. Thetest_canonicalize_type_purity.pytest enforces this byast-walk.
Patterns DELIBERATELY deferred (premature-abstraction guard)¶
- A generic
tests/golden/test_kernel.pyparametrized over(probe, fixture)pairs — defer;regen_golden.py --checkalready does the iteration, and adding a parametrized pytest layer atop it is duplicate work without observable benefit. Re-evaluate if a future contributor needs per-cell debugging (then addpytest tests/golden/probes/index_health/selective invocation). - Per-probe golden schemas (a separate
tests/golden/probes/<probe>/_schema.jsonthat asserts every fixture's golden matches that schema) — defer; the production sub-schemas undersrc/codegenie/schema/probes/already do this at gather time, so the golden is validated upstream. Re-evaluate if probes diverge. pytest-snapshotlibrary integration — defer; the canonical-JSON discipline + the homegrownregen_golden.pyare the Phase-1 precedent and integrate cleanly with--check/--update.pytest-snapshotwould add a dep without changing observable behavior.