Story S3-01 — SecretRedactor pattern classes + entropy threshold + BLAKE3 fingerprint¶
Step: Step 3 — Plant SecretRedactor + RedactedSlice smart constructor at the writer chokepoint
Status: Done (2026-05-16) — see _attempts/S3-01.md. Paired with S3-02. Evidence: sanitizer.py::redact_secrets + tests/unit/output/test_secret_redactor.py (45 tests, 28 ACs).
Effort: M
Depends on: S1-11 (forbidden-patterns extension that will later cover model_construct under src/codegenie/output/**, plus the nine new ADRs; Phase-2 ADRs 0005 + 0010 are landed)
ADRs honored: 02-ADR-0005 (no plaintext persistence anywhere in Phase 2), 02-ADR-0010 (RedactedSlice smart constructor at the writer boundary), production ADR-0005 (no LLM in gather — the chokepoint guarantee inherits to Phase 4 RAG ingestion), production ADR-0033 (newtype discipline; informs the deferred Fingerprint newtype — see Validation notes #11)
Validation notes (phase-story-validator, 2026-05-15)¶
Verdict: HARDENED. The draft was structurally sound — six pattern classes + entropy fallback + BLAKE3 fingerprint + mutation-test discipline + in-band findings tuple all trace cleanly to 02-ADR-0005, 02-ADR-0010, and phase-arch-design.md §"Component design" #4. But the draft carried two BLOCK-severity bugs in prescribed code (it would not run as written) and eight harden-tier gaps that would have let an obviously-wrong implementation slip past the executor's Validator pass. Edits applied:
- B1 (BLOCK) —
content_hashAPI mismatch. Draft prescribedcodegenie.hashing.content_hash(cleartext.encode("utf-8"))[:8]. But Phase 0'scontent_hash(path: Path) -> strtakes aPath, not bytes — the bytes overload iscontent_hash_bytes(b: bytes) -> str. Both return"blake3:<64hex>"(prefix-tagged per Phase-0 ADR-0001). Slicing[:8]would yield"blake3:b", not 8 hex chars. Fix: Goal #2, AC-13, AC-14, and the Implementation outline now namecontent_hash_bytes(cleartext.encode("utf-8")).removeprefix("blake3:")[:8]. AC-32 added as the explicit regression test for prefix-stripping (a mutation that drops.removeprefix("blake3:")is caught). - B2 (BLOCK) —
JSONValueimport path wrong. Draft citedsrc/codegenie/types.py (Phase 0) — JSONValue recursive alias. Phase 1 actually placesJSONValueatsrc/codegenie/parsers/__init__.py(re-exported fromparsers.safe_json). Thecodegenie.typespackage holds identifier newtypes, notJSONValue. Fix: References section updated; the Implementation outline now importsfrom codegenie.parsers import JSONValue. TheProbeIdimport path is pinned atfrom codegenie.types.identifiers import ProbeId(thecodegenie.typespackage__all__re-exports the other four newtypes but notProbeIdas of S1-05's landing — verified againstsrc/codegenie/types/__init__.pyon master). - F1 (harden) — Same-secret-twice dedupe invariant unasserted. AC-20 covers "two AWS keys + one entropy hit = 3 findings" but does not cover the same AWS key appearing twice (
findings_count == 2,len(fingerprints) == 1). This is load-bearing for S3-02'sRedactedSlice.findings_count >= len(fingerprints)model validator. A regression that uses asetfor findings (dedupe at finding-level instead of fingerprint-level) silently breaks the contract. Fix: AC-26 added. - F2 (harden) — Two distinct named-pattern matches in one string unasserted. AC-20 covers two findings of the same pattern class plus an entropy hit; no AC covers two findings of different named-pattern classes co-located in one string value (e.g.,
"aws=AKIA…EXAMPLE github=ghp_<36>"). A regression thatreturns after the first pattern match per pass silently passes. Fix: AC-27 added. - F3 (harden) —
cleartext_lenbyte-vs-char drift unasserted. Notes name "byte-length, not char-length" but no AC enforces. A regression usinglen(cleartext)(char count) silently passes every ASCII test. Fix: AC-28 added with a multi-byte fixture (e.g., the canonical AWS key surrounded by a non-ASCII codepoint; the fixture only needs to differ inbyte_len != char_len). - F4 (harden) — Input non-mutation unasserted. AC-22 asserts statelessness across calls but not that the input
slice_dict is unmutated. A regression that mutatesslice_in place ("save memory") would corrupt the caller's state. Fix: AC-29 added — verifiesslice_survives the call by deep-equality with a pre-callcopy.deepcopy. - F5 (harden) —
_PATTERNSmodule-level placement unasserted. Implementer notes name "Pattern table at module level, not function-local" but no AC enforces. AC-18's mutation tests requiremonkeypatch.setattr(sanitizer, "_PATTERNS", ...)to work; a refactor that moves_PATTERNSfunction-local silently disables every mutation test (they no-op because the function-local table is never read by the patched module attribute). Fix: AC-30 added — verifies_PATTERNSand_ENTROPY_THRESHOLD_BITS_PER_CHARare module-level names, and thatmonkeypatch.setattragainst them genuinely changes the redactor's behavior (positive-control test for the mutation harness). - F6 (harden) — Entropy edge cases unspecified. AC-10/11/12 cover the threshold rule but not boundary conditions. A naive Shannon-entropy implementation
-sum(p * log2(p) for c, p in freq)crashes onlog2(0)if any iterated entry has count zero (rare but defensible to guard); empty strings, single-repeated-char strings, and Unicode-only strings of len ≥ 32 are all undefined by the draft. Fix: AC-31 added — pins behavior on"","a","a" * 100, and a 64-char Unicode string. The entropy function itself never crashes; skip-the-rule semantics forlen < 32(already in AC-12) handle the small cases. - F7 (harden) — Cross-story integration with S3-02 invariants unasserted. S3-02 lands
RedactedSlicewith validatorsfindings_count >= len(fingerprints),findings_count >= 0, fingerprints must match^[0-9a-f]{8}$. S3-01's output must satisfy all three by construction. A regression that returns uppercase hex (e.g., a.hexdigest().upper()slip) or an off-by-one count would failRedactedSliceconstruction at runtime — but only if the integration is tested. AC-22 tests S3-01 alone; no AC tests S3-02's validators against S3-01's output. Fix: AC-33 added — asserts the returnedRedactedSliceround-trips throughRedactedSlice.model_validate(model_dump())and the three invariants hold. - F8 (harden) — Secret-as-substring (inline replacement) unasserted across patterns. Implementation Notes name the JWT inline-replacement case (
"Bearer eyJ..."→"Bearer <REDACTED:...>"); no AC asserts the same property for the other five pattern classes. A regression that anchors a pattern (e.g.,^AKIA[0-9A-Z]{16}$or usesre.fullmatch) silently passes simple fixtures. Fix: AC-34 added — covers inline-substring replacement for AWS, GitHub, NPM, Anthropic, and entropy. JWT already covered by Notes; AC-34 promotes the property to a contract across the family. - F9 (nit) —
FingerprintNewType opportunity surfaced, deferred. The 8-hex fingerprint string crosses module boundaries (sanitizer.py → redacted_slice.py → writer.py → log-field constant in S3-03). Production ADR-0033 §3 names primitive obsession on domain identifiers as a review-blocker. Not promoted to an AC in this story because the rule-of-three threshold (third concrete consumer of the family) is not yet reached in S3-01 alone, and S3-02'sRedactedSlice.fingerprints: list[str]field validator already closes the format invariant at construction. Fix: Notes-for-implementer #11 added —Fingerprint = NewType("Fingerprint", str)recommended for a Phase-3 cross-cutting story when the third consumer (Phase 3 RAG ingest or the audit-anchor) lands. Closing the origin invariant (only_fingerprint(...)produces aFingerprint) is a Phase-3 follow-up; this story closes the format invariant by validator. - F10 (nit) — Variant-set extension framing missing.
pattern_class: Literal[...]is a closed set — extension is ADR-amendment-gated, mirroring S1-01 (IndexFreshness) and S1-03 (AdapterConfidence). The story did not state this explicitly; a future reader might "improve" it tostrfor "extensibility". Fix: Notes-for-implementer #12 added — adding a seventh pattern class is an ADR amendment, not Open/Closed. The walker / fingerprinter / entropy logic stays untouched (extension-by-addition at the_PATTERNSrow + theLiteral[...]set widening); the closed set is the deliberate design. - F11 (nit) — Module-docstring assertion technique unspecified. AC-2 requires the module docstring to reference specific docs but pins no test mechanism. Fix: AC-2 strengthened — a programmatic check via
inspect.getdoc(codegenie.output.sanitizer)substring-matches the required references.
Coverage critic: HARDEN (11 findings). Test-quality critic: HARDEN (mutation table shows six plausibly-wrong implementations slip past the original TDD plan; all closed below). Consistency critic: two BLOCK findings (B1, B2), zero ADR conflicts. Design-pattern critic: two nits surfaced as Notes-for-implementer (Fingerprint newtype #11; variant-set framing #12). Stage 3 research skipped — every gap was answerable from arch + ADR-0005 + ADR-0010 + S1-01 / S1-03 sibling validation precedent + Phase 0 hashing.py source on master.
Ready for phase-story-executor.
Context¶
Phase 2 introduces three sources of cleartext secret findings: gitleaks (walks .git/ history for credential patterns — see S6-07), semgrep p/secrets (rule-pack matches over source — see S6-06), and an entropy fallback for novel credential shapes the pattern set does not name. Without a defense at the persistence boundary, those cleartext bytes flow through the Phase 0 OutputSanitizer.scrub (which only scrubs known field-name patterns, not unknown high-entropy values inside arbitrary string fields) and land verbatim in repo-context.yaml, every raw/*.json, the cache blob, and the audit anchor. 02-ADR-0005 picks the structural fix: don't persist plaintext at all — SecretRedactor intercepts every string in every ProbeOutput.schema_slice before it reaches disk, replaces matched cleartext with <REDACTED:fingerprint=BLAKE3_8> inline, and returns an in-memory list[SecretFinding] for the CLI summary that is never persisted. This story is the redactor itself — patterns, entropy fallback, fingerprint scheme, and the mutation-test discipline that makes pattern coverage a build invariant. The next two stories tighten the writer's signature (S3-02 → S3-03) so "redactor was called" is type-checkable, not convention-enforced.
The redactor sits between Phase 0's existing OutputSanitizer.scrub (field-name regex + JSONValue tree walk) and the writer. It does not replace the Phase 0 sanitizer; it composes after it (composition order is pinned in S3-03 and verified by mock-spy test). The pattern set is finite by intent — six named credential classes that have stable, regex-matchable shapes — with a Shannon-entropy floor (≥ 4.5 bits/char on len ≥ 32 unknowns) as the safety net for vendor-specific token shapes the pattern set does not name. Phase-0 codegenie.hashing.content_hash (BLAKE3) supplies the fingerprint helper; the first 8 hex chars are persisted, which is privacy-preserving by construction (BLAKE3 first-8-hex is not reversible to the cleartext).
The load-bearing test discipline is mutation testing: for each pattern class, a deliberately weakened regex (e.g., AKIA[0-9A-Z]{15} — one fewer character) is introduced via monkeypatch.setattr against the redactor's pattern table; the test asserts the redactor then fails to redact the canonical example secret. A regression that loosens a pattern silently passes today's coverage but fails the mutation. Pattern coverage stops being asserted and starts being verified.
References — where to look¶
- Architecture:
../phase-arch-design.md §"Component design" #4 SecretRedactor— public interface (redact_secrets(slice_, probe_name) -> tuple[dict, list[SecretFinding]]in the doc; superseded by 02-ADR-0010 — return type is nowtuple[RedactedSlice, list[SecretFinding]]; theRedactedSlicemodel itself lives in S3-02), pattern list, entropy threshold, fingerprint scheme, mutation-test discipline.../phase-arch-design.md §"Gap analysis & improvements" Gap 4— the smart-constructor framing; this story implements the redactor body; S3-02 implements theRedactedSlicemodel + the privacy of construction.../phase-arch-design.md §"Sequence — secret-redaction flow"(line ~420) — the end-to-end flow:gitleaks→ Pydantic →OutputSanitizer.scrub→redact_secrets→ writer. In-memory findings list is NOT persisted.../phase-arch-design.md §"Goals" G5— testable invariant ("plaintext in zero persisted files"). Asserted by S6-07'stest_secret_in_source.py; this story is the runtime that makes that assertion green.../phase-arch-design.md §"Anti-patterns avoided"—model_construct()bypass; this story does NOT constructRedactedSliceviamodel_constructand lands the build invariant in S3-02.- Phase 2 ADRs:
../ADRs/0005-secret-findings-no-plaintext-persistence.md— the structural-defense rationale, full options-considered table, and the reversibility analysis (Medium — one-way by design).../ADRs/0010-redacted-slice-smart-constructor-at-writer-boundary.md— the typed return shape this story emits.- Source design:
../final-design.md §"Components" #4 SecretRedactor— the original tuple return shape; this story implements the 02-ADR-0010-tightened shape.../final-design.md §"Conflict-resolution table" row 4— synthesis pick: structural fix (Option C) over encryption-at-rest (Option B) or inline plaintext (Option A).../final-design.md §"Failure modes & recovery" row 7—gitleaksAKIA-in-git-history scenario.- Existing code (Phase 0 + Phase 1 on master):
src/codegenie/output/sanitizer.py— Phase 0OutputSanitizer.scrub. This story extends it withredact_secrets; composition order is the next story's job (S3-03). The existingscrub(...)signature is unchanged in interface; the body grows by composition.src/codegenie/hashing.py(Phase 0) —content_hash_bytes(b: bytes) -> stris the BLAKE3-of-bytes chokepoint (content_hash(path: Path)is the file variant — do NOT confuse). Both return prefix-tagged"blake3:<64hex>"per Phase-0 ADR-0001; this story strips"blake3:"then slices[:8]. The fingerprint helper isolates this two-step.src/codegenie/errors.py(Phase 0 + Phase 1) — markers-only invariant (tests/unit/test_errors.py::test_subclasses_are_markers_only).SecretFindingis not an exception; it is a Pydantic frozen model.src/codegenie/parsers/__init__.py(Phase 1) —JSONValuerecursive alias is exported here (NOT undercodegenie.types, which is the identifier-newtype package). Import asfrom codegenie.parsers import JSONValue.src/codegenie/types/identifiers.py(Phase 2 / S1-05) —ProbeId = NewType("ProbeId", str). Import asfrom codegenie.types.identifiers import ProbeId(thecodegenie.typestop-level__all__re-exports four other newtypes but NOTProbeIdas of S1-05's landing — verify before adopting a re-export).- Phase 1 shape calibration:
docs/phases/01-context-gather-layer-a-node/stories/S1-02-safe-json-parser.md— chokepoint pattern, parsers-layer mutation discipline, structured-event emission viastructlog.testing.capture_logs()(this story emitssecrets_redacted_countfield — implemented in S3-03 at the writer call site, not here).
Goal¶
Ship src/codegenie/output/sanitizer.py::redact_secrets(slice_, probe_name) -> tuple[RedactedSlice, list[SecretFinding]] (referring to the RedactedSlice model that S3-02 lands; this story imports it) such that:
- Every string value in the input
slice_: dict[str, JSONValue]is walked recursively (descending intodictvalues andlistitems, mirroring the Phase-1safe_jsondepth-walker shape). - Each pattern-class regex (AWS, GitHub, JWT, RSA private-key block, NPM, Anthropic) replaces matches inline with
<REDACTED:fingerprint=<8hex>>where<8hex>is the first 8 hex characters ofcodegenie.hashing.content_hash_bytes(cleartext.encode("utf-8")).removeprefix("blake3:"). Phase 0content_hash_bytes(b: bytes) -> stris the BLAKE3-of-bytes chokepoint companion tocontent_hash(path); both return prefix-tagged"blake3:<64hex>"per Phase-0 ADR-0001, so the"blake3:"prefix must be stripped before the 8-hex slice. The fingerprint helper isolates this one-liner. - After all pattern-class regex passes, any remaining string of
len(s) >= 32whose Shannon entropy ≥ 4.5 bits/char is replaced with<REDACTED:fingerprint=<8hex>>(the entropy-fallback rule). - Each replacement appends a
SecretFinding(probe_name, fingerprint, pattern_class, cleartext_len)to an in-memorylist[SecretFinding]. The cleartext is never stored on theSecretFinding. Plaintext exists only as a local variable inside the regex-substitution callback and is discarded immediately after the fingerprint is computed. - The returned
RedactedSlicecarriesslice(the redacted dict),findings_count(the list length), andfingerprints(the 8-hex strings, deduplicated and stably ordered). - The returned
list[SecretFinding]is the in-band audit-trail consumed by the CLI summary; it is never threaded into the persisted slice.
All pattern classes are mutation-tested: a deliberately weakened regex (e.g., AKIA[0-9A-Z]{15} — one fewer character class) must cause the canonical-example test to FAIL. Pattern coverage is verified, not asserted.
Acceptance criteria¶
Module / surface:
- [ ] AC-1 —
src/codegenie/output/sanitizer.pyexportsredact_secrets(slice_: dict[str, JSONValue], probe_name: ProbeId) -> tuple[RedactedSlice, list[SecretFinding]]. The existingOutputSanitizer.scrubinterface is unchanged (composition is S3-03's job). - [ ] AC-2 —
src/codegenie/output/sanitizer.pymodule docstring is extended (Phase 0 docstring preserved) to referencephase-arch-design.md §"Component design" #4, 02-ADR-0005, and 02-ADR-0010, and to document the entropy threshold (≥ 4.5 bits/char,len ≥ 32) with a one-line rationale ("Shannon-entropy floor sized against the gitleaks pattern pack; tunable perphase-arch-design.md §"Component design" #4"). A test programmatically asserts the docstring contains the substrings02-ADR-0005,02-ADR-0010,4.5, and32—assert all(s in inspect.getdoc(codegenie.output.sanitizer) for s in [...]). A regression that drops a reference is caught. - [ ] AC-3 —
SecretFindingis a Pydanticfrozen=True, extra="forbid"model insrc/codegenie/output/sanitizer.py(or a sibling module imported from it) with exactly four fields:probe_name: ProbeId,fingerprint: str(8 hex chars),pattern_class: Literal["aws_access_key", "github_token", "jwt", "rsa_private_key", "npm_token", "anthropic_key", "entropy"],cleartext_len: int. Nocleartextfield, nofile_linefield (the file-line audit trail is the caller's job — gitleaks attaches it separately and never reaches the persisted artifact per 02-ADR-0005).
Pattern-class regex matching:
- [ ] AC-4 —
AKIA[0-9A-Z]{16}matches and replaces; canonical exampleAKIAIOSFODNN7EXAMPLEis redacted to<REDACTED:fingerprint=<8hex>>withpattern_class="aws_access_key". - [ ] AC-5 —
ghp_[A-Za-z0-9]{36}matches and replaces; canonical exampleghp_+ 36 alnum chars is redacted withpattern_class="github_token". - [ ] AC-6 — JWT pattern
eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+matches and replaces; a synthetic three-part JWT with realistic base64url segments is redacted withpattern_class="jwt". The regex does not match a bareeyJabc(must include both dots and three segments). - [ ] AC-7 — RSA private-key block
-----BEGIN[ A-Z]*PRIVATE KEY-----[\s\S]+?-----END[ A-Z]*PRIVATE KEY-----matches and replaces, including multi-line bodies;pattern_class="rsa_private_key". The replacement collapses the entire block to a single<REDACTED:fingerprint=<8hex>>token (the fingerprint covers the full block bytes). - [ ] AC-8 —
npm_[A-Za-z0-9]{36}matches and replaces;pattern_class="npm_token". - [ ] AC-9 —
sk-ant-[A-Za-z0-9-_]{50,}matches and replaces (length ≥ 50 to avoid short-prefix false positives);pattern_class="anthropic_key".
Entropy fallback:
- [ ] AC-10 — A string of
len ≥ 32with Shannon entropy≥ 4.5bits/char that does not match any named pattern is redacted withpattern_class="entropy". Concrete example: 32+ chars of base64-url-safe random output. - [ ] AC-11 — A string of
len ≥ 32with Shannon entropy< 4.5bits/char (e.g.,"a" * 64— entropy = 0; or English prose like"the quick brown fox" * 4) is not redacted by the entropy rule. - [ ] AC-12 — A string of
len < 32(e.g., a 16-char high-entropy base64 fragment) is not redacted by the entropy rule, even if its entropy crosses the 4.5 threshold. The length floor is the false-positive control.
Fingerprint scheme:
- [ ] AC-13 — Fingerprint is exactly the first 8 hex characters of
codegenie.hashing.content_hash_bytes(cleartext.encode("utf-8")).removeprefix("blake3:"). A test asserts:len(fingerprint) == 8, every char is in[0-9a-f](lowercase), identical cleartext yields identical fingerprint, distinct cleartext yields distinct fingerprint (sampled across the pattern set). Negative test: a regression that callscontent_hash(...)(thePathoverload) or skips.removeprefix("blake3:")is caught — both produce non-hex 8-char prefixes (the former is aTypeError; the latter starts with"blake3:b"). - [ ] AC-14 — The replacement token format is exactly
<REDACTED:fingerprint=<8hex>>(literal<REDACTED:fingerprint=, the 8-hex fingerprint, literal>). ARedactedSlice.fingerprintsfield contains the deduplicated, stably-ordered (insertion order) list of distinct fingerprints from this slice.
Recursive walk:
- [ ] AC-15 — The walker descends recursively into both
dictvalues andlistitems (mirroringsafe_json._assert_depth). A test fixture with a secret nested as{"a": [{"b": ["AKIA…EXAMPLE"]}]}is fully redacted. - [ ] AC-16 — A secret appearing as a
dictkey is not walked into for matching (keys are typically configuration names like"aws_access_key"themselves — Phase 0's field-name regex already covers that surface). Onlystrvalues are matched. - [ ] AC-17 — Non-string scalars (
int,float,bool,None) are passed through unchanged. Nested types are preserved (dictstaysdict,liststayslist).
Mutation test (load-bearing):
- [ ] AC-18 —
tests/unit/output/test_secret_redactor.py::test_aws_key_mutation— replaces the AWS regex in the redactor's pattern table with a deliberately weakened version (AKIA[0-9A-Z]{15}, one char fewer) viamonkeypatch.setattragainst the module-level pattern table, then re-runs the redactor against the canonicalAKIAIOSFODNN7EXAMPLEand asserts the cleartext is not redacted (i.e., the weakened regex misses the 20-char example). A mutation that loosens the pattern fails the build. Mirror tests for the other five pattern classes (test_github_token_mutation,test_jwt_mutation,test_rsa_private_key_mutation,test_npm_token_mutation,test_anthropic_key_mutation) — each weakens its pattern (one char fewer, one delimiter dropped, one anchor removed) and asserts the canonical example slips through. The test mechanism is identical across the six; only the pattern + canonical example differ. - [ ] AC-19 —
tests/unit/output/test_secret_redactor.py::test_entropy_threshold_mutation— replaces the entropy threshold with5.0(above the chosen 4.5) and asserts a previously-redacted high-entropy 32-char base64 string is now passed through. A drift in the entropy floor fails the build.
In-band findings list (audit trail):
- [ ] AC-20 —
redact_secretsreturnstuple[RedactedSlice, list[SecretFinding]]. Thelist[SecretFinding]length equals the total number of replacements (each match is one finding, including entropy hits). A test asserts the count under a fixture with two AWS keys + one entropy hit = 3 findings. - [ ] AC-21 —
SecretFinding.cleartext_lenmatches the original byte length of the redacted cleartext; the cleartext itself is not present on theSecretFinding(assert viamodel_dump()keys). - [ ] AC-22 — The function is stateless across calls — calling
redact_secretstwice on the same input returns equalRedactedSliceobjects (Pydantic equality) and equallist[SecretFinding](in insertion order). No global state; noContextVar; no module-level findings accumulator.
Phase-0 / Phase-1 invariants preserved:
- [ ] AC-23 —
OutputSanitizer.scrubPhase-0 contract-freeze snapshot test (tests/unit/test_probe_contract.py) continues to pass — the existingscrubsignature is unchanged in this story (S3-03 tightens the writer's signature one layer up). - [ ] AC-24 — No
model_constructcalls anywhere in this story's code. Theforbidden-patternspre-commit (S1-11) coverssrc/codegenie/output/**; this story does not introduce a violation. (S3-02 lands theRedactedSlicemodel that this story imports; this story does not construct it viamodel_construct.)
Validator-added: edge cases, integration, and mutation-harness positive controls:
- [ ] AC-26 — Same-secret-twice dedupe invariant. A fixture with the same canonical AWS key appearing twice (
{"a": "AKIAIOSFODNN7EXAMPLE", "b": "AKIAIOSFODNN7EXAMPLE"}) yieldsfindings_count == 2,len(returned_slice.fingerprints) == 1. This is the load-bearing invariant for S3-02'sfindings_count >= len(fingerprints)validator; a regression that uses asetfor findings (dedupe at the wrong level) silently breaks the contract. - [ ] AC-27 — Two distinct named patterns in one string. A fixture with an AWS key and a GitHub token co-located in one string value (
"aws=AKIAIOSFODNN7EXAMPLE github=ghp_<36 alnum>") yields two findings (pattern_classvalues"aws_access_key"and"github_token") and two distinct fingerprints. A regression that exits after the first pattern match per pass is caught. - [ ] AC-28 —
cleartext_lenis byte-length, not char-length. Fixture: a cleartext that differs in byte-length vs. char-length (e.g., a synthetic high-entropy 32+-char string with one 3-byte UTF-8 codepoint mixed in, or surround the canonical AWS key with a non-ASCII codepoint that the pattern match excludes). AssertSecretFinding.cleartext_len == len(matched_cleartext.encode("utf-8"))and (for the multi-byte fixture)cleartext_len != len(matched_cleartext). A regression usinglen(cleartext)(char count) is caught. - [ ] AC-29 — Input is not mutated.
slice_before = copy.deepcopy(slice_in); redact_secrets(slice_in, probe); assert slice_in == slice_before. A regression that mutatesslice_in place is caught. Combined with AC-22 (idempotence across calls), this pins both stateless-across-calls and immutable-input. - [ ] AC-30 —
_PATTERNSand_ENTROPY_THRESHOLD_BITS_PER_CHARare module-level names (positive control for the mutation harness). Asserts: (a)getattr(sanitizer, "_PATTERNS")returns alist; (b)getattr(sanitizer, "_ENTROPY_THRESHOLD_BITS_PER_CHAR")returns afloat; (c)monkeypatch.setattr(sanitizer, "_PATTERNS", [])causes a subsequentredact_secretscall against"AKIAIOSFODNN7EXAMPLE"to NOT redact the AWS key (the patched empty table is genuinely consulted). A regression that moves the table function-local fails (c) — the mutation harness is broken if (c) fails, and that's the load-bearing positive control AC-18/AC-19 depend on. - [ ] AC-31 — Entropy edge cases.
_shannon_entropy("")returns0.0(or a sentinel; whichever the implementation picks — pin one)._shannon_entropy("a")returns0.0._shannon_entropy("a" * 100)returns0.0._shannon_entropy(some_64_char_unicode_string)returns a finite float without crash. The entropy function itself never raisesValueError/ZeroDivisionError/OverflowError. The skip-the-rule semantics forlen < 32(already in AC-12) handle the small-string cases; this AC pins that the entropy function is total overstr. - [ ] AC-32 —
content_hash_bytesprefix-strip regression test. Calls_fingerprint(b"sentinel")and asserts the returned 8-char string is all[0-9a-f](lowercase) and is exactlycontent_hash_bytes(b"sentinel")[7:15](the slice past the"blake3:"prefix). A regression that drops.removeprefix("blake3:")produces"blake3:b"and fails the all-hex assertion. - [ ] AC-33 — S3-02 round-trip integration. The
RedactedSlicereturned byredact_secretssurvivesRedactedSlice.model_validate(returned.model_dump())(Pydantic round-trip through validators). Assert all three S3-02 invariants hold on every test fixture: (i) every fingerprint matches^[0-9a-f]{8}$; (ii)findings_count >= len(fingerprints); (iii)findings_count >= 0. A regression in S3-01 that violates any of these (uppercase hex slip; off-by-one count; negative count from a buggy loop) fails at runtime — this AC moves the failure into the test suite where the executor's Validator catches it before merge. - [ ] AC-34 — Inline-substring replacement across pattern classes. Fixtures embed each named pattern inside a longer string (e.g.,
"prefix-AKIAIOSFODNN7EXAMPLE-suffix","Authorization: token ghp_<36>","NPM_TOKEN=npm_<36>","X-Anthropic-Key: sk-ant-<50+>","random-prefix-<32+chars-high-entropy>-suffix"). Each fixture's redacted output preserves the prefix/suffix and replaces only the matched substring with<REDACTED:fingerprint=…>. JWT is already covered by Notes-for-implementer; AC-34 promotes the property to a contract across the family. A regression that anchors a pattern (e.g.,^AKIA…$orre.fullmatch) is caught.
Toolchain:
- [ ] AC-25 —
ruff check,ruff format --check,mypy --strict,pytestpass on touched files.mypy --strictaccepts the recursiveJSONValuewalker without# type: ignore.
Implementation outline¶
- Add to
src/codegenie/output/sanitizer.py(extending Phase 0's existing module — the existingOutputSanitizer.scrubis untouched): - Import
RedactedSlicefromsrc/codegenie/output/redacted_slice.py(lands in S3-02 — this story is paired with S3-02; the executor lands them as a single PR, or S3-02 lands first and this story imports). - Define
SecretFindingPydanticfrozen=True, extra="forbid"(four fields per AC-3). - Define a module-level pattern table:
The table is module-level (not function-local) so
_PATTERNS: list[tuple[Literal[...], re.Pattern[str]]] = [ ("aws_access_key", re.compile(r"AKIA[0-9A-Z]{16}")), ("github_token", re.compile(r"ghp_[A-Za-z0-9]{36}")), ("jwt", re.compile(r"eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+")), ("rsa_private_key", re.compile(r"-----BEGIN[ A-Z]*PRIVATE KEY-----[\s\S]+?-----END[ A-Z]*PRIVATE KEY-----")), ("npm_token", re.compile(r"npm_[A-Za-z0-9]{36}")), ("anthropic_key", re.compile(r"sk-ant-[A-Za-z0-9_-]{50,}")), ] _ENTROPY_THRESHOLD_BITS_PER_CHAR: float = 4.5 _ENTROPY_MIN_LEN: int = 32monkeypatch.setattrcan swap entries for AC-18 mutation tests. - Implement
_shannon_entropy(s: str) -> float— stdlibmath.log2, char-frequency dict, single-pass. Guard against empty input (return0.0); the formula-sum(p * log2(p) for p in freqs.values())is naturally total over non-emptystrbecausep > 0for every entry in the frequency dict. Pin this in AC-31. - Implement
_fingerprint(cleartext: str) -> str—content_hash_bytes(cleartext.encode("utf-8")).removeprefix("blake3:")[:8]. The"blake3:"prefix is mandatory per Phase-0 ADR-0001; stripping then slicing isolates the two-step in one helper so AC-32's regression test pins the exact form. - Imports:
from codegenie.parsers import JSONValue(NOTcodegenie.types);from codegenie.types.identifiers import ProbeId(thecodegenie.typestop-level does NOT re-exportProbeIdas of S1-05);from codegenie.hashing import content_hash_bytes;from codegenie.output.redacted_slice import RedactedSlice(lands in S3-02 — see Notes for implementer #10). - Implement
_redact_string(s: str, probe_name: ProbeId, findings_out: list[SecretFinding]) -> str— applies each_PATTERNSregex withre.subwhosereplis a closure that computes the fingerprint, appends aSecretFinding, and returns the<REDACTED:fingerprint=…>token; after all patterns, checks the entropy rule on the remaining string (i.e., the post-regex string) and replaces if the rule fires (the entropy check sees<REDACTED:fingerprint=…>tokens but those are short and below the 32-char floor; documented in implementer note). - Implement
_walk(node: JSONValue, probe_name: ProbeId, findings_out: list[SecretFinding]) -> JSONValue— recursive dispatch onisinstance(node, str|dict|list); calls_redact_stringon strings; recurses intodict.values()andlistitems; passes through scalars. - Implement the public
redact_secrets(slice_, probe_name) -> tuple[RedactedSlice, list[SecretFinding]]:findings: list[SecretFinding] = [] redacted: dict[str, JSONValue] = _walk(slice_, probe_name, findings) # type: ignore[assignment] fingerprints = list(dict.fromkeys(f.fingerprint for f in findings)) # dedupe, preserve order return ( RedactedSlice( slice=redacted, findings_count=len(findings), fingerprints=fingerprints, ), findings, ) - Write
tests/unit/output/test_secret_redactor.pycovering all 25 ACs. Mutation tests usemonkeypatch.setattr(sanitizer, "_PATTERNS", [...weakened...])andmonkeypatch.setattr(sanitizer, "_ENTROPY_THRESHOLD_BITS_PER_CHAR", 5.0). - No edits to
OutputSanitizer.scrubin this story — composition order is S3-03. - No edits to
writer.pyin this story — signature tightening is S3-03.
Out of scope¶
- The
RedactedSlicePydantic model itself (S3-02) — this story imports it. OutputSanitizer.scrubcomposition + ordering documentation (S3-03).- Writer signature tightening (S3-03).
secrets_redacted_countlog field at the writer (S3-03).tests/adv/phase02/test_secret_in_source.py(S6-07 — load-bearing adversarial; depends on this story landing first).tests/adv/phase02/test_no_inmemory_secret_leak.py(S7-04 —inspect-based boundary test; Gap 5 closure).- CLI summary line (
secrets_redacted_count: <N>+ file:line list) — this story returns the findings list; the CLI summary path consumes it; the CLI summary itself is touched in S3-03 / S8-02.
Notes for the implementer¶
- Pattern table at module level, not function-local. AC-18 mutation tests require
monkeypatch.setattr(sanitizer, "_PATTERNS", [...weakened...]). A function-local pattern table makes mutation testing impossible. Document the invariant in the module docstring. - Entropy threshold rationale in the module docstring. AC-2 requires the docstring to name 4.5 bits/char +
len ≥ 32and a one-line rationale. The threshold is empirically chosen against the gitleaks pattern pack — document the source (Phase 2 final-design Conflict-resolution table row 4 — Option C synthesis). A future tune is a docstring + AC-19 threshold change; the mutation test makes drift visible. - Cleartext lifetime. The cleartext appears as the match group inside the
re.subcallback. Compute the fingerprint, append theSecretFinding, return the replacement token. Do not stash the cleartext anywhere — not in a debug log, not in aprint, not on theSecretFinding. Rule 12 (Fail loud) applies inverted: do not silently retain. re.subcallback construction. Usere.sub(pattern, _make_repl(probe_name, pattern_class, findings_out), s)where_make_replreturns a closure. Each match invokes the closure exactly once; the closure side-effectsfindings_outand returns the replacement token. The closure capturesprobe_name,pattern_class, and the findings list — no other state.- JWT regex anchor discipline. The JWT pattern matches anywhere in the string (no
^/$). A typical JWT appears in a JSON value as"Authorization: Bearer eyJ...". The regex finds the JWT substring and replaces it inline (the"Authorization: Bearer "prefix is preserved). Test fixture covers this. - RSA block is multi-line. The pattern uses
[\s\S]+?(non-greedy any-char-including-newline) between BEGIN and END. A test fixture must contain a multi-line RSA block (synthesized — never a real key). - The entropy rule sees post-regex strings. After all six pattern regexes run, the remaining string may contain
<REDACTED:fingerprint=…>tokens. Those tokens are short (under 32 chars) and below thelen ≥ 32floor — the entropy rule will not fire on them. Document this composition invariant in the_redact_stringdocstring. - Stable fingerprint ordering for
RedactedSlice.fingerprints. Uselist(dict.fromkeys(f.fingerprint for f in findings))—dict.fromkeysdeduplicates and preserves insertion order (guaranteed by Python 3.7+). A test asserts that two semantically-identical inputs produce identicalRedactedSlice.fingerprints(AC-22). SecretFinding.cleartext_lenis the int byte-length, not char-length. Uselen(cleartext.encode("utf-8")). Multi-byte secrets are rare (most credential schemes are ASCII) but the byte-length is the auditor-friendly measure.pattern_classis aLiteral[...]not an enum. Cheaper at the Pydantic boundary;mypy --strictenforces exhaustiveness if a future story usesmatchover the literal.- The
RedactedSliceimport. S3-02 landssrc/codegenie/output/redacted_slice.pywith the model. This story imports it. If the executor lands S3-02 and S3-01 in one PR, the import path is correct; if S3-02 lands first, the import lands cleanly. The two stories are tightly coupled by design; the validator may merge them into a single attempt log. - No LLM, no shell, no subprocess. This is a pure regex + entropy + BLAKE3 pass. Total LOC budget ~150 (pattern table, walker, fingerprint helper, entropy helper,
redact_secrets,SecretFindingmodel). The mutation test file is ~200 LOC (one mutation per pattern class + entropy threshold + composition + walker + cleartext-lifetime assertions). - #11 —
Fingerprint = NewType("Fingerprint", str)is deferred. Production ADR-0033 §3 names primitive-obsession on domain identifiers as a review-blocker pattern; the 8-hex fingerprint qualifies on the surface (crosses ≥ 2 module boundaries: sanitizer → redacted_slice → writer → log-field constant). The validator considered promotingFingerprintto aNewTypein this story but deferred because (a) the rule-of-three threshold (third concrete consumer of the family) is not reached until Phase 3 (RAG ingest or audit-anchor); (b) S3-02'sRedactedSlice.fingerprints: list[str]field-validator already closes the format invariant (^[0-9a-f]{8}$) at construction; (c) introducing the newtype now without the third consumer would either (i) require S3-02 to typefingerprints: list[Fingerprint](forcing the executor to land it in a coupled PR) or (ii) leave a one-callsite newtype that adds boilerplate without payoff. A Phase-3 cross-cutting story should landFingerprintonce and propagate the type to all consumers — the origin invariant ("only_fingerprint(...)produces aFingerprint") closes then. This story usesstrdeliberately; the format invariant is the structural defense. - #12 —
pattern_class: Literal[...]is a deliberately closed variant set. Adding a seventh pattern class (e.g.,"slack_webhook","gcp_service_account_key") is an ADR amendment, not Open/Closed. The closed set mirrors the variant-set-extension discipline ratified in S1-01 (IndexFreshness) and S1-03 (AdapterConfidence): the walker / fingerprinter / entropy logic stays untouched (extension-by-addition is at the_PATTERNSrow append), and theLiteral[...]widens by ADR-amendment somypy --strictwith--warn-unreachableenforces exhaustivematchdiscipline at every consumer. A future contributor who "loosens"pattern_classtostrfor "extensibility" defeats this exhaustiveness check; the design intent is the closed set + ADR amendment, not Open/Closed. Document in the module docstring.