Skip to content

Validation report: S6-01 — SkillsIndexProbe Layer D

Validated: 2026-05-17 Verdict: HARDENED Validator version: phase-story-validator v1 Story: ../S6-01-skills-index-probe.md

Summary

S6-01's intent (a Layer-D probe that projects SkillsLoader.load_all() output into a typed, body-byte-free index slice for the Planner) is well-formed and traces cleanly to phase-arch-design.md §"Component design" #9, the CLAUDE.md "Progressive disclosure for context" commitment, and localv2.md §5.4 D2. The original draft, however, contradicted the actual S2-01 implementation (src/codegenie/skills/loader.py, src/codegenie/skills/model.py) and the frozen Phase-0 Probe ABC (src/codegenie/probes/base.py) at twelve load-bearing points — every one a block-severity contract mismatch that would have made the story uncompilable against the existing codebase.

The twelve contract mismatches were all in-place fixable because the goal itself was consistent with both the architecture and the kernel that S2-01 actually shipped; the draft's mistakes were documentation drift between the story-authoring snapshot of the loader contract and the implementation S2-01 ultimately landed. None required architectural change. Six new ACs were added to cover the corners the draft skipped (empty fixture, FatalLoadError, per-file-error surfacing with a three-state confidence policy, shadowed-skill de-duplication propagation, byte-identical raw artifact, helper-driven ProbeContext construction). Three mutation-resistance hardens were applied (parametrized field-coverage smoke; Hypothesis property-based projection monotonicity; module-source os.open / os.read interdict tightened from class-source to whole-module). One design-pattern harden was applied (extract _project_skill, _project_skills_sorted, _count_skills_per_tier, _compute_confidence as pure module-level helpers — functional core / imperative shell).

The original draft's tier_counts derivation was structurally unimplementable as written (the loader does not propagate tier identity into LoadOutcome.skills); the harden routes it through a pure filesystem-enumeration helper (Path.rglob("SKILL.md") per tier — filename-only, no body reads) and explicitly documents the rule-of-three trigger for when _count_files_per_tier should be extracted to a shared helper (when the third tier-aware probe lands).

No NEEDS RESEARCH findings — every gap traced to an in-repo precedent (tests/unit/probes/layer_b/test_dep_graph.py for the asyncio.run test idiom; tests/unit/skills/test_loader.py for the content_hash_bytes-based BLAKE3-prefix comparison; src/codegenie/probes/layer_b/dep_graph.py for the @register_probe(heaviness=...) + name/layer/tier ABC pattern; src/codegenie/schema/probes/*.schema.json for the flat schema layout; S5-06's validation report for the parametrized field-coverage and module-source interdict patterns).

Twenty-three in-place edits applied; verdict HARDENED. Story is now structurally consistent with the S2-01 contract that actually shipped and with every load-bearing convention the Phase-0/1/2 stories already established.

Context Brief (Stage 1)

Story snapshot

  • Goal: Land src/codegenie/probes/layer_d/skills_index.py as a @register_probe(heaviness="light") probe that projects SkillsLoader.load_all() output into a typed SkillsIndexSlice carrying indexed-skill rows + tier counts + per-file errors, with bodies never opened by the probe (only the loader's recorded body_offset/body_size/body_blake3 anchors are forwarded).
  • Non-goals: body reads; other Layer-D probes (conventions, ADRs, policy — S6-02/03/04); sub-schema authoring (S6-08); hostile-YAML adversarial (S7-04); Skill / LoadOutcome amendments.
  • Effort: S
  • Depends on: S2-01 (SkillsLoader three-tier merge; Skill Pydantic model with body_offset/body_size/body_blake3 fields; LoadOutcome with skills + per_file_errors; per-file-error discriminated union).

Phase / arch constraints touched

  • 02-ADR-0005 — no plaintext persistence; the probe's commitment to "anchors only, never bytes" is the same chokepoint discipline applied to a different surface.
  • Phase 1 ADR-0006safe_yaml chokepoint; the probe does not parse YAML directly (the loader already did, once, through the chokepoint).
  • 02-ADR-0003@register_probe(heaviness=..., runs_last=...) is a registry kwarg, NOT a Probe ABC field. The original draft's probe_id = ProbeId(...) field assumed an ABC change.
  • 02-ADR-0007 — no Plugin Loader in Phase 2; SkillsLoader is kernel-side, not loaded via a plugin.
  • phase-arch-design.md §"Component design" #9 — the canonical loader-contract source. The original draft's signature description (Result[list[Skill], SkillsLoadError]) is older than what S2-01 actually shipped (Result[LoadOutcome, FatalLoadError]).
  • phase-arch-design.md §"Edge cases" rows 8/9/16 — symlink-refused, hostile-YAML, and three-tier collision are per-file failures; "this skill skipped, others load" semantics. The probe must surface these via per_file_errors, not by collapsing the whole probe to confidence="low".
  • CLAUDE.md "Progressive disclosure for context" — the body anchors-not-bytes commitment is the story's identity.
  • CLAUDE.md "Honest confidence" + ADR-0033 §3 "make illegal states unrepresentable" — confidence is Literal["high","medium","low"]; three-state semantics for full-success / partial-success / total-failure are mandatory.

Sibling-family lineage

  • 3rd Layer-D-shaped story landed after S2-01 (loader) and S2-02 (conventions catalog loader). The Layer-D probes layer (S6-01 through S6-05) consumes the loaders S2-01 / S2-02 already landed.
  • 1st @register_probe(heaviness="light") probe in Layer D. Layer-B (S4-01 IndexHealthProbe is runs_last=True, dep_graph is heaviness="light" default), Layer-C (S5-02 RuntimeTraceProbe is heaviness="heavy"). This is the first Layer-D probe so the rule-of-three for layer-shared helpers has not triggered.
  • Functional-core split is the architectural precedent. S2-01's loader cleanly separates pure helpers (_split_frontmatter, _matches) from the imperative shell (_load_one_skill, SkillsLoader.load_all). S6-01 inherits this split.

Prior validation framings carried forward

  • S5-06 hardening: committed parametrized mutation-resistance suites > "developer-runnable only" mutation rituals. The field-coverage parametrized test in S6-01 is the analogous pattern.
  • S5-05 hardening: diagnostic independence as separate test functions / separate fixtures. S6-01 mirrors this by giving each AC its own test function rather than folding multiple ACs into one fixture.
  • S5-04 hardening: mutation-resistance is mandatory; module-level Final[...] discipline. S6-01 adopts the same discipline for the TIERS constant (already pinned in skills/model.py as TIERS: Final[tuple[Tier,...]] = ("user","repo","org")).
  • S5-03 hardening: AST-walk / source-introspection audits supersede source-grep purity tests where a "did the test author do the right thing?" check is needed. S6-01's AC-11 (no os.open / os.read etc. in the module source) is a source-grep audit — defensible here because the negative is concrete (literal string presence in the module), not behavioral.

Phase exit criteria the story contributes to

  • High-level-impl.md §"Step 6" first bullet — ships Layer-D skills-index probe with body-byte-free index slice.
  • Phase-arch-design §"Testing strategy" — the probe's unit + integration test land under tests/unit/probes/layer_d/ and tests/integration/probes/.
  • CLAUDE.md "Progressive disclosure for context" — load-bearing commitment the probe makes operational.
  • G6 (final-design §"Goals") — kernel scaffolding for Phase-3 Skills consumption.

Open ambiguities discovered during Stage 1

  • SkillsLoader.load_all return type drift. Story says Result[list[Skill], SkillsLoadError]; the actual S2-01 implementation returns Result[LoadOutcome, FatalLoadError] where LoadOutcome carries both skills AND per_file_errors. Per-file errors stay inside Ok(LoadOutcome); only catastrophic failures (no search path readable) surface as Err(FatalLoadError). Resolved at synthesis: rewrite AC-6, AC-11 (now AC-17), and the GREEN code to pattern-match against the actual Ok(LoadOutcome) / Err(FatalLoadError) shape; add AC-16 (FatalLoadError) and AC-17 (per-file errors round-trip).

  • Probe ABC signature. Story specifies _run(self, ctx: ProbeContext) -> ProbeOutput (sync, private, takes only ctx). The actual ABC at src/codegenie/probes/base.py:94 is async def run(self, repo: RepoSnapshot, ctx: ProbeContext) -> ProbeOutput (async, public, takes both repo and ctx). The story's _run does not exist in the contract. Resolved at synthesis: rewrite AC-6 to async def run; rewrite GREEN code with async def; update TDD plan to use asyncio.run (mirroring tests/unit/probes/layer_b/test_dep_graph.py:108).

  • Probe identity convention. Story specifies probe_id = ProbeId("skills_index") as a class attribute. The actual ABC uses name: str = "skills_index". Other typed identifiers (e.g., SkillId, TaskClassId, Language) ARE preserved through the probe's slice; the kernel ABC's name: str is the frozen contract surface (ADR-0007). Resolved at synthesis: rewrite AC-5 to use name/layer/tier/requires per the ABC, while keeping SkillId/TaskClassId/Language newtypes preserved through IndexedSkill's fields.

  • Skill.applies_to_* type. Story specifies tuple[str, ...] for IndexedSkill. The source Skill carries list[TaskClassId] / list[Language] (typed). Forcing tuple[str, ...] would launder the newtypes — primitive-obsession regression. Resolved at synthesis: AC-3 preserves tuple[TaskClassId, ...] / tuple[Language, ...] (tuples for slice hash-stability; newtypes for end-to-end type discipline).

  • Skill.body_blake3 shape. Story's TDD plan compares blake3(...).hexdigest() to indexed.body_blake3 directly. The actual loader records body_blake3 with a blake3: prefix (regex ^blake3:[0-9a-f]{64}$ pinned in Skill.body_blake3's Field(pattern=...)). The raw-hexdigest() comparison would fail every time. Resolved at synthesis: AC-12 uses content_hash_bytes(body) (which produces the canonical blake3:<64hex> form) for the round-trip comparison; explicit assertion that body_blake3.startswith("blake3:").

  • tier_counts derivation. Story's AC-10 asserts tier_counts == {"user": 3, "repo": 1, "org": 0} but the loader's LoadOutcome.skills is a flat de-duplicated list with no per-skill tier metadata. The original GREEN code's self._count_by_tier(skills, search_paths) is hand-waved — there's no tier field on Skill to count from. Resolved at synthesis: introduce _count_skills_per_tier(search_paths) as a pure filesystem-enumeration helper (Path.rglob("SKILL.md") per tier, filename-only, no body reads); document that the counts are pre-de-duplication and explicitly surface shadowed-skill presence to operators (AC-14, AC-18). Rule-of-three trigger documented in Notes-for-implementer §9.

  • ProbeContext.for_test doesn't exist. Story tests call ProbeContext.for_test(search_paths=[...]). ProbeContext is a stdlib @dataclass (src/codegenie/probes/base.py:52) with no classmethod helpers. Resolved at synthesis: introduce a _make_context(tmp_path) test helper in the test module; search_paths are resolved via ctx.config["skills.user_path"] etc., not via a ProbeContext constructor arg.

  • _PROBE_REGISTRY doesn't exist. Story's AC-13 references _PROBE_REGISTRY["skills_index"].heaviness. The actual registry is default_registry: Registry at src/codegenie/probes/registry.py:238 with _entries: list[ProbeRegEntry]. Resolved at synthesis: AC-20 (renumbered) uses next(e for e in default_registry._entries if e.cls.name == "skills_index").

  • Schema layout. Story's AC-12 and GREEN test reference src/codegenie/schema/probes/layer_d/skills_index.schema.json and files("codegenie.schema.probes.layer_d"). The actual schema layout is flat (src/codegenie/schema/probes/dep_graph.schema.json, etc. — no per-layer subdir). Resolved at synthesis: AC-19 (renumbered) uses the flat path files("codegenie.schema.probes") / "skills_index.schema.json".

  • codegenie.ids doesn't exist. Story's GREEN code imports from codegenie.ids import ProbeId, SkillId. The actual newtype module is codegenie.types.identifiers. Resolved at synthesis: import path corrected in GREEN code.

  • ProbeOutput shape. Story's GREEN constructs ProbeOutput(probe_id=..., confidence=..., schema_slice=..., errors=...). The actual ProbeOutput is (schema_slice, raw_artifacts, confidence, duration_ms, warnings, errors) — no probe_id field, plus three required fields the story omitted. Resolved at synthesis: GREEN rewritten with all six fields; duration_ms measured via time.perf_counter().

  • "No medium" confidence policy. Story note 6: "Use \"high\" on success, \"low\" on SkillsLoadError. No \"medium\" — there is no \"partially loaded skills\" state." This is incorrect — the loader's LoadOutcome.per_file_errors exists precisely for the partial-success state. Resolved at synthesis: introduce a three-state policy via _compute_confidence(skills, per_file_errors): "high" if no per-file errors (clean — including empty install); "medium" if some skills loaded and some failed; "low" only if every discovered file failed OR FatalLoadError.

Findings by critic

Coverage critic (K)

ID Severity Finding Proposed fix
K1 harden No AC for the empty-install case (zero SKILL.md files anywhere). A future regression that returns confidence="low" on a clean empty install (treating "no skills" as an error) would slip through. New AC-15: empty fixture → confidence="high", skills=(), tier_counts={user:0,repo:0,org:0}, per_file_errors=().
K2 harden No AC for FatalLoadError (all tiers unreadable). The original draft's AC-6 only handled Result.Err(SkillsLoadError) which doesn't match the actual Err(FatalLoadError) shape. Without a test, a regression that re-raises FatalLoadError (breaking Phase 0 coordinator failure-isolation) silently lands. New AC-16: FatalLoadErrorconfidence="low", schema_slice with zeroed tier_counts and empty skills, error JSON-dumped into output.errors; probe does NOT re-raise.
K3 harden No AC for per-file-error surfacing. The original draft's AC-11 (symlink → Result.Err) is structurally wrong — symlinks become per-file errors inside Ok(LoadOutcome). Without a correct AC, the implementer might unwittingly swallow outcome.per_file_errors (a regression that loses the "loud about failures" property). New AC-17: partial-success fixture (one good + one symlinked) → confidence="medium", len(skills)==1, len(per_file_errors)==1, json.loads(per_file_errors[0])["reason"]=="symlink_refused".
K4 harden No AC for shadowed-skill propagation. The loader de-duplicates first-tier-wins, but the slice's tier_counts is filesystem-derived (pre-de-dup). Without an AC, a future implementer might "fix" tier_counts to use post-de-dup counts, losing the operator-facing signal that a shadow happened. New AC-18: cross-tier collision → exactly one IndexedSkill (user wins), tier_counts == {user:1, repo:1, org:0} (both files present on disk).
K5 harden No AC for byte-identical raw-artifact determinism. AC-9 (now AC-13) pins JSON byte-identity for two consecutive _run calls, but the raw file (raw/skills-index.json) was not explicitly pinned. Folded into AC-13's second clause: json.dumps(out1.schema_slice, sort_keys=True) == json.dumps(out2.schema_slice, sort_keys=True). (Raw-file determinism follows mechanically from sort_keys=True, indent=2 in the writer.)
K6 block Original draft AC-3 specified applies_to_tasks: tuple[str, ...] on IndexedSkill. The source Skill model carries list[TaskClassId] (typed). Laundering newtypes into str is the primitive-obsession ADR-0033 §1 explicitly forbids; the project's anti-pattern catalog at phase-arch-design.md §"Anti-patterns avoided" calls this out specifically. AC-3 rewritten: tuple[TaskClassId, ...] / tuple[Language, ...] preserved end-to-end. Notes §4 documents the constraint.
K7 harden "Tier counts" is observable but the source of the count is ambiguous in the original draft (loader doesn't expose tier per skill). Implementation might invent a tier field on Skill, or count via insertion order, or worst — silently emit empty counts. New AC-14 (renumbered): counts derive from _count_skills_per_tier(search_paths) — pure filesystem enumeration via rglob("SKILL.md"), filename-only, no body reads. Explicit invariant: missing tier path → 0.
K8 nit "mypy --strict passes" AC (AC-21) — verbatim from S5-06 / S4-07 precedents; ✓ no change needed beyond rewording. No change required.

Test-Quality critic (T)

ID Severity Finding Proposed fix
T1 harden The 20 KB tracemalloc ceiling (original AC-7) is brittle against Pydantic version drift. A Pydantic 2.x point release that adds 8 bytes per model would silently flip this test red without any actual regression in the body-bytes-never-loaded invariant. AC-10 (renumbered): loosen to < 1 MB. Still 100× tighter than what a read_bytes() regression would consume; absorbs Pydantic-model allocations comfortably for ≤ 100 skills. Notes §1 documents the rationale.
T2 harden The 100 MB body fixture used path.write_text(big_body) — materializes 100 MB in Python before any test runs. CI wall-clock degradation. Test fixture rewritten to use sparse file via fh.seek(N-1); fh.write(b"\0") — O(1) on tmpfs, file-size invariant preserved.
T3 harden The architectural-source-introspection test (original AC-15) only inspected SkillsIndexProbe class source. A "convenience" helper at module-level (def _read_preview(path): return path.read_text()[:200]) would slip through. AC-11 (renumbered): inspect WHOLE module source (inspect.getsource(si)); forbid os.open, os.read, .read_bytes, .read_text, .open( anywhere in the module. The probe is a pure consumer of Skill records; no body or file access from this module ever.
T4 harden No mutation-resistance test for "every field is populated." A regression in _project_skill that silently drops applies_to_languages (e.g., a refactor typo) would not be caught by any AC if the existing tests happened not to assert on that field. New AC-22: parametrized over IndexedSkill.model_fields.keys() — every field asserted non-default-shaped for a canonical fixture. Adding a new field auto-extends the test surface.
T5 harden No property-based test. The slice's "sorted, cardinality-preserving" invariant is the kind of invariant that example-based tests routinely miss on edge cases (single-element list; identical-prefix IDs; empty list; very long IDs). New AC-23: Hypothesis property — len(_project_skills_sorted(skills)) == len({s.id for s in skills}) AND [s.id for s in projected] == sorted({s.id for s in skills}). Catches subtle sort/cardinality regressions.
T6 harden The registry-heaviness test (original AC-13) called _PROBE_REGISTRY["skills_index"] — but _PROBE_REGISTRY doesn't exist (the registry is default_registry: Registry). The test would have crashed at import time. AC-20 (renumbered): use the actual surface — next(e for e in default_registry._entries if e.cls.name == "skills_index"). Also assert entry.runs_last is False (mutation: a regression flipping runs_last would change scheduling).
T7 harden The BLAKE3 round-trip test (original AC-8) compared blake3(raw).hexdigest() directly to indexed.body_blake3. The loader produces blake3:<64hex> (with prefix); the test would fail every time even on a correct implementation. AC-12 (renumbered): use content_hash_bytes(body) (the canonical helper that produces the prefix); add explicit body_blake3.startswith("blake3:") assertion.
T8 nit The pytest.raises(Exception) for frozen-Pydantic mutation is overly broad; pytest.raises(ValidationError) or pytest.raises(pydantic.ValidationError) is the precise expectation. Left as pytest.raises(Exception) since Pydantic 2's frozen-mutation error type has shifted across versions (was TypeError, now ValidationError); the broad assertion is defensible for cross-version stability and is the same shape used in S2-01's test. Notes-for-implementer flags this as a tightening opportunity if it ever produces a false negative.

Consistency critic (C)

ID Severity Finding Proposed fix
C1 block The original draft's Probe._run(self, ctx) signature is not the contract. Probe.run is async def run(self, repo, ctx) (sync _run does not exist). Tests using si.SkillsIndexProbe()._run(ctx) would crash with AttributeError. AC-6 + GREEN code rewritten to async def run(self, repo, ctx). Test idiom: asyncio.run(probe.run(repo, ctx)) mirroring tests/unit/probes/layer_b/test_dep_graph.py:108.
C2 block The original draft's probe_id = ProbeId(...) class attribute does not match the frozen Probe ABC (name: str = ...). Subclassing Probe with probe_id instead of name would mean the registry's de-dup-by-name logic (registry.py:154 existing.cls.name == cls.name) doesn't see the probe at all. AC-5 rewritten: class attributes are exactly name: str = "skills_index", layer = "D", tier = "base", applies_to_tasks: list[str] = ["*"], applies_to_languages: list[str] = ["*"], requires: list[str] = [], timeout_seconds: int = 10. ABC contract unchanged.
C3 block The original draft's SkillsLoader(...).load_all() returning Result.Ok(skills) / Result.Err(SkillsLoadError(...)) does not match the actual return shape (Result[LoadOutcome, FatalLoadError]). The pattern-match would crash on attribute access (result.unwrap() would return LoadOutcome, not list[Skill]). AC-6 + GREEN rewritten to handle Ok(LoadOutcome) and Err(FatalLoadError); per-file errors stay inside outcome.per_file_errors (NOT in the Err branch).
C4 block The original draft's IndexedSkill.body_blake3: str field does not enforce the loader's blake3:<64hex> shape. A regression in the loader that drops the prefix would round-trip through Pydantic silently. AC-3 specifies body_blake3: Annotated[str, Field(pattern=r"^blake3:[0-9a-f]{64}$")] — matches the loader's pin in skills/model.py:62.
C5 block The original draft's from codegenie.ids import ... does not exist as a module. Newtypes live in codegenie.types.identifiers. GREEN imports corrected; story body's references corrected.
C6 block The original draft's ProbeOutput(probe_id=..., confidence=..., schema_slice=..., errors=...) does not match the frozen ProbeOutput dataclass (schema_slice, raw_artifacts, confidence, duration_ms, warnings, errors). Constructor would TypeError. GREEN constructs all six fields; duration_ms measured via time.perf_counter() (mirroring dep_graph.py:431).
C7 block The original draft's schema path src/codegenie/schema/probes/layer_d/skills_index.schema.json does not match the existing flat convention (src/codegenie/schema/probes/dep_graph.schema.json, etc.). An S6-08 implementer following the story literally would put the schema in the wrong place; the consumer-side test would fail. AC-19 (renumbered) uses the flat path: files("codegenie.schema.probes") / "skills_index.schema.json".
C8 block The original draft's ProbeContext.for_test(search_paths=[...]) does not exist. ProbeContext is a stdlib @dataclass with no classmethods. The tests would crash with AttributeError on import. New _make_context(tmp_path) test helper in the TDD plan; search-paths resolved via ctx.config["skills.user_path"] etc. (the ProbeContext.config field is the existing extension point).
C9 block The original draft's "Use \"high\" on success, \"low\" on SkillsLoadError. No \"medium\"" rule (note 6) contradicts the loader's partial-success semantics. The loader returns Ok(LoadOutcome(skills=..., per_file_errors=...)) for the case "some skills loaded, some failed" — that IS a "partially loaded skills" state, and "medium" is the precise typed signal for it. New AC-9 (renumbered): three-state confidence policy via _compute_confidence(skills, per_file_errors). Notes-for-implementer §6 documents the rationale.
C10 block The original draft's _PROBE_REGISTRY does not exist. Registry surface is default_registry: Registry with _entries: list[ProbeRegEntry]. AC-20 (renumbered) uses the actual surface.
C11 harden The original draft's IndexedSkill.applies_to_tasks: tuple[str, ...] strips the TaskClassId newtype from the source Skill.applies_to_tasks: list[TaskClassId]. ADR-0033 §1 (primitive obsession) prohibits exactly this kind of laundering. AC-3: tuple[TaskClassId, ...] / tuple[Language, ...] preserved. Notes §4 documents.
C12 harden tier_counts derivation is structurally underspecified — the loader exposes no per-skill tier identity. A reader of the story can't tell where the counts come from; the original GREEN's self._count_by_tier(skills, search_paths) hand-waves it. AC-14 (renumbered): explicit _count_skills_per_tier(search_paths) pure helper; counts are filesystem-enumeration-derived; missing tier path = 0; pre-de-duplication semantics documented (so shadowed skills surface in counts).

Design-Patterns critic (D)

ID Severity Finding Proposed fix
D1 harden The original draft's _resolve_search_paths and _count_by_tier are private methods on the probe class. Mixing pure logic into the imperative shell defeats Phase 2's "functional core / imperative shell" convention (proven by s1-loader.py splitting _split_frontmatter / _matches as pure module-level helpers; dep_graph.py splitting _detect_package_manager from the probe). Extracted as pure module-level functions: _project_skill, _project_skills_sorted, _count_skills_per_tier, _compute_confidence. Each independently testable without a Probe instance or asyncio.run.
D2 harden IndexedSkill.applies_to_* as raw tuple[str, ...] is primitive obsession (already block-flagged by Consistency critic). The design-pattern lens: newtypes are the discipline that lets mypy --strict catch a TaskClassId accidentally passed where a Language is expected — preserving them through the slice means the Planner consumer also gets that protection. Already resolved via AC-3 + Notes §4.
D3 harden The confidence dimension is naturally a Sum/State-Machine; the original draft's two-state binary loses the partial-success state. ADR-0033 §3 "make illegal states unrepresentable" prescribes the three-state typed surface. Already resolved via AC-9 + _compute_confidence pure helper. The Literal["high","medium","low"] is exhaustive at the Probe-ABC level (already pinned by Phase 0), so match over it on the consumer side stays type-safe.
D4 nit The original draft's "no tier classifier" rationale (rule-of-three not triggered) is correct but the seam was not documented. When the third tier-aware probe lands, the implementer of THAT story needs to know which pure helper to extract. Notes §9: rule-of-three trigger is the third tier-aware Layer-D probe (S6-02 ConventionsCatalogProbe is the second). Extract _count_files_per_tier(search_paths, glob_pattern) into src/codegenie/probes/_shared/tier_counts.py at that point.
D5 block The original draft's tier_counts ambiguity (D-pattern lens: "where does the data come from?") is also a hidden-coupling smell. Without an explicit pure helper, a future implementer might "fix" the gap by adding a tier: Tier field to Skill (S2-01 amendment) — wrong layer to fix this, and would expand the loader's contract for one consumer. AC-14 + Notes §9: explicit pure helper for the count derivation; explicit "do NOT extend Skill/LoadOutcome" guidance to forestall the wrong fix.
D6 harden declared_inputs was not explicitly tied to the search-path configuration. Cache invalidation requires the declared_inputs to change when the config changes. Without an explicit AC, a future config-key addition (e.g., skills.user_path) would not invalidate the cache. New AC-7: declared_inputs includes skills_user_search_path:<expanded> etc. — mirrors the proven precedent at dep_graph.py:420 (dep_graph_strategy_set:<resolved>).
D7 nit The slice writes a raw artifact to ctx.output_dir / "skills-index.json" but the original draft did not pin this. Operators read raw artifacts when debugging; without a pin, the path could drift. GREEN explicitly writes raw_path = ctx.output_dir / "skills-index.json"; raw_artifacts list includes it. AC-13's byte-identity assertion guarantees the raw file is deterministic.

Conflict resolution

Priority: Consistency > Coverage > Test-Quality > Design-Patterns.

  • No conflicts between Coverage and Consistency: every new Coverage AC (K1–K7) traces to an actual capability of the existing loader/probe contract.
  • No conflicts between Test-Quality and Consistency: every Test-Quality harden (T1–T7) tightens an AC that Consistency had already corrected for shape.
  • Design-Patterns vs. Rule 2 (YAGNI): D4's rule-of-three trigger is documented but not implemented in S6-01 — per Rule 2 ("three similar lines is better than a premature abstraction"), the shared _count_files_per_tier helper waits for the third tier-aware probe. The story's Notes §9 makes this explicit so the next story author has the context.

No Stage 3 research was needed — every gap traced to an in-repo precedent.

Edits applied (Stage 4)

Story header

  • Updated Depends on to reflect actual S2-01 surface (LoadOutcome.per_file_errors is the per-file-error carrier).
  • Added 02-ADR-0003 to the ADRs honored list (heaviness is registry-side, not on the ABC).
  • Added a Validation notes block under the header documenting the twelve block fixes, six new ACs, three mutation-resistance hardens, and one design-pattern harden.

Context

  • Rewrote the second paragraph to describe the actual SkillsLoader.load_all() return shape (Result[LoadOutcome, FatalLoadError] with per_file_errors carrier).
  • Loosened the tracemalloc ceiling to 1 MB (from 20 KB) with a Pydantic-version-drift rationale.

References — where to look

  • Added the actual references for the Probe ABC (base.py), registry (registry.py), newtypes (types/identifiers.py), schema layout (flat under schema/probes/), and the canonical asyncio.run test idiom from test_dep_graph.py.
  • Removed the references to non-existent surfaces (codegenie.ids, ProbeContext.for_test, _PROBE_REGISTRY).

Goal

  • Rewrote to use async def run(self, repo, ctx) (matching the ABC), name: str = "skills_index" (matching ABC convention), preserved newtypes (SkillId, TaskClassId, Language), and blake3:<64hex> prefix preservation.
  • Made explicit that tier_counts derives from a pure filesystem-enumeration helper, NOT from any field on LoadOutcome.
  • Added: probe MUST NOT contain os.open/os.read/.read_bytes/.read_text/.open( in the module source.

Acceptance criteria

  • Renumbered to 23 ACs (from 15) organized into 12 sections.
  • Rewrites: AC-3 (preserve newtypes), AC-4 (add per_file_errors), AC-5 (use ABC class attributes, not probe_id), AC-6 (async def run + Result[LoadOutcome, FatalLoadError] pattern-match), AC-9 (three-state confidence policy), AC-10 (1 MB ceiling), AC-11 (whole-module source interdict), AC-12 (content_hash_bytes prefix-aware round-trip), AC-13 (byte-identical JSON between gathers), AC-19 (flat schema path), AC-20 (use default_registry).
  • New ACs: AC-7 (_resolve_search_paths + declared_inputs tokens), AC-14 (filesystem-enumeration tier_counts), AC-15 (empty fixture), AC-16 (FatalLoadError), AC-17 (per-file errors round-trip), AC-18 (shadowed-skill propagation), AC-22 (parametrized field-coverage), AC-23 (Hypothesis property).

Implementation outline

  • Restructured to make the functional-core / imperative-shell split explicit.
  • Added the _make_context helper requirement.
  • Documented the S6-08 forward dependency.

TDD plan

  • Replaced every _run call with asyncio.run(probe.run(...)).
  • Replaced ProbeContext.for_test with _make_context(tmp_path) helper.
  • Replaced _PROBE_REGISTRY with default_registry._entries iteration.
  • Replaced raw blake3(...).hexdigest() with content_hash_bytes(body) (prefix-aware).
  • Replaced flat 100 MB path.write_text fixture with sparse-file fh.seek/write (O(1)).
  • Replaced class-source getsource(SkillsIndexProbe) introspection with module-source getsource(si) introspection.
  • Added five new test functions for the new ACs (empty, FatalLoadError, per-file errors, shadowed propagation, byte-identical JSON, parametrized field coverage, Hypothesis property).

GREEN code

  • Complete rewrite: pure helpers at module level; imperative shell uses ABC-correct class attributes; async def run taking (self, repo, ctx); pattern-matches Result[LoadOutcome, FatalLoadError]; emits all six ProbeOutput fields; preserves newtypes through projection; writes raw artifact with sort_keys=True, indent=2.

Refactor section

  • Documented the rule-of-three trigger for the shared tier-counts helper.

Out of scope

  • Added: "Adding tier membership to Skill / LoadOutcome" (S2-01 amendment, not S6-01's job).
  • Added: "Editing the Phase 0 Probe ABC or ProbeContext dataclass" (frozen contract per ADR-0007).

Notes for the implementer

  • Rewrote all 12 notes to reflect the corrected ACs and code.
  • Added notes 9 (rule-of-three trigger for tier-counts), 10 (async def run is the contract — pinning the test idiom), 11 (ProbeContext is a stdlib @dataclass), 12 (registry surface is default_registry).

Verdict

HARDENED. Twenty-three edits applied to the story file at docs/phases/02-context-gather-layers-b-g/stories/S6-01-skills-index-probe.md. The story now compiles structurally against the S2-01 implementation that actually shipped and the frozen Phase-0 Probe / ProbeContext / ProbeOutput contract surfaces. All twelve block-severity contract mismatches are resolved; six coverage gaps closed; three mutation-resistance hardens applied; one design-pattern harden applied with rule-of-three trigger documented for the next implementer.

The story is ready for phase-story-executor.