Validation report: S6-01 — SkillsIndexProbe Layer D¶
Validated: 2026-05-17
Verdict: HARDENED
Validator version: phase-story-validator v1
Story: ../S6-01-skills-index-probe.md
Summary¶
S6-01's intent (a Layer-D probe that projects SkillsLoader.load_all() output into a typed, body-byte-free index slice for the Planner) is well-formed and traces cleanly to phase-arch-design.md §"Component design" #9, the CLAUDE.md "Progressive disclosure for context" commitment, and localv2.md §5.4 D2. The original draft, however, contradicted the actual S2-01 implementation (src/codegenie/skills/loader.py, src/codegenie/skills/model.py) and the frozen Phase-0 Probe ABC (src/codegenie/probes/base.py) at twelve load-bearing points — every one a block-severity contract mismatch that would have made the story uncompilable against the existing codebase.
The twelve contract mismatches were all in-place fixable because the goal itself was consistent with both the architecture and the kernel that S2-01 actually shipped; the draft's mistakes were documentation drift between the story-authoring snapshot of the loader contract and the implementation S2-01 ultimately landed. None required architectural change. Six new ACs were added to cover the corners the draft skipped (empty fixture, FatalLoadError, per-file-error surfacing with a three-state confidence policy, shadowed-skill de-duplication propagation, byte-identical raw artifact, helper-driven ProbeContext construction). Three mutation-resistance hardens were applied (parametrized field-coverage smoke; Hypothesis property-based projection monotonicity; module-source os.open / os.read interdict tightened from class-source to whole-module). One design-pattern harden was applied (extract _project_skill, _project_skills_sorted, _count_skills_per_tier, _compute_confidence as pure module-level helpers — functional core / imperative shell).
The original draft's tier_counts derivation was structurally unimplementable as written (the loader does not propagate tier identity into LoadOutcome.skills); the harden routes it through a pure filesystem-enumeration helper (Path.rglob("SKILL.md") per tier — filename-only, no body reads) and explicitly documents the rule-of-three trigger for when _count_files_per_tier should be extracted to a shared helper (when the third tier-aware probe lands).
No NEEDS RESEARCH findings — every gap traced to an in-repo precedent (tests/unit/probes/layer_b/test_dep_graph.py for the asyncio.run test idiom; tests/unit/skills/test_loader.py for the content_hash_bytes-based BLAKE3-prefix comparison; src/codegenie/probes/layer_b/dep_graph.py for the @register_probe(heaviness=...) + name/layer/tier ABC pattern; src/codegenie/schema/probes/*.schema.json for the flat schema layout; S5-06's validation report for the parametrized field-coverage and module-source interdict patterns).
Twenty-three in-place edits applied; verdict HARDENED. Story is now structurally consistent with the S2-01 contract that actually shipped and with every load-bearing convention the Phase-0/1/2 stories already established.
Context Brief (Stage 1)¶
Story snapshot¶
- Goal: Land
src/codegenie/probes/layer_d/skills_index.pyas a@register_probe(heaviness="light")probe that projectsSkillsLoader.load_all()output into a typedSkillsIndexSlicecarrying indexed-skill rows + tier counts + per-file errors, with bodies never opened by the probe (only the loader's recordedbody_offset/body_size/body_blake3anchors are forwarded). - Non-goals: body reads; other Layer-D probes (conventions, ADRs, policy — S6-02/03/04); sub-schema authoring (S6-08); hostile-YAML adversarial (S7-04); Skill / LoadOutcome amendments.
- Effort: S
- Depends on: S2-01 (
SkillsLoaderthree-tier merge;SkillPydantic model withbody_offset/body_size/body_blake3fields;LoadOutcomewithskills+per_file_errors; per-file-error discriminated union).
Phase / arch constraints touched¶
- 02-ADR-0005 — no plaintext persistence; the probe's commitment to "anchors only, never bytes" is the same chokepoint discipline applied to a different surface.
- Phase 1 ADR-0006 —
safe_yamlchokepoint; the probe does not parse YAML directly (the loader already did, once, through the chokepoint). - 02-ADR-0003 —
@register_probe(heaviness=..., runs_last=...)is a registry kwarg, NOT aProbeABC field. The original draft'sprobe_id = ProbeId(...)field assumed an ABC change. - 02-ADR-0007 — no Plugin Loader in Phase 2;
SkillsLoaderis kernel-side, not loaded via a plugin. phase-arch-design.md§"Component design" #9 — the canonical loader-contract source. The original draft's signature description (Result[list[Skill], SkillsLoadError]) is older than what S2-01 actually shipped (Result[LoadOutcome, FatalLoadError]).phase-arch-design.md§"Edge cases" rows 8/9/16 — symlink-refused, hostile-YAML, and three-tier collision are per-file failures; "this skill skipped, others load" semantics. The probe must surface these viaper_file_errors, not by collapsing the whole probe toconfidence="low".- CLAUDE.md "Progressive disclosure for context" — the body anchors-not-bytes commitment is the story's identity.
- CLAUDE.md "Honest confidence" + ADR-0033 §3 "make illegal states unrepresentable" —
confidenceisLiteral["high","medium","low"]; three-state semantics for full-success / partial-success / total-failure are mandatory.
Sibling-family lineage¶
- 3rd Layer-D-shaped story landed after S2-01 (loader) and S2-02 (conventions catalog loader). The Layer-D probes layer (S6-01 through S6-05) consumes the loaders S2-01 / S2-02 already landed.
- 1st
@register_probe(heaviness="light")probe in Layer D. Layer-B (S4-01 IndexHealthProbe isruns_last=True, dep_graph isheaviness="light"default), Layer-C (S5-02 RuntimeTraceProbe isheaviness="heavy"). This is the first Layer-D probe so the rule-of-three for layer-shared helpers has not triggered. - Functional-core split is the architectural precedent. S2-01's loader cleanly separates pure helpers (
_split_frontmatter,_matches) from the imperative shell (_load_one_skill,SkillsLoader.load_all). S6-01 inherits this split.
Prior validation framings carried forward¶
- S5-06 hardening: committed parametrized mutation-resistance suites > "developer-runnable only" mutation rituals. The field-coverage parametrized test in S6-01 is the analogous pattern.
- S5-05 hardening: diagnostic independence as separate test functions / separate fixtures. S6-01 mirrors this by giving each AC its own test function rather than folding multiple ACs into one fixture.
- S5-04 hardening: mutation-resistance is mandatory; module-level
Final[...]discipline. S6-01 adopts the same discipline for theTIERSconstant (already pinned inskills/model.pyasTIERS: Final[tuple[Tier,...]] = ("user","repo","org")). - S5-03 hardening: AST-walk / source-introspection audits supersede source-grep purity tests where a "did the test author do the right thing?" check is needed. S6-01's AC-11 (no
os.open/os.readetc. in the module source) is a source-grep audit — defensible here because the negative is concrete (literal string presence in the module), not behavioral.
Phase exit criteria the story contributes to¶
- High-level-impl.md §"Step 6" first bullet — ships Layer-D skills-index probe with body-byte-free index slice.
- Phase-arch-design §"Testing strategy" — the probe's unit + integration test land under
tests/unit/probes/layer_d/andtests/integration/probes/. - CLAUDE.md "Progressive disclosure for context" — load-bearing commitment the probe makes operational.
- G6 (final-design §"Goals") — kernel scaffolding for Phase-3 Skills consumption.
Open ambiguities discovered during Stage 1¶
-
SkillsLoader.load_allreturn type drift. Story saysResult[list[Skill], SkillsLoadError]; the actual S2-01 implementation returnsResult[LoadOutcome, FatalLoadError]whereLoadOutcomecarries bothskillsANDper_file_errors. Per-file errors stay insideOk(LoadOutcome); only catastrophic failures (no search path readable) surface asErr(FatalLoadError). Resolved at synthesis: rewrite AC-6, AC-11 (now AC-17), and the GREEN code to pattern-match against the actualOk(LoadOutcome)/Err(FatalLoadError)shape; add AC-16 (FatalLoadError) and AC-17 (per-file errors round-trip). -
ProbeABC signature. Story specifies_run(self, ctx: ProbeContext) -> ProbeOutput(sync, private, takes only ctx). The actual ABC atsrc/codegenie/probes/base.py:94isasync def run(self, repo: RepoSnapshot, ctx: ProbeContext) -> ProbeOutput(async, public, takes both repo and ctx). The story's_rundoes not exist in the contract. Resolved at synthesis: rewrite AC-6 toasync def run; rewrite GREEN code withasync def; update TDD plan to useasyncio.run(mirroringtests/unit/probes/layer_b/test_dep_graph.py:108). -
Probe identity convention. Story specifies
probe_id = ProbeId("skills_index")as a class attribute. The actual ABC usesname: str = "skills_index". Other typed identifiers (e.g.,SkillId,TaskClassId,Language) ARE preserved through the probe's slice; the kernel ABC'sname: stris the frozen contract surface (ADR-0007). Resolved at synthesis: rewrite AC-5 to usename/layer/tier/requiresper the ABC, while keepingSkillId/TaskClassId/Languagenewtypes preserved throughIndexedSkill's fields. -
Skill.applies_to_*type. Story specifiestuple[str, ...]forIndexedSkill. The sourceSkillcarrieslist[TaskClassId]/list[Language](typed). Forcingtuple[str, ...]would launder the newtypes — primitive-obsession regression. Resolved at synthesis: AC-3 preservestuple[TaskClassId, ...]/tuple[Language, ...](tuples for slice hash-stability; newtypes for end-to-end type discipline). -
Skill.body_blake3shape. Story's TDD plan comparesblake3(...).hexdigest()toindexed.body_blake3directly. The actual loader recordsbody_blake3with ablake3:prefix (regex^blake3:[0-9a-f]{64}$pinned inSkill.body_blake3'sField(pattern=...)). The raw-hexdigest()comparison would fail every time. Resolved at synthesis: AC-12 usescontent_hash_bytes(body)(which produces the canonicalblake3:<64hex>form) for the round-trip comparison; explicit assertion thatbody_blake3.startswith("blake3:"). -
tier_countsderivation. Story's AC-10 assertstier_counts == {"user": 3, "repo": 1, "org": 0}but the loader'sLoadOutcome.skillsis a flat de-duplicated list with no per-skill tier metadata. The original GREEN code'sself._count_by_tier(skills, search_paths)is hand-waved — there's notierfield onSkillto count from. Resolved at synthesis: introduce_count_skills_per_tier(search_paths)as a pure filesystem-enumeration helper (Path.rglob("SKILL.md")per tier, filename-only, no body reads); document that the counts are pre-de-duplication and explicitly surface shadowed-skill presence to operators (AC-14, AC-18). Rule-of-three trigger documented in Notes-for-implementer §9. -
ProbeContext.for_testdoesn't exist. Story tests callProbeContext.for_test(search_paths=[...]).ProbeContextis a stdlib@dataclass(src/codegenie/probes/base.py:52) with no classmethod helpers. Resolved at synthesis: introduce a_make_context(tmp_path)test helper in the test module;search_pathsare resolved viactx.config["skills.user_path"]etc., not via aProbeContextconstructor arg. -
_PROBE_REGISTRYdoesn't exist. Story's AC-13 references_PROBE_REGISTRY["skills_index"].heaviness. The actual registry isdefault_registry: Registryatsrc/codegenie/probes/registry.py:238with_entries: list[ProbeRegEntry]. Resolved at synthesis: AC-20 (renumbered) usesnext(e for e in default_registry._entries if e.cls.name == "skills_index"). -
Schema layout. Story's AC-12 and GREEN test reference
src/codegenie/schema/probes/layer_d/skills_index.schema.jsonandfiles("codegenie.schema.probes.layer_d"). The actual schema layout is flat (src/codegenie/schema/probes/dep_graph.schema.json, etc. — no per-layer subdir). Resolved at synthesis: AC-19 (renumbered) uses the flat pathfiles("codegenie.schema.probes") / "skills_index.schema.json". -
codegenie.idsdoesn't exist. Story's GREEN code importsfrom codegenie.ids import ProbeId, SkillId. The actual newtype module iscodegenie.types.identifiers. Resolved at synthesis: import path corrected in GREEN code. -
ProbeOutputshape. Story's GREEN constructsProbeOutput(probe_id=..., confidence=..., schema_slice=..., errors=...). The actualProbeOutputis(schema_slice, raw_artifacts, confidence, duration_ms, warnings, errors)— noprobe_idfield, plus three required fields the story omitted. Resolved at synthesis: GREEN rewritten with all six fields;duration_msmeasured viatime.perf_counter(). -
"No medium" confidence policy. Story note 6: "Use
\"high\"on success,\"low\"onSkillsLoadError. No\"medium\"— there is no \"partially loaded skills\" state." This is incorrect — the loader'sLoadOutcome.per_file_errorsexists precisely for the partial-success state. Resolved at synthesis: introduce a three-state policy via_compute_confidence(skills, per_file_errors):"high"if no per-file errors (clean — including empty install);"medium"if some skills loaded and some failed;"low"only if every discovered file failed ORFatalLoadError.
Findings by critic¶
Coverage critic (K)¶
| ID | Severity | Finding | Proposed fix |
|---|---|---|---|
| K1 | harden | No AC for the empty-install case (zero SKILL.md files anywhere). A future regression that returns confidence="low" on a clean empty install (treating "no skills" as an error) would slip through. |
New AC-15: empty fixture → confidence="high", skills=(), tier_counts={user:0,repo:0,org:0}, per_file_errors=(). |
| K2 | harden | No AC for FatalLoadError (all tiers unreadable). The original draft's AC-6 only handled Result.Err(SkillsLoadError) which doesn't match the actual Err(FatalLoadError) shape. Without a test, a regression that re-raises FatalLoadError (breaking Phase 0 coordinator failure-isolation) silently lands. |
New AC-16: FatalLoadError → confidence="low", schema_slice with zeroed tier_counts and empty skills, error JSON-dumped into output.errors; probe does NOT re-raise. |
| K3 | harden | No AC for per-file-error surfacing. The original draft's AC-11 (symlink → Result.Err) is structurally wrong — symlinks become per-file errors inside Ok(LoadOutcome). Without a correct AC, the implementer might unwittingly swallow outcome.per_file_errors (a regression that loses the "loud about failures" property). |
New AC-17: partial-success fixture (one good + one symlinked) → confidence="medium", len(skills)==1, len(per_file_errors)==1, json.loads(per_file_errors[0])["reason"]=="symlink_refused". |
| K4 | harden | No AC for shadowed-skill propagation. The loader de-duplicates first-tier-wins, but the slice's tier_counts is filesystem-derived (pre-de-dup). Without an AC, a future implementer might "fix" tier_counts to use post-de-dup counts, losing the operator-facing signal that a shadow happened. |
New AC-18: cross-tier collision → exactly one IndexedSkill (user wins), tier_counts == {user:1, repo:1, org:0} (both files present on disk). |
| K5 | harden | No AC for byte-identical raw-artifact determinism. AC-9 (now AC-13) pins JSON byte-identity for two consecutive _run calls, but the raw file (raw/skills-index.json) was not explicitly pinned. |
Folded into AC-13's second clause: json.dumps(out1.schema_slice, sort_keys=True) == json.dumps(out2.schema_slice, sort_keys=True). (Raw-file determinism follows mechanically from sort_keys=True, indent=2 in the writer.) |
| K6 | block | Original draft AC-3 specified applies_to_tasks: tuple[str, ...] on IndexedSkill. The source Skill model carries list[TaskClassId] (typed). Laundering newtypes into str is the primitive-obsession ADR-0033 §1 explicitly forbids; the project's anti-pattern catalog at phase-arch-design.md §"Anti-patterns avoided" calls this out specifically. |
AC-3 rewritten: tuple[TaskClassId, ...] / tuple[Language, ...] preserved end-to-end. Notes §4 documents the constraint. |
| K7 | harden | "Tier counts" is observable but the source of the count is ambiguous in the original draft (loader doesn't expose tier per skill). Implementation might invent a tier field on Skill, or count via insertion order, or worst — silently emit empty counts. | New AC-14 (renumbered): counts derive from _count_skills_per_tier(search_paths) — pure filesystem enumeration via rglob("SKILL.md"), filename-only, no body reads. Explicit invariant: missing tier path → 0. |
| K8 | nit | "mypy --strict passes" AC (AC-21) — verbatim from S5-06 / S4-07 precedents; ✓ no change needed beyond rewording. |
No change required. |
Test-Quality critic (T)¶
| ID | Severity | Finding | Proposed fix |
|---|---|---|---|
| T1 | harden | The 20 KB tracemalloc ceiling (original AC-7) is brittle against Pydantic version drift. A Pydantic 2.x point release that adds 8 bytes per model would silently flip this test red without any actual regression in the body-bytes-never-loaded invariant. | AC-10 (renumbered): loosen to < 1 MB. Still 100× tighter than what a read_bytes() regression would consume; absorbs Pydantic-model allocations comfortably for ≤ 100 skills. Notes §1 documents the rationale. |
| T2 | harden | The 100 MB body fixture used path.write_text(big_body) — materializes 100 MB in Python before any test runs. CI wall-clock degradation. |
Test fixture rewritten to use sparse file via fh.seek(N-1); fh.write(b"\0") — O(1) on tmpfs, file-size invariant preserved. |
| T3 | harden | The architectural-source-introspection test (original AC-15) only inspected SkillsIndexProbe class source. A "convenience" helper at module-level (def _read_preview(path): return path.read_text()[:200]) would slip through. |
AC-11 (renumbered): inspect WHOLE module source (inspect.getsource(si)); forbid os.open, os.read, .read_bytes, .read_text, .open( anywhere in the module. The probe is a pure consumer of Skill records; no body or file access from this module ever. |
| T4 | harden | No mutation-resistance test for "every field is populated." A regression in _project_skill that silently drops applies_to_languages (e.g., a refactor typo) would not be caught by any AC if the existing tests happened not to assert on that field. |
New AC-22: parametrized over IndexedSkill.model_fields.keys() — every field asserted non-default-shaped for a canonical fixture. Adding a new field auto-extends the test surface. |
| T5 | harden | No property-based test. The slice's "sorted, cardinality-preserving" invariant is the kind of invariant that example-based tests routinely miss on edge cases (single-element list; identical-prefix IDs; empty list; very long IDs). | New AC-23: Hypothesis property — len(_project_skills_sorted(skills)) == len({s.id for s in skills}) AND [s.id for s in projected] == sorted({s.id for s in skills}). Catches subtle sort/cardinality regressions. |
| T6 | harden | The registry-heaviness test (original AC-13) called _PROBE_REGISTRY["skills_index"] — but _PROBE_REGISTRY doesn't exist (the registry is default_registry: Registry). The test would have crashed at import time. |
AC-20 (renumbered): use the actual surface — next(e for e in default_registry._entries if e.cls.name == "skills_index"). Also assert entry.runs_last is False (mutation: a regression flipping runs_last would change scheduling). |
| T7 | harden | The BLAKE3 round-trip test (original AC-8) compared blake3(raw).hexdigest() directly to indexed.body_blake3. The loader produces blake3:<64hex> (with prefix); the test would fail every time even on a correct implementation. |
AC-12 (renumbered): use content_hash_bytes(body) (the canonical helper that produces the prefix); add explicit body_blake3.startswith("blake3:") assertion. |
| T8 | nit | The pytest.raises(Exception) for frozen-Pydantic mutation is overly broad; pytest.raises(ValidationError) or pytest.raises(pydantic.ValidationError) is the precise expectation. |
Left as pytest.raises(Exception) since Pydantic 2's frozen-mutation error type has shifted across versions (was TypeError, now ValidationError); the broad assertion is defensible for cross-version stability and is the same shape used in S2-01's test. Notes-for-implementer flags this as a tightening opportunity if it ever produces a false negative. |
Consistency critic (C)¶
| ID | Severity | Finding | Proposed fix |
|---|---|---|---|
| C1 | block | The original draft's Probe._run(self, ctx) signature is not the contract. Probe.run is async def run(self, repo, ctx) (sync _run does not exist). Tests using si.SkillsIndexProbe()._run(ctx) would crash with AttributeError. |
AC-6 + GREEN code rewritten to async def run(self, repo, ctx). Test idiom: asyncio.run(probe.run(repo, ctx)) mirroring tests/unit/probes/layer_b/test_dep_graph.py:108. |
| C2 | block | The original draft's probe_id = ProbeId(...) class attribute does not match the frozen Probe ABC (name: str = ...). Subclassing Probe with probe_id instead of name would mean the registry's de-dup-by-name logic (registry.py:154 existing.cls.name == cls.name) doesn't see the probe at all. |
AC-5 rewritten: class attributes are exactly name: str = "skills_index", layer = "D", tier = "base", applies_to_tasks: list[str] = ["*"], applies_to_languages: list[str] = ["*"], requires: list[str] = [], timeout_seconds: int = 10. ABC contract unchanged. |
| C3 | block | The original draft's SkillsLoader(...).load_all() returning Result.Ok(skills) / Result.Err(SkillsLoadError(...)) does not match the actual return shape (Result[LoadOutcome, FatalLoadError]). The pattern-match would crash on attribute access (result.unwrap() would return LoadOutcome, not list[Skill]). |
AC-6 + GREEN rewritten to handle Ok(LoadOutcome) and Err(FatalLoadError); per-file errors stay inside outcome.per_file_errors (NOT in the Err branch). |
| C4 | block | The original draft's IndexedSkill.body_blake3: str field does not enforce the loader's blake3:<64hex> shape. A regression in the loader that drops the prefix would round-trip through Pydantic silently. |
AC-3 specifies body_blake3: Annotated[str, Field(pattern=r"^blake3:[0-9a-f]{64}$")] — matches the loader's pin in skills/model.py:62. |
| C5 | block | The original draft's from codegenie.ids import ... does not exist as a module. Newtypes live in codegenie.types.identifiers. |
GREEN imports corrected; story body's references corrected. |
| C6 | block | The original draft's ProbeOutput(probe_id=..., confidence=..., schema_slice=..., errors=...) does not match the frozen ProbeOutput dataclass (schema_slice, raw_artifacts, confidence, duration_ms, warnings, errors). Constructor would TypeError. |
GREEN constructs all six fields; duration_ms measured via time.perf_counter() (mirroring dep_graph.py:431). |
| C7 | block | The original draft's schema path src/codegenie/schema/probes/layer_d/skills_index.schema.json does not match the existing flat convention (src/codegenie/schema/probes/dep_graph.schema.json, etc.). An S6-08 implementer following the story literally would put the schema in the wrong place; the consumer-side test would fail. |
AC-19 (renumbered) uses the flat path: files("codegenie.schema.probes") / "skills_index.schema.json". |
| C8 | block | The original draft's ProbeContext.for_test(search_paths=[...]) does not exist. ProbeContext is a stdlib @dataclass with no classmethods. The tests would crash with AttributeError on import. |
New _make_context(tmp_path) test helper in the TDD plan; search-paths resolved via ctx.config["skills.user_path"] etc. (the ProbeContext.config field is the existing extension point). |
| C9 | block | The original draft's "Use \"high\" on success, \"low\" on SkillsLoadError. No \"medium\"" rule (note 6) contradicts the loader's partial-success semantics. The loader returns Ok(LoadOutcome(skills=..., per_file_errors=...)) for the case "some skills loaded, some failed" — that IS a "partially loaded skills" state, and "medium" is the precise typed signal for it. |
New AC-9 (renumbered): three-state confidence policy via _compute_confidence(skills, per_file_errors). Notes-for-implementer §6 documents the rationale. |
| C10 | block | The original draft's _PROBE_REGISTRY does not exist. Registry surface is default_registry: Registry with _entries: list[ProbeRegEntry]. |
AC-20 (renumbered) uses the actual surface. |
| C11 | harden | The original draft's IndexedSkill.applies_to_tasks: tuple[str, ...] strips the TaskClassId newtype from the source Skill.applies_to_tasks: list[TaskClassId]. ADR-0033 §1 (primitive obsession) prohibits exactly this kind of laundering. |
AC-3: tuple[TaskClassId, ...] / tuple[Language, ...] preserved. Notes §4 documents. |
| C12 | harden | tier_counts derivation is structurally underspecified — the loader exposes no per-skill tier identity. A reader of the story can't tell where the counts come from; the original GREEN's self._count_by_tier(skills, search_paths) hand-waves it. |
AC-14 (renumbered): explicit _count_skills_per_tier(search_paths) pure helper; counts are filesystem-enumeration-derived; missing tier path = 0; pre-de-duplication semantics documented (so shadowed skills surface in counts). |
Design-Patterns critic (D)¶
| ID | Severity | Finding | Proposed fix |
|---|---|---|---|
| D1 | harden | The original draft's _resolve_search_paths and _count_by_tier are private methods on the probe class. Mixing pure logic into the imperative shell defeats Phase 2's "functional core / imperative shell" convention (proven by s1-loader.py splitting _split_frontmatter / _matches as pure module-level helpers; dep_graph.py splitting _detect_package_manager from the probe). |
Extracted as pure module-level functions: _project_skill, _project_skills_sorted, _count_skills_per_tier, _compute_confidence. Each independently testable without a Probe instance or asyncio.run. |
| D2 | harden | IndexedSkill.applies_to_* as raw tuple[str, ...] is primitive obsession (already block-flagged by Consistency critic). The design-pattern lens: newtypes are the discipline that lets mypy --strict catch a TaskClassId accidentally passed where a Language is expected — preserving them through the slice means the Planner consumer also gets that protection. |
Already resolved via AC-3 + Notes §4. |
| D3 | harden | The confidence dimension is naturally a Sum/State-Machine; the original draft's two-state binary loses the partial-success state. ADR-0033 §3 "make illegal states unrepresentable" prescribes the three-state typed surface. | Already resolved via AC-9 + _compute_confidence pure helper. The Literal["high","medium","low"] is exhaustive at the Probe-ABC level (already pinned by Phase 0), so match over it on the consumer side stays type-safe. |
| D4 | nit | The original draft's "no tier classifier" rationale (rule-of-three not triggered) is correct but the seam was not documented. When the third tier-aware probe lands, the implementer of THAT story needs to know which pure helper to extract. | Notes §9: rule-of-three trigger is the third tier-aware Layer-D probe (S6-02 ConventionsCatalogProbe is the second). Extract _count_files_per_tier(search_paths, glob_pattern) into src/codegenie/probes/_shared/tier_counts.py at that point. |
| D5 | block | The original draft's tier_counts ambiguity (D-pattern lens: "where does the data come from?") is also a hidden-coupling smell. Without an explicit pure helper, a future implementer might "fix" the gap by adding a tier: Tier field to Skill (S2-01 amendment) — wrong layer to fix this, and would expand the loader's contract for one consumer. |
AC-14 + Notes §9: explicit pure helper for the count derivation; explicit "do NOT extend Skill/LoadOutcome" guidance to forestall the wrong fix. |
| D6 | harden | declared_inputs was not explicitly tied to the search-path configuration. Cache invalidation requires the declared_inputs to change when the config changes. Without an explicit AC, a future config-key addition (e.g., skills.user_path) would not invalidate the cache. |
New AC-7: declared_inputs includes skills_user_search_path:<expanded> etc. — mirrors the proven precedent at dep_graph.py:420 (dep_graph_strategy_set:<resolved>). |
| D7 | nit | The slice writes a raw artifact to ctx.output_dir / "skills-index.json" but the original draft did not pin this. Operators read raw artifacts when debugging; without a pin, the path could drift. |
GREEN explicitly writes raw_path = ctx.output_dir / "skills-index.json"; raw_artifacts list includes it. AC-13's byte-identity assertion guarantees the raw file is deterministic. |
Conflict resolution¶
Priority: Consistency > Coverage > Test-Quality > Design-Patterns.
- No conflicts between Coverage and Consistency: every new Coverage AC (K1–K7) traces to an actual capability of the existing loader/probe contract.
- No conflicts between Test-Quality and Consistency: every Test-Quality harden (T1–T7) tightens an AC that Consistency had already corrected for shape.
- Design-Patterns vs. Rule 2 (YAGNI): D4's rule-of-three trigger is documented but not implemented in S6-01 — per Rule 2 ("three similar lines is better than a premature abstraction"), the shared
_count_files_per_tierhelper waits for the third tier-aware probe. The story's Notes §9 makes this explicit so the next story author has the context.
No Stage 3 research was needed — every gap traced to an in-repo precedent.
Edits applied (Stage 4)¶
Story header¶
- Updated Depends on to reflect actual S2-01 surface (
LoadOutcome.per_file_errorsis the per-file-error carrier). - Added 02-ADR-0003 to the ADRs honored list (
heavinessis registry-side, not on the ABC). - Added a Validation notes block under the header documenting the twelve block fixes, six new ACs, three mutation-resistance hardens, and one design-pattern harden.
Context¶
- Rewrote the second paragraph to describe the actual
SkillsLoader.load_all()return shape (Result[LoadOutcome, FatalLoadError]withper_file_errorscarrier). - Loosened the tracemalloc ceiling to 1 MB (from 20 KB) with a Pydantic-version-drift rationale.
References — where to look¶
- Added the actual references for the Probe ABC (
base.py), registry (registry.py), newtypes (types/identifiers.py), schema layout (flat underschema/probes/), and the canonicalasyncio.runtest idiom fromtest_dep_graph.py. - Removed the references to non-existent surfaces (
codegenie.ids,ProbeContext.for_test,_PROBE_REGISTRY).
Goal¶
- Rewrote to use
async def run(self, repo, ctx)(matching the ABC),name: str = "skills_index"(matching ABC convention), preserved newtypes (SkillId,TaskClassId,Language), andblake3:<64hex>prefix preservation. - Made explicit that
tier_countsderives from a pure filesystem-enumeration helper, NOT from any field onLoadOutcome. - Added: probe MUST NOT contain
os.open/os.read/.read_bytes/.read_text/.open(in the module source.
Acceptance criteria¶
- Renumbered to 23 ACs (from 15) organized into 12 sections.
- Rewrites: AC-3 (preserve newtypes), AC-4 (add
per_file_errors), AC-5 (use ABC class attributes, notprobe_id), AC-6 (async def run+Result[LoadOutcome, FatalLoadError]pattern-match), AC-9 (three-state confidence policy), AC-10 (1 MB ceiling), AC-11 (whole-module source interdict), AC-12 (content_hash_bytesprefix-aware round-trip), AC-13 (byte-identical JSON between gathers), AC-19 (flat schema path), AC-20 (usedefault_registry). - New ACs: AC-7 (
_resolve_search_paths+ declared_inputs tokens), AC-14 (filesystem-enumerationtier_counts), AC-15 (empty fixture), AC-16 (FatalLoadError), AC-17 (per-file errors round-trip), AC-18 (shadowed-skill propagation), AC-22 (parametrized field-coverage), AC-23 (Hypothesis property).
Implementation outline¶
- Restructured to make the functional-core / imperative-shell split explicit.
- Added the
_make_contexthelper requirement. - Documented the S6-08 forward dependency.
TDD plan¶
- Replaced every
_runcall withasyncio.run(probe.run(...)). - Replaced
ProbeContext.for_testwith_make_context(tmp_path)helper. - Replaced
_PROBE_REGISTRYwithdefault_registry._entriesiteration. - Replaced raw
blake3(...).hexdigest()withcontent_hash_bytes(body)(prefix-aware). - Replaced flat 100 MB
path.write_textfixture with sparse-filefh.seek/write(O(1)). - Replaced class-source
getsource(SkillsIndexProbe)introspection with module-sourcegetsource(si)introspection. - Added five new test functions for the new ACs (empty, FatalLoadError, per-file errors, shadowed propagation, byte-identical JSON, parametrized field coverage, Hypothesis property).
GREEN code¶
- Complete rewrite: pure helpers at module level; imperative shell uses ABC-correct class attributes;
async def runtaking(self, repo, ctx); pattern-matchesResult[LoadOutcome, FatalLoadError]; emits all sixProbeOutputfields; preserves newtypes through projection; writes raw artifact withsort_keys=True, indent=2.
Refactor section¶
- Documented the rule-of-three trigger for the shared tier-counts helper.
Out of scope¶
- Added: "Adding
tiermembership toSkill/LoadOutcome" (S2-01 amendment, not S6-01's job). - Added: "Editing the Phase 0
ProbeABC orProbeContextdataclass" (frozen contract per ADR-0007).
Notes for the implementer¶
- Rewrote all 12 notes to reflect the corrected ACs and code.
- Added notes 9 (rule-of-three trigger for tier-counts), 10 (
async def runis the contract — pinning the test idiom), 11 (ProbeContextis a stdlib@dataclass), 12 (registry surface isdefault_registry).
Verdict¶
HARDENED. Twenty-three edits applied to the story file at docs/phases/02-context-gather-layers-b-g/stories/S6-01-skills-index-probe.md. The story now compiles structurally against the S2-01 implementation that actually shipped and the frozen Phase-0 Probe / ProbeContext / ProbeOutput contract surfaces. All twelve block-severity contract mismatches are resolved; six coverage gaps closed; three mutation-resistance hardens applied; one design-pattern harden applied with rule-of-three trigger documented for the next implementer.
The story is ready for phase-story-executor.