Story S7-05 — Property tests + portfolio integration sweep¶
Step: Step 7 — Plant five-repo fixture portfolio + per-probe golden files + remaining adversarial corpus
Status: Done — GREEN 2026-05-18 (phase-story-executor; see _attempts/S7-05.md for the per-AC evidence table + gate log). AC-25/28 use subprocess.run([sys.executable, "-m", "codegenie", ...]) (the run_allowlisted path was unsatisfiable — python is not in ALLOWED_BINARIES). AC-26 implements a structured-JSON-log check (the project's stderr format) instead of a prefix-allowlist. AC-20 "iff len(results)==0" was softened to match _derive_trace_coverage_confidence semantics ("unavailable" iff no completed entries). All adaptations documented in the attempt log.
Effort: M
Depends on: S7-03 (~70 goldens exist; the portfolio sweep diffs against them; the regen script proves canonical-JSON discipline holds)
ADRs honored: ADR-0006 (IndexFreshness location — property test asserts round-trip identity over every StaleReason variant), ADR-0007 (no plugin loader — dep_graph strategy registry has zero strategies in Phase 2; property test asserts the dispatch is total over the closed PackageManager enum), ADR-0009 (pytest-xdist veto — property tests run serially under the same --max-examples=200 budget), ADR-0010 (RedactedSlice smart constructor — property tests against ScannerOutcome round-trip exercise the RedactedSlice JSON shape without re-constructing it outside the sanitizer).
Validation notes (2026-05-18)¶
Hardened against the actual Phase-2 source tree (not the architectural plan as written). The original draft cited several types/APIs that diverged from what S1-01, S1-10, S5-01, S5-02 actually shipped. Concrete corrections below; consistency with the running code is now load-bearing for the executor.
StaleReasonfield names corrected (AC-2). Code (src/codegenie/indices/freshness.py) shipsDigestMismatch(expected, actual)andCoverageGap(files_indexed, files_in_repo); the draft cited(last_traced, current_built)and(missing_files, indexed_files, total_files)— wrong. TheFreshconstructor also requires anindexed_at: datetime(the draft'sst.builds(Fresh)would fail at construction); strategy must pass an aware datetime.ScannerRanhas nofingerprints/findings_countfields (AC-8, AC-9 reframed). Those fields live onRedactedSlice(per ADR-0010), not onScannerRan(which carriesfindings: list[Finding]). The original ACs conflated the two types. New ACs targetFinding.id/severity/metadatainvariants onScannerRan(matching the model that S5-01 actually shipped) and addressRedactedSliceseparately under a corrected AC-12.- Existing
tests/property/test_sum_types_roundtrip.py(S5-01) already coversScannerOutcome+ScenarioResultround-trip. The original draft planned a duplicatetest_scanner_outcome_roundtrip.py. Reframed: this story extends the existing file with the--max-examples=200/database=None/deadline=Nonediscipline AC-11/AC-35 demands, and adds a separate exhaustive-matchassert_nevertest (AC-23) — no second property file. DepGraphRegistry.dispatch()raisesDepGraphRegistryErrorwith the structural prefix"no_strategy_for_ecosystem: "— it does NOT return aResult.Ok|Err(AC-14 / AC-15 reframed). NoResulttype exists insrc/codegenie/depgraph/. The Phase-2 invariant is now expressed asdefault_dep_graph_registry.registered_ecosystems() == frozenset()(non-raising query) AND, when a probe dispatches against anyPackageManagermember, the registry raises with the documented prefix that S4-05'sDepGraphProbematches. Mock-strategy registration AC-16 now explicitly usesdefault_dep_graph_registry.unregister_for_tests(...)infinally(the test-only teardown the registry already exposes).TraceCoverageis not a class — replaced with the real Phase-2 surface (AC-19..AC-24). The code ships a pure function_derive_trace_coverage_confidence(results) -> Literal["high","medium","low","unavailable"]and a private_AggregatedSlicePydantic model (src/codegenie/probes/layer_c/runtime_trace.py). New ACs are: monotonicity / totality of the confidence derivation across the closedScenarioResultvariant space, an exhaustive-matchassert_nevertest onScenarioResult(mirroring AC-5's discipline), and a well-formedness property over_aggregate_scenarios(scenarios_run + scenarios_failed + skipped == |results|, no duplicate scenario names,trace_coverage_confidence == "unavailable" iff len(results)==0per the canonical-empty case the function ships).walltimes.jsonno longer dirties the repo tree (AC-32). Original draft wrotetests/integration/portfolio/walltimes.jsonunconditionally on every test run, which would dirty the working copy and the pre-commit hook. Reframed: written only whenCODEGENIE_PORTFOLIO_WALLTIME_OUT=<path>is set (CI sets it to a job-artifact path). Without the env var, the test prints the walltime table topytest -sstdout and the file is untouched.- AC-26 stderr allowlist enumerated, not described. Original "stderr is empty or contains only documented warnings" was unverifiable. Reframed: the test loads stderr, splits on
\n, and asserts every non-empty line begins with one of an explicit allowlist of warning prefixes (skill_shadowed,strace_unavailable,image_digest_unresolved) OR is the empty line. A line that doesn't match fails the test and prints the offending line. - AC-29 single budget number. Original mixed
≤ 6 min(CI) with≤ 5 min target(local) — two unverifiable thresholds. Reframed: hard≤ 6 minmeasured by the test itself across the five fixtures; the bench advisory (S8-03) tracks the per-fixture trend separately. - AC-12 RedactedSlice handling clarified. Original was a paragraph-long dual-resolution that an executor would mis-implement. Reframed: the property test for
ScannerOutcomeround-trips theScannerRan(findings=...)shape as it actually ships (noRedactedSlicefield onScannerRan). A separate AC (new AC-12) coversRedactedSliceJSON round-trip identity via an instance obtained only viaredact_secrets(<synthetic input>)— the one allowed construction path. The S7-04 structural test (noRedactedSlice(...)outside the sanitizer) is unchanged and remains the structural firewall. - Design-pattern surfacing for
unregister_for_tests. Mock-strategy registration in AC-16 explicitly uses the registry'sunregister_for_teststest-only hook — the symmetric API already exists insrc/codegenie/depgraph/registry.pyandsrc/codegenie/indices/registry.py; the property test consumes the established Open/Closed seam rather than reaching into private state. - Coordination AC added (new AC-37). Names the existing
tests/property/test_sum_types_roundtrip.pyandtests/property/test_index_freshness_roundtrip.pyas "extend in place, do not duplicate" — closes the duplication risk the original draft introduced.
Context¶
This story closes Step 7 with two complementary surfaces:
- Hypothesis property tests under
tests/property/— four files covering the round-trip / dispatch-totality / well-formedness invariants of the Phase-2 typed surfaces. Each runs with--max-examples=200(Hypothesis convention; tradeoff between coverage and CI wall-clock). These are invariant tests over generated data, complementing S7-03's literal-data goldens and S7-04's adversarial cases: test_index_freshness_roundtrip.py— everyIndexFreshnessvariant + everyStaleReasonvariant round-trips throughmodel_dump_json/model_validate_jsonto identity. Extends S1-01's single-example test to portfolio-wide hypothesis coverage. Catches: missing field, type-coercion silent loss, discriminator drift.test_scanner_outcome_roundtrip.py— everyScannerOutcomevariant (ScannerRan | ScannerSkipped | ScannerFailed) round-trips. PlusScenarioResult(Layer C). Catches: same class of bug as above, separate type tree.test_dep_graph_strategy_dispatch.py— the@register_dep_graph_strategyregistry's dispatch is total over the closedPackageManagerenum (Phase 1 ADR-0013). Phase 2 has zero strategies registered; every input producesResult.Err(DepGraphRegistryError(reason="no_strategy_for_ecosystem"))— that's the Phase-2 invariant. Phase 3 fills strategies; the property test grows with the strategy set. Catches: a future implementer who silently adds a strategy AND silently drops the Phase-2 total-dispatch property.test_trace_coverage_well_formed.py—TraceCoverageis well-formed across any combination ofScenarioResultvariants. Specifically: scenario count ≥ 0; completed-and-failed counts sum to total minus skipped; no scenario name appears twice.- A portfolio-sweep integration test —
tests/integration/portfolio/test_portfolio_sweep.py— runscodegenie gatheragainst every fixture intests/fixtures/portfolio/serially (per ADR-0009; no pytest-xdist) and asserts: (a) every gather succeeds (exit 0); (b) the resultingrepo-context.yamlvalidates against the Phase-2 envelope schema; (c) the golden diff (S7-03's regen script in--checkmode) is empty. This is the "every probe runs against every fixture without crashing" smoke at the portfolio level; the CIportfoliojob (S8-03) consumes it.
Both surfaces are complementary, not redundant:
- Goldens (S7-03) pin specific byte sequences for specific (probe × fixture) pairs.
- Property tests (this story) assert invariants over generated inputs the goldens cannot exhaustively cover (e.g., every
StaleReasonvariant including ones the fixtures don't exercise). - Portfolio sweep (this story) verifies the integration surface — every probe runs against every fixture without crashing, and the gather output remains shape-consistent across the portfolio.
This is the final Step-7 story. After it lands: Step 8 (Confidence renderer + CI ratchet + bench canaries + Phase-3 handoff issues) wires everything together. The cross-cutting invariant this story locks: every Phase-2 typed surface that participates in serialization has a Hypothesis round-trip property test; the portfolio sweep proves no fixture × probe combination crashes the gatherer.
References — where to look¶
- Architecture:
../phase-arch-design.md §"Testing strategy" → "Property tests"— the inventory of round-trip / dispatch / well-formed properties.../phase-arch-design.md §"Testing strategy" → "Test pyramid"— property tests are the second-narrowest tier (above adversarial, below unit).../phase-arch-design.md §"Component design"#2 (IndexFreshness), #5 (ScannerOutcome), #11 (DepGraphProbestrategy registry), and §"Component design" #6 (TraceCoverage).- Phase ADRs:
- ADR-0006 (
IndexFreshnesslocation —frozen=True, extra="forbid"; round-trip identity is the load-bearing property). - ADR-0010 (
RedactedSlicesmart constructor — property test forScannerOutcomeround-trip must NOT constructRedactedSliceoutsideredact_secrets; instead generates JSON-shapedRedactedSlicepayloads and verifies they round-trip throughmodel_validate_json). - Implementation plan:
../High-level-impl.md §"Step 7"— property-test bullets + portfolio-sweep bullet. - Existing code:
src/codegenie/indices/freshness.py(S1-01 — the type under property test; note the actual fields:CommitsBehind(n, last_indexed),DigestMismatch(expected, actual),CoverageGap(files_indexed, files_in_repo),IndexerError(message),Fresh(indexed_at)).src/codegenie/probes/_shared/scanner_outcome.py(S5-01 —ScannerOutcomefamily;ScannerRancarriesfindings: list[Finding], notfindings_count/fingerprints— those live onRedactedSlice).src/codegenie/probes/layer_c/scenario_result.py(S5-01 —TraceScenarioCompleted | TraceScenarioFailed | TraceScenarioSkipped).src/codegenie/depgraph/registry.py(S1-10 — the registry;dispatch()raisesDepGraphRegistryError(\"no_strategy_for_ecosystem: ...\"),has_strategy()is the non-raising query,unregister_for_tests()is the test-only teardown).src/codegenie/probes/layer_c/runtime_trace.py(S5-02 —_aggregate_scenariospure fold +_derive_trace_coverage_confidenceLiteral totality; there is noTraceCoverageclass, the arch doc's section title named the concept).src/codegenie/output/sanitizer.py(S3-01..S3-03 —RedactedSliceandredact_secrets; the latter is the only legal construction surface per ADR-0010).- All five fixtures from S7-01 + S7-02.
scripts/regen_golden.py --check --portfolio(S7-03 — if the--portfolioflag hasn't landed by PR time, AC-28 ships behind apytest.mark.skipifgate, lifted by S8-03).- Existing Hypothesis test precedents:
tests/property/test_index_freshness_roundtrip.py(S1-01) andtests/property/test_sum_types_roundtrip.py(S5-01) — extended in place by this story.tests/property/test_skills_loader_monotone.py,tests/property/test_tccm_roundtrip.py,tests/property/test_runtime_trace_freshness_purity.py,tests/property/test_truncate_tail.py— convention precedents for@given+@settingsshape; do not edit.
Goal¶
The property-test surface gains three additions; the portfolio sweep is created; existing property files are extended (not duplicated):
- Extended
tests/property/test_index_freshness_roundtrip.py(S1-01 shipped this; extend) — add@settings(max_examples=200, deadline=None, database=None)on the round-trip; add a separateassert_neverexhaustive-match unit test (AC-5). - Extended
tests/property/test_sum_types_roundtrip.py(S5-01 shipped this forScannerOutcome+ScenarioResult; extend) — add the same@settings(max_examples=200, deadline=None, database=None)discipline to bothtest_scanner_outcome_roundtrips_identityandtest_scenario_result_roundtrips_identity; add anassert_neverexhaustive-match test forScenarioResult(AC-23). - New
tests/property/test_redacted_slice_roundtrip.py— Hypothesis property test overRedactedSliceJSON round-trip, where every Hypothesis example is obtained viaredact_secrets(<synthetic dict>)(the one allowed construction path per ADR-0010);--max-examples=200. - New
tests/property/test_dep_graph_strategy_dispatch.py— Hypothesis property test that the registry'sdispatch()is total overPackageManagerin the structural sense: with zero strategies registered, every member raisesDepGraphRegistryErrorwith the documented"no_strategy_for_ecosystem: "prefix; with a registered mock, the mock is invoked exactly once and its return value passes through unchanged; the registry never raises an undocumented exception.--max-examples=200. - New
tests/property/test_trace_coverage_invariants.py— Hypothesis property test over_aggregate_scenariosand_derive_trace_coverage_confidencecovering: per-input partition invariant (len(scenarios_run) + len(scenarios_failed) + skipped == len(results)), uniqueness invariant (eachscenario_nameappears at most once across the three lists), and confidence-derivation totality +unavailable iff len(results)==0;--max-examples=200. - New
tests/integration/portfolio/test_portfolio_sweep.py— serial portfolio sweep; gathers every fixture; asserts schema validation + golden diff empty.
Acceptance criteria¶
test_index_freshness_roundtrip.py (extends the existing S1-01 file)
- [ ] AC-1.
tests/property/test_index_freshness_roundtrip.pyexists (S1-01 shipped it; this story extends it); useshypothesiswith@givenstrategies that generate everyIndexFreshnessvariant (Fresh(indexed_at),Stale(reason=<each StaleReason variant>)). - [ ] AC-2 — variant coverage matches the shipped types. Every
StaleReasonvariant is reachable using its actual code-level field set:CommitsBehind(n: int, last_indexed: str),DigestMismatch(expected: str, actual: str),CoverageGap(files_indexed: int, files_in_repo: int),IndexerError(message: str).Freshis built with anindexed_atstrategy producing timezone-awaredatetimes. Ast.one_of(...)composition covers allStaleReasonvariants. If the variant set drifts from the shipped module, this AC fails by construction at import time (the wrong field name produces a Pydantic validation error inst.builds). - [ ] AC-3 — round-trip identity. For every generated
IndexFreshnessinstancex, using aTypeAdapter[IndexFreshness]:adapter.validate_json(adapter.dump_json(x)) == x. Additionally, the concrete type is preserved (type(decoded) is type(x); and forStale,type(decoded.reason) is type(x.reason)) — guards against silent discriminator drift. - [ ] AC-4 —
--max-examples=200. The test uses Hypothesis's@settings(max_examples=200, deadline=None, database=None).deadline=Nonebecause round-trip latency is variable on CI;database=Nonefor CI reproducibility (AC-35). - [ ] AC-5 —
assert_neverenforcement. A separate (non-property) test exhaustively pattern-matches onIndexFreshnessAND on everyStaleReasonwithassert_neveron the closing case. A missingcasetriggersmypy --warn-unreachablefailure (the per-module override from S1-11 applies to this test). Extends S1-01's single-example test. - [ ] AC-6 — wall-clock < 30 s on CI.
--max-examples=200× round-trip should fit easily. If not, the type itself is non-trivial in its round-trip path; investigate.
Existing tests/property/test_sum_types_roundtrip.py (extends the S5-01 file — ScannerOutcome + ScenarioResult)
- [ ] AC-7. The existing
tests/property/test_sum_types_roundtrip.pyis extended in place (not duplicated). Bothtest_scanner_outcome_roundtrips_identityandtest_scenario_result_roundtrips_identityare decorated with@settings(max_examples=200, deadline=None, database=None). The pre-existing strategies (_scanner_outcomes,_scenario_results) remain authoritative; nothing about theFindingshape is changed. - [ ] AC-8 —
Findingshape preserved. The Hypothesis strategy forScannerRanbuildsFinding(id, severity, metadata)whereseverity ∈ {"info","low","medium","high","critical"}andmetadatais bounded-depthJSONValue. The round-trip preservestype(decoded.findings[i]) is type(value.findings[i])for every index (the existing test asserts the per-element-type identity; no change needed beyond@settings). - [ ] AC-9 — no plaintext leaks via
metadata. A unit-level companion test (non-property) constructsScannerRan(findings=[Finding(id="probe.test", severity="high", metadata={"secret":"sk_live_..."})]), runs it throughredact_secrets(<dict-shaped slice carrying that finding>), and asserts the resultingRedactedSlice.sliceJSON contains zero"sk_live_"substring matches. This is the cross-check that ADR-0005 + ADR-0010 hold at the seam where scanner findings meet the sanitizer; the property test itself does not assert this (the property surface is round-trip identity, not secret-erasure). - [ ] AC-10 — round-trip identity. Preserved verbatim from the existing file (both
_scanner_adapterand_scenario_adapterround-trip identity); no edit beyond the@settingsdecoration. - [ ] AC-11 —
--max-examples=200withdeadline=None, database=Noneon both tests in the file. - [ ] AC-12 —
RedactedSliceJSON round-trip (new filetests/property/test_redacted_slice_roundtrip.py). Hypothesis generatesdict[str, JSONValue]payloads with bounded depth; passes each throughredact_secrets(payload, probe_name=ProbeId("test.property"))to obtain a(RedactedSlice, list[SecretFinding])tuple; uses theRedactedSliceas the Hypothesis-generated input to aTypeAdapter[RedactedSlice].validate_json(adapter.dump_json(x)) == xround-trip assertion. The test never constructsRedactedSlice(...)directly — it goes throughredact_secretsfor every example (per ADR-0010). The S7-04 structural firewall (noRedactedSlice(...)construction outsidecodegenie.output.sanitizer) is unaffected; this test file is not incodegenie.output.sanitizerand obtains every instance via the smart-constructor surface.
test_dep_graph_strategy_dispatch.py (matches the exception-raising API the registry actually ships)
- [ ] AC-13.
tests/property/test_dep_graph_strategy_dispatch.pyexists. - [ ] AC-14 — dispatch totality (Phase-2 reality).
default_dep_graph_registryis a fresh / empty registry in this test's process scope (an autouse fixture assertsregistered_ecosystems() == frozenset()before each example; any leftover registration from another test fails fast with a named pointer to the polluter). Hypothesis generates everyPackageManagermember viast.sampled_from(get_args(PackageManager)). For every generated member, the test callsdefault_dep_graph_registry.dispatch(member, ctx, manifests)with aProbeContexttest-double and asserts: (a) it raisesDepGraphRegistryError; (b)str(err).startswith("no_strategy_for_ecosystem: ")(the structural prefixDepGraphProbematches); (c)repr(member)appears instr(err); (d) no other exception type ever bubbles.has_strategy(member)returnsFalsefor every member.registered_ecosystems()remainsfrozenset(). - [ ] AC-15 — Phase 2 invariant (trip-wire). With zero strategies registered, the AC-14 properties hold for every
PackageManagermember. If Phase 3 lands a strategy that registers at import time (a module-level@register_dep_graph_strategy(PackageManager.npm)), this test fails on the Phase 3 PR — which is the desired contract trip-wire. The Phase 3 author must explicitly update AC-14 / AC-15 (e.g., "for the registered members expectnx.DiGraphreturn; for the rest expect the documented raise"), NOT silently break it. The test file's module docstring documents this handoff. - [ ] AC-16 — mock strategy registration uses the public test-only seam. A separate (non-property) test calls
register_dep_graph_strategy(PackageManager.npm)(mock_fn)wheremock_fn(ctx, manifests)returns a sentinelnx.DiGraphinstance; assertsdefault_dep_graph_registry.dispatch(PackageManager.npm, ctx, manifests) is sentinel_graph(identity, not copy — pinned by S1-10 AC-11); cleans up in afinally:block viadefault_dep_graph_registry.unregister_for_tests(PackageManager.npm). The test never mutatesdefault_dep_graph_registry._strategiesdirectly — the Open/Closed seam (register_dep_graph_strategy+unregister_for_tests) is the only API touched. OtherPackageManagermembers in the same test still raise the documentedDepGraphRegistryError(proves the registration is scoped, not global). - [ ] AC-17 —
--max-examples=200withdeadline=None, database=None. (Overkill for the closed enum, but consistent with the rest of the property surface; Hypothesis exhausts the closed set quickly and the remaining budget is harmless repeats.) - [ ] AC-18 — wall-clock < 10 s on CI.
test_trace_coverage_invariants.py (replaces test_trace_coverage_well_formed.py — there is no TraceCoverage class)
- [ ] AC-19.
tests/property/test_trace_coverage_invariants.pyexists. The property tests target the shipped surface:codegenie.probes.layer_c.runtime_trace._aggregate_scenarios(pure fold fromSequence[ScenarioResult]→_AggregatedSlice) and_derive_trace_coverage_confidence(results) -> Literal["high","medium","low","unavailable"]. Both are private (_prefix), so the test imports them directly with a# type: ignore[reportPrivateUsage]line documented in the file's top comment as "intentional — property test of a pure fold; no public API is more honest than the function under test." - [ ] AC-20 — partition + uniqueness invariants over
_aggregate_scenarios. Hypothesis generates a list ofScenarioResult(any combination ofTraceScenarioCompleted,TraceScenarioFailed,TraceScenarioSkipped) with the constraintunique_by=lambda r: r.scenario_name(the function's pre-condition — the runtime trace probe never emits duplicate scenario names; the test mirrors the contract). For each generated inputresults, the test callsparsed = {r.scenario_name: ParsedTrace(...) for r in results if isinstance(r, TraceScenarioCompleted)}(any well-formedParsedTracestub), thenslice_ = _aggregate_scenarios(results, parsed)and asserts: len(slice_.scenarios_run) + len(slice_.scenarios_failed) + skipped_count == len(results)whereskipped_count = sum(1 for r in results if isinstance(r, TraceScenarioSkipped)).set(slice_.scenarios_run) & set(slice_.scenarios_failed) == set()— no name appears in both lists.set(slice_.per_scenario_artifacts.keys()) == {r.scenario_name for r in results}— every scenario name is keyed.slice_.trace_coverage_confidence == "unavailable"ifflen(results) == 0(canonical-empty case).- When all scenarios are
TraceScenarioCompletedwithlen(results) >= 5,slice_.trace_coverage_confidence == "high"; withlen(results) == 1andscenario_name == "startup","low"; withlen(results) == 1andscenario_name != "startup","medium"; with2 <= len(results) <= 4,"medium"(a parameterized table-test, not the property-level assertion — Hypothesis covers all combinations; the table-test pins the precedence reading of_derive_trace_coverage_confidence). - [ ] AC-21 — confidence-derivation totality. Hypothesis generates the same
Sequence[ScenarioResult]space and asserts_derive_trace_coverage_confidence(results)returns a value in the closedLiteral["high","medium","low","unavailable"]set — i.e., it never raises, never returnsNone, never returns an out-of-set string. (Pydantic doesn't validate the function return; a mypy-strict + runtimeassertin the test is the redundant defense.) - [ ] AC-22 —
--max-examples=200withdeadline=None, database=None. - [ ] AC-23 —
assert_neveronScenarioResultvariants in a separate exhaustive-match unit test undertests/unit/probes/layer_c/test_scenario_result_assert_never.py. Constructs one instance of everyScenarioResultvariant; the test function's exhaustivematchcloses onassert_never(_). A missingcaseis amypy --warn-unreachablebuild error against the per-module override. Mirrors AC-5's discipline. - [ ] AC-24 — wall-clock < 30 s on CI.
test_portfolio_sweep.py — serial portfolio integration
- [ ] AC-25.
tests/integration/portfolio/test_portfolio_sweep.pyexists; gathers every fixture undertests/fixtures/portfolio/serially (for fixture in sorted(fixtures): ...) viacodegenie.exec.run_allowlisted. - [ ] AC-26 — every gather exits 0 with an explicit stderr allowlist. For each fixture,
codegenie gather <fixture>returns exit code 0. Stderr is split on\n; every non-empty line must begin with one of the documented warning IDs (allowlist literal in the test source):skill_shadowed,strace_unavailable,image_digest_unresolved,external_docs_skipped. A line that doesn't match — and any line containingTraceback,Error,Exceptionoutside that allowlist — fails the test with the offending line in the failure message. The allowlist tuple is module-level and grep-discoverable so adding a new documented warning is a one-line edit. - [ ] AC-27 — envelope schema validation. For each fixture's resulting
repo-context.yaml, the test loads it via the project'ssafe_yaml.load(NOTyaml.safe_load— the project's wrapper is the chokepoint) AND validates against the Phase-2 envelope schema (src/codegenie/schema/repo_context.schema.jsonextended in Steps 4–6). Validation failure fails the test with the full JSONSchema error path (error.absolute_path) for actionable diagnostics. - [ ] AC-28 — golden diff empty. After gathering, the test invokes
scripts/regen_golden.py --check --portfolioviarun_allowlistedand asserts exit 0. (Redundant withtests/golden/test_goldens_match.pyfrom S7-03, but appropriate here because the portfolio sweep is the integration-level gate; the golden harness is the unit-test-level gate.) If S7-03's regen script ships only the--checkmode and not the--portfolioflag at PR time, this AC is implemented behind apytest.mark.skipif(not _has_portfolio_check_mode(), reason="...")gate naming the missing flag — and S8-03 lifts the skip when the flag lands. - [ ] AC-29 — wall-clock budget ≤ 6 minutes hard. The test measures total wall-clock across the five fixtures and asserts
total_seconds <= 360(the Phase-2portfoliojob budget perphase-arch-design.md §"Testing strategy"). If exceeded, the test fails with the per-fixture breakdown so the reviewer can see which fixture regressed. The local-vs-CI distinction is dropped (a developer machine running outside the budget is a probe-regression signal, not a per-machine tolerance). - [ ] AC-30 — serial dispatch. No
pytest-xdist, nomultiprocessing, noasyncio.gather— for-loop iteration with sequentialrun_allowlistedinvocations. ADR-0009 honored. The test is decorated@pytest.mark.serialand the file's module docstring names the ADR. - [ ] AC-31 — clean tmpdir per fixture. Each fixture is copied to a fresh
tmp_path / fixture.nameviashutil.copytree(notsubprocess.run(["cp", "-R", ...])—shutil.copytreeis the cross-platform stdlib equivalent and avoids acpallowlist line for a pure-Python copy). Cache + context outputs land in the tmpdir. The canonical fixture tree undertests/fixtures/portfolio/is never written to (the test asserts_PORTFOLIO_DIR_HASHbefore and after match — a_dir_sha256(_PORTFOLIO)snapshot taken at test start and re-checked at test end). - [ ] AC-32 — wall-clock per fixture recorded without dirtying the repo. The test collects
{fixture_name: walltime_seconds}in memory. If the env varCODEGENIE_PORTFOLIO_WALLTIME_OUTis set to a path (CI sets it to a job-artifact path under${{ runner.temp }}), the test writes the JSON to that path. Without the env var, the test prints the table to stdout (visible underpytest -s) and does NOT write to the repo tree. The S8-03 bench script (bench_portfolio_walltime.py) reads the artifact via the same env var; the cross-story handoff contract is documented in this story's PR description.
Determinism, audit hygiene, type cleanliness
- [ ] AC-33 — every property test passes
mypy --strict. Hypothesis's@givendecorators carry full type annotations; noAnyoutside what Hypothesis's API demands (@givenitself is typedAnyupstream;_ignores at the decorator line are the only allowed Hypothesis-specific concession). - [ ] AC-34 — Hypothesis strategies are explicit, not
from_type-magic. Each property test declares its strategies explicitly (e.g.,commits_behind_strategy = hypothesis.strategies.builds(CommitsBehind, n=integers(min_value=0, max_value=10_000), last_indexed=text(...))).hypothesis.strategies.from_type(IndexFreshness)would silently DTRT (or fail to) — explicit beats implicit, especially for discriminated unions. The existing precedents (test_sum_types_roundtrip.py,test_index_freshness_roundtrip.py) already follow this discipline; new files mirror it verbatim. - [ ] AC-35 — no flakes;
database=Nonein CI. Each property test uses@settings(database=None)to disable Hypothesis's persistent example database for CI reproducibility (committingtests/property/.hypothesis/is explicitly out of scope for Phase 2 per "Patterns DELIBERATELY deferred"). The PR description documents that every property test was run 100×local with--hypothesis-seed=0and passed 100/100; the loop ispytest tests/property/ --hypothesis-seed=0executed in a shellfor i in $(seq 1 100); do ...; donebefore opening the PR. - [ ] AC-36 — portfolio sweep passes against all five fixtures (smoke-verified locally before opening PR;
pytest tests/integration/portfolio/test_portfolio_sweep.py -v -sshows the per-fixture walltime table in the PR description). - [ ] AC-37 — coordination, not duplication.
tests/property/test_index_freshness_roundtrip.pyandtests/property/test_sum_types_roundtrip.pyalready ship (S1-01, S5-01). This story extends them in place — no newtest_scanner_outcome_roundtrip.pyfile is created (its content already lives intest_sum_types_roundtrip.py). A grep-precheck in the test file's top comment notes the prior file and the AC mapping (AC-7..AC-11 → existing file; AC-12 → newtest_redacted_slice_roundtrip.py; AC-19..AC-24 → newtest_trace_coverage_invariants.py; AC-13..AC-18 → newtest_dep_graph_strategy_dispatch.py).
Implementation outline¶
- Extend
tests/property/test_index_freshness_roundtrip.py(S1-01 ships it). Add@settings(max_examples=200, deadline=None, database=None)to the existingtest_index_freshness_roundtrips_identity. Confirm the existing_freshness/_stale_reasonsstrategies already match the shipped variant set with correct field names (AC-2 audit). Run; observe pass with the larger example budget. - Add the
assert_neverexhaustive-match test (AC-5) undertests/unit/indices/test_freshness_assert_never.py(or appended totests/unit/indices/test_freshness.pyif S1-01 already has the file). Runmypy --warn-unreachableagainst the file; observe pass. Temporarily comment out onecaseline and re-run; observe mypy failure. Restore. Commit. - Extend
tests/property/test_sum_types_roundtrip.py(S5-01 ships it). Add@settings(max_examples=200, deadline=None, database=None)to both round-trip tests. No other change — the strategies are already correct. - Add the no-plaintext-leak companion test (AC-9) under
tests/unit/output/test_finding_redaction.py— constructs a syntheticFindingwith a plaintext-secret-shapedmetadatavalue, threads it throughredact_secrets, asserts the resultingRedactedSlice.sliceJSON contains zero plaintext-secret substring matches. - Write
tests/property/test_redacted_slice_roundtrip.py(AC-12). Hypothesis generatesdict[str, JSONValue]payloads; every example is passed throughredact_secrets(...)to obtain aRedactedSlice; the test asserts round-trip identity viaTypeAdapter[RedactedSlice]. The file's top comment names ADR-0010 and the S7-04 firewall — everyRedactedSliceinstance reaches the test viaredact_secrets, never via the model constructor. - Write
tests/property/test_dep_graph_strategy_dispatch.py(AC-13..AC-17). Autouse fixture assertsdefault_dep_graph_registry.registered_ecosystems() == frozenset()before each example (and unregisters anything left over with a failure message naming the polluter). The property body samples onePackageManagermember per example, callsdispatch(...), and asserts the documentedDepGraphRegistryErrorraise with the structural prefix. The mock-strategy test (AC-16) is a separate non-property test usingregister_dep_graph_strategy(PackageManager.npm)+try/finallywithunregister_for_tests. - Write
tests/property/test_trace_coverage_invariants.py(AC-19..AC-22). Hypothesis strategy forlist[ScenarioResult]usesunique_by=lambda r: r.scenario_nameto mirror the runtime-trace pre-condition. The property body calls_aggregate_scenarios(results, parsed)and asserts the partition + uniqueness invariants; a separate strategy targets_derive_trace_coverage_confidencedirectly for totality (AC-21). - Add
tests/unit/probes/layer_c/test_scenario_result_assert_never.py(AC-23). One instance perScenarioResultvariant; exhaustivematchcloses onassert_never(_);mypy --warn-unreachableenforces. - Write
tests/integration/portfolio/test_portfolio_sweep.py(AC-25..AC-32). Serial for-loop;shutil.copytreeeach fixture totmp_path / fixture.name;run_allowlisted([sys.executable, "-m", "codegenie", "gather", str(workdir)], ...); assert exit + stderr allowlist + schema + golden-diff; collect walltimes in memory; write toCODEGENIE_PORTFOLIO_WALLTIME_OUTif set; assert_PORTFOLIO_DIR_HASHunchanged at test end. Run; observe pass (or debug the failing fixture + probe combination). - Stabilize. Run each property test 100 times locally with
pytest tests/property/ --hypothesis-seed=0. Confirm 100/100 passes. If any flake, investigate — Hypothesis's persistent database is a common culprit (database=Noneper AC-35 is the prescribed mitigation). - Sweep budget check. Run the portfolio sweep locally; record per-fixture wall-clock; confirm
total_seconds <= 360(AC-29). If a fixture's gather exceeds expectation, debug — usually a probe regressing into a worst-case path. - Final pass:
mypy --strict,ruff check,ruff format --check,make check. Green.
TDD plan — red / green / refactor¶
Red — failing property tests first¶
# tests/property/test_index_freshness_roundtrip.py (extension of the S1-01 file)
# Existing strategies (already shipped under S1-01) are unchanged:
# _commits_behind uses CommitsBehind(n, last_indexed)
# _digest_mismatch uses DigestMismatch(expected, actual) <-- not (last_traced, current_built)
# _coverage_gap uses CoverageGap(files_indexed, files_in_repo) <-- not (missing_files, ...)
# _indexer_error uses IndexerError(message)
# _freshness one_of(builds(Fresh, indexed_at=_aware_datetimes), builds(Stale, reason=...))
#
# This story adds ONLY the @settings decoration to the existing test:
from hypothesis import given, settings
# ... (existing imports and strategies — see S1-01 shipped file)
@given(value=_freshness)
@settings(max_examples=200, deadline=None, database=None) # AC-4, AC-35
def test_index_freshness_roundtrips_identity(value: IndexFreshness) -> None:
decoded = _adapter.validate_json(_adapter.dump_json(value))
assert decoded == value
assert type(decoded) is type(value)
if isinstance(value, Stale):
assert isinstance(decoded, Stale)
assert type(decoded.reason) is type(value.reason)
# tests/unit/indices/test_freshness_assert_never.py (AC-5)
from typing import assert_never
from datetime import UTC, datetime
from codegenie.indices import (
Fresh, Stale, CommitsBehind, DigestMismatch, CoverageGap, IndexerError, IndexFreshness,
)
def _stringify(x: IndexFreshness) -> str:
match x:
case Fresh():
return "fresh"
case Stale(reason=CommitsBehind(n=n)):
return f"stale_commits_behind_{n}"
case Stale(reason=DigestMismatch()):
return "stale_digest_mismatch"
case Stale(reason=CoverageGap()):
return "stale_coverage_gap"
case Stale(reason=IndexerError()):
return "stale_indexer_error"
case _:
assert_never(x)
def test_exhaustive_match_assert_never() -> None:
"""AC-5 — match is exhaustive over every StaleReason variant;
mypy --warn-unreachable on this module enforces it at build time."""
assert _stringify(Fresh(indexed_at=datetime(2026, 1, 1, tzinfo=UTC))) == "fresh"
assert _stringify(Stale(reason=CommitsBehind(n=1, last_indexed="a"*40))).startswith("stale_commits_behind_")
assert _stringify(Stale(reason=DigestMismatch(expected="x"*64, actual="y"*64))) == "stale_digest_mismatch"
assert _stringify(Stale(reason=CoverageGap(files_indexed=0, files_in_repo=0))) == "stale_coverage_gap"
assert _stringify(Stale(reason=IndexerError(message="boom"))) == "stale_indexer_error"
# tests/property/test_dep_graph_strategy_dispatch.py (AC-13..AC-17)
from __future__ import annotations
from typing import get_args
import pytest
from hypothesis import given, settings
from hypothesis import strategies as st
from codegenie.depgraph import (
DepGraphRegistryError,
default_dep_graph_registry,
register_dep_graph_strategy,
)
from codegenie.types.identifiers import PackageManager
_package_managers = st.sampled_from(list(get_args(PackageManager)))
@pytest.fixture(autouse=True)
def _registry_is_empty() -> None:
leftover = default_dep_graph_registry.registered_ecosystems()
assert leftover == frozenset(), (
f"singleton polluted by prior test; leftover ecosystems={leftover!r}"
)
@given(ecosystem=_package_managers)
@settings(max_examples=200, deadline=None, database=None)
def test_dispatch_phase2_invariant_raises_documented_error(ecosystem: PackageManager) -> None:
"""AC-14, AC-15 — with zero strategies registered, every PackageManager member
raises DepGraphRegistryError with the documented structural prefix."""
assert default_dep_graph_registry.has_strategy(ecosystem) is False
with pytest.raises(DepGraphRegistryError) as exc_info:
default_dep_graph_registry.dispatch(ecosystem, ctx=None, manifests=[]) # type: ignore[arg-type]
msg = str(exc_info.value)
assert msg.startswith("no_strategy_for_ecosystem: "), msg
assert repr(ecosystem) in msg, msg
# tests/integration/portfolio/test_portfolio_sweep.py (AC-25..AC-32)
from __future__ import annotations
import hashlib
import json
import os
import shutil
import sys
import time
from pathlib import Path
import pytest
from jsonschema import validate
from codegenie.exec import run_allowlisted
from codegenie.parsers import safe_yaml
_REPO_ROOT = Path(__file__).resolve().parents[3]
_PORTFOLIO = _REPO_ROOT / "tests" / "fixtures" / "portfolio"
_SCHEMA = _REPO_ROOT / "src" / "codegenie" / "schema" / "repo_context.schema.json"
# AC-26 — explicit, grep-discoverable allowlist of stderr line prefixes.
_STDERR_ALLOWLIST: tuple[str, ...] = (
"skill_shadowed",
"strace_unavailable",
"image_digest_unresolved",
"external_docs_skipped",
)
_TOTAL_WALLCLOCK_BUDGET_S = 360.0 # AC-29 hard ceiling
def _enumerate_fixtures() -> list[Path]:
return sorted(p for p in _PORTFOLIO.iterdir() if p.is_dir() and not p.name.startswith("_"))
def _dir_sha256(root: Path) -> str:
h = hashlib.sha256()
for p in sorted(root.rglob("*")):
if p.is_file():
h.update(str(p.relative_to(root)).encode())
h.update(p.read_bytes())
return h.hexdigest()
@pytest.mark.serial # AC-30 — ADR-0009; never xdist
def test_portfolio_sweep(tmp_path: Path) -> None:
schema = json.loads(_SCHEMA.read_text())
walltimes: dict[str, float] = {}
pre_hash = _dir_sha256(_PORTFOLIO) # AC-31 fixture-purity check
sweep_t0 = time.perf_counter()
for fixture in _enumerate_fixtures():
workdir = tmp_path / fixture.name
shutil.copytree(fixture, workdir) # AC-31 stdlib; no `cp` subprocess
t0 = time.perf_counter()
result = run_allowlisted(
[sys.executable, "-m", "codegenie", "gather", str(workdir)],
cwd=_REPO_ROOT,
timeout_seconds=180,
)
walltimes[fixture.name] = time.perf_counter() - t0
# AC-26 — stderr allowlist
stderr_text = result.stderr_tail.decode("utf-8", errors="replace")
for line in stderr_text.splitlines():
if not line.strip():
continue
assert any(line.startswith(p) for p in _STDERR_ALLOWLIST), (
f"{fixture.name}: undocumented stderr line: {line!r}"
)
assert "Traceback" not in line and "Exception" not in line, (
f"{fixture.name}: error keyword in stderr: {line!r}"
)
assert result.exit_code == 0, f"{fixture.name}: exit={result.exit_code}; stderr={stderr_text!r}"
# AC-27 — schema validation via the project's safe_yaml chokepoint
ctx_path = workdir / ".codegenie" / "context" / "repo-context.yaml"
validate(instance=safe_yaml.load(ctx_path.read_text()), schema=schema)
# AC-28 — golden diff empty (skip if S7-03's --portfolio flag hasn't landed yet)
check_result = run_allowlisted(
[sys.executable, str(_REPO_ROOT / "scripts" / "regen_golden.py"), "--check", "--portfolio"],
cwd=_REPO_ROOT,
timeout_seconds=120,
)
assert check_result.exit_code == 0, (
f"Golden diff non-empty:\n{check_result.stderr_tail.decode('utf-8', errors='replace')}"
)
total_wallclock = time.perf_counter() - sweep_t0
assert total_wallclock <= _TOTAL_WALLCLOCK_BUDGET_S, (
f"portfolio sweep exceeded {_TOTAL_WALLCLOCK_BUDGET_S}s budget: {total_wallclock:.1f}s\n"
+ json.dumps(walltimes, sort_keys=True, indent=2)
)
# AC-31 — fixture tree untouched
assert _dir_sha256(_PORTFOLIO) == pre_hash, "canonical portfolio fixture tree was modified"
# AC-32 — walltime artifact (env-gated; never dirties the repo)
out_path = os.environ.get("CODEGENIE_PORTFOLIO_WALLTIME_OUT")
if out_path:
Path(out_path).write_text(json.dumps(walltimes, sort_keys=True, indent=2) + "\n")
else:
print("\nportfolio walltimes (seconds):", json.dumps(walltimes, sort_keys=True, indent=2))
Green — make it pass¶
With S1-01, S5-01, S1-10, and S5-02 types in place AND S7-03's goldens committed AND all five fixtures from S7-01/S7-02 on disk, every test in this story should pass on first run. If any fails, the failure points to a real bug — fix the production code, not the test.
Mutation-resistance witness table¶
| Mutation | Test that catches it |
|---|---|
Add IndexFreshness variant Stale.NetworkPartition (missing discriminator) |
test_index_freshness_roundtrips_identity — model_validate_json fails on round-trip; test_exhaustive_match_assert_never fires mypy --warn-unreachable |
Drop the n field from CommitsBehind |
test_index_freshness_roundtrips_identity round-trip fails (Pydantic extra="forbid" + the strategy's required-field build) |
Rename DigestMismatch.expected → DigestMismatch.last_traced without updating strategies |
test_index_freshness_roundtrips_identity — st.builds(DigestMismatch, expected=...) fails at collection time with TypeError: unexpected keyword 'expected', naming the drift |
ScannerRan.findings element loses severity constraint (regex/Literal regression) |
Existing test_sum_types_roundtrip.py — Pydantic refuses the out-of-set sample; the round-trip fails on the offending example |
Plaintext slips into Finding.metadata and through to writer |
test_finding_redaction.py (AC-9) — asserts the post-redact_secrets slice JSON contains zero plaintext substring matches |
RedactedSlice.fingerprints shape regresses (non-8-hex string admitted) |
test_redacted_slice_roundtrip — since every example transits redact_secrets, the regression surfaces as a round-trip identity mismatch or a model-validation failure |
Future contributor adds a dep-graph strategy without updating test_dep_graph_strategy_dispatch.py |
The Phase-2 invariant (registered_ecosystems() == frozenset()) fails in the autouse fixture — the test fails loudly with a named pointer to the polluter, forcing the Phase-3 PR to explicitly update |
DepGraphRegistry.dispatch quietly drops the "no_strategy_for_ecosystem: " prefix |
test_dispatch_phase2_invariant_raises_documented_error — prefix assertion fires |
_aggregate_scenarios admits a duplicate scenario name |
test_trace_coverage_invariants — uniqueness invariant + Hypothesis unique_by mismatch fires |
_aggregate_scenarios returns len(scenarios_run) + len(scenarios_failed) + skipped != len(results) |
test_trace_coverage_invariants — partition invariant fails |
_derive_trace_coverage_confidence returns an out-of-Literal value or raises |
test_trace_coverage_invariants totality assertion fires |
A fixture × probe combination crashes (e.g., DepGraphProbe against monorepo-pnpm hits an unhandled KeyError) |
test_portfolio_sweep — exit-code-non-zero assertion fires; Traceback/Exception keyword assertion fires |
A probe's slice schema drifts (e.g., adds a new field without updating repo_context.schema.json) |
test_portfolio_sweep — jsonschema.validate fails with the absolute schema path |
| Golden file silently goes stale | test_portfolio_sweep AC-28 — regen_golden.py --check --portfolio returns non-zero |
Implementer enables pytest-xdist for the portfolio sweep |
@pytest.mark.serial + the for-loop iteration make this impossible to silently enable; the registry-emptiness autouse fixture also fires under shared-process contention |
Test silently writes walltimes.json into the working tree |
test_portfolio_sweep writes only when CODEGENIE_PORTFOLIO_WALLTIME_OUT is set; the _dir_sha256 fixture-purity check fires if the canonical fixture tree changed |
Refactor — clean up¶
- The four property-test files share a structural pattern (Hypothesis strategy declarations →
@givenround-trip →assert_neverexhaustive match). DO NOT extract a kernel — four consumers is at the Rule-of-Three boundary, but the variant-strategy declarations are specific to each type (StaleReasonfor one,ScenarioResultfor another); extracting would require dependency-injecting the type, which obscures more than it clarifies. Re-evaluate at the fifth property test (Phase 3+). test_portfolio_sweep.py's walltime recording (AC-32) is the seed S8-03'sbench_portfolio_walltime.pyconsumes. The file format ({fixture_name: walltime_seconds}) is documented in this story's PR description; S8-03 inherits the contract.--max-examples=200is a Hypothesis convention; the budget assumes round-trip work is cheap. If a property test exceeds its AC-budget (AC-6, AC-11, AC-18, AC-22, AC-24, AC-29), the bottleneck is either Hypothesis shrinking (setphases=[...]to skip shrinking on CI) OR the type's round-trip latency itself (investigate Pydantic field count, custom validators).
Files to touch¶
| Path | Why |
|---|---|
tests/property/test_index_freshness_roundtrip.py (extend in place) |
Add @settings(max_examples=200, deadline=None, database=None) to the existing test |
tests/property/test_sum_types_roundtrip.py (extend in place) |
Add the same @settings decoration to both round-trip tests; no other change |
tests/property/test_redacted_slice_roundtrip.py (new) |
RedactedSlice JSON round-trip; every example obtained via redact_secrets (ADR-0010) |
tests/property/test_dep_graph_strategy_dispatch.py (new) |
Phase-2 zero-strategy invariant + documented-raise structural-prefix assertion |
tests/property/test_trace_coverage_invariants.py (new) |
Partition / uniqueness / confidence-totality over _aggregate_scenarios + _derive_trace_coverage_confidence |
tests/unit/indices/test_freshness_assert_never.py (new — or appended to existing file) |
AC-5 exhaustive match + mypy --warn-unreachable enforcement |
tests/unit/probes/layer_c/test_scenario_result_assert_never.py (new) |
AC-23 exhaustive match + mypy --warn-unreachable enforcement |
tests/unit/output/test_finding_redaction.py (new) |
AC-9 — Finding.metadata plaintext never reaches the writer |
tests/integration/portfolio/test_portfolio_sweep.py (new) |
AC-25..AC-32 serial sweep + stderr allowlist + schema + golden-diff + walltime artifact |
tests/property/conftest.py (optional) |
Hypothesis settings profile (max_examples, deadline, database); only if duplication across the four property files is uncomfortable for the implementer |
Deliberately NOT created: tests/property/test_scanner_outcome_roundtrip.py — its content already lives in tests/property/test_sum_types_roundtrip.py (S5-01); duplicating would violate Rule 3 (surgical changes) and create two strategies-of-record for the same type. The extension AC (AC-7) makes the coordination explicit.
Out of scope¶
- CI wiring (
portfolio+propertyjob lanes) — S8-03. bench_portfolio_walltime.py+ baselines — S8-03 (this story produces the seed walltime data viaCODEGENIE_PORTFOLIO_WALLTIME_OUT).- Hosted-runner bench (Gap 2) — S8-03.
- Confidence-renderer +
assert_nevermypy --warn-unreachable enforcement at the renderer site — S8-01. - A generic property-test kernel / shared
conftest.pysettings profile — premature; four consumers (Rule of Three says wait for a fifth). - Hypothesis stateful tests (state-machine-based) — out; the Phase-2 types under property test are immutable / Pydantic frozen; stateful testing offers no advantage.
- A
--max-examples=2000deep-property CI lane — out;200is the convention; deepening it is a bench-driven decision, not a Phase-2 story. - Introducing a
Result[T, E]type wrapper aroundDepGraphRegistry.dispatch— earlier drafts assumed one existed; the registry's exception-with-structural-prefix is the shipped contract S4-05 consumes. Refactor toResultis a cross-module change with no other consumer — premature per Rule 2.
Notes for the implementer¶
- The property tests should pass on first run. If the extended
test_index_freshness_roundtrips_identityfails after adding@settings(max_examples=200, ...), the bug is incodegenie.indices.freshness(S1-01) — investigate the Pydantic model. Don't paper over with strategy restrictions. Likewise fortest_sum_types_roundtrip.py— the existing strategies are authoritative. database=Noneis mandatory on every property test in this story. Phase 2's CI determinism contract (AC-35) forbidstests/property/.hypothesis/from being committed (the option is on the deliberately-deferred list).@settings(database=None)is the one-line enforcement; the autouse fixture pattern (a session-scoped fixture that asserts no.hypothesis/artifacts appear in the test root) is not added in this story —database=Noneper-test is the simpler enforcement that doesn't introduce a new fixture surface.--max-examples=200is the Hypothesis convention. Not 100 (under-coverage), not 2000 (over-budget). The Phase-2 types are small enough that 200 examples cover the variant space and find any discriminator regression quickly.- Use
hypothesis.strategies.builds(...)notfrom_type(...). The discriminated unions are not Hypothesis-introspectable by default; explicit strategies are predictable. AC-34 names this; the existing precedents already obey it. - The
assert_nevertest is the load-bearing Phase-2 type-safety enforcement.mypy --warn-unreachableon the per-module override (S1-11) fires if anycaseis missing. Test this manually: temporarily comment out onecasein_stringify, runmypy --warn-unreachable tests/unit/indices/test_freshness_assert_never.py, observe failure, restore. Document the deliberate-fail-then-pass in PR. test_portfolio_sweep.py's per-fixture wall-clock timeout is generous (180 s). That's far more than the cold p50 (≤ 90 s) target. The 6-minute sweep budget (AC-29) covers all five fixtures with headroom. If a single fixture's gather exceeds 90 s in development, that's a probe-regression signal — investigate before merging.- The
walltimes.jsonartifact is env-gated (CODEGENIE_PORTFOLIO_WALLTIME_OUT). Without the env var the test prints to stdout (visible underpytest -s); it never writes to the repo tree. S8-03'sbench_portfolio_walltime.pyconsumes the artifact via the same env var (CI sets it to a job-artifact path under${{ runner.temp }}). Document the cross-story handoff in this story's PR description. - Why
serialmark ontest_portfolio_sweep. ADR-0009 (pytest-xdistveto preserved); theserialmark is a pytest convention for tests that explicitly opt out of parallelization. The portfolio sweep is the canonical serial-only test in Phase 2. The mark is a documentation aid + a future-proofing hook in case a future contributor enables xdist for unit tests but forgets to exclude this one. - The Phase-2 zero-strategy invariant (AC-15) is the load-bearing Phase-3 trip-wire. When Phase 3 lands its first
@register_dep_graph_strategy(PackageManager.npm) def npm_strategy(...), this property test fails on the Phase 3 PR — which is correct. The Phase 3 author updates the test to reflect "for the registered members expectnx.DiGraphreturn; for the rest expect the documented raise". This is the explicit Open/Closed seam Phase 2 documented. Document this handoff in the test file's top comment and in S8-04's Phase-3-handoff issue. TraceCoverageis a section title in the arch doc, not a class. Earlier drafts of this story referenced aTraceCoveragePydantic model; the shipped surface (S5-02) is a pure function_derive_trace_coverage_confidenceplus a private_AggregatedSlice. The property tests target those directly via# type: ignore[reportPrivateUsage]on the import line, with a top-comment justification: "property testing a pure fold; no public re-export would be more honest than the function under test." If S8-01'sconfidence_section.pyintroduces a publicTraceCoveragere-export, the test imports get one-line updated; the invariants don't change.- Design-pattern hooks already paid for by existing code (Open/Closed seams to consume, not extract). The story consumes — never reinvents — three existing Open/Closed seams: (a)
default_dep_graph_registry.register / unregister_for_tests / registered_ecosystemsfor the dep-graph dispatch property; (b)redact_secretsas the smart-constructor surface forRedactedSlice(ADR-0010); (c) the per-modulemypy --warn-unreachableoverrides (S1-11) forassert_neverenforcement. The implementer's job is to exercise these seams, not invent new abstractions. A fifth property file would be the trigger for considering atests/property/conftest.pysettings profile; four are not enough to extract one (Rule of Three).
Patterns DELIBERATELY deferred (per Rule 2)¶
- A generic property-test kernel / shared
conftest.pysettings profile — four consumers; deferred until a fifth. The duplicated@settings(max_examples=200, deadline=None, database=None)decoration is the simpler choice; if Phase 3 grows the property surface, extract then. - Stateful property tests — out; types are immutable.
- Hypothesis
databasecommitted under git — out;database=Noneis the simpler choice for Phase 2 (AC-35). - A
--max-examples=2000"deep" CI lane — out until bench data shows the shallow lane misses real bugs. - A
Result[T, E]wrapper aroundDepGraphRegistry.dispatch— earlier drafts of this story assumed one existed; the registry uses exception-with-structural-prefix instead (it's already the API S4-05'sDepGraphProbematches). Introducing aResulttype to "make exceptions explicit" would be a cross-module refactor with one user — premature per Rule 2.