Story S6-07 — Determinism-under-cassette-replay property test¶
Step: Step 6 — Compose FallbackTier + register typecheck.typescript SignalKind + integration Status: Ready Effort: M Depends on: S6-01 (FallbackTier shell), S6-02 (retry-bypass branch) ADRs honored: ADR-04-0002 (Pipeline; deterministic when cassette is fixed), ADR-04-0014 (cassette discipline supports deterministic replay), ADR-04-0008 (two-threshold band is pure)
Context¶
Phase-arch-design §Harness §Determinism vs probabilism (lines 832–835) commits Phase 4 to deterministic behavior across every component except the probabilistic leaf, which becomes deterministic under cassette replay. Phase-arch-design §Idempotence (line 827) names the load-bearing key tuple: (cassette_id, store_digest, repo_snapshot_sha, embedding_model_digest) constant ⇒ byte-identical outcomes.
The risk is real: a flaky dict iteration order, a sort instability in the retriever, or a clock-based audit field that leaks into a digest can turn deterministic-by-construction into "passes locally, flakes in CI". The property test is the contract: 50 runs with the four-tuple constant must produce byte-identical Transform.diff_bytes and byte-identical event order (modulo timestamps and other allowlisted non-deterministic fields).
This story lands tests/property/test_determinism_under_cassette_replay.py as the contract test. Phase 6.5 (bench replay) and Phase 7 (E2E) both read this contract; a regression here is a Phase-4-merge blocker.
References — where to look¶
- Architecture: phase-arch-design.md §Harness §Determinism vs probabilism (lines 832–835); §Idempotence (line 827); §Concurrency (line 269 — single-async-event-loop); §Goals — G6 (replay).
- Phase ADRs: ADR-04-0002 §Tradeoffs (every step one audit event — debuggability is a sequence); ADR-04-0014 (cassettes are the determinism mechanism for the leaf); ADR-04-0008 (threshold band classifier is pure).
- Source design: final-design.md §Phase 4 Goals — "Deterministic under cassette replay"; §"Three load-bearing structural lines" item 1.
- High-level impl: High-level-impl.md §Step 6 Done criteria — "Determinism property
tests/property/test_determinism_under_cassette_replay.py— 50 runs with(cassette_id, store_digest, repo_snapshot_sha, embedding_model_digest)constant: byte-identicalTransform.diff_bytesand event order (modulo timestamps)". - Existing code (after S6-01/02):
src/codegenie/fallback/tier.py;tests/cassettes/anthropic/;tests/fixtures/fallback_tier_callable.py.
Goal¶
Land tests/property/test_determinism_under_cassette_replay.py: run FallbackTier.run 50 times with the four-tuple (cassette_id, store_digest, repo_snapshot_sha, embedding_model_digest) held constant; assert byte-identical Transform.diff_bytes across all 50 runs AND byte-identical event-kind-sequence (modulo a documented allowlist of non-deterministic fields like timestamps and randomly-generated audit-event UUIDs).
Acceptance criteria¶
- [ ] 50-iteration replay: the test invokes
FallbackTier.runexactly 50 times under VCR cassette replay (pytest --record-mode=none) with the same advisory, repo_ctx, recipe_selection, store, embedder, prior_attempts=[]. - [ ] Byte-identical
Transform.diff_bytesacross all 50 runs:set(diff_bytes for _ in range(50))has exactly 1 element. Assertion failure surfaces the first diverging byte index per Global Rule 12 ("fail loud") — printlen(diff_bytes_a)vslen(diff_bytes_b)and the first mismatched offset. - [ ] Byte-identical event-kind sequence:
[(e.kind, e.payload_digest_blake3) for e in run_events]is identical across all 50 runs. - [ ] Non-deterministic-fields allowlist: timestamps (
emitted_at), audit-event UUIDs, andBudgetTokenId(uuid4) are explicitly listed as allowed to differ. The test strips them before comparison; the allowlist lives as aFinaltuple in the test module with a docstring citing this AC. - [ ] Same property holds for retry-bypass path: a second test
test_determinism_under_cassette_replay_retryruns the same 50-iteration loop withprior_attempts=[summary](S6-02 path); same byte-identical assertion. Confirms the retry branch is just as deterministic. - [ ] Performance budget: 50 cassette-replay iterations complete within an acceptable wall-clock cap (target ≤ 60 s total for the property test); if it overruns, drop iteration count to a number that still gives statistical confidence (≥ 20) and surface per Global Rule 12.
- [ ] Single-event-loop discipline asserted: the test runs each iteration inside a fresh
asyncio.run(...)to confirm no leaked state across loops; if a state leak across loops appears (e.g., a module-leveldictmutated by a probe), the test fails loud. - [ ] Failure diagnostic is actionable: when the test fails, the error message must name (a) which run number first diverged, (b) which event index diverged, and (c) the diff between the two events. A
pprint.pformat-style diff is acceptable; rawassert a == bfailure messages are not (they truncate dicts). - [ ]
make check,make testgreen; the property test runs as part of the default suite (not gated behind-m bench); ≥ 20 iterations is the floor per Global Rule 6 (token budget) — the goal is 50 but truthful 20 beats fake 50.
Implementation outline¶
- New test file
tests/property/test_determinism_under_cassette_replay.py. Usepytest-recording's@pytest.mark.vcrto lock the cassette; modenone(replay-only). - Use the
tests/fixtures/fallback_tier_callable.pyfactory from S6-01 to build a tier wired to a deterministic store + embedder + cassette-backed adapter. - Loop 50 times:
result_i = asyncio.run(tier.run(advisory, repo_ctx, sel)); captureresult_i.transform.diff_bytesand the captured event tape. - Assertion helpers:
- Use a small
_diff_runs(run_a, run_b) -> strhelper that returns apprint.pformat-style diff highlighting the first divergence — call it from the assertion failure message viaassert ..., f"runs diverge:\n{_diff_runs(...)}". - The second test (retry-bypass) mirrors the same structure with
prior_attempts=[AttemptSummary(...)].
TDD plan — red / green / refactor¶
Red — write the failing test first¶
# tests/property/test_determinism_under_cassette_replay.py
import asyncio
from typing import Final
import pytest
from tests.fixtures.fallback_tier_callable import make_fallback_tier_for_fixtures
ALLOWED_NONDET_FIELDS: Final[tuple[str, ...]] = (
"emitted_at", # ISO-8601 timestamp; clock-derived
"audit_event_id", # uuid4 per event
"budget_token_id", # uuid4 per precharge
)
ITERATIONS = 50
@pytest.mark.vcr(cassette_library_dir="tests/cassettes/anthropic/test_determinism")
def test_fallback_tier_byte_identical_under_replay(
advisory_fix, repo_ctx_fix, recipe_selection_fix,
seeded_rag_store, fastembed_real,
):
"""ADR-04-0002 + arch §Determinism: same four-tuple ⇒ byte-identical outcome.
Why this matters: a regression here means Phase 6.5 bench replay and Phase 7
E2E both flake non-deterministically — and the production-behavior exit
criterion #2 (replay-lands-RAG) is no longer trustworthy."""
results = []
event_tapes = []
for i in range(ITERATIONS):
tier = make_fallback_tier_for_fixtures(
store=seeded_rag_store, embedder=fastembed_real,
)
event_log = tier.event_log # capturing
result = asyncio.run(tier.run(advisory_fix, repo_ctx_fix, recipe_selection_fix))
results.append(result.transform.diff_bytes)
event_tapes.append([_strip_nondet(e) for e in event_log.recorded])
distinct_diffs = set(results)
assert len(distinct_diffs) == 1, (
f"diff_bytes diverged across {ITERATIONS} runs: "
f"{len(distinct_diffs)} distinct outputs. First divergence at run "
f"{_first_divergence(results)}."
)
distinct_tapes = {tuple((e["kind"], e.get("payload_digest_blake3")) for e in tape)
for tape in event_tapes}
assert len(distinct_tapes) == 1, _diff_runs(event_tapes)
@pytest.mark.vcr(cassette_library_dir="tests/cassettes/anthropic/test_determinism_retry")
def test_fallback_tier_byte_identical_under_replay_retry_path(
advisory_fix, repo_ctx_fix, recipe_selection_fix, attempt_summary_fix,
seeded_rag_store, fastembed_real,
):
"""S6-02 retry-bypass path must be just as deterministic as the happy path."""
results = []
for _ in range(ITERATIONS):
tier = make_fallback_tier_for_fixtures(
store=seeded_rag_store, embedder=fastembed_real,
)
r = asyncio.run(tier.run(
advisory_fix, repo_ctx_fix, recipe_selection_fix,
prior_attempts=[attempt_summary_fix],
))
results.append(r.transform.diff_bytes)
assert len(set(results)) == 1
Green — make it pass¶
- The first time this test runs, it will likely surface a real non-determinism — possibly:
- A
dictiteration order leaking into a prompt body — fix by sorting before serializing. - A
setiteration leaking into a fence-segment ordering — fix bysorted(...). - A
time.time()call leaking into the prompt — replace with a deterministic field or move to event metadata. - Do not weaken the test to make it pass. Fail loud (Global Rule 12); fix the source of non-determinism in
FallbackTier,PromptBuilder, orSolvedExampleRetriever.
Refactor — clean up¶
- The
_strip_nondethelper andALLOWED_NONDET_FIELDSallowlist must stay small and documented. Every entry needs a comment explaining why it's allowed to differ. - If a real source of non-determinism is found, fix it; do not add it to the allowlist.
Files to touch¶
| Path | Why |
|---|---|
tests/property/test_determinism_under_cassette_replay.py |
New — the property test. |
tests/cassettes/anthropic/test_determinism/ |
New — cassette directory; record via make refresh-cassettes. |
tests/cassettes/anthropic/test_determinism_retry/ |
New — second cassette for retry path. |
src/codegenie/fallback/tier.py (only if real non-determinism is uncovered) |
Fix the source; do not weaken the test. |
src/codegenie/rag/retriever.py (only if real non-determinism is uncovered) |
Same — sort iteration orders, etc. |
Out of scope¶
- The full E2E roadmap-exit tests — S7-06, S7-07 read this property as a precondition.
- Phase 6.5 bench replay determinism — that's a higher-order assertion that depends on this property holding.
- Cross-architecture float-drift handling — already mitigated by the two-threshold band (ADR-04-0008); not re-tested here.
- The
make refresh-cassettesworkflow itself — S3-06.
Notes for the implementer¶
- Fail loud is the contract. If the property fails the first time you run it, that's the test working — there's real non-determinism somewhere. Find it. Do not add fields to
ALLOWED_NONDET_FIELDSto make the test pass (Global Rule 12). - 50 iterations is the goal, ≥ 20 the floor. If 50 iterations exceeds 60 s wall clock, drop to a count that fits the suite's perf envelope and surface per Global Rule 12. Don't fake the iteration count.
- The retry-path test (S6-02 branch) is just as important as the happy path. A regression that only affects the retry path would be invisible to E2E #1 (S7-06) which doesn't exercise retry. Don't skip the second test even if it doubles cassette work.
asyncio.run(...)per iteration is intentional — it surfaces leaked module-level state. If a global cache or registry mutates across loops, this catches it.- Cassette recording (
make refresh-cassettesfrom S3-06) is operator-touch — coordinate with the cassette-steward (CODEOWNERS entry from S3-06) before re-recording. The cassettes for the determinism test must NOT be re-recorded casually; they're the immutable substrate for the property. - Diagnostic quality matters.
assert a == bon big dicts truncates output and a flake becomes un-debuggable. Usepprint.pformat+ a_diff_runshelper. Future-you will thank present-you. - This property test makes the Phase-6.5 bench replay and Phase-7 E2E tests trustworthy. A weak version of this test invalidates the downstream claims.