Story S5-06 — Cache hit-rate + invalidation integration tests¶
Step: Step 5 — Backfill bench/vuln-remediation/ with ≥10 cases + rubric + taxonomies
Status: Ready
Effort: S
Depends on: S5-05 (10 signed cases + E2E green run is the substrate these tests rerun against)
ADRs honored: ADR-0001 (rubric subprocess invocation contributes to rubric_digest — the cache key feels every byte), ADR-0002 (cache hit means the report's lower_bound_95 is byte-identical across reruns — determinism is load-bearing for promotion evidence)
Context¶
The harness's cache key per case is BLAKE3(case_digest || sut_digest || rubric_digest || cassette_corpus_digest || harness_version || cassette_canary_pin) (S2-03). If any input changes, the key changes. If none changes, every case is a cache hit, every cost_usd is 0.0, and a warm rerun is bounded by the cache I/O rather than by SUT invocation. The harness's cost discipline (CI budget, contributor dev-loop latency) hinges on this being true.
Two failure modes the integration test must catch: - False misses — a warm rerun re-invokes the SUT. Cache key derivation is wrong; cost regression silent. - False hits — a case is edited but the cache serves the stale score. Cache key is missing an input. Promotion evidence becomes lies.
ADR-0002 + ADR-0001 don't say "cache must be correct" explicitly, but the promotion gate's lower_bound_95 is only meaningful if the score producing it is reproducible and invalidated when any input shifts. The 10-case vuln-remediation bench is the first concrete corpus to test this on.
References — where to look¶
- Architecture:
../phase-arch-design.md §Component design → src/codegenie/eval/cache.py—get/put/gc; corrupt-file-on-read treated as miss;fcntl.flockon sentinel.../phase-arch-design.md §Testing strategy → Integration → test_cache_hit_rate.py / test_cache_invalidation.py— names this story's tests; specifies "second run ≤ 8 s" and "all 10cost_usd == 0.0"; whitespace edits torubric.pyinvalidate all; whitespace edits to onecase.tomlinvalidate only that case.../phase-arch-design.md §Property tests → Cache-key determinism— the Hypothesis-property substrate this integration test concretizes against the 10-case corpus.- Phase ADRs:
../ADRs/0001-rubric-execution-isolation-via-subprocess.md §Consequences— the rubric file is bytewise part ofrubric_digest; any edit (whitespace included) bumps it.../ADRs/0002-promotion-gate-keys-on-lower-bound-95.md §Consequences— deterministiclower_bound_95requires deterministic cache; this test enforces both.- Source design:
../High-level-impl.md §Step 5"Done criteria" — "Re-running the same task class with no source changes is a 100% cache hit (10/10 casescost_usd == 0.0, wall-clock ≤ 8 s)"; "whitespace edit torubric.pyinvalidates all 10 cache entries; whitespace edit to onecase.tomlinvalidates exactly that case".
Goal¶
Land two integration tests that exercise the full cache contract on the 10-case vuln-remediation bench: test_cache_hit_rate.py asserts 10/10 cost_usd == 0.0 and ≤ 8 s wall-clock on a warm rerun; test_cache_invalidation.py asserts whitespace-edit to rubric.py invalidates all 10 entries while a whitespace-edit to one case.toml invalidates exactly that case.
Acceptance criteria¶
- [ ]
tests/integration/test_cache_hit_rate.pyexists; running it: (a) executes a cold-cachecodegenie eval run --task-class=vuln-remediationagainst the stub deterministic SUT, (b) immediately re-executes the same command, (c) asserts the second run'sBenchRunReport.per_casehas all 10cost_usd == 0.0, (d) asserts the second run's wall-clock is ≤ 8 s. - [ ]
tests/integration/test_cache_invalidation.pyexists with two scenarios: - All-invalidation scenario: warm the cache; append
"\n"tobench/vuln-remediation/rubric.py; rerun; assert all 10 cases re-invoke the SUT (cost_usd > 0.0for all 10, or — if the stub SUT itself reportscost_usd=0.0— a different cache-miss signal: the per-casewall_clock_msexceeds a "definitely re-invoked" floor; document the chosen invariant). - Single-invalidation scenario: warm the cache; append
"\n"to one case'scase.toml(e.g.,bench/vuln-remediation/cases/001-*/case.toml); rerun; assert exactly 1 case re-invoked the SUT (or its digest mismatch produces aBenchCaseDigestMismatcherror — see §Notes for the implementer); the other 9 cache-hit cleanly. - [ ] Both tests restore the edited file in a
try/finallyso a failure mid-test does not leave the working tree dirty. - [ ] The "cache hit" detection is not based on time alone — it asserts the per-case
cost_usdsignal and an explicitcache_hit: bool(or equivalent) field on the per-case JSONL output if S3-02 exposes one. If S3-02 doesn't expose acache_hitfield, add a feature:BenchRunReport.per_case[*]carries aBenchScore.cache_hit: bool = False(default; runner sets True on serve-from-cache). - [ ] The warm rerun wall-clock budget is asserted as ≤ 8 s; if it regresses past 12 s, the test fails with a diagnostic naming the cache-hit rate.
- [ ] Both tests run in CI within the standard test suite; they are not nightly-canary-only.
- [ ] Red versions of both tests exist, were committed at red, are now green;
ruff check,ruff format --check,pytest tests/integration/test_cache_hit_rate.py tests/integration/test_cache_invalidation.py -vpass.
Implementation outline¶
- Write
tests/integration/test_cache_hit_rate.pyfirst (red) — see §TDD plan. - Write
tests/integration/test_cache_invalidation.pynext (red). - Determine the cache-hit signal:
- Preferred:
BenchScore.cache_hit: bool(defaultFalse; runner setsTruewhen serving from cache). If S1-02 didn't include it, add it now underextra="forbid"semantics. Document onBenchScore. - Fallback: assert
cost_usd == 0.0ANDwall_clock_ms < <threshold>(the stub SUT's hand-coded cost is 0.0 always, socost_usdalone is insufficient; thewall_clock_msthreshold proxy works for the stub SUT but is fragile). - Pick the preferred path; extend
BenchScoreif needed. - Iterate to green for both tests.
- Documentation: a short section in
bench/vuln-remediation/README.md("Cache behavior") names the two integration tests and the cache-key composition.
TDD plan — red / green / refactor¶
Red — write the failing test first¶
Test file path: tests/integration/test_cache_hit_rate.py
# tests/integration/test_cache_hit_rate.py
"""Cache contract on the 10-case vuln-remediation bench.
S2-03 + High-level-impl.md §Step 5 "Done criteria"."""
import json
import os
import subprocess
import sys
import time
from pathlib import Path
import pytest
REPO_ROOT = Path(__file__).parents[2]
@pytest.fixture()
def fresh_eval_env(tmp_path, monkeypatch):
monkeypatch.setenv("CODEGENIE_EVAL_CACHE_DIR", str(tmp_path / "cache"))
monkeypatch.setenv("CODEGENIE_EVAL_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("CODEGENIE_EVAL_SUT", "tests.fixtures.sut.deterministic_vuln_sut")
return tmp_path
def _run():
start = time.monotonic()
result = subprocess.run(
[sys.executable, "-m", "codegenie", "eval", "run",
"--task-class=vuln-remediation", "--format=jsonl"],
capture_output=True, text=True, cwd=REPO_ROOT, check=True,
)
return result, time.monotonic() - start
def _per_case_lines(result):
return [json.loads(l) for l in result.stdout.splitlines()
if l.strip() and json.loads(l).get("kind") == "case"]
def test_warm_rerun_is_100_percent_cache_hit(fresh_eval_env):
_ = _run() # cold
result, elapsed = _run() # warm
cases = _per_case_lines(result)
assert len(cases) == 10
# ADR-0002 + S2-03: warm rerun = byte-identical scores at zero cost.
assert all(c["score"]["cost_usd"] == 0.0 for c in cases), (
f"expected all cost_usd=0.0; got {[c['score']['cost_usd'] for c in cases]}"
)
assert all(c["score"].get("cache_hit") is True for c in cases), (
"expected cache_hit=True on every per_case entry"
)
assert elapsed <= 8.0, f"warm rerun took {elapsed:.2f}s; budget 8s"
Test file path: tests/integration/test_cache_invalidation.py
# tests/integration/test_cache_invalidation.py
"""Cache invalidation contract: rubric edit invalidates everything;
single-case edit invalidates only that case. ADR-0002 §Consequences."""
import json
import subprocess
import sys
from pathlib import Path
import pytest
REPO_ROOT = Path(__file__).parents[2]
BENCH = REPO_ROOT / "bench" / "vuln-remediation"
@pytest.fixture()
def fresh_eval_env(tmp_path, monkeypatch):
monkeypatch.setenv("CODEGENIE_EVAL_CACHE_DIR", str(tmp_path / "cache"))
monkeypatch.setenv("CODEGENIE_EVAL_RUNS_DIR", str(tmp_path / "runs"))
monkeypatch.setenv("CODEGENIE_EVAL_SUT", "tests.fixtures.sut.deterministic_vuln_sut")
return tmp_path
def _run():
return subprocess.run(
[sys.executable, "-m", "codegenie", "eval", "run",
"--task-class=vuln-remediation", "--format=jsonl"],
capture_output=True, text=True, cwd=REPO_ROOT, check=False,
)
def _per_case(result):
return [json.loads(l) for l in result.stdout.splitlines()
if l.strip() and json.loads(l).get("kind") == "case"]
def test_whitespace_edit_to_rubric_invalidates_all_ten(fresh_eval_env):
_run() # cold; warm cache
rubric = BENCH / "rubric.py"
original = rubric.read_bytes()
try:
rubric.write_bytes(original + b"\n# cache-bust\n")
result = _run()
cases = _per_case(result)
assert len(cases) == 10
cache_hits = sum(1 for c in cases if c["score"].get("cache_hit") is True)
assert cache_hits == 0, (
f"expected 0 cache hits after rubric edit; got {cache_hits}/10"
)
finally:
rubric.write_bytes(original)
def test_whitespace_edit_to_one_case_invalidates_only_that_case(fresh_eval_env):
_run() # warm cache
target_dir = next(d for d in (BENCH / "cases").iterdir() if d.is_dir())
target_id = target_dir.name
case_toml = target_dir / "case.toml"
# Edit a file under input/ (changing case.toml itself would also bump digest,
# but it would be caught by the digest-mismatch raise; whitespace edit to
# input/<f> changes case_digest, which propagates into cache_key only for
# this case_id).
input_file = next(p for p in (target_dir / "input").rglob("*") if p.is_file())
original = input_file.read_bytes()
try:
input_file.write_bytes(original + b"\n")
# case_digest is now stale relative to digests.yaml; the loader will
# raise BenchCaseDigestMismatch (exit 6). This is the correct behavior:
# "single-case invalidation" is structurally surfaced as a digest error,
# not a silent cache miss.
result = _run()
assert result.returncode == 6, (
f"expected exit 6 on case digest mismatch; got {result.returncode}; "
f"stderr={result.stderr[-1000:]}"
)
assert target_id in result.stderr, (
f"diagnostic must name the offending case_id {target_id}"
)
finally:
input_file.write_bytes(original)
Run both; confirm failures. Commit as red marker.
Green — smallest impl shape¶
- If
BenchScore.cache_hit: bool = Falsedoesn't exist on the wire type (S1-02), add it (withextra="forbid"and defaultFalse); have the runner (S3-02) setTruewhen serving cached. - Ensure the CLI's JSONL output (S4-02) emits per-case lines with a
"score": {..., "cache_hit": bool}shape; if it's already structured butcache_hitis missing, surface it. - Confirm cache invalidation works as designed: an edit to
bench/vuln-remediation/rubric.pychangesrubric_digest, which is one input to every per-case cache key. The change propagates to all 10 cases. - Confirm single-case invalidation works structurally: a
case.toml(or input/) edit changescase_digest, which is recomputed by the loader againstdigests.yaml. The mismatch raisesBenchCaseDigestMismatch(exit 6), which is the correct behavior — silent cache miss would be the bug. - If S2-02 currently allows "single case digest changes silently" (no digests.yaml check), that is a separate bug; this story's single-case test asserts the structurally-correct behavior (exit 6 with the case_id in stderr) — see §Notes for the implementer.
Refactor — clean up¶
- Document the "single-case invalidation = digest mismatch error" decision in
bench/vuln-remediation/README.md's Cache behavior section. The semantics differ from naive caches: bench cases are signed; you cannot edit one and "let the cache invalidate" — you must re-sign indigests.yaml. The cache is signed-content-addressed. - The
cache_hitfield is part of the wire contract; document onBenchScoreand add a property test (Hypothesis):cache_hit=True implies cost_usd == 0.0. - The 8-s warm-budget is the High-level-impl.md §Step 5 floor. If CI machines are slow, the test will flake — surface as a benchmark, not a hard fail (mark
@pytest.mark.benchmarkif needed; budget stays in the assertion). - Cross-reference S7-02's "audit chain extension integration" test — both reuse the
fresh_eval_envfixture; consider hoisting totests/integration/conftest.py.
Files to touch¶
| Path | Why |
|---|---|
tests/integration/test_cache_hit_rate.py |
New — warm-rerun 100% cache hit + ≤ 8 s budget |
tests/integration/test_cache_invalidation.py |
New — rubric-edit invalidates all; case-edit invalidates one (digest mismatch) |
src/codegenie/eval/models.py (maybe) |
Extend — add BenchScore.cache_hit: bool = False if not extant |
src/codegenie/eval/runner.py (maybe) |
Extend — set cache_hit=True on serve-from-cache |
src/codegenie/eval/cli.py (maybe) |
Verify — JSONL emits the field |
bench/vuln-remediation/README.md |
Extend — "Cache behavior" section documenting the contract |
tests/integration/conftest.py (maybe) |
New — fresh_eval_env fixture if shared with S7-02 |
Out of scope¶
- Hypothesis property tests for cache-key determinism. Listed in
phase-arch-design.md §Property tests; lives intests/unit/test_cache.py(S2-03 / property-test story, separate); this story is the integration-level concretization. - Cache GC behavior. S2-03's
gc(retain_days=90)is unit-tested there; not in scope here. - Cross-host cache sharing. Gap #5 (parallel-eval across hosts) is deferred; the cache is per-host in Phase 6.5.
- Concurrent eval-run race. Edge case #17 is fence-CI / runner-level; the cache uses
fcntl.flockto serialize writers but the integration test does not exercise concurrency here.
Notes for the implementer¶
- The "single-case invalidation" semantics are subtle. Two valid interpretations:
- Edit
input//expected/of one case →case_digestrecomputation by loader → mismatch withdigests.yaml→BenchCaseDigestMismatch(exit 6). This is what the test asserts. The cache never even gets a chance to serve a stale entry because the loader refuses the case. - Re-sign
digests.yamlafter the edit →case_digestmatches → cache key includes the newcase_digest→ cache miss for this case only → 1 SUT re-invocation, 9 cache hits. This is the "single-case invalidation = single cache miss" semantics; assertable ascache_hits=9, misses=1.
The story's primary assertion is (1) because it tests the load-bearing structural defense (signed digests). A secondary test of (2) is welcome but optional — note that it requires running the digest-signing script mid-test.
- If BenchScore.cache_hit is added, surface it everywhere the score is JSON-serialized: stdout JSONL, audit record, snapshot tests. S7-02's golden-file snapshot will need a regen if the field is new.
- The warm rerun in test_warm_rerun_is_100_percent_cache_hit is a subprocess-launch of the CLI. Python interpreter startup itself takes ~150 ms; module imports add another ~300 ms; cache I/O is small but non-zero. The 8 s budget is generous; if you find yourself at 7.5 s, profile cold-imports — S4-01's "deferred heavy imports" discipline applies.
- If the EVAL_CACHE_DIR env contract doesn't exist, add it in S2-03 or here — the test needs a way to point the cache at tmp_path for isolation.
- The "cache-bust" edit must change byte content, not just mtime. The cache is content-addressed (BLAKE3 over inputs); touch rubric.py would not invalidate. The test appends a newline; document this.
- tests/fixtures/sut/deterministic_vuln_sut.py (from S5-05) must produce byte-stable harness_output for each case across runs, or the integration test will appear to pass spuriously (the cache might serve stale entries that happen to match). Verify determinism by running the stub twice in isolation and diff-ing JSON dumps.
- If the test failures reveal that the cache key doesn't include rubric_digest (i.e., a rubric edit doesn't invalidate), that is the bug S2-03 must fix. Surface to S2-03's owner; don't kludge the test.