Story S4-03 — codegenie eval verify subcommand for chain integrity¶
Step: Step 4 — Wire the CLI and the read-only promotion gate
Status: Ready
Effort: S
Depends on: S4-01 (CLI scaffold + exit codes), S2-04 (audit chain extension + VerifyResult)
ADRs honored: ADR-0002 (lower_bound_95 is gate signal; verify surfaces partial-record breakdown so partials cannot be miscounted as evidence), ADR-0010 (isolation_class annotated on every record — verify surfaces it for operator inspection), Phase 0 ADR-0014 (BLAKE3 chain primitive reuse), Gap #4 (complete: bool on BenchRunReport)
Context¶
codegenie eval verify walks the audit chain at .codegenie/eval/runs/ (and any --out override), recomputes BLAKE3 link hashes, and reports a clean / tampered verdict. The audit chain is the load-bearing evidence trail for promotion (S4-04 reads it); a silently-tampered chain corrupts every downstream verdict. Operators run verify as a CI gate (nightly) and as a forensics tool after suspected drift. Per Gap #4 / ADR-0004 §Consequences, partial reports (complete=False) are real history — verify must walk them, but the result must distinguish "verified-complete N" from "verified-incomplete M" so operators see the breakdown and S4-04 knows how many records qualify as promotion evidence.
This story is a thin CLI veneer over S2-04's audit.verify(out_dir, since) -> VerifyResult. The exit-code mapping is the load-bearing contract: 0 on clean, 5 on tamper. The --strict flag tightens the contract by treating any non-empty verified_incomplete count as a non-zero exit (still 0, but with a stderr diagnostic — operators in CI matrix that intentionally exclude partials want a one-flag clean signal).
References — where to look¶
- Architecture:
../phase-arch-design.md §Component design → src/codegenie/eval/cli.py— namesverify [--since=<iso>] [--strict].../phase-arch-design.md §Component design → src/codegenie/eval/audit.py—verify(out_dir: Path, since: str | None = None) -> VerifyResultis the callable.../phase-arch-design.md §Failure modes #5— chain-tamper at startup exits code 5 before any SUT invocation;verifyis the dedicated tool for the same check standalone.../phase-arch-design.md §Gap analysis Gap 4—audit.verify(...)distinguishes "verified-complete N records" from "verified-incomplete M records" viaVerifyResultfields; the CLI surfaces both.- Phase ADRs:
../ADRs/0002-promotion-gate-keys-on-lower-bound-95.md§Consequences — partial reports cannot be evidence;verify's incomplete-count surface is how operators see that gap.../ADRs/0010-isolation-class-annotation-on-bench-run-report.md§Consequences —verify --strictmay extend in a follow-up to refuse mixed isolation-class windows; this story does not implement that, but the JSONL output must surface the field per record so the future check is mechanical.- Production ADRs:
../../../production/adrs/0009-humans-always-merge.md—verifyis read-only by construction; no flag mutates the chain.- Source design:
../High-level-impl.md §Step 4— names the flag list (--since,--strict) and the exit semantics (0 clean / 5 tamper). - Phase 0 precedent:
../../00-bullet-tracer-foundations/ADRs/0014-blake3-audit-chain.md— the BLAKE3 chain primitives this story walks.
Goal¶
Implement codegenie eval verify [--since=<iso>] [--strict] [--out=<path>] [--format=human|jsonl] by delegating to audit.verify(...) and mapping its VerifyResult to exit 0 (clean) or 5 (tamper); surface the "verified-complete N / verified-incomplete M" breakdown on stdout.
Acceptance criteria¶
- [ ]
codegenie eval verifyover a clean chain (oneBenchRunReportfrom S4-02's test fixture) exits 0; stdout (default--format=jsonl) emits a single aggregate line with fields{"kind": "verify", "ok": true, "verified_complete": <int>, "verified_incomplete": <int>, "first_record_iso": "...", "last_record_iso": "..."}. - [ ]
codegenie eval verifyover a tampered chain (one byte flipped in any prior record) exits 5; stdout emits{"kind": "verify", "ok": false, ...}plus a"tamper_at": "<path>"field naming the first divergent record. - [ ]
--since=<utc-iso>filters the walk to records whoserun_started_iso >= <since>; passing an ISO that excludes all records → exit 0 withverified_complete=0, verified_incomplete=0. - [ ]
--strict: when set, a non-zeroverified_incompletecount emits a stderr warning naming the incompleterun_ids but the exit code is still 0 (clean chain) or 5 (tamper). The flag does NOT escalate "incomplete records exist" to a tamper; partials are valid history. - [ ]
--out=<path>optional override for the chain directory; default.codegenie/eval/runs/. - [ ]
--format=humanprints a small table with columnsrun_id / run_started_iso / complete / isolation_class / chain_head[:8], footerverified_complete=N verified_incomplete=M. - [ ] Empty chain (no records under
out_dir): exits 0 withverified_complete=0, verified_incomplete=0; not an error — first-time runs are clean by definition. - [ ] Heavy imports remain deferred: the
verifycommand body importsauditlazily; cold-start test from S4-01 stays green. - [ ] The red test from §TDD plan exists, was committed at the red marker, and is now green.
- [ ]
ruff check,ruff format --check,mypy --strict, andpytest tests/integration/test_cli_verify.pyall pass on touched files.
Implementation outline¶
- Write red tests in
tests/integration/test_cli_verify.py— see §TDD plan. The tests need fixtures: (a) a clean chain of two records, (b) a tampered chain (flip a byte after the fact), (c) a chain with onecomplete=Falsepartial record (from cost-cap, per S3-06). - Fill in the
verifysubcommand stub from S4-01: - Click options:
--since(str, defaultNone),--strict(flag),--out(Path, defaultPath(".codegenie/eval/runs")). - Body (deferred imports):
from codegenie.eval.audit import verify as audit_verify.result = audit_verify(out_dir=out, since=since)— raisesChainTamperDetectedonly in the synchronous-walk variant; the documented API returnsVerifyResultwithok=Falseandtamper_atpopulated. (Confirm with S2-04's actual signature; if the API raises, catch and convert.)- Emit JSONL or human format per
ctx.obj["format"]. - If
--strictandresult.verified_incomplete > 0: write a stderr warning naming the incompleterun_ids; exit code unaffected. sys.exit(EXIT_SUCCESS if result.ok else EXIT_CHAIN_TAMPER).
- The JSONL aggregate fields come from
VerifyResult— if S2-04's wire shape differs from §Acceptance criteria, coordinate (S2-04 is the source of truth for the shape; this story consumes it). The contract this story owns is the CLI surface:{"kind": "verify"}envelope + the two counts + the boolean. - Run
ruff format,ruff check,mypy --strict,pytest.
TDD plan — red / green / refactor¶
Red — write the failing test first¶
# tests/integration/test_cli_verify.py
import json
from pathlib import Path
from click.testing import CliRunner
from codegenie.eval.cli import eval as eval_group
def test_verify_empty_chain_exits_zero(tmp_path, monkeypatch):
monkeypatch.chdir(tmp_path)
(tmp_path / ".codegenie" / "eval" / "runs").mkdir(parents=True)
runner = CliRunner()
result = runner.invoke(eval_group, ["verify"], catch_exceptions=False)
assert result.exit_code == 0
payload = next(json.loads(ln) for ln in result.output.splitlines() if ln.startswith("{"))
assert payload["kind"] == "verify"
assert payload["ok"] is True
assert payload["verified_complete"] == 0
assert payload["verified_incomplete"] == 0
def test_verify_clean_two_record_chain_exits_zero(clean_two_record_chain, monkeypatch):
monkeypatch.chdir(clean_two_record_chain.parent)
runner = CliRunner()
result = runner.invoke(eval_group, ["verify"], catch_exceptions=False)
assert result.exit_code == 0
payload = next(json.loads(ln) for ln in result.output.splitlines() if ln.startswith("{"))
assert payload["ok"] is True
assert payload["verified_complete"] == 2
assert payload["verified_incomplete"] == 0
def test_verify_tampered_chain_exits_five(tampered_chain, monkeypatch):
"""One byte flipped in the first record after the second was chained."""
monkeypatch.chdir(tampered_chain.parent)
runner = CliRunner()
result = runner.invoke(eval_group, ["verify"], catch_exceptions=False)
assert result.exit_code == 5 # EXIT_CHAIN_TAMPER
payload = next(json.loads(ln) for ln in result.output.splitlines() if ln.startswith("{"))
assert payload["ok"] is False
assert "tamper_at" in payload
def test_verify_distinguishes_complete_from_incomplete(partial_then_complete_chain, monkeypatch):
"""Chain of two records: first complete=False (cost-capped), second complete=True."""
monkeypatch.chdir(partial_then_complete_chain.parent)
runner = CliRunner()
result = runner.invoke(eval_group, ["verify"], catch_exceptions=False)
assert result.exit_code == 0 # chain integrity intact
payload = next(json.loads(ln) for ln in result.output.splitlines() if ln.startswith("{"))
assert payload["ok"] is True
assert payload["verified_complete"] == 1
assert payload["verified_incomplete"] == 1
def test_verify_strict_warns_on_incomplete_but_still_exits_zero(partial_then_complete_chain, monkeypatch):
monkeypatch.chdir(partial_then_complete_chain.parent)
runner = CliRunner()
result = runner.invoke(
eval_group, ["verify", "--strict"], mix_stderr=False, catch_exceptions=False
)
assert result.exit_code == 0
# Partials surfaced loudly on stderr per --strict
assert "incomplete" in (result.stderr or "").lower()
def test_verify_since_filters_records(clean_two_record_chain, monkeypatch):
monkeypatch.chdir(clean_two_record_chain.parent)
runner = CliRunner()
# Pick an ISO after both records — should match nothing
result = runner.invoke(
eval_group, ["verify", "--since=2099-01-01T00:00:00Z"], catch_exceptions=False
)
assert result.exit_code == 0
payload = next(json.loads(ln) for ln in result.output.splitlines() if ln.startswith("{"))
assert payload["verified_complete"] == 0
assert payload["verified_incomplete"] == 0
def test_verify_human_format_table(clean_two_record_chain, monkeypatch):
monkeypatch.chdir(clean_two_record_chain.parent)
runner = CliRunner()
result = runner.invoke(
eval_group, ["--format=human", "verify"], catch_exceptions=False
)
assert result.exit_code == 0
# Human format: no JSONL on stdout
assert not any(ln.startswith("{") for ln in result.output.splitlines())
# Surfaces the two counts
assert "verified_complete" in result.output
assert "verified_incomplete" in result.output
Fixtures (clean_two_record_chain, tampered_chain, partial_then_complete_chain) belong in tests/integration/conftest.py — they construct on-disk audit chains using S2-04's write_run_record and then (for the tampered case) flip a byte directly on disk.
Run; confirm failures. Commit as the red marker.
Green — make it pass¶
Implement the verify command body per §Implementation outline. The JSONL envelope is {"kind": "verify", "ok": bool, "verified_complete": int, "verified_incomplete": int, "tamper_at": str | None, "first_record_iso": str | None, "last_record_iso": str | None}. Human format: small tabulate or hand-rolled table.
Refactor — clean up¶
- Extract
_emit_verify_jsonland_emit_verify_humaninto private helpers incli.py. - Type hints on every helper;
mypy --strictclean. - The stderr warning in
--strictmode lists incompleterun_ids with theirrun_started_isofor operator triage. Useclick.echo(..., err=True). - Confirm the cold-start budget from S4-01 still passes after this story (re-run that test in CI).
- Log structured events at
structlog.infolevel:verify_started,verify_completedwithok, the two counts,tamper_at(if any). These feed the Phase 13 dashboard backfill mentioned inphase-arch-design.md §Trace export deferred. --sinceparsing: stdlibdatetime.fromisoformataccepts2026-05-12T14:32:08+00:00and theZ-suffix form on Python 3.11+. If S2-04 uses a different ISO normalization, mirror it; do not invent a third.
Files to touch¶
| Path | Why |
|---|---|
src/codegenie/eval/cli.py |
Fill in the verify subcommand body; add verify-emit helpers. |
tests/integration/test_cli_verify.py |
New file — clean chain, tampered chain, partial chain, --since filter, --strict warning, human format. |
tests/integration/conftest.py |
Add clean_two_record_chain, tampered_chain, partial_then_complete_chain fixtures (construct with S2-04's writer; tamper by direct byte-flip on disk). |
Out of scope¶
audit.verifyinternals — S2-04 owns theVerifyResultshape and the BLAKE3 walk. This story consumes the contract.isolation_classmixed-window refusal — ADR-0010 §Open Q reserves a--allow-isolation-mixflag for a future refusal-on-mix path; this story emitsisolation_classper record in human format but does not refuse mixed windows. That refusal lives inPromotionGate.evaluate(S4-04) at the evidence-window scope, not inverifyat the chain scope.promote-verdictsubcommand — S4-04/S4-05.runsubcommand — S4-02.- Tamper diagnostics beyond
tamper_at— full forensic traces (expected vs computed BLAKE3, byte offsets) are S7-02 (end-to-end audit integration test) territory; the CLI surface here is operator-facing, not forensics-facing. - Genesis-record handling — S2-04 owns the
prev_hash == "0"*64semantics; this story walks whatever the chain contains.
Notes for the implementer¶
VerifyResultshape is owned by S2-04. Read S2-04's story or implementation before writing this one — the fields you depend on areok: bool,verified_complete: int,verified_incomplete: int,tamper_at: Path | None, and per-record metadata for the human table. If S2-04 ships a different field name (e.g.,complete_countinstead ofverified_complete), the integration test fixtures and the CLI emit code must match; do not silently rename. Flag and reconcile.--strictis gentler than it sounds. It does NOT change exit codes. It only escalates the stderr volume. The rationale: partial records are valid history that must remain in the chain; promoting "partials exist" to a chain-integrity failure would conflate two orthogonal concerns. Operators who want a strict-no-partials gate composeverify --strict | grep -q "verified_incomplete=0"themselves.- Tamper fixture construction: the cleanest way to build a tampered chain is (a) write two records via S2-04, (b) open the first JSON file in binary mode, flip one byte in the middle of the JSON payload (avoid corrupting the JSON structure — flip inside a digest hex value), (c) re-close. The chain's link hash on the second record was computed over the original first; after the flip, the recomputed link diverges.
audit.verifycatches this at the first record's link-recomputation. tamper_atis a path, not arun_id. Operators investigating tamper need a filesystem coordinate to grep / diff;run_idalone (which is content-addressed) is harder to map back to disk. Emit both if cheap.first_record_iso/last_record_isoare convenience metadata for--format=human's footer; they cost nothing to compute and save operators als -lacall.- Empty-chain semantics are deliberate. A fresh repo with no runs yet is clean by definition (
ok=True, counts=(0,0)). Do not raise; do not warn. The first run ofverifyon a fresh repo must exit 0 silently — this is the nightly-CI contract. - Cold-start budget audit: after wiring
verify, re-run S4-01's cold-start test. Importingauditinside the function body is the discipline; if you pullaudit.verifyto module top, the test fails.