Story S4-01 — CLI scaffold + partitioned exit codes¶
Step: Step 4 — Wire the CLI and the read-only promotion gate Status: Ready Effort: S Depends on: S3-04 (six per-case failure paths), S3-05 (BCa bootstrap), S3-06 (cost-cap path) ADRs honored: ADR-0001 (subprocess-isolation failure typing → CLI exit codes), ADR-0004 (failure-mode taxonomy surfaces via exit codes), Phase 5 ADR-0016 (eval-harness-as-trust-evidence), Phase 0 import-linter contract (deferred heavy imports for cold-start)
Context¶
cli.py is the user-visible boundary of the eval harness. Before any subcommand exists, the surrounding plumbing must land: the Click subcommand group codegenie eval, deferred heavy imports so codegenie eval --help is fast, the --format=human|jsonl option (default jsonl per phase-arch-design.md §Component design → cli.py), and the partitioned exit-code table mapping CodegenieEvalError subclasses to codes 1–6. This story produces the scaffold and exit-code contract; S4-02/S4-03 fill in the run and verify subcommands against it.
The cold-start budget (≤ 600 ms) mirrors Phase 0's codegenie gather and is non-negotiable: Click resolution + --help rendering cannot pay for pydantic.BaseModel recursion, bench.{name}.rubric chain imports, or pyyaml. Heavy imports are deferred inside command bodies. The exit-code table is the load-bearing contract Phase 11 consumers (PR provenance) will branch on — partitioning is structural, not advisory.
References — where to look¶
- Architecture:
../phase-arch-design.md §Component design → src/codegenie/eval/cli.py— usage, exit codes 0/1/2/3/4/5/6,--format=human|jsonldefaultjsonl, deferred-import discipline.../phase-arch-design.md §Container view—cli.pyis the surface;runner.py,promotion.py,audit.pyare deferred imports.../phase-arch-design.md §Performance budgets— cold-start ≤ 600 ms (mirrorscodegenie gather).../phase-arch-design.md §Failure modes table— rows 1, 2, 5, 6 map to exit codes 6, 6, 5, 6 respectively; cost-cap (row from §Happy path step 5) maps to 2; task-class-not-registered (TaskClassNotFound) to 3; bench-dir-missing to 4.- Phase ADRs:
../ADRs/0001-rubric-execution-isolation-via-subprocess.md— rubric per-case failures areFailureModenot exceptions; the CLI does not exit 1 on a per-case rubric failure. The run still exits 0 (or 2 on cost-cap); theBenchRunReportcarries the block-severity codes.../ADRs/0004-per-task-class-failure-modes-taxonomy.md— same: per-case failures are data on the report, not CLI exit categories.- Production ADRs:
../../../production/adrs/0009-humans-always-merge.md— autonomy ends at the CLI boundary; exit-code semantics are operator-facing.- Source design:
../High-level-impl.md §Step 4— names the seven exit codes verbatim and the deferred-import discipline. - Phase 0 precedent:
../../00-bullet-tracer-foundations/CLI scaffold forcodegenie gather— same Click group pattern, same cold-start budget; mirror it.
Goal¶
Land src/codegenie/eval/cli.py with the codegenie eval Click subcommand group, deferred heavy imports, the seven-code exit-code partition (0/1/2/3/4/5/6), and a --format=human|jsonl global option (default jsonl) — all measured to start in ≤ 600 ms cold.
Acceptance criteria¶
- [ ]
src/codegenie/eval/cli.pydefines a Click groupevalregistered against the existing top-levelcodegeniegroup (Phase 0 entry-point);codegenie eval --helplists the three subcommands (run,verify,promote-verdict) as stubs that exist but may raiseNotImplementedErroroutside of this story's scope. - [ ] The seven exit codes are exported as named constants:
EXIT_SUCCESS=0,EXIT_GENERIC_ERROR=1,EXIT_COST_CAP=2,EXIT_TASK_CLASS_NOT_REGISTERED=3,EXIT_BENCH_DIR_MISSING=4,EXIT_CHAIN_TAMPER=5,EXIT_DIGEST_MISMATCH=6(constants live incli.pyand are referenced by the top-level exception handler). - [ ] A top-level
@eval.result_callbackor wrappedmain()handler catches:TaskClassNotFound→ 3;BenchCaseLoadErrorwith reason"bench dir missing"→ 4;BenchCaseDigestMismatch→ 6;ChainTamperDetected→ 5; cost-cap signal (a sentinel exceptionCostCapExceededfrom S3-06 or aBenchRunReport.complete=Falsewithrun_id.startswith("partial:")) → 2; uncaughtCodegenieEvalErrorand uncaughtException→ 1. - [ ]
--format=human|jsonlis a group-level option with defaultjsonl; the option value is propagated to subcommands via Click context. - [ ] Cold-start performance:
python -c "import time; t=time.perf_counter(); import codegenie.eval.cli; print((time.perf_counter()-t)*1000)"reports ≤ 600 ms on the CI runner (test asserts < 600 ms with a 10% slack budget).pydantic,pyyaml,bench.*modules MUST NOT appear insys.modulesafter purecliimport; they are deferred inside subcommand bodies. - [ ] Negative cold-start guard test: assert that after importing
codegenie.eval.cli, the stringspydantic,yaml, and any key beginning withbench.are absent fromsys.modules. - [ ] The red test from §TDD plan exists, was committed at the red marker, and is now green.
- [ ]
ruff check,ruff format --check,mypy --strict, andpytest tests/unit/test_cli_scaffold.py tests/unit/test_cli_exit_codes.pyall pass on touched files.
Implementation outline¶
- Write the red tests first in
tests/unit/test_cli_scaffold.pyandtests/unit/test_cli_exit_codes.py— see §TDD plan. Confirm they fail withModuleNotFoundError/ attribute errors. - Create
src/codegenie/eval/cli.py: - Import only stdlib +
clickat module top. - Declare the seven
EXIT_*integer constants. - Define
@click.group(name="eval")with--format=human|jsonl(defaultjsonl) as a group-level option attached toclick.Context.obj. - Define three subcommand stubs (
run,verify,promote-verdict) registered on the group; each body raisesNotImplementedError(S4-02/S4-03/S4-04 will fill these in). They exist socodegenie eval --helplists them and the structural tests pass. - Define a
main()(or a wrappingrun_cli()) that invokes the group inside atry/exceptmapping eachCodegenieEvalErrorsubclass and the cost-cap signal to the correspondingEXIT_*constant, thensys.exit(code). - Defer imports of
codegenie.eval.runner,codegenie.eval.promotion,codegenie.eval.audit,pydantic,pyyamlinside subcommand bodies. - Wire the Click group into the top-level
codegenieentry-point (Phase 0 adds it viapyproject.toml/__main__.py); this story may need a one-line registration edit in the existing top-level CLI module — keep it surgical (Rule 3). - Run
ruff format,ruff check,mypy --strict src/codegenie/eval/cli.py,pytest tests/unit/test_cli_scaffold.py tests/unit/test_cli_exit_codes.py.
TDD plan — red / green / refactor¶
Red — write the failing test first¶
Test file paths: tests/unit/test_cli_scaffold.py and tests/unit/test_cli_exit_codes.py.
# tests/unit/test_cli_scaffold.py
import sys
import time
import importlib
from click.testing import CliRunner
def test_cli_group_exists_and_lists_three_subcommands():
from codegenie.eval.cli import eval as eval_group # noqa: A001
runner = CliRunner()
result = runner.invoke(eval_group, ["--help"])
assert result.exit_code == 0
for sub in ("run", "verify", "promote-verdict"):
assert sub in result.output, f"subcommand {sub!r} missing from --help"
def test_format_option_default_is_jsonl():
from codegenie.eval.cli import eval as eval_group
runner = CliRunner()
# Probe default through --help (option default surfaced)
result = runner.invoke(eval_group, ["--help"])
assert "--format" in result.output
assert "jsonl" in result.output
def test_cold_start_no_heavy_imports():
# Reset sys.modules of any prior eval imports so the budget is honest.
for k in list(sys.modules):
if k.startswith("codegenie.eval") or k.startswith("bench."):
sys.modules.pop(k, None)
sys.modules.pop("pydantic", None)
sys.modules.pop("yaml", None)
t0 = time.perf_counter()
importlib.import_module("codegenie.eval.cli")
elapsed_ms = (time.perf_counter() - t0) * 1000
# 600 ms budget, 10% slack — fail loud past 660 ms.
assert elapsed_ms < 660.0, f"cli import took {elapsed_ms:.1f}ms (> 660ms budget)"
# Negative guard: heavy modules must NOT be loaded by importing the CLI.
forbidden_prefixes = ("pydantic", "yaml", "bench.")
leaked = sorted(
m for m in sys.modules
if any(m == p.rstrip(".") or m.startswith(p) for p in forbidden_prefixes)
)
assert leaked == [], f"cli import leaked heavy modules: {leaked}"
# tests/unit/test_cli_exit_codes.py
import pytest
from codegenie.eval import cli as cli_module
from codegenie.eval import errors as e
EXPECTED = {
"EXIT_SUCCESS": 0,
"EXIT_GENERIC_ERROR": 1,
"EXIT_COST_CAP": 2,
"EXIT_TASK_CLASS_NOT_REGISTERED": 3,
"EXIT_BENCH_DIR_MISSING": 4,
"EXIT_CHAIN_TAMPER": 5,
"EXIT_DIGEST_MISMATCH": 6,
}
@pytest.mark.parametrize("name,value", list(EXPECTED.items()))
def test_exit_code_constant_present_and_partitioned(name: str, value: int) -> None:
assert getattr(cli_module, name) == value
def test_exit_codes_are_disjoint_and_total_seven():
values = {getattr(cli_module, n) for n in EXPECTED}
assert values == {0, 1, 2, 3, 4, 5, 6}
@pytest.mark.parametrize(
"exc,expected_code",
[
(e.TaskClassNotFound("foo", ("bar",)), 3),
(e.BenchCaseDigestMismatch("003-x", "abc", "def"), 6),
(e.ChainTamperDetected("/tmp/x", "0" * 64, "1" * 64), 5),
],
)
def test_exception_maps_to_exit_code(exc: Exception, expected_code: int) -> None:
code = cli_module._map_exception_to_exit_code(exc) # internal helper, tested at boundary
assert code == expected_code
def test_uncaught_exception_maps_to_generic_one() -> None:
code = cli_module._map_exception_to_exit_code(RuntimeError("anything"))
assert code == 1
Run; confirm ModuleNotFoundError: No module named 'codegenie.eval.cli' (and missing attributes once the module exists). Commit as the red marker.
Green — make it pass¶
Create src/codegenie/eval/cli.py with:
- Seven EXIT_* integer constants.
- @click.group(name="eval") with @click.option("--format", "fmt", type=click.Choice(["human", "jsonl"]), default="jsonl", show_default=True).
- Three @eval.command() stubs (run, verify, promote-verdict), each body raise NotImplementedError("S4-02/S4-03/S4-04").
- A _map_exception_to_exit_code(exc: BaseException) -> int helper mapping the documented exception classes; default 1.
- A main() (or run_cli()) calling the group inside try/except BaseException as exc: sys.exit(_map_exception_to_exit_code(exc)).
No imports of pydantic, pyyaml, bench.*, runner, promotion, or audit at module top.
Refactor — clean up¶
- Annotate every function with full type hints;
mypy --strictclean. - Module docstring cites
../phase-arch-design.md §Component design → cli.pyas the source-of-truth for exit-code semantics. - Use
structlog.get_logger(__name__)lazily insidemain(); do not configure logging at import time. - The exit-code constants get a module-level docstring tying each code to its triggering exception class and a
phase-arch-design.md §Failure modesrow reference; this is the table operators read when they see a non-zero exit. - Consider extracting
EXIT_*to a_exit_codes.pysibling for downstream consumers; defer unless S4-02 needs it (Rule 2 — simplicity first). - Add a
# pragma: no coveronly on theif __name__ == "__main__": main()line; everything else is covered by the unit tests + S4-02/S4-03 integration tests.
Files to touch¶
| Path | Why |
|---|---|
src/codegenie/eval/cli.py |
New file — Click group, three stub subcommands, exit-code constants, exception→code mapper, main() entry. |
tests/unit/test_cli_scaffold.py |
New file — group exists, subcommands listed, --format=jsonl default, cold-start < 660 ms, no heavy imports leaked. |
tests/unit/test_cli_exit_codes.py |
New file — seven constants present, disjoint, exception→code mapping is the documented partition. |
src/codegenie/__main__.py or pyproject.toml entry-point |
If Phase 0's top-level CLI module exists, register the eval group surgically (one-line edit). Otherwise note it as a follow-up; this story does not block on Phase 0 entry-point shape. |
Out of scope¶
- Implementing
run,verify,promote-verdictbodies — handled by S4-02 (run), S4-03 (verify), and S4-04+S4-05 (promote-verdictreads PromotionGate output). Stubs raiseNotImplementedErrorhere. --cases,--concurrency,--max-cost-usd,--no-cache,--out,--with-verdict,--bench-rootflags — wired in S4-02 on therunsubcommand.--since,--strictflags — wired in S4-03 on theverifysubcommand.PromotionGateconstruction andTierConfigloading — wired in S4-04 (gate logic) and S4-05 (recommendation writer).- JSONL line writer — the
--formatoption exists structurally here; per-case JSONL emission and human-readable table rendering happen in S4-02. - Cost-cap exception class — S3-06 owns the
CostCapExceeded(or equivalent) sentinel; this story just maps it to exit 2 when the symbol exists. If S3-06 has not yet landed, gate the mapping on atry: from codegenie.eval.errors import CostCapExceeded; except ImportError: CostCapExceeded = ...shim and flag it.
Notes for the implementer¶
- Defer EVERY heavy import. Even
from codegenie.eval.models import BenchRunReporttriggerspydanticand breaks the cold-start budget. Inside subcommand bodies,importat function scope; the lint contract from Phase 0 (import-linter) will codify this in S1-05 and S7-* — but write it correctly the first time so the test in this story stays green. - Click 8+ is the assumption; if the project pins an earlier version,
result_callbackand context patterns differ. Checkpyproject.toml. --formatpropagation: store the chosen format inclick.Context.obj(initialize viactx.ensure_object(dict)at the group level). Subcommands readctx.obj["format"]. This avoids parameter duplication on every subcommand declaration.- Cold-start budget is honestly load-bearing. Phase 0's
codegenie gatherset this number; operators seecodegenie eval --help≥ 50× more often than they see a full run. A 1.5 s--helpis broken UX even though it does no work. _map_exception_to_exit_codeis intentionally a private helper. Testing the boundary directly (rather than only viaCliRunner) means the partition is independently verifiable; CLI integration tests in S4-02/S4-03 cover the full Click invocation path. The name is_-prefixed; do not export it.- The cost-cap signal shape is unsettled at the time of writing. S3-06's
BenchRunReport.complete=False+run_id.startswith("partial:")is the data; whether the CLI sees aCostCapExceededexception or a returnedBenchRunReportis a S4-02 decision. Either way, this story's_map_exception_to_exit_codehandles the exception form; the report-based path is checked by S4-02 after the runner returns. - No
BaseExceptioncatch-alls in the group body — onlymain()wraps everything. Inside subcommands, let exceptions propagate;main()is the single mapping point. This makes the partition tractable to test.