Story S3-04 — Six typed per-case failure paths¶
Step: Step 3 — Implement the runner: asyncio fan-out, subprocess rubric, aggregator with BCa bootstrap
Status: Ready
Effort: M
Depends on: S3-03 (subprocess rubric invocation)
ADRs honored: ADR-0004 (failure_modes.yaml taxonomy + block-severity), ADR-0008 (BreakdownKey runtime validation + substring ban), ADR-0001 (subprocess failure surface is typed FailureMode)
Context¶
S3-02 set up the worker; S3-03 wired the subprocess and handled three subprocess-level failure modes (rubric.timeout, rubric.malformed_output, and "non-zero exit ⇒ rubric.malformed_output"). This story closes the loop with the six typed per-case failure paths the architecture commits to. Each one maps to a FailureMode(severity="block") recorded on the per-case BenchScore; the run does NOT abort on any of them. The aggregator continues; the promotion gate later surfaces them in block_severity_failure_modes.
The six paths (arch §Components → runner.py §Failure behavior, arch §Edge cases #3–#5, #12, #14):
sut.exception— SUT raises any exception other than asyncio cancel/keyboard-interrupt.sut.timeout— SUT exceedstimeout_per_case_seconds.rubric.malformed_output— non-zero exit ORpydantic.ValidationErroron stdout (S3-03 already maps).rubric.timeout— subprocess exceedscase.rubric_wall_clock_seconds(S3-03 already maps).rubric.unknown_breakdown_key—BenchScore.breakdowncontains a key not intask_class.breakdown_keys(ADR-0008 runtime validation).rubric.unknown_failure_mode— a rubric-emittedFailureMode.codeis not intask_class.failure_mode_taxonomy(ADR-0004 runtime resolution).
Paths 1, 2, 5, 6 are this story's work. Paths 3, 4 are wired by S3-03 but this story extends the test surface to include them in the integrated run.
References — where to look¶
- Architecture:
../phase-arch-design.md §Control flow → Decision points #3, #4— SUT exception/timeout and rubric subprocess failures both yield typedFailureModes; case completes; run continues.../phase-arch-design.md §Edge cases #3 (rubric crash), #4 (rubric timeout), #5 (malformed JSON), #12 (banned breakdown key), #14 (SUT exception).../phase-arch-design.md §Agentic best practices → Error escalation— the full bucket-to-code mapping.../phase-arch-design.md §Components → models.py—BenchScore.breakdownruntime validation againsttask_class.breakdown_keysis the "typed-enum-at-the-edge" pattern.- Phase ADRs:
../ADRs/0004-per-task-class-failure-modes-taxonomy.md—failure_modes.yamltaxonomy +severity ∈ {block, warn, info}; rubric-emitted codes resolved against this map; unknown codes →rubric.unknown_failure_modeblock-severity.../ADRs/0008-breakdown-keys-strenum-with-substring-ban.md— runtime validation ofBenchScore.breakdowndict keys againsttask_class.breakdown_keys: frozenset[str]; unknown keys →rubric.unknown_breakdown_keyblock-severity.../ADRs/0001-rubric-execution-isolation-via-subprocess.md§Consequences — the four rubric-side codes typed.- Source design:
../final-design.md §Edge cases —BenchScore.breakdownkey smuggling defense.
Goal¶
Extend the runner's worker pipeline so all six per-case failure paths produce a BenchScore(passed=False, failure_modes=(FailureMode(code="<typed-code>", severity="block", detail=...),)) and never abort the run. Define block_severity_failure_modes as the deduplicated, sorted tuple of block-severity codes across the run.
Acceptance criteria¶
- [ ]
sut.exceptionpath: whensystem_under_test(case)raisesException(subclasses ofBaseExceptionother thanKeyboardInterrupt/SystemExit/asyncio.CancelledError), the worker producesBenchScore(passed=False, score=0.0, breakdown={}, failure_modes=(FailureMode(code="sut.exception", severity="block", detail=f"{type(e).__name__}: {str(e)[:200]}"),), cost_usd=0.0, wall_clock_ms=<measured>). - [ ]
sut.timeoutpath: whenasyncio.wait_for(system_under_test(case), timeout=plan.timeout_per_case_seconds)raisesasyncio.TimeoutError, the worker producesBenchScore(..., failure_modes=(FailureMode(code="sut.timeout", severity="block"),)). Critical:asyncio.wait_for, notsignal.SIGALRM— Phase 6's SUT isasync(final-design §Components → runner.py "asyncio.wait_for, not SIGALRM"). - [ ]
rubric.unknown_breakdown_key: after the rubric subprocess returns a parseableBenchScore, the runner validatesset(score.breakdown).issubset(task_class.breakdown_keys); on a violation, the offending case is recorded asBenchScore(passed=False, ..., failure_modes=(FailureMode(code="rubric.unknown_breakdown_key", severity="block", detail=<offending key as string>),))— the rubric's original score is discarded. - [ ]
rubric.unknown_failure_mode: after parsing, the runner iteratesscore.failure_modesand resolves eachfm.codeagainsttask_class.failure_mode_taxonomy. Codes not in the taxonomy are replaced withFailureMode(code="rubric.unknown_failure_mode", severity="block", detail=<original_code>). Known codes get their severity re-resolved from the taxonomy (the rubric's emitted severity is the suggestion; the taxonomy is the source of truth per ADR-0004 §"Facts, not judgments"). - [ ]
KeyboardInterrupt,SystemExit, andasyncio.CancelledErrorare never mapped toFailureMode— they propagate. A test asserts this for each of the three. - [ ] Run continues on every failure path: a 3-case bench where case-A's SUT throws, case-B's SUT times out, case-C's rubric emits a banned breakdown key produces a
BenchRunReportwith threeper_caseentries;len(per_case) == 3;complete=True. - [ ]
block_severity_failure_modeson the aggregate report is the deduplicated, sorted tuple of allfm.codewherefm.severity == "block"across allper_caseentries. - [ ]
mypy --strict,ruff format --check,ruff checkclean. - [ ] All red tests in §TDD plan exist, were committed at the red marker, and are now green.
Implementation outline¶
- Extract a
_run_case(plan, case, system_under_test, rubric_runner) -> BenchScorefunction from S3-02's worker body (if not already done in S3-02's refactor). - Around the SUT call:
try: harness_output = await asyncio.wait_for( system_under_test(case), timeout=plan.timeout_per_case_seconds, ) except asyncio.TimeoutError: return _sut_timeout_score(case, start_ns) except (KeyboardInterrupt, SystemExit, asyncio.CancelledError): raise except Exception as e: return _sut_exception_score(case, e, start_ns) - After the rubric subprocess returns a parseable
BenchScorefrom S3-03: - Validate
set(score.breakdown).issubset(plan.task_class.breakdown_keys). On miss: picknext(iter(unknown))for the detail; return_unknown_breakdown_score(...). - Resolve
failure_modes:resolved: list[FailureMode] = [] taxonomy = plan.task_class.failure_mode_taxonomy for fm in score.failure_modes: if fm.code in taxonomy: resolved.append(FailureMode(code=fm.code, severity=taxonomy[fm.code], detail=fm.detail)) else: resolved.append(FailureMode(code="rubric.unknown_failure_mode", severity="block", detail=fm.code)) - Return
score.model_copy(update={"failure_modes": tuple(resolved)}). - Update the aggregator (S3-02) to compute
block_severity_failure_modes:
TDD plan — red / green / refactor¶
Red — write failing tests first¶
Test file: tests/unit/test_runner_failure_paths.py
import asyncio
import pytest
from codegenie.eval.models import BenchScore, FailureMode
from codegenie.eval.runner import Runner
from tests.helpers.bench import make_stub_plan
from tests.helpers.suts import (
RaisingSUT, SleepingSUT, DeterministicSUT, multi_sut,
)
from tests.helpers.rubrics import (
in_process_stub_rubric,
banned_breakdown_key_rubric,
unknown_failure_mode_rubric,
mixed_severity_rubric,
)
@pytest.mark.asyncio
async def test_sut_exception_maps_to_sut_exception_failure_mode():
plan = make_stub_plan(case_ids=["a"])
sut = RaisingSUT(error=RuntimeError("nope, broken"))
report = await Runner().execute(plan, system_under_test=sut, rubric_runner=in_process_stub_rubric)
s = report.per_case[0][1]
assert s.passed is False
assert s.failure_modes[0].code == "sut.exception"
assert s.failure_modes[0].severity == "block"
assert "RuntimeError" in (s.failure_modes[0].detail or "")
assert "nope, broken" in (s.failure_modes[0].detail or "")
@pytest.mark.asyncio
async def test_sut_timeout_maps_to_sut_timeout_failure_mode():
plan = make_stub_plan(case_ids=["a"], timeout_per_case_seconds=0.1)
sut = SleepingSUT(seconds=5.0)
report = await Runner().execute(plan, system_under_test=sut, rubric_runner=in_process_stub_rubric)
s = report.per_case[0][1]
assert s.failure_modes[0].code == "sut.timeout"
assert s.failure_modes[0].severity == "block"
@pytest.mark.asyncio
async def test_keyboard_interrupt_propagates():
plan = make_stub_plan(case_ids=["a"])
async def boom(case): raise KeyboardInterrupt
with pytest.raises(KeyboardInterrupt):
await Runner().execute(plan, system_under_test=boom, rubric_runner=in_process_stub_rubric)
@pytest.mark.asyncio
async def test_system_exit_propagates():
plan = make_stub_plan(case_ids=["a"])
async def bye(case): raise SystemExit(2)
with pytest.raises(SystemExit):
await Runner().execute(plan, system_under_test=bye, rubric_runner=in_process_stub_rubric)
@pytest.mark.asyncio
async def test_cancellederror_propagates():
plan = make_stub_plan(case_ids=["a"])
async def cancel(case): raise asyncio.CancelledError()
with pytest.raises(asyncio.CancelledError):
await Runner().execute(plan, system_under_test=cancel, rubric_runner=in_process_stub_rubric)
@pytest.mark.asyncio
async def test_unknown_breakdown_key_maps_to_block_failure():
"""ADR-0008 runtime validation: rubric emits 'llm_confidence' → block."""
plan = make_stub_plan(case_ids=["a"], breakdown_keys=frozenset({"correctness"}))
sut = DeterministicSUT.passing()
report = await Runner().execute(
plan, system_under_test=sut, rubric_runner=banned_breakdown_key_rubric("llm_confidence"),
)
s = report.per_case[0][1]
assert s.passed is False
assert s.failure_modes[0].code == "rubric.unknown_breakdown_key"
assert s.failure_modes[0].severity == "block"
assert s.failure_modes[0].detail == "llm_confidence"
@pytest.mark.asyncio
async def test_unknown_failure_mode_code_replaced():
"""ADR-0004 runtime resolution: rubric-emitted code not in taxonomy → unknown_failure_mode."""
plan = make_stub_plan(case_ids=["a"], failure_mode_taxonomy={"validator.build_failed": "block"})
sut = DeterministicSUT.passing()
report = await Runner().execute(
plan, system_under_test=sut,
rubric_runner=unknown_failure_mode_rubric(emitted_code="some.typoed.code"),
)
s = report.per_case[0][1]
assert s.failure_modes[0].code == "rubric.unknown_failure_mode"
assert s.failure_modes[0].severity == "block"
assert s.failure_modes[0].detail == "some.typoed.code"
@pytest.mark.asyncio
async def test_known_failure_mode_severity_resolved_from_taxonomy():
"""Rubric's emitted severity is suggestion; taxonomy is source of truth."""
plan = make_stub_plan(case_ids=["a"], failure_mode_taxonomy={"recipe.unused_field": "warn"})
rubric = unknown_failure_mode_rubric(emitted_code="recipe.unused_field", emitted_severity="block")
sut = DeterministicSUT.passing()
report = await Runner().execute(plan, system_under_test=sut, rubric_runner=rubric)
s = report.per_case[0][1]
assert s.failure_modes[0].code == "recipe.unused_field"
assert s.failure_modes[0].severity == "warn" # taxonomy wins
@pytest.mark.asyncio
async def test_run_does_not_abort_on_any_failure_path():
"""3 cases, 3 different failure modes — all three land in the report."""
plan = make_stub_plan(
case_ids=["a", "b", "c"],
breakdown_keys=frozenset({"correctness"}),
failure_mode_taxonomy={"validator.build_failed": "block"},
timeout_per_case_seconds=0.1,
)
def sut_for(case_id):
if case_id == "a":
return RaisingSUT(error=ValueError("boom"))
if case_id == "b":
return SleepingSUT(seconds=5.0)
return DeterministicSUT.passing()
report = await Runner().execute(
plan,
system_under_test=multi_sut(sut_for),
rubric_runner=banned_breakdown_key_rubric("llm_confidence"),
)
codes = {fm.code for _, s in report.per_case for fm in s.failure_modes}
assert "sut.exception" in codes
assert "sut.timeout" in codes
assert "rubric.unknown_breakdown_key" in codes
assert len(report.per_case) == 3
assert report.complete is True
@pytest.mark.asyncio
async def test_block_severity_failure_modes_is_dedup_sorted_block_only():
plan = make_stub_plan(case_ids=["a", "b"], failure_mode_taxonomy={
"validator.build_failed": "block",
"recipe.unused_field": "warn",
})
report = await Runner().execute(
plan,
system_under_test=DeterministicSUT.passing(),
rubric_runner=mixed_severity_rubric({
"a": [("validator.build_failed", "block")],
"b": [("recipe.unused_field", "warn")],
}),
)
assert report.block_severity_failure_modes == ("validator.build_failed",)
Run all ten; confirm failures. Commit as the red marker.
Green — make them pass¶
Wrap the SUT call in try/except with the cancel-exception passthrough. Add the breakdown-key validation and failure-mode taxonomy resolution after BenchScore.model_validate_json. Update the aggregator to compute block_severity_failure_modes from the deduplicated, sorted block-severity set.
Refactor — clean up¶
- Pull each failure-mode mapping into a named helper (
_sut_exception_score,_sut_timeout_score,_unknown_breakdown_score,_resolve_failure_modes) — readable and testable in isolation. - Add a module-level table comment listing the six codes and their architectural origins (ADR refs).
- Structured logs:
log.warning("case_failed", code=fm.code, severity=fm.severity, detail=fm.detail)on every block-severity emission. - Verify the helpers are pure (no
await, no I/O); they take inputs and return aBenchScore. Pure helpers + a thin async glue function is the clean refactor target.
Files to touch¶
| Path | Why |
|---|---|
src/codegenie/eval/runner.py |
Add the six failure-path mappings and aggregator's block_severity_failure_modes computation |
tests/unit/test_runner_failure_paths.py |
New: ten tests covering all six paths + propagation guarantees + aggregator dedup |
tests/helpers/suts.py |
Add RaisingSUT, SleepingSUT, DeterministicSUT, multi_sut |
tests/helpers/rubrics.py |
Add banned_breakdown_key_rubric, unknown_failure_mode_rubric, mixed_severity_rubric (all in-process — subprocess versions live in S3-07) |
Out of scope¶
rubric.timeoutandrubric.malformed_outputplumbing — S3-03 owns these; this story extends the test surface for the integrated run but does not change the subprocess module.- Adversarial subprocess fixtures (
tests/fixtures/bench/adversarial-task-class/) — S3-07. - Cost-cap path (
run_id = "partial:...") — S3-06. - BCa bootstrap on
lower_bound_95— S3-05. - Promotion gate's consumption of
block_severity_failure_modes— S4-04.
Notes for the implementer¶
asyncio.wait_for, notsignal.SIGALRM. Called out infinal-design.md §Components → runner.pyas a critic-flagged correctness issue (SIGALRM doesn't compose with asyncio). The temptation to "just use a signal handler" is wrong; the test forsut.timeoutwill catch it becausesignal.SIGALRMwon't fire inside the event loop.asyncio.CancelledErrorpassthrough is critical. S3-06's cost-cap path usesasyncio.Task.cancel()to abort outstanding workers; mappingCancelledErrorto aFailureModewould silently turn cost-cap into "all cases failed" instead of "run aborted." The explicitCancelledErrorpropagation test is the structural guard.- Taxonomy severity overrides rubric severity. Non-obvious. ADR-0004 §"Facts, not judgments" — the rubric is allowed to report what happened; the taxonomy decides what severity that event is. A rubric that emits
severity="warn"for a code the taxonomy classifies as"block"must produce a"block"-severityFailureMode. The test for this is load-bearing. task_class.breakdown_keysisfrozenset[str](per arch §Data model). The runtime check isset(score.breakdown).issubset(task_class.breakdown_keys). Picknext(iter(unknown))for the detail — if multiple keys are unknown, this surfaces one; the rubric author fixes one and re-runs to discover the next. Thedetailfield is single-string by ADR-0004.- Defense in depth: ADR-0008 says the fence-CI substring ban (
tests/unit/test_eval_fence.py) catches banned keys at PR time; this runtime check is the second layer. Both must be live; one is not enough. - Don't conflate "warn-severity" with "passed=False". A case can have
passed=TrueANDfailure_modes=(FailureMode(code="recipe.unused_field", severity="warn"),).block_severity_failure_modesfilters on severity; the rubric'spassedfield is independent. ADR-0004 §Tradeoffs explicitly allows this asymmetry. - The test fixture rubrics are in-process (S3-04 reuses S3-02's in-process injection seam). Subprocess versions of the adversarial rubrics live in S3-07's bench fixture portfolio.