Story S5-01 — vuln-remediation registration + breakdown_keys + failure_modes¶
Step: Step 5 — Backfill bench/vuln-remediation/ with ≥10 cases + rubric + taxonomies
Status: Ready
Effort: S
Depends on: S4-02 (codegenie eval run subcommand exists end-to-end on a stub bench), and transitively the Step 1 contracts (@register_task_class, BreakdownKey StrEnum convention, taxonomy loader)
ADRs honored: ADR-0001 (subprocess-isolation envelope the rubric will fit), ADR-0004 (per-task-class failure_modes.yaml taxonomy with severity), ADR-0006 (curation-class split; min_cases_for_promotion["silver"] triggers held-out floor), ADR-0008 (per-task-class BreakdownKey StrEnum + substring ban at value level)
Context¶
Step 5 produces the worked example every Phase 7 implementer will pattern-match against. Before any cases or rubric land, the task-class identity for vuln-remediation must exist: a single @register_task_class("vuln-remediation", ...) literal call, a BreakdownKey StrEnum whose values pass ADR-0008's substring ban, and a failure_modes.yaml taxonomy whose entries carry severity ∈ {block, warn, info} and a non-empty description per ADR-0004. These three artifacts are the structural contract every subsequent S5-* story extends — the rubric (S5-02) emits keys constrained by BreakdownKey and codes constrained by failure_modes.yaml; the cases (S5-03/04) carry no taxonomy, but the runner validates rubric output against this taxonomy at score time.
The story is intentionally scoped tight: no cases yet, no rubric yet, no E2E run. It is the identity declaration the harness needs to know vuln-remediation is a real task class with a real breakdown-key and failure-mode shape.
References — where to look¶
- Architecture:
../phase-arch-design.md §bench/{task-class}/directory contract— the four files this story creates and their structural roles.../phase-arch-design.md §Component design → src/codegenie/eval/loader.py— howbreakdown_keys.pyis imported andfrozenset({m.value for m in BreakdownKey})is extracted intotask_class.breakdown_keys.../phase-arch-design.md §Fence-CI test— assertions #4 (literal name only), #5 (StrEnum substring ban), #6 (taxonomy validity) all gate this story.- Phase ADRs:
../ADRs/0004-per-task-class-failure-modes-taxonomy.md §Decision—severity: block|warn|infoper code; non-emptydescription; loader parses intotask_class.failure_mode_taxonomy: Mapping[str, Literal["block","warn","info"]].../ADRs/0006-curation-class-split-with-fence-ci-held-out-floor.md §Decision—min_cases_for_promotion["silver"]triggers the fence-CI held-out floor (≥ 5 held-out cases). Declaring silver here commits the bench to the floor S5-04 must satisfy.../ADRs/0008-breakdown-keys-strenum-with-substring-ban.md §Decision—BreakdownKeyis aStrEnum; member values (not just names) are walked by fence-CI assertion #5 forconfidence|llm|self_reported|model_says.- Production ADRs:
../../../production/adrs/0008-objective-signal-trust-score.md— the upstream "no LLM self-confidence" commitment ADR-0008 structurally enforces. - Source design:
../High-level-impl.md §Step 5— initial taxonomy proposal (vuln-remediation block/warn/info entries).
Goal¶
Land bench/vuln-remediation/{registration.py, breakdown_keys.py, failure_modes.yaml} declaring exactly one @register_task_class("vuln-remediation", min_cases_for_promotion={"bronze": 10, "silver": 25}), a BreakdownKey StrEnum whose values pass ADR-0008's substring ban, and a taxonomy with severity per code — all three files importable by the loader and validated by fence-CI assertions #4–#6.
Acceptance criteria¶
- [ ]
bench/vuln-remediation/registration.pycontains exactly one@register_task_classcall whose first positional arg is the literal string"vuln-remediation"and whosemin_cases_for_promotionkwarg is exactly{"bronze": 10, "silver": 25}. - [ ] Importing
bench.vuln_remediation.registrationonce (in a fresh registry) registers the task class; a second import in the same test process does not raiseTaskClassAlreadyRegistered(module-import dedup is the standard side-effect pattern). - [ ]
bench/vuln-remediation/breakdown_keys.pydefinesclass BreakdownKey(StrEnum)with at least 4 members (e.g.,VALIDATOR_TESTS_PASSED,VALIDATOR_BUILD_PASSED,CVE_DROPPED,RECIPE_APPLIED); every member value is a literalast.Constantstring (nof"...", noprefix + suffix). - [ ] No
BreakdownKeymember value contains the substringsconfidence,llm,self_reported, ormodel_says—tests/unit/test_breakdown_keys_static.py(S1-05) walks the registered enum and stays green. - [ ]
bench/vuln-remediation/failure_modes.yamldeclares every code listed in ADR-0004's "initial vuln-remediation taxonomy" (block:validator.build_failed,validator.tests_failed,validator.cve_not_dropped,recipe.semantic_drift,rubric.timeout,rubric.unknown_failure_mode,sut.exception,sut.cancelled; warn:recipe.unused_field,cassette.tier_mismatch,cost.over_estimate; info:recipe.optimized_path,rag.first_hit); each entry hasseverity ∈ {block, warn, info}and a non-emptydescriptionstring. - [ ] After loading,
task_class.breakdown_keysis afrozenset[str]matching the StrEnum values;task_class.failure_mode_taxonomy[code]returns the declared severity for every declared code. - [ ] Fence-CI assertions #4 (literal name), #5 (BreakdownKey substring ban), #6 (taxonomy validity) all pass on these three files; the S7-01 fence test runs them in its ≤ 2 s budget.
- [ ] Red test from §TDD plan exists, was committed at red marker, now green;
ruff check,ruff format --check,mypy --strict bench/vuln-remediation/registration.py bench/vuln-remediation/breakdown_keys.py, andpytest tests/unit/test_bench_vuln_registration.pyall pass.
Implementation outline¶
- Create the directory skeleton:
bench/vuln-remediation/{__init__.py, registration.py, breakdown_keys.py, failure_modes.yaml, README.md}(README is a stub; S5-05 fills it). - Write the red test
tests/unit/test_bench_vuln_registration.pyfirst — see §TDD plan. breakdown_keys.py:failure_modes.yaml: enumerate every code withseverity+description. Use the ADR-0004 §Consequences "initial vuln-remediation taxonomy" verbatim as the starting set. Severities are literal lowercase strings.registration.py:(Note: the rubric class is imported here even though S5-02 hasn't landed; this story may temporarily stub the import — see §Notes for the implementer.)from pathlib import Path from codegenie.eval.registry import register_task_class from bench.vuln_remediation.breakdown_keys import BreakdownKey # imported so the loader resolves it from bench.vuln_remediation.rubric import VulnRemediationRubric # forward import; S5-02 lands the class @register_task_class( "vuln-remediation", bench_path=Path(__file__).parent, min_cases_for_promotion={"bronze": 10, "silver": 25}, rubric_class=VulnRemediationRubric, breakdown_key_enum=BreakdownKey, ) class _VulnRemediationRegistration: # marker class; the decorator owns the registration pass- Run
mypy --strictand the fence-CI test against the three files; iterate to green.
TDD plan — red / green / refactor¶
Red — write the failing test first¶
Test file path: tests/unit/test_bench_vuln_registration.py
# tests/unit/test_bench_vuln_registration.py
import importlib
from enum import StrEnum
import pytest
from codegenie.eval.registry import TaskClassRegistry, default_registry
BANNED_SUBSTRINGS = ("confidence", "llm", "self_reported", "model_says")
REQUIRED_BLOCK_CODES = frozenset({
"validator.build_failed",
"validator.tests_failed",
"validator.cve_not_dropped",
"recipe.semantic_drift",
"rubric.timeout",
"rubric.unknown_failure_mode",
"sut.exception",
"sut.cancelled",
})
REQUIRED_WARN_CODES = frozenset({
"recipe.unused_field",
"cassette.tier_mismatch",
"cost.over_estimate",
})
REQUIRED_INFO_CODES = frozenset({
"recipe.optimized_path",
"rag.first_hit",
})
@pytest.fixture()
def fresh_registry(monkeypatch):
# Isolate side-effect import from the global default_registry.
reg = TaskClassRegistry()
monkeypatch.setattr("codegenie.eval.registry.default_registry", reg)
# Force a fresh import of the registration module so the decorator fires
# against the patched registry.
for mod in [
"bench.vuln_remediation.registration",
"bench.vuln_remediation.breakdown_keys",
]:
if mod in importlib.sys.modules:
del importlib.sys.modules[mod]
return reg
def test_registration_imports_and_uses_literal_name_and_promotion_floors(fresh_registry):
importlib.import_module("bench.vuln_remediation.registration")
tc = fresh_registry.get("vuln-remediation")
assert tc.name == "vuln-remediation"
# ADR-0006: declaring silver triggers the held-out-≥5 fence assertion.
assert tc.min_cases_for_promotion == {"bronze": 10, "silver": 25}
def test_breakdown_key_strenum_passes_substring_ban():
from bench.vuln_remediation.breakdown_keys import BreakdownKey
assert issubclass(BreakdownKey, StrEnum)
members = list(BreakdownKey)
assert len(members) >= 4
for m in members:
v = m.value
assert isinstance(v, str) and v != ""
for banned in BANNED_SUBSTRINGS:
assert banned not in v, f"banned substring {banned!r} in BreakdownKey value {v!r}"
def test_failure_modes_yaml_declares_full_taxonomy(fresh_registry):
importlib.import_module("bench.vuln_remediation.registration")
tc = fresh_registry.get("vuln-remediation")
tax = tc.failure_mode_taxonomy
# Every required code present with the right severity.
for code in REQUIRED_BLOCK_CODES:
assert tax[code] == "block", f"{code} should be block-severity"
for code in REQUIRED_WARN_CODES:
assert tax[code] == "warn"
for code in REQUIRED_INFO_CODES:
assert tax[code] == "info"
# Every declared code has a non-empty description (loaded as part of taxonomy load,
# surfaced via task_class.failure_mode_descriptions or analogous mapping).
descs = getattr(tc, "failure_mode_descriptions", None)
assert descs is not None
for code in REQUIRED_BLOCK_CODES | REQUIRED_WARN_CODES | REQUIRED_INFO_CODES:
assert descs[code].strip() != ""
def test_breakdown_keys_loaded_into_task_class(fresh_registry):
importlib.import_module("bench.vuln_remediation.registration")
tc = fresh_registry.get("vuln-remediation")
from bench.vuln_remediation.breakdown_keys import BreakdownKey
assert tc.breakdown_keys == frozenset(m.value for m in BreakdownKey)
Run it; confirm ModuleNotFoundError or KeyError. Commit as red marker.
Green — smallest impl shape¶
- Create the three files above.
failure_modes.yamlas a flat mapping{code: {severity, description}}. - If
task_class.failure_mode_descriptionsis not yet exposed onTaskClass(S1-03), extend the dataclass with thatMapping[str, str]field. The loader (S2-01) already parses the YAML — extend it to also populate this map. - The decorator side-effect import must run once — the test uses a fresh registry per test to isolate.
Refactor — clean up¶
- Module docstrings on
registration.pyandbreakdown_keys.pycite ADR-0004, ADR-0006, ADR-0008. failure_modes.yamltop-of-file comment names the ADR.- Type-narrow the
min_cases_for_promotionliteral so mypy--strictaccepts it without# type: ignore. - The README stub names what S5-02/03/04/05 will add; do not include cases or rubric details — those land in their stories.
Files to touch¶
| Path | Why |
|---|---|
bench/vuln-remediation/__init__.py |
New file — package marker (empty or single docstring) |
bench/vuln-remediation/registration.py |
New file — the single @register_task_class("vuln-remediation", ...) literal |
bench/vuln-remediation/breakdown_keys.py |
New file — BreakdownKey StrEnum with literal values |
bench/vuln-remediation/failure_modes.yaml |
New file — full taxonomy with severity per code |
bench/vuln-remediation/README.md |
New file — stub; S5-05 expands |
tests/unit/test_bench_vuln_registration.py |
New file — pins identity, StrEnum, taxonomy |
src/codegenie/eval/loader.py (possibly) |
Extend taxonomy load to populate failure_mode_descriptions if S2-01 hasn't already |
Out of scope¶
- The rubric implementation. S5-02 lands
rubric.py. This story imports the rubric class as a forward dependency; if S5-02 hasn't merged, shiprubric.pyas a minimal stub (class VulnRemediationRubric: def score(self, *_): raise NotImplementedError). The stub is replaced byte-for-byte in S5-02 and must not be merged to main without S5-02 landing in the same train. - Bench cases. S5-03 and S5-04 land cases.
digests.yaml. S5-05 signs cases; no cases exist yet.- Cassette pin selection. Story-level decision is "every case will carry a 32-hex
cassette_canary_pin"; the values are the cases' problem (S5-03/04). - Wiring into
codegenie eval run. Already wired by S4-02; this story does not modify CLI or runner code.
Notes for the implementer¶
- The substring ban in ADR-0008 applies to values, not names.
STYLE_QUALITY = "llm_confidence"is the failure mode the fence catches — a member namedSTYLE_QUALITYis harmless if its value is, e.g.,"style.quality". Reviewers readingbreakdown_keys.pyshould be able to see every value at a glance — keep them on one line each. - Declaring
"silver": 25inmin_cases_for_promotionis an explicit ADR-0006 commitment that S5-04's 5 held-out cases must land before fence-CI passes. If S5-04 slips and you cannot ship 5 held-out cases in the same train, drop"silver"frommin_cases_for_promotion(ship{"bronze": 10}only) — adding silver later is one line; shipping silver without held-out floor fails fence-CI #3 and blocks the phase merge. - The
failure_modes.yamlinitial taxonomy in ADR-0004 §Consequences is illustrative. Use it verbatim as the seed; future task classes (Phase 7) get their own taxonomy. The runner's "always-block" set (sut.exception,sut.timeout,rubric.timeout,rubric.unknown_failure_mode,rubric.unknown_breakdown_key,rubric.malformed_output) must appear here — the ADR-0004 §Tradeoffs row "Codes shared across task classes ... must be replicated per task class" is the rationale. - The rubric forward-import is a known load-order quirk. Two acceptable resolutions: (a) ship S5-01 and S5-02 in the same PR; (b) stub
VulnRemediationRubricinrubric.pyas aNotImplementedError-raising class within this story, then have S5-02 replace the body. Pick whichever the team's review velocity supports. bench/__init__.pymay also need to exist sobench.vuln_remediationis importable; the S2-01 loader'ssys.pathprep contract should already handle this — verify before adding extra__init__.pys.- Do not edit
src/codegenie/eval/**beyond what's strictly needed to surfacefailure_mode_descriptions. Per CLAUDE.md "Extension by addition", this story should be near-zero touch to the harness package.