Story S6-01 — bench/migration-chainguard-distroless/ registration, taxonomies, and stub rubric¶
Step: Step 6 — Seed bench/migration-chainguard-distroless/
Status: Ready
Effort: M
Depends on: S5-05 (the worked vuln-remediation pattern to mirror)
ADRs honored: ADR-0001 (subprocess rubric), ADR-0004 (failure-mode taxonomy), ADR-0006 (curation-class — Phase 7 will own the silver eligibility surface), ADR-0008 (BreakdownKey StrEnum + substring ban)
Context¶
Phase 7 introduces migration-chainguard-distroless as the second task class without editing any Phase 0–6 source — the extension-by-addition invariant in CLAUDE.md. For that to work, Phase 6.5 must ship a complete directory skeleton (registration.py, breakdown_keys.py, failure_modes.yaml, rubric.py) that fence-CI is already asserting against and that mirrors the bench/vuln-remediation/ pattern landed in S5. The rubric is a stub: at N=3 we only need to demonstrate the subprocess contract and the Dockerfile-derived signals; Phase 7 will harden scoring as the corpus grows.
The registration declares only bronze: 10 in min_cases_for_promotion — silver/gold are Phase 7's call. This deliberately keeps fence-CI assertion #3 (held-out floor ≥ 5 for tier ≥ silver) inactive for this task class until Phase 7 raises the bar.
References — where to look¶
- Architecture:
../phase-arch-design.md §"bench/{task-class}/ directory contract"— the directory shape fence-CI asserts.../phase-arch-design.md §"What new task classes will need" §Step 1(bench/migration-chainguard-distroless/ subsection).../phase-arch-design.md §"Component design → loader.py"— how_load_breakdown_keysand_load_failure_mode_taxonomyconsume these files.../phase-arch-design.md §"Scenarios → Scenario 3"— fence-CI walkingbench/*/registration.py.- Phase ADRs:
../ADRs/0001-rubric-execution-isolation-via-subprocess.md—if __name__ == "__main__"entrypoint contract, JSON over stdin/stdout, no module-level side effects.../ADRs/0004-per-task-class-failure-modes-taxonomy.md—failure_modes.yamlschema (severity ∈ {block, warn, info}, non-emptydescription).../ADRs/0006-curation-class-split-with-fence-ci-held-out-floor.md— why we declare onlybronzehere.../ADRs/0008-breakdown-keys-strenum-with-substring-ban.md—BreakdownKeyStrEnum value-level substring ban.- Production ADRs:
../../../production/adrs/0008-objective-signal-trust-score.md— facts not judgments; the Dockerfile-derived signals are the right shape.- Source design:
../High-level-impl.md §"Step 6". - Existing precedent:
bench/vuln-remediation/{registration.py, rubric.py, breakdown_keys.py, failure_modes.yaml}from S5-01/S5-02.
Goal¶
Land the four task-class artifact files (registration.py, breakdown_keys.py, failure_modes.yaml, rubric.py) plus bench/migration-chainguard-distroless/tests/test_rubric_unit.py so that loader.load_task_class("migration-chainguard-distroless") succeeds and fence-CI assertions #1, #4, #5, #6 pass even with zero cases.
Acceptance criteria¶
- [ ]
bench/migration-chainguard-distroless/registration.pycontains exactly one@register_task_class("migration-chainguard-distroless", bench_path=..., min_cases_for_promotion={"bronze": 10})call;silver/goldare absent (Phase 7's surface). - [ ]
bench/migration-chainguard-distroless/breakdown_keys.pydeclaresclass BreakdownKey(StrEnum)whose members include at leastBASE_IMAGE_SWAPPED = "base_image_swapped",SHELL_FREE = "shell_free",BUILD_PASSES = "build_passes"; every member value is anast.Constantliteral (ADR-0008 strictness). - [ ]
bench/migration-chainguard-distroless/failure_modes.yamldeclares at leastmigration.base_image_not_chainguard,migration.shell_invocation_present,migration.build_failed, each withseverity ∈ {block, warn, info}and a non-emptydescription(ADR-0004). - [ ]
bench/migration-chainguard-distroless/rubric.pyexposesif __name__ == "__main__":reading a JSON{"case": ..., "harness_output": ...}payload from stdin and writing aBenchScoreJSON to stdout in ≤ 60 s wall-clock; the module has no module-level I/O or import side effects (ADR-0001). - [ ]
loader.load_task_class("migration-chainguard-distroless", bench_root=...)returns aTaskClasswhosebreakdown_keysfrozenset equals{m.value for m in BreakdownKey}and whosefailure_modesmap covers every YAML entry. - [ ] No breakdown-key value contains
confidence|llm|self_reported|model_says;tests/unit/test_breakdown_keys_static.pyextended to walk this task class proves it (or the existing fence already does — verify which). - [ ] The red test from §TDD plan exists, was committed at red, and is now green.
- [ ]
ruff format --check,ruff check,mypy --strictclean on touched files;pytest bench/migration-chainguard-distroless/tests/and the new unit test pass.
Implementation outline¶
- Mirror
bench/vuln-remediation/skeleton; copy structure not content. - registration.py — one decorator call; literal task-class slug;
min_cases_for_promotion={"bronze": 10}only; no other side effects at import time. - breakdown_keys.py —
StrEnum BreakdownKeywith the three required members above (Phase 7 may add e.g.,RUNTIME_CAPABILITY_MATCHlater). Values must pass the substring ban. - failure_modes.yaml — YAML mapping
code → {severity, description}for the three required codes plus any rubric-internal ones the stub might emit (e.g.,migration.dockerfile_unparseable). - rubric.py — subprocess entrypoint:
- read stdin JSON,
- parse
harness_output.dockerfile(the SUT's produced Dockerfile string), - compute three boolean signals: (a)
FROMline points at acgr.dev/chainguard/*image (BASE_IMAGE_SWAPPED); (b) noRUN sh|bash|/bin/shinvocations in the produced Dockerfile (SHELL_FREE); (c)expected/build.loglast line saysSuccessfully builtor equivalent if present (BUILD_PASSES), - emit
BenchScore(score=mean_of_three_booleans, breakdown={...}, failure_mode_code=None or "migration.base_image_not_chainguard" etc., cost_usd=0.0)to stdout. - Determinism is on the bench-author's shoulders — pure string parsing, no clocks, no network.
- tests/test_rubric_unit.py — in-process import + direct
score(case, harness_output)calls covering: all-pass, swap-not-done, shell-present, build-failed; assert breakdown keys equalBreakdownKeyenum values; assertfailure_mode_coderesolves againstfailure_modes.yaml. - Update
tests/unit/test_breakdown_keys_static.pyto assert the new task class's enum values also pass the substring ban (defense-in-depth — fence will catch it but unit test should be loud too).
TDD plan — red / green / refactor¶
Red¶
Test file path: tests/unit/test_distroless_registration.py
# tests/unit/test_distroless_registration.py
from pathlib import Path
import pytest
from codegenie.eval.loader import load_task_class
BENCH_ROOT = Path(__file__).resolve().parents[2] / "bench"
def test_distroless_task_class_registers_with_only_bronze_floor():
tc = load_task_class("migration-chainguard-distroless", bench_root=BENCH_ROOT)
assert tc.name == "migration-chainguard-distroless"
assert tc.min_cases_for_promotion == {"bronze": 10}
# Silver/gold absent — Phase 7's surface (ADR-0006 held-out floor inactive here).
assert "silver" not in tc.min_cases_for_promotion
assert "gold" not in tc.min_cases_for_promotion
def test_distroless_breakdown_keys_cover_required_signals():
tc = load_task_class("migration-chainguard-distroless", bench_root=BENCH_ROOT)
required = {"base_image_swapped", "shell_free", "build_passes"}
assert required.issubset(tc.breakdown_keys)
# ADR-0008 substring ban applies uniformly.
for value in tc.breakdown_keys:
for banned in ("confidence", "llm", "self_reported", "model_says"):
assert banned not in value
def test_distroless_failure_modes_have_severities_and_descriptions():
tc = load_task_class("migration-chainguard-distroless", bench_root=BENCH_ROOT)
for code in ("migration.base_image_not_chainguard",
"migration.shell_invocation_present",
"migration.build_failed"):
fm = tc.failure_modes[code]
assert fm.severity in {"block", "warn", "info"}
assert fm.description.strip() != ""
def test_distroless_rubric_is_subprocess_entrypoint_only():
import importlib, sys
# No module-level side effects — importing should not run scoring.
mod = importlib.import_module("_codegenie_bench.migration_chainguard_distroless.rubric")
# The module exposes a `__main__` guard, not a top-level `score()` invocation.
assert hasattr(mod, "__name__")
Run; confirm ModuleNotFoundError or TaskClassNotFound. Commit as red marker.
Green¶
Create the four artifact files + the bench-author test directory. Smallest shape: registration is one decorator call; breakdown_keys lists exactly the three members; YAML lists exactly the three codes; rubric is a main() reading stdin, parsing the Dockerfile string with regex, writing JSON to stdout.
Refactor¶
- Add docstrings citing ADR-0001, ADR-0004, ADR-0008.
- Confirm
mypy --strict bench/migration-chainguard-distroless/clean (rubric.py is a script; annotatemain() -> None). - Confirm the rubric's regex set is conservative (false-positive on shell detection is preferable to false-negative in a stub).
bench/migration-chainguard-distroless/README.mdplaceholder — actual N=3 verdict-context text lands in S6-03.
Files to touch¶
| Path | Why |
|---|---|
bench/migration-chainguard-distroless/registration.py |
New — @register_task_class literal-name call, bronze-only floor |
bench/migration-chainguard-distroless/breakdown_keys.py |
New — BreakdownKey StrEnum with three required values |
bench/migration-chainguard-distroless/failure_modes.yaml |
New — three required codes with severity + description |
bench/migration-chainguard-distroless/rubric.py |
New — subprocess entrypoint scoring Dockerfile-derived signals |
bench/migration-chainguard-distroless/tests/test_rubric_unit.py |
New — bench-author unit tests, in-process |
bench/migration-chainguard-distroless/__init__.py |
New — package marker for loader import (Option A sys.path prep) |
bench/migration-chainguard-distroless/cases/__init__.py |
New — package marker (empty; cases land in S6-02) |
tests/unit/test_distroless_registration.py |
New — pins registration + taxonomies contract |
Out of scope¶
- Seed cases (
cases/001-*,cases/002-*,cases/003-*,cases/digests.yaml) — S6-02 owns case curation and signing. - E2E
codegenie eval run+ N=3 verdict documentation — S6-03. - Hardening the rubric (multi-stage detection, build sandboxing, semver checks on Chainguard image tags) — Phase 7.
- Adding
silver/goldtomin_cases_for_promotion— Phase 7 raises the bar once ≥10 cases with ≥5 held-out exist.
Notes for the implementer¶
- Stub-quality is correct. The signals are coarse (regex on
FROMline, regex onRUN sh|bash). That is the Phase 6.5 commitment; Phase 7 expands. Resist gold-plating —Rule 2 Simplicity FirstandRule 3 Surgical Changes. - The
if __name__ == "__main__"discipline is load-bearing. The runner spawnspython rubric.pyas a subprocess (ADR-0001); any module-level import of e.g.,dockerwould either fail at subprocess startup or slow every case by hundreds of ms. Keep imports minimal — stdlib only is ideal. - Breakdown-key values are the substring-ban surface, not member names (ADR-0008).
BASE_IMAGE_SWAPPED = "base_image_swapped"— thevalueis what fence-CI walks. bench_pathin@register_task_classmust be thePath(__file__).parentofregistration.pyso the loader can locate sibling files. Mirror exactly howbench/vuln-remediation/registration.pydoes it (S5-01).- The runtime test_breakdown_keys_static.py auto-discovers registered task classes — if you wired registration correctly, the existing static test picks this task class up without edit. Verify before adding a new test; do not duplicate.
migration-chainguard-distrolessslug uses hyphens in@register_task_class("...")but the Python package directory must use underscores (migration_chainguard_distroless/) for Option Asys.path-prep imports. Loader handles the slug→module-name translation; mirror vuln-remediation's pattern.