Story S2-01 — Bench import-path resolution (load_task_class)¶
Step: Step 2 — Build harness internals: loader, cache, audit chain extension, canary + cost-tag shims Status: Ready Effort: S Depends on: S1-02, S1-03 ADRs honored: ADR-0001 (no in-process rubric import surface), Phase 5 ADR-0006 (Protocol convention upstream of registry)
Context¶
The synthesis docs hand-wave _codegenie_bench.{name}.registration (final-design.md §Components → loader.py); bench/ lives at repo root and isn't inside src/codegenie/, so the import does not resolve as written. phase-arch-design.md §Gap analysis & improvements §Gap 2 picks Option A: prepend the parent of bench/ to sys.path and import bench.{name}.registration directly (no synthesized prefix), so bench/ becomes an implicit namespace package. This story implements that contract — the first concrete loader entry point, with the side-effect import that triggers @register_task_class("<name>") exactly once and returns the resolved TaskClass.
References — where to look¶
- Architecture:
../phase-arch-design.md §Component design — src/codegenie/eval/loader.py— public-interface signatures (load_task_class,load_cases); side-effect-import idempotence note../phase-arch-design.md §Gap analysis & improvements §Gap 2— full rationale for Option A vs MetaPathFinder; the OQ #3 fallback if packaging conflicts surface../phase-arch-design.md §Control flow(Happy path narrative) — theRunner.plan()call site that invokesload_task_class- Phase ADRs:
../ADRs/0001-rubric-execution-isolation-via-subprocess.md— loader must never importbench/{name}/rubric.py; onlyregistration.pyis in-process- Source design:
../final-design.md §Components → loader.py— original (hand-wavy) statement of the import target- Existing code:
src/codegenie/eval/registry.py(S1-03) —default_registry,@register_task_class; the side-effect targetsrc/codegenie/eval/models.py(S1-02) —TaskClassshape returned to the callersrc/codegenie/eval/errors.py(S1-01) —TaskClassNotFound,TaskClassAlreadyRegistered
Goal¶
codegenie.eval.loader.load_task_class(name, bench_root) resolves bench/{name}/registration.py via sys.path prep (Option A), triggers @register_task_class exactly once, and returns the registered TaskClass; second call with the same name is a no-op that returns the same instance.
Acceptance criteria¶
- [ ]
load_task_class(name: str, bench_root: Path = Path("bench")) -> TaskClassis importable fromcodegenie.eval.loaderand exported bycodegenie.eval.__init__'s loader-internal seam (loader is internal scaffolding; the function itself is not in the ≤9 public-name list). - [ ] On first call,
load_task_class("vuln-remediation", tmp_bench_root)runsbench/vuln-remediation/registration.py's module body exactly once (verified by an assertion-counter side-effect in the fixture's registration), inserting the parent ofbench_roottosys.path[0]if not already present and removing it fromsys.modulescleanup is not required (idempotence on the registry side handles re-runs). - [ ] On second call with the same
(name, bench_root), no module re-execution occurs (importlib.import_modulereturns the cached module fromsys.modules); the function still returns the sameTaskClassinstance fromdefault_registry. - [ ] If the registration import succeeds but does not register
name, raiseTaskClassNotFound(name, looked_up_in="bench.<name>.registration")— guards against aregistration.pytypo where the decorator argument doesn't match the directory name. - [ ] If
bench/{name}/registration.pydoes not exist, raiseBenchCaseLoadError(case_dir=bench_root / name, field="registration.py", reason="file not found")(orTaskClassNotFound— pick one and document; test asserts the chosen typed exit). - [ ] Hyphenated task-class names (
migration-chainguard-distroless) work: the loader translates the directory name to its Python-import-safe form (migration_chainguard_distroless) forbench.{module}.registrationresolution, while the registered task-class string stays hyphenated. - [ ]
sys.pathmutation is bounded: the parent-of-bench_rootentry is inserted only if missing; repeatedload_task_classcalls do not growsys.path(idempotent insert). - [ ] TDD red test exists, is committed, and passes green.
- [ ]
ruff format,ruff check,mypy --strictclean onsrc/codegenie/eval/loader.pyand the test file.
Implementation outline¶
- Create
src/codegenie/eval/loader.pywith module-level docstring naming Option A from Gap #2. - Implement a private
_prep_bench_sys_path(bench_root: Path) -> Nonethat resolvesbench_root.parent.resolve(), inserts it atsys.path[0]if not already present, and returns nothing. - Implement
load_task_class(name: str, bench_root: Path = Path("bench")) -> TaskClass: - Call
_prep_bench_sys_path(bench_root). - Translate
name→module_name = name.replace("-", "_"). importlib.import_module(f"bench.{module_name}.registration")(catchModuleNotFoundError→ raise the chosen typed error).- Look up
name(the original hyphenated form) indefault_registry; if missing, raiseTaskClassNotFound. - Return the
TaskClass. - Add a
__all__ = ("load_task_class", "load_cases")placeholder (S2-02 addsload_cases).
TDD plan — red / green / refactor¶
Red¶
Test file: tests/unit/eval/test_loader_import_path.py
def test_load_task_class_triggers_registration_exactly_once(tmp_path, monkeypatch):
# Arrange: build a fake bench/ at tmp_path/bench/<name>/registration.py
# whose module body increments a counter and decorates a TaskClass.
# Act: call load_task_class twice.
# Assert:
# - counter == 1 after first call (module body ran)
# - counter == 1 after second call (sys.modules cached)
# - both returned values are the same object (identity)
...
def test_load_task_class_hyphen_to_underscore_translation(tmp_path):
# name="migration-chainguard-distroless" must import
# bench.migration_chainguard_distroless.registration
...
def test_load_task_class_missing_registration_raises_typed(tmp_path):
# bench/<name>/ exists but no registration.py → typed error (TaskClassNotFound or BenchCaseLoadError)
...
def test_load_task_class_registration_doesnt_register_name(tmp_path):
# registration.py imports but never calls @register_task_class("<name>")
# → TaskClassNotFound(name, looked_up_in=...)
...
def test_sys_path_insert_is_idempotent(tmp_path):
# Two load_task_class calls do not grow sys.path.
...
Green¶
Smallest impl: the four steps in §Implementation outline; ~25 lines of code.
Refactor¶
- Add type hints and a docstring quoting Gap #2's Option A decision.
- Inline-comment the hyphen-to-underscore translation (load-bearing for
migration-chainguard-distroless). - Add a structlog
infoeventloader.task_class_loadedwithnameandbench_rootattributes for traceability (matches Phase 0's logging convention). - Defensive: catch a bare
ImportErrorfromimport_moduleand surface a typed error chained viaraise ... from.
Files to touch¶
| Path | Why |
|---|---|
src/codegenie/eval/loader.py |
New module — Option A sys.path prep + load_task_class |
src/codegenie/eval/__init__.py |
Add internal-seam re-export if needed for tests (load_task_class is internal — loader is scaffolding) |
tests/unit/eval/test_loader_import_path.py |
Red tests for import-path resolution |
tests/fixtures/bench/stub_task_class/registration.py |
Counter-fixture registration used by tests |
Out of scope¶
- Case loading and digest verification — handled by S2-02 (
load_cases). - MetaPathFinder fallback (Option B) — surfaces only if Option A causes packaging conflicts in CI; tracked as OQ #3.
- Bench-root discovery from CWD — caller passes
bench_root; auto-discovery is a CLI concern (S4-01/S4-02).
Notes for the implementer¶
bench/does not need__init__.py— implicit namespace packages (PEP 420) work for our case as long as the parent dir is onsys.path.- Don't import
bench/{name}/rubric.pyfrom anywhere reachable here — ADR-0001 says the rubric is subprocess-only, even thoughregistration.pymay reference the rubric path. - The fixture in
tests/fixtures/bench/stub_task_class/doubles as a sanity check for Option A; keep it minimal so Step 3's runner can reuse it. - If
bench_root.parent.resolve()differs frombench_root.parent(symlinks), prefer the resolved form forsys.path— avoids "imported twice under different names" subtlety. name.replace("-", "_")is the only place we cross between the user-facing slug and Python's module name; surface this clearly so future curators don't accidentally use underscores in@register_task_class("...").- Phase 0's
codegenie/probes/registry pattern (@register_probe) is the closest precedent; the difference isbench/lives outsidesrc/codegenie/, which is exactly what Gap #2 calls out.