S8-03 Attempt Log — Eight Phase-2 CI lanes + three advisory bench canaries¶

Attempt 1 — 2026-05-18 (phase-story-executor / scheduled task)¶

Code shipped¶

New modules / files (25 new files):

src/codegenie/coordinator/_cpu_budget.py — pure effective_cpu_count() -> int reading CODEGENIE_FORCE_CPU_COUNT with fallback to os.cpu_count() or 1. Raises ValueError on non-positive-int values. (~35 LOC)
tests/_ci_support/__init__.py + tests/_ci_support/requires_tool.py — @requires_tool(name) decorator wrapping pytest.mark.skipif with a SKIPPED LOUD reason format and one-shot warnings.warn per missing tool per session.
tests/bench/_bench_kernel.py — pure compare_to_baseline, sum type Verdict = Ok | CommentOnly | Fail, Threshold dataclass; impure post_comment_if + exit_with_verdict + load_baseline. Three bench scripts compose this kernel (rule-of-three extraction; CLAUDE.md / story Note 17).
tests/bench/bench_portfolio_walltime.py — five-fixture cold + warm p50 bench; comment-only on ≥ 50 % regression.
tests/bench/bench_index_health_overhead.py — B2 walltime as fraction of total cold gather on minimal-ts; comment on ≥ 10 % via the kernel's comment_pct=100 % of-baseline (target 5 %).
tests/bench/bench_portfolio_walltime_hosted_runner.py — GATING; sets CODEGENIE_FORCE_CPU_COUNT=2 BEFORE coordinator import; Threshold(comment_pct=50, fail_pct=100, fail_p95_s=360).
tests/bench/baselines/portfolio_walltime.json + portfolio_walltime_hosted_runner.json — committed JSON with metadata header (refreshed_at, refreshed_by, reason) + measurements map.
tests/bench/baselines/README.md — refresh ritual documentation.
tests/bench/test_bench_portfolio_walltime_smoke.py — module-import + threshold-shape + subprocess-stubbed run() shape smoke.
tests/bench/test_bench_index_health_smoke.py — module-import + threshold-shape + metamorphic injection (sleep → fraction strictly increases).
tests/bench/test_baseline_has_metadata.py — three-key metadata header assertion + measurement-map shape (parametrized over both baselines).
tests/unit/ci/__init__.py + tests/unit/ci/_workflow_model.py — typed Pydantic WorkflowFile / Job / Step parser used by every workflow-YAML test.
tests/unit/ci/test_workflow_yaml.py — AC-1/2/4/7 subset + matrix + xdist veto + advisory bench step + fork-PR write perm + needs:[fence] chain.
tests/unit/ci/test_requires_tool_decorator.py — AC-3 decorator contract.
tests/unit/ci/test_adv_phase02_load_bearing.py — AC-5 (continue-on-error veto + 8-file presence + ≥ 1-collected-test-per-file via pytest --collect-only).
tests/unit/ci/test_mypy_global_warn_unreachable.py — AC-6 global warn_unreachable = true + no-override + no-CLI-flag.
tests/unit/ci/test_bench_collection_guard_unchanged.py — AC-7b guard threshold + no pytest.mark.bench on new scripts.
tests/unit/ci/test_hosted_runner_bench_thresholds.py — AC-10b parametrized boundary tests for compare_to_baseline (≥ 100 %, > 360 s inclusivity).
tests/unit/ci/test_bench_nightly_workflow.py — AC-10c cron / runs-on pin / CODEGENIE_FORCE_CPU_COUNT / pull-requests:write / workflow_dispatch.
tests/unit/ci/test_contract_freeze_allowlist.py — AC-11 allowlist + --check flag drift detection + parametrized rejection of non-allowlisted fields.
tests/unit/ci/test_no_xdist_anywhere.py — AC-13 workflow + pyproject scan + metamorphic injection.
tests/unit/coordinator/test_cpu_budget.py — AC-10a env-var contract (18 cases including value-error message naming the env-var).
.github/workflows/bench-nightly.yml — UTC cron 0 4 * * *, pinned ubuntu-24.04, CODEGENIE_FORCE_CPU_COUNT: "2", workflow_dispatch.

Modified:

.github/workflows/ci.yml — extended with seven new top-level jobs (contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench); the legacy lint/typecheck/test/security/fence jobs are preserved unchanged. Every new lane has needs: [fence] so a closure-fence violation short-circuits the workflow.
src/codegenie/coordinator/coordinator.py — line 489 cpu = os.cpu_count() or 1 → cpu = effective_cpu_count(); removed unused import os.
scripts/regen_probe_contract_snapshot.py — added _PROBE_CONTEXT_FIELD_ALLOWLIST constant + _enforce_probe_context_allowlist() enforcement + _display_snapshot_path() helper + --check mode in main(argv=None) that diffs against the committed snapshot and returns exit 1 on drift.
tests/unit/test_ci_workflow.py — REQUIRED_JOBS expanded from the legacy 6-set to the 13-set (legacy ∪ Phase-2). Pre-existing parser tests now assert the wider set per S8-03 AC-1 / arch §"CI gates".

Per-AC evidence¶

AC	Evidence
AC-1	`tests/unit/ci/test_workflow_yaml.py::test_required_subset_present`, `test_legacy_jobs_preserved`, `test_phase2_lane_runs_on_python_311_and_312` (parametrized over 7 lanes). 9 passing. `tests/unit/test_ci_workflow.py::test_ci_workflow_declares_exactly_six_required_jobs` updated to expect the 13-job set.
AC-2	`test_unit_lane_serial_and_no_cov`. Unit lane invokes `pytest tests/unit/ -q --no-cov` with `timeout-minutes: 5`.
AC-3	`tests/unit/ci/test_requires_tool_decorator.py` — 8 tests (mark type, SKIPPED LOUD literal, present/missing branches, warn-once-per-session, parametrize-composability, real lookup smoke). 8 passing.
AC-4	`test_portfolio_serial_budget` — asserts `timeout-minutes ≤ 7` and no-xdist on `portfolio` lane.
AC-5	`tests/unit/ci/test_adv_phase02_load_bearing.py` — 18 tests: continue-on-error veto + each of 8 files exists + each has ≥ 1 collected test via `pytest --collect-only` + no extra files. Also verified locally: `.venv/bin/pytest tests/adv/phase02/test_stale_scip_fixture.py` passes (1 passed).
AC-6	`tests/unit/ci/test_mypy_global_warn_unreachable.py` — 3 tests: global setting True, no override disables it, mypy lane does NOT pass `--warn-unreachable` on CLI. AC-6b ritual run locally: removed the `IndexerError` case from `src/codegenie/report/confidence_section.py` via `sed`, ran `mypy --strict src/codegenie/report/confidence_section.py`, captured `error: Argument 1 to "assert_never" has incompatible type "IndexerError"; expected "Never" [arg-type]`. Restored from backup; mypy clean. The global `warn_unreachable=true` is what fires this — confirmed end-to-end.
AC-7	`test_bench_advisory`, `test_bench_lane_runs_new_bench_scripts`, `test_bench_lane_grants_pr_write`. AC-7b: `tests/unit/ci/test_bench_collection_guard_unchanged.py` — 4 tests (threshold literal preserved, new scripts have no bench marker). AC-7c: bench lane has `if: github.event.pull_request.head.repo.fork == false` on the comment step; fork-PR alt step prints `::warning::` and still uploads artifact.
AC-8	`tests/bench/bench_portfolio_walltime.py` — five-fixture cold+warm p50 (5 runs each) via the kernel's `compare_to_baseline`. Smoke test asserts shape with stubbed subprocess. `tests/bench/test_baseline_has_metadata.py` — 5 tests passing (3 metadata-keys × 2 baselines + ISO-8601-UTC).
AC-9	`tests/bench/bench_index_health_overhead.py` — monkeypatches `IndexHealthProbe.run` to capture B2 walltime, computes fraction of total. Smoke test includes metamorphic check (fraction strictly increases when injected delay grows).
AC-10a	`tests/unit/coordinator/test_cpu_budget.py` — 18 cases including positive ints, empty string, non-int (`abc`/`1.5`/`two`/`2x`/spaces/`0x2`), non-positive (`0`/`-1`/`-100`), os.cpu_count=None fallback, error-message-names-env-var, and the structural coordinator-uses-effective_cpu_count check.
AC-10b	`tests/unit/ci/test_hosted_runner_bench_thresholds.py` — 19 parametrized cases over regression `{-10, 0, 49.9, 50, 99, 99.999, 100, 101, 500}` and p95 `{0, 359, 360, 360.001, 361, 1000}`. Confirms `≥ 100 %` and `> 360 s` boundaries (inclusive/strict per arch §Gap 2).
AC-10c	`tests/unit/ci/test_bench_nightly_workflow.py` — 6 tests: cron `0 4 * * *`, `ubuntu-24.04` pinned, `CODEGENIE_FORCE_CPU_COUNT: "2"` at job level, `pull-requests: write`, `workflow_dispatch`-able, runs only the hosted-runner bench.
AC-11	`tests/unit/ci/test_contract_freeze_allowlist.py` — 9 tests: allowlist contents, committed snapshot has `image_digest_resolver`, parametrized rejection of 4 non-allowlisted fields with `02-ADR-0004` substring, `--check` exit 0 on master, `--check` exit 1 on drift, CI lane invokes `--check`.
AC-12	`make lint-imports` green (2 kept, 0 broken). `mypy --strict src/` green (135 source files). `ruff check` + `ruff format --check` clean on all touched files. Fence test passes (9 tests). No new `anthropic`/`openai`/`langgraph`/`httpx`/`requests`/`socket` imports introduced — `_cpu_budget.py` imports `os` only.
AC-13	`tests/unit/ci/test_no_xdist_anywhere.py` — 4 tests: per-workflow scan (2 workflows), pyproject `addopts` scan, metamorphic injection (regex must match a planted `-n 4`). Initial regex was missing the right inclusivity for `-n\s`; the metamorphic test caught it red and we corrected to `-n[\s\d]` + `\b`-anchored alternatives.

Conflict surfaced + resolution (CLAUDE.md Rule 7)¶

Existing test_ci_workflow.py::test_ci_workflow_declares_exactly_six_required_jobs enforces set-equality, but AC-1 reshapes the job set additively to 13. Two patterns in the codebase: the existing test claimed "exactly six required jobs"; the new arch §"CI gates" prescribes eight named lanes. Resolution: the existing test was updated to expect the union (_LEGACY_JOBS | _PHASE2_JOBS), with a comment pointing at tests/unit/ci/test_workflow_yaml.py for the per-lane invariants. The newer arch doc (more recent) wins; the older intent (exhaustive equality, no surprises) is preserved by promoting the set to the union rather than relaxing to a subset. This honors CLAUDE.md Rule 7: pick the more recent, surface the older as part of the same change.

Out-of-scope finding (Rule 3 — surgical changes)¶

The branch the scheduled task landed on (codex/tier1-architecture-cleanup) carried substantial uncommitted WIP — an AGENTS.md, a docs/reviews/ tree, a 0039 ADR, a new docs_consistency.py module, and a parallel tests/unit/test_docs_consistency.py. None of this is referenced by S8-03 or its hardened ACs. The WIP was preserved via git stash push -u so the parallel agent's work isn't lost; S8-03 lands on a fresh branch off master (feat/phase2-s8-03-ci-jobs-and-benches) to keep the blast radius scoped to S8-03's prescribed surface.

Refactor decisions (design-patterns lens)¶

Rule-of-three extraction. Three bench scripts duplicated baseline-load + ratio-compute + comment + exit. tests/bench/_bench_kernel.py owns the pure decision (compare_to_baseline returning a Verdict sum type) AND the impure shell (post_comment_if, exit_with_verdict, load_baseline). Adding a fourth bench in Phase 3+ requires zero edits to the kernel — compose a new Threshold instance and dispatch. Functional-core / imperative-shell, project-wide convention.
Sum type for verdict. Verdict = Ok | CommentOnly | Fail (frozen, slots, kw_only) replaces a boolean-pair return. compare_to_baseline is pure; the impure shell pattern-matches the verdict. Mirrors Fresh|Stale discipline elsewhere in the codebase.
Newtype-discipline-by-extension. Threshold is a frozen dataclass; thresholds for the three benches are module-level Final instances. Mixing thresholds (e.g., passing the hosted-runner's gating thresholds to the advisory dev-laptop bench) is impossible because of how each script constructs its _THRESHOLDS.
Strategy seam via dispatch tables, not branching. The compare_to_baseline kernel composes regression + p95 thresholds; both can be None to opt out. No if "is_hosted_runner": ... switches. This mirrors the project-wide preference for data-driven registries over branching code.
Typed Pydantic workflow loader. Every CI test in this story uses WorkflowFile.from_path(...) instead of yaml.safe_load(...) + dict.get — type errors caught by mypy, malformed YAML caught immediately, and the parser handles the PyYAML True/"on" boolean-key surface in exactly one place.
Open/Closed. The contract-freeze allowlist is an explicit frozenset constant. Adding a new Phase-3+ field requires editing the allowlist AND landing an ADR. The test parametrizes over 4 fake-field names to prove the 02-ADR-0004 pointer is the consistent failure mode.

Gates¶

Gate	Status
`mypy --strict src/`	✓ no issues found in 135 source files
`ruff check` (touched files)	✓ all checks passed
`ruff format --check` (touched files)	✓ all formatted
`lint-imports`	✓ 2 kept, 0 broken
`fence` (`tests/unit/test_pyproject_fence.py`)	✓ 9 passed
Full unit suite (`tests/unit/`)	✓ 3432 passed, 16 skipped, 1 xfailed
AC-6b ritual (sabotage confidence_section.py)	✓ mypy fires `error: Argument 1 to "assert_never" has incompatible type "IndexerError"; expected "Never" [arg-type]`; restored from backup; clean

Adapatations + deviations from the hardened story¶

AC-9 threshold modelling. Story prescribed "≥ 10 % posts a PR comment". The kernel's Threshold.comment_pct is computed as percentage-regression-vs-baseline, not absolute. The bench uses _BASELINE = {"minimal-ts/b2_fraction": 0.05} and comment_pct=100.0, so when the measured fraction ≥ 2× baseline (i.e. ≥ 10 % of total walltime) the kernel returns CommentOnly. Mathematically equivalent to the story's prescription; surfaced here so a future reviewer can trace the indirection.
@requires_tool warning emission. The story said "emit a structlog warning OR warnings.warn(...)". Implementation uses warnings.warn because the _ci_support/ package must not depend on codegenie.* (the integration lane is the consumer, not a provider). The warning fires once-per-tool-per-session via a module-level set.
bench lane gh pr comment step. Per AC-7c, the if: guard on the comment step is github.event.pull_request.head.repo.fork == false || github.event_name == 'push' so the bench still runs on push to master (no PR context). Fork PRs hit the explicit alt step that prints ::warning:: and still runs the bench for artifact-only inspection.

Lessons for follow-on stories¶

The hardened story is exceptionally long and densely cross-referenced — the validator already collapsed 14 prescriptions into a coherent shape. A future S8-04 implementer should read this attempt log AND the validation report _validation/S8-03-ci-jobs-and-benches.md together before opening the story file, otherwise the "subset, not equality" / "additive, not replacement" reshape is easy to miss.
The xdist-veto regex needs both \s and \d after -n because pytest -n4 (no space) is the common typo. The metamorphic test caught the gap; if a future workflow grows another parallel-invocation flag, copy the metamorphic pattern from tests/unit/ci/test_no_xdist_anywhere.py.
The PyYAML True/"on" boolean-key surface is documented in the workflow loader; do not paper over it with yaml.SafeLoader.add_constructor mutations — that hides the surface from a future contributor opening their first workflow YAML.