Story S8-03 — Eight Phase-2 CI lanes + three advisory bench canaries (hosted-runner closes Gap 2)¶

Step: Step 8 — Confidence section renderer + CI ratchet + advisory benches + Phase-3 handoff Status: Done — GREEN 2026-05-18 (phase-story-executor; see _attempts/S8-03.md for the per-AC evidence table + AC-6b ritual mypy stderr capture). The hardened story's "8-name required subset" interpretation was honored: lint, typecheck, test, security, docs, fence legacy jobs preserved unchanged; seven new Phase-2 lanes (contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench) added additively under needs: [fence]. Three new bench scripts (NOT -m bench-marked — collection guard stays at 3) compose a shared tests/bench/_bench_kernel.py (sum-type Verdict = Ok | CommentOnly | Fail). CODEGENIE_FORCE_CPU_COUNT plumbed via src/codegenie/coordinator/_cpu_budget.py::effective_cpu_count(); coordinator.py:489 routes through it. scripts/regen_probe_contract_snapshot.py gained _PROBE_CONTEXT_FIELD_ALLOWLIST + --check mode (a third additive field raises ValueError naming 02-ADR-0004). tests/unit/test_ci_workflow.py::REQUIRED_JOBS updated from the legacy 6-set to the 13-set per arch §"CI gates" (Rule 7 conflict surface). 3432 unit tests pass; mypy --strict, ruff, lint-imports, fence all clean. Effort: M Depends on: S8-02 (CLI summary line ships; tests/integration/cli/ integration tests exist). Precondition story (must land first or be folded in): plumbing CODEGENIE_FORCE_CPU_COUNT into src/codegenie/coordinator/coordinator.py Semaphore sizing (today the coordinator reads os.cpu_count() directly at coordinator.py:489 with no env-var override — S1-08 did not ship this). This story folds the plumbing in if no separate precondition story is filed; see AC-10a/10b. ADRs honored: 02-ADR-0009 (pytest-xdist veto preserved — portfolio and adv-phase02 lanes are serial, no xdist anywhere); 02-ADR-0001 (ALLOWED_BINARIES Phase 2 extension is the union the integration lane depends on for tool-presence preflight); production ADR-0005 (no LLM in gather — the fence job stays green); 02-ADR-0004 (ProbeContext.image_digest_resolver is the one allowed widening; contract-freeze lane asserts the snapshot diff is exactly that field, no others).

Validation notes (2026-05-18 — phase-story-validator)¶

This story was hardened by phase-story-validator. The draft was substantially restructured because eleven of twelve ACs contradicted either the current CI shape, ADR-0008's discipline, or some named precondition that has not landed. The Goal — 8 Phase-2-named lanes + 3 bench scripts + Gap-2 hosted-runner closer — is sound and traces to phase-arch-design.md §"CI gates" + §"Gap analysis Gap 2" + ADR-0009 + ADR-0001 + ADR-0004. The prescriptions needed grounding:

Phantom job set. Draft AC-1 asserted set(jobs) == {fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench} (exactly 8). Current .github/workflows/ci.yml has {lint, typecheck, test, security, fence} (5 jobs). The literal equality would silently force deletion of lint and security — both load-bearing per their own arch docs. Resolution: the eight names from phase-arch-design.md §"CI gates" are a required subset of the workflow's job names, not an equality. lint and security (and any other Phase-0/1 jobs) coexist. AC-1 reworded as a subset assertion; AC-1b added asserting the legacy jobs (lint, security) stay present (no silent deletion).
contract-freeze job does not exist in current ci.yml. The probe contract test runs inside the test job (tests/unit/test_probe_contract.py). This story PROMOTES it to its own top-level job with its own per-PR diff surface. AC-11 reworded to ship that promotion + the field-allowlist mechanism (which scripts/regen_probe_contract_snapshot.py does NOT have today — verified by reading the file: it just walks the structural signature with no allowlist).
test job vs unit/integration/portfolio/adv-phase02 split. Current test job runs pytest -q --cov-report=json (all tests). The four Phase-2-named lanes are this story's split. Two reconciliation paths: (a) delete test, fan out into four lanes — preserves coverage but breaks the existing Per-module coverage carve-outs (ADR-0005) step and the bench-collection-guard step; (b) keep test as the umbrella with --cov + carve-outs (status quo), add the four Phase-2-named lanes alongside as additional lanes — the new lanes run subsets and assert specific gating semantics that test does not (no -x, hard fail on adversarial, etc.). Resolution: path (b) — add lanes; do not delete test. The four new lanes set --no-cov (avoiding double-counting against the global 85 % floor) and run on a subset filter. Documented as ADR amendment text in Notes for the implementer.
bench-collection-guard mismatch. .github/workflows/ci.yml:122-130 hardcodes expected exactly 3 bench tests (the S5-01 marker set). Adding three new bench scripts under tests/bench/ collected by -m bench would push count to 6 and fail the guard. Resolution: the three new bench scripts are NOT marked -m bench (they live alongside the existing trio under tests/bench/ but are invoked by their own job step pytest tests/bench/bench_*.py, NOT by -m bench). The collection guard stays at 3 (the existing S5-01 set). AC-7 explicitly preserves the count.
CODEGENIE_FORCE_CPU_COUNT is not plumbed. Story Notes line 147 hedged ("if S1-08 did NOT thread this all the way through, surface the gap loudly and file a follow-up"); the hedge proved correct. Verified by grep: src/codegenie/coordinator/coordinator.py:489 reads os.cpu_count() with no env-var override; no other src/ file references the constant. AC-10 split into AC-10a (plumbing — this story ships it, with a focused unit test) and AC-10b (bench script consumes the env-var). The plumbing is a 3-line change in coordinator.py + a wrapper in src/codegenie/coordinator/_cpu_budget.py (pure function: effective_cpu_count() -> int).
pyproject.toml mypy overrides. Draft AC-6 expected a 5-module [[tool.mypy.overrides]] block with warn_unreachable = true for codegenie.{indices, probes.layer_b.index_health, report, adapters, tccm}. Reality: pyproject.toml:172 enables warn_unreachable = true at the GLOBAL [tool.mypy] level, repo-wide. The S8-01 _attempts log already noted this: "warn_unreachable is enabled globally in pyproject.toml; S1-11 had the intent of a per-module override, but the project-wide setting is already in force — the deletion ritual proves the gate fires. No edits to pyproject.toml were needed." Resolution: AC-6 rewritten to assert the GLOBAL setting + a deletion-ritual smoke test (the S8-01 ritual generalized) instead of the phantom 5-module block. Out-of-scope's "Editing pyproject.toml's mypy overrides (S1-11 owns this)" stays — this story does not touch the mypy config.
tests/adv/phase02/ file count. Story said 7 adversarial files; reality has 8 (test_phase3_handoff_smoke.py is the 8th). AC-5 rewritten to enumerate the exact filenames and assert each has at least one collected test_ function (catches empty stubs, not just file existence).
@requires_tool decorator does not exist. Story AC-3 said "via a custom @requires_tool(name) decorator" as if it existed. Verified by grep: no such decorator in src/codegenie/ or tests/. The existing skip pattern is bare pytest.mark.skipif per test. AC-3 rewritten to own the decorator's creation in tests/_ci_support/requires_tool.py (new module) with its own contract test asserting the skip reason format.
"Phase 1 bench pattern" doesn't exist. Story said "reuses Phase 1's tests/bench/ baseline-comparison + PR-comment pattern verbatim". tests/bench/_helpers.py writes a bench-results.json artifact for upload — no baseline JSON file, no PR-comment automation. AC-8 + AC-9 rewritten to OWN the helper module (tests/bench/_bench_kernel.py) with the pure compare_to_baseline(...) -> Verdict function + impure post_comment_if(...) shell — design-patterns critic's recommendation.
Bench scripts share logic (rule-of-three threshold). Three bench scripts duplicate timing-harness + baseline-load + ratio-compute + comment-on-PR + threshold-decision. This crosses the rule-of-three threshold for extraction. AC-8 + AC-9 + AC-10b now consume a shared tests/bench/_bench_kernel.py (pure: compare_to_baseline(measurements, baseline, thresholds) -> Verdict, sum type Verdict = Ok | CommentOnly | Fail; impure: _post_comment + _exit_with_verdict). Adding a fourth bench in Phase 3+ requires zero edits to the kernel.
Threshold-check pure-function discipline. AC-10 tested only one boundary (200 %); a wrong implementation off-by-one at 100 % passes. Test-quality critic flagged this. AC-10b now parametrizes over [99 %, 100 %, 101 %, p95=359 s, p95=360 s, p95=361 s] and asserts the inclusivity convention explicitly (≥ 100 % AND > 360 s — the AND is corrected; original story said OR but matches arch §Gap 2's "or" wording — the synthesis pick is the arch's OR).
Workflow-YAML test brittleness. Draft used string-grep for pytest-xdist. Test-quality critic flagged that -nauto, --dist=loadfile, tox -p would slip through. Tests now load the workflow into a typed Pydantic WorkflowFile / Job / Step model (tests/unit/ci/_workflow_model.py) and run regex r'(?<!\w)(-n\s|--numprocesses|--dist|pytest-xdist)(?!\w)' against each step's run string; also greps pyproject.toml [tool.pytest.ini_options] addopts for the same patterns. Metamorphic test: monkeypatch-inject -n 4 into a parsed copy of the workflow → assertion fires.
Cron timezone unspecified. Draft cron: "0 4 * * *". GH Actions cron runs in UTC; an operator expecting PT/CET could be confused by 8-hour skew. AC-10c documents the convention (04:00 UTC == 21:00 PT prior day).
Fork-PR GH_TOKEN. Bench gh pr comment needs pull-requests: write; fork PRs from external contributors get a read-only GITHUB_TOKEN. AC-7c added: bench comment step degrades silently with a loud log on github.event.pull_request.head.repo.fork == true; the bench measurement still runs and writes the artifact.

Full critic findings + decision rationale archived at _validation/S8-03-ci-jobs-and-benches.md. Verdict: HARDENED.

Context¶

Phase 2 commits to eight named CI lanes in phase-arch-design.md §"CI gates": fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench. Three of those exist today (fence is its own job; unit-style coverage runs inside test; the bench step is inside test too) — the rest are new. This story lands the gap.

Of the eight, adv-phase02 is load-bearing — it gates every adversarial test from S4-02 (stale_scip), S5-05 (image_digest_drift), S5-06 (adversarial_dockerfile), S6-07 (secret_in_source), S6-07 (hostile_skills_yaml), S7-04 (concurrent_gather_race, no_inmemory_secret_leak), plus the Phase 3 handoff smoke (test_phase3_handoff_smoke.py). Eight files in tests/adv/phase02/ as of master. A failing test_stale_scip_fixture.py turns the build red; that is the roadmap exit criterion for Phase 2 ("IndexHealthProbe surfaces a real staleness case in CI against a deliberately-seeded fixture").

Three advisory bench canaries also land here. They never block merge — they comment on PRs (Phase 0 §3.2 advisory discipline). The third bench (bench_portfolio_walltime_hosted_runner.py) is the closer for Gap 2 from phase-arch-design.md §"Gap analysis": the developer-laptop bench (bench_portfolio_walltime.py) measures wall-clock on a beefy machine; the hosted-runner bench emulates the actual GitHub Actions cpu_count()=2 runner via CODEGENIE_FORCE_CPU_COUNT=2 and does have a build-fail threshold (≥ 100 % regression OR p95 > 360 s). That single bench is the one place in Phase 2 where bench failure is build-failure — because by then we're not advising, we've crossed the operational red line.

The mypy job is the runtime enforcer of S8-01's exhaustiveness ritual: mypy --strict repo-wide with warn_unreachable = true set in [tool.mypy] (already shipped, repo-wide, by S1-11 — verified by S8-01's _attempts log). A removed case in confidence_section.py produces a CI build error via the global flag — the per-module override scheme the draft prescribed proved unnecessary.

This story is the YAML, the bench scripts, the bench kernel (tests/bench/_bench_kernel.py), the CPU-count plumbing (src/codegenie/coordinator/_cpu_budget.py + coordinator wiring), the tool-presence decorator (tests/_ci_support/requires_tool.py), the contract-freeze allowlist (scripts/regen_probe_contract_snapshot.py extension), and the workflow-YAML typed test model (tests/unit/ci/_workflow_model.py). Eight ACs, seven new modules, one workflow file. Effort: M (was S; rescoped to honor the precondition + helper-extraction work the draft hand-waved).

References — where to look¶

Architecture:
../phase-arch-design.md §"CI gates" — the eight numbered Phase-2 jobs with their gating/advisory status.
../phase-arch-design.md §"Performance regression tests" — bench_portfolio_walltime.py and bench_index_health_overhead.py thresholds (≥ 50 % and ≥ 10 % comment-on-PR).
../phase-arch-design.md §"Gap analysis" Gap 2 — "hosted-runner bench closes the hidden-assumption #2"; Improvement subsection names bench_portfolio_walltime_hosted_runner.py, CODEGENIE_FORCE_CPU_COUNT=2, nightly cron, comment-on-PR ≥ 50 %, build-fail ≥ 100 % (> 360 s p95), escape valve (commit per-fixture .codegenie/cache/ blobs).
../phase-arch-design.md §"Adversarial tests" — the table of seven adversarial tests the adv-phase02 job aggregates (the actual file count on master is eight — see AC-5).
Phase ADRs:
../ADRs/0009-pytest-xdist-veto-preserved.md — no xdist anywhere. Portfolio and adversarial lanes stay serial.
../ADRs/0001-add-docker-and-security-cli-tools-to-allowed-binaries.md — the eleven additions; the integration lane's tool-presence preflight matrix.
../ADRs/0004-image-digest-as-declared-input-token.md — the contract-freeze snapshot regen permits exactly the image_digest_resolver field on ProbeContext; nothing else widens.
Production ADRs:
../../../production/adrs/0005-no-llm-in-gather.md — fence job invariant.
../../../production/adrs/0033-domain-modeling-discipline.md §3 — mypy --strict + warn_unreachable is the runtime enforcement.
Source design:
../final-design.md §"CI lane" — "Serial (no pytest-xdist). Estimated CI walltime growth ≤ 6 minutes; the bench canary ... is advisory."
../final-design.md §"Open questions deferred to implementation" #5 — full-repo mypy --warn-unreachable is in S8-04's backlog.
Existing code (DO NOT WEAKEN):
.github/workflows/ci.yml (Phase 0 + Phase 1) — existing five jobs: lint (ruff + import-linter), typecheck (mypy --strict), test (pytest + coverage carve-outs + 3 bench canaries with -m bench), security (pip-audit + osv-scanner), fence. Phase 2 adds five new top-level jobs and a new workflow file (bench-nightly.yml); the existing jobs stay in place (no rename, no delete).
tests/bench/_helpers.py (S5-01) — writes bench-results.json artifact via merge_bench_result(). No baseline JSON. No PR-comment automation. This story OWNS the new _bench_kernel.py that adds those primitives.
tests/bench/test_cli_cold_start.py, test_coordinator_overhead.py, test_cache_hit_dispatch.py (S5-01) — the three -m bench canaries; bench-collection-guard (ci.yml:122-130) hardcodes count == 3. The new bench scripts in this story are NOT marked -m bench so the guard stays at 3.
pyproject.toml:172 — warn_unreachable = true at the GLOBAL [tool.mypy] level. Repo-wide. The S8-01 _attempts log already confirmed this fires the gate per-module via mypy's whole-program analysis.
src/codegenie/coordinator/coordinator.py:489 — currently cpu = os.cpu_count() or 1 with no env-var override. This story plumbs CODEGENIE_FORCE_CPU_COUNT here via a new wrapper effective_cpu_count() -> int.
scripts/regen_probe_contract_snapshot.py — currently has no field-allowlist mechanism; this story extends it to assert the ProbeContext allowlist is exactly the Phase-0 fields ∪ image_digest_resolver.
src/codegenie/probes/base.py:52-62 — ProbeContext fields: cache_dir, output_dir, workspace, logger, config, parsed_manifest, input_snapshot, image_digest_resolver. The latter is the only Phase-2 widening per ADR-0004.
tests/adv/phase02/ — 8 files: test_adversarial_dockerfile.py, test_concurrent_gather_race.py, test_hostile_skills_yaml.py, test_image_digest_drift.py, test_no_inmemory_secret_leak.py, test_phase3_handoff_smoke.py, test_secret_in_source.py, test_stale_scip_fixture.py. adv-phase02 lane collects them all.
tests/integration/portfolio/test_portfolio_sweep.py (S7-01/02) — the five-fixture sweep the portfolio lane runs. Fixtures: minimal-ts, monorepo-pnpm, distroless-target, native-modules, stale-scip.
tests/snapshots/probe_contract.v1.json — exists (Phase 0 created it); the contract-freeze lane's allowlist-extended regen helper diffs against this.

Goal¶

Land Phase-2 CI surface in two workflow files:

.github/workflows/ci.yml — extend with five new top-level jobs named contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench (seven new — but mypy is the new name promoted from typecheck, and bench is the new name promoted from the in-test bench step; keep the legacy lint, typecheck and security jobs intact via the additive path described in Validation Note #3). After this story the workflow has the existing five jobs (lint, typecheck, test, security, fence) PLUS the five new Phase-2-named jobs (contract-freeze, unit, integration, portfolio, adv-phase02, mypy-as-alias-of-typecheck, bench-as-promoted-from-test-step). The eight names from phase-arch-design.md §"CI gates" are a required subset, not the entire set. No xdist anywhere (ADR-0009).
.github/workflows/bench-nightly.yml — new file; cron 0 4 * * * (UTC); runs tests/bench/bench_portfolio_walltime_hosted_runner.py only.

Land three new bench scripts under tests/bench/ (NOT marked -m bench — the existing collection guard stays at count==3):

bench_portfolio_walltime.py — five-fixture cold + warm p50 captured per run; baseline JSON committed in tests/bench/baselines/portfolio_walltime.json; ≥ 50 % delta posts a PR comment (no block).
bench_index_health_overhead.py — measures IndexHealthProbe walltime as a fraction of total cold gather walltime on minimal-ts; ≥ 10 % posts a PR comment. Target: < 5 %; 5–10 % acceptable; ≥ 10 % comments.
bench_portfolio_walltime_hosted_runner.py — nightly cron; sets CODEGENIE_FORCE_CPU_COUNT=2 so effective_cpu_count() (this story's new wrapper) returns 2 regardless of os.cpu_count(); ≥ 50 % regression vs baseline posts a PR comment; ≥ 100 % regression OR p95 > 360 s = build failure (Gap 2 closer).

Land four supporting modules:

src/codegenie/coordinator/_cpu_budget.py — pure effective_cpu_count() -> int reading CODEGENIE_FORCE_CPU_COUNT env-var with fallback to os.cpu_count() or 1. coordinator.py:489 consumes it.
tests/bench/_bench_kernel.py — pure compare_to_baseline(measurements, baseline, thresholds) -> Verdict (sum type Ok | CommentOnly | Fail) + impure post_comment_if(verdict, gh_token) + impure exit_with_verdict(verdict). Three bench scripts compose this kernel.
tests/_ci_support/requires_tool.py — @requires_tool(name) decorator with skip-reason format f"{name} not on PATH — SKIPPED LOUD".
tests/unit/ci/_workflow_model.py — typed Pydantic WorkflowFile, Job, Step models for parsing .github/workflows/*.yml in tests.

Promote tests/unit/test_probe_contract.py to its own contract-freeze lane (separate from test); extend scripts/regen_probe_contract_snapshot.py with the ProbeContext field-allowlist {cache_dir, output_dir, workspace, logger, config, parsed_manifest, input_snapshot, image_digest_resolver}. A third additive field fails with an ADR-0004 pointer.

Acceptance criteria¶

[x] AC-1 (Phase-2 named lanes present as a required subset; legacy jobs preserved; matrix on Python 3.11 + 3.12). .github/workflows/ci.yml defines top-level jobs that include the subset {fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench} (8 names from phase-arch-design.md §"CI gates"). The legacy jobs {lint, typecheck, test, security} also remain present (a Phase-0/1 contract this story does NOT change). Every new lane runs on the matrix python-version: ["3.11", "3.12"]. tests/unit/ci/test_workflow_yaml.py::test_required_subset_present parses the workflow via _workflow_model.WorkflowFile (typed Pydantic loader) and asserts: (a) {fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench}.issubset(workflow.jobs.keys()); (b) {lint, typecheck, test, security}.issubset(workflow.jobs.keys()) (no silent legacy deletion); (c) each new lane's matrix contains both "3.11" and "3.12". mypy may be an alias for typecheck (a needs: typecheck job stub or a duplicate top-level — pick one and document) — the lane MUST exist by the name mypy; the existing typecheck may continue to be invoked from make typecheck.
[x] AC-2 (unit lane — subset filter; no xdist; serial; --no-cov). The unit job runs pytest tests/unit/ -q --no-cov with NO -n/--numprocesses/--dist/-x parallel flags. --no-cov avoids double-counting against the global 85 % floor (which the existing test job continues to enforce). The job's timeout-minutes: 5 step-level cap enforces the ceiling. test_workflow_yaml.py::test_unit_serial_and_no_cov asserts the absence of xdist flags and presence of --no-cov.
[x] AC-3 (integration lane — real tool invocations; tool-presence preflight via @requires_tool decorator; loud skip). The job runs pytest tests/integration/ -q --no-cov serially. A new module tests/_ci_support/requires_tool.py exposes @requires_tool(name: str) that wraps pytest.mark.skipif(shutil.which(name) is None, reason=f"{name} not on PATH — SKIPPED LOUD"). The decorator MUST also emit a structlog warning (or warnings.warn(..., stacklevel=2)) when applied so the skip is visible in CI stdout (not just in the pytest report). The job also pre-flights every ALLOWED_BINARIES Phase-2 addition (semgrep, syft, grype, gitleaks, tree-sitter, docker, strace, scip-typescript) and prints the missing-tool list as the first stdout line (visible in CI logs at a glance). tests/unit/ci/test_requires_tool_decorator.py asserts: (a) skip reason literal contains SKIPPED LOUD and the tool name; (b) the warning/log emission fires once per missing tool per session; (c) the decorator is composable with other pytest marks (@requires_tool("foo") + @pytest.mark.parametrize(...) works).
[x] AC-4 (portfolio lane — five-fixture sweep + golden diff; serial; ≤ 7 min step-level cap; no xdist). The job runs pytest tests/integration/portfolio/ -q --no-cov --tb=short against the five-fixture portfolio (minimal-ts, native-modules, monorepo-pnpm, distroless-target, stale-scip). timeout-minutes: 7 (one-minute headroom over the 6-min budget from phase-arch-design.md §"CI lane"). Golden-diff failure is a hard fail (no continue-on-error). test_workflow_yaml.py::test_portfolio_serial_budget asserts: (a) no xdist; (b) timeout-minutes <= 7; (c) every fixture name from the canonical list appears in the test discovery (verified by pytest --collect-only tests/integration/portfolio/).
[x] AC-5 (adv-phase02 lane — LOAD-BEARING — fails build on any adversarial failure; eight-file presence + collected-tests assertion). The job runs pytest tests/adv/phase02/ -q --no-cov --tb=long. The job's continue-on-error: false (default) means any failure fails the build. tests/unit/ci/test_adv_phase02_load_bearing.py asserts: (a) the workflow's adv-phase02 step does NOT set continue-on-error: true; (b) the eight named files exist under tests/adv/phase02/: test_adversarial_dockerfile.py, test_concurrent_gather_race.py, test_hostile_skills_yaml.py, test_image_digest_drift.py, test_no_inmemory_secret_leak.py, test_phase3_handoff_smoke.py, test_secret_in_source.py, test_stale_scip_fixture.py; (c) each file has at least one collected def test_… function (pytest --collect-only tests/adv/phase02/<file>::* returns ≥ 1 item) — catches empty stubs, not just file existence. The story documents in "Notes for the implementer" the verification ritual: deliberately introduce a bug in IndexHealthProbe (e.g., always emit Fresh) and confirm the CI build fails red on test_stale_scip_fixture.py; revert. Mypy stderr from the ritual is captured in _attempts/S8-03.md.
[x] AC-6 (mypy lane — mypy --strict repo-wide + global warn_unreachable already in pyproject.toml; exhaustiveness smoke ritual). The job runs mypy --strict src/codegenie/ tests/ (or make typecheck which already invokes this). warn_unreachable = true is set at the GLOBAL [tool.mypy] level in pyproject.toml:172 (already shipped; per S8-01's _attempts log, the project-wide setting fires the per-module exhaustiveness gate via mypy's whole-program analysis — the per-module override block S1-11 originally intended is NOT necessary). tests/unit/ci/test_mypy_global_warn_unreachable.py parses pyproject.toml and asserts: (a) [tool.mypy] contains warn_unreachable = true; (b) no [[tool.mypy.overrides]] block sets warn_unreachable = false for any production module; (c) the mypy CI lane's step does NOT pass --warn-unreachable on the command line (config-driven, single source of truth). AC-6b — exhaustiveness smoke ritual: the Step 8 PR-review checklist (and an executable assertion in _attempts/S8-03.md) includes "deliberately remove a case from confidence_section.py; confirm mypy --strict fails with [unreachable] or assert_never arg-type error; revert." The captured stderr stays in the attempt log as load-bearing evidence.
[x] AC-7 (bench lane — advisory; never blocks PR; existing collection guard preserved at count==3). The new bench lane runs pytest tests/bench/bench_portfolio_walltime.py tests/bench/bench_index_health_overhead.py -q --no-cov (the two new bench scripts, NOT the existing -m bench canaries). The lane's continue-on-error: true ensures merge is never blocked. ≥ 50 % regression vs tests/bench/baselines/portfolio_walltime.json posts a PR comment via gh pr comment (uses GH_TOKEN/secrets.GITHUB_TOKEN); ≥ 10 % regression for bench_index_health_overhead.py posts a comment. test_workflow_yaml.py::test_bench_advisory asserts continue-on-error: true. AC-7b (existing bench-collection-guard unchanged): the existing bench-collection-guard step inside the test job (ci.yml:122-130) MUST still expect == 3 (the S5-01 set: test_cli_cold_start.py, test_coordinator_overhead.py, test_cache_hit_dispatch.py). The two new bench scripts (bench_portfolio_walltime.py, bench_index_health_overhead.py) are NOT marked -m bench and so are not counted by the guard. test_bench_collection_guard_unchanged.py asserts the guard threshold is still 3 and the new bench scripts do not carry the bench marker. AC-7c (fork-PR degradation): the bench lane's gh pr comment step uses if: ${{ github.event.pull_request.head.repo.fork == false }} so external-fork PRs skip the comment with a loud log; the bench measurement still runs and writes the artifact (operator can inspect manually).
[x] AC-8 (bench_portfolio_walltime.py — five-fixture cold + warm p50; baseline JSON committed; uses pure compare_to_baseline kernel). The script runs each of the five fixtures cold (cache cleared) and warm (cache populated), capturing p50 across at least 5 runs per (fixture, mode) (variance-tolerance for CI runners; arch §"Performance regression tests" precedent). The script consumes tests/bench/_bench_kernel.compare_to_baseline(measurements, baseline, thresholds) -> Verdict (pure, sum type). On Verdict.CommentOnly, posts a PR comment listing the regressed fixture(s); on Verdict.Ok, exit 0 silently; on Verdict.Fail, exit 0 (this script's threshold is comment-only — Fail never occurs here). tests/bench/test_bench_portfolio_walltime_smoke.py runs the script against minimal-ts ONLY and asserts the result dict contains cold_p50_s and warm_p50_s keys with float > 0. The committed baseline tests/bench/baselines/portfolio_walltime.json carries a metadata header: {"refreshed_at": "<ISO8601>", "refreshed_by": "<github-username>", "reason": "<one-line justification>"}; test_baseline_has_metadata.py asserts the three keys.
[x] AC-9 (bench_index_health_overhead.py — B2 walltime as fraction of total cold gather on minimal-ts; ≥ 10 % comments; metamorphic test). The script captures IndexHealthProbe walltime as fraction-of-total during a cold gather of minimal-ts (median of 5 runs). Reports the fraction; on ≥ 10 % posts a PR comment naming the regression (does NOT fail). The 5 % target is documented in the script's module docstring; 5–10 % is an acceptable middle band. tests/bench/test_bench_index_health_smoke.py asserts: (a) the harness returns a fraction_of_total: float between 0.0 and 1.0 (vacuous, but catches a None/-1 return); (b) metamorphic — running the script twice with a monkeypatch-injected time.sleep(0.5) inside IndexHealthProbe.run produces a STRICTLY larger fraction_of_total than the unmodified run.
[x] AC-10a (Plumb CODEGENIE_FORCE_CPU_COUNT into the coordinator via effective_cpu_count()). New module src/codegenie/coordinator/_cpu_budget.py exposes the pure function effective_cpu_count() -> int: reads CODEGENIE_FORCE_CPU_COUNT env-var, falls back to os.cpu_count() or 1, raises ValueError if the env-var is non-empty and not a positive int. src/codegenie/coordinator/coordinator.py:489 is changed from cpu = os.cpu_count() or 1 to cpu = effective_cpu_count(). tests/unit/coordinator/test_cpu_budget.py asserts: (a) env-var absent → falls back to os.cpu_count() or 1; (b) env-var = "2" → returns 2; (c) env-var = "abc" → ValueError; (d) env-var = "-1" → ValueError; (e) env-var = "" (empty string) → falls back; (f) the coordinator's Semaphore is constructed with min(effective_cpu_count(), 8).
[x] AC-10b (bench_portfolio_walltime_hosted_runner.py — nightly, emulates cpu_count()=2, comment ≥ 50 %, build-fail ≥ 100 % OR p95 > 360 s — Gap 2 closer; parametrized threshold boundary test). The script sets os.environ["CODEGENIE_FORCE_CPU_COUNT"] = "2" BEFORE importing coordinator modules (so effective_cpu_count() is honored on first read). Runs the five-fixture portfolio (median of 5 runs per fixture). Consumes _bench_kernel.compare_to_baseline(measurements, baseline, thresholds=Threshold(comment_pct=50.0, fail_pct=100.0, fail_p95_s=360.0)). On Verdict.CommentOnly → PR comment; on Verdict.Fail → exit_with_verdict() calls sys.exit(2) (failing the build). tests/unit/ci/test_hosted_runner_bench_thresholds.py parametrizes the threshold function over [(99.0, Verdict.Ok), (100.0, Verdict.Fail), (101.0, Verdict.Fail), (50.0, Verdict.CommentOnly), (49.9, Verdict.Ok)] AND p95 boundaries [(359.0, Ok), (360.0, Ok), (360.001, Fail), (361.0, Fail)] — explicit inclusivity conventions: regression >= 100 % triggers Fail; p95 > 360 s (strict) triggers Fail. (Synthesis pick: arch §"Gap 2" wording is "OR"; both conditions are independent triggers.)
[x] AC-10c (.github/workflows/bench-nightly.yml — UTC cron, hosted-runner bench only). New workflow file. on: schedule: - cron: '0 4 * * *' (UTC; 04:00 UTC == 21:00 PT prior day == 06:00 CET). Single job runs pytest tests/bench/bench_portfolio_walltime_hosted_runner.py -q --no-cov on runs-on: ubuntu-24.04 (pinned; runner-image upgrade alone could cause ≥ 100 % drift — pinning is load-bearing). env: CODEGENIE_FORCE_CPU_COUNT: "2" set at the job level. permissions: { pull-requests: write, contents: read }. tests/unit/ci/test_bench_nightly_workflow.py parses the workflow and asserts: (a) cron schedule is exactly "0 4 * * *"; (b) runs-on is ubuntu-24.04 (pinned, not ubuntu-latest); (c) env contains CODEGENIE_FORCE_CPU_COUNT: "2"; (d) permissions.pull-requests == "write"; (e) the workflow is workflow_dispatch-able (operator can trigger manually).
[x] AC-11 (contract-freeze lane — Phase 0 contract test promoted to its own job; scripts/regen_probe_contract_snapshot.py extended with field-allowlist; ADR-0004 pointer on third field). The contract-freeze lane runs pytest tests/unit/test_probe_contract.py -q --no-cov (existing test) AND python scripts/regen_probe_contract_snapshot.py --check (new flag — asserts the live snapshot matches the committed tests/snapshots/probe_contract.v1.json). The regen helper is extended with an explicit field-allowlist for ProbeContext fields: {cache_dir, output_dir, workspace, logger, config, parsed_manifest, input_snapshot, image_digest_resolver} (the Phase-0 base ∪ Phase-1 amendments ∪ image_digest_resolver). A third additive field (e.g., parsed_manifest_v2, foo, bar) raises ValueError("ProbeContext widening prohibited — see 02-ADR-0004; got unknown field {name}"). tests/unit/ci/test_contract_freeze_allowlist.py asserts: (a) the committed snapshot contains image_digest_resolver; (b) the regen helper's field-allowlist names exactly the eight fields above; (c) parametrized over [("foo"), ("bar"), ("parsed_manifest_v2"), ("__init__")] — each triggers ValueError with 02-ADR-0004 literal substring in the message; (d) the contract-freeze workflow job exists and runs --check.
[x] AC-12 (Phase 0 fence continues green; no new LLM/network imports introduced; all new modules pass mypy --strict + ruff + lint-imports). The fence job asserts no anthropic/openai/langgraph/httpx/requests/socket imports under src/codegenie/. This story introduces no such import (only src/codegenie/coordinator/_cpu_budget.py is added; it imports os only). make lint-imports green (no new cross-package edges). mypy --strict green on src/codegenie/coordinator/_cpu_budget.py, tests/bench/_bench_kernel.py, tests/_ci_support/requires_tool.py, tests/unit/ci/_workflow_model.py. ruff check + ruff format --check green on all touched files. CI run against master post-merge passes fence on Python 3.11 + 3.12.
[x] AC-13 (no pytest-xdist anywhere — metamorphic test). tests/unit/ci/test_no_xdist_anywhere.py parses every workflow under .github/workflows/ via _workflow_model.WorkflowFile, and for each step's run string applies the regex r'(?<!\w)(-n\s|-n\d|--numprocesses|--dist|pytest-xdist|tox\s+-p)(?!\w)' — asserts zero matches. Also greps pyproject.toml [tool.pytest.ini_options] addopts for the same patterns; asserts zero. Metamorphic test: the test monkeypatches a parsed workflow copy to inject -n 4 into a step's run string and re-runs the assertion; expects AssertionError. Confirms the assertion has bite (Rule 9 — tests verify intent, mutation-resistant).

Out of scope¶

Deleting or renaming the existing lint and security jobs. Both are Phase-0/1 contracts; the eight Phase-2 names from phase-arch-design.md §"CI gates" are a required subset, not the entire job set. Reconciling Phase-2's eight with the legacy five is an additive layering, not a replacement.
Changing the existing bench-collection-guard count (== 3). The three new bench scripts in this story are NOT marked -m bench; the existing S5-01 collection stays untouched.
Editing pyproject.toml's [tool.mypy] config. The global warn_unreachable = true already shipped (S1-11; verified by S8-01's _attempts log). The per-module override block the draft prescribed is unnecessary.
Splitting the portfolio job into per-fixture parallel lanes via xdist. ADR-0009. If walltime regresses past 6 min, the operator's escape valve (final-design.md §"Open Q 6", phase-arch-design.md §"Gap 2 §Escape valve") is committing per-fixture .codegenie/cache/ blobs — not a CI shape change.
Making the per-PR bench lane gating. ADR-0009 + final-design.md §"CI lane". Only bench_portfolio_walltime_hosted_runner.py's ≥ 100 % / p95 > 360 s thresholds gate the build, and that script runs on the nightly cron via bench-nightly.yml, not per-PR.
Adding a coverage-ratchet job (Phase 1 owns this via the existing test job's Per-module coverage carve-outs step).
Adding a forbidden-patterns CI job. Phase 0 pre-commit owns this; the Phase 2 extension to ban model_construct under src/codegenie/output/** is enforced by the existing pre-commit hook (S1-11). If a reader expects a CI lane for it, the answer is "no — pre-commit is the source of truth, not CI; Rule 2 — don't duplicate."
Editing the contents of any adversarial test under tests/adv/phase02/. Those land in S4-02/S5-05/S5-06/S6-07/S7-04. AC-5 only asserts file presence + ≥ 1 collected test per file.
Adding per-PR bench_portfolio_walltime_hosted_runner.py. The hosted-runner bench is nightly-only by design (Gap 2 closer); running it per-PR would consume CI quota and inflate variance.

Files to touch¶

New:

.github/workflows/bench-nightly.yml — UTC cron for the hosted-runner bench (AC-10c).
src/codegenie/coordinator/_cpu_budget.py — pure effective_cpu_count() (AC-10a). ~15 LOC.
tests/_ci_support/__init__.py — empty.
tests/_ci_support/requires_tool.py — @requires_tool decorator (AC-3). ~30 LOC.
tests/bench/_bench_kernel.py — pure compare_to_baseline, Verdict sum type, Threshold dataclass; impure post_comment_if, exit_with_verdict (AC-8/9/10b). ~80 LOC.
tests/bench/bench_portfolio_walltime.py — AC-8.
tests/bench/bench_index_health_overhead.py — AC-9.
tests/bench/bench_portfolio_walltime_hosted_runner.py — AC-10b.
tests/bench/baselines/portfolio_walltime.json — committed baseline + metadata header (AC-8).
tests/bench/baselines/portfolio_walltime_hosted_runner.json — committed baseline + metadata header (AC-10b).
tests/bench/baselines/README.md — documents the baseline-refresh ritual (separate PR, reviewer approval, metadata header fields).
tests/bench/test_bench_portfolio_walltime_smoke.py — AC-8 smoke.
tests/bench/test_bench_index_health_smoke.py — AC-9 smoke + metamorphic.
tests/bench/test_baseline_has_metadata.py — AC-8 metadata-header assertion.
tests/unit/ci/__init__.py — empty.
tests/unit/ci/_workflow_model.py — typed Pydantic WorkflowFile, Job, Step (parser used by every workflow-YAML test in this story).
tests/unit/ci/test_workflow_yaml.py — AC-1, AC-2, AC-4, AC-7.
tests/unit/ci/test_requires_tool_decorator.py — AC-3.
tests/unit/ci/test_adv_phase02_load_bearing.py — AC-5.
tests/unit/ci/test_mypy_global_warn_unreachable.py — AC-6.
tests/unit/ci/test_bench_collection_guard_unchanged.py — AC-7b.
tests/unit/ci/test_hosted_runner_bench_thresholds.py — AC-10b (parametrized boundary test).
tests/unit/ci/test_bench_nightly_workflow.py — AC-10c.
tests/unit/ci/test_contract_freeze_allowlist.py — AC-11.
tests/unit/ci/test_no_xdist_anywhere.py — AC-13 (metamorphic).
tests/unit/coordinator/test_cpu_budget.py — AC-10a.

Modified:

.github/workflows/ci.yml — add five new top-level job blocks (contract-freeze, unit, integration, portfolio, adv-phase02, plus mypy as a new top-level OR a needs: typecheck alias job, plus bench as a new top-level promoted from the existing in-test bench step). The legacy lint, typecheck, test, security, fence jobs are PRESERVED unchanged. Matrix extension: the existing typecheck/test/security/fence jobs stay at python-version: "3.11" only (not in scope to extend); each NEW lane runs python-version: ["3.11", "3.12"].
src/codegenie/coordinator/coordinator.py — line 489 changes from cpu = os.cpu_count() or 1 to cpu = effective_cpu_count() (import from _cpu_budget). ~3 LOC delta.
scripts/regen_probe_contract_snapshot.py — add explicit _PROBE_CONTEXT_FIELD_ALLOWLIST constant; add --check flag that diffs the live snapshot against the committed JSON; raise ValueError with 02-ADR-0004 pointer on any non-allowlisted field. ~30 LOC delta.

Untouched (DO NOT EDIT):

pyproject.toml's [tool.mypy] block (S1-11 owns; global warn_unreachable=true already there).
The existing test job's bench-collection-guard count (== 3).
Any adversarial test under tests/adv/phase02/.
Existing tests/bench/_helpers.py (S5-01 owns). The new _bench_kernel.py is additive.
Phase 0 fence test.
The lint, typecheck, security jobs in ci.yml.

TDD plan — red / green / refactor¶

RED (failing tests committed first):

test_workflow_yaml.py::test_required_subset_present — parses .github/workflows/ci.yml via _workflow_model.WorkflowFile; asserts the 8-name subset is present AND the 4-name legacy subset is preserved. Fails red.
test_workflow_yaml.py::test_unit_serial_and_no_cov — asserts the unit lane's run contains pytest tests/unit/ AND --no-cov AND lacks -n/--numprocesses/--dist. Fails red.
test_workflow_yaml.py::test_portfolio_serial_budget — asserts timeout-minutes <= 7 on portfolio lane and no xdist. Fails red.
test_workflow_yaml.py::test_bench_advisory — continue-on-error: true on bench lane's pytest step. Fails red.
test_no_xdist_anywhere.py::test_zero_xdist_invocations — typed-loaded workflow scan + pyproject.toml addopts scan + the metamorphic monkeypatch test (inject -n 4 → expect AssertionError). Fails red.
test_adv_phase02_load_bearing.py::test_eight_files_with_collected_tests — asserts the 8 named files exist AND each has ≥ 1 collected test_… function. Fails red until the workflow runs pytest --collect-only and the harness reads its output.
test_requires_tool_decorator.py::test_skip_reason_format — applies @requires_tool("doesnotexist") to a test stub; runs pytest --collect-only; asserts skip reason contains SKIPPED LOUD and doesnotexist. Fails red — module doesn't exist.
test_mypy_global_warn_unreachable.py::test_global_warn_unreachable_true — parses pyproject.toml; asserts [tool.mypy].warn_unreachable == True. (Passes green on master — this story's purpose for AC-6 is to not break it + add the smoke ritual.) Add test_no_override_disables_warn_unreachable — asserts no override block sets warn_unreachable = false.
test_bench_collection_guard_unchanged.py::test_guard_count_three — greps .github/workflows/ci.yml for the literal expected exactly 3 bench tests; also runs pytest --collect-only -m bench tests/bench/ and asserts collection count == 3 after the new bench scripts land (they MUST NOT carry the bench marker). Fails red if a future contributor accidentally tags one.
test_hosted_runner_bench_thresholds.py::test_threshold_boundaries — parametrize [(99.0, Ok), (100.0, Fail), (101.0, Fail), (50.0, CommentOnly), (49.9, Ok)] AND p95 [(359.0, Ok), (360.0, Ok), (360.001, Fail)] against the pure compare_to_baseline. Fails red — kernel doesn't exist.
test_bench_nightly_workflow.py::test_cron_runs_on_env_permissions — asserts cron == "0 4 * * *", runs-on == "ubuntu-24.04", env.CODEGENIE_FORCE_CPU_COUNT == "2", permissions.pull-requests == "write". Fails red.
test_contract_freeze_allowlist.py::test_third_field_rejected — parametrize [("foo"), ("bar"), ("parsed_manifest_v2"), ("__init__")]; asserts each raises ValueError with 02-ADR-0004 substring. Fails red.
test_cpu_budget.py::test_env_var_respected — parametrize [("absent", os.cpu_count() or 1), ("2", 2), ("abc", ValueError), ("-1", ValueError), ("", os.cpu_count() or 1)]. Fails red — module doesn't exist.
test_bench_portfolio_walltime_smoke.py — smoke against minimal-ts. Fails red — script doesn't exist.
test_bench_index_health_smoke.py::test_metamorphic_with_injected_sleep — runs the harness twice (baseline + with monkeypatched time.sleep(0.5) in IndexHealthProbe.run); asserts second fraction_of_total > first. Fails red.
test_baseline_has_metadata.py::test_three_metadata_keys — loads both baseline JSONs; asserts refreshed_at, refreshed_by, reason keys exist and are non-empty. Fails red.

GREEN (minimum code to pass):

Create src/codegenie/coordinator/_cpu_budget.py with effective_cpu_count(). Edit coordinator.py:489.
Create tests/_ci_support/requires_tool.py with @requires_tool + warning emission.
Create tests/bench/_bench_kernel.py with Verdict = Ok | CommentOnly | Fail sum type, Threshold dataclass, pure compare_to_baseline, impure post_comment_if + exit_with_verdict.
Create tests/unit/ci/_workflow_model.py with typed Pydantic loader.
Write the three bench scripts using _bench_kernel. Write the two baseline JSON files with metadata headers (initial values from first-run measurements; document the refresh ritual in baselines/README.md).
Extend .github/workflows/ci.yml with the five new top-level jobs (and mypy + bench promotions). Leave legacy lint/typecheck/test/security/fence untouched.
Create .github/workflows/bench-nightly.yml.
Extend scripts/regen_probe_contract_snapshot.py with the _PROBE_CONTEXT_FIELD_ALLOWLIST constant + --check flag + ValueError on third field.

REFACTOR:

Confirm the three bench scripts share zero ad-hoc baseline-load / ratio-compute / comment-on-PR code — all flows through _bench_kernel. If any script grows a sixth helper, surface as a future extraction; for now three is the rule-of-three threshold and the kernel is justified.
Confirm mypy --strict tests/bench/ src/codegenie/coordinator/_cpu_budget.py tests/_ci_support/ tests/unit/ci/ is clean.
Run the AC-5 ritual locally (introduce a B2 bug → pytest tests/adv/phase02/test_stale_scip_fixture.py fails → revert); capture proof in _attempts/S8-03.md.
Run the AC-6b ritual (delete a case in confidence_section.py → mypy fails → revert); capture mypy stderr in _attempts/S8-03.md.
ruff format, ruff check, mypy --strict, make lint-imports, make fence all green on touched modules.

Notes for the implementer¶

The adv-phase02 lane is the load-bearing gate. Green on master is the public contract that the roadmap exit criterion is met. Treat any flake here as a P0 (phase-arch-design.md §"Adversarial corpus"). Fix the test or the fixture; never continue-on-error: true.
Bench advisory vs gating — DO NOT blur the line. Of the three new benches, only bench_portfolio_walltime_hosted_runner.py gates the build, and only on ≥ 100 % OR > 360 s p95, and only on the nightly cron — not per-PR. Mixing these up either (a) blocks PRs on infra noise or (b) lets a 100 % regression sail through.
CODEGENIE_FORCE_CPU_COUNT plumbing must land BEFORE the hosted-runner bench is meaningful. AC-10a is the prerequisite. The bench script's os.environ["CODEGENIE_FORCE_CPU_COUNT"] = "2" must be set BEFORE the first import codegenie.coordinator so the wrapper reads the override on first call. Document this ordering in the bench script's module docstring.
Baseline-refresh ritual. Baselines are committed JSON + metadata header (refreshed_at, refreshed_by, reason). A contributor who intentionally regresses MUST refresh in a separate PR with reviewer approval. The metadata header is the audit trail; reviewers can grep git log for baseline-refresh PRs. Document in tests/bench/baselines/README.md.
PR-comment helper auth + fork degradation. Use ${{ secrets.GITHUB_TOKEN }}. Set permissions: { pull-requests: write, contents: read } on bench jobs ONLY (not on unit/portfolio/etc.). On fork PRs (github.event.pull_request.head.repo.fork == true), the comment step is skipped with a loud log (echo "::warning::Fork PR detected; bench comment skipped; measurement still ran"); the bench artifact upload still runs so an operator can inspect manually.
Tool-presence pre-flight. @requires_tool is the per-test decorator (decorator + warning emission). The integration job also runs a top-of-job shell preflight that prints the missing-tool list as the first stdout line so a human scanning CI logs sees the list at a glance: for tool in semgrep syft grype gitleaks tree-sitter docker strace scip-typescript; do command -v "$tool" >/dev/null || echo "MISSING: $tool"; done.
portfolio job's 6-min budget vs Gap 2 hosted-runner reality. The 6-min budget assumes the dev-laptop bench's measurements. The nightly hosted-runner bench is what verifies the assumption against actual CI hardware. If the nightly fails the ≥ 100 % / p95 > 360 s threshold, the operator's choice is the escape valve (committed .codegenie/cache/ blobs per fixture); do not edit the 6-min budget unilaterally — that requires an ADR amendment.
Phase 0 fence runs first. Order the needs: graph: every new lane depends on fence passing. If a future contributor accidentally imports httpx, no other lane wastes minutes. Use needs: [fence] on every new top-level job.
mypy job is fast (< 30 s) — no caching beyond what mypy provides natively. Adding action-level mypy caching is a separate optimization; out of scope.
Cron timezone. cron: "0 4 * * *" is UTC (GH Actions convention). That's 21:00 PT prior day / 06:00 CET. Document in bench-nightly.yml's top comment so an operator in PT doesn't expect a 4:00 AM PT run.
Don't run benches in unit/portfolio/adv-phase02 lanes. The new bench scripts live in tests/bench/ but the unit lane's pytest tests/unit/ discovery does NOT cross into tests/bench/. The bench lane is the only consumer of bench_portfolio_walltime.py + bench_index_health_overhead.py; the bench-nightly.yml workflow is the only consumer of bench_portfolio_walltime_hosted_runner.py.
bench lane scripts are NOT marked -m bench. The existing S5-01 collection guard expects exactly 3 markered tests. Marking a new bench script with @pytest.mark.bench would push the count to 4 and break the guard. The new bench scripts are invoked by explicit path: pytest tests/bench/bench_portfolio_walltime.py …. AC-7b enforces this.
Threshold inclusivity. Regression >= 100 % triggers Fail (inclusive); p95 > 360 s triggers Fail (strict). The boundary parametrize in AC-10b is the source of truth; arch §"Gap 2" uses the wording "≥ 100 %" and "> 360 s", which this story honors verbatim.
Rule 2 vs three-bench kernel. Three bench scripts crossing the rule-of-three threshold is exactly when the design-patterns toolkit prescribes extraction. _bench_kernel.py is justified — adding a fourth bench in Phase 3+ requires zero edits to the kernel (compose a new Threshold instance, no if branches added). If only two benches landed, the kernel would be premature.
Pure/impure split in _bench_kernel.py. compare_to_baseline is pure: takes measurements + baseline + thresholds, returns Verdict. post_comment_if is impure (calls gh pr comment). exit_with_verdict is impure (calls sys.exit). Tests cover the pure function with parametrize; the impure shell has a thin smoke test only. Functional-core / imperative-shell (CLAUDE.md convention).
Workflow-YAML tests use the typed Pydantic loader, not grep. _workflow_model.WorkflowFile.from_path(path) parses + validates the workflow up front. A malformed YAML fails the test immediately, not silently. Every CI test in this story reuses the same parser; rule-of-three reached (three CI tests minimum: test_workflow_yaml.py, test_bench_nightly_workflow.py, test_no_xdist_anywhere.py).
AC-5's file enumeration must be kept in sync with tests/adv/phase02/. If a future story adds a 9th adversarial file, the failing AC-5 enumeration is the prompt to update both the story file and the assertion. Treat this as load-bearing scoping discipline, not test flake.