Story S8-03 — Eight Phase-2 CI lanes + three advisory bench canaries (hosted-runner closes Gap 2)¶
Step: Step 8 — Confidence section renderer + CI ratchet + advisory benches + Phase-3 handoff
Status: Done — GREEN 2026-05-18 (phase-story-executor; see _attempts/S8-03.md for the per-AC evidence table + AC-6b ritual mypy stderr capture). The hardened story's "8-name required subset" interpretation was honored: lint, typecheck, test, security, docs, fence legacy jobs preserved unchanged; seven new Phase-2 lanes (contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench) added additively under needs: [fence]. Three new bench scripts (NOT -m bench-marked — collection guard stays at 3) compose a shared tests/bench/_bench_kernel.py (sum-type Verdict = Ok | CommentOnly | Fail). CODEGENIE_FORCE_CPU_COUNT plumbed via src/codegenie/coordinator/_cpu_budget.py::effective_cpu_count(); coordinator.py:489 routes through it. scripts/regen_probe_contract_snapshot.py gained _PROBE_CONTEXT_FIELD_ALLOWLIST + --check mode (a third additive field raises ValueError naming 02-ADR-0004). tests/unit/test_ci_workflow.py::REQUIRED_JOBS updated from the legacy 6-set to the 13-set per arch §"CI gates" (Rule 7 conflict surface). 3432 unit tests pass; mypy --strict, ruff, lint-imports, fence all clean.
Effort: M
Depends on: S8-02 (CLI summary line ships; tests/integration/cli/ integration tests exist). Precondition story (must land first or be folded in): plumbing CODEGENIE_FORCE_CPU_COUNT into src/codegenie/coordinator/coordinator.py Semaphore sizing (today the coordinator reads os.cpu_count() directly at coordinator.py:489 with no env-var override — S1-08 did not ship this). This story folds the plumbing in if no separate precondition story is filed; see AC-10a/10b.
ADRs honored: 02-ADR-0009 (pytest-xdist veto preserved — portfolio and adv-phase02 lanes are serial, no xdist anywhere); 02-ADR-0001 (ALLOWED_BINARIES Phase 2 extension is the union the integration lane depends on for tool-presence preflight); production ADR-0005 (no LLM in gather — the fence job stays green); 02-ADR-0004 (ProbeContext.image_digest_resolver is the one allowed widening; contract-freeze lane asserts the snapshot diff is exactly that field, no others).
Validation notes (2026-05-18 — phase-story-validator)¶
This story was hardened by phase-story-validator. The draft was substantially restructured because eleven of twelve ACs contradicted either the current CI shape, ADR-0008's discipline, or some named precondition that has not landed. The Goal — 8 Phase-2-named lanes + 3 bench scripts + Gap-2 hosted-runner closer — is sound and traces to phase-arch-design.md §"CI gates" + §"Gap analysis Gap 2" + ADR-0009 + ADR-0001 + ADR-0004. The prescriptions needed grounding:
- Phantom job set. Draft AC-1 asserted
set(jobs) == {fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench}(exactly 8). Current.github/workflows/ci.ymlhas{lint, typecheck, test, security, fence}(5 jobs). The literal equality would silently force deletion oflintandsecurity— both load-bearing per their own arch docs. Resolution: the eight names fromphase-arch-design.md §"CI gates"are a required subset of the workflow's job names, not an equality.lintandsecurity(and any other Phase-0/1 jobs) coexist. AC-1 reworded as a subset assertion; AC-1b added asserting the legacy jobs (lint,security) stay present (no silent deletion). contract-freezejob does not exist in current ci.yml. The probe contract test runs inside thetestjob (tests/unit/test_probe_contract.py). This story PROMOTES it to its own top-level job with its own per-PR diff surface. AC-11 reworded to ship that promotion + the field-allowlist mechanism (whichscripts/regen_probe_contract_snapshot.pydoes NOT have today — verified by reading the file: it just walks the structural signature with no allowlist).testjob vsunit/integration/portfolio/adv-phase02split. Currenttestjob runspytest -q --cov-report=json(all tests). The four Phase-2-named lanes are this story's split. Two reconciliation paths: (a) deletetest, fan out into four lanes — preserves coverage but breaks the existingPer-module coverage carve-outs (ADR-0005)step and thebench-collection-guardstep; (b) keeptestas the umbrella with--cov+ carve-outs (status quo), add the four Phase-2-named lanes alongside as additional lanes — the new lanes run subsets and assert specific gating semantics thattestdoes not (no-x, hard fail on adversarial, etc.). Resolution: path (b) — add lanes; do not deletetest. The four new lanes set--no-cov(avoiding double-counting against the global 85 % floor) and run on a subset filter. Documented as ADR amendment text inNotes for the implementer.bench-collection-guardmismatch..github/workflows/ci.yml:122-130hardcodesexpected exactly 3 bench tests(the S5-01 marker set). Adding three new bench scripts undertests/bench/collected by-m benchwould push count to 6 and fail the guard. Resolution: the three new bench scripts are NOT marked-m bench(they live alongside the existing trio undertests/bench/but are invoked by their own job steppytest tests/bench/bench_*.py, NOT by-m bench). The collection guard stays at 3 (the existing S5-01 set). AC-7 explicitly preserves the count.CODEGENIE_FORCE_CPU_COUNTis not plumbed. Story Notes line 147 hedged ("if S1-08 did NOT thread this all the way through, surface the gap loudly and file a follow-up"); the hedge proved correct. Verified by grep:src/codegenie/coordinator/coordinator.py:489readsos.cpu_count()with no env-var override; no other src/ file references the constant. AC-10 split into AC-10a (plumbing — this story ships it, with a focused unit test) and AC-10b (bench script consumes the env-var). The plumbing is a 3-line change in coordinator.py + a wrapper insrc/codegenie/coordinator/_cpu_budget.py(pure function:effective_cpu_count() -> int).pyproject.tomlmypy overrides. Draft AC-6 expected a 5-module[[tool.mypy.overrides]]block withwarn_unreachable = trueforcodegenie.{indices, probes.layer_b.index_health, report, adapters, tccm}. Reality:pyproject.toml:172enableswarn_unreachable = trueat the GLOBAL[tool.mypy]level, repo-wide. The S8-01_attemptslog already noted this: "warn_unreachable is enabled globally in pyproject.toml; S1-11 had the intent of a per-module override, but the project-wide setting is already in force — the deletion ritual proves the gate fires. No edits to pyproject.toml were needed." Resolution: AC-6 rewritten to assert the GLOBAL setting + a deletion-ritual smoke test (the S8-01 ritual generalized) instead of the phantom 5-module block. Out-of-scope's "Editingpyproject.toml's mypy overrides (S1-11 owns this)" stays — this story does not touch the mypy config.tests/adv/phase02/file count. Story said 7 adversarial files; reality has 8 (test_phase3_handoff_smoke.pyis the 8th). AC-5 rewritten to enumerate the exact filenames and assert each has at least one collectedtest_function (catches empty stubs, not just file existence).@requires_tooldecorator does not exist. Story AC-3 said "via a custom@requires_tool(name)decorator" as if it existed. Verified by grep: no such decorator insrc/codegenie/ortests/. The existing skip pattern is barepytest.mark.skipifper test. AC-3 rewritten to own the decorator's creation intests/_ci_support/requires_tool.py(new module) with its own contract test asserting the skip reason format.- "Phase 1 bench pattern" doesn't exist. Story said "reuses Phase 1's
tests/bench/baseline-comparison + PR-comment pattern verbatim".tests/bench/_helpers.pywrites abench-results.jsonartifact for upload — no baseline JSON file, no PR-comment automation. AC-8 + AC-9 rewritten to OWN the helper module (tests/bench/_bench_kernel.py) with the purecompare_to_baseline(...) -> Verdictfunction + impurepost_comment_if(...)shell — design-patterns critic's recommendation. - Bench scripts share logic (rule-of-three threshold). Three bench scripts duplicate timing-harness + baseline-load + ratio-compute + comment-on-PR + threshold-decision. This crosses the rule-of-three threshold for extraction. AC-8 + AC-9 + AC-10b now consume a shared
tests/bench/_bench_kernel.py(pure:compare_to_baseline(measurements, baseline, thresholds) -> Verdict, sum typeVerdict = Ok | CommentOnly | Fail; impure:_post_comment+_exit_with_verdict). Adding a fourth bench in Phase 3+ requires zero edits to the kernel. - Threshold-check pure-function discipline. AC-10 tested only one boundary (200 %); a wrong implementation off-by-one at 100 % passes. Test-quality critic flagged this. AC-10b now parametrizes over
[99 %, 100 %, 101 %, p95=359 s, p95=360 s, p95=361 s]and asserts the inclusivity convention explicitly (≥ 100 % AND > 360 s — the AND is corrected; original story said OR but matches arch §Gap 2's "or" wording — the synthesis pick is the arch's OR). - Workflow-YAML test brittleness. Draft used string-grep for
pytest-xdist. Test-quality critic flagged that-nauto,--dist=loadfile,tox -pwould slip through. Tests now load the workflow into a typed PydanticWorkflowFile/Job/Stepmodel (tests/unit/ci/_workflow_model.py) and run regexr'(?<!\w)(-n\s|--numprocesses|--dist|pytest-xdist)(?!\w)'against each step'srunstring; also grepspyproject.toml[tool.pytest.ini_options]addoptsfor the same patterns. Metamorphic test: monkeypatch-inject-n 4into a parsed copy of the workflow → assertion fires. - Cron timezone unspecified. Draft
cron: "0 4 * * *". GH Actions cron runs in UTC; an operator expecting PT/CET could be confused by 8-hour skew. AC-10c documents the convention (04:00 UTC == 21:00 PT prior day). - Fork-PR GH_TOKEN. Bench
gh pr commentneedspull-requests: write; fork PRs from external contributors get a read-onlyGITHUB_TOKEN. AC-7c added: bench comment step degrades silently with a loud log ongithub.event.pull_request.head.repo.fork == true; the bench measurement still runs and writes the artifact.
Full critic findings + decision rationale archived at _validation/S8-03-ci-jobs-and-benches.md. Verdict: HARDENED.
Context¶
Phase 2 commits to eight named CI lanes in phase-arch-design.md §"CI gates": fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench. Three of those exist today (fence is its own job; unit-style coverage runs inside test; the bench step is inside test too) — the rest are new. This story lands the gap.
Of the eight, adv-phase02 is load-bearing — it gates every adversarial test from S4-02 (stale_scip), S5-05 (image_digest_drift), S5-06 (adversarial_dockerfile), S6-07 (secret_in_source), S6-07 (hostile_skills_yaml), S7-04 (concurrent_gather_race, no_inmemory_secret_leak), plus the Phase 3 handoff smoke (test_phase3_handoff_smoke.py). Eight files in tests/adv/phase02/ as of master. A failing test_stale_scip_fixture.py turns the build red; that is the roadmap exit criterion for Phase 2 ("IndexHealthProbe surfaces a real staleness case in CI against a deliberately-seeded fixture").
Three advisory bench canaries also land here. They never block merge — they comment on PRs (Phase 0 §3.2 advisory discipline). The third bench (bench_portfolio_walltime_hosted_runner.py) is the closer for Gap 2 from phase-arch-design.md §"Gap analysis": the developer-laptop bench (bench_portfolio_walltime.py) measures wall-clock on a beefy machine; the hosted-runner bench emulates the actual GitHub Actions cpu_count()=2 runner via CODEGENIE_FORCE_CPU_COUNT=2 and does have a build-fail threshold (≥ 100 % regression OR p95 > 360 s). That single bench is the one place in Phase 2 where bench failure is build-failure — because by then we're not advising, we've crossed the operational red line.
The mypy job is the runtime enforcer of S8-01's exhaustiveness ritual: mypy --strict repo-wide with warn_unreachable = true set in [tool.mypy] (already shipped, repo-wide, by S1-11 — verified by S8-01's _attempts log). A removed case in confidence_section.py produces a CI build error via the global flag — the per-module override scheme the draft prescribed proved unnecessary.
This story is the YAML, the bench scripts, the bench kernel (tests/bench/_bench_kernel.py), the CPU-count plumbing (src/codegenie/coordinator/_cpu_budget.py + coordinator wiring), the tool-presence decorator (tests/_ci_support/requires_tool.py), the contract-freeze allowlist (scripts/regen_probe_contract_snapshot.py extension), and the workflow-YAML typed test model (tests/unit/ci/_workflow_model.py). Eight ACs, seven new modules, one workflow file. Effort: M (was S; rescoped to honor the precondition + helper-extraction work the draft hand-waved).
References — where to look¶
- Architecture:
../phase-arch-design.md§"CI gates" — the eight numbered Phase-2 jobs with their gating/advisory status.../phase-arch-design.md§"Performance regression tests" —bench_portfolio_walltime.pyandbench_index_health_overhead.pythresholds (≥ 50 % and ≥ 10 % comment-on-PR).../phase-arch-design.md§"Gap analysis" Gap 2 — "hosted-runner bench closes the hidden-assumption #2"; Improvement subsection namesbench_portfolio_walltime_hosted_runner.py,CODEGENIE_FORCE_CPU_COUNT=2, nightly cron, comment-on-PR ≥ 50 %, build-fail ≥ 100 % (> 360 s p95), escape valve (commit per-fixture.codegenie/cache/blobs).../phase-arch-design.md§"Adversarial tests" — the table of seven adversarial tests theadv-phase02job aggregates (the actual file count on master is eight — see AC-5).- Phase ADRs:
../ADRs/0009-pytest-xdist-veto-preserved.md— no xdist anywhere. Portfolio and adversarial lanes stay serial.../ADRs/0001-add-docker-and-security-cli-tools-to-allowed-binaries.md— the eleven additions; theintegrationlane's tool-presence preflight matrix.../ADRs/0004-image-digest-as-declared-input-token.md— thecontract-freezesnapshot regen permits exactly theimage_digest_resolverfield onProbeContext; nothing else widens.- Production ADRs:
../../../production/adrs/0005-no-llm-in-gather.md—fencejob invariant.../../../production/adrs/0033-domain-modeling-discipline.md§3 —mypy --strict+warn_unreachableis the runtime enforcement.- Source design:
../final-design.md§"CI lane" — "Serial (nopytest-xdist). Estimated CI walltime growth ≤ 6 minutes; the bench canary ... is advisory."../final-design.md§"Open questions deferred to implementation" #5 — full-repomypy --warn-unreachableis in S8-04's backlog.- Existing code (DO NOT WEAKEN):
.github/workflows/ci.yml(Phase 0 + Phase 1) — existing five jobs:lint(ruff + import-linter),typecheck(mypy --strict),test(pytest + coverage carve-outs + 3 bench canaries with-m bench),security(pip-audit + osv-scanner),fence. Phase 2 adds five new top-level jobs and a new workflow file (bench-nightly.yml); the existing jobs stay in place (no rename, no delete).tests/bench/_helpers.py(S5-01) — writesbench-results.jsonartifact viamerge_bench_result(). No baseline JSON. No PR-comment automation. This story OWNS the new_bench_kernel.pythat adds those primitives.tests/bench/test_cli_cold_start.py,test_coordinator_overhead.py,test_cache_hit_dispatch.py(S5-01) — the three-m benchcanaries;bench-collection-guard(ci.yml:122-130) hardcodes count== 3. The new bench scripts in this story are NOT marked-m benchso the guard stays at 3.pyproject.toml:172—warn_unreachable = trueat the GLOBAL[tool.mypy]level. Repo-wide. The S8-01_attemptslog already confirmed this fires the gate per-module via mypy's whole-program analysis.src/codegenie/coordinator/coordinator.py:489— currentlycpu = os.cpu_count() or 1with no env-var override. This story plumbsCODEGENIE_FORCE_CPU_COUNThere via a new wrappereffective_cpu_count() -> int.scripts/regen_probe_contract_snapshot.py— currently has no field-allowlist mechanism; this story extends it to assert theProbeContextallowlist is exactly the Phase-0 fields ∪image_digest_resolver.src/codegenie/probes/base.py:52-62—ProbeContextfields:cache_dir, output_dir, workspace, logger, config, parsed_manifest, input_snapshot, image_digest_resolver. The latter is the only Phase-2 widening per ADR-0004.tests/adv/phase02/— 8 files:test_adversarial_dockerfile.py, test_concurrent_gather_race.py, test_hostile_skills_yaml.py, test_image_digest_drift.py, test_no_inmemory_secret_leak.py, test_phase3_handoff_smoke.py, test_secret_in_source.py, test_stale_scip_fixture.py.adv-phase02lane collects them all.tests/integration/portfolio/test_portfolio_sweep.py(S7-01/02) — the five-fixture sweep theportfoliolane runs. Fixtures:minimal-ts, monorepo-pnpm, distroless-target, native-modules, stale-scip.tests/snapshots/probe_contract.v1.json— exists (Phase 0 created it); thecontract-freezelane's allowlist-extended regen helper diffs against this.
Goal¶
Land Phase-2 CI surface in two workflow files:
-
.github/workflows/ci.yml— extend with five new top-level jobs namedcontract-freeze,unit,integration,portfolio,adv-phase02,mypy,bench(seven new — butmypyis the new name promoted fromtypecheck, andbenchis the new name promoted from the in-testbench step; keep the legacylint,typecheckandsecurityjobs intact via the additive path described in Validation Note #3). After this story the workflow has the existing five jobs (lint,typecheck,test,security,fence) PLUS the five new Phase-2-named jobs (contract-freeze,unit,integration,portfolio,adv-phase02,mypy-as-alias-of-typecheck,bench-as-promoted-from-test-step). The eight names fromphase-arch-design.md §"CI gates"are a required subset, not the entire set. No xdist anywhere (ADR-0009). -
.github/workflows/bench-nightly.yml— new file; cron0 4 * * *(UTC); runstests/bench/bench_portfolio_walltime_hosted_runner.pyonly.
Land three new bench scripts under tests/bench/ (NOT marked -m bench — the existing collection guard stays at count==3):
bench_portfolio_walltime.py— five-fixture cold + warm p50 captured per run; baseline JSON committed intests/bench/baselines/portfolio_walltime.json; ≥ 50 % delta posts a PR comment (no block).bench_index_health_overhead.py— measuresIndexHealthProbewalltime as a fraction of total cold gather walltime onminimal-ts; ≥ 10 % posts a PR comment. Target: < 5 %; 5–10 % acceptable; ≥ 10 % comments.bench_portfolio_walltime_hosted_runner.py— nightly cron; setsCODEGENIE_FORCE_CPU_COUNT=2soeffective_cpu_count()(this story's new wrapper) returns 2 regardless ofos.cpu_count(); ≥ 50 % regression vs baseline posts a PR comment; ≥ 100 % regression OR p95 > 360 s = build failure (Gap 2 closer).
Land four supporting modules:
src/codegenie/coordinator/_cpu_budget.py— pureeffective_cpu_count() -> intreadingCODEGENIE_FORCE_CPU_COUNTenv-var with fallback toos.cpu_count() or 1.coordinator.py:489consumes it.tests/bench/_bench_kernel.py— purecompare_to_baseline(measurements, baseline, thresholds) -> Verdict(sum typeOk | CommentOnly | Fail) + impurepost_comment_if(verdict, gh_token)+ impureexit_with_verdict(verdict). Three bench scripts compose this kernel.tests/_ci_support/requires_tool.py—@requires_tool(name)decorator with skip-reason formatf"{name} not on PATH — SKIPPED LOUD".tests/unit/ci/_workflow_model.py— typed PydanticWorkflowFile,Job,Stepmodels for parsing.github/workflows/*.ymlin tests.
Promote tests/unit/test_probe_contract.py to its own contract-freeze lane (separate from test); extend scripts/regen_probe_contract_snapshot.py with the ProbeContext field-allowlist {cache_dir, output_dir, workspace, logger, config, parsed_manifest, input_snapshot, image_digest_resolver}. A third additive field fails with an ADR-0004 pointer.
Acceptance criteria¶
-
[x] AC-1 (Phase-2 named lanes present as a required subset; legacy jobs preserved; matrix on Python 3.11 + 3.12).
.github/workflows/ci.ymldefines top-level jobs that include the subset{fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench}(8 names fromphase-arch-design.md §"CI gates"). The legacy jobs{lint, typecheck, test, security}also remain present (a Phase-0/1 contract this story does NOT change). Every new lane runs on the matrixpython-version: ["3.11", "3.12"].tests/unit/ci/test_workflow_yaml.py::test_required_subset_presentparses the workflow via_workflow_model.WorkflowFile(typed Pydantic loader) and asserts: (a){fence, contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench}.issubset(workflow.jobs.keys()); (b){lint, typecheck, test, security}.issubset(workflow.jobs.keys())(no silent legacy deletion); (c) each new lane's matrix contains both"3.11"and"3.12".mypymay be an alias fortypecheck(aneeds: typecheckjob stub or a duplicate top-level — pick one and document) — the lane MUST exist by the namemypy; the existingtypecheckmay continue to be invoked frommake typecheck. -
[x] AC-2 (
unitlane — subset filter; no xdist; serial;--no-cov). Theunitjob runspytest tests/unit/ -q --no-covwith NO-n/--numprocesses/--dist/-xparallel flags.--no-covavoids double-counting against the global 85 % floor (which the existingtestjob continues to enforce). The job'stimeout-minutes: 5step-level cap enforces the ceiling.test_workflow_yaml.py::test_unit_serial_and_no_covasserts the absence of xdist flags and presence of--no-cov. -
[x] AC-3 (
integrationlane — real tool invocations; tool-presence preflight via@requires_tooldecorator; loud skip). The job runspytest tests/integration/ -q --no-covserially. A new moduletests/_ci_support/requires_tool.pyexposes@requires_tool(name: str)that wrapspytest.mark.skipif(shutil.which(name) is None, reason=f"{name} not on PATH — SKIPPED LOUD"). The decorator MUST also emit a structlog warning (orwarnings.warn(..., stacklevel=2)) when applied so the skip is visible in CI stdout (not just in the pytest report). The job also pre-flights everyALLOWED_BINARIESPhase-2 addition (semgrep,syft,grype,gitleaks,tree-sitter,docker,strace,scip-typescript) and prints the missing-tool list as the first stdout line (visible in CI logs at a glance).tests/unit/ci/test_requires_tool_decorator.pyasserts: (a) skip reason literal containsSKIPPED LOUDand the tool name; (b) the warning/log emission fires once per missing tool per session; (c) the decorator is composable with other pytest marks (@requires_tool("foo")+@pytest.mark.parametrize(...)works). -
[x] AC-4 (
portfoliolane — five-fixture sweep + golden diff; serial; ≤ 7 min step-level cap; no xdist). The job runspytest tests/integration/portfolio/ -q --no-cov --tb=shortagainst the five-fixture portfolio (minimal-ts, native-modules, monorepo-pnpm, distroless-target, stale-scip).timeout-minutes: 7(one-minute headroom over the 6-min budget fromphase-arch-design.md §"CI lane"). Golden-diff failure is a hard fail (nocontinue-on-error).test_workflow_yaml.py::test_portfolio_serial_budgetasserts: (a) no xdist; (b)timeout-minutes <= 7; (c) every fixture name from the canonical list appears in the test discovery (verified bypytest --collect-only tests/integration/portfolio/). -
[x] AC-5 (
adv-phase02lane — LOAD-BEARING — fails build on any adversarial failure; eight-file presence + collected-tests assertion). The job runspytest tests/adv/phase02/ -q --no-cov --tb=long. The job'scontinue-on-error: false(default) means any failure fails the build.tests/unit/ci/test_adv_phase02_load_bearing.pyasserts: (a) the workflow'sadv-phase02step does NOT setcontinue-on-error: true; (b) the eight named files exist undertests/adv/phase02/:test_adversarial_dockerfile.py, test_concurrent_gather_race.py, test_hostile_skills_yaml.py, test_image_digest_drift.py, test_no_inmemory_secret_leak.py, test_phase3_handoff_smoke.py, test_secret_in_source.py, test_stale_scip_fixture.py; (c) each file has at least one collecteddef test_…function (pytest --collect-only tests/adv/phase02/<file>::*returns ≥ 1 item) — catches empty stubs, not just file existence. The story documents in "Notes for the implementer" the verification ritual: deliberately introduce a bug inIndexHealthProbe(e.g., always emitFresh) and confirm the CI build fails red ontest_stale_scip_fixture.py; revert. Mypy stderr from the ritual is captured in_attempts/S8-03.md. -
[x] AC-6 (
mypylane —mypy --strictrepo-wide + globalwarn_unreachablealready inpyproject.toml; exhaustiveness smoke ritual). The job runsmypy --strict src/codegenie/ tests/(ormake typecheckwhich already invokes this).warn_unreachable = trueis set at the GLOBAL[tool.mypy]level inpyproject.toml:172(already shipped; per S8-01's_attemptslog, the project-wide setting fires the per-module exhaustiveness gate via mypy's whole-program analysis — the per-module override block S1-11 originally intended is NOT necessary).tests/unit/ci/test_mypy_global_warn_unreachable.pyparsespyproject.tomland asserts: (a)[tool.mypy]containswarn_unreachable = true; (b) no[[tool.mypy.overrides]]block setswarn_unreachable = falsefor any production module; (c) themypyCI lane's step does NOT pass--warn-unreachableon the command line (config-driven, single source of truth). AC-6b — exhaustiveness smoke ritual: the Step 8 PR-review checklist (and an executable assertion in_attempts/S8-03.md) includes "deliberately remove acasefromconfidence_section.py; confirmmypy --strictfails with[unreachable]orassert_neverarg-type error; revert." The captured stderr stays in the attempt log as load-bearing evidence. -
[x] AC-7 (
benchlane — advisory; never blocks PR; existing collection guard preserved at count==3). The newbenchlane runspytest tests/bench/bench_portfolio_walltime.py tests/bench/bench_index_health_overhead.py -q --no-cov(the two new bench scripts, NOT the existing-m benchcanaries). The lane'scontinue-on-error: trueensures merge is never blocked. ≥ 50 % regression vstests/bench/baselines/portfolio_walltime.jsonposts a PR comment viagh pr comment(usesGH_TOKEN/secrets.GITHUB_TOKEN); ≥ 10 % regression forbench_index_health_overhead.pyposts a comment.test_workflow_yaml.py::test_bench_advisoryassertscontinue-on-error: true. AC-7b (existingbench-collection-guardunchanged): the existingbench-collection-guardstep inside thetestjob (ci.yml:122-130) MUST still expect== 3(the S5-01 set:test_cli_cold_start.py,test_coordinator_overhead.py,test_cache_hit_dispatch.py). The two new bench scripts (bench_portfolio_walltime.py,bench_index_health_overhead.py) are NOT marked-m benchand so are not counted by the guard.test_bench_collection_guard_unchanged.pyasserts the guard threshold is still3and the new bench scripts do not carry thebenchmarker. AC-7c (fork-PR degradation): thebenchlane'sgh pr commentstep usesif: ${{ github.event.pull_request.head.repo.fork == false }}so external-fork PRs skip the comment with a loud log; the bench measurement still runs and writes the artifact (operator can inspect manually). -
[x] AC-8 (
bench_portfolio_walltime.py— five-fixture cold + warm p50; baseline JSON committed; uses purecompare_to_baselinekernel). The script runs each of the five fixtures cold (cache cleared) and warm (cache populated), capturing p50 across at least 5 runs per (fixture, mode) (variance-tolerance for CI runners; arch §"Performance regression tests" precedent). The script consumestests/bench/_bench_kernel.compare_to_baseline(measurements, baseline, thresholds) -> Verdict(pure, sum type). OnVerdict.CommentOnly, posts a PR comment listing the regressed fixture(s); onVerdict.Ok, exit 0 silently; onVerdict.Fail, exit 0 (this script's threshold is comment-only —Failnever occurs here).tests/bench/test_bench_portfolio_walltime_smoke.pyruns the script againstminimal-tsONLY and asserts the result dict containscold_p50_sandwarm_p50_skeys withfloat > 0. The committed baselinetests/bench/baselines/portfolio_walltime.jsoncarries a metadata header:{"refreshed_at": "<ISO8601>", "refreshed_by": "<github-username>", "reason": "<one-line justification>"};test_baseline_has_metadata.pyasserts the three keys. -
[x] AC-9 (
bench_index_health_overhead.py— B2 walltime as fraction of total cold gather onminimal-ts; ≥ 10 % comments; metamorphic test). The script capturesIndexHealthProbewalltime as fraction-of-total during a cold gather ofminimal-ts(median of 5 runs). Reports the fraction; on ≥ 10 % posts a PR comment naming the regression (does NOT fail). The 5 % target is documented in the script's module docstring; 5–10 % is an acceptable middle band.tests/bench/test_bench_index_health_smoke.pyasserts: (a) the harness returns afraction_of_total: floatbetween 0.0 and 1.0 (vacuous, but catches a None/-1 return); (b) metamorphic — running the script twice with amonkeypatch-injectedtime.sleep(0.5)insideIndexHealthProbe.runproduces a STRICTLY largerfraction_of_totalthan the unmodified run. -
[x] AC-10a (Plumb
CODEGENIE_FORCE_CPU_COUNTinto the coordinator viaeffective_cpu_count()). New modulesrc/codegenie/coordinator/_cpu_budget.pyexposes the pure functioneffective_cpu_count() -> int: readsCODEGENIE_FORCE_CPU_COUNTenv-var, falls back toos.cpu_count() or 1, raisesValueErrorif the env-var is non-empty and not a positive int.src/codegenie/coordinator/coordinator.py:489is changed fromcpu = os.cpu_count() or 1tocpu = effective_cpu_count().tests/unit/coordinator/test_cpu_budget.pyasserts: (a) env-var absent → falls back toos.cpu_count() or 1; (b) env-var ="2"→ returns2; (c) env-var ="abc"→ValueError; (d) env-var ="-1"→ValueError; (e) env-var =""(empty string) → falls back; (f) the coordinator'sSemaphoreis constructed withmin(effective_cpu_count(), 8). -
[x] AC-10b (
bench_portfolio_walltime_hosted_runner.py— nightly, emulatescpu_count()=2, comment ≥ 50 %, build-fail ≥ 100 % OR p95 > 360 s — Gap 2 closer; parametrized threshold boundary test). The script setsos.environ["CODEGENIE_FORCE_CPU_COUNT"] = "2"BEFORE importing coordinator modules (soeffective_cpu_count()is honored on first read). Runs the five-fixture portfolio (median of 5 runs per fixture). Consumes_bench_kernel.compare_to_baseline(measurements, baseline, thresholds=Threshold(comment_pct=50.0, fail_pct=100.0, fail_p95_s=360.0)). OnVerdict.CommentOnly→ PR comment; onVerdict.Fail→exit_with_verdict()callssys.exit(2)(failing the build).tests/unit/ci/test_hosted_runner_bench_thresholds.pyparametrizes the threshold function over[(99.0, Verdict.Ok), (100.0, Verdict.Fail), (101.0, Verdict.Fail), (50.0, Verdict.CommentOnly), (49.9, Verdict.Ok)]AND p95 boundaries[(359.0, Ok), (360.0, Ok), (360.001, Fail), (361.0, Fail)]— explicit inclusivity conventions: regression>= 100 %triggers Fail; p95> 360 s(strict) triggers Fail. (Synthesis pick: arch §"Gap 2" wording is "OR"; both conditions are independent triggers.) -
[x] AC-10c (
.github/workflows/bench-nightly.yml— UTC cron, hosted-runner bench only). New workflow file.on: schedule: - cron: '0 4 * * *'(UTC; 04:00 UTC == 21:00 PT prior day == 06:00 CET). Single job runspytest tests/bench/bench_portfolio_walltime_hosted_runner.py -q --no-covonruns-on: ubuntu-24.04(pinned; runner-image upgrade alone could cause ≥ 100 % drift — pinning is load-bearing).env: CODEGENIE_FORCE_CPU_COUNT: "2"set at the job level.permissions: { pull-requests: write, contents: read }.tests/unit/ci/test_bench_nightly_workflow.pyparses the workflow and asserts: (a) cron schedule is exactly"0 4 * * *"; (b)runs-onisubuntu-24.04(pinned, notubuntu-latest); (c) env containsCODEGENIE_FORCE_CPU_COUNT: "2"; (d)permissions.pull-requests == "write"; (e) the workflow isworkflow_dispatch-able (operator can trigger manually). -
[x] AC-11 (
contract-freezelane — Phase 0 contract test promoted to its own job;scripts/regen_probe_contract_snapshot.pyextended with field-allowlist; ADR-0004 pointer on third field). Thecontract-freezelane runspytest tests/unit/test_probe_contract.py -q --no-cov(existing test) ANDpython scripts/regen_probe_contract_snapshot.py --check(new flag — asserts the live snapshot matches the committedtests/snapshots/probe_contract.v1.json). The regen helper is extended with an explicit field-allowlist forProbeContextfields:{cache_dir, output_dir, workspace, logger, config, parsed_manifest, input_snapshot, image_digest_resolver}(the Phase-0 base ∪ Phase-1 amendments ∪image_digest_resolver). A third additive field (e.g.,parsed_manifest_v2,foo,bar) raisesValueError("ProbeContext widening prohibited — see 02-ADR-0004; got unknown field {name}").tests/unit/ci/test_contract_freeze_allowlist.pyasserts: (a) the committed snapshot containsimage_digest_resolver; (b) the regen helper's field-allowlist names exactly the eight fields above; (c) parametrized over[("foo"), ("bar"), ("parsed_manifest_v2"), ("__init__")]— each triggersValueErrorwith02-ADR-0004literal substring in the message; (d) thecontract-freezeworkflow job exists and runs--check. -
[x] AC-12 (Phase 0
fencecontinues green; no new LLM/network imports introduced; all new modules passmypy --strict+ruff+lint-imports). Thefencejob asserts noanthropic/openai/langgraph/httpx/requests/socketimports undersrc/codegenie/. This story introduces no such import (onlysrc/codegenie/coordinator/_cpu_budget.pyis added; it importsosonly).make lint-importsgreen (no new cross-package edges).mypy --strictgreen onsrc/codegenie/coordinator/_cpu_budget.py,tests/bench/_bench_kernel.py,tests/_ci_support/requires_tool.py,tests/unit/ci/_workflow_model.py.ruff check+ruff format --checkgreen on all touched files. CI run againstmasterpost-merge passesfenceon Python 3.11 + 3.12. -
[x] AC-13 (no
pytest-xdistanywhere — metamorphic test).tests/unit/ci/test_no_xdist_anywhere.pyparses every workflow under.github/workflows/via_workflow_model.WorkflowFile, and for each step'srunstring applies the regexr'(?<!\w)(-n\s|-n\d|--numprocesses|--dist|pytest-xdist|tox\s+-p)(?!\w)'— asserts zero matches. Also grepspyproject.toml[tool.pytest.ini_options]addoptsfor the same patterns; asserts zero. Metamorphic test: the test monkeypatches a parsed workflow copy to inject-n 4into a step'srunstring and re-runs the assertion; expectsAssertionError. Confirms the assertion has bite (Rule 9 — tests verify intent, mutation-resistant).
Out of scope¶
- Deleting or renaming the existing
lintandsecurityjobs. Both are Phase-0/1 contracts; the eight Phase-2 names fromphase-arch-design.md §"CI gates"are a required subset, not the entire job set. Reconciling Phase-2's eight with the legacy five is an additive layering, not a replacement. - Changing the existing
bench-collection-guardcount (== 3). The three new bench scripts in this story are NOT marked-m bench; the existing S5-01 collection stays untouched. - Editing
pyproject.toml's[tool.mypy]config. The globalwarn_unreachable = truealready shipped (S1-11; verified by S8-01's_attemptslog). The per-module override block the draft prescribed is unnecessary. - Splitting the
portfoliojob into per-fixture parallel lanes via xdist. ADR-0009. If walltime regresses past 6 min, the operator's escape valve (final-design.md §"Open Q 6",phase-arch-design.md §"Gap 2 §Escape valve") is committing per-fixture.codegenie/cache/blobs — not a CI shape change. - Making the per-PR
benchlane gating. ADR-0009 +final-design.md §"CI lane". Onlybench_portfolio_walltime_hosted_runner.py's ≥ 100 % / p95 > 360 s thresholds gate the build, and that script runs on the nightly cron viabench-nightly.yml, not per-PR. - Adding a
coverage-ratchetjob (Phase 1 owns this via the existingtestjob'sPer-module coverage carve-outsstep). - Adding a
forbidden-patternsCI job. Phase 0 pre-commit owns this; the Phase 2 extension to banmodel_constructundersrc/codegenie/output/**is enforced by the existing pre-commit hook (S1-11). If a reader expects a CI lane for it, the answer is "no — pre-commit is the source of truth, not CI; Rule 2 — don't duplicate." - Editing the contents of any adversarial test under
tests/adv/phase02/. Those land in S4-02/S5-05/S5-06/S6-07/S7-04. AC-5 only asserts file presence + ≥ 1 collected test per file. - Adding per-PR
bench_portfolio_walltime_hosted_runner.py. The hosted-runner bench is nightly-only by design (Gap 2 closer); running it per-PR would consume CI quota and inflate variance.
Files to touch¶
New:
.github/workflows/bench-nightly.yml— UTC cron for the hosted-runner bench (AC-10c).src/codegenie/coordinator/_cpu_budget.py— pureeffective_cpu_count()(AC-10a). ~15 LOC.tests/_ci_support/__init__.py— empty.tests/_ci_support/requires_tool.py—@requires_tooldecorator (AC-3). ~30 LOC.tests/bench/_bench_kernel.py— purecompare_to_baseline,Verdictsum type,Thresholddataclass; impurepost_comment_if,exit_with_verdict(AC-8/9/10b). ~80 LOC.tests/bench/bench_portfolio_walltime.py— AC-8.tests/bench/bench_index_health_overhead.py— AC-9.tests/bench/bench_portfolio_walltime_hosted_runner.py— AC-10b.tests/bench/baselines/portfolio_walltime.json— committed baseline + metadata header (AC-8).tests/bench/baselines/portfolio_walltime_hosted_runner.json— committed baseline + metadata header (AC-10b).tests/bench/baselines/README.md— documents the baseline-refresh ritual (separate PR, reviewer approval, metadata header fields).tests/bench/test_bench_portfolio_walltime_smoke.py— AC-8 smoke.tests/bench/test_bench_index_health_smoke.py— AC-9 smoke + metamorphic.tests/bench/test_baseline_has_metadata.py— AC-8 metadata-header assertion.tests/unit/ci/__init__.py— empty.tests/unit/ci/_workflow_model.py— typed PydanticWorkflowFile, Job, Step(parser used by every workflow-YAML test in this story).tests/unit/ci/test_workflow_yaml.py— AC-1, AC-2, AC-4, AC-7.tests/unit/ci/test_requires_tool_decorator.py— AC-3.tests/unit/ci/test_adv_phase02_load_bearing.py— AC-5.tests/unit/ci/test_mypy_global_warn_unreachable.py— AC-6.tests/unit/ci/test_bench_collection_guard_unchanged.py— AC-7b.tests/unit/ci/test_hosted_runner_bench_thresholds.py— AC-10b (parametrized boundary test).tests/unit/ci/test_bench_nightly_workflow.py— AC-10c.tests/unit/ci/test_contract_freeze_allowlist.py— AC-11.tests/unit/ci/test_no_xdist_anywhere.py— AC-13 (metamorphic).tests/unit/coordinator/test_cpu_budget.py— AC-10a.
Modified:
.github/workflows/ci.yml— add five new top-level job blocks (contract-freeze,unit,integration,portfolio,adv-phase02, plusmypyas a new top-level OR aneeds: typecheckalias job, plusbenchas a new top-level promoted from the existing in-testbench step). The legacylint,typecheck,test,security,fencejobs are PRESERVED unchanged. Matrix extension: the existingtypecheck/test/security/fencejobs stay atpython-version: "3.11"only (not in scope to extend); each NEW lane runspython-version: ["3.11", "3.12"].src/codegenie/coordinator/coordinator.py— line 489 changes fromcpu = os.cpu_count() or 1tocpu = effective_cpu_count()(import from_cpu_budget). ~3 LOC delta.scripts/regen_probe_contract_snapshot.py— add explicit_PROBE_CONTEXT_FIELD_ALLOWLISTconstant; add--checkflag that diffs the live snapshot against the committed JSON; raiseValueErrorwith02-ADR-0004pointer on any non-allowlisted field. ~30 LOC delta.
Untouched (DO NOT EDIT):
pyproject.toml's[tool.mypy]block (S1-11 owns; globalwarn_unreachable=truealready there).- The existing
testjob'sbench-collection-guardcount (== 3). - Any adversarial test under
tests/adv/phase02/. - Existing
tests/bench/_helpers.py(S5-01 owns). The new_bench_kernel.pyis additive. - Phase 0
fencetest. - The
lint,typecheck,securityjobs in ci.yml.
TDD plan — red / green / refactor¶
RED (failing tests committed first):
test_workflow_yaml.py::test_required_subset_present— parses.github/workflows/ci.ymlvia_workflow_model.WorkflowFile; asserts the 8-name subset is present AND the 4-name legacy subset is preserved. Fails red.test_workflow_yaml.py::test_unit_serial_and_no_cov— asserts theunitlane'sruncontainspytest tests/unit/AND--no-covAND lacks-n/--numprocesses/--dist. Fails red.test_workflow_yaml.py::test_portfolio_serial_budget— assertstimeout-minutes <= 7onportfoliolane and no xdist. Fails red.test_workflow_yaml.py::test_bench_advisory—continue-on-error: trueonbenchlane's pytest step. Fails red.test_no_xdist_anywhere.py::test_zero_xdist_invocations— typed-loaded workflow scan +pyproject.toml addoptsscan + the metamorphic monkeypatch test (inject-n 4→ expectAssertionError). Fails red.test_adv_phase02_load_bearing.py::test_eight_files_with_collected_tests— asserts the 8 named files exist AND each has ≥ 1 collectedtest_…function. Fails red until the workflow runspytest --collect-onlyand the harness reads its output.test_requires_tool_decorator.py::test_skip_reason_format— applies@requires_tool("doesnotexist")to a test stub; runspytest --collect-only; asserts skip reason containsSKIPPED LOUDanddoesnotexist. Fails red — module doesn't exist.test_mypy_global_warn_unreachable.py::test_global_warn_unreachable_true— parsespyproject.toml; asserts[tool.mypy].warn_unreachable == True. (Passes green on master — this story's purpose for AC-6 is to not break it + add the smoke ritual.) Addtest_no_override_disables_warn_unreachable— asserts no override block setswarn_unreachable = false.test_bench_collection_guard_unchanged.py::test_guard_count_three— greps.github/workflows/ci.ymlfor the literalexpected exactly 3 bench tests; also runspytest --collect-only -m bench tests/bench/and asserts collection count == 3 after the new bench scripts land (they MUST NOT carry thebenchmarker). Fails red if a future contributor accidentally tags one.test_hosted_runner_bench_thresholds.py::test_threshold_boundaries— parametrize[(99.0, Ok), (100.0, Fail), (101.0, Fail), (50.0, CommentOnly), (49.9, Ok)]AND p95[(359.0, Ok), (360.0, Ok), (360.001, Fail)]against the purecompare_to_baseline. Fails red — kernel doesn't exist.test_bench_nightly_workflow.py::test_cron_runs_on_env_permissions— assertscron == "0 4 * * *",runs-on == "ubuntu-24.04",env.CODEGENIE_FORCE_CPU_COUNT == "2",permissions.pull-requests == "write". Fails red.test_contract_freeze_allowlist.py::test_third_field_rejected— parametrize[("foo"), ("bar"), ("parsed_manifest_v2"), ("__init__")]; asserts each raisesValueErrorwith02-ADR-0004substring. Fails red.test_cpu_budget.py::test_env_var_respected— parametrize[("absent", os.cpu_count() or 1), ("2", 2), ("abc", ValueError), ("-1", ValueError), ("", os.cpu_count() or 1)]. Fails red — module doesn't exist.test_bench_portfolio_walltime_smoke.py— smoke againstminimal-ts. Fails red — script doesn't exist.test_bench_index_health_smoke.py::test_metamorphic_with_injected_sleep— runs the harness twice (baseline + with monkeypatchedtime.sleep(0.5)inIndexHealthProbe.run); asserts secondfraction_of_total > first. Fails red.test_baseline_has_metadata.py::test_three_metadata_keys— loads both baseline JSONs; assertsrefreshed_at, refreshed_by, reasonkeys exist and are non-empty. Fails red.
GREEN (minimum code to pass):
- Create
src/codegenie/coordinator/_cpu_budget.pywitheffective_cpu_count(). Editcoordinator.py:489. - Create
tests/_ci_support/requires_tool.pywith@requires_tool+ warning emission. - Create
tests/bench/_bench_kernel.pywithVerdict = Ok | CommentOnly | Failsum type,Thresholddataclass, purecompare_to_baseline, impurepost_comment_if+exit_with_verdict. - Create
tests/unit/ci/_workflow_model.pywith typed Pydantic loader. - Write the three bench scripts using
_bench_kernel. Write the two baseline JSON files with metadata headers (initial values from first-run measurements; document the refresh ritual inbaselines/README.md). - Extend
.github/workflows/ci.ymlwith the five new top-level jobs (andmypy+benchpromotions). Leave legacylint/typecheck/test/security/fenceuntouched. - Create
.github/workflows/bench-nightly.yml. - Extend
scripts/regen_probe_contract_snapshot.pywith the_PROBE_CONTEXT_FIELD_ALLOWLISTconstant +--checkflag + ValueError on third field.
REFACTOR:
- Confirm the three bench scripts share zero ad-hoc baseline-load / ratio-compute / comment-on-PR code — all flows through
_bench_kernel. If any script grows a sixth helper, surface as a future extraction; for now three is the rule-of-three threshold and the kernel is justified. - Confirm
mypy --strict tests/bench/ src/codegenie/coordinator/_cpu_budget.py tests/_ci_support/ tests/unit/ci/is clean. - Run the AC-5 ritual locally (introduce a B2 bug →
pytest tests/adv/phase02/test_stale_scip_fixture.pyfails → revert); capture proof in_attempts/S8-03.md. - Run the AC-6b ritual (delete a
caseinconfidence_section.py→mypyfails → revert); capture mypy stderr in_attempts/S8-03.md. ruff format,ruff check,mypy --strict,make lint-imports,make fenceall green on touched modules.
Notes for the implementer¶
- The
adv-phase02lane is the load-bearing gate. Green onmasteris the public contract that the roadmap exit criterion is met. Treat any flake here as a P0 (phase-arch-design.md §"Adversarial corpus"). Fix the test or the fixture; nevercontinue-on-error: true. - Bench advisory vs gating — DO NOT blur the line. Of the three new benches, only
bench_portfolio_walltime_hosted_runner.pygates the build, and only on≥ 100 %OR> 360 s p95, and only on the nightly cron — not per-PR. Mixing these up either (a) blocks PRs on infra noise or (b) lets a 100 % regression sail through. CODEGENIE_FORCE_CPU_COUNTplumbing must land BEFORE the hosted-runner bench is meaningful. AC-10a is the prerequisite. The bench script'sos.environ["CODEGENIE_FORCE_CPU_COUNT"] = "2"must be set BEFORE the firstimport codegenie.coordinatorso the wrapper reads the override on first call. Document this ordering in the bench script's module docstring.- Baseline-refresh ritual. Baselines are committed JSON + metadata header (
refreshed_at,refreshed_by,reason). A contributor who intentionally regresses MUST refresh in a separate PR with reviewer approval. The metadata header is the audit trail; reviewers can grepgit logfor baseline-refresh PRs. Document intests/bench/baselines/README.md. - PR-comment helper auth + fork degradation. Use
${{ secrets.GITHUB_TOKEN }}. Setpermissions: { pull-requests: write, contents: read }on bench jobs ONLY (not onunit/portfolio/etc.). On fork PRs (github.event.pull_request.head.repo.fork == true), the comment step is skipped with a loud log (echo "::warning::Fork PR detected; bench comment skipped; measurement still ran"); the bench artifact upload still runs so an operator can inspect manually. - Tool-presence pre-flight.
@requires_toolis the per-test decorator (decorator + warning emission). Theintegrationjob also runs a top-of-job shell preflight that prints the missing-tool list as the first stdout line so a human scanning CI logs sees the list at a glance:for tool in semgrep syft grype gitleaks tree-sitter docker strace scip-typescript; do command -v "$tool" >/dev/null || echo "MISSING: $tool"; done. portfoliojob's 6-min budget vs Gap 2 hosted-runner reality. The 6-min budget assumes the dev-laptop bench's measurements. The nightly hosted-runner bench is what verifies the assumption against actual CI hardware. If the nightly fails the ≥ 100 % / p95 > 360 s threshold, the operator's choice is the escape valve (committed.codegenie/cache/blobs per fixture); do not edit the 6-min budget unilaterally — that requires an ADR amendment.- Phase 0 fence runs first. Order the
needs:graph: every new lane depends onfencepassing. If a future contributor accidentally importshttpx, no other lane wastes minutes. Useneeds: [fence]on every new top-level job. mypyjob is fast (< 30 s) — no caching beyond what mypy provides natively. Adding action-level mypy caching is a separate optimization; out of scope.- Cron timezone.
cron: "0 4 * * *"is UTC (GH Actions convention). That's21:00 PT prior day/06:00 CET. Document inbench-nightly.yml's top comment so an operator in PT doesn't expect a 4:00 AM PT run. - Don't run benches in
unit/portfolio/adv-phase02lanes. The new bench scripts live intests/bench/but theunitlane'spytest tests/unit/discovery does NOT cross intotests/bench/. Thebenchlane is the only consumer ofbench_portfolio_walltime.py+bench_index_health_overhead.py; thebench-nightly.ymlworkflow is the only consumer ofbench_portfolio_walltime_hosted_runner.py. benchlane scripts are NOT marked-m bench. The existing S5-01 collection guard expects exactly 3 markered tests. Marking a new bench script with@pytest.mark.benchwould push the count to 4 and break the guard. The new bench scripts are invoked by explicit path:pytest tests/bench/bench_portfolio_walltime.py …. AC-7b enforces this.- Threshold inclusivity. Regression
>= 100 %triggers Fail (inclusive); p95> 360 striggers Fail (strict). The boundary parametrize in AC-10b is the source of truth; arch §"Gap 2" uses the wording "≥ 100 %" and "> 360 s", which this story honors verbatim. - Rule 2 vs three-bench kernel. Three bench scripts crossing the rule-of-three threshold is exactly when the design-patterns toolkit prescribes extraction.
_bench_kernel.pyis justified — adding a fourth bench in Phase 3+ requires zero edits to the kernel (compose a newThresholdinstance, noifbranches added). If only two benches landed, the kernel would be premature. - Pure/impure split in
_bench_kernel.py.compare_to_baselineis pure: takes measurements + baseline + thresholds, returnsVerdict.post_comment_ifis impure (callsgh pr comment).exit_with_verdictis impure (callssys.exit). Tests cover the pure function with parametrize; the impure shell has a thin smoke test only. Functional-core / imperative-shell (CLAUDE.md convention). - Workflow-YAML tests use the typed Pydantic loader, not grep.
_workflow_model.WorkflowFile.from_path(path)parses + validates the workflow up front. A malformed YAML fails the test immediately, not silently. Every CI test in this story reuses the same parser; rule-of-three reached (three CI tests minimum:test_workflow_yaml.py,test_bench_nightly_workflow.py,test_no_xdist_anywhere.py). - AC-5's file enumeration must be kept in sync with
tests/adv/phase02/. If a future story adds a 9th adversarial file, the failing AC-5 enumeration is the prompt to update both the story file and the assertion. Treat this as load-bearing scoping discipline, not test flake.