S8-03 Attempt Log — Eight Phase-2 CI lanes + three advisory bench canaries¶
Attempt 1 — 2026-05-18 (phase-story-executor / scheduled task)¶
Code shipped¶
New modules / files (25 new files):
src/codegenie/coordinator/_cpu_budget.py— pureeffective_cpu_count() -> intreadingCODEGENIE_FORCE_CPU_COUNTwith fallback toos.cpu_count() or 1. RaisesValueErroron non-positive-int values. (~35 LOC)tests/_ci_support/__init__.py+tests/_ci_support/requires_tool.py—@requires_tool(name)decorator wrappingpytest.mark.skipifwith aSKIPPED LOUDreason format and one-shotwarnings.warnper missing tool per session.tests/bench/_bench_kernel.py— purecompare_to_baseline, sum typeVerdict = Ok | CommentOnly | Fail,Thresholddataclass; impurepost_comment_if+exit_with_verdict+load_baseline. Three bench scripts compose this kernel (rule-of-three extraction; CLAUDE.md / story Note 17).tests/bench/bench_portfolio_walltime.py— five-fixture cold + warm p50 bench; comment-only on ≥ 50 % regression.tests/bench/bench_index_health_overhead.py— B2 walltime as fraction of total cold gather onminimal-ts; comment on ≥ 10 % via the kernel'scomment_pct=100 %of-baseline (target 5 %).tests/bench/bench_portfolio_walltime_hosted_runner.py— GATING; setsCODEGENIE_FORCE_CPU_COUNT=2BEFORE coordinator import;Threshold(comment_pct=50, fail_pct=100, fail_p95_s=360).tests/bench/baselines/portfolio_walltime.json+portfolio_walltime_hosted_runner.json— committed JSON with metadata header (refreshed_at,refreshed_by,reason) +measurementsmap.tests/bench/baselines/README.md— refresh ritual documentation.tests/bench/test_bench_portfolio_walltime_smoke.py— module-import + threshold-shape +subprocess-stubbedrun()shape smoke.tests/bench/test_bench_index_health_smoke.py— module-import + threshold-shape + metamorphic injection (sleep → fraction strictly increases).tests/bench/test_baseline_has_metadata.py— three-key metadata header assertion + measurement-map shape (parametrized over both baselines).tests/unit/ci/__init__.py+tests/unit/ci/_workflow_model.py— typed PydanticWorkflowFile/Job/Stepparser used by every workflow-YAML test.tests/unit/ci/test_workflow_yaml.py— AC-1/2/4/7 subset + matrix + xdist veto + advisory bench step + fork-PR write perm + needs:[fence] chain.tests/unit/ci/test_requires_tool_decorator.py— AC-3 decorator contract.tests/unit/ci/test_adv_phase02_load_bearing.py— AC-5 (continue-on-error veto + 8-file presence + ≥ 1-collected-test-per-file viapytest --collect-only).tests/unit/ci/test_mypy_global_warn_unreachable.py— AC-6 globalwarn_unreachable = true+ no-override + no-CLI-flag.tests/unit/ci/test_bench_collection_guard_unchanged.py— AC-7b guard threshold + nopytest.mark.benchon new scripts.tests/unit/ci/test_hosted_runner_bench_thresholds.py— AC-10b parametrized boundary tests forcompare_to_baseline(≥ 100 %, > 360 s inclusivity).tests/unit/ci/test_bench_nightly_workflow.py— AC-10c cron / runs-on pin / CODEGENIE_FORCE_CPU_COUNT / pull-requests:write / workflow_dispatch.tests/unit/ci/test_contract_freeze_allowlist.py— AC-11 allowlist +--checkflag drift detection + parametrized rejection of non-allowlisted fields.tests/unit/ci/test_no_xdist_anywhere.py— AC-13 workflow + pyproject scan + metamorphic injection.tests/unit/coordinator/test_cpu_budget.py— AC-10a env-var contract (18 cases including value-error message naming the env-var)..github/workflows/bench-nightly.yml— UTC cron0 4 * * *, pinnedubuntu-24.04,CODEGENIE_FORCE_CPU_COUNT: "2",workflow_dispatch.
Modified:
.github/workflows/ci.yml— extended with seven new top-level jobs (contract-freeze,unit,integration,portfolio,adv-phase02,mypy,bench); the legacylint/typecheck/test/security/fencejobs are preserved unchanged. Every new lane hasneeds: [fence]so a closure-fence violation short-circuits the workflow.src/codegenie/coordinator/coordinator.py— line 489cpu = os.cpu_count() or 1→cpu = effective_cpu_count(); removed unusedimport os.scripts/regen_probe_contract_snapshot.py— added_PROBE_CONTEXT_FIELD_ALLOWLISTconstant +_enforce_probe_context_allowlist()enforcement +_display_snapshot_path()helper +--checkmode inmain(argv=None)that diffs against the committed snapshot and returns exit 1 on drift.tests/unit/test_ci_workflow.py—REQUIRED_JOBSexpanded from the legacy 6-set to the 13-set (legacy ∪ Phase-2). Pre-existing parser tests now assert the wider set per S8-03 AC-1 / arch §"CI gates".
Per-AC evidence¶
| AC | Evidence |
|---|---|
| AC-1 | tests/unit/ci/test_workflow_yaml.py::test_required_subset_present, test_legacy_jobs_preserved, test_phase2_lane_runs_on_python_311_and_312 (parametrized over 7 lanes). 9 passing. tests/unit/test_ci_workflow.py::test_ci_workflow_declares_exactly_six_required_jobs updated to expect the 13-job set. |
| AC-2 | test_unit_lane_serial_and_no_cov. Unit lane invokes pytest tests/unit/ -q --no-cov with timeout-minutes: 5. |
| AC-3 | tests/unit/ci/test_requires_tool_decorator.py — 8 tests (mark type, SKIPPED LOUD literal, present/missing branches, warn-once-per-session, parametrize-composability, real lookup smoke). 8 passing. |
| AC-4 | test_portfolio_serial_budget — asserts timeout-minutes ≤ 7 and no-xdist on portfolio lane. |
| AC-5 | tests/unit/ci/test_adv_phase02_load_bearing.py — 18 tests: continue-on-error veto + each of 8 files exists + each has ≥ 1 collected test via pytest --collect-only + no extra files. Also verified locally: .venv/bin/pytest tests/adv/phase02/test_stale_scip_fixture.py passes (1 passed). |
| AC-6 | tests/unit/ci/test_mypy_global_warn_unreachable.py — 3 tests: global setting True, no override disables it, mypy lane does NOT pass --warn-unreachable on CLI. AC-6b ritual run locally: removed the IndexerError case from src/codegenie/report/confidence_section.py via sed, ran mypy --strict src/codegenie/report/confidence_section.py, captured error: Argument 1 to "assert_never" has incompatible type "IndexerError"; expected "Never" [arg-type]. Restored from backup; mypy clean. The global warn_unreachable=true is what fires this — confirmed end-to-end. |
| AC-7 | test_bench_advisory, test_bench_lane_runs_new_bench_scripts, test_bench_lane_grants_pr_write. AC-7b: tests/unit/ci/test_bench_collection_guard_unchanged.py — 4 tests (threshold literal preserved, new scripts have no bench marker). AC-7c: bench lane has if: github.event.pull_request.head.repo.fork == false on the comment step; fork-PR alt step prints ::warning:: and still uploads artifact. |
| AC-8 | tests/bench/bench_portfolio_walltime.py — five-fixture cold+warm p50 (5 runs each) via the kernel's compare_to_baseline. Smoke test asserts shape with stubbed subprocess. tests/bench/test_baseline_has_metadata.py — 5 tests passing (3 metadata-keys × 2 baselines + ISO-8601-UTC). |
| AC-9 | tests/bench/bench_index_health_overhead.py — monkeypatches IndexHealthProbe.run to capture B2 walltime, computes fraction of total. Smoke test includes metamorphic check (fraction strictly increases when injected delay grows). |
| AC-10a | tests/unit/coordinator/test_cpu_budget.py — 18 cases including positive ints, empty string, non-int (abc/1.5/two/2x/spaces/0x2), non-positive (0/-1/-100), os.cpu_count=None fallback, error-message-names-env-var, and the structural coordinator-uses-effective_cpu_count check. |
| AC-10b | tests/unit/ci/test_hosted_runner_bench_thresholds.py — 19 parametrized cases over regression {-10, 0, 49.9, 50, 99, 99.999, 100, 101, 500} and p95 {0, 359, 360, 360.001, 361, 1000}. Confirms ≥ 100 % and > 360 s boundaries (inclusive/strict per arch §Gap 2). |
| AC-10c | tests/unit/ci/test_bench_nightly_workflow.py — 6 tests: cron 0 4 * * *, ubuntu-24.04 pinned, CODEGENIE_FORCE_CPU_COUNT: "2" at job level, pull-requests: write, workflow_dispatch-able, runs only the hosted-runner bench. |
| AC-11 | tests/unit/ci/test_contract_freeze_allowlist.py — 9 tests: allowlist contents, committed snapshot has image_digest_resolver, parametrized rejection of 4 non-allowlisted fields with 02-ADR-0004 substring, --check exit 0 on master, --check exit 1 on drift, CI lane invokes --check. |
| AC-12 | make lint-imports green (2 kept, 0 broken). mypy --strict src/ green (135 source files). ruff check + ruff format --check clean on all touched files. Fence test passes (9 tests). No new anthropic/openai/langgraph/httpx/requests/socket imports introduced — _cpu_budget.py imports os only. |
| AC-13 | tests/unit/ci/test_no_xdist_anywhere.py — 4 tests: per-workflow scan (2 workflows), pyproject addopts scan, metamorphic injection (regex must match a planted -n 4). Initial regex was missing the right inclusivity for -n\s; the metamorphic test caught it red and we corrected to -n[\s\d] + \b-anchored alternatives. |
Conflict surfaced + resolution (CLAUDE.md Rule 7)¶
- Existing
test_ci_workflow.py::test_ci_workflow_declares_exactly_six_required_jobsenforces set-equality, but AC-1 reshapes the job set additively to 13. Two patterns in the codebase: the existing test claimed "exactly six required jobs"; the new arch §"CI gates" prescribes eight named lanes. Resolution: the existing test was updated to expect the union (_LEGACY_JOBS | _PHASE2_JOBS), with a comment pointing attests/unit/ci/test_workflow_yaml.pyfor the per-lane invariants. The newer arch doc (more recent) wins; the older intent (exhaustive equality, no surprises) is preserved by promoting the set to the union rather than relaxing to a subset. This honors CLAUDE.md Rule 7: pick the more recent, surface the older as part of the same change.
Out-of-scope finding (Rule 3 — surgical changes)¶
- The branch the scheduled task landed on (
codex/tier1-architecture-cleanup) carried substantial uncommitted WIP — anAGENTS.md, adocs/reviews/tree, a0039ADR, a newdocs_consistency.pymodule, and a paralleltests/unit/test_docs_consistency.py. None of this is referenced by S8-03 or its hardened ACs. The WIP was preserved viagit stash push -uso the parallel agent's work isn't lost; S8-03 lands on a fresh branch offmaster(feat/phase2-s8-03-ci-jobs-and-benches) to keep the blast radius scoped to S8-03's prescribed surface.
Refactor decisions (design-patterns lens)¶
- Rule-of-three extraction. Three bench scripts duplicated baseline-load + ratio-compute + comment + exit.
tests/bench/_bench_kernel.pyowns the pure decision (compare_to_baselinereturning aVerdictsum type) AND the impure shell (post_comment_if,exit_with_verdict,load_baseline). Adding a fourth bench in Phase 3+ requires zero edits to the kernel — compose a newThresholdinstance and dispatch. Functional-core / imperative-shell, project-wide convention. - Sum type for verdict.
Verdict = Ok | CommentOnly | Fail(frozen, slots, kw_only) replaces a boolean-pair return.compare_to_baselineis pure; the impure shell pattern-matches the verdict. MirrorsFresh|Stalediscipline elsewhere in the codebase. - Newtype-discipline-by-extension.
Thresholdis a frozen dataclass; thresholds for the three benches are module-levelFinalinstances. Mixing thresholds (e.g., passing the hosted-runner's gating thresholds to the advisory dev-laptop bench) is impossible because of how each script constructs its_THRESHOLDS. - Strategy seam via dispatch tables, not branching. The
compare_to_baselinekernel composes regression + p95 thresholds; both can beNoneto opt out. Noif "is_hosted_runner": ...switches. This mirrors the project-wide preference for data-driven registries over branching code. - Typed Pydantic workflow loader. Every CI test in this story uses
WorkflowFile.from_path(...)instead ofyaml.safe_load(...) + dict.get— type errors caught by mypy, malformed YAML caught immediately, and the parser handles the PyYAMLTrue/"on"boolean-key surface in exactly one place. - Open/Closed. The contract-freeze allowlist is an explicit
frozensetconstant. Adding a new Phase-3+ field requires editing the allowlist AND landing an ADR. The test parametrizes over 4 fake-field names to prove the02-ADR-0004pointer is the consistent failure mode.
Gates¶
| Gate | Status |
|---|---|
mypy --strict src/ |
✓ no issues found in 135 source files |
ruff check (touched files) |
✓ all checks passed |
ruff format --check (touched files) |
✓ all formatted |
lint-imports |
✓ 2 kept, 0 broken |
fence (tests/unit/test_pyproject_fence.py) |
✓ 9 passed |
Full unit suite (tests/unit/) |
✓ 3432 passed, 16 skipped, 1 xfailed |
| AC-6b ritual (sabotage confidence_section.py) | ✓ mypy fires error: Argument 1 to "assert_never" has incompatible type "IndexerError"; expected "Never" [arg-type]; restored from backup; clean |
Adapatations + deviations from the hardened story¶
- AC-9 threshold modelling. Story prescribed "≥ 10 % posts a PR comment". The kernel's
Threshold.comment_pctis computed as percentage-regression-vs-baseline, not absolute. The bench uses_BASELINE = {"minimal-ts/b2_fraction": 0.05}andcomment_pct=100.0, so when the measured fraction ≥ 2× baseline (i.e. ≥ 10 % of total walltime) the kernel returnsCommentOnly. Mathematically equivalent to the story's prescription; surfaced here so a future reviewer can trace the indirection. @requires_toolwarning emission. The story said "emit a structlog warning ORwarnings.warn(...)". Implementation useswarnings.warnbecause the_ci_support/package must not depend oncodegenie.*(the integration lane is the consumer, not a provider). The warning fires once-per-tool-per-session via a module-levelset.benchlanegh pr commentstep. Per AC-7c, theif:guard on the comment step isgithub.event.pull_request.head.repo.fork == false || github.event_name == 'push'so the bench still runs onpushto master (no PR context). Fork PRs hit the explicit alt step that prints::warning::and still runs the bench for artifact-only inspection.
Lessons for follow-on stories¶
- The hardened story is exceptionally long and densely cross-referenced — the validator already collapsed 14 prescriptions into a coherent shape. A future S8-04 implementer should read this attempt log AND the validation report
_validation/S8-03-ci-jobs-and-benches.mdtogether before opening the story file, otherwise the "subset, not equality" / "additive, not replacement" reshape is easy to miss. - The xdist-veto regex needs both
\sand\dafter-nbecausepytest -n4(no space) is the common typo. The metamorphic test caught the gap; if a future workflow grows another parallel-invocation flag, copy the metamorphic pattern fromtests/unit/ci/test_no_xdist_anywhere.py. - The PyYAML
True/"on"boolean-key surface is documented in the workflow loader; do not paper over it withyaml.SafeLoader.add_constructormutations — that hides the surface from a future contributor opening their first workflow YAML.