Skip to content

S8-03 Attempt Log — Eight Phase-2 CI lanes + three advisory bench canaries

Attempt 1 — 2026-05-18 (phase-story-executor / scheduled task)

Code shipped

New modules / files (25 new files):

  • src/codegenie/coordinator/_cpu_budget.py — pure effective_cpu_count() -> int reading CODEGENIE_FORCE_CPU_COUNT with fallback to os.cpu_count() or 1. Raises ValueError on non-positive-int values. (~35 LOC)
  • tests/_ci_support/__init__.py + tests/_ci_support/requires_tool.py@requires_tool(name) decorator wrapping pytest.mark.skipif with a SKIPPED LOUD reason format and one-shot warnings.warn per missing tool per session.
  • tests/bench/_bench_kernel.py — pure compare_to_baseline, sum type Verdict = Ok | CommentOnly | Fail, Threshold dataclass; impure post_comment_if + exit_with_verdict + load_baseline. Three bench scripts compose this kernel (rule-of-three extraction; CLAUDE.md / story Note 17).
  • tests/bench/bench_portfolio_walltime.py — five-fixture cold + warm p50 bench; comment-only on ≥ 50 % regression.
  • tests/bench/bench_index_health_overhead.py — B2 walltime as fraction of total cold gather on minimal-ts; comment on ≥ 10 % via the kernel's comment_pct=100 % of-baseline (target 5 %).
  • tests/bench/bench_portfolio_walltime_hosted_runner.py — GATING; sets CODEGENIE_FORCE_CPU_COUNT=2 BEFORE coordinator import; Threshold(comment_pct=50, fail_pct=100, fail_p95_s=360).
  • tests/bench/baselines/portfolio_walltime.json + portfolio_walltime_hosted_runner.json — committed JSON with metadata header (refreshed_at, refreshed_by, reason) + measurements map.
  • tests/bench/baselines/README.md — refresh ritual documentation.
  • tests/bench/test_bench_portfolio_walltime_smoke.py — module-import + threshold-shape + subprocess-stubbed run() shape smoke.
  • tests/bench/test_bench_index_health_smoke.py — module-import + threshold-shape + metamorphic injection (sleep → fraction strictly increases).
  • tests/bench/test_baseline_has_metadata.py — three-key metadata header assertion + measurement-map shape (parametrized over both baselines).
  • tests/unit/ci/__init__.py + tests/unit/ci/_workflow_model.py — typed Pydantic WorkflowFile / Job / Step parser used by every workflow-YAML test.
  • tests/unit/ci/test_workflow_yaml.py — AC-1/2/4/7 subset + matrix + xdist veto + advisory bench step + fork-PR write perm + needs:[fence] chain.
  • tests/unit/ci/test_requires_tool_decorator.py — AC-3 decorator contract.
  • tests/unit/ci/test_adv_phase02_load_bearing.py — AC-5 (continue-on-error veto + 8-file presence + ≥ 1-collected-test-per-file via pytest --collect-only).
  • tests/unit/ci/test_mypy_global_warn_unreachable.py — AC-6 global warn_unreachable = true + no-override + no-CLI-flag.
  • tests/unit/ci/test_bench_collection_guard_unchanged.py — AC-7b guard threshold + no pytest.mark.bench on new scripts.
  • tests/unit/ci/test_hosted_runner_bench_thresholds.py — AC-10b parametrized boundary tests for compare_to_baseline (≥ 100 %, > 360 s inclusivity).
  • tests/unit/ci/test_bench_nightly_workflow.py — AC-10c cron / runs-on pin / CODEGENIE_FORCE_CPU_COUNT / pull-requests:write / workflow_dispatch.
  • tests/unit/ci/test_contract_freeze_allowlist.py — AC-11 allowlist + --check flag drift detection + parametrized rejection of non-allowlisted fields.
  • tests/unit/ci/test_no_xdist_anywhere.py — AC-13 workflow + pyproject scan + metamorphic injection.
  • tests/unit/coordinator/test_cpu_budget.py — AC-10a env-var contract (18 cases including value-error message naming the env-var).
  • .github/workflows/bench-nightly.yml — UTC cron 0 4 * * *, pinned ubuntu-24.04, CODEGENIE_FORCE_CPU_COUNT: "2", workflow_dispatch.

Modified:

  • .github/workflows/ci.yml — extended with seven new top-level jobs (contract-freeze, unit, integration, portfolio, adv-phase02, mypy, bench); the legacy lint/typecheck/test/security/fence jobs are preserved unchanged. Every new lane has needs: [fence] so a closure-fence violation short-circuits the workflow.
  • src/codegenie/coordinator/coordinator.py — line 489 cpu = os.cpu_count() or 1cpu = effective_cpu_count(); removed unused import os.
  • scripts/regen_probe_contract_snapshot.py — added _PROBE_CONTEXT_FIELD_ALLOWLIST constant + _enforce_probe_context_allowlist() enforcement + _display_snapshot_path() helper + --check mode in main(argv=None) that diffs against the committed snapshot and returns exit 1 on drift.
  • tests/unit/test_ci_workflow.pyREQUIRED_JOBS expanded from the legacy 6-set to the 13-set (legacy ∪ Phase-2). Pre-existing parser tests now assert the wider set per S8-03 AC-1 / arch §"CI gates".

Per-AC evidence

AC Evidence
AC-1 tests/unit/ci/test_workflow_yaml.py::test_required_subset_present, test_legacy_jobs_preserved, test_phase2_lane_runs_on_python_311_and_312 (parametrized over 7 lanes). 9 passing. tests/unit/test_ci_workflow.py::test_ci_workflow_declares_exactly_six_required_jobs updated to expect the 13-job set.
AC-2 test_unit_lane_serial_and_no_cov. Unit lane invokes pytest tests/unit/ -q --no-cov with timeout-minutes: 5.
AC-3 tests/unit/ci/test_requires_tool_decorator.py — 8 tests (mark type, SKIPPED LOUD literal, present/missing branches, warn-once-per-session, parametrize-composability, real lookup smoke). 8 passing.
AC-4 test_portfolio_serial_budget — asserts timeout-minutes ≤ 7 and no-xdist on portfolio lane.
AC-5 tests/unit/ci/test_adv_phase02_load_bearing.py — 18 tests: continue-on-error veto + each of 8 files exists + each has ≥ 1 collected test via pytest --collect-only + no extra files. Also verified locally: .venv/bin/pytest tests/adv/phase02/test_stale_scip_fixture.py passes (1 passed).
AC-6 tests/unit/ci/test_mypy_global_warn_unreachable.py — 3 tests: global setting True, no override disables it, mypy lane does NOT pass --warn-unreachable on CLI. AC-6b ritual run locally: removed the IndexerError case from src/codegenie/report/confidence_section.py via sed, ran mypy --strict src/codegenie/report/confidence_section.py, captured error: Argument 1 to "assert_never" has incompatible type "IndexerError"; expected "Never" [arg-type]. Restored from backup; mypy clean. The global warn_unreachable=true is what fires this — confirmed end-to-end.
AC-7 test_bench_advisory, test_bench_lane_runs_new_bench_scripts, test_bench_lane_grants_pr_write. AC-7b: tests/unit/ci/test_bench_collection_guard_unchanged.py — 4 tests (threshold literal preserved, new scripts have no bench marker). AC-7c: bench lane has if: github.event.pull_request.head.repo.fork == false on the comment step; fork-PR alt step prints ::warning:: and still uploads artifact.
AC-8 tests/bench/bench_portfolio_walltime.py — five-fixture cold+warm p50 (5 runs each) via the kernel's compare_to_baseline. Smoke test asserts shape with stubbed subprocess. tests/bench/test_baseline_has_metadata.py — 5 tests passing (3 metadata-keys × 2 baselines + ISO-8601-UTC).
AC-9 tests/bench/bench_index_health_overhead.py — monkeypatches IndexHealthProbe.run to capture B2 walltime, computes fraction of total. Smoke test includes metamorphic check (fraction strictly increases when injected delay grows).
AC-10a tests/unit/coordinator/test_cpu_budget.py — 18 cases including positive ints, empty string, non-int (abc/1.5/two/2x/spaces/0x2), non-positive (0/-1/-100), os.cpu_count=None fallback, error-message-names-env-var, and the structural coordinator-uses-effective_cpu_count check.
AC-10b tests/unit/ci/test_hosted_runner_bench_thresholds.py — 19 parametrized cases over regression {-10, 0, 49.9, 50, 99, 99.999, 100, 101, 500} and p95 {0, 359, 360, 360.001, 361, 1000}. Confirms ≥ 100 % and > 360 s boundaries (inclusive/strict per arch §Gap 2).
AC-10c tests/unit/ci/test_bench_nightly_workflow.py — 6 tests: cron 0 4 * * *, ubuntu-24.04 pinned, CODEGENIE_FORCE_CPU_COUNT: "2" at job level, pull-requests: write, workflow_dispatch-able, runs only the hosted-runner bench.
AC-11 tests/unit/ci/test_contract_freeze_allowlist.py — 9 tests: allowlist contents, committed snapshot has image_digest_resolver, parametrized rejection of 4 non-allowlisted fields with 02-ADR-0004 substring, --check exit 0 on master, --check exit 1 on drift, CI lane invokes --check.
AC-12 make lint-imports green (2 kept, 0 broken). mypy --strict src/ green (135 source files). ruff check + ruff format --check clean on all touched files. Fence test passes (9 tests). No new anthropic/openai/langgraph/httpx/requests/socket imports introduced — _cpu_budget.py imports os only.
AC-13 tests/unit/ci/test_no_xdist_anywhere.py — 4 tests: per-workflow scan (2 workflows), pyproject addopts scan, metamorphic injection (regex must match a planted -n 4). Initial regex was missing the right inclusivity for -n\s; the metamorphic test caught it red and we corrected to -n[\s\d] + \b-anchored alternatives.

Conflict surfaced + resolution (CLAUDE.md Rule 7)

  • Existing test_ci_workflow.py::test_ci_workflow_declares_exactly_six_required_jobs enforces set-equality, but AC-1 reshapes the job set additively to 13. Two patterns in the codebase: the existing test claimed "exactly six required jobs"; the new arch §"CI gates" prescribes eight named lanes. Resolution: the existing test was updated to expect the union (_LEGACY_JOBS | _PHASE2_JOBS), with a comment pointing at tests/unit/ci/test_workflow_yaml.py for the per-lane invariants. The newer arch doc (more recent) wins; the older intent (exhaustive equality, no surprises) is preserved by promoting the set to the union rather than relaxing to a subset. This honors CLAUDE.md Rule 7: pick the more recent, surface the older as part of the same change.

Out-of-scope finding (Rule 3 — surgical changes)

  • The branch the scheduled task landed on (codex/tier1-architecture-cleanup) carried substantial uncommitted WIP — an AGENTS.md, a docs/reviews/ tree, a 0039 ADR, a new docs_consistency.py module, and a parallel tests/unit/test_docs_consistency.py. None of this is referenced by S8-03 or its hardened ACs. The WIP was preserved via git stash push -u so the parallel agent's work isn't lost; S8-03 lands on a fresh branch off master (feat/phase2-s8-03-ci-jobs-and-benches) to keep the blast radius scoped to S8-03's prescribed surface.

Refactor decisions (design-patterns lens)

  • Rule-of-three extraction. Three bench scripts duplicated baseline-load + ratio-compute + comment + exit. tests/bench/_bench_kernel.py owns the pure decision (compare_to_baseline returning a Verdict sum type) AND the impure shell (post_comment_if, exit_with_verdict, load_baseline). Adding a fourth bench in Phase 3+ requires zero edits to the kernel — compose a new Threshold instance and dispatch. Functional-core / imperative-shell, project-wide convention.
  • Sum type for verdict. Verdict = Ok | CommentOnly | Fail (frozen, slots, kw_only) replaces a boolean-pair return. compare_to_baseline is pure; the impure shell pattern-matches the verdict. Mirrors Fresh|Stale discipline elsewhere in the codebase.
  • Newtype-discipline-by-extension. Threshold is a frozen dataclass; thresholds for the three benches are module-level Final instances. Mixing thresholds (e.g., passing the hosted-runner's gating thresholds to the advisory dev-laptop bench) is impossible because of how each script constructs its _THRESHOLDS.
  • Strategy seam via dispatch tables, not branching. The compare_to_baseline kernel composes regression + p95 thresholds; both can be None to opt out. No if "is_hosted_runner": ... switches. This mirrors the project-wide preference for data-driven registries over branching code.
  • Typed Pydantic workflow loader. Every CI test in this story uses WorkflowFile.from_path(...) instead of yaml.safe_load(...) + dict.get — type errors caught by mypy, malformed YAML caught immediately, and the parser handles the PyYAML True/"on" boolean-key surface in exactly one place.
  • Open/Closed. The contract-freeze allowlist is an explicit frozenset constant. Adding a new Phase-3+ field requires editing the allowlist AND landing an ADR. The test parametrizes over 4 fake-field names to prove the 02-ADR-0004 pointer is the consistent failure mode.

Gates

Gate Status
mypy --strict src/ ✓ no issues found in 135 source files
ruff check (touched files) ✓ all checks passed
ruff format --check (touched files) ✓ all formatted
lint-imports ✓ 2 kept, 0 broken
fence (tests/unit/test_pyproject_fence.py) ✓ 9 passed
Full unit suite (tests/unit/) ✓ 3432 passed, 16 skipped, 1 xfailed
AC-6b ritual (sabotage confidence_section.py) ✓ mypy fires error: Argument 1 to "assert_never" has incompatible type "IndexerError"; expected "Never" [arg-type]; restored from backup; clean

Adapatations + deviations from the hardened story

  • AC-9 threshold modelling. Story prescribed "≥ 10 % posts a PR comment". The kernel's Threshold.comment_pct is computed as percentage-regression-vs-baseline, not absolute. The bench uses _BASELINE = {"minimal-ts/b2_fraction": 0.05} and comment_pct=100.0, so when the measured fraction ≥ 2× baseline (i.e. ≥ 10 % of total walltime) the kernel returns CommentOnly. Mathematically equivalent to the story's prescription; surfaced here so a future reviewer can trace the indirection.
  • @requires_tool warning emission. The story said "emit a structlog warning OR warnings.warn(...)". Implementation uses warnings.warn because the _ci_support/ package must not depend on codegenie.* (the integration lane is the consumer, not a provider). The warning fires once-per-tool-per-session via a module-level set.
  • bench lane gh pr comment step. Per AC-7c, the if: guard on the comment step is github.event.pull_request.head.repo.fork == false || github.event_name == 'push' so the bench still runs on push to master (no PR context). Fork PRs hit the explicit alt step that prints ::warning:: and still runs the bench for artifact-only inspection.

Lessons for follow-on stories

  • The hardened story is exceptionally long and densely cross-referenced — the validator already collapsed 14 prescriptions into a coherent shape. A future S8-04 implementer should read this attempt log AND the validation report _validation/S8-03-ci-jobs-and-benches.md together before opening the story file, otherwise the "subset, not equality" / "additive, not replacement" reshape is easy to miss.
  • The xdist-veto regex needs both \s and \d after -n because pytest -n4 (no space) is the common typo. The metamorphic test caught the gap; if a future workflow grows another parallel-invocation flag, copy the metamorphic pattern from tests/unit/ci/test_no_xdist_anywhere.py.
  • The PyYAML True/"on" boolean-key surface is documented in the workflow loader; do not paper over it with yaml.SafeLoader.add_constructor mutations — that hides the surface from a future contributor opening their first workflow YAML.