Story S3-05 — Deterministic BCa bootstrap for lower_bound_95¶
Step: Step 3 — Implement the runner: asyncio fan-out, subprocess rubric, aggregator with BCa bootstrap
Status: Ready
Effort: M
Depends on: S3-02 (aggregator scaffolding — fills the lower_bound_95=0.0 placeholder)
ADRs honored: ADR-0002 (load-bearing — promotion gate keys on lower_bound_95), ADR-0006 (curation-class held-out floor reinforces small-N caution)
Context¶
Per ADR-0002, the promotion gate keys on BenchRunReport.lower_bound_95, not mean_score. This story implements the BCa bootstrap that computes lower_bound_95 deterministically: 1000 resamples, seeded by int(run_id[:8], 16), returning the 2.5th percentile of bootstrapped means as a one-sided 95% lower confidence bound.
The bootstrap is the single probabilistic surface in an otherwise fully-deterministic harness (arch §Determinism vs probabilism). It is leafed (only the aggregator calls it) and seeded (the seed derives from run_id, which is itself content-addressed in S3-01). Identical inputs must produce a byte-identical lower_bound_95 — that property is what makes the audit chain reproducible across host reboots, CI runners, and Python patch versions.
The BCa (bias-corrected and accelerated) variant matters because BenchScore.score ∈ [0, 1] is bounded; symmetric percentile bootstrap produces bounds outside [0, 1] near the corners. BCa adds two corrections (z0 for bias from the boundary, a for acceleration via jackknife) that are well-studied at N ≈ 10 — the floor for min_cases_for_promotion[bronze].
References — where to look¶
- Architecture:
../phase-arch-design.md §Determinism vs probabilism— the bootstrap row; "leafed and seeded — no other component branches on RNG."../phase-arch-design.md §Components → runner.py— six-phase pipeline step 4 (aggregate) ends with the bootstrap.../phase-arch-design.md §Agentic best practices → Confidence handling—lower_bound_95is the sole gate signal;meanandstddevare reported but not consumed.../phase-arch-design.md §Testing strategy → Property tests— themean - 2*stddev ≤ lower_bound_95 ≤ meansanity property is anchored here.- Phase ADRs:
../ADRs/0002-promotion-gate-keys-on-lower-bound-95.md— read in full; §Decision pins the seed derivation; §"Revisit trigger" pins the future Wilson switch.../ADRs/0006-curation-class-split-with-fence-ci-held-out-floor.md— the held-out floor of 5 reinforces caution at small N.- Source design:
../final-design.md §Departures from all three inputs #2,§Open Q #5(deferred Wilson switch trigger). - Standard reference: Efron, An Introduction to the Bootstrap (1993), §14 — BCa (bias-corrected and accelerated). The acceleration constant via jackknife in §14.3 is the implementation pattern.
Goal¶
Implement compute_lower_bound_95(per_case_scores: Sequence[float], *, run_id: str, n_resamples: int = 1000) -> float that returns a deterministic, seeded BCa-bootstrap 95% one-sided lower bound on the mean of per_case_scores, and wire it into S3-02's aggregator.
Acceptance criteria¶
- [ ]
src/codegenie/eval/bootstrap.pyexportscompute_lower_bound_95(scores, *, run_id, n_resamples=1000) -> float. - [ ] Seed derivation:
seed = int(run_id[:8], 16)is the only source of randomness; the bootstrap usesnumpy.random.default_rng(seed)(orrandom.Random(seed)if numpy is rejected — pick one explicitly in the implementation and document it). - [ ] Byte-identical determinism: two calls with identical inputs (same
run_id, same per-case score sequence) produce byte-identicallower_bound_95when serialized viafloat.hex(). A test asserts this across 100 reruns. - [ ] Cross-Python-version determinism: byte-identical across Python 3.11 and 3.12 (asserted by a snapshot test that pins the bound to a literal hex for a known input).
- [ ] Hypothesis property test (one-tailed lower-bound sanity): for
N ≥ 5and any sequence in[0, 1]^Nwithstddev > 0, the bound satisfiesmean - 2*stddev ≤ lower_bound_95 ≤ mean. Usehypothesis.strategies.lists(floats(min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False), min_size=5, max_size=50). - [ ] Mutation property: the bound is non-decreasing under uniform score shifts:
compute_lower_bound_95([s+δ for s in scores], ...) ≥ compute_lower_bound_95(scores, ...) - 1e-9for any smallδ > 0(clamped to[0,1]). - [ ] Degenerate case
stddev == 0(all scores identical) →lower_bound_95 == mean(bypass the bootstrap; no resampling). - [ ] Edge case
N < 5→lower_bound_95 == 0.0with aWARNINGlog includingn_cases,run_id, and a pointer to ADR-0002. - [ ]
n_resamples = 1000per the literature default; expose as a kwarg for property-test speed but the runner always passes 1000. - [ ] Wire into S3-02's aggregator: replaces the
lower_bound_95=0.0placeholder; a regression test runsRunner().execute(...)and asserts the bound is> 0.0on a non-degenerate stub bench. - [ ]
mypy --strict,ruff format --check,ruff checkclean. - [ ] All red tests in §TDD plan exist, were committed at the red marker, and are now green.
Implementation outline¶
- New module
src/codegenie/eval/bootstrap.py:def compute_lower_bound_95( scores: Sequence[float], *, run_id: str, n_resamples: int = 1000 ) -> float: if len(scores) < 5: structlog.get_logger().warning( "bootstrap_n_too_small", n=len(scores), run_id=run_id, see_adr="ADR-0002", ) return 0.0 arr = np.asarray(scores, dtype=np.float64) if float(arr.std(ddof=1)) == 0.0: return float(arr.mean()) seed = int(run_id[:8], 16) rng = np.random.default_rng(seed) boot_means = np.empty(n_resamples, dtype=np.float64) for i in range(n_resamples): sample = rng.choice(arr, size=arr.size, replace=True) boot_means[i] = sample.mean() z0 = _bias_correction(arr, boot_means) a = _acceleration(arr) alpha_lower = _bca_alpha_lower(z0, a, 0.025) return float(np.percentile(boot_means, alpha_lower * 100)) _bias_correction(arr, boot_means):z0 = norm.ppf(mean(boot_means < arr.mean()))(usescipy.stats.norm.ppfif scipy is available; else Acklam's algorithm via a stdlib implementation)._acceleration(arr): jackknife-based acceleration constant per Efron §14.3:_bca_alpha_lower(z0, a, alpha):norm.cdf(z0 + (z0 + norm.ppf(alpha)) / (1 - a*(z0 + norm.ppf(alpha)))).- Call from S3-02's aggregator: after sorting
per_casebycase_id, computelower_bound_95 = compute_lower_bound_95([s.score for _, s in per_case], run_id=plan.run_id)and set on the report.
TDD plan — red / green / refactor¶
Red — write failing tests first¶
tests/unit/test_bootstrap.py:
import numpy as np
import pytest
from hypothesis import given, strategies as st, settings
from codegenie.eval.bootstrap import compute_lower_bound_95
def test_byte_identical_lower_bound_across_reruns():
"""Determinism is load-bearing: cassette reruns must produce the same chain."""
scores = [0.6, 0.7, 0.8, 0.9, 0.55, 0.65, 0.75, 0.85, 0.5, 0.95]
run_id = "abc1234500000000"
runs = {compute_lower_bound_95(scores, run_id=run_id).hex() for _ in range(100)}
assert len(runs) == 1, runs
def test_snapshot_bound_for_known_input():
"""Cross-Python-version stability: a pinned bound for a pinned input."""
scores = [0.5, 0.6, 0.7, 0.8, 0.9, 0.55, 0.65, 0.75, 0.85, 0.95]
bound = compute_lower_bound_95(scores, run_id="deadbeef00000000")
# Pinned value — regenerated by scripts/regen_bootstrap_snapshot.py if intentional.
assert bound.hex() == "0x1.5cc4..." # PLACEHOLDER: real value filled in at green step
def test_degenerate_stddev_zero_returns_mean():
scores = [0.7] * 10
assert compute_lower_bound_95(scores, run_id="0" * 16) == 0.7
def test_small_n_returns_zero_and_warns(caplog):
scores = [0.5, 0.6, 0.7, 0.8] # N=4
bound = compute_lower_bound_95(scores, run_id="0" * 16)
assert bound == 0.0
assert any("bootstrap_n_too_small" in r.message for r in caplog.records)
@given(scores=st.lists(
st.floats(min_value=0.0, max_value=1.0, allow_nan=False, allow_infinity=False),
min_size=5, max_size=50,
))
@settings(max_examples=200, deadline=None)
def test_bca_bound_within_mean_minus_two_stddev_window(scores):
"""One-tailed 95% lower-bound sanity property (ADR-0002)."""
arr = np.asarray(scores)
if arr.std(ddof=1) == 0.0:
return
lb = compute_lower_bound_95(list(scores), run_id="deadbeef00000000")
assert arr.mean() - 2 * arr.std(ddof=1) <= lb <= arr.mean() + 1e-9
@given(scores=st.lists(
st.floats(min_value=0.0, max_value=0.9, allow_nan=False),
min_size=5, max_size=20,
))
@settings(max_examples=50, deadline=None)
def test_uniform_score_shift_is_non_decreasing(scores):
"""If every score goes up by δ, the bound cannot go down."""
bound_a = compute_lower_bound_95(scores, run_id="cafebabe00000000")
shifted = [s + 0.05 for s in scores]
bound_b = compute_lower_bound_95(shifted, run_id="cafebabe00000000")
assert bound_b >= bound_a - 1e-9
@pytest.mark.asyncio
async def test_bound_lands_on_report(stub_bench):
"""Regression: aggregator wires the real bound (not the 0.0 placeholder)."""
from codegenie.eval.runner import Runner
plan = make_plan_with_varied_scores(stub_bench)
report = await Runner().execute(plan, system_under_test=stub_sut, rubric_runner=stub_rubric)
assert 0.0 < report.lower_bound_95 < report.mean_score
Run all seven; confirm failures. Commit as the red marker (note: the snapshot hex is filled in at the green step, not invented at red).
Green — make them pass¶
Implement compute_lower_bound_95 with the simplest correct BCa: bias correction + jackknife acceleration. Vectorize the resample loop with rng.choice(arr, size=arr.size, replace=True) if numpy is available. Wire into the aggregator's report build.
Refactor — clean up¶
- Pull
_bias_correction,_acceleration,_bca_alpha_lowerinto private helpers with their own unit tests verified against a published worked example from Efron §14.3. - Add a
pytest -m slowmarker on the 100-rerun byte-identical test so it can be skipped locally; CI runs it. - Module docstring states the seed-derivation rule and cites ADR-0002 verbatim: "changing it would invalidate the audit chain's reproducibility claim."
- Cross-reference comment in
runner.pyaggregator:# lower_bound_95 is the ONLY gate signal (ADR-0002); mean and stddev are reported for human review only. scripts/regen_bootstrap_snapshot.py— operator tool that recomputes the snapshot hex when the BCa implementation is intentionally changed (rare; gated by an ADR amendment).
Files to touch¶
| Path | Why |
|---|---|
src/codegenie/eval/bootstrap.py |
New module — BCa implementation + helpers |
src/codegenie/eval/runner.py |
Wire the call into the aggregator; remove the placeholder |
tests/unit/test_bootstrap.py |
Determinism + snapshot + two Hypothesis properties + degenerate + small-N + aggregator regression |
pyproject.toml |
If numpy/scipy are not yet deps, add them (or implement norm.ppf locally with Acklam's algorithm) |
scripts/regen_bootstrap_snapshot.py |
Operator tool for snapshot regeneration |
Out of scope¶
- Switching to Wilson interval (ADR-0002's revisit trigger — future ADR amendment if
score ∈ {0.0, 1.0}rate > 80%). - Two-sided confidence intervals (only the lower bound matters for the gate).
- Cross-task-class bootstrap stitching (Phase 16).
- Per-breakdown-key sub-bootstraps (would surface in Phase 13 dashboards; not needed for the gate).
Notes for the implementer¶
- The seed derivation rule is structural state. ADR-0002 calls it out by name: "changing it would invalidate the audit chain's reproducibility claim." If you find a reason to change it, escalate via an ADR amendment, not a code change.
- BCa over the percentile method is the right choice because
BenchScore.score ∈ [0, 1]is asymmetric near the boundaries; the bias correction matters atN ≈ 10. - The Hypothesis bound
mean - 2*stddev <= lower_bound_95 <= meanis a sanity property, not the canonical BCa property. It catches gross implementation bugs (e.g., returning the upper bound by accident, or returningmean + stddev). It does not certify BCa correctness — that's what the Efron §14.3 worked example is for in the unit test of_bias_correctionand_acceleration. - The monotone-shift property is the second guardrail: if every score increases, the lower bound cannot decrease. A bug that, e.g., swapped sign on
z0would fail this within ~10 Hypothesis examples. - The numpy dependency is small and the BCa math is materially harder to write correctly without it. If the project rejects numpy, implement
_accelerationwith stdlibstatistics+ a hand-rollednorm.ppf(Acklam's algorithm); leave a TODO and a citation. Either choice must be byte-identical across Python versions — pin the choice in a code comment. - The degenerate
stddev == 0path is required — without it, the jackknife denominator is zero and the bootstrap explodes (NaN). - The
N < 5floor is conservative: BCa is known to be unreliable below ~5 samples. The promotion gate'smin_cases_for_promotion[bronze] = 10makes the practical gate stricter than this safeguard; the warning exists for Phase 6 (the seed migration corpus at N=3) and for malformed runs that emit < 5 cases. - Snapshot test is intentional: it pins the BCa output for known inputs across Python minor versions and numpy versions. If the snapshot fails on a numpy upgrade, that's a signal — not silent drift. The regen script makes the snapshot legible.
- Vectorize the resample loop if numpy:
rng.choice(arr, size=(n_resamples, arr.size), replace=True).mean(axis=1)is one statement and ~50× faster than the per-iteration loop. Keep the slow loop in the docstring as the readable spec.