Phase 6.5 — Per-task-class eval harness + first benches: ADRs¶
Architecture Decision Records for Phase 6.5, in Nygard format. Each ADR captures one load-bearing decision: the context, the alternatives considered, what was chosen, the tradeoffs accepted, the consequences, and how reversible the choice is.
Phase architecture: phase-arch-design.md — full architecture spec. Source design: final-design.md — synthesized from three competing lens designs. Devil's-advocate critique: critique.md — surfaces the load-bearing forks these ADRs resolve. Anchor ADR: Phase 5 ADR-0016 — Phase 6.5 implements this commitment. Production reference: docs/production/adrs/ — project-level ADR set this phase composes with.
Index¶
| # | Title | Tags |
|---|---|---|
| 0001 | Rubric runs as a scrubbed-env subprocess — not in-process, not microVM | isolation · security · trust-boundary · rubric |
| 0002 | Promotion gate keys on lower_bound_95 (BCa bootstrap), not mean_score |
statistics · promotion · honest-confidence · phase-7-precondition |
| 0003 | Tier identifiers are str, validated at startup against docs/trust-tiers.yaml |
extension-by-addition · type-system · contract-data |
| 0004 | Per-task-class failure_modes.yaml taxonomy with typed FailureMode |
taxonomy · trust · promotion · extension-by-addition · fail-loud |
| 0005 | Per-case cassette_canary_pin + Phase 4 Canary.mint(seed=...) additive amendment |
cassettes · determinism · phase-4-amendment · cross-phase-boundary |
| 0006 | Bench cases split by curation_class — rag-corpus-derived vs held-out with fence-CI floor |
memorization · judgment · curation · fence-ci · phase-7-precondition |
| 0007 | Bench-invocation tagging on SandboxCostEntry via env-var contract — amends Phase 5 ADR-0010 |
cost-ledger · phase-5-amendment · phase-13-handoff · cross-phase-boundary |
| 0008 | Per-task-class BreakdownKey StrEnum + fence-CI substring ban at value level |
llm-judgment-smuggling · type-safety · static-introspection · fence-ci |
| 0009 | "Automatic demotion" semantics — recommendation-shift, not side-effect (amends Phase 5 ADR-0016) | promotion · demotion · humans-always-merge · adr-amendment |
| 0010 | isolation_class annotation on BenchRunReport for Phase 16 microVM upgrade safety |
audit-chain · phase-16-handoff · isolation-upgrade · population-mixing |
Conventions¶
- Filenames
NNNN-kebab-case-title.mdzero-padded, numbered locally per phase from 0001. - Numbers are immutable — superseded ADR keeps its number; new ADR gets next number + cross-links.
- Cross-references to production ADRs:
../../../production/adrs/NNNN-*.md. To sibling Phase 5 ADRs:../../05-sandbox-trust-gates/ADRs/NNNN-*.md. Within Phase 6.5:NNNN-*.md. - Phase ADR numbering is local to this phase; it does not continue from production ADR numbers. Phase 6.5 ADR-0001 is unrelated to production ADR-0001.
ADRs that amend prior decisions¶
This phase produces three amendments to already-accepted decisions. Each is captured as a Phase 6.5 ADR; the amended document gains an "Amended by" cross-link as part of phase-6.5 work:
- Amends Phase 4 final design (cassette canary discipline — additive
seed: bytes | Nonekwarg onCanary.mint) — captured in ADR-0005. The Phase 4 ADR (ADR-P4-006-canary-seed-kwarg.md) is drafted as part of Phase 6.5 work. - Amends Phase 5 ADR-0010 (cost-ledger schema — additive
bench_invocation: boolfield onSandboxCostEntry) — captured in ADR-0007. - Amends Phase 5 ADR-0016 §Decision §4 ("automatic demotion" semantics — recommendation-shift, not side-effect) — captured in ADR-0009.
ADR clusters¶
The ten ADRs group into four load-bearing themes. Reading the cluster together is faster than reading the index linearly.
- Trust posture & evidence semantics — ADR-0001 (rubric isolation), ADR-0002 (statistical gate signal), ADR-0009 (no autonomous tier mutation), ADR-0010 (audit-chain invariance under Phase 16 upgrade).
- Anti-smuggling structural defenses — ADR-0004 (failure-mode taxonomy is data), ADR-0008 (dict-key LLM-judgment ban).
- Curation discipline — ADR-0005 (deterministic cassette replay), ADR-0006 (memorization-vs-judgment split).
- Extension by addition + cross-phase contracts — ADR-0003 (tier slugs as data), ADR-0007 (Phase 5 cost-ledger amendment).
Decisions noted but not yet documented in arch / final-design¶
The phase-architect's gap analysis surfaced four candidate decisions; two were promoted to ADRs (ADR-0010 closes Gap 1; ADR-0009 closes the §"automatic" ambiguity). The other two remain as paragraphs in phase-arch-design.md rather than ADRs:
_codegenie_benchvsbenchimport-path resolution (phase-arch-design.md §Gap analysis Gap 2, §Open questions #3) — The architecture recommends Option A (prependbench/'s parent tosys.path; importbench.{name}.registrationdirectly) with Option B (MetaPathFinder) as the contingency. The decision is implementation-time-recoverable: if Option A surfaces a packaging conflict in CI, Option B is the documented fallback. Promote to an ADR only if Option B is chosen at implementation time, or if the resolution affects external consumers.- Case-ID collision as a fence-CI assertion (
phase-arch-design.md §Gap analysis Gap 3) — The architecture promotes this to a seventh fence-CI assertion + aloader.pyBenchCaseIDCollisionraise (defense-in-depth). The change is mechanical and inside the existing fence-CI envelope; no architectural fork. Promote to an ADR only if a contributor objects to the assertion or proposes a different containment. complete: boolfield onBenchRunReport+ partial-run handling (phase-arch-design.md §Gap analysis Gap 4) — The architecture commits to the field and toPromotionGate.evaluate(...)rejectingcomplete=Falsereports. The change is one bool, one early return, oneVerifyResulttuple field — small enough that the architecture spec is its own canonical record. Promote to an ADR if a future phase changes the rejection semantics (e.g., allows partial reports under a--allow-partialflag).- Per-host vs canonical-host fingerprinting for the audit chain (
phase-arch-design.md §Gap analysis Gap 5, §Open questions #2) — Implementation must pick the fingerprint construction (hostname-derived vs UUID-on-first-run). Both work; the architecture surface area is small. Promote to an ADR when the CI substrate's matrix structure becomes load-bearing (Phase 13 or Phase 16 territory).
If any of these decisions hardens into a load-bearing fork during implementation, the ADR set should grow; the local-numbering convention guarantees the new ADR gets the next sequential number with no renumbering of the existing ten.