Phase 6.5 — Per-task-class eval harness + first benches: Security-first design¶
Lens: Security — isolation, least privilege, audit, supply chain. Designed by: Security-first design subagent Date: 2026-05-12
Lens summary¶
I treat the eval harness as a trust-laundering target. Its single output — bench_score — feeds the gate that decides whether a task class graduates from bronze to silver to gold trust tiers, and trust-tier promotion expands what the planner is allowed to do in real repositories (ADR-0016 §Decision §4, production ADR-0008). An adversary who can move that one number — by poisoning a bench case, tampering a cassette, swapping a rubric, or rewriting score history — can engineer a fake "promotion-ready" track record without ever touching production code. The harness is a control-plane component pretending to be a test runner; I design it as the former.
Threat model assumed. Adversaries are (a) contributors with bench-write access whose intent is to "fix a flaky case" but who in fact relax detection of a real failure mode; (b) an attacker with shell on an operator's laptop attempting to forge BenchScore history to trigger promotion; (c) a malicious or compromised dependency in the rubric's import closure; (d) a poisoned outcome-ledger entry (Phase 13) that auto-converts into a bench case under ADR-0016 §Decision §6's regression-converted provenance class; (e) cassette swap by a contributor with tests/cassettes/ write access. I assume each layer will eventually be breached and require the next.
Optimized for: integrity over throughput; audit completeness over operator UX; default-deny in every component; explicit, code-enforced trust boundaries; cryptographic chaining of every score that ever influences a promotion decision; and humans always promote as the corollary to production ADR-0009's "humans always merge."
Explicitly deprioritized: wall-clock harness runtime, per-case cost minimization, developer ergonomics of adding new cases (extra friction is the intended security control), the LLM-Judge un-deferral surface (Phase 5 ADR-0008) — un-deferral remains future work and the security shape I leave behind is sufficient.
Threat model¶
Assets to protect¶
- Bench-case bodies (the curated input + expected-output corpus under
bench/{task-class}/cases/{case-id}/). Integrity and provenance. A bench case is a signed claim that this input should produce that output; if the claim is wrong, the harness measures the wrong thing. - Rubric source (
bench/{task-class}/rubric.py). This is executable Python invoked by the harness on every run. A malicious rubric is RCE on the operator's host with whatever privileges the harness runs under, including the ability to forge its ownBenchScorereturns and corrupt the audit chain in the same process. - Cassette corpus (
tests/cassettes/extended for bench, per Phase 4 final-design §VCR cassette discipline). Bench runs replay frozen LLM responses. A swapped cassette skews scores without touching the bench case itself — invisible to bench-case integrity checks alone. - Score history (
.codegenie/bench/history/). The append-only ledger of everyBenchScoreever produced. Rewriting the past lets an attacker synthesize a promotion-eligible track record. - Trust-tier promotion authority. The act of changing a task class from bronze → silver → gold. This is not data; it is an authorization decision. Whoever holds this authority effectively widens or narrows the system's blast radius across the portfolio.
- The inference itself. "bench_score ≥ threshold → safe to promote." If the threshold is set from a forged distribution, every downstream promotion is unsafe even if the cryptography holds.
Adversaries assumed¶
- A1 — Compromised internal contributor with bench-case write access. Lands a "fix" that softens detection of a known failure mode, or adds a high-scoring synthetic case that anchors the rubric's mean upward. Most likely adversary; full social-engineering plausibility.
- A2 — Compromised contributor with rubric-edit access. Lands a
rubric.pychange that returnspassed=Trueon a poisoned input pattern, or that silently mutates global harness state during scoring. - A3 — Attacker with shell on operator's laptop. Has write access to
.codegenie/bench/history/,tests/cassettes/, and possibly the audit chain anchor. Goal: forge enough history to make the promotion gate trigger. - A4 — Poisoned outcome ledger entry (Phase 13). A regression that should convert into a
regression-convertedbench case is filtered out by the conversion logic, or — worse — a benign incident is converted withdisposition: positiveso it adds a "real production failure we caught" to the bench-score numerator. - A5 — Supply-chain attacker via rubric dependencies.
rubric.pyimportsnumpyor a YAML parser; a pinned dependency rolls forward to a malicious version. - A6 — Cassette-swap attacker. Replaces
tests/cassettes/bench/.../foo.yaml.zstwith one whose response is the "correct" answer to every case, neutering the entire benchmark. - A7 — CI-config tamperer. Silently disables the fence-CI test that enforces "task class registered →
bench/{name}/exists." Or disables the bench job entirely. The harness still runs locally and looks healthy. - A8 — Promotion forger. Has commit access. Pushes a
promotion.yaml(or equivalent) that flips a task class to gold by hand, claiming the bench history justifies it.
Attack surfaces specific to this phase¶
- The
bench/{task-class}/cases/{case-id}/case.yamlprovenance block — easy to lie in YAML. - The
bench/{task-class}/rubric.pyimport-time side effects. - The
score(harness_output, expected) -> BenchScorecall site — the rubric runs inside the harness's Python process by default. - The
.codegenie/bench/history/append-only ledger — filesystem; vulnerable to local-shell tampering. - The cassette directory the harness shares with Phase 4 — both phases must agree on integrity rules or one undermines the other.
- The fence-CI test asserting
bench/{name}/exists when a task class is registered — meta-control; its disablement is itself a finding. - The trust-tier promotion gate's read of
bench_score— if it reads "the latest history file" rather than "the latest verified history entry," it inherits any tamper the audit chain catches. - The outcome-ledger reconciliation hook (deferred to Phase 13 per ADR-0016) — this phase must define the boundary now, not when Phase 13 reaches it.
Trust boundaries (explicit)¶
| Boundary | From (less-trusted) | To (more-trusted) | Crossing control |
|---|---|---|---|
| TB-1 | bench-case curator (human or LLM-assisted) | bench/{task-class}/cases/ on disk |
CODEOWNERS-protected path; two-reviewer rule; case-content BLAKE3 digest pinned in bench/{task-class}/cases/digests.yaml; case.yaml provenance signed (see Components → case format) |
| TB-2 | bench/{task-class}/rubric.py source |
rubric executor | Rubric runs in a microVM (Firecracker on Linux/CI, gVisor-on-Lima on macOS dev) — never in the harness's Python process. Output is a JSON BenchScore over an RPC; no shared memory |
| TB-3 | rubric microVM | harness orchestrator | Schema validation on the JSON BenchScore; extra="forbid" mirroring Phase 5 ADR-0014; range checks on every numeric field; hard wall-clock + cost cap enforced by the harness, not the rubric |
| TB-4 | harness orchestrator | .codegenie/bench/history/ ledger |
Every BenchScore is hashed + chained (BLAKE3-over-SHA-256, matching Phase 0 audit.py); per-run run-record.json mode 0600; codegenie bench verify re-walks the chain |
| TB-5 | .codegenie/bench/history/ ledger |
trust-tier promotion gate (promotion.py) |
Gate reads only verified chain entries; refuses to read on chain-tamper detection; emits a block-severity audit event |
| TB-6 | promotion gate | actual tier change | The gate produces a recommendation, not a side effect. Tier change is a separate, human-authorized PR amending bench/{task-class}/registration.py's current_tier field with a CODEOWNERS-approved diff (see Component → promotion.py) |
| TB-7 | outcome ledger (Phase 13) | regression-converted bench cases |
Conversion is a proposal — produces a draft case in bench/{task-class}/cases-pending/; promotion to bench/{task-class}/cases/ requires the same CODEOWNERS rule as TB-1 |
| TB-8 | cassette corpus | bench runner | Cassette digest pinned per bench case in case.yaml#cassette_digest; startup integrity check refuses to run a case whose cassette hash drifts |
Goals (concrete, measurable)¶
- Bench-case poisoning detection. Every case-content byte BLAKE3-digested at the
cases/digests.yamllevel (one digest per case directory's tar-serialization); everycase.yamlfield carriesprovenance.signed_by(the GPG fingerprint or Sigstore identity of the human who approved it) andprovenance.signed_at. CI rejects a PR that changes a case's contents without updating its digest and re-signing — mitigates A1, A4. - Rubric malicious-code execution risk. Rubric runs in a microVM with no network, no host FS, and no env inheritance (
env_allowlist.filter({})returns{}— evenPATHis microVM-default). Inputs are JSON-over-pipe; output is JSON-over-pipe. CODEOWNERS rule onbench/{task-class}/rubric.pyrequires two reviewers, one of whom must hold thesecurityrole. Mitigates A2, A5. - Score-history tamper detection.
.codegenie/bench/history/is a BLAKE3-chained append-only log following the Phase 0audit.pypattern (Phase 0 final-design §2.12). Eachbench-run-<utc-iso>-<short>.jsoncarriesprev_hashlinking the chain;codegenie bench verifyre-walks the chain on every promotion-gate read. A new "audit anchor" is published nightly to a separate path (audit/anchors/bench/<date>.json) signed by the operator and stored alongside the regular audit anchors. Mitigates A3. - Trust-tier auto-promotion: forbidden. Matches ADR-0016 §Decision §4 ("wired but does not auto-promote — promotion remains a deliberate, ADR-anchored decision") and extends production ADR-0009's "humans always merge" to "humans always promote." The harness recommends; a human merges the promotion PR. Mitigates A8.
- Audit completeness target. Every harness run emits exactly one chained
run-record.jsoncontaining: harness version, rubric digest, cassette digest set, case-digest set,bench_score, runtime, cost, microVM image digest, operator fingerprint,prev_hash. Coverage measured bycodegenie bench verify --strictreturning zero gaps for 30 trailing days. - Allowed network egress. Rubric microVMs: none. Harness orchestrator (the trust-aware caller that drives LLM cassettes): network off by default;
cassette-record-onlymode enables network behind a--unsafe-recordflag plus operator confirmation, never in CI. Mitigates A6 (cassette regeneration in CI is forbidden). - Fence-CI extension is itself fenced. The fence test asserting
bench/{name}/exists when a task class is registered is checksummed and a second CI job verifies the checksum of the first. Disabling the fence is detectable via a third meta-job that asserts the second job ran. Mitigates A7. - Two-person curation rule. No bench case, no rubric edit, no
case.yamlprovenance change, and nobench/{task-class}/registration.pyedit lands without two reviewer approvals (CODEOWNERS-enforced).
Architecture¶
┌────────────────────────────────────────────────────────────┐
│ CODEOWNERS-protected territory (TB-1, TB-2 source side) │
│ │
│ bench/{task-class}/ │
│ ├─ cases/{case-id}/ │
│ │ ├─ input/... (fixture) │
│ │ ├─ expected/... (ground truth) │
│ │ └─ case.yaml (signed provenance) │
│ ├─ cases/digests.yaml (BLAKE3 per case dir) │
│ ├─ rubric.py (signed; two-reviewer) │
│ └─ registration.py (current_tier; ADR-anchored) │
└──────────────┬─────────────────────────────────────────────┘
│ TB-1 verified at startup
▼
┌──────────────────────────────────────────────────────────────────┐
│ Harness orchestrator (src/codegenie/eval/) │
│ - Runs as the codegenie process; no extra privileges │
│ - Reads cases, verifies digests, looks up cassette digests │
│ - Drives the system-under-test via Phase 4 cassettes (no net) │
│ - Spawns rubric microVM per case (TB-2) │
│ - Validates rubric output (TB-3) against BenchScore schema │
│ - Writes chained run-record to .codegenie/bench/history/ (TB-4) │
└──────────────────────────────┬───────────────────────────────────┘
│ TB-2: process boundary
▼
┌──────────────────────────┐
│ Rubric microVM │
│ (Firecracker / gVisor) │
│ - No network │
│ - No host FS │
│ - Empty env │
│ - JSON stdin/stdout │
│ - Hard wall-clock cap │
└──────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ .codegenie/bench/history/ │
│ - Append-only, mode 0600 │
│ - BLAKE3-chained; SHA-256 identity tuple │
│ - `codegenie bench verify` walks chain │
│ - Daily signed anchor published to audit/anchors/bench/ │
└────────────────────────┬──────────────────────────────────────────┘
│ TB-5: chain-verified read only
▼
┌────────────────────────────────────┐
│ Trust-tier promotion gate │
│ src/codegenie/eval/promotion.py │
│ - Reads verified history │
│ - Emits *recommendation* artifact │
│ - DOES NOT mutate registration.py │
└────────────────┬───────────────────┘
│ TB-6: humans always promote
▼
┌────────────────────────────┐
│ Promotion PR (human) │
│ - Amends registration.py │
│ - CODEOWNERS + ADR amend │
│ - Branch protection │
└────────────────────────────┘
Defense-in-depth layers (for each, what an attacker has to defeat):
- L1 — CODEOWNERS + branch protection on
bench/**andsrc/codegenie/eval/**. Attacker must compromise two reviewers including one withsecurityrole. - L2 — Digest pinning + provenance signatures on every case + rubric. Even with a merged PR, mismatched digests fail startup integrity check.
- L3 — Rubric microVM isolation (TB-2). Even with a malicious rubric merged through L1+L2, it cannot reach the harness's process, network, or filesystem.
- L4 — Chain-walked audit history. Even if the rubric or harness lies about a score, the chain makes rewriting the past detectable.
- L5 — Humans always promote. Even if every layer above is breached, no tier change happens without a separate, reviewed PR.
- L6 — Daily signed anchor publication. Even if
.codegenie/bench/history/is rewritten on the operator's machine, the published anchor diverges; a watcher (Phase 16 hardening) alerts.
To get to tier promotion, an attacker must defeat all six.
Components¶
src/codegenie/eval/registry.py¶
- Purpose:
@register_task_classdecorator + in-process registry with collision detection. - Trust level: trusted (code under codegenie's control).
- Interface:
Inputs trusted at import time but cross-checked at registration time against
@register_task_class( "migration-chainguard-distroless", bench_path=Path("bench/migration-chainguard-distroless"), rubric_module="bench.migration_chainguard_distroless.rubric", min_cases_for_promotion={"bronze": 10, "silver": 30, "gold": 100}, current_tier="bronze", # ADR-anchored; security gate refuses unknown values ) class MigrationChainguardDistrolessTaskClass: ...bench/{name}/cases/digests.yamlexistence andrubric.pydigest. - Isolation: import-time only; no network, no subprocess, no filesystem writes.
- Credentials accessed: none.
- Audit emissions: one
task_class.registeredevent per@register_task_classcall, into the Phase 0 audit chain (re-usingcodegenie/audit.py's writer). - Tradeoffs accepted: import-time validation requires bench directories to be on disk during
pytestdiscovery; this is intentional — the fence-CI test depends on it.
Collision behavior matches Phase 5 ADR-0003's SignalKindAlreadyRegistered — raises TaskClassAlreadyRegistered loudly. No silent overrides; matches CLAUDE.md "Fail loud."
src/codegenie/eval/models.py¶
- Purpose:
BenchScorePydantic model + companionBenchRunRecord(the chained audit entry). - Trust level: trusted.
-
Interface:
class BenchScore(BaseModel): model_config = ConfigDict(extra="forbid", frozen=True) passed: bool score: float # [0.0, 1.0], validated breakdown: dict[str, float] # keys: str; values: float; no nesting failure_modes: list[str] # str-only; no nested dicts cost_usd: float duration_seconds: float # Forbidden by static test (mirrors ADR-0014): any field name matching # /confidence|llm|self_reported|model_says/ raises in CI.extra="forbid", frozen=Truemirrors Phase 5 ADR-0014 exactly — same threat (LLM judgment smuggling), same control. Atests/eval/test_bench_score_static.pywalksBenchScore's field graph and rejects banned substrings. This is non-negotiable: the moment a rubric is allowed to emitconfidence, the strict-AND discipline at the per-task-class layer collapses into the per-PR-LLM-self-assessment trap production ADR-0008 explicitly forbids. -
Isolation: pure Pydantic; no I/O.
- Credentials accessed: none.
- Audit emissions: none directly (
BenchScoreis data, not an event). - Tradeoffs accepted: range-validation forces rubrics to clamp; a rubric that wants to emit "-0.1" to signal "worse than baseline" must instead use
breakdownkeys. Intentional friction — keeps the score in[0, 1]where the promotion threshold lives.
BenchRunRecord (the chain entry) extends:
class BenchRunRecord(BaseModel):
model_config = ConfigDict(extra="forbid", frozen=True)
schema_version: Literal[1]
task_class: str
case_id: str
case_digest: str # BLAKE3 of the case tar-serialization
rubric_digest: str # BLAKE3 of rubric.py
cassette_digest: str | None
harness_version: str
microvm_image_digest: str
score: BenchScore
started_at: str # UTC ISO
finished_at: str
operator_fingerprint: str # GPG / Sigstore identity
prev_hash: str # SHA-256 of the previous chain entry; "0"*64 at genesis
src/codegenie/eval/runner.py¶
- Purpose: Orchestrates one bench run end-to-end: walk cases, verify digests, invoke system-under-test via Phase 4 cassettes, spawn rubric microVM, validate output, write chained record.
- Trust level: trusted, but minimal-privilege.
- Interface:
- Input:
TaskClassregistration object + optional filter (case-ids). - Output:
BenchRunSummaryaggregatingBenchScores. - Errors:
BenchCaseDigestMismatch,RubricDigestMismatch,CassetteDigestMismatch,RubricSandboxEscape(heuristic),ChainTamperDetected. All raised loudly;--continue-on-erroris forbidden by design (CLAUDE.md "Fail loud" + the entire point of integrity checks). - Isolation: the runner itself runs as the codegenie process; the rubric runs in a microVM per-case. The runner never
execs untrusted code in-process.import bench.{task-class}.rubricis forbidden by animport-linterrule mirroring Phase 0's lazy-import discipline — rubric source is read as bytes, copied into the microVM rootfs at run time, and executed by a stub interpreter inside the microVM. This is the same shape as production ADR-0012's sandbox-for-agent-output, applied to the rubric (also untrusted from the harness's perspective). - Credentials accessed: none. Following Phase 5 ADR-0012, the env passed to the rubric microVM is
env_allowlist.filter({}). The bench harness itself needs no API keys at run time — cassettes are deterministic replays. If the operator runs--unsafe-record(cassette regeneration), the orchestrator briefly receivesANTHROPIC_API_KEYbut never propagates it into the microVM and never logs it. CI never has access to--unsafe-record. - Audit emissions:
bench.run.started,bench.case.completed,bench.case.failed,bench.run.finished, and the chainedBenchRunRecordper case. - Tradeoffs accepted: per-case microVM cold-start cost (100ms Firecracker, seconds for gVisor-on-Lima) multiplied by N cases dominates wall-clock. I accept this; it is the cost of TB-2. No warm pool — same reasoning as Phase 5 final-design rejecting warm pools (snapshot reuse is itself an attack surface).
src/codegenie/eval/audit.py¶
- Purpose: Append-only chain over
BenchRunRecords; daily anchor publication. - Trust level: trusted.
- Interface:
append(record: BenchRunRecord) -> str— returns the new chain head SHA-256. Writes.codegenie/bench/history/<utc-iso>-<short>.jsonmode0600, computesBLAKE3(record_bytes), wraps inSHA-256(prev_hash || blake3)identity tuple matching Phase 0 final-design §2.12.verify(since: str | None = None) -> VerifyResult— re-walks the chain; returns mismatches.publish_anchor(date: str) -> Path— writesaudit/anchors/bench/<date>.jsoncontaining{chain_head, count, signature}; signature is the operator's Sigstore identity or GPG-fingerprint signing the chain head.- Errors:
ChainTamperDetected(raised byverifyand on everyappendthat finds a torn write),MissingAnchor,AnchorSignatureInvalid. - Isolation: filesystem-only; no network for
appendandverify;publish_anchoruses Sigstore's transparency log (network egress allowlisted in CI; off in local). - Credentials accessed: Sigstore short-lived identity (OIDC) for anchor publication; never persisted; ephemeral per
publish_anchorcall. - Audit emissions: itself the audit.
- Tradeoffs accepted: chain verification is
O(N)per promotion-gate read. For Phase 6.5 N is small (hundreds); fine. Phase 16 may add per-month anchor checkpointing if N grows.
Hash choice deliberately mirrors Phase 0 (BLAKE3 content, SHA-256 identity tuple) for two reasons: (a) operators already trust this pattern; (b) sharing the codegenie/hashing.py module from Phase 0 means there is exactly one place either algorithm is named — same load-bearing commitment as Phase 0 final-design §2.11.
src/codegenie/eval/promotion.py¶
- Purpose: Read verified history, emit a recommendation, never mutate.
- Trust level: trusted, write-restricted by code.
- Interface:
recommend(task_class: str) -> PromotionRecommendation— returns{current_tier, recommended_tier, evidence: list[BenchRunRecord], rationale: str, requires_human_approval: True}.apply(task_class: str, target_tier: str) -> NoReturn— raisesPromotionMustBeHumanAuthorizedunconditionally. The function exists to prevent anyone from writing it; the only path to a tier change is a human-authored PR amendingbench/{task-class}/registration.py#current_tier, which is CODEOWNERS-protected and (per the new invariant added by ADR-0016 §Consequences) must carry an ADR amendment.- Isolation: read-only against the audit chain; no filesystem writes outside
.codegenie/bench/recommendations/<utc-iso>.json(the recommendation artifact a human reviewer reads). - Credentials accessed: none.
- Audit emissions:
promotion.recommendation.emittedper recommendation;promotion.apply.blockedifapply()is ever called (alarms loudly — only happens if someone is trying). - Tradeoffs accepted: the "promotion" interface is intentionally asymmetric — recommendation flows automatically, application never does. This is the operationalization of ADR-0016 §Decision §4 and the corollary to production ADR-0009 at the meta level. Demotion behavior from ADR-0016 §Decision §4 ("Demotion is automatic on any production regression") deserves scrutiny — see Open questions §3. My read of the ADR: demotion is automatic in the sense that the recommendation drops; the applied tier still moves only on a human-authored PR. If the ADR meant automatic side-effect demotion, that contradicts the security argument here and should be amended.
bench/{task-class}/cases/{case-id}/case.yaml¶
The provenance schema:
schema_version: 1
case_id: "GHSA-2024-0001-axios-ssrf"
title: "Axios SSRF — minor bump available"
disposition: positive | negative | ambiguous # per ADR-0016 §Decision §1
difficulty: easy | medium | hard
provenance:
source: curated | outcome-ledger-derived | regression-converted
source_ref: # e.g., ledger entry id, GHSA id, commit SHA
type: ghsa | ledger | commit | synthetic
value: "GHSA-xxxx-xxxx-xxxx"
added_at: "2026-05-12T14:00:00Z"
added_by:
identity: "alice@codewizard.example" # Sigstore identity or verified GPG fingerprint
signature: "MEUCIQ..." # signature over (case_id || added_at || case_digest)
reviewed_by: # second reviewer mandatory (two-person rule)
identity: "bob@codewizard.example"
signature: "MEUCIQ..."
last_validated_at: "2026-05-12T14:00:00Z" # IndexHealthProbe-style freshness signal
expected:
recipe_id: "axios-bump-minor"
cve_delta: -1
validator_verdict: pass
cassette_digest: "blake3:abcdef..." # the Phase 4 cassette this case replays
case_digest: "blake3:123456..." # BLAKE3 of the tar-serialized cases/{case-id}/ dir minus case.yaml itself
- Trust level: untrusted as code; provenance fields verified by code.
- Isolation: the
case.yamlis read by the runner;dispositionandexpectedare passed to the rubric microVM as inputs. - Credentials accessed: none.
- Audit emissions:
case.verifiedon every load (logs the digest). - Tradeoffs accepted: two-signature requirement (CODEOWNERS-mirror at the data layer) makes case-add a slower workflow than just a PR. Intentional. Without the second signature the case is valid YAML but invalid provenance; the runner refuses to use it.
The bench/{task-class}/cases/digests.yaml top-level file:
schema_version: 1
cases:
"GHSA-2024-0001-axios-ssrf": "blake3:123456..."
"CVE-2024-9999-lodash-proto-pollution": "blake3:789abc..."
The startup integrity check (SandboxHealthProbe-style, borrowing the pattern from Phase 5 ADR-0013) refuses to run any case whose computed digest mismatches.
bench/{task-class}/rubric.py¶
- Purpose: Score one
(harness_output, expected)pair into aBenchScore. - Trust level: untrusted for execution purposes (even though under CODEOWNERS). The harness assumes the rubric is hostile.
- Interface: Inside the microVM, the rubric is invoked as
python /work/rubric.py < inputs.json > output.json.inputs.jsonis{harness_output, expected, case_metadata};output.jsonis aBenchScoreJSON dump. - Isolation: microVM, no network, no host FS, empty env, hard wall-clock + RSS cap (60s, 1GB defaults; rubric-author may request up to 5min / 4GB via
case.yaml#rubric_limits, enforced by the harness, never by the rubric). - Credentials accessed: none.
- Audit emissions: runner emits
rubric.invoked,rubric.completed,rubric.killed_oom,rubric.killed_timeout,rubric.malformed_output. - Tradeoffs accepted: microVM-per-case spawn cost is real. Some rubrics will be trivially short and yet pay a 100ms-to-seconds penalty. I accept this entirely — the per-call cost is the security premium, and Phase 6.5's offline-nightly cadence (ADR-0016 §Decision §5) absorbs it.
A tests/adversarial/test_rubric_isolation.py plants a deliberately malicious rubric.py in a fixture (attempts: env var read, urlopen, socket.gethostbyname, open("/etc/passwd"), fork+exec) and asserts the resulting BenchScore is RubricSandboxEscape or shows the malicious operations produced no observable host effect.
Fence-CI extensions¶
Three new gates:
test_task_class_has_bench_dir— for each registered task class, assertsbench/{name}/{cases,rubric.py,registration.py}exist and pass startup digest checks. Mirrors ADR-0016 §Consequences §Fence-CI extends.test_bench_score_static_introspection— the Phase 5 ADR-0014-style introspection overBenchScore's reachable fields; rejectsconfidence | llm | self_reported | model_sayssubstrings.test_fence_ci_is_alive— meta-gate: hashes the CI workflow file's bench-job stanza and asserts it matchestools/digests.yaml#ci.bench_job. Disabling the bench job changes the digest and fails this gate.
CODEOWNERS additions:
bench/ @codewizard-sherpa/security @codewizard-sherpa/eval-curators
bench/**/rubric.py @codewizard-sherpa/security
bench/**/registration.py @codewizard-sherpa/security
src/codegenie/eval/ @codewizard-sherpa/security
src/codegenie/eval/promotion.py @codewizard-sherpa/security (single-owner; second approval external)
tools/digests.yaml @codewizard-sherpa/security
.github/workflows/bench.yml @codewizard-sherpa/security
Data flow¶
One representative end-to-end run for migration-chainguard-distroless:
- Startup integrity (TB-1, TB-8).
codegenie bench run migration-chainguard-distrolessboots. The runner readsbench/migration-chainguard-distroless/cases/digests.yaml. For each case directory it computes the BLAKE3 of the tar-serialization (excludingcase.yamlitself) and compares to the pinned digest. One mismatch → abort. It then reads eachcase.yaml, verifies the two Sigstore/GPG signatures against the case-content digest. One unverifiable signature → abort. It readsbench/.../rubric.py, compares its BLAKE3 totools/digests.yaml#bench.migration_chainguard_distroless.rubric. Mismatch → abort. - Cassette verification (TB-8). For each case, the runner looks up
case.yaml#cassette_digestand verifies the cassette file attests/cassettes/bench/migration-chainguard-distroless/{case-id}.yaml.zstmatches. Mismatch → abort. - Chain integrity (TB-4). The runner calls
audit.verify()on the existing chain. Tamper → abort and emit ablock-severity audit event reachable by Phase 16 alerting. - Per-case execution. For each case, in serial (no parallelism in Phase 6.5 — concurrency is an integrity-correlation risk this phase does not need):
- Spawn rubric microVM. The microVM image digest comes from
tools/digests.yaml#bench.rubric_image. Hard wall-clock cap, hard RSS cap, network=none, env={}. - Invoke system-under-test against
case_inputusing the verified cassette. The SUT path is the Phase 6 SHERPA state machine — its execution within the microVM is the same shape as the Phase 5 trust-aware gates, not a special bench-only path. The cassette ensures determinism. - Pipe
(harness_output, expected, case_metadata)JSON into the rubric microVM stdin. - Read
BenchScoreJSON from rubric microVM stdout. Validate against the Pydantic schema (extra="forbid"). Schema violation or malformed JSON → recordBenchScore(passed=False, failure_modes=["rubric_malformed"])and continue; the case counts as a failure. - Construct
BenchRunRecordwith all digests, the operator's Sigstore identity (short-lived OIDC),prev_hashset to the current chain head. audit.append(record)— writes the file, computes new chain head.- Aggregation. Compute
bench_score.mean,failure_modesunion,cost_usdsum. Emitbench_run_summary.jsonto.codegenie/bench/runs/<utc-iso>-<short>.jsonmode0600. - Promotion recommendation.
promotion.recommend("migration-chainguard-distroless")reads only verified chain entries from the last N days (configurable; default 30). Returns aPromotionRecommendation. Writes it to.codegenie/bench/recommendations/<utc-iso>.json. Does not modify any registration. - Daily anchor. Nightly job calls
audit.publish_anchor(today). Sigstore OIDC token is fetched, the chain head signed, and the anchor written toaudit/anchors/bench/<date>.json. This file is committed via PR (one-line, auto-generated) so the public chain head lives in git history — diverging local tampering becomes visible atgit pull.
Trust-boundary crossings are tagged at each step (the runner emits a structured event whose field set marks the boundary). A tests/integration/test_eval_data_flow.py walks the full flow against a hello-world bench fixture and asserts the boundary-event set matches a golden list — any new boundary crossing without a matching audit event fails.
Failure modes & recovery¶
| Failure | Detected by | Containment | Recovery |
|---|---|---|---|
| Poisoned bench case (A1, A4) — content edited without digest update | Startup cases/digests.yaml integrity check |
BenchCaseDigestMismatch raised; entire run aborts; no BenchRunRecord written |
Re-curate the case; update digest in PR with two-reviewer approval; CI runs fresh |
Malicious rubric (A2) — RCE attempt via rubric.py |
Rubric microVM (no network, no FS, no env); tests/adversarial/test_rubric_isolation.py |
Rubric is in its own kernel; cannot affect harness; output rejected by schema | Revert rubric.py; rotate any operator credentials reachable via the orchestrator (paranoia) |
Forged operator identity in BenchRunRecord |
Sigstore identity verification at chain-append time | audit.append() refuses to write record without verifiable identity |
Re-run from a workstation with valid OIDC |
Local tampering of .codegenie/bench/history/ (A3) |
audit.verify() re-walks the chain; mismatch on prev_hash |
ChainTamperDetected raised; promotion gate refuses to read |
Restore from the last anchor; investigate the host (likely compromised) |
| Cassette swap (A6) | Per-case cassette_digest check |
CassetteDigestMismatch raised; run aborts |
Re-record cassette via --unsafe-record from a clean environment; PR with cassettes-reviewed label, mirroring Phase 4 final-design §VCR cassette discipline |
| Fence-CI disabled (A7) | test_fence_ci_is_alive checksumming the workflow file |
Meta-job fails CI | Investigate why the fence was changed; revert |
| Forged promotion PR (A8) | CODEOWNERS protection on bench/**/registration.py + branch protection + the new invariant (ADR amendment required for tier change) |
PR blocked from merge until two reviewers approve | Standard PR review |
regression-converted bench case from poisoned outcome ledger (A4) |
Conversion always produces a cases-pending/ draft, not a cases/ entry; two-signature rule applies |
Pending draft never enters scoring | Reviewers reject the draft |
| Anchor signature unverifiable | audit.publish_anchor signature check |
AnchorSignatureInvalid raised; nightly job fails red |
Rotate the operator's signing identity; re-publish |
| Rubric runs forever (livelock) | Wall-clock cap enforced by microVM | rubric.killed_timeout event; case scored failure |
Investigate rubric; lower-bound the rubric's cost |
| Rubric OOMs | RSS cap | rubric.killed_oom; case scored failure |
Same as above |
BenchScore smuggles a hidden field via dict |
extra="forbid" + static introspection test (test_bench_score_static.py) |
CI fails before merge | Rename the field; never bypass |
| Sigstore network outage at anchor-publish time | OIDC fetch fails | Anchor not published; nightly job red | Operator gets paged; chain-head still local-verifiable; anchor catches up next day |
Two non-failure-modes worth explicitly stating: a bench case representing a hostile real-world scenario (e.g., a repo with a known sandbox-escape attempt) is curated intentionally under disposition: negative — its case.yaml carries the same two signatures and the same digest as any other case. The distinction between curated adversarial case and poisoned case is bytes-on-disk identical until you read the provenance: a poisoned case is one whose added_by is unverifiable or whose reviewed_by is missing. Provenance integrity is therefore the load-bearing distinction, exactly as ADR-0016 §Open questions §1 anticipates.
Resource & cost profile¶
Per-case execution (concrete):
- Rubric microVM cold-start: 100–500ms (Firecracker) / 1–3s (gVisor-on-Lima for macOS dev).
- Cassette replay (system-under-test under bench, no live LLM): single-digit ms.
- Rubric scoring runtime: typically <1s; cap at 60s.
- Digest verification (per case): <50ms (BLAKE3 over a small directory).
- Chain append (per case): <10ms (one file write + one hash chain).
For a 30-case migration-chainguard-distroless bench run on Linux/CI: ~1 minute wall-clock total (dominated by microVM cold starts). On macOS dev under gVisor-on-Lima: 2–4 minutes.
Cost of security (the explicit delta vs a security-naïve design):
- Without microVM per rubric (in-process import): -100ms to -3s per case; but eliminates TB-2 entirely, opening RCE-on-operator-host.
- Without per-case cassette digest check: -50ms per case total; but eliminates TB-8, opening cassette-swap.
- Without Sigstore anchor: -2s per day; but eliminates L6 (the divergence-detection layer that survives local-host compromise).
- Without two-signature requirement on bench cases: developer ergonomics +50%; but eliminates the L1 control entirely since merge-by-default contributors can act unilaterally.
I judge each cost worth paying. The nightly cadence (ADR-0016 §Decision §5) absorbs the wall-clock; the bench corpus is small and grows slowly; promotion is weeks-to-months (ADR-0016 §Tradeoffs), so the dev-loop friction is mostly invisible.
LLM cost per eval run (deferred to Phase 13 per ADR-0016 §Open questions §3): zero at run time under cassette discipline; nonzero only during --unsafe-record. The BenchScore.cost_usd field is populated from the cassette metadata (recorded cost at record-time), not from live charges, so the field is meaningful without re-spending.
Test plan¶
Adversarial tests are first-class:
tests/adversarial/test_rubric_malicious_outbound.py—rubric.pyattemptsurllib.request.urlopen("http://attacker.example"); assertRubricSandboxEscaperecorded and harness emits ablock-severity audit event; network never reached (verified via a tcpdump-monitored CI runner).tests/adversarial/test_rubric_env_read.py—rubric.pyattemptsos.environ.get("ANTHROPIC_API_KEY"); assert returned value isNonebecause env is empty (mirrors Phase 5 ADR-0012's adversarial test).tests/adversarial/test_case_poisoning_detected.py— flip one byte in a case directory'sexpected/diff.patchwithout updatingcases/digests.yaml; assertBenchCaseDigestMismatchaborts the run.tests/adversarial/test_case_provenance_forged.py— author acase.yamlwhoseadded_by.signatureis a random string; assert startup verifies the signature and refuses the case.tests/adversarial/test_chain_tamper_detected.py— rewrite the second-newestBenchRunRecordin.codegenie/bench/history/; assertaudit.verify()returns mismatch at that point andpromotion.recommend()refuses to read.tests/adversarial/test_cassette_swap_detected.py— replace a cassette with a different one (different digest); assertCassetteDigestMismatch.tests/adversarial/test_promotion_apply_blocked.py— callpromotion.apply()programmatically; assertPromotionMustBeHumanAuthorizedraised,promotion.apply.blockedevent emitted.tests/adversarial/test_bench_score_extra_forbid.py— attempt to constructBenchScore(passed=True, score=1.0, breakdown={}, failure_modes=[], cost_usd=0.0, duration_seconds=0.0, confidence=0.95); assert Pydantic raises.tests/schema/test_bench_score_static.py— recursive field-name walk; rejects banned substrings (mirrorstest_objective_signals_static.py).tests/schema/test_digests_yaml_bench.py—tools/digests.yamlhas entries for every registered task class's rubric and microVM image.tests/fence/test_task_class_has_bench_dir.py— for each@register_task_class, assert directory contract.tests/fence/test_fence_ci_is_alive.py— workflow stanza checksum.tests/integration/test_eval_data_flow.py— golden hello-world bench fixture; end-to-end run; chain head matches expected.tests/integration/test_anchor_publication.py— Sigstore-mocked CI test; publishes an anchor; verifies signature and chain head.
A passing test plan means: (i) every adversarial test green; (ii) codegenie bench verify --strict returns zero gaps over 30 trailing simulated days; (iii) no bench_score reaches promotion.recommend() without a verified chain entry; (iv) no path in the codebase can construct a BenchScore carrying a forbidden field name; (v) the apply() interface is provably uncallable by code (only humans + git can move tiers).
Risks (top 5)¶
-
The microVM stack on macOS dev (gVisor-on-Lima) is operational debt. Phase 5 picked DinD on macOS for trust-aware gates, accepting shared-kernel verdicts on developer laptops. I propose gVisor-on-Lima for the rubric (a different sandbox stack from Phase 5's), which means two macOS-dev stacks coexist. Cost: real. Alternative: accept shared-kernel rubric execution on macOS dev with
gate_isolation_class: "shared_kernel"annotation propagated to theBenchRunRecord, and refuse to use shared-kernel-produced records for promotion decisions. This is the cleaner choice; I commit to it: production-qualityBenchRunRecords come from Linux/CI Firecracker only; macOS dev records are flagged and ignored bypromotion.recommend(). Mirrors Phase 5 final-design §3 ("gate_isolation_class: shared_kernel always") for DinD. -
Two-person curation rule creates a single-point-of-failure on the second reviewer. If the
securityrole has only one human, every bench-case PR queues on one inbox. Mitigation: at least three humans withsecurityrole; reviewer-fatigue monitoring (Phase 13 dashboard hook). This is an org-policy risk, not a code risk. -
The demotion-on-regression behavior from ADR-0016 §Decision §4 is ambiguous. "Demotion is automatic on any production regression that the bench set fails to catch" — if "automatic" means "the next promotion recommendation drops the recommended tier," that aligns with humans-always-promote. If it means "the runner side-effects a tier change," that violates TB-6. I commit to the first reading and flag this for the synthesizer. A poisoned outcome ledger (A4) that synthesizes a fake "regression" entry is a path to automatic demotion — humiliating but non-catastrophic — but a path to automatic promotion via the same mechanism would be catastrophic.
-
Sigstore dependency widens supply-chain attack surface. Anchor publication requires the Sigstore transparency log being honest. If Sigstore is compromised, anchors are forgeable. Compensating control: maintain a parallel offline operator-signed anchor (GPG-detached signature alongside the Sigstore one). Belt + suspenders; cost is one extra signature per day.
-
bench/checked into the same repo (per Phase 6.5 default) means every contributor with repo-read access can read all bench cases. For vuln-remediation bench cases this is fine. For migration-chainguard-distroless cases that snapshot real customer repos under permissive license, it may not be. The decision to split intocodewizard-sherpa-benchesis deferred per ADR-0016 §Open questions §4; I lean toward splitting for the migration bench specifically once it grows, with stricter access control.
Acknowledged blind spots¶
- Wall-clock optimization. I have not designed for parallelism (deliberately — sequential makes audit-chain integrity simpler). The performance lens will argue for it. Synthesizer should weigh.
- Cost minimization. Every per-case microVM spawn is paid. I have not designed cost-recovery via case-caching across runs; cache invalidation under my integrity model is non-trivial and worth deferring.
- Operator UX. Adding a bench case is a multi-step ceremony (two signatures, digest update, ADR amendment for new task class). This is the intended friction. Best-practices lens may reasonably push back.
- The LLM-Judge un-deferral path. Phase 5 ADR-0008 un-deferral is downstream of this harness (ADR-0016 §"What this ADR explicitly resolves"). I have not designed
bench/judgment-arbitration/— leaving it for the synthesizer to decide whether the security shape I leave behind (microVM-isolated rubric, signed cases, chained scores) is sufficient when the Judge enters the loop. My read: it is. The Judge becomes one more "rubric-like" thing, isolated identically. - Reconciliation timing. Outcome-ledger reconciliation (Phase 13) is the path from production regression → new
regression-convertedbench case. I have set the TB-7 contract but not the cadence. Phase 13 inherits the question; my design tolerates any cadence because each conversion is acases-pending/draft requiring human approval. - Cassette regeneration drift. Cassette-record-mode is
--unsafe-recordonly; I have not designed a "re-record without paying full LLM cost" path. This is a Phase 13 cost concern.
Open questions for the synthesizer¶
-
Is the rubric microVM cost-justified on Phase 6.5's small bench? With 10–30 cases per task class on the bronze tier, the overhead is ~30s per nightly run total. Performance lens will likely argue for
import bench.{name}.rubricin-process withbanditstatic analysis as the control. My read: the security premium is worth it because the rubric is control-plane code — it gates promotion, and promotion gates blast radius. The synthesizer must resolve whether the security argument or the cost argument wins. -
Does
promotion.recommend()need to be itself isolated? I run it in the harness orchestrator process. If the orchestrator is compromised, recommendations can be forged — but humans-always-promote (TB-6) is the catch. Shouldrecommend()also run in a microVM? My read: no — the recommendation is advisory and the audit chain it reads is independently verifiable from the published anchor; forging a recommendation in memory is useless without forging the chain on disk, which the chain verify catches. -
The ADR-0016 §Decision §4 demotion clause — does "automatic" mean side-effect or recommendation-shift? See risk #3. I have committed to the recommendation-shift reading. If the synthesizer disagrees, the design needs the
apply()interface to be writable for demotion only, which is a non-trivial asymmetry; I would argue against it on the grounds that asymmetric authorization is a frequent source of bugs. -
Should the daily anchor PR be auto-merged? It is generated by a job, contains exactly one new file under
audit/anchors/bench/<date>.json, and serves the L6 purpose only if it lands in git. Two options: (a) auto-merge with a narrow exception to "humans always merge" — the precedent production ADR-0009 reserves for "narrow exceptions to be ADR-amendable"; (b) require human merge, accepting that anchor publication may lag. My lean: option (a) with an ADR amendment, because the anchor is self-attesting data not new code. Synthesizer should weigh. -
Is
bench/in the same repo, or split? ADR-0016 §Open questions §4 defers this. Security argument: split, so bench-write-access is governed separately from code-write-access. Operational argument: same repo, so the fence-CI tests run atomically with code changes. My lean: same repo for Phase 6.5; split when migration-chainguard-distroless graduates to silver (~tier promotion would itself trigger the split). The synthesizer can confirm or push back. -
Should the operator's Sigstore OIDC identity be persisted in
BenchRunRecord.operator_fingerprintas the identity string, or as a hash thereof? Identity string is auditable but leaks who ran the bench. Hash is privacy-respecting but un-auditable post-hoc without a side channel. My lean: hash, with the side channel being aaudit/operators.yamlmapping that itself is two-signature-protected. Synthesizer can resolve.