Skip to content

ADR-0008: Two-threshold calibration band for RAG retrieval — high_floor, degraded_floor in plugin.yaml

Status: Accepted Date: 2026-05-18 Tags: tagged-union · honest-confidence · config-as-data · specification-pattern · adr-0008 Related: ADR-0009 (this phase) · production ADR-0008

Context

All three design lenses chose RAG retrieval thresholds as single global floats: performance picked 0.92/0.97, security picked 0.85, best-practices picked 0.78. The critic identified this as a shared blind spot (critique.md §"Where do all three quietly agree on something questionable" item 1): "retrieval quality at Phase-4-corpus-size (≤200 examples) is a different problem than retrieval quality at portfolio scale. With <50 examples, every cosine-similarity threshold is gameable; the 'top-K with threshold' pattern is a stand-in for 'the corpus is too small for any retrieval to matter.'"

A single float threshold also makes the failure mode silent: score=0.84 and score=0.86 get bucketed identically as miss/hit depending on which side of an unprincipled cutoff they land. Phase 6.5 (per-task-class eval harness) is where calibration evidence will live, but Phase 4 has to ship a shape that Phase 6.5 can calibrate against.

The honest model is three retrieval outcomes — confident hit, near-match (use with caution), miss — encoded as a discriminated union, with the two thresholds living in plugin configuration rather than in code.

Cross-architecture ONNX drift (ADR-0007 of this phase) at the 5th decimal exacerbates the single-threshold fragility: 0.85001 on x86_64 and 0.84998 on arm64 produce different retrieval outcomes. A band with explicit "degraded" middle absorbs this.

Options considered

  • Single global float threshold (all three design lenses). One number; values above hit, values below miss. Pattern: Magic-number cutoff. Silent failure mode at the boundary; gameable on small corpora; sensitive to cross-arch float drift.
  • Single threshold + Optional[SolvedExample] return (implicit alternative). Distinguishes hit-with-example vs no-example via Optional, but doesn't model degraded. Pattern: Optional-field overload. The toolkit's "make illegal states unrepresentable" flag fires — Optional[float] for similarity score lets Optional[SolvedExample] and Optional[float] disagree on hit/miss.
  • Two thresholds + three-variant discriminated union (RagHit | RagDegraded | RagMiss) with bands defined in plugin.yaml. Pattern: Tagged union + named bands + Specification pattern (band membership is a composable rule).
  • Learned classifier instead of thresholds (e.g., train a small classifier per task class on labeled hit/miss examples). Pattern: ML classifier replaces heuristic. Rejected for Phase 4 — labeled training data doesn't exist yet (Phase 6.5's job); over-engineered for the corpus size; an ML decision boundary is still calibrated thresholds wearing a different hat at this scale.

Decision

SolvedExampleRetriever.query(...) returns RetrievalOutcome = RagHit(few_shot, score) | RagDegraded(near_match, score) | RagMiss. Classification is by two floats living in plugins/.../plugin.yaml:

  • similarity ≥ high_floor (default 0.85) → RagHit(few_shot=record)
  • degraded_floor ≤ similarity < high_floor (default 0.65) → RagDegraded(near_match=record) — fed to LLM as few-shot with a "low-confidence" tag in the prompt template
  • similarity < degraded_floorRagMiss

Defaults are conservative initial values; Phase 6.5's calibration harness owns evidence-based tuning. Thresholds live in plugin.yaml, not in code — calibration is config, not a code edit. Pattern: Tagged union + named bands instead of magic numbers + Specification pattern (band classification is a named, composable rule).

Tradeoffs

Gain Cost
The single-threshold silent failure at the boundary is replaced by an explicit RagDegraded state — the LLM is told the few-shot is near-match, not certain-match Three outcomes means three consumer code paths in FallbackTier (vs two for hit/miss); the additional match arm is cheap
Calibration is config-as-data — Phase 6.5 ships new evidence and operators bump plugin.yaml without a code release Calibration quality still depends on Phase 6.5 evidence; until that lands, the defaults are conservative guesses
Cross-architecture ONNX drift at 5th decimal is absorbed by the band — 0.8501 and 0.8498 are both either RagHit or RagDegraded, never split The band width itself must be wider than the cross-arch drift envelope; if drift is 0.005, a band of 0.001 reintroduces the failure
The honest-confidence commitment (commitment §2.3) is honored — RagDegraded is the audit signal analogous to IndexHealthProbe's degraded state LLM few-shot output is shaped by the few-shot it sees; a RagDegraded near-match may bias the LLM toward a wrong shape — mitigated by the prompt template's explicit "low-confidence" tag and Phase 5 strict-AND validation
Per-(task_class, language, build_system) thresholds can differ as Phase 6.5 calibrates — heterogeneous corpora get heterogeneous bands More configuration surface to maintain; documented in plugins/.../plugin.yaml per plugin

Pattern fit

The toolkit's tagged union / sum type pattern applies cleanly: RagHit carries a few_shot example; RagMiss is bare; RagDegraded carries a near_match. The three shapes are genuinely different — modeling them as Optional[SolvedExample] + Optional[float] lets illegal states (few_shot=None, score=0.95) be representable, which is the anti-pattern the discipline forbids (ADR-0033).

The Specification pattern applies to band classification: "the band a similarity falls into" is a named, composable rule (is_high_confidence ⇔ score ≥ high_floor; is_degraded ⇔ degraded_floor ≤ score < high_floor). Both rules are exposed as Pydantic validators on RetrievalOutcome.classify(score, high_floor, degraded_floor); the rule composition is one place and is testable in isolation.

Consequences

  • RetrievalOutcome is a stable contract Phase 5 and Phase 6 consume; widening it (e.g., a future RagAmbiguous for multi-hit ties) is a Phase-amendment ADR.
  • The inline-harvest gate (ADR-0009) reads TrustOutcome.confidence == "high" rather than a numeric threshold — the band shape and the harvest shape both speak the same vocabulary.
  • Phase 6.5's calibration harness reads RetrievalOutcome events from the spanning event log and produces evidence-backed threshold proposals per (task_class, language, build_system); operators bump plugin.yaml accordingly.
  • A RagDegraded outcome shows up in audit as a typed event (RagDegraded(score, near_match_id)); the operator portal renders it distinctly from RagHit.
  • The prompt template (plugins/.../skills/leaf-llm-instruction.md) has a conditional block for "this few-shot is low-confidence; treat as guidance, not template" that fires on RagDegraded.
  • Tests: tests/unit/rag/test_retriever_thresholds.py covers band classification (0.95→hit, 0.75→degraded, 0.40→miss); tests/property/test_retriever_band_monotonic.py Hypothesis-asserts that higher similarity never yields lower confidence.
  • The contract Phase 6.5 sees retrieval band evidence shaped as typed events lets the harness work without parsing prose logs.

Reversibility

High. Threshold values are config; bumping them is one YAML edit, zero code change. Collapsing the band back to a single threshold would require deleting RagDegraded from the union and removing the match arm — Phase-4-local edits; but loses the honest-confidence signal Phase 6.5 builds on. Promoting to a learned classifier (Phase 11+) lands behind the same RetrievalOutcome Protocol — adapter swap, no consumer change.

Evidence / sources

  • ../final-design.md §Component 11 — SolvedExampleRetriever ("Calibration band")
  • ../final-design.md §Departures from all three inputs item 2
  • ../final-design.md §Goal "Honest confidence" (§Load-bearing commitments check §2.3)
  • ../phase-arch-design.md §Goals — G6
  • ../phase-arch-design.md §Component 9 — SolvedExampleRetriever
  • ../critique.md §"Where do all three quietly agree on something questionable" item 1
  • production ADR-0008 (honest confidence as objective signal)
  • production ADR-0033 (sum-type discipline)