ADR-0008: Trust score uses objective signals only — no LLM self-confidence¶
Status: Accepted Date: 2026-05-11 Tags: trust · safety Related: ADR-0009, ADR-0012, ADR-0015
Context¶
The Trust-Aware gate layer (../design.md §4.1 Layer 3) gates every state transition behind a "is this safe to advance?" check. The natural temptation is to ask the LLM itself: "how confident are you in this output?" and use the answer as a gate input.
Published research argues this is worse than no signal at all. The "Confidence Trap" finding in ../../gemini-auto-agent-design.md §"Mitigating the Confidence Trap" reports that agentic PRs at the highest self-reported confidence levels (8–10 out of 10) still introduce breaking changes at 3.16–3.96%. At confidence 10, the rate is 3.16% — 458 breaks out of 14,509 commits. The correlation between LLM-reported confidence and code correctness breaks down completely during maintenance tasks.
A gate keyed on self-reported confidence produces false reassurance proportional to risk.
Options considered¶
- LLM self-reported confidence. Cheap, easy, available. Empirically miscalibrated during the exact tasks (maintenance, refactor) where the system makes its money.
- Objective signals only. Sandbox build status, test pass/fail counts, SAST findings, CVE delta direction, runtime-trace coverage, policy-engine block events. Slower, more infrastructure, but the signal is from the world, not from the model.
- Hybrid: objective primary, self-confidence as tie-breaker. Compromise. Still vulnerable to the Confidence Trap when objective signals are borderline.
Decision¶
Trust score is computed from objective evidence only. Specifically:
- Sandbox build status (binary)
- Test pass/fail counts and delta vs baseline
- SAST/DAST findings, new vs baseline
- CVE delta direction (more, same, or fewer)
- Runtime-trace coverage (which scenarios completed cleanly)
- Policy-engine block events (did any deterministic rule fire?)
- Coverage of changed code by existing tests
LLM self-reported confidence may be logged for observability and drift analysis. It must not feed the gate.
Tradeoffs¶
| Gain | Cost |
|---|---|
| Gate is grounded in reality, not model self-report | More infrastructure: sandbox + scanners + signal aggregator |
| Confidence-Trap immunity — agents cannot "talk their way past" the gate | Gates take seconds/minutes (sandbox checks), not milliseconds |
| Gate verdicts are explainable — show the failing signal | Some classes of intent-level wrongness (semantically correct but goal-wrong code) are invisible to objective signals |
| Calibration is empirical (sample real outcomes, tune thresholds) | Threshold calibration (ADR-0015) requires production data — can't be set a priori |
Consequences¶
- Stage 5 Validation (
../design.md §3Stage 5) is the canonical objective-signal collector. The gate logic at every other stage's transition leans on the same signal sources. - The trust-score formula is a weighted sum of objective signals; weights and the gate threshold are calibrated against post-merge incident data (ADR-0015 deferred).
- Until calibration data exists, gates use binary pass/fail on the most direct objective signal (build passes / tests pass / SAST finds nothing new). Conservative default.
- The "40-Point Rule" from
../../gemini-auto-agent-design.md— halt when confidence/information gap exceeds 40 points — is interesting but contingent on reliable confidence signals. Per this ADR, we do not yet have those. Phase 3 concern.
Reversibility¶
Low cost in the trivial direction (start including self-confidence in the score). High cost in the meaningful direction — if self-confidence ever enters the score, every gate's calibration must be redone, and the Confidence-Trap exposure returns.
Evidence / sources¶
../design.md §4.6(push-back: trust score uses objective signals only)../../gemini-auto-agent-design.md §"Mitigating the Confidence Trap"— empirical Confidence-Trap data../../gemini-auto-agent-design.md §"Empirical Realities"— agentic-PR breaking-change rates by task type- arXiv 2603.27524 "Safer Builders, Risky Maintainers" — agents fail at higher rates during maintenance than feature work
../../reviews/2026-05-18-research-committee-search-paper.md— external evidence: executable / type-check / test signals are the "local soundness signal" the committee-search bounds require; the objective-signal trust score is exactly that shape