ADR-0013: FenceWrapper + CanaryGuard — canary scans the UNTRUNCATED payload, then truncate¶
Status: Accepted Date: 2026-05-18 Tags: functional-core-imperative-shell · trust-boundary · newtype · smart-constructor · prompt-injection-containment Related: ADR-0001 (this phase) · ADR-0014 (this phase)
Context¶
Phase 4 introduces prompt assembly. Every untrusted byte (CVE description, repo README, transitive-dep metadata, source snippets, sandbox stderr, RAG-retrieved content, prior-attempt summaries) flows into the LLM prompt. The security design proposed FenceWrapper to wrap each untrusted segment in a per-invocation nonce'd delimiter (<UNTRUSTED_INPUT id=NONCE>...</UNTRUSTED_INPUT id=NONCE>) and CanaryGuard to scan for known prompt-injection patterns and nonce-collision attempts.
The critic identified a load-bearing implementation bug (critique.md §"[S] §5"): the security design's code ordered scan-after-truncate, which let an attacker hide injection past the truncation byte. A 16 KB source snippet truncated to 4 KB would be canary-scanned only on the first 4 KB; anything past byte 4096 was invisible to the guard. Fixing the ordering is the load-bearing change.
The design must also acknowledge the residual: INJECTION_PATTERNS is a denylist; denylists are by definition incomplete. The claim cannot be "injection-proof"; the claim must be "every byte is fenced + every collision is loud."
Options considered¶
- Canary scan after truncation (security design's original). Truncate to source-kind cap, then scan. Pattern: Truncate-then-scan. Lets injection past the truncation boundary in untouched.
- Canary scan before truncation, then truncate (synthesis fix). Scan the full payload for patterns and nonce collisions; only then apply the per-source-kind cap. Pattern: Scan-then-truncate.
- No truncation; refuse to embed payloads exceeding caps (alternative). Hard-reject any segment > cap. Pattern: Strict size limit. Loses ability to embed large legitimate context (full source snippet truncated is better than refused).
- Probabilistic injection detection via ML classifier (alternative). Train a classifier on injection corpora. Pattern: ML guard. Over-engineered for Phase 4 corpus; classifier is the same denylist with hidden tunable thresholds.
Decision¶
CanaryGuard.scan(payload, nonce) runs on the untruncated payload. Truncation runs afterwards via FenceWrapper._truncate(payload, source_kind). The order in FenceWrapper.fence(...) is: (1) scan untruncated → (2) on collision: replace with redacted-block + emit CanaryCollision; (3) truncate to source-kind cap → (4) return FencedSegment. INJECTION_PATTERNS: Final[tuple[bytes, ...]] is a tuple (not a list — frozen at module load); the canary corpus (tests/unit/fallback/test_canary_corpus.py) is acknowledged-incomplete and grown over time. The design ships the Functional core / Imperative shell separation: fence_pure, scan_pure operate on bytes only (no I/O); FenceWrapper, CanaryGuard are imperative-shell wrappers that emit audit events. TrustedPrompt and FencedPromptBody are newtypes minted only by PromptBuilder (asserted by AST-walking test). Pattern: Newtype + Smart constructor + Functional core / Imperative shell.
The per-source truncation caps are:
| Source kind | Cap |
|---|---|
cve_description |
4 KB |
repo_readme |
2 KB |
transitive_dep_meta |
1 KB × max 16 |
source_snippet |
16 KB |
sandbox_stderr |
8 KB |
rag_retrieved |
8 KB × max 3 |
prior_attempt_summary |
4 KB |
Tradeoffs¶
| Gain | Cost |
|---|---|
| The "hide injection past truncation byte" attack is structurally impossible — scan covers full payload | Scanning untruncated payloads costs more CPU on large inputs (worst case: 16 KB source snippet × ~10 patterns × byte-level scan); measured ≤ 1 ms / 16 KB; negligible |
Functional-core fence_pure / scan_pure are testable in isolation with Hypothesis (tests/property/test_fence_no_escape.py — "nonce never appears in fenced content") |
The imperative shell (FenceWrapper, CanaryGuard) duplicates a thin wrapper layer; both pure + shell must stay in sync (test: tests/unit/fallback/test_fence_pure_shell_parity.py) |
TrustedPrompt and FencedPromptBody newtypes are minted only by PromptBuilder — AST-walking test asserts no other call site constructs them — the type system enforces "every byte reaching the LLM passed through fencing" |
A future call site that legitimately needs to mint these (e.g., a Phase 6 test harness) must request a fixture or extend PromptBuilder — friction for legitimate test code, intentional |
Canary collision replaces the payload with <<redacted: canary collision>> and emits CanaryCollision(source_kind, pattern_id) — the LLM sees a redacted block, typically returns Refuse(insufficient_context) → HITL |
Legitimate content containing canary-pattern-shaped text (e.g., a security advisory describing prompt injection) is false-positive redacted; mitigated by per-source-kind pattern selection and operator-portal review of CanaryCollision events |
The honest framing — "every byte fenced + every collision loud + denylist acknowledged incomplete" — is auditable; the test corpus (tests/unit/fallback/test_canary_corpus.py) grows over time with new patterns |
The design does not claim injection-proofness; the security posture is "defense in depth, never sole reliance"; operators must understand this |
INJECTION_PATTERNS is Final[tuple] — frozen at module load; the toolkit's "module-level Final tuple" convention is honored |
Adding patterns is an additive code change + ADR amendment; can't hot-reload patterns at runtime — by design |
Pattern fit¶
The toolkit's Functional core, imperative shell pattern is a textbook fit: fence_pure(payload, nonce, source_kind) -> FencedSegment is pure (no I/O, no audit emission, fully testable with Hypothesis); the FenceWrapper.fence(...) method is the imperative shell that emits the audit event and calls the core. Same for canary: scan_pure(payload, nonce, patterns) -> CanaryResult is pure.
Newtype + Smart constructor for TrustedPrompt and FencedPromptBody: only PromptBuilder can construct them; the type system enforces the invariant "every prompt-shaped byte passed through fencing." The toolkit's example "every domain primitive shows up in 50 call sites" applies — these newtypes flow from PromptBuilder through FallbackTier.run into LeafLlm.invoke, and no other constructor exists.
Consequences¶
- The Hypothesis property
f"</UNTRUSTED_INPUT id={nonce}>" not in fence(p, ...).contentholds for any payload and any nonce — asserted intests/property/test_fence_no_escape.py. tests/unit/fallback/test_fence_wrapper.pyincludes a regression test for the scan-untruncated-first ordering — flips ordering would fail loudly.tests/adversarial/test_injection_corpus.pyruns 200+ payloads from PromptInject + project-curated throughFenceWrapper+CanaryGuard; the target is 0 escapes (closes critic [S] §5).INJECTION_PATTERNSgrowth: each new pattern is an ADR-light single-line addition; the corpus test catches regressions automatically.CanaryCollisionevents are operator-portal-visible; a spike indicates either active adversarial probing or false-positive prone content (e.g., a research advisory).PromptBuilderis the only minting site forTrustedPromptandFencedPromptBody—tests/fence/test_prompt_newtype_minting_bounded.pyAST-walks asserting no other module constructs them.- Phase 5's
prior_failure_summary(per ADR-0011 of this phase) is fenced assource_kind="prior_attempt_summary"(4 KB cap) when re-entered throughFallbackTier.run. - The honest framing in design + ADR: this is defense, not a structural proof of injection immunity. Operators must understand the residual.
Reversibility¶
Low. Flipping scan and truncate orderings is a one-line edit but reintroduces the bypass; would require a Phase-4 ADR amendment with explicit reasoning. Removing the fence entirely (no untrusted-byte protection) would be a load-bearing security regression. Adding new source kinds is additive (one row in the truncation table + one canary corpus pass). Replacing the denylist with an ML classifier (Phase 13+) lands behind the same CanaryGuard.scan Protocol — adapter swap.
Evidence / sources¶
../final-design.md §Component 3 — FenceWrapper + CanaryGuard(critic fix on scan-before-truncate)../phase-arch-design.md §Component 3 — FenceWrapper + CanaryGuard../phase-arch-design.md §Design patterns appliedrow 5../critique.md §"[S] §5"(scan-after-truncate hole)- production ADR-0033 (newtype + smart constructor + sum type discipline)