Skip to content

ADR-0019: Sandbox stack (Firecracker / gVisor / nested QEMU)

Status: Deferred Date: 2026-05-11 Tags: platform · sandbox Related: ADR-0012

Context

ADR-0012 commits to microVM isolation for Trust-Aware gate evaluations. The specific microVM stack is deferred because the choice depends on workload characteristics we won't know definitively until production.

The three candidates have different cost profiles:

  • Firecracker (AWS's microVM, used in Lambda and Fargate). Hardware-virtualized, KVM-backed, sub-100ms cold start. Used by major production systems for exactly this workload pattern. Linux-only guest. Limited filesystem features compared to a full VM.
  • gVisor (Google's user-space kernel). Strong isolation via system-call interception. Lower overhead than a true microVM in some workloads, higher in others. Better Linux compatibility than Firecracker. Slightly weaker isolation boundary than Firecracker's hardware virtualization.
  • Nested QEMU. A full VM inside another VM. Maximum compatibility, slowest cold start, highest overhead. Useful when guest needs unusual kernel features.

Options considered

  • Firecracker. Best for high-volume, fast-cold-start, Linux-only sandbox workloads.
  • gVisor. Best when guest needs broader syscall surface than Firecracker provides (some strace/eBPF setups, exotic filesystem operations).
  • Nested QEMU. Best when guest needs to run a different OS or specific kernel features.
  • Multiple stacks in parallel — route gate workloads to the appropriate stack by tag. Operational complexity in exchange for flexibility.

Default until decided

No default committed. During POC and pre-production, gates can run in Docker-with-seccomp on a dedicated sandbox host (acceptable risk for non-production workloads). Production rollout requires this ADR to be resolved.

Evidence needed to resolve

  • Cold-start latency tolerance. How often do gates fire? If thousands per minute, sub-100ms cold start matters; if dozens per minute, seconds-of-cold-start is fine.
  • Kernel feature requirements inside the sandbox. The gate runs docker build, strace, possibly eBPF tools, possibly Vagrant. Does each stack support what we need?
  • Operational experience. Firecracker requires KVM-capable hosts; gVisor runs anywhere; nested QEMU works but is operationally complex.
  • Cost per evaluation. Sandbox lifecycle cost (compute + storage churn) dominates non-LLM cost at portfolio scale. Cheaper-per-evaluation wins.
  • Compliance posture. Some compliance frameworks treat hardware-isolated (Firecracker) and user-space-isolated (gVisor) sandboxes differently for "execute untrusted code" categories.

Reversibility (of the eventual choice)

Medium. The sandbox/ package wraps the stack behind a stable RPC contract (per ADR-0012). Replacing one stack with another is a localized change to the sandbox client; gate logic is unaffected. But: migration during production has downtime risk.

Evidence / sources

  • ../design.md §5 (Sandboxed reality checks subsection — explicit deferral)
  • ../design.md §7 (Open questions — Sandbox stack)
  • ../design.md §8.4 (physical view — sandbox cluster as separate trust boundary)
  • Firecracker public case studies — AWS Lambda, Fargate
  • gVisor production usage at Google Cloud
  • OpenHands V1 architecture — Docker isolation pattern, cited as reference for upgrading to microVM