nested QEMU)¶

Status: Deferred Date: 2026-05-11 Tags: platform · sandbox Related: ADR-0012

Context¶

ADR-0012 commits to microVM isolation for Trust-Aware gate evaluations. The specific microVM stack is deferred because the choice depends on workload characteristics we won't know definitively until production.

The three candidates have different cost profiles:

Firecracker (AWS's microVM, used in Lambda and Fargate). Hardware-virtualized, KVM-backed, sub-100ms cold start. Used by major production systems for exactly this workload pattern. Linux-only guest. Limited filesystem features compared to a full VM.
gVisor (Google's user-space kernel). Strong isolation via system-call interception. Lower overhead than a true microVM in some workloads, higher in others. Better Linux compatibility than Firecracker. Slightly weaker isolation boundary than Firecracker's hardware virtualization.
Nested QEMU. A full VM inside another VM. Maximum compatibility, slowest cold start, highest overhead. Useful when guest needs unusual kernel features.

Options considered¶

Firecracker. Best for high-volume, fast-cold-start, Linux-only sandbox workloads.
gVisor. Best when guest needs broader syscall surface than Firecracker provides (some strace/eBPF setups, exotic filesystem operations).
Nested QEMU. Best when guest needs to run a different OS or specific kernel features.
Multiple stacks in parallel — route gate workloads to the appropriate stack by tag. Operational complexity in exchange for flexibility.

Default until decided¶

No default committed. During POC and pre-production, gates can run in Docker-with-seccomp on a dedicated sandbox host (acceptable risk for non-production workloads). Production rollout requires this ADR to be resolved.

Evidence needed to resolve¶

Cold-start latency tolerance. How often do gates fire? If thousands per minute, sub-100ms cold start matters; if dozens per minute, seconds-of-cold-start is fine.
Kernel feature requirements inside the sandbox. The gate runs docker build, strace, possibly eBPF tools, possibly Vagrant. Does each stack support what we need?
Operational experience. Firecracker requires KVM-capable hosts; gVisor runs anywhere; nested QEMU works but is operationally complex.
Cost per evaluation. Sandbox lifecycle cost (compute + storage churn) dominates non-LLM cost at portfolio scale. Cheaper-per-evaluation wins.
Compliance posture. Some compliance frameworks treat hardware-isolated (Firecracker) and user-space-isolated (gVisor) sandboxes differently for "execute untrusted code" categories.

Reversibility (of the eventual choice)¶

Medium. The sandbox/ package wraps the stack behind a stable RPC contract (per ADR-0012). Replacing one stack with another is a localized change to the sandbox client; gate logic is unaffected. But: migration during production has downtime risk.

Evidence / sources¶

../design.md §5 (Sandboxed reality checks subsection — explicit deferral)
../design.md §7 (Open questions — Sandbox stack)
../design.md §8.4 (physical view — sandbox cluster as separate trust boundary)
Firecracker public case studies — AWS Lambda, Fargate
gVisor production usage at Google Cloud
OpenHands V1 architecture — Docker isolation pattern, cited as reference for upgrading to microVM