Your agent just ran pip install on code it wrote for itself. Nobody read that code, and it has to execute somewhere: on a real machine, with real isolation. And agents brought a requirement that batch computing never had: they spawn subagents mid-task and merge the results back, so the machine itself has to fork where the work forks.
We build Mitos on Kubernetes, deliberately. We have spent years operating Kubernetes, and it is the de facto standard for running workloads at scale in the cloud: the scheduling, quotas, and network policy a sandbox fleet needs already exist there, hardened by a decade of production. Our bet is simple: scaling AI means scaling sandboxes, and sandboxes should run on infrastructure that already scales instead of a parallel stack you have to learn to trust. The catch is that Kubernetes’s unit of work, the pod, was designed for code you trust.
So this is a working answer to secure AI sandboxes on Kubernetes: what a pod actually gives you, which runtime holds against model-written code, what a fork of a running machine costs, and how we made microVMs behave like ordinary pods. By the end you should be able to pick a runtime for your own threat model and fan one warm machine out into thirty.
Model-written code is untrusted code
An LLM cannot tell your instructions from an attacker’s. Simon Willison’s lethal trifecta names the failure mode: private data, untrusted content, and a way to talk out. Combine the three and an injected instruction exfiltrates whatever the agent can read. Give the agent a shell and the injection becomes code execution.
You do not have to take this on faith. Trail of Bits showed in October 2025 that command allowlisting without a sandbox fails: attackers smuggle flags into pre-approved commands like go test -exec or fd -x and get execution from one poisoned prompt. In July 2025 a Replit agent deleted a production database during a code freeze. A month later the Nx supply-chain attack (s1ngularity) flipped the direction: malicious npm packages invoked whatever AI CLIs were installed and used them to hunt credentials. GitGuardian counted 2,349 secrets from 1,079 repositories.
The tool vendors already conceded the point. Anthropic ships OS-level sandboxing for Claude Code and reports it cut permission prompts by 84 percent. OpenAI’s Codex runs with network off by default.
Those protect one laptop. Host agents for other people, or fan one agent out into fifty, and you are running a hostile multi-tenant service. The question stops being whether to sandbox. It becomes: which boundary are you willing to rent out?
A pod is a view of your kernel
The default answer on Kubernetes is a pod per agent, so start there and look at what you actually get.
Kubernetes does not run containers. The kubelet calls a runtime over the Container Runtime Interface, usually containerd. containerd starts a shim per pod. The shim invokes runc, and runc builds the thing we call a container: namespaces for the restricted view, cgroups for limits, seccomp to trim syscalls, then execve into your process.
Notice what never happened in that chain. Nothing put a wall between your process and the kernel. The process sees less, but it still talks to the same kernel as every other pod on the node, through an interface of several hundred syscalls. A namespace is a view, not a wall.
That interface has a record. CVE-2019-5736 let a container overwrite the host runc binary through /proc/self/exe. CVE-2024-21626, the Leaky Vessels file-descriptor leak, scored 8.6 and escaped to the host filesystem. In November 2025, three more runc escapes landed on the same day, all breaking out through procfs writes.
None of this makes containers bad engineering. For code you wrote and reviewed, a hardened container is a reasonable boundary, and most of the internet runs on one. The Kubernetes multi-tenancy docs are honest here too: isolation is a spectrum, and once tenants stop trusting each other, the docs point you at sandboxed runtimes.
Agent code fails the trust test by construction. You did not write it. You did not review it. And an attacker may have steered the model that did.
gVisor, Kata, Firecracker: three ways to raise the wall
Kubernetes has a pluggable seam for exactly this problem. A RuntimeClass maps spec.runtimeClassName to a different shim, so the same pod spec can land on a different isolation mechanism. Three runtimes matter.
gVisor puts a user-space kernel between your workload and the real one. Syscalls get intercepted and re-implemented in Go; the design goal, in gVisor’s own words, is that no syscall passes through directly. It starts in tens of milliseconds, and it backs GKE Sandbox. The price is syscall-shaped: most workloads pay a few percent, syscall-heavy ones can pay multiples. And the wall is made of software.
Kata Containers puts every pod in a real VM with its own guest kernel under KVM. Stronger wall, heavier price: third-party benchmarks put it around 600 ms of added start latency and on the order of 180 MiB per pod with the default VMMs. Approximate numbers, but the shape is right.
Then there is Firecracker, the microVM AWS built to run Lambda. Its authors report about 50 thousand lines of Rust, 96 percent less code than QEMU. It emulates almost nothing: virtio-net, virtio-block, vsock, a serial console, and not much else. The published spec commits to under 125 ms from API call to guest user space and under 5 MiB of overhead per microVM. The NSDI paper measured roughly 3 MB of overhead where QEMU took 131 MB, and reports creation rates up to 150 microVMs per second per host.
A hardware wall at almost container prices. That is why so much of the sandbox category runs on it.
| Runtime | Boundary | Added start cost | Per-sandbox overhead | Kubernetes path | Forks a running sandbox |
|---|---|---|---|---|---|
| runc | shared host kernel | ~0 | ~0 | native | no |
| gVisor | user-space kernel + seccomp | tens of ms | Sentry per pod | RuntimeClass | no |
| Kata (QEMU/CLH) | guest kernel under KVM | ~600 ms* | ~180 MiB* | RuntimeClass shim | no |
| Firecracker | guest kernel under KVM, minimal devices | <125 ms boot, ms-range restore | <5 MiB VMM | Kata-fc, firecracker-containerd, or Mitos | with Mitos, via CoW |
*Third-party benchmark figures, approximate. Firecracker figures are from its published specification. Morph and CodeSandbox also fork running VMs, on their own clouds rather than on Kubernetes.
Here is the decision, compressed to three questions you can paste into a design doc:
- Who wrote the code? You, and you reviewed it: runc is fine, spend the savings elsewhere. A model wrote it: keep going.
- Does it mostly call libraries, or does it shell out and install packages? Mostly libraries, one sandbox at a time: gVisor will probably hold, and we would rather tell you that than pretend otherwise. Full OS behavior: keep going.
- Do many agents need to start from one warm state? No: Kata or any microVM runtime ends the analysis here. Yes: you need a fork, and the rest of this post is about what that takes.
Opting in looks like one field:
apiVersion: node.k8s.io/v1kind: RuntimeClassmetadata: name: kata-fchandler: kata-fc # containerd-shim-kata-v2 + Firecracker---apiVersion: v1kind: Podspec: runtimeClassName: kata-fcOne field, and every pod in the manifest gets its own kernel. If that were the whole story, the post would end here.
Firecracker deleted everything Kubernetes assumes
Everything Firecracker removed to get small is something Kubernetes quietly depends on.
No virtio-fs means no shared filesystems, so the container rootfs has to arrive as a block device, which forces containerd onto the devmapper snapshotter that Kata’s docs require for the Firecracker VMM. No device hotplug and no VFIO means Kata-with-Firecracker cannot resize a running container or pass through a GPU. Memory is sized up front, against a scheduler built for elastic requests. Networking is a tap device per VM, bridged to the pod’s veth by an extra CNI plugin. And firecracker-containerd, the direct integration, still lists CRI conformance under future goals.
Upstream Kubernetes answered in late 2025 with agent-sandbox, a SIG Apps project that wraps Sandbox CRDs, templates, and warm pools around gVisor or Kata. The warm pools exist because, as the Kubernetes blog puts it, starting a new pod adds about a second. If your agents run one sandbox at a time and gVisor clears your threat model, take that path; it is standard and well maintained, and this is not the post that talks you out of it.
But read what it standardizes: one sandbox per agent, created fresh or handed over pre-warmed. Nothing in that stack forks a running machine.
Once your agents multiply, that is the operation you will miss. Here is why.
Copy-on-write changes what a copy costs
The cheapest way to copy a machine is to not copy it. fork() has worked this way for forty years: the child shares the parent’s physical pages read-only, and the kernel copies a page only when someone writes it. A child costs what it changes.
Firecracker snapshots extend that to whole machines. A snapshot is guest memory plus device state in two files. On restore, Firecracker maps the memory file MAP_PRIVATE, so pages load on demand from the host page cache, and every VM restored from that snapshot reads the same physical pages until it writes them.
Run the arithmetic once and it stops feeling like an optimization. Thirty-two forks of a warm 512 MiB Python environment is 16 GiB in naive accounting. Physically, at fork time, it is one 512 MiB page set plus about 3 MiB of private pages per daughter: under 0.7 GiB. The daughters grow as they dirty pages, so that is a floor, not a steady state. But you just provisioned a fleet for the memory of a machine and a half.
This is not a lab trick. Lambda SnapStart resumes functions from snapshots, and Aurora DSQL runs each SQL transaction in its own cloned Firecracker VM, sharing unchanged pages across clones. On our reference hardware, the Mitos engine restores a snapshot in 6 to 16 ms and activates a warm fork in about 27 ms, at about 3 MiB of marginal memory per fork. Engine measurements, reproducible from the benchmark scripts in the repo; what they do and do not include is spelled out on the benchmarks page.
One distinction will save you real confusion when you read vendor pages. Disk copy-on-write (overlayfs layers, qcow2 backing files, thin volumes) clones a filesystem, and plenty of products call a golden image a snapshot. A machine snapshot includes live memory and running processes. The first buys you a cheap clean boot. The second buys you a cheap copy of a machine mid-task, which is the one your agent wants after it spent 40 seconds installing dependencies and loading a model.
Restore a snapshot twice and you have a security bug
This is the part the sandbox write-ups skip, and it is the reason forking is a security mechanism before it is a speed trick.
Firecracker’s own docs warn that when one guest state is resumed more than once, “guest information assumed to be unique may in fact not be.” Think about what lives in that memory image: the kernel’s entropy pool, userspace PRNG state, TCP sequence numbers, machine IDs, cached TLS session keys. Twenty naive restores means twenty guests that can mint colliding UUIDs, session tokens, and nonces.
For code you already assume is adversarial, that is a gift with a bow on it.
The kernel half has a fix: Firecracker writes a fresh VMGenID on resume and Linux 5.18+ reseeds its CSPRNG from it, though the docs admit a race window before the reseed lands. The userspace half has no general fix. The same doc says so outright. Network identity duplicates too: every clone wakes up with the parent’s MAC and IP.
So a correct fork is a protocol, not a file copy. When Mitos activates a daughter it pushes fresh entropy into the guest, waits for the guest agent to confirm the RNG reseed, steps the wall clock off the frozen snapshot time, re-addresses the NIC, and delivers per-daughter secrets. It fails closed: a clone that never confirmed its reseed is never served. Details in the fork-correctness doc.
You can now interrogate any sandbox vendor with three questions. Does “snapshot” include memory and running processes, or is it a golden image? What reseeds the guest RNG on restore, and is that verified or assumed? What happens to the MAC, the IP, and the machine-id on the twentieth clone? The last show-HN VM-forking demo got asked question two by the commenters. The answer was: it’s on the roadmap.
Agent swarms multiply everything
A single coding agent cares about interactive latency. A swarm changes the shape of the whole problem.
Look at where swarms actually come from. Every serious harness now spawns subagents mid-task: Claude Code fans work out to parallel subagents, OpenClaw sessions spawn worker sessions, and Steve Yegge’s Gas Town runs a whole town of coding agents with a mayor handing out work. Underneath, the pattern is fork and join: a subagent is born from the parent’s mid-task state, the repo cloned, the dependencies installed, the plan half-executed. It does its piece and the result merges back. The work forks, so the environment has to fork with it, and today most stacks fake that fork by rebuilding a fresh sandbox and replaying setup.
The other pressure is parallel attempts, and those numbers are not subtle. The Large Language Monkeys paper took SWE-bench Lite from 15.9 percent solved with one attempt to 56 percent with 250 parallel attempts. Same model. Anthropic reports Sonnet 4.5 going from 77.2 to 82 percent on SWE-bench Verified with parallel test-time compute, and its multi-agent research system beat single-agent by 90.2 percent while spending about 15 times the tokens. On the training side, Qwen3-Coder ran 20,000 parallel environments for agentic RL.
Every one of those parallel attempts wants the same starting state.
Now price the rebuild. Cloudflare’s own numbers: booting a sandbox, cloning a repo, and running npm install takes 30 seconds. Restoring the same state from a filesystem backup takes 2, and that is disk state alone. Multiply the gap by N attempts, then by every node of a tree search, where each branch point wants a copy of the environment mid-task rather than a replay of setup.
One sandbox per agent was the right pattern when agents came one at a time. At swarm scale, the primitive you want is division. Warm one machine up, then divide it.
Husk pods: the microVM becomes a pod
Our first engine did the obvious thing, and we paid full tuition for it.
In the original mode, a per-node daemon called forkd launched Firecracker processes directly, beside the cluster. It worked. It also meant every VM was invisible to Kubernetes: the scheduler learned about capacity through our own heartbeats, quotas never saw the memory, and the VM taps lived in the host network namespace where no NetworkPolicy can reach. So we built a bespoke per-tap nftables engine and a DNS proxy to police egress ourselves. Somewhere around the anti-spoof rules it sank in that we were rebuilding Kubernetes features next to Kubernetes.
Husk pods are the admission that this was backwards. Instead of running VMs beside the cluster, the sandbox VM runs inside a pod, and the pod stays unprivileged.
The trick is splitting build from run. A builder (forkd, non-privileged since ADR 0008, with an explicit capability set) boots each pool’s template once, runs its init, takes a Firecracker snapshot. Running is the cheap, unprivileged part: a husk pod is a minimal pre-scheduled pod holding a dormant VMM with no VM loaded. Scheduling, admission, netns, cgroup: all paid before any claim arrives. When an agent claims a sandbox, the controller picks a dormant husk under an optimistic lock (two racing claims cannot land on one pod), activates it over mTLS, and the stub restores the snapshot in place, inside the pod’s own cgroup and network namespace, running the reseed handshake from the previous section before it reports ready.
How does a pod get /dev/kvm without privileged: true? A device plugin. The pod requests a mitos.run/kvm resource, the kubelet injects the device, and the container drops every capability except one scoped NET_ADMIN (which powers the in-pod egress filter below) and runs the default seccomp profile. Against PodSecurity restricted it carries three documented exceptions: a read-only snapshot hostPath, root to open /dev/kvm, and that NET_ADMIN. Each one is written down in an ADR instead of waved through.
Two consequences carry most of the value.
First, the density survives per-pod accounting. Copy-on-write sharing is physical; a cgroup memory controller decides who gets billed for a page, it does not copy the page. We did not take that on faith, because the entire design stands on it: fork four VMs into four separate cgroups and compare the naive sum (every fork’s full RSS added up) against the honest physical footprint (the shared snapshot set counted once, plus each fork’s private dirty pages). The sharing holds, and we keep a standing test on it. Billing follows the same physics through CoW-aware metering: shared pages metered once and split across the daughters, each tenant pays for what it dirtied, not for whichever pod happened to fault a page in first.
Second, the VM’s tap lives in the pod’s network namespace, so the sandbox’s traffic is the pod’s traffic and your existing NetworkPolicy machinery applies to it like any other pod. The guarantee does not lean on your CNI, though: the engine programs an in-pod default-deny egress filter with an unconditional block on 169.254.169.254 (the cloud metadata endpoint, the classic hop from sandbox escape to stolen IAM credentials), verified end to end on a real KVM cluster. NetworkPolicy is defense in depth on top.
After that, boring Kubernetes just works, which was the entire point. The scheduler sees real requests. ResourceQuota bounds sandboxes. A PodDisruptionBudget keeps the warm pool alive through drains, and a Pending husk pod is exactly the signal your cluster autoscaler already scales on. kubectl get pods lists your sandboxes. kubectl logs reads them.
Declaring a fleet looks like Kubernetes because it is Kubernetes:
apiVersion: mitos.run/v1kind: SandboxPoolmetadata: name: python-agent-poolspec: template: image: python:3.12-slim init: ["pip install numpy pandas requests"] resources: { cpu: "1", memory: "512Mi" } warm: { min: 10 }And fan-out stays one call:
import mitos
sb = mitos.create("python") # lands on a warm sandboxforks = sb.fork(32) # 32 daughters, ~3 MiB each at forkresults = [f.exec("python attempt.py") for f in forks]The engine is Apache-2.0, the whole mechanism in the open, so you can check our homework.
What to do with all this
Three things, in order of effort.
First, stop calling a plain pod a sandbox in your design docs. If the code came out of a model, the honest minimum is a sandboxed RuntimeClass, and the multi-tenancy question is settled by the CVE record, not by vibes.
Second, ask the three snapshot questions before you sign with any sandbox vendor, including us. Golden image or live memory? Who reseeds the RNG? What happens on clone twenty?
Third, if you have a KVM-capable cluster, run the thing. The quickstart goes from install to a forked sandbox in a few minutes, migrating from E2B is a one-import shim, and fork, don’t rebuild covers what division buys an agent swarm in practice.
And if you would rather build than buy, the recipe is genuinely public: Firecracker’s snapshot docs, a MAP_PRIVATE restore, a jailer and a network namespace per clone, VMGenID for the reseed. Budget most of your time for the uniqueness protocol and egress policing; the restore is the easy weekend. The engine is Apache-2.0 exactly so you can read how we handled the parts that were not.
A pod hands untrusted code a view of your kernel and calls it a wall. Give every agent its own kernel instead, and make the second copy cost 3 MiB.
Every number in this post regenerates from the repo. Run the bench on your own hardware, and if it disagrees with ours, open an issue. We would rather be corrected than quoted.