This is the reference for how long a sandbox lives, what keeps it alive, how to
adjust its TTL while it runs, and how to pause and resume it. It reconciles the
controller-side reaping (docs/failure-gc.md) with the
sandbox HTTP API surface that the SDKs expose.
The three controls:
set_timeout: adjust a RUNNING sandbox’s TTL (live, not just at creation).- pause / resume: snapshot full state (memory + filesystem) and stop the clock, then restore.
- work-aware idle timeout: idle is measured against ACTUAL activity, including a running background process, not just inbound API interaction.
Timeouts and TTL
Two creation-time bounds, set on the Sandbox spec (k8s mode) or implied by the standalone sandbox-server defaults:
| Bound | Field | Meaning | Default |
|---|---|---|---|
| maxLifetime | spec.lifetime.ttl | Hard wall-clock cap from start. The sandbox is reaped at startedAt + ttl regardless of activity. | unset (no cap) |
| idleTimeout | spec.lifetime.idleTimeout | Reap after this much time with no ACTUAL activity (see below). | unset (no idle limit) |
There is no implicit default for either: a zero or unset value means “no limit”.
Operators set these per sandbox or per pool. The standalone sandbox-server does
not reap on its own; reaping is a controller (k8s) behavior, and the live
set_timeout deadline below is the standalone path’s TTL control.
maxLifetime does not depend on a reachable forkd: it is pure wall-clock from
startedAt. idleTimeout and the live deadline are evaluated from the work-aware
activity signal forkd reports through ListSandboxes.
Live set_timeout
set_timeout(timeout_seconds) adjusts a RUNNING sandbox’s TTL to
now + timeout_seconds. It is exposed as:
POST /v1/set_timeouton forkd and the standalone sandbox-server, body{"sandbox": "<id>", "timeout_seconds": <n>}, returning the newdeadline_unix.sandbox.set_timeout(n)in the Python SDK (syncSandboxandDirectSandbox, asyncAsyncSandbox) andsandbox.setTimeout(n)in the TypeScript SDK.
The live deadline takes authority over the idle clock: while a live deadline is
set and in the future, the sandbox is not idle-reaped (the caller has taken
explicit control of the TTL). A live deadline in the past reaps the sandbox with
the TimeoutExpired reason. This is what the E2B compat shim maps its
setTimeout onto.
Ceiling and rejection: a requested timeout over the server ceiling
(--max-exec-timeout-seconds, default 86400 s = 24 h) is REJECTED with the typed
timeout_too_large error, never silently clamped. The deadline you set is the
deadline you get, or you get a clear rejection that names the ceiling.
Work-aware idle timeout
Idle is measured against ACTUAL activity, not just inbound API interaction. A sandbox is NOT idle when any of these hold:
- a streaming exec, run_code, or PTY session is OPEN (a live background job), or
- the sandbox is paused (its clock is stopped while held), or
- the most recent inbound exec or file interaction is within the idle window.
Only when none of these hold, and the time since the later of last-activity and
start exceeds idleTimeout, is the sandbox reaped with the IdleTimeout reason.
This is the difference that matters for unattended jobs: a long-running
background process with no inbound interaction is NOT killed mid-run. forkd
surfaces the open-stream count (active_streams) and the paused flag through
ListSandboxes; the controller’s idle decision (idleExpired) treats a non-zero
stream count or a paused sandbox as busy. The decision function is unit-tested on
the mock (internal/controller/idle_decision_test.go,
TestClaimIdleTimeoutNotReapedWithBackgroundJob).
Default: there is no implicit idle window; idle reaping is off unless
idleTimeout is set. When it is set, the work-aware rule above governs.
Pause and resume
Pause snapshots the sandbox’s FULL state (guest memory + filesystem) and pauses the VM; resume restores it exactly. A paused sandbox is held, not reaped: its idle clock is stopped and the billing meter stops. Repeated pause/resume cycles preserve both memory and filesystem state.
Exposed as:
POST /v1/pauseandPOST /v1/resumeon forkd and the standalone sandbox-server, body{"sandbox": "<id>"}.sandbox.pause()/sandbox.resume()in the Python SDK (sync and async) andsandbox.pause()/sandbox.resume()in the TypeScript SDK.
Substrate: the snapshot/fork engine (internal/fork, Engine.Pause /
Engine.Resume) drives a Firecracker Full snapshot of the running VM paired with
the copy-on-write rootfs that already holds the filesystem, so both survive every
cycle. forkd wires the engine pause/resume into the HTTP endpoints
(SandboxAPI.SetEnginePauser); the standalone server and unit tests record the
held state only (no VM behind them).
Validation status
- The pause/resume API surface, the held-state bookkeeping, the work-aware idle
decision, and the live
set_timeoutdeadline are unit-tested on the mock engine and in controller envtests (no KVM). - The REAL memory + filesystem preservation across N repeated pause/resume
cycles needs KVM (the mock cannot snapshot real memory). It is asserted by the
GATED test
TestEnginePauseResumePreservesStateKVM(internal/fork/engine_pause_kvm_test.go), which boots a real Firecracker VM, writes a marker file and starts a long-running process, runs N pause/resume cycles, and asserts the file content and the same live PID survive every cycle. It skips cleanly when/dev/kvmor the asset env vars are absent, so it is never a fake pass: it only asserts when it can really boot a VM. The KVM CI workflow (.github/workflows/kvm-test.yaml) provides the runner and assets.
This directly targets the documented competitor papercuts: the E2B repeated-cycle filesystem bug (state not persisting after multiple pause/resume) and the Daytona interaction-only idle timer (background jobs killed mid-run).
Behavior on expiry
When a sandbox crosses its bound it is TERMINATED, not paused: the backing VM is
reaped and the claim reaches the terminal Terminated phase with a condition
carrying the reason (MaxLifetimeExceeded, IdleTimeout, or TimeoutExpired).
A subsequent call against a reaped sandbox returns the typed idle_timeout error
(docs/api/errors.md), whose remediation points at creating a fresh sandbox or
calling set_timeout earlier to keep it alive. Pause is the explicit way to hold
a sandbox without terminating it; expiry never auto-pauses.