Sandbox lifecycle: timeouts, pause/resume, idle, expiry

This is the reference for how long a sandbox lives, what keeps it alive, how to adjust its TTL while it runs, and how to pause and resume it. It reconciles the controller-side reaping (docs/failure-gc.md) with the sandbox HTTP API surface that the SDKs expose.

The three controls:

  • set_timeout: adjust a RUNNING sandbox’s TTL (live, not just at creation).
  • pause / resume: snapshot full state (memory + filesystem) and stop the clock, then restore.
  • work-aware idle timeout: idle is measured against ACTUAL activity, including a running background process, not just inbound API interaction.

Timeouts and TTL

Two creation-time bounds, set on the Sandbox spec (k8s mode) or implied by the standalone sandbox-server defaults:

BoundFieldMeaningDefault
maxLifetimespec.lifetime.ttlHard wall-clock cap from start. The sandbox is reaped at startedAt + ttl regardless of activity.unset (no cap)
idleTimeoutspec.lifetime.idleTimeoutReap after this much time with no ACTUAL activity (see below).unset (no idle limit)

There is no implicit default for either: a zero or unset value means “no limit”. Operators set these per sandbox or per pool. The standalone sandbox-server does not reap on its own; reaping is a controller (k8s) behavior, and the live set_timeout deadline below is the standalone path’s TTL control.

maxLifetime does not depend on a reachable forkd: it is pure wall-clock from startedAt. idleTimeout and the live deadline are evaluated from the work-aware activity signal forkd reports through ListSandboxes.

Live set_timeout

set_timeout(timeout_seconds) adjusts a RUNNING sandbox’s TTL to now + timeout_seconds. It is exposed as:

  • POST /v1/set_timeout on forkd and the standalone sandbox-server, body {"sandbox": "<id>", "timeout_seconds": <n>}, returning the new deadline_unix.
  • sandbox.set_timeout(n) in the Python SDK (sync Sandbox and DirectSandbox, async AsyncSandbox) and sandbox.setTimeout(n) in the TypeScript SDK.

The live deadline takes authority over the idle clock: while a live deadline is set and in the future, the sandbox is not idle-reaped (the caller has taken explicit control of the TTL). A live deadline in the past reaps the sandbox with the TimeoutExpired reason. This is what the E2B compat shim maps its setTimeout onto.

Ceiling and rejection: a requested timeout over the server ceiling (--max-exec-timeout-seconds, default 86400 s = 24 h) is REJECTED with the typed timeout_too_large error, never silently clamped. The deadline you set is the deadline you get, or you get a clear rejection that names the ceiling.

Work-aware idle timeout

Idle is measured against ACTUAL activity, not just inbound API interaction. A sandbox is NOT idle when any of these hold:

  • a streaming exec, run_code, or PTY session is OPEN (a live background job), or
  • the sandbox is paused (its clock is stopped while held), or
  • the most recent inbound exec or file interaction is within the idle window.

Only when none of these hold, and the time since the later of last-activity and start exceeds idleTimeout, is the sandbox reaped with the IdleTimeout reason.

This is the difference that matters for unattended jobs: a long-running background process with no inbound interaction is NOT killed mid-run. forkd surfaces the open-stream count (active_streams) and the paused flag through ListSandboxes; the controller’s idle decision (idleExpired) treats a non-zero stream count or a paused sandbox as busy. The decision function is unit-tested on the mock (internal/controller/idle_decision_test.go, TestClaimIdleTimeoutNotReapedWithBackgroundJob).

Default: there is no implicit idle window; idle reaping is off unless idleTimeout is set. When it is set, the work-aware rule above governs.

Pause and resume

Pause snapshots the sandbox’s FULL state (guest memory + filesystem) and pauses the VM; resume restores it exactly. A paused sandbox is held, not reaped: its idle clock is stopped and the billing meter stops. Repeated pause/resume cycles preserve both memory and filesystem state.

Exposed as:

  • POST /v1/pause and POST /v1/resume on forkd and the standalone sandbox-server, body {"sandbox": "<id>"}.
  • sandbox.pause() / sandbox.resume() in the Python SDK (sync and async) and sandbox.pause() / sandbox.resume() in the TypeScript SDK.

Substrate: the snapshot/fork engine (internal/fork, Engine.Pause / Engine.Resume) drives a Firecracker Full snapshot of the running VM paired with the copy-on-write rootfs that already holds the filesystem, so both survive every cycle. forkd wires the engine pause/resume into the HTTP endpoints (SandboxAPI.SetEnginePauser); the standalone server and unit tests record the held state only (no VM behind them).

Validation status

  • The pause/resume API surface, the held-state bookkeeping, the work-aware idle decision, and the live set_timeout deadline are unit-tested on the mock engine and in controller envtests (no KVM).
  • The REAL memory + filesystem preservation across N repeated pause/resume cycles needs KVM (the mock cannot snapshot real memory). It is asserted by the GATED test TestEnginePauseResumePreservesStateKVM (internal/fork/engine_pause_kvm_test.go), which boots a real Firecracker VM, writes a marker file and starts a long-running process, runs N pause/resume cycles, and asserts the file content and the same live PID survive every cycle. It skips cleanly when /dev/kvm or the asset env vars are absent, so it is never a fake pass: it only asserts when it can really boot a VM. The KVM CI workflow (.github/workflows/kvm-test.yaml) provides the runner and assets.

This directly targets the documented competitor papercuts: the E2B repeated-cycle filesystem bug (state not persisting after multiple pause/resume) and the Daytona interaction-only idle timer (background jobs killed mid-run).

Behavior on expiry

When a sandbox crosses its bound it is TERMINATED, not paused: the backing VM is reaped and the claim reaches the terminal Terminated phase with a condition carrying the reason (MaxLifetimeExceeded, IdleTimeout, or TimeoutExpired). A subsequent call against a reaped sandbox returns the typed idle_timeout error (docs/api/errors.md), whose remediation points at creating a fresh sandbox or calling set_timeout earlier to keep it alive. Pause is the explicit way to hold a sandbox without terminating it; expiry never auto-pauses.

View source on GitHub →