This document states what isolation sandbox provides today, what it intends
to provide, and the current status of every boundary. It is written against the
code in this repository, not the README. Statuses: mitigated, partial,
open.
For the operator-facing summary of how tenant secrets are delivered, isolated
across tenants, and encrypted at rest (consolidated with docs/encryption.md and
docs/fork-correctness.md), see docs/secrets.md.
Honest summary for the current codebase: do not run untrusted code with this
project in production yet. The KVM/Firecracker boundary is real, but several
defense-in-depth layers around it are still open, and an extended security review
(recorded in this document) found controls that the code did NOT implement.
The most serious of those, the husk default path having NO egress enforcement and
NO cloud-metadata block, is now IMPLEMENTED and KVM-VERIFIED end to end on a real
VM: the husk default path enforces in-pod default-deny egress with an
unconditional cloud-metadata block and the threaded per-template allowlist, and
the husk-network KVM cluster e2e (test/cluster-e2e/husk-network-e2e.sh, the
cluster-husk-network-e2e suite) now PASSES on the Hetzner Talos KVM cluster:
all three in-VM enforcement assertions are green inside a real restored VM
(metadata-blocked: a connect to 169.254.169.254 fails with exit 1 and no IAM
theft; default-deny: a non-allowlisted host fails to connect, exit 1; and
allowlist-works: the allowlisted name example.com IS reachable, connect exit 0).
The claim reaches Ready and the pool warms a dormant husk pod. This was verified
TWICE: once with a node kubelet sysctl allowance in place (proving the datapath),
then again with that node change fully reverted (the probe confirmed
SysctlForbidden), proving the feature needs NO node prerequisite. The remaining
open items below (item 1 is now resolved) are why this is not yet a production
posture. No external security review has happened.
Running untrusted code: what is and is NOT warranted today
The microVM (KVM + Firecracker) is the isolation boundary, and the design DOES target running untrusted code. But advertising this project as “safe for untrusted code” is NOT yet warranted. The extended review found controls claimed in this document that the code does not implement; those are corrected per row below. Until the following are all true, do not run untrusted multi-tenant workloads in production on this project, and do not claim it is safe to:
- Husk egress default-deny plus a cloud-metadata (169.254.169.254) block.
This was the top blocker; it is now IMPLEMENTED and KVM-VERIFIED end to end, so
it is RESOLVED (mitigated). The husk default path enforces default-deny egress
via the in-pod nftables filter the husk-stub programs in the pod’s own netns at
activation (
internal/husk/netfilter.go, applied ininternal/husk/stub.go, reusinginternal/netconfandinternal/dnsproxy). The cloud metadata endpoints (169.254.169.254, the169.254.0.0/16link-local range, and IPv6fd00:ec2::254) are an UNCONDITIONAL hard drop rendered before any allow (netconf.RenderMetadataBlock), so the allowlist can never reach them and a guest cannot steal the node’s cloud IAM credentials. The per-template egress allowlist is threaded husk-side (huskNotifyNetwork+huskEgressConfigininternal/controller/sandboxclaim_controller.go, carried inhusk.ActivateRequest.Egress/Allow), with a per-pod DNS proxy for name entries (internal/dnsproxy), and the husk pod adds exactly the scopedNET_ADMINcapability needed to program the filter. A best-effort Kubernetes NetworkPolicy (internal/controller/husknetworkpolicy.go) is created as defense in depth; HONEST CNI caveat: it enforces only on a CNI that implements NetworkPolicy and it cannot express name-based allows, so the in-pod nft filter is the guarantee that holds with no CNI policy at all (now KVM-verified). VERIFIED: the husk-network KVM cluster e2e (test/cluster-e2e/husk-network-e2e.sh, thecluster-husk-network-e2esuite) PASSES on the Hetzner Talos KVM cluster. The claim reaches Ready, the pool warms a dormant husk pod, and all three in-VM enforcement assertions are green inside a real restored VM: metadata-blocked (connect to169.254.169.254fails, exit 1, no IAM theft), default-deny (a non-allowlisted host fails to connect, exit 1), and allowlist-works (the allowlisted nameexample.comIS reachable, connect exit 0). It was verified TWICE: once with a node kubelet sysctl allowance in place (proving the datapath), then again with that node change fully reverted (the probe confirmed SysctlForbidden), proving the feature needs NO node prerequisite. See section 4 and section 0 surface 5. - Fork-correctness fail-closed on ALL engines. The raw-forkd and
sandbox-server paths do NOT fail closed on an un-reseeded fork; only the husk
path does (section 6,
docs/fork-correctness.mdrow 1). - Node-CAS bounds and integrity. The node CAS is tenant-writable,
unbounded, and activated with
--allow-unverified-snapshotson the fork child path (section 3, W4 row). - The secondary hardening below: the raw-forkd privileged DaemonSet remains
(the privileged-no-jailer DaemonSet is why raw-forkd is still NOT for untrusted
multi-tenant). Four earlier secondary items are now SHIPPED: the husk pod no
longer automounts the default SA token, the forkd gRPC surface fails closed
without mTLS (opt-in
--allow-insecure-grpcfor dev), the host-side vsock client applies a per-request read deadline, and the raw-forkd shared-rootfs cross-fork write channel is FIXED via per-fork rootfs CoW (mirroring the husk fix). See the per-row table. - The fork-correctness and failure/GC CI suites green (a precondition already stated in CLAUDE.md and ROADMAP).
- A CVE / patch pipeline for the guest kernel and Firecracker (today CVE watch is a manual process, section “Supply chain”).
- An EXTERNAL security audit (none has happened; see the review gate).
Must-fix-first set (the items that make the current posture actively unsafe, not
merely incomplete): item 1 (husk egress default-open + metadata reachable) is now
RESOLVED (mitigated): the husk default path enforces in-pod default-deny +
unconditional metadata block + threaded allowlist + per-pod DNS proxy via the
scoped NET_ADMIN capability (best-effort NetworkPolicy is defense in depth), and
the husk-network KVM cluster e2e (cluster-husk-network-e2e) PASSES on the
Hetzner Talos KVM cluster with all three in-VM enforcement assertions green
(metadata-blocked, default-deny, allowlist-works), verified twice (once with and
once without a node sysctl allowance), proving NO node prerequisite. The remaining
must-fix items are item 2 (fork-correctness not fail-closed off the husk path),
the raw-forkd privileged-no-jailer DaemonSet (item 4, which is why raw-forkd is
NOT for untrusted multi-tenant, even now that its shared-rootfs cross-fork write
channel is fixed by per-fork rootfs CoW), and item 3 (node-CAS
write/integrity/DoS). The per-row honest status discipline (mitigated / partial /
open, with a severity on the review rows) is preserved throughout.
Components and trust
| Component | Runs as | Trusts | Trusted by |
|---|---|---|---|
| Guest workload | VM guest, untrusted | nothing | nobody |
Guest agent - Rust (guest/agent-rs) | PID 1 in guest; the SOLE production agent since Phase E (#310). Baked as /init by guest/rootfs/build.sh and kvm-test.yaml. Serves ONLY gRPC on vsock port 53 (AgentGRPCPort); all host callers speak gRPC (SP1.5 merged, Go agent and legacy JSON protocol removed). SECURITY-SENSITIVE: requires named human reviewer; see docs/security-review-policy.md. | nothing | forkd / husk stub (gRPC, port 53) |
husk pod stub (cmd/husk-stub), the DEFAULT runner | unprivileged pod, /dev/kvm via device plugin (not privileged), drop ALL caps and add only NET_ADMIN (in-pod egress firewall, scoped to the pod netns), seccomp: RuntimeDefault, read-only snapshot mount; the long-lived husk-stub container is NOT privileged | controller (mTLS control channel) | controller |
husk enable-ip-forward init container (name-egress pools only) | short-lived PRIVILEGED init container on the husk pod; writes net.ipv4.ip_forward=1 in the shared pod netns and exits BEFORE the workload runs; added only when name-based egress is configured (--husk-dns-upstream set) | controller (it is part of the pod spec) | controller |
forkd (cmd/forkd), the snapshot BUILDER and the raw-forkd fallback | NON-privileged DaemonSet pod (uid 0, privileged: false, allowPrivilegeEscalation: false, seccomp: RuntimeDefault): drops ALL capabilities and adds back ONLY the explicit builder set (SYS_ADMIN, CHOWN, SETUID, SETGID, MKNOD for the jailer, plus NET_ADMIN for the build-time placeholder tap and DAC_OVERRIDE so the builder can reopen the template rootfs.ext4 it wrote after the jailer chowned the shared inode to the per-VM uid, #426; cmd/forkd/jailer.go forkdRequiredCapabilities). DAC_OVERRIDE is negligible marginal authority next to the CAP_SYS_ADMIN the builder already holds. The per-VM jailer is ENABLED in the shipped DaemonSet (deploy/daemon/daemonset.yaml: --jailer/--chroot-base/--uid-range), so every build/raw-forkd VM runs under a dedicated uid/gid in a per-VM chroot. /dev/kvm and /dev/net/tun come from the device plugin (mitos.run/kvm), NOT a privileged hostPath. (#352) | controller | controller, nodes |
controller (cmd/controller) | cluster Deployment, CRD + Secrets RBAC | kube-apiserver | forkd, husk pods |
| Snapshot artifacts | files under /var/lib/mitos on each node | - | forkd builds them; husk pods mount and execute them as memory images |
0. Default execution surface: the unprivileged husk pod (issue #18)
Pod-native execution is now the DEFAULT (the controller runs
--enable-husk-pods by default; --enable-raw-forkd and --mock select the
fork-per-claim fallback). This is a deliberate change to the per-sandbox
execution surface. This section is the FULL re-derivation for the
unprivileged-stub escape surface (issue #18); it re-derives the surface boundary
by boundary, states which CI-proven mechanism backs each claim, and names every
residual. It does not contradict the per-row sections below; it reconciles with
them (the networking egress row in section 4, the encryption/key-custody row in
section 5, the sandbox-API row in section 3) and points at each.
The build-vs-run split is the core idea: a SNAPSHOT is BUILT once per node by a privileged process and RUN many times by unprivileged pods.
- Old default (raw-forkd, now the fallback behind
--enable-raw-forkd). A sandbox VM was forked by forkd: a root DaemonSet with/dev/kvm, an explicit capability set (CAP_SYS_ADMIN,SYS_CHROOT, and others; section 3), and a hostPath to the node data dir. The per-sandbox EXECUTION surface WAS that privileged process. - New default (husk pods). A sandbox VM runs inside an UNPRIVILEGED husk pod:
privileged: false,allowPrivilegeEscalation: false, drop ALL capabilities,seccompProfile: RuntimeDefault,/dev/kvminjected by the device plugin (not a hostPath or privilege), and the template snapshot mounted READ-ONLY. It adds exactly one capability,NET_ADMIN, scoped to the pod’s own netns, so the stub can program the in-pod egress firewall (surface 5). The three documented PSA exceptions arerunAsNonRoot: false(the device exception), the read-only snapshot hostPath (surfaces 1 and 3 below), andNET_ADMIN(the in-pod firewall, surface 5). On name-based-egress pools ONLY (the controller flag--husk-dns-upstreamset) the pod additionally carries a short-lived PRIVILEGED init container (enable-ip-forward) that setsnet.ipv4.ip_forward=1in the shared pod netns and exits before the workload runs (surface 5); it is privileged but one-shot, and the long-lived husk-stub container itself stays unprivileged (NET_ADMINonly). - forkd-the-builder is the residual host privilege, but it is no longer a
privileged container (#352). Building a template snapshot still needs
/dev/kvmand the jailer, so forkd remains the privileged BUILDER role on the KVM nodes (and the--enable-raw-forkdfork-per-claim engine). That role now runs as a NON-privileged pod confined to the explicit builder capability set, notprivileged: true, and the privileged surface is confined to the BUILD path, run once per node per template, rather than every sandbox execution. - forkd is NON-privileged (deploy/daemon/daemonset.yaml, #352). forkd is the
per-node BUILDER and the raw-forkd fallback: building a snapshot runs the
jailer, which needs
CAP_SYS_ADMIN(chroot-base mount setup, cgroup, and namespace setup),CAP_CHOWN/CAP_SETUID/CAP_SETGID(hand and drop to the per-VM uid),CAP_MKNOD(in-chroot/dev/kvmand/dev/net/tunnodes),CAP_NET_ADMIN(the build-time placeholder tap), andCAP_DAC_OVERRIDE(reopen the templaterootfs.ext4the builder wrote after the jailer chowned the shared inode to the per-VM uid, #426). forkd runs with EXACTLY that set (drop: [ALL],privileged: false,allowPrivilegeEscalation: false,seccompProfile: RuntimeDefault), and/dev/kvm//dev/net/tuncome from the device plugin (mitos.run/kvm), not a privileged hostPath. Status: mitigated. Mitigation: the forkd pod runs only on labelled KVM nodes (mitos.run/kvm), is one-per-node (not one-per-sandbox), is not exposed to tenant traffic, and its build/raw-forkd VMMs run jailed under throwaway uids; husk pods, not forkd, are the tenant execution surface. - The per-VM Firecracker jailer is deliberately NOT run inside the husk pod. jailer-in-pod was implemented and VERIFIED achievable on real KVM (branch feat/jailer-in-pod, closed PR #96), but it requires the full 9-cap jailer set, Unconfined seccomp, and a writable exec+dev hostPath chroot, which makes EVERY husk pod privileged-class and breaks the PSA-restricted model. It is declined: the jailer isolates many VMs sharing one process (raw-forkd), but in the husk model each pod runs exactly ONE VM, so the pod itself is the per-VM boundary (its own uid, netns, cgroup, PSA-restricted securityContext). Per-VM isolation comes from one-VM-per-unprivileged-pod plus the microVM, not an in-pod jailer.
Unprivileged-stub escape surface (issue #18 re-derivation)
The honest framing up front: the per-sandbox EXECUTION surface is strictly
improved (an unprivileged, capability-dropped, restricted-minus-two,
pod-netns-governed container instead of a root process), while the INHERENT
microVM-host-escape risk (a KVM or host-kernel bug reachable from any
/dev/kvm-holder) is UNCHANGED, and forkd-the-builder remains a smaller
privileged control-plane surface (run once per node per template, not per
sandbox). “Provably better” is argued PER SURFACE below and tallied honestly; it
is NOT claimed globally, because the /dev/kvm-and-kernel axis is EQUAL, not
better.
Surface 1: GUEST -> HUSK-STUB CONTAINER (the post-VMM-escape blast radius).
A guest that breaks out of the microVM (a Firecracker or KVM escape) lands in the
process that hosts the VMM. Under the old default that process was forkd: ROOT,
with /dev/kvm, an explicit capability set including CAP_SYS_ADMIN, and a
hostPath to the node data dir (section 3). Under the husk default that process is
the husk stub inside an UNPRIVILEGED pod whose securityContext
(internal/controller/huskpod.go, proven in internal/controller/huskpod_test.go
at the object level and against the v1.31 PodSecurity admission plugin on the
kind-e2e-husk job, slice 4, conformance criterion 4) sets, each control
load-bearing:
privileged: false,allowPrivilegeEscalation: false(no setuid path regains privilege),capabilities.drop: [ALL], with exactlyNET_ADMINadded back so the stub can program the in-pod egress firewall (surface 5) in the pod’s OWN netns; this is scoped to the pod netns (not hostNetwork, not privileged) and cannot reach the host netns or another pod,seccompProfile: RuntimeDefaultat BOTH the pod and the container level,- the ONLY host mount is the READ-ONLY snapshot dir plus the read-only kernel file (surface 3),
- the ONLY device is
/dev/kvm(and/dev/net/tun) via the device plugin, not a hostPath or privilege (surface 4).
This securityContext satisfies EVERY PSA restricted control except three
documented exceptions; the husk pod is kept out of a restricted namespace by
EXACTLY those three, each intrinsic to the model: the read-only snapshot hostPath
and runAsNonRoot: false (uid 0 so Firecracker can open the injected /dev/kvm
without privileged), both recorded as
docs/adr/0003-kvm-device-plugin-psa-exception.md, plus NET_ADMIN (forbidden
under PSA baseline/restricted “Capabilities”) for the in-pod egress firewall,
recorded as docs/adr/0006-husk-netadmin-egress-firewall.md. Justification: the
capability is scoped to the pod’s own netns (not hostNetwork, not privileged,
allowPrivilegeEscalation: false, seccomp: RuntimeDefault), so it cannot reach
the host netns or other pods; the net security change is strongly positive: it
trades one scoped capability in an already-unprivileged pod for closing
default-open egress and node IAM-credential theft. The husk pod is thus
“restricted EXCEPT the read-only snapshot hostPath, runAsNonRoot: false, and
NET_ADMIN (for in-pod firewalling)”. A genuinely privileged pod IS rejected in
the same namespace (PSA is enforcing), asserted on kind-e2e-husk (slice 4,
section 6e of docs/husk-pods.md).
This is the core “provably better” claim, and it is bounded to THIS surface: a
guest that escapes the microVM lands with NO root authority, NO Linux
capabilities, NO privilege-escalation path, NO broad host filesystem, only a
read-only base-image mount and the pod’s own netns and cgroup, instead of forkd’s
root with CAP_SYS_ADMIN and host data-dir access. The post-escape blast radius
is strictly smaller. What this does NOT change is whether the guest can reach the
host kernel through /dev/kvm in the first place (surface 4).
Surface 2: the CONTROL CHANNEL (activation + secret delivery). Activating a
husk pod delivers the tenant’s claim-time env, secrets, and the per-sandbox
bearer token into the pod. The channel is mTLS, RequireAndVerifyClientCert,
authorized to the controller identity ONLY: internal/husk.ServeTLS plus
AuthorizeControllerIdentity accept a connection only when the VERIFIED mTLS peer
(read from VerifiedChains, never from a merely-presented cert) carries the
pki.ControllerName SAN, and a nil TLS config or nil authorize hook is refused
(fail-closed: an unauthenticated activate channel that delivers secrets is
rejected before any request is read). CI-proven: the KVM husk network-activation
phase asserts a WRONG-CA controller cert is REJECTED by the mTLS gate before any
secret is read (slice 2, section 6b of docs/husk-pods.md). So an in-cluster
actor cannot activate or hijack a husk pod, or inject secrets into one. Residual:
a compromised CONTROLLER can activate any husk pod and deliver secrets to it. The
controller is the trust anchor here, the same anchor as in the raw-forkd model
and the encryption key custody (section 5); this is not a regression, it is the
same boundary.
Surface 3: the READ-ONLY SNAPSHOT HOSTPATH. The node template snapshot is
mounted READ-ONLY into the husk pod (huskpod.go: the snapshot hostPath and the
kernel file are both ReadOnly: true). The husk stub RE-VERIFIES the snapshot ON
ACTIVATE, before it loads it, applying the SAME fail-closed gate as raw-forkd:
the stub decodes the mounted CAS manifest, binds it to the controller-passed
recorded digest (husk.ActivateRequest.ExpectedDigest, fed from the
NodeRegistry’s forkd-reported TemplateDigests), re-hashes the loaded
mem+vmstate against it (a sha256 digest verify, #9), and runs
internal/snapcompat.Check against THIS node’s detected environment (#32), all
in internal/husk verifySnapshot (the production verifier, shared with the
fork path via the internal/cas chunk/hash primitives so the two cannot drift).
Both checks fail closed: a snapshot tampered on the node disk after forkd’s
build-time verification, or one incompatible with this node, is REFUSED on the
husk path too and never loaded into the VM (proven by internal/husk
TestActivateVerifyRefusesTamperedSnapshot and
TestActivateVerifyRefusesIncompatibleSnapshot). So the husk path is no longer a
verify gap relative to raw-forkd: it is the same digest + snapcompat gate.
Residual, stated honestly: all husk pods on a node share the SAME read-only
snapshot dir. This IS a shared read-only host mount and is one of the two
documented PSA-restricted exceptions (the hostPath, surface 1). It is acceptable
because (a) it is READ-ONLY: a husk pod cannot WRITE it, so it cannot tamper with
the base image another pod loads; (b) it is integrity-verified and
content-addressed, and the husk stub re-checks the digest + compatibility on EACH
activate before loading, so a tampered-on-disk or incompatible snapshot is
refused at activate time on the husk path; and (c) it is a
BASE IMAGE, not tenant data: tenant secrets are delivered post-restore over the
control channel (surface 2), never baked into the shared snapshot (section 6).
Cross-pod isolation of the snapshot mem/vmstate is the read-only property, not a
per-pod copy. The ROOTFS isolation is from a per-activation copy-on-write clone
rebound while the guest is FROZEN, not from a read-only template mount. The
template dir (which holds rootfs.ext4) is mounted read-write, because
Firecracker opens the snapshot’s baked rootfs path with O_RDWR during
/snapshot/load (a read-only mount fails the load EROFS, verified on real KVM);
isolation does NOT rely on the mount mode. Each activation gets its OWN clone:
internal/husk Stub.Prepare reflink-clones <dataDir>/templates/<id>/rootfs .ext4 to a PER-POD file <dataDir>/husk-rootfs/<pod-name>/rootfs.ext4 (the clone
path is scoped to the per-pod VM id the controller passes via the downward API
metadata.name, so two husk pods sharing the node CoW hostPath never collide on,
overwrite, or delete each other’s clone), and Stub.Activate loads the snapshot
PAUSED (resume=false), rebinds the baked rootfs drive to that clone with
PatchDrive while the guest is frozen, THEN resumes. The template is only OPENED
(never written) during the paused load and the drive fd is replaced by PatchDrive
before resume, so the guest writes only its own clone, never a single block of the
shared template, and concurrent activations of one template never leak one
tenant’s filesystem state into another. The clone is removed on pod teardown
(Stub.Close). Fully pod-native snapshot delivery (a CAS pull into the pod,
removing the shared read-only mem/vmstate hostPath) remains a documented follow-up.
Warm-pool autoscaling (no integrity-gate move)
Demand-driven warm-pool autoscaling changes only WHEN and HOW MANY dormant husk pods the controller creates or deletes. It does NOT change the snapshot integrity gate: every dormant pod still runs the same fail-closed Prepare-time verify (digest + snapcompat) against the read-only mounted CAS manifest before it can be offered for a claim (Surface 3). The autoscaler reads only pod labels (dormant vs claimed) and a process-local claim-arrival timestamp; it trusts no tenant-controlled input, holds no secret, and a compromised husk pod cannot influence the desired count beyond appearing claimed (which only makes the pool create MORE warm capacity, never fewer or unverified pods). Scale-down deletes only surplus DORMANT pods, never a claimed/in-use one. Security surface: unchanged.
A SEPARATE follow-up (a per-node verify cache so the second dormant pod on a node skips the ~680 MiB re-hash) WILL touch the integrity gate and must land with its own threat-model delta; it is intentionally out of scope here.
Husk fork snapshot (live fork on the husk path)
A fork (a Sandbox with source.fromSandbox) of a husk-backed source drives a ForkSnapshot control op against
the SOURCE husk pod’s stub over the SAME mTLS channel as activate (authorized to
the controller identity only: internal/husk.ServeTLS plus
AuthorizeControllerIdentity; the op rides the same op-dispatched channel that
delivers secrets on activate). The op carries NO secrets (a fork id and a
node-local snapshot path). The stub pauses the running VM, writes a Full
Firecracker snapshot to a node hostPath <dataDir>/forks/<fork-id> (read-write
only to the source pod that owns the VM; read-only to the child pods on the SAME
node), then resumes the source unless pauseSource. The fork snapshot is a LIVE,
EPHEMERAL artifact created by a trusted node-local stub and consumed by child
stubs on the same node within the same trust boundary; it is NOT content-addressed,
so the children activate it with verify disabled (--allow-unverified-snapshots),
the same posture a pre-digest pool uses. This is acceptable because the artifact
is root-owned, never tenant-writable, and re-hashing would gate on a digest that
does not exist for a live fork. The child still runs the full fail-closed RNG/clock
reseed handshake (see docs/fork-correctness.md, husk fork children). Per-child
independence: each child is its own husk pod + dormant VMM + per-activation rootfs
CoW clone, so guest writes never cross between children or back to the source; the
children share only the read-only fork snapshot mem+vmstate as a restore image,
exactly as warm pods share the template snapshot. Each child mints its OWN bearer
token (the source’s token never opens a child). Lifecycle: the fork snapshot is
owned by the forking Sandbox and removed by its finalizer (RemoveForkSnapshot op)
on delete; the child pods are owner-ref’d to the fork and reaped by Kubernetes GC.
Residual: a compromised controller can drive a fork-snapshot of any husk pod it
can reach, the same residual already stated for activate (Surface 2).
Husk workspace hydrate/dehydrate (W4 transport on the husk path)
A bound claim’s /workspace is persisted and restored over TWO new control ops on
the SAME op-dispatched mTLS channel as activate and fork-snapshot, authorized to
the controller identity ONLY (internal/husk.ServeTLS plus
AuthorizeControllerIdentity). The controller is not on the node and cannot reach
the guest vsock or the node CAS, so it DELEGATES the transfer to the husk-stub
that owns both:
dehydrate-workspace(excludePaths, capturePaths, parentManifestDigest): the stub runs the guest vsockTarDirover/workspace, stores the content-addressed chunks plus manifest into the node CAS (a<dataDir>/cashostPath mounted read-write into the pod), and returns the manifest digest. It reusesinternal/workspace.Dehydrate(the KVM-proven tar round trip), not a reimplementation. WhenparentManifestDigestis set (a{diff: true}terminate) it ALSO computes the content-hash diff of the new revision against that parent and returns it: the diff is computed on the node from the two MANIFESTS (path -> chunk-digest lists) in the node CAS viainternal/workspace.DiffManifests, never the chunk bytes, because the off-node controller cannot read either node-CAS manifest. The diff carries content path NAMES only; no chunk bytes ride the result.hydrate-workspace(manifestDigest): the stub reads the manifest plus chunks from the node CAS andUntarDirs them into the guest/workspace.
The ops carry NO secrets: the request is path lists / content-address manifest
digests, and the result is a manifest digest plus an optional content-path diff and
latency. Secret/credential paths
are stripped from the captured tree by the dehydrate exclude list
(WorkspaceSecretExcludePaths), so a committed revision is content only. Workspace
CONTENT bytes never appear in a log line or an error on either side. The ops FAIL
CLOSED: the stub requires an active VM and a configured node CAS (an unset
--cas-dir disables them), and the controller delegate refuses on an unreachable
pod or a not-OK result rather than committing a revision the node never produced;
the controller still owns the WorkspaceRevision commit + head advance. Surface
delta vs the prior model: the node CAS is now mounted READ-WRITE into the husk
pod (it was read-only manifests before) so the stub can persist a revision. The
content-addressed store stays plaintext-content-addressed (or per-workspace
encrypted at rest under spec.store.encryptionKeyRef, section 6). Residual: a
compromised controller can drive a dehydrate/hydrate of any husk pod it can reach
(the same activate/fork residual, Surface 2); a compromised husk pod already had
write access to its node <dataDir> subtree (forks dir, rootfs CoW), and the CAS
mount widens that to the content store on the same node, bounded to the node’s own
data dir.
Surface 4: the DEVICE /dev/kvm. KVM access is injected by the device plugin
(cmd/kvm-device-plugin, internal/deviceplugin): the pod requests
mitos.run/kvm like any extended resource and the kubelet bind-mounts
/dev/kvm (and /dev/net/tun) on Allocate, so the pod sets NO privileged: true and carries NO /dev/kvm hostPath. CI-proven: the kind-e2e job drives
the full advertise -> schedule -> inject path with a NON-privileged probe pod
(privileged: false, escalation false, drop ALL, read-only rootfs) and
kubectl exec confirms /dev/kvm is present inside the container, coming
entirely from Allocate, not from any privilege (section 5 of
docs/husk-pods.md). Residual, stated honestly: /dev/kvm IS exposed to the
pod, so a KVM-or-host-kernel escape from the VMM is STILL a host-escape vector.
The device plugin removes the PRIVILEGED requirement, NOT the /dev/kvm attack
surface itself; that surface is inherent to ANY Firecracker host and is UNCHANGED
between raw-forkd and husk. This is the axis on which the two models are EQUAL,
not better. The device-plugin DaemonSet has its own small surface
(deploy/device-plugin/daemonset.yaml): it runs as root because the kubelet
device-plugins dir is root-owned, but it is privileged: false,
allowPrivilegeEscalation: false, ALL capabilities dropped, and
readOnlyRootFilesystem: true; its only host access is the kubelet
device-plugins dir (to serve and register its socket) and a read-only /dev
(to stat /dev/kvm); it creates NO device nodes and starts NO VMs.
Surface 5: the POD NETNS (egress). IMPLEMENTED and KVM-VERIFIED end to end (in-pod default-deny + metadata block proven inside a real restored VM on the husk path; raw-forkd unchanged). In the husk default the VM’s tap lives inside the HUSK POD’s network namespace, so the sandbox’s traffic IS the pod’s traffic. This was the top must-fix-first blocker (default-open egress). The remediation is an in-pod nftables filter the husk-stub programs in the pod’s own netns at activation, and it is now VERIFIED working inside a real restored VM by the husk-network KVM cluster e2e (see the status note below). The datapath is:
- The in-pod nftables filter is the guarantee (CNI-independent). The
husk-stub brings up the VM’s tap and installs a default-deny egress chain in
the husk pod netns at activation (
internal/husk/netfilter.go, applied ininternal/husk/stub.go), reusing the raw-forkd dataplane rendering (internal/netconf: per-tap chain,ip saddranti-spoof) and the controlled DNS proxy (internal/dnsproxy: name-allowlist resolution + IP pinning). Because the tap is in the POD’s netns, this holds regardless of the cluster CNI. It needs exactlyNET_ADMINscoped to the pod netns (surface 1, ADR 0006). - The cloud-metadata block is unconditional.
netconf.RenderMetadataBlockemits a hard drop for169.254.169.254, the169.254.0.0/16link-local range (covers ECS task metadata), and IPv6fd00:ec2::254BEFORE any allow rule and regardless of the template policy (evenEgressAllow), so the allowlist can never reach the metadata endpoint and a guest cannot steal the node’s cloud IAM credentials. - The guest agent configures its NIC via rtnetlink, not an
ipbinary. The guest agent brings up its owneth0(address, route) through rtnetlink syscalls (internal/guestnet), not by shelling out to anipbinary, so the datapath works on an arbitrary OCI-image rootfs that may carry no networking tools. - Name-based egress runs an in-pod DNS proxy with a failover upstream list.
The in-pod DNS proxy resolves ONLY allowlisted names and pins each resolved IP
into the per-tap nftables allow set, forwarding to a comma-separated upstream
resolver list tried in failover order. The recommended value is the public pair
1.1.1.1:53,8.8.8.8:53, deliberately NOT cluster DNS, so an untrusted sandbox cannot resolve internal service names. The proxy also REFUSES to pin a resolved address in any non-publicly-routable range (RFC1918, IPv6 ULA, loopback, link-local, multicast, unspecified, RFC6598 CGNAT, deprecated site-local fec0::/10) and strips it from the answer, so an allowlisted name whose authoritative DNS an attacker influences cannot be rebound at internal cluster services or node-local targets (DNS-rebinding-to-internal defense in depth, complementing the unconditional IMDS block). The check also decodes IPv6 addresses that embed an IPv4 target inside the NAT64 well-known prefix (64:ff9b::/96) or the 6to4 prefix (2002::/16) and applies the same policy to the embedded IPv4, so on a DNS64/NAT64 cluster a wrapper such as64:ff9b::a9fe:a9fe(NAT64 of169.254.169.254) is refused while a wrapper of a public IPv4 stays allowed. RESIDUAL: a non-default, operator-configured NAT64 prefix (other than the well-known64:ff9b::/96) is not yet decoded; clusters using a custom NAT64 prefix should set it via the resolver config once that knob lands. - Guest-to-pod-local traffic is filtered on the INPUT hook (husk path). The
per-tap forward chain governs transit egress, but a packet the guest sends to
an address LOCAL to the pod netns (the tap gateway, the resolver, the
husk-stub sandbox API on :9091 and mTLS control on :9443) is delivered on the
kernel input hook, which a forward-only filter never evaluates. So without an
input rule a guest could reach those pod-local listeners regardless of egress
policy (their own auth gates limited impact, but the egress allowlist offered
no protection and any future in-pod listener was exposed).
applyEgressFilternow also installs an input-hooked base chain plus a per-tap input chain (netconf.RenderSharedInputTable/RenderSandboxInputChain) that accepts the guest only to the resolver on udp/tcp 53 and drops every other guest-sourced packet to a pod-local address. The base chain policy is accept so non-sandbox input (kubelet probes, the controller’s mTLS dial arriving on the pod uplink) is untouched; isolation is via the per-tap dispatch jump. This is on the HUSK path only: the filter lives in the isolated pod netns, whereas the raw-forkd path runs in the node netns where a node-wide input hook is not added (raw-forkd’s host-local exposure is tracked separately). Renderer and wiring are unit-tested (TestRenderSandboxInputChainBlocksGuestToPodLocal,TestApplyEgressFilterInstallsInputGuard); end-to-end enforcement inside a real restored VM is gated by the husk-network KVM e2e (test/cluster-e2e/husk-network-e2e.sh) which MUST be green before merge. - The per-template allowlist is threaded husk-side.
huskNotifyNetworkdelivers the fixed in-pod /30 plus the in-pod resolver, andhuskEgressConfigcarries the template egress policy + allowlist in the activate request (internal/controller/sandboxclaim_controller.go,husk.ActivateRequest.Egress/Allow). IP:port entries become static chain accepts; name entries are resolved + pinned by the in-pod DNS proxy. - First-class network-posture knobs (issue #219): block_network, CIDR
allowlists, deny-by-default inbound. The per-sandbox
NetworkPolicy(api/v1, threaded throughhusk.ActivateRequestand the protoNetworkConfig) now also expresses, in addition to the egress policy + name allowlist above:blockNetwork(total deny). When set, the per-tap chain drops ALL egress, v4 and v6, with NO accept of any kind (not even established/related): the Modalblock_network=Trueprimitive for a sandbox that must never reach the network. It overrides the egress policy and every allowlist. Rendered bynetconf.RenderSandboxChainSpec(the block branch emits only the metadata block and the v4/v6 drops).allowCidrs(CIDR egress allowlist). A destination IP inside an allowed block is accepted, v4 saddr-pinned and v6 family-scoped, exactly like the static IP:port accepts. The CIDR list is parsed fail-closed (netconf.ParseCIDRList): a malformed CIDR fails the whole activation rather than silently dropping the rule, so a sandbox never comes up with a partially applied allowlist.- Deny-by-default INBOUND (
inbound,inboundCidrs). The input-hook chain (above) already dropped all guest-sourced pod-local traffic except DNS; the secure default is now made explicit asinbound: deny.inbound: allow(optionally narrowed toinboundCidrssource blocks) accepts unsolicited inbound to the guest IP for a sandbox that intentionally hosts a listener. Return traffic for the guest’s own egress is always accepted via the forward chain’s established,related rule, so deny-by-default inbound never breaks the guest’s outbound flows. The SECURE DEFAULT for an untrusted sandbox, applied by the SDK and the sandbox-server when no policy is supplied, is deny-by-default in BOTH directions: egressdeny(no allows) and inbounddeny. The rule rendering of each new dimension is unit-tested ininternal/netconf(TestRenderSandboxChainBlockNetwork,TestRenderSandboxChainCIDRAllowlist,TestRenderSandboxInputChainAllowCIDR, and the deny-by-default cases); the real in-VM packet enforcement is KVM-gated by the existing husk-network e2e (the new dimensions reuse the SAME proven datapath: per-tap dispatch,ip saddranti-spoof, terminal drop).
- Per-sandbox egress byte counter (the #211 metering seam). Each sandbox’s
chain carries a named nftables counter (
sb_<tap>_egress) incremented on every guest-sourced egress packet at the top of the chain, so the metering pipeline (#211) reads per-sandbox egress bytes by name (netconf.NftReadEgressCounterArgsParseEgressCounterBytes, surfaced viametering.Sample.EgressBytes). It is a passive counting rule with no verdict, so it never changes enforcement; it is a usage-accounting and abuse-signal source, not a security control.
- SNAT masquerade plus IPv4 forwarding let allowed traffic egress and return.
An nftables SNAT masquerade scoped to the guest source address, plus IPv4
forwarding enabled in the pod netns, let allowed traffic reach the internet and
return.
net.ipv4.ip_forward=1is set in the pod netns by a SHORT-LIVED PRIVILEGED INIT CONTAINER on the husk pod (nameenable-ip-forward), which writes the sysctl in the shared pod netns and exits before the workload runs. This needs NO node/kubelet change. It is added ONLY when name-based egress is configured (controller flag--husk-dns-upstreamset). It runs in the privileged-PSA namespace husk already requires (forNET_ADMINand hostPath), so it adds no new operator prerequisite. Blast radius, stated honestly: it is privileged but SHORT-LIVED (runs once, sets one sysctl, exits), versus a privileged long-lived workload; the long-lived husk-stub container itself stays unprivileged (NET_ADMINonly, via the device plugin for KVM, notprivileged: true). It is gated to name-egress pools only. - A best-effort Kubernetes NetworkPolicy adds defense in depth. The
controller now creates one
networking.k8s.io/v1NetworkPolicy per pool selectingmitos.run/husk=truewith default-deny egress, a DNS allow, and one egress rule per enforceable IP:port allow, owner-referenced to the pool for GC (internal/controller/husknetworkpolicy.go). HONEST CNI caveat: a NetworkPolicy only enforces on a CNI that implements it, so it is defense in depth ONLY; the in-pod nft filter above is the guarantee that holds with no CNI policy at all.
Status: mitigated (IMPLEMENTED and KVM-VERIFIED end to end). The in-pod
default-deny filter, the unconditional metadata block, the threaded per-template
allowlist, the per-pod DNS proxy, and the scoped NET_ADMIN are all implemented,
the controller creates the best-effort NetworkPolicy, and the husk-network KVM
cluster e2e (test/cluster-e2e/husk-network-e2e.sh, the cluster-husk-network-e2e
suite) now PASSES on the Hetzner Talos KVM cluster. The claim reaches Ready, the
pool warms a dormant husk pod, and all three in-VM enforcement assertions are
green inside a real restored VM: metadata-blocked (connect to 169.254.169.254
fails, exit 1, no IAM theft), default-deny (a non-allowlisted host fails to
connect, exit 1), and allowlist-works (the allowlisted name example.com IS
reachable, connect exit 0). It was verified TWICE: once with a node kubelet
sysctl allowance in place (proving the datapath), then again with that node change
fully reverted (the probe confirmed SysctlForbidden), proving the feature needs
NO node prerequisite (the enable-ip-forward init container sets the only sysctl
required, in the pod netns). The raw-forkd nftables egress remains CI-proven
in-VM on KVM (section 4) and is unchanged; it gains the same unconditional
metadata block because the rendering is shared. The earlier hand-applied allow-all
husk-egress NetworkPolicy in .github/workflows/ci.yaml “Conformance 3” proved
nothing about restriction and is superseded by the controller-emitted object plus
the in-pod filter. Residuals: NET_ADMIN is a documented PSA exception (surface
1, ADR 0006), and on name-egress pools the enable-ip-forward init container is a
short-lived privileged surface (one-shot, sets one sysctl, exits; gated to
name-egress pools; the long-lived husk-stub container stays unprivileged). The
trade is strongly positive: it closes default-open egress and node IAM-credential
theft for one scoped capability in an already-unprivileged pod plus a one-shot
privileged init step, and it is now KVM-verified end to end.
Surface 6: the IN-POD SANDBOX API (exec and files). After activation the husk
stub serves the SAME internal/daemon.SandboxAPI forkd serves, IN the pod, on
the sandbox port; the claim’s Status.Endpoint is podIP:sandboxPort. Every
exec/files request is gated by the per-sandbox bearer token (32-byte
crypto/rand, constant-time compare, fail-closed: a sandbox with no registered
token rejects everything). The token is delivered to the stub over the mTLS
control channel (surface 2), never logged, never in argv. CI-proven: a tokened
HTTP exec reaches the guest over vsock and an UNTOKENED or WRONG-TOKEN request is
rejected with the token value absent from host-side logs (slice 2,
internal/husk TestActivateServesTokenGatedSandboxAPI, and the KVM
network-activation phase over the real endpoint; section 3 row below). Residual:
tokens are static per sandbox (no rotation or expiry); anyone with namespace-wide
Secret read can take them (section 3).
Serving-workload start is host-trusted only (issue #460). StartWorkload (the
serving-workload RPC the build uses to start a long-running app in its own session,
captured running in the snapshot) is on the host-trusted Control service (vsock,
reached only by forkd during the template build), NOT the public Sandbox service.
A tenant exec cannot call it, so it cannot start arbitrary detached processes that
escape the exec process-group kill or the per-fork SIGUSR2 reset outside its own
budget. The workload command and env are non-secret build-time config (env values
are non-secret; secrets are still injected per fork via Configure); the agent
logs only the workload session id and env-key count, never argv or values. The
per-fork SIGUSR2 reset no longer default-terminates a non-handler process: the
broadcast delivers only to confirmed SIGUSR2 handlers (the SigCgt bit in
/proc/<pid>/status), so a captured serving workload that installs no handler
survives the fork rather than being silently killed (issue #467). This closes a
self-DoS/availability hazard on the fork path; it adds no new escape surface (the
signal set only shrinks).
Surface 7: the ENCRYPTION KEY (#31 PR2). When --enable-encryption is on,
the per-template 256-bit key reaches the node ONLY over the mTLS control channel
(the CreateTemplate/Fork gRPC requests; the controller refuses to deliver the
key to a node whose connection is not mTLS, and forkd refuses to start encrypted
without its TLS flags), is held in node process memory while a container is open,
and is NEVER written to the node data disk (section 5: RequestKeyProvider,
key-not-on-disk proven by unit and envtest, key-never-logged enforced by grep in
CI). On the HUSK path the key reaches FORKD (the builder) ONLY, over the same
mTLS gRPC; forkd uses it to open the per-template LUKS container and the snapshot
is decrypted BELOW the page cache by forkd’s dm-crypt mount. The husk pod mounts
that mount’s PRE-DECRYPTED snapshot bytes read-only and NEVER receives the
encryption key: the key does not cross the controller-to-husk mTLS control channel
and is never present in the husk pod’s address space. So a compromised husk pod
cannot exfiltrate the template key. Residual, stated honestly: the IN-MEMORY KEY
WINDOW on the FORKD process. While a container is open the key is necessarily in
forkd’s process memory; a root attacker with a node-memory dump of FORKD while a
container is open recovers it. Zeroize-on-close is the current mitigation;
HSM/envelope custody is the follow-up.
Surface 8: EVICTION and DRAIN (slice 4b). A husk pod is an ordinary pod, so
it is subject to drain, eviction, preemption, and delete. A policy/v1
PodDisruptionBudget (<pool>-husk, minAvailable = max(1, Replicas-1)) BOUNDS
voluntary disruption to at most one warm slot at a time; a lost husk pod
re-pends the claim (Phase Pending, endpoint cleared) and the warm pool self-heals
a replacement; a drainPolicy governs an active sandbox (Kill re-pends,
Checkpoint snapshots the live VM first where the VMM still runs). CI-proven
object-level on kind-e2e-husk (slice 4b, section 6f of docs/husk-pods.md).
This is an AVAILABILITY surface, not a new ESCAPE surface: a drained or evicted
husk pod is gone, not escalated. The honest availability note vs the old model:
raw-forkd’s VMs were not pods and did not feel drains, but they also had no
bounded, self-healing disruption story; the husk model trades that for ordinary,
self-healing pod disruption with a documented budget. The live-VM
Checkpoint-on-drain actually SURVIVING end to end is bare-metal work (it needs
the VMM running in the husk pod on a KVM-capable kubelet).
Per-axis tally: old forkd vs husk pod
This compares the per-sandbox EXECUTION surface. forkd-the-builder (the privileged snapshot builder, run once per node per template) is NOT a per-sandbox surface and is discussed separately below the table.
| Axis | Old forkd (raw-forkd) | Husk pod | Verdict |
|---|---|---|---|
| Privilege | uid 0, privileged: false since #352 (jailer ENABLED; explicit builder cap set), allowPrivilegeEscalation: false, seccompProfile: RuntimeDefault | privileged: false, runAsNonRoot: false (one of the two PSA-restricted exceptions, the /dev/kvm device one; the other is the read-only snapshot hostPath), no escalation | husk BETTER (both non-privileged now; husk adds far fewer caps and runs one VM per pod) |
| Capabilities | drop: [ALL], adds back the explicit builder set (SYS_ADMIN, CHOWN, SETUID, SETGID, MKNOD, NET_ADMIN; #352) | drop: [ALL], adds only NET_ADMIN | husk BETTER (raw-forkd retains CAP_SYS_ADMIN + uid 0 for the builder/jailer; husk adds one scoped cap) |
| Host FS access | hostPath to the node data dir (RW) | the snapshot mem/vmstate mount and kernel file are READ-ONLY, but the husk pod ALSO has WRITABLE node hostPaths: its per-pod rootfs CoW dir, the fork-snapshot dir, and (W4) the node CAS mounted READ-WRITE (huskCASMountPath, internal/controller/huskpod.go) so the stub can persist a revision | husk BETTER on the base image (read-only) but NOT read-only-only: a compromised husk pod can write the node CAS and its CoW dirs, bounded to the node data dir (see the W4 CAS row in section 3) |
Device access (/dev/kvm + kernel) | /dev/kvm + /dev/net/tun via the device plugin (no privilege; #352 removed the privileged hostPath) | /dev/kvm via device plugin (no privilege) | EQUAL on the inherent KVM/kernel escape surface; both now get the device via the plugin, not a privileged hostPath |
| Network governance | host-nftables in forkd’s netns (default-deny egress allowlist, CI-proven in-VM on KVM, but only on opt-in --enable-networking) | in-pod nftables default-deny in the pod’s OWN netns, applied by the husk-stub at activation (CNI-independent), with an unconditional cloud-metadata block and the threaded per-template allowlist, plus a best-effort controller-emitted NetworkPolicy; IMPLEMENTED and KVM-VERIFIED end to end by the husk-network KVM cluster e2e (surface 5) | EQUAL and KVM-VERIFIED: husk enforces default-deny egress in-pod regardless of CNI (the same posture as raw-forkd) and blocks the metadata endpoint unconditionally, proven inside a real restored VM (the three in-VM assertions are green and the feature needs no node prerequisite, verified twice) |
| Secret + key delivery | mTLS gRPC to forkd | tenant secrets + token over the mTLS control channel to the pod (controller-identity authz, never on disk); the per-template ENCRYPTION KEY never reaches the husk pod at all (it goes to forkd, which serves the pre-decrypted snapshot via dm-crypt) | EQUAL/BETTER (same mTLS anchor; enc key never enters the husk pod, in-memory-window residual is on forkd only) |
Honest conclusion: on the privilege and capabilities axes the husk model is
clearly BETTER. On the host-FS axis it is better for the BASE IMAGE (read-only)
but it still holds writable node hostPaths (the CoW dirs and the read-write node
CAS), so it is not the read-only-only surface earlier claimed. On the NETWORK
axis the husk default now MATCHES opt-in raw-forkd and is KVM-VERIFIED: it
enforces an in-pod default-deny egress filter in the pod’s own netns
(CNI-independent), plus an unconditional cloud-metadata block and the threaded
per-template allowlist (surface 5, section 4), the same posture raw-forkd has
with --enable-networking, and it adds the metadata block raw-forkd now also
gains because the rendering is shared. This is IMPLEMENTED and KVM-VERIFIED end
to end: the husk-network cluster e2e PASSES on the Hetzner Talos KVM cluster with
all three in-VM enforcement assertions green (metadata-blocked, default-deny,
allowlist-works), verified twice (once with and once without a node sysctl
allowance, proving NO node prerequisite). It costs one scoped capability
(NET_ADMIN in the pod netns, ADR 0006) plus, on name-egress pools only, a
short-lived privileged enable-ip-forward init container (one-shot, sets one
sysctl, exits; the long-lived husk-stub container stays unprivileged). The
residuals named (shared read-only snapshot mount, in-memory key window) stand.
On the inherent /dev/kvm-and-kernel axis the two are EQUAL: a KVM or host-kernel
bug reachable from a /dev/kvm-holder is the same risk in both models, and the
device plugin removes the privileged requirement, NOT that attack surface. The
per-sandbox EXECUTION surface is therefore IMPROVED on privilege and
capabilities, the inherent microVM-host-escape risk is UNCHANGED, and the husk
EGRESS surface now MATCHES raw-forkd (in-pod default-deny + metadata block) and is
KVM-verified end to end.
Separately,
forkd-the-builder remains a
PRIVILEGED control-plane surface (root, CAP_SYS_ADMIN, /dev/kvm, the jailer),
but it is SMALLER than the old per-sandbox surface: it runs the BUILD path once
per node per template, not on every sandbox execution, so the privileged surface
is confined to the build path and amortized across all sandboxes a template
serves. Removing forkd’s privilege entirely (a builder redesign) is out of scope.
RESOLVED gaps the review surfaced (were must-fix-first; now mitigated):
- HUSK EGRESS is now IMPLEMENTED and KVM-VERIFIED end to end (was the top
blocker): the husk-stub enforces an in-pod default-deny egress filter in the
pod’s own netns with an unconditional cloud-metadata block and the threaded
per-template allowlist, plus a best-effort controller-emitted NetworkPolicy
(surface 5, section 4). FIX/PROOF: the husk-network KVM cluster e2e
(
test/cluster-e2e/husk-network-e2e.sh,cluster-husk-network-e2e) PASSES on the Hetzner Talos KVM cluster with all three in-VM enforcement assertions green (metadata-blocked, default-deny, allowlist-works), the claim reaches Ready, and the feature was verified twice (once with and once without a node sysctl allowance), proving NO node prerequisite. The warm-pool over-create that blocked activation (below) is fixed by the husk pod readiness probe. - DEHYDRATE-ON-DELETE FINALIZER HOT-LOOP is FIXED. A
Sandboxbound to an already-deletedWorkspacenow treats the missing workspace as a TERMINAL no-op: it drops the finalizer and the sandbox is garbage-collected instead of retrying the dehydrate step forever. Proven by envtest. - HUSK WARM POOL OVER-CREATE is FIXED by the husk pod readiness probe gating dormant-Ready on the control listener: a dormant pod is only counted Ready once its control listener is up, so the pool no longer creates an unbounded number of never-Ready pods toward a deficit it cannot close.
The husk default-SA-token automount, the fail-open gRPC default, and the missing host-side vsock read deadline that this list previously carried are now SHIPPED (mitigated); see the per-row table entries above for the mechanism and the test.
Accepted residuals (tracked, see ROADMAP W1 #18):
- the SHARED READ-ONLY SNAPSHOT MOUNT across husk pods on a node (read-only, integrity-verified, non-tenant base image; fully pod-native CAS delivery is the follow-up);
- the WRITABLE node hostPaths a husk pod holds (per-pod CoW dirs and the read-write node CAS, bounded to the node data dir; see the W4 CAS row in section 3 for the unbounded-disk and cross-tenant-destruction concerns);
- the
/dev/kvmINHERENT host-escape surface (unchanged from raw-forkd; inherent to any Firecracker host); - the IN-MEMORY ENC-KEY WINDOW while a container is open (HSM custody is the follow-up, #31);
- the FORKD-BUILDER PRIVILEGE (it stays the privileged builder; a builder redesign is out of scope);
- the live-Checkpoint-on-drain survival, proven only on a KVM-capable kubelet (a bare-metal reference node, #16).
The device-plugin surface itself is in section 3; the per-mode networking reconciliation is in section 4; the encryption-key custody is in section 5.
1. Guest → host escape
The primary boundary is KVM hardware virtualization via Firecracker.
| Control | Status | Detail |
|---|---|---|
| Firecracker microVM (minimal device model) | mitigated | Each sandbox is a separate Firecracker process with its own KVM VM (internal/fork/engine.go). |
| virtio-rng device (new guest device) | mitigated | Every template snapshot bakes a Firecracker virtio-rng device backed by the HOST RNG (firecracker.Client.SetEntropy -> PUT /entropy, VMConfig.EntropyDevice default-on), so each forked guest has a continuous entropy source in addition to the per-fork NotifyForked CRNG reseed (fork-correctness row 1). The device is a guest -> host surface: it adds the Firecracker virtio-rng emulation path. The flow is one-directional and read-only from the guest’s view (the guest draws random bytes; it supplies nothing the host acts on), draws from the host CSPRNG (crypto/rand-grade kernel RNG), is attached unthrottled but carries no secret material, and crosses no new boundary beyond the existing Firecracker device model (same KVM/Firecracker and unprivileged-husk-pod containment as the rootfs and vsock devices). The host never reads guest-supplied bytes through it. The continuous source also removes the failure mode where a long-running fork could starve its CRNG between reseeds. The config/JSON-builder is darwin-unit-tested; the live device is exercised in the KVM fork-correctness phase (distinct UUID and TLS client random across forks). |
| Jailer (dedicated UID, chroot, cgroup, namespaces per VM) | mitigated as shipped (jailer ENABLED + forkd non-privileged, #352); the kernel-enforced capability/uid drop is proven in the KVM CI jailer run (issue #2 Task 5) | The jailer IS implemented (internal/firecracker/jailer.go, client.go:startJailedVM): a dedicated uid/gid per VM from --uid-range (default 64000-64999; uid 0 refused), a per-VM chroot under --chroot-base containing only the explicitly hard-linked kernel, rootfs, and snapshot files (a traversal guard refuses anything outside the data dir and the VM workspace), and cgroup v2 attachment; ids are validated at the gRPC boundary (internal/daemon/validate.go) and the launch path refuses ids whose jailer dirs would escape the chroot base. The SHIPPED DaemonSet (deploy/daemon/daemonset.yaml) now PASSES the jailer flags (--jailer/--chroot-base=/var/lib/mitos/jailer/--uid-range=64000-64999) and runs NON-privileged with the explicit builder capability set (section 3, forkd capability minimization row), so every build/raw-forkd VMM runs UNDER a throwaway jailed uid in a per-VM chroot: a VMM compromise lands as that disposable uid, not forkd’s root. forkd makes --chroot-base a private mount point at startup (cmd/forkd/prepareChrootMount) so the jailer’s pivot_root works inside the non-privileged pod. The direct-exec dev path still remains when --jailer is omitted (forkd logs a loud warning; standalone sandbox-server always runs unjailed). The standalone sandbox-server previously validated only that ids were non-empty, so a crafted template/sandbox id from its REST body (req.ID/req.Template) flowed unsanitized into host paths (templates/<id>/..., sandboxes/<id>/...); CodeQL flagged this as go/path-injection. It is now closed at two layers: the REST handlers (cmd/sandbox-server handleCreateTemplate/handleFork) reject any id that is not a single safe path segment (idPattern, the same ^[a-zA-Z0-9][a-zA-Z0-9_-]{0,63}$ allowlist as the gRPC validateSandboxID) with a 400, and the engine re-validates as defense in depth at every entry point both callers traverse (Engine.Fork, Engine.CreateTemplate, Engine.Terminate, mirroring the existing Pause guard), so a .. or separator can never reach a filepath.Join. The conformance test cmd/forkd TestShippedDaemonSetEnablesJailer keeps the shipped flags from regressing; the kernel actually enforcing the uid/cap drop is asserted on the KVM runner (the CI host is root and cannot observe the bounding-set shrink). |
| Seccomp on the VMM process | mitigated (enforced + asserted, #353) | The VMM runs Firecracker’s default production seccomp BPF filter on every launch path (engine fork, raw-forkd, and husk activate): Firecracker installs it on all VMM threads UNLESS --no-seccomp is passed, and Mitos never passes it. This is now a CHECKED invariant, not an implicit one: internal/firecracker.assertSeccompEnforced runs on the final argv of BOTH the direct-exec and jailer launch paths (client.go) and FAILS CLOSED if any future flag or refactor ever threaded --no-seccomp in, so the VMM can never silently come up with its full syscall surface (unit-tested in seccomp_test.go). The enforcement is PROVEN on real KVM: the kvm-test.yaml “Seccomp filter enforced on the Firecracker VMM” step (load-bearing, not continue-on-error) BOOTS a microVM (Firecracker installs its per-thread filters at vCPU start, not at API-server start) and asserts a VMM thread reports Seccomp: 2 (SECCOMP_MODE_FILTER) in /proc/<pid>/task/*/status, so a disallowed syscall is blocked, while a --no-seccomp boot leaves every thread mode 0, proving the assertion distinguishes enforced from absent rather than being vacuous. Custom tightened profile: EVALUATED and declined (docs/security/vmm-seccomp.md); Firecracker’s upstream “advanced” filter is already scoped to the syscalls the VMM needs and is maintained and tested against each Firecracker release, so a hand-rolled tighter filter would add a brittle, version-coupled maintenance surface (a too-tight filter SIGSYS-kills a legitimate VMM on a kernel/libc/FC change) for no demonstrated reduction; the decision is to assert the upstream filter holds rather than fork it, and to pass an explicit --seccomp-filter only if a concrete required-syscall delta is ever found. |
| CVE posture / version pinning | partial | CI pins Firecracker v1.15.0; there is no documented update policy or advisory tracking. |
Guest agent as attack surface (Rust, guest/agent-rs) | partial (gRPC surface; sole production agent since Phase E #310) | The Rust agent is the SOLE production guest agent: baked as /init by guest/rootfs/build.sh and kvm-test.yaml. It serves ONLY gRPC on vsock port 53. SP1.5 is merged and the Go agent and legacy JSON protocol (port 52) are removed. All host callers (fork, daemon, husk, smoke binaries) speak gRPC. The full firecracker-test suite (exec-via-vsock, fork-correctness, ws-smoke, SDK example, etc.) boots the Rust agent. No fuzzing of the gRPC surface yet. SECURITY-SENSITIVE: requires named human reviewer before any PR touching guest/agent-rs/src/sys/, guest/agent-rs/src/fork/, guest/agent-rs/src/init/mod.rs, or guest/agent-rs/src/main.rs is merged. See docs/security-review-policy.md. |
In-guest self-service socket (MITOS_SOCKET, issue #22) | mitigated (in-guest only; no new host surface) | The Rust guest agent serves a small JSON protocol over a unix socket INSIDE the VM at MITOS_SOCKET (default /run/mitos.sock, internal/guestsock) so the in-VM workload can read its OWN identity and budget (and, once issue #25 wires it, request a budget-gated fork). The socket is reachable only from inside the guest: it is a unix socket on the guest’s own tmpfs, NOT exposed on the tenant-facing HTTP sandbox API and NOT bridged to the host vsock, so it adds NO new host boundary; a caller of it is already running inside the untrusted guest at the same privilege as any exec child (KVM/Firecracker and the unprivileged husk pod, section 0, bound it). The info response carries NAMES and budget NUMBERS only (sandbox id, claim, pool, workspace names; fork caps), assembled by a key-whitelisting handler from the delivered env; no secret VALUE ever crosses it (the handler reads only the MITOS_* identity/budget keys, never the secret keys also present in the guest env). The host advertises the path and the sandbox’s own id via the existing best-effort env-delivery channel (withSelfServiceEnv, internal/daemon/server.go); a listen failure is non-fatal so exec/files over vsock are unaffected. The same MaxMessageBytes line cap applies. The fork verb is not yet enabled: it returns an LLM-legible not-enabled error naming the orchestrator escalation path until the budget ledger (issue #25) is wired. |
| Host resource exhaustion (memory + sandbox count) | mitigated (production-blocker #2) | Three host-DoS dimensions are now capped, each as an O(1) admission/ceiling/sizing check OFF the warm-claim activate/fork hot path so they do not regress the warm-claim latency. (1) Husk pod memory. A husk pod previously carried a memory REQUEST only and no LIMIT, so a tenant VM could grow without bound and OOM the node. The controller now sets a memory LIMIT sized = request + headroom (internal/controller/huskpod.go, huskMemoryLimit), headroom = max(--husk-memory-headroom floor, default 256Mi; --husk-memory-headroom-percent of the request, default 25%). The headroom is load-bearing: the cgroup the limit caps holds the Firecracker VMM, the husk-stub, and CoW dirty-page slack ON TOP of the guest RAM, so a too-tight limit would OOM-kill a normally-running VM (which is why the limit must exceed the request); the headroom keeps the limit transparent to a legitimate VM while capping a runaway. The kubelet enforces the cgroup; the controller never throttles the running VM. cpu is deliberately left WITHOUT a limit (cpu throttling would hurt the activate latency); cpu stays requests-only for scheduler truth. (2) Per-node sandbox count. The engine reported MaxSandboxes in GetCapacity but never enforced it at Fork, so a runaway tenant could exhaust a node by opening forks. Engine.Fork now refuses with the typed ErrAtCapacity once the live count reaches --max-sandboxes (internal/fork/engine.go, admitFork), BEFORE any verify, allocation, or Firecracker boot, mapped to gRPC RESOURCE_EXHAUSTED for the controller; 0 disables it. (3) Concurrent streams per sandbox. Capped via --max-streams-per-sandbox (see the Connect runtime exec/run_code and interactive Exec rows). Residuals: the memory headroom defaults are sized conservatively but are operator-tunable, not derived from a measured VMM+CoW profile per template (raise the floor if pods are OOM-killed at their configured RAM); the sandbox-count ceiling is per-node, not a global tenant quota; there is no per-tenant fair-share across sandboxes yet (a tenant with many sandboxes still consumes proportionally). |
| Hosted gateway quota + abuse + kill-switch enforcement (issue #341) | mitigated (in-process; durable suspension store is a follow-up) | The hosted gateway no longer admits every request: cmd/gateway wires the real quota.Enforcer (via quota.GatewayAdapter) in place of the AllowAllQuota stub, default-on for the hosted profile (--enforce-quota, overridable). On the sandbox.create path the enforcer checks in FAIL-CLOSED order: the kill-switch suspension store FIRST (a suspended org is refused before any tier or usage logic, and a store-read error is a DENY, never a silent allow), then tier resolution (an unknown or absent plan resolves to the TIGHTEST free tier, never unlimited), then per-sandbox size, live concurrency and aggregate caps, and per-org plus per-IP rate buckets. The SAME suspension store is written by both the abuse-signal path and the billing suspender adapter (quota/billing_adapter.go), so an abuse flag or a billing past_due/spend-cap event suspends the org and the gateway then blocks it (proven end to end in cmd/gateway and quota tests). ErrSuspended maps to 403 naming the reason without leaking internal detail; an over-quota or rate denial maps to 429 with the limit and tier; no message leaks another org’s data. The client IP for the per-IP buckets is resolved by internal/saas/clientip with the gateway’s trusted-proxy-hops model (RemoteAddr unless --trusted-proxy-hops > 0; a caller-set leftmost X-Forwarded-For is never trusted and a too-short list fails closed to RemoteAddr), so a spoofed X-Forwarded-For cannot bypass per-IP limits (asserted in gateway_clientip_test.go). The org id is taken only from the gateway-verified request. The acceptable-use policy (docs/legal/acceptable-use-policy.md, a counsel-reviewable DRAFT scaffold) is the policy basis referenced from the kill-switch code path. Residual: the suspension store is in-process (MemSuspensionStore), so suspensions do not survive a gateway restart and are not shared across replicas; run a single gateway replica for the kill-switch to be authoritative until the durable replica-shared store lands (a clean follow-up behind the same interface). The front-door LiveUsage source is a conservative zero-footprint default today (the authoritative live concurrency and aggregate re-check is at the control plane, which sees the real running set); wiring the real cluster usage counter into the front door is a documented follow-up. |
| Dynamic CPU pinning + launch scheduling priority (issue #168) | partial (within-cpuset, no new cross-tenant surface; affinity wired, RT class gated) | Opt-in per pool via SandboxPool.spec.cpuPinning. After a fork’s guest is ready, the node pins that fork’s Firecracker vCPU threads to physical core(s) via sched_setaffinity and, during the activate window, bumps their scheduling priority then drops it after ready (internal/cpupin). The pin is applied WITHIN the husk pod’s cpuset: the topology the planner uses is the pod’s allowed CPU set, so a pinned thread can only land on a CPU the pod cgroup already grants. It adds NO new cross-tenant surface and does NOT regress the CoW memcg accounting (#33): affinity (which CPU a thread runs on) is orthogonal to memory accounting (which memcg pages are charged to), and the husk pod still owns both the cpuset and the memcg. The pin is best-effort and fail-open: a pin failure leaves the fork unpinned (floating in the pod cpuset, correct just less dense), never failing the fork. The applier is Linux-gated; the darwin stub is a no-op, so nothing is pinned off Linux. Residuals: the launch-window bump is a nice-level decrease today; the true SCHED_FIFO real-time class switch needs CAP_SYS_NICE in the pod and a host RT runtime budget and is the gated bare-metal follow-up (an unbounded RT priority would itself be a node-DoS surface, which is why it is gated, not shipped). The planner reads the node topology rather than the pod’s exact cpuset today; reading the pod cpuset so the within-cgroup property is enforced in code, not only by construction, is tracked. No density or activate-success number is measured yet; all figures in docs/perf/cpu-pinning.md are TARGETS until run on #16. |
| userfaultfd memory restore backend (issue #167) | mitigated (no new trust boundary; reads only the verified snapshot, writes only registered guest memory) | Hugepage-backed restore and snapshot-resume prefetch restore guest memory through a userfaultfd backend instead of a file mmap (internal/fork/uffd_linux.go). Per restored VM, forkd binds a PRIVATE per-fork unix socket under the sandbox dir and points Firecracker at it via mem_backend on /snapshot/load; Firecracker connects, creates the userfaultfd over the guest memory, and sends the handler the region mappings (JSON) plus the uffd descriptor (SCM_RIGHTS). The handler then services page faults with UFFDIO_COPY, sourcing bytes from the snapshot mem file it mmaps READ-ONLY. Surface analysis: (1) it introduces NO new external/tenant input: the only data it parses is the region-mapping JSON Firecracker itself sends over the private socket (not tenant-reachable; the guest cannot speak to it), and the page bytes come from the SAME mem file the file-backed restore already maps and that the content-addressed, verify-on-load manifest already covers; (2) it makes NO host-path write: it only fills guest pages Firecracker registered, with bytes from the verified snapshot, so it reads nothing the lazy path would not read anyway, just sooner; (3) the preloaded hot-page set is part of the manifest digest, so a tampered set fails verify before any restore (the prefetch design doc reconciles this with the #33 CoW story: prefetched shared pages still count once across forks); (4) it requires a CONFIG_USERFAULTFD=y kernel and a hugepage pool (surfaced by mitos doctor and docs/platforms/host-prerequisites.md). Residuals: the handler runs in forkd’s address space (a handler bug is a forkd bug, same blast radius as the rest of the restore path, not a new privilege); the per-VM socket lives under the node data dir bounded like the other per-fork artifacts; the syscall path (UFFDIO_COPY, SCM_RIGHTS recv) has unit-tested pure arithmetic and is validated end-to-end only on a userfaultfd-capable kernel (the Hetzner rescue kernel lacks it), so its KVM integration proof rides the same KVM-gated path as the rest of internal/fork. |
| Workspace tar transfer (W4 hydrate/dehydrate) | mitigated | The Rust guest agent serves tar_dir/untar_dir over the gRPC vsock channel; these are NOT exposed on the tenant-facing HTTP sandbox API and are called only by the controller workspace lifecycle. UntarDir (host tar bytes into the guest) rejects absolute, .., and out-of-target members with an anchored-separator prefix check, and refuses every non-regular member (symlinks, hardlinks, devices) before any write, so a crafted workspace revision cannot write outside /workspace or escape through a symlink. TarDir is allowlisted to /workspace only and does not follow symlinks out of it. The dehydrate excludes credential paths (.ssh, .aws, .netrc, .git-credentials, .config/gh, .npmrc) and secrets live only in the guest’s in-memory env, never on disk under /workspace. Both directions enforce a 64MiB MaxTarBytes cap with a per-file io.LimitReader. Residuals: a guest already running as root could hardlink an on-disk file into /workspace to capture its bytes into a revision (not a cross-VM escape; secrets are in-memory so unaffected); there is no per-transfer member-count cap yet (a low-severity local DoS against a compromised sandbox), to be addressed with the streaming-tar slice. |
2. Guest → guest
| Control | Status | Detail |
|---|---|---|
| Separate KVM VMs per sandbox | mitigated | No two sandboxes share a kernel. |
| raw-forkd: forks share ONE writable rootfs inode (cross-fork filesystem channel) | fixed (per-fork rootfs CoW) on raw-forkd | Previously, on the raw-forkd fork path the template rootfs was hard-linked into each fork’s chroot and the rootfs drive was attached with readOnly=false (internal/firecracker/template.go AddDrive("rootfs", templateRootfs, false, true)) and NEVER rebound, so all forks of one template on a node shared a SINGLE writable rootfs inode: a write by one fork was visible to its siblings (and across tenants, since snapshots are node-flat), a cross-fork filesystem read/write channel and a corruption vector. FIXED: internal/fork/engine.go now gives each fork its OWN copy-on-write clone of the template rootfs (prepareForkRootfs reflink-clones the template rootfs to <dataDir>/sandboxes/<id>/rootfs.ext4 through the SAME volume.Backend.ReflinkCopy owner the husk path uses), loads the snapshot PAUSED (resume=false), rebinds the baked rootfs drive to that per-fork clone with PatchDrive while the guest is frozen, then Resumes, exactly mirroring the husk Stub.Prepare/Stub.Activate fix (section 0 surface 3). The template rootfs is now the READ-ONLY CoW SOURCE; no two forks, and no fork and the source, ever write the same rootfs path. The per-fork clone is hard-linked into the jailer chroot (never the shared template) and reaped with the sandbox dir at Terminate. Proven by internal/fork/rootfs_cow_test.go (distinct backing paths, distinct inodes, and a write in one fork’s rootfs not visible in a sibling or the template); real-VM cross-fork isolation is KVM-gated (firecracker-test). Residual: raw-forkd is STILL not for untrusted multi-tenant for the OTHER reason below (node-flat snapshots); the privileged-DaemonSet and jailer-disabled reasons are closed since #352 (forkd is non-privileged with the jailer enabled). |
| CoW page sharing side channels | open | All forks of a snapshot share read-only pages via mmap(MAP_PRIVATE) of the same mem file. Flush+Reload-style attacks across forks of the same tenant’s snapshot are in scope to document; cross-tenant page sharing must be prevented by never sharing snapshot files across trust boundaries. Not yet enforced anywhere. |
| KSM | open | We must mandate KSM off on hosts (we control the reference platform). Not yet documented in any platform guide or checked by forkd at startup. |
| CPU vulnerability mitigations | open | Reference hosts (bare metal) must run current microcode with mitigations on; forkd should refuse or warn on /sys/devices/system/cpu/vulnerabilities red flags. Not implemented. |
3. Sandbox / forkd → cluster
forkd is the highest-value HOST component: uid 0 in a NON-privileged container
with an explicit, minimal capability set, /dev/kvm + /dev/net/tun via the
device plugin (not a privileged hostPath), and a hostPath to /var/lib/mitos,
on every KVM node. Since #352 it is the privileged BUILDER concentrated to that
audited cap set, not a privileged-and-long-lived daemon; its build/raw-forkd
VMMs run jailed under throwaway uids.
| Control | Status | Detail |
|---|---|---|
| controller ↔ forkd authn/authz (mTLS) | mitigated when deployed as shipped | The controller bootstraps an internal CA and per-identity leaf certificates as Secrets (internal/pki); forkd requires TLS 1.3 client certificates signed by that CA and authorizes only the controller.mitos SAN via unary AND stream interceptors; per-identity EKUs prevent the forkd server cert acting as a client. Residuals, explicitly: programmatic insecure construction remains for tests and for deployments that omit the TLS flags (forkd logs a loud warning); no certificate rotation yet; the CA private key lives in a namespace Secret readable by namespace secret-readers. |
| Sandbox HTTP API (exec/files, :9091) | mitigated | Per-sandbox bearer tokens are minted at claim time (32-byte crypto/rand), compared in constant time, and fail closed: a sandbox with no registered token rejects everything. Tokens are delivered to clients via claim-owned Secrets, never logged and never in status. On the husk-pod path (#18, slice 2, --enable-husk-pods) the SAME internal/daemon SandboxAPI and bearer-token gate runs IN the pod: after activation the husk stub registers the activated VM + the per-sandbox token and serves the gated exec/files API on the sandbox port, so Status.Endpoint (podIP:port) is reachable only with the token. The token is delivered to the stub over the mTLS control channel (the same channel as the activate secrets; never logged, never in argv), so it never crosses an unauthenticated wire. A husk pod serves exactly ONE VM, so the stub runs the SandboxAPI in SINGLE-SANDBOX mode (SetSingleSandbox, opt-in, set ONLY by cmd/husk-stub): the per-sandbox bearer token is the auth gate, validated against the pod’s one registered token regardless of the request’s sandbox id, then routed to that one VM. This is required because the SDK addresses the in-pod API with the claim’s status.sandboxID (the husk pod name), which never equals the stub’s fixed local id; a strict per-id lookup 401s every SDK request (the cluster-e2e bug). The gate is NOT weakened: a wrong/absent bearer is still rejected 401 (constant-time compare), and an activated-but-untokened sandbox still fails closed. forkd NEVER sets single-sandbox mode, so its multi-sandbox per-id lookup is byte-identical: a token for sandbox A cannot authorize sandbox B. The in-pod exec/files surface is the Connect sandbox.v1.Sandbox protocol (the legacy /v1 JSON runtime routes were removed in #358); the PTY upgrade (ptyAuth) AND the Connect bearer gate (connectLookupToken) BOTH resolve single-sandbox the same way (via resolveSandboxID), so cluster-mode exec/files over Connect against a husk pod is gated by the one registered token rather than 401’d by a strict per-id lookup. Proven in internal/daemon (TestSingleSandboxAcceptsArbitrarySandboxIDWithCorrectToken, TestSingleSandboxRejectsWrongOrAbsentToken, TestSingleSandboxNoTokenFailsClosed, TestMultiSandboxModeStillRequiresExactIDMatch, TestSingleSandboxPtyAuthIgnoresRequestID, TestConnectLookupTokenResolvesSingleSandbox, TestConnectLookupTokenMultiSandboxStaysStrict) and internal/husk (TestActivateSingleSandboxAcceptsSDKPodID). Residuals: tokens are static per sandbox (no rotation or expiry); anyone with namespace-wide Secret read can take them; standalone sandbox-server runs tokenless by explicit AllowTokenless design. Review finding (med/low): the :9091 sandbox API and its bearer tokens cross the POD NETWORK in cleartext HTTP (the in-pod and forkd sandbox API is plain HTTP, not TLS), so anyone who can observe the pod network sees the token and the exec/file traffic; an in-cluster TLS or service-mesh wrap is a follow-up. |
Runtime exec/files/run_code/vitals (Connect sandbox.v1.Sandbox) | mitigated (auth + concurrent-stream cap) | The runtime surface (exec, file IO, run_code, per-sandbox guest vitals) is served by the Connect sandbox.v1.Sandbox service on the same :9091 mux (internal/daemon/sandbox_api.go, internal/sandboxrpc); the legacy JSON /v1/exec, /v1/exec/stream, /v1/run_code/stream, /v1/files/*, and /v1/vitals routes were REMOVED once every SDK and kubectl-mitos moved to Connect (#358). AUTH is the per-sandbox bearer token enforced by the Connect BearerInterceptor (constant-time compare against the per-sandbox registered token, fail-closed; an untokened or wrong-token call is rejected CodeUnauthenticated before any guest RPC; AllowTokenlessInterceptor only on the standalone sandbox-server). The interceptor reads the token from the Authorization: Bearer header and the sandbox id from X-Sandbox-Id, never the body, and never logs or reflects a token value (tested in internal/daemon/connect_handler_test.go and internal/sandboxrpc/interceptor_test.go). Each streaming RPC opens a DEDICATED vsock gRPC connection to the guest, cancelled on client disconnect (the guest turns the cancel into a process-group SIGKILL). The per-sandbox concurrent-stream count is BOUNDED (production-blocker #2): a NEW stream over the --max-streams-per-sandbox ceiling (default 16) is rejected (too_many_streams); existing streams are never killed. The cap acquire/release logic is tested in TestStreamCapAcquireRelease/TestStreamCapConcurrent, and the HTTP-surface rejection on the interactive path in TestExecWSStreamCapRejected. run_code in-guest blast radius is unchanged from exec (see the code-interpreter subsection below). |
GET /sandbox.v1.Sandbox/Exec (interactive bidi Exec over a Connect WebSocket, issue #358) | mitigated (auth); accepted by design (in-guest blast radius, same as exec) | The interactive terminal is the sandbox.v1.Sandbox.Exec schema carried over a WebSocket so the thin half-duplex-over-HTTP/1.1 SDK clients can reach the full-duplex interactive (PTY) case that Connect’s HTTP/2-only bidi cannot serve them. The transport is the SAME endpoint path the Connect HTTP handler serves over HTTP/2; this handler (handleExecWS, internal/daemon/exec_ws.go) takes only the GET WebSocket upgrade (subprotocol connect.sandbox.v1) while the Connect handler keeps POST. Each WebSocket binary message carries ONE Connect enveloped frame (a 5-byte header then a protojson sandbox.v1.ExecRequest from the client, ExecResponse from the server); the client sends an open frame (with pty for a terminal), then stdin or resize, and receives stdout/stderr chunks then a terminal exit frame with the end-stream flag. SECURITY: the upgrade is a bodyless GET, so it is authenticated by ptyAuth (the per-sandbox bearer gate: ?sandbox= + Authorization: Bearer, constant-time compare, fail-closed with the SAME semantics as requireBearer: no registered token rejects 401 unless AllowTokenless; missing/malformed/mismatched token is 401, all BEFORE the upgrade, tested in TestExecWSRejectsBadToken/TestExecWSRejectsMissingToken/TestExecWSRejectsCrossSandboxToken/TestExecWSTokenlessAllowed, and the single-sandbox husk resolution in TestSingleSandboxPtyAuthIgnoresRequestID), mounted on the outer mux outside requireBearer; token values are never logged. The per-sandbox concurrent-stream cap is acquired via ExecPTY; because the open frame arrives only after the upgrade, a cap rejection surfaces as a terminal error exit frame plus a policy-violation WebSocket close rather than a pre-upgrade 429 (tested in TestExecWSStreamCapRejected). The handler takes NO command from the client for the interactive case (the guest defaults to /bin/sh); inbound frames are bounded to 1 MiB. The shell runs INSIDE the untrusted guest at the same privilege as any exec child, bound by the KVM/Firecracker boundary and the unprivileged husk pod exactly as exec is: NO new host boundary is crossed. Auditing records an exec_ws op with the pty flag only, never terminal contents (tested in TestAuditRecordsInteractiveExec). |
Connect runtime service audit + exec timeout ceiling (sandbox.v1.Sandbox, issue #358 / #216) | mitigated | The audit log and the exec/run_code timeout ceiling, which the legacy /v1 JSON handlers enforced, now apply to the Connect runtime path so they survive the /v1 retirement. AUDIT: a Connect interceptor (connectAuditInterceptor, internal/daemon/connect_audit.go) records ONE AuditEvent per runtime RPC (unary and streaming) AFTER it completes, carrying ONLY the op string (mapped from the procedure: exec, run_code, read_file, write_file, list_dir, stat, mkdir, remove, processes, vitals, signal, port_forward), the authenticated sandbox id, and OK. It NEVER reads or records the command, argv, env, file path, file content, stdin/stdout, or any token (a test marshals the whole event and asserts no command/argv/env appears); Detail and Bytes are deliberately unset on this path. The interceptor is wired INSIDE the auth interceptor (connect.WithInterceptors(authIC, auditIC), auth outermost), so the sandbox id it records is the one the bearer gate authenticated, not client input; the auditor is read lazily so SetAuditor governs it exactly as on the legacy path (NopAuditor = off, the default). Auditing never fails an RPC. CEILING: Service.MaxExecTimeoutSeconds (set from the daemon’s maxExecTimeout, default 86400s; 0 disables) is enforced in Service.Exec/ExecStream/RunCode/RunCodeStream (internal/sandboxrpc): a requested timeout_seconds over the ceiling is REJECTED before the guest stream opens with connect.CodeInvalidArgument naming the requested value and the ceiling, never silently reduced (issue #216), so a tenant cannot pin an exec for an unbounded time over Connect. Residual: the audit op is the RPC name plus the sandbox id and OK; per-op Detail (the command for exec, the path for files) is not enriched on this interceptor path yet (a follow-up), so the Connect audit is coarser than the legacy per-handler audit, but no runtime op is unaudited. |
POST /v1/sandboxes/{id}/forward (standalone guest-port forward, issue #228) | partial (standalone slice: loopback-only, per-sandbox cap, real-mode only; tokenless on the host listener, same as the rest of the standalone server) | The standalone sandbox-server can expose a guest TCP port to the host: the endpoint opens a host TCP listener and bridges every accepted connection over the guest’s gRPC PortForward stream (internal/sandboxrpc/portforward.go on the host, served by guest/agent-rs/src/service/portforward.rs in the guest) to the guest’s 127.0.0.1:<guest_port>, returning the host:port the caller dials (internal/daemon/forward.go, cmd/sandbox-server handleForward). It is a new host -> guest surface (a host listener that bridges into the guest), bounded as follows. (1) The host listener binds to LOOPBACK only (127.0.0.1:0), so it is reachable only from the host running the server, never from the network. (2) The guest dial is forced to LOOPBACK by the guest agent (guest/agent-rs/src/service/portforward.rs): the host carries only a bare port, the guest always dials 127.0.0.1:<port>, so the tunnel can never be steered to the guest’s other interfaces or back out to the host network; an out-of-range or non-loopback target is refused with an LLM-legible error, and a guest port with no listener returns a clean refused result (no hang). (3) The number of concurrent forwards per sandbox is CAPPED (SetMaxForwardsPerSandbox, default 16, mirroring the streaming-exec ceiling) so one sandbox cannot exhaust host sockets; each forward and all its tunnels are tracked and closed on UnregisterSandbox (terminate), so no host listener or tunnel goroutine outlives the sandbox. (4) It is REAL MODE only: mock mode returns a clean 501 (no guest TCP socket to bridge). The forward path carries NO auth on the host listener: this is the SAME tokenless trust model as the rest of the standalone sandbox-server (AllowTokenless, a single-tenant local server), NOT a new weakening; the tunnel bytes are application traffic and are never logged. The forkd/Kubernetes Service+Ingress routing and the CRD template/claim port-declaration fields are EXPLICIT follow-ups of #228 and are NOT built here, so this row covers the standalone surface only; auth on the forward path and UDP are also follow-ups. |
ANY /v1/sandboxes/{id}/expose/{port}/ (authenticated guest HTTP proxy, Mitos Expose slice 1, issues #230 and #312) | mitigated (forkd: per-sandbox bearer gate, constant-time, never logged; per-sandbox concurrency cap, 429 on excess; force-close-on-terminate; standalone: loopback tokenless by design; SSRF allowlist-of-one; streaming not buffered; internet-facing edge proxy, TLS, and subdomain routing are slice 2 and NOT part of this surface) | The expose route reverse-proxies an HTTP request (any method: GET for SSE, POST for agent RPC) to the guest’s 127.0.0.1:<port> over the existing vsock PortForward tunnel, streaming the response immediately (FlushInterval -1, no output buffering), so SSE sessions work without batching. Code: internal/daemon/expose_conn.go (net.Conn over the tunnel), internal/daemon/expose.go (ProxyHTTP), internal/daemon/expose_route.go (checkBearer, handleExpose, HandleExpose), internal/daemon/expose_cap.go (acquireExpose, SetMaxExposePerSandbox, CloseExpose, trackExposeConn). The surface is bounded as follows. (1) AUTH on forkd: the per-sandbox bearer token is validated by checkBearer with a constant-time compare (subtle.ConstantTimeCompare); a missing, malformed, or wrong token is rejected 401. The token VALUE is never logged and never appears in an error body; only the sandbox id and port are logged. checkBearer is mounted OUTSIDE the body-peeking requireBearer middleware so the auth gate never buffers the request body: a large or streaming POST body is not read before auth completes and is not held in memory. (2) STANDALONE: inherits the same loopback tokenless trust model as the rest of the standalone sandbox-server (AllowTokenless), the same as the /forward path; this is not a new weakening. (3) SSRF: the guest dial is forced to loopback by the guest agent (guest/agent-rs/src/service/portforward.rs): the host carries only a bare integer port parsed from the URL path, the guest always dials 127.0.0.1:<port> and refuses any non-loopback target or out-of-range port; the host never derives the dial target from any user-supplied input beyond the port integer, so there is no SSRF path to steer the tunnel to another guest interface, a host interface, or the host network. Port is range-checked to 1-65535 before the tunnel is opened. (4) Each request opens its own vsock tunnel and guest TCP connection and tears it down on close; there is no multiplexing and no shared state across requests. (5) CONCURRENCY CAP: acquireExpose(sandboxID) (modeled exactly on acquireStream) enforces a per-sandbox ceiling (defaultMaxExpose=16, configurable via SetMaxExposePerSandbox) on concurrent OPEN expose tunnels; a NEW request over the cap is rejected 429 before any tunnel is dialed; existing in-flight tunnels are never killed by the cap. The counter is decremented via an idempotent sync.Once release deferred in handleExpose. This mirrors the /forward cap (SetMaxForwardsPerSandbox) and the streaming-exec cap (SetMaxStreamsPerSandbox). (6) FORCE-CLOSE ON TERMINATE: every expose net.Conn created in ProxyHTTP’s dial closure is registered in exposeConns[sandboxID] via trackExposeConn; CloseExpose(sandboxID) closes every tracked conn and clears the entry; UnregisterSandbox calls CloseExpose (before taking its own lock, mirroring CloseForwards) so no tunnel goroutine outlives a terminated sandbox. A trackingConn wrapper calls untrackExposeConn on natural close so the map does not grow unbounded across request lifetimes. (7) DEFERRED to slice 2 and explicitly NOT part of this surface: the internet-facing edge proxy, TLS termination, the <label>.<expose-domain> subdomain scheme, the Host allowlist and reserved-name check, the wildcard certificate, the controller route-sync CRD fields, and the internal/preview to internal/expose package rename. This row covers only the forkd :9091 and standalone sandbox-server paths; the public ingress surface is the section 7c area (preview URL rows). |
Sandbox API error responses (internal/apierr envelope) | addressed | Runtime error responses from the forkd sandbox API and the standalone sandbox-server use the LLM-legible envelope {error:{code, message, cause, remediation}} (internal/apierr), so a caller gets a stable machine code and an actionable next step instead of an opaque string (issue #28). The cause is built from sandbox ids, paths, and operation names only (an exec/file failure surfaced from the guest agent or fork engine, a fixed string for the auth and bad-request paths); tokens and secret values never appear in any field. The requireBearer gate never echoes the presented token: its 401 cause is a fixed string, never the request header. CI asserts every error path carries a non-empty code and remediation (internal/apierr/apierr_test.go, internal/daemon/error_envelope_test.go). The Python and TypeScript SDKs additionally redact any bearer token a misconfigured server might reflect into a body before it becomes the client-side error cause. |
| forkd capability minimization | mitigated as shipped (#352); kernel-enforced cap/uid drop proven in KVM CI (issue #2 Task 5) | The SHIPPED DaemonSet (deploy/daemon/daemonset.yaml) and the Helm chart run forkd NON-privileged: privileged: false, allowPrivilegeEscalation: false, seccompProfile: RuntimeDefault, capabilities.drop: [ALL] and add back ONLY the explicit builder set SYS_ADMIN, CHOWN, SETUID, SETGID, MKNOD (the jailer) plus NET_ADMIN (the build-time placeholder tap). That set is the single source of truth in cmd/forkd/jailer.go (forkdRequiredCapabilities, derived from jailerRequiredCapabilities), and the conformance suite cmd/forkd (TestShippedDaemonSetIsNotPrivileged, TestShippedDaemonSetHasExactJailerCapabilities, TestShippedDaemonSetHardensSecurityContext, TestShippedDaemonSetGetsKVMViaDevicePlugin, TestShippedDaemonSetEnablesJailer) fails in the darwin unit run if a regression re-adds privileged: true, widens the cap set, drops the jailer flags, or re-introduces the privileged /dev/kvm hostPath. /dev/kvm and /dev/net/tun come from the device plugin (mitos.run/kvm); the device cgroup is scoped by the kubelet, which is what removes the need for privileged. The jailer is ENABLED, so a guest escape from a raw-forkd or build VM lands as a THROWAWAY jailed uid in a per-VM chroot inside a non-privileged container, not root in a privileged one. Residual: the container is still uid 0 and holds CAP_SYS_ADMIN (needed for the chroot-base mount setup and the jailer), and the data-dir hostPath remains, so a successful escape that ALSO defeats the jailed uid and CAP_SYS_ADMIN boundary is still a serious node event, just no longer trivial full-node root; the kernel ENFORCING the bounding-set/uid drop is asserted on the KVM runner (the CI host is root and cannot observe it). raw-forkd stays not-for-untrusted-multi-tenant for the REMAINING reasons (node-flat snapshots; see docs/adr/0005-raw-forkd-not-multitenant.md), the privileged-container reason is now closed. |
| Blast radius documentation | mitigated (#352) | This document. A forkd compromise is now uid 0 in a NON-privileged container with the explicit builder capability set (no privileged: true), /dev/kvm//dev/net/tun via the device plugin, and a hostPath to the node data dir, plus the ability to read every snapshot and secret passed to it. The build/raw-forkd VMMs run jailed under throwaway uids, so a VMM escape no longer lands as forkd’s root. The residual blast radius (CAP_SYS_ADMIN, uid 0, data-dir hostPath) is documented in the forkd capability minimization row above. |
| forkd crash recovery / orphan-VM leak on restart (issue #12) | mitigated (artifact reap + re-adopt; unit-verified on darwin); real-VM reap on KVM is a TARGET pending a kvm-test.yaml crash-reap phase | forkd tracks live VMs in an in-memory map, so before this change a forkd crash + restart lost all knowledge of its own pre-crash Firecracker processes: they kept running (consuming CPU, memory, /dev/kvm slots, jailer uids, disk) while ListSandboxes reported zero, so the controller GC could not see or reap them and they leaked until node reboot (a node-level availability/DoS surface, NOT a cross-tenant escape). forkd now persists a minimal per-VM journal record at <dataDir>/sandboxes/<id>.json (atomic temp+rename) when a VM reaches running and removes it on clean terminate; NewEngine reconciles the journal before serving. A record whose pid the PID-recycle guard confirms is still OUR live Firecracker (/proc/<pid>/exe resolves to the recorded firecracker binary, or comm is firecracker when exe is unreadable under the jailed uid) is re-adopted into the live map so ListSandboxes reports it and the controller GC reconciles it against the CRDs (terminating it if no live claim, directly from the recorded pid + jailer chroot + uid + network identity). A dead pid, or a recycled pid running an UNRELATED program, is treated as gone: its leaked artifacts (jailer workspace incl. chroot + rootfs CoW clone, sandbox dir, fork network tap/ruleset/identity, jailer uid, volume backings) are reaped best-effort and idempotently, and its record dropped. The PID-recycle guard is the safety property: a wrong kill of an unrelated reused pid is prevented because reconcile never SIGKILLs on the startup reap path (a dead pid has nothing to kill; a recycled pid is not ours), and adoption (which does enable a later kill via the GC) happens ONLY for a verified-our-firecracker pid. The later GC-driven kill is the most dangerous operation here: an adopted firecracker is re-parented to init across the crash (not a child of the restarted forkd), and Terminate runs a full GC interval after adoption, so between the two the adopted VM can exit on its own and its pid be recycled to an unrelated process on a busy node. To close that adoption-then-kill TOCTOU, reapAdopted RE-RUNS the SAME PID-recycle guard against the recorded firecracker binary immediately before signalling and skips the kill when the pid no longer resolves to our firecracker (artifacts are still reaped). The adopted VM’s exact /30 network block is pinned from its recorded guest IP via netconf.Allocator.MarkInUse rather than re-Acquired, so the empty post-restart allocator cannot hand the same /30 to a fresh fork and Release frees the right block. Reconcile is fail-open (a single malformed record never blocks startup) and logs counts + ids/paths only, never secrets. The journal carries ids/pids/host paths/uids/IPs only; never env, secrets, or tokens. Verified on darwin via injected pid/verifier seams (internal/fork/reconcile_engine_test.go, including the reap-adopted recycled-pid skip and the network-block pinning) and internal/netconf/identity_test.go; the real Firecracker kill + chroot/CoW/network reap on KVM (start a sandbox, kill -9 forkd, restart forkd, assert the orphan FC is reaped or re-adopted + GC-terminated with no leaked process/chroot/uid) is a TARGET pending a kvm-test.yaml crash-reap phase (issue #12). Residual: a re-adopted orphan the GC terminates does not yet stamp a typed claim condition explaining the pre-crash origin. |
| Git rendezvous egress (W4 outputs) | mitigated (arg injection); mitigated (credentialed egress); open (external-endpoint egress) (documented) | A claim spec.outputs {git} entry is a NEW EGRESS: on terminate the control plane (the claim reconciler today; the node-side transfer path when wired) materializes the workspace spec.git.paths content (tenant data) into a commit and pushes it to an operator-declared external git remote on a per-attempt branch. The remote URL is operator-declared in the Workspace/output spec (not tenant-controlled), git is the merge layer (the engine only pushes a branch, it never merges working trees), and the secret exclude list still strips credential paths before any capture so a push carries repo content only. CONFIRMED arg-injection RCE (now closed): the push ran git push <remote> <branch> with no -- separator, so a remote of --receive-pack=<cmd> was parsed by git as a FLAG and ran an arbitrary command on the pushing (controller) node, exploitable regardless of git version. Mitigations now in place: (a) the push uses a -- separator (git push -- <remote> <branch>), so a flag-shaped remote lands as a positional and cannot inject an option; (b) RenderBranch rejects a rendered branch beginning with -, so a custom branch template cannot inject a flag even past the separator (defense in depth); (c) the push environment sets GIT_CONFIG_NOSYSTEM=1 and points HOME at an empty temp dir, so ambient host git config (a controller image ~/.gitconfig or /etc/gitconfig) cannot re-enable the ext::/fd:: transports or alter push behavior; (d) the API enforces a +kubebuilder:validation:Pattern on GitOutput.Remote restricting it to https://, http://, ssh://, git://, file://, and scp-like git@host:path forms, rejecting flag-shaped and ext::/fd:: values at admission. ext:: is also disabled by default in git >= 2.38.4. These defenses close the arg-injection even for a misconfigured or compromised operator input. The operator-declared remote remains a high-trust boundary by design. Rendezvous CREDENTIALS are now modeled (W4 Phase 3). A push to a real external remote authenticates with a token from Workspace.spec.git.credentialsSecretRef (a referenced Secret key in the workspace namespace, resolved by the controller). The token is a SECRET VALUE and is handled per the secrets policy: (i) it NEVER appears on the git argv (so it is absent from the process table), in a log line, in an error, in a claim/revision condition, or in a committed revision; (ii) it reaches git ONLY through a mode 0o600 .git-credentials file written into an ephemeral, isolated HOME created per push and removed when the push returns (credential.helper=store reads only that file); (iii) any git output surfaced in an error is scrubbed of the token defensively; (iv) a missing/empty key yields an LLM-legible error that names only the Secret and key, never the value; (v) credentials require an http(s) remote that can carry basic auth (a file:///scp-like remote with credentials is rejected). Redaction on the failure path is asserted in internal/workspace/git_test.go and the token-never-in-conditions invariant in internal/controller/workspace_binding_test.go BEFORE the credential code merges. The matching authenticated remote is internal/rendezvous + cmd/rendezvous-server: a git-http server wrapping git http-backend behind HTTP basic auth (constant-time token compare; the server’s token is mounted from a Secret via -token-file/RENDEZVOUS_TOKEN, never an argv flag, never logged), enabling receive-pack only after the auth check passes and rejecting traversal in the repo path. The operator may also point the output at any standard authenticated git remote. Residual EGRESS surface, explicitly: the push target is an EXTERNAL endpoint, so tenant repo bytes leave the cluster to wherever the operator pointed the remote (a high-trust, operator-declared boundary by design; the credential just authenticates that egress, it does not bound where it goes). The push to a REAL external server on a live cluster is the gated e2e tail (test/cluster-e2e/workspace-e2e.sh). A {git} output without spec.git.paths is a no-op; a push failure surfaces on the claim/revision condition and is retried, never silently swallowed. |
| Revision change feed egress (W4 slice 4) | mitigated | The controller emits CloudEvents (dev.mitos.workspace.revision.created, dev.mitos.sandbox.phase.changed) to an OPT-IN operator-configured webhook (--event-sink-url; empty disables it, leaving only on-cluster Kubernetes Events) and mirrors each as a Kubernetes Event. Payloads carry IDENTIFIERS only (workspace/revision names, the content-manifest DIGEST, lineage refs, phase transitions), never secret values, env, or file content, so the feed leaks metadata to wherever the operator points the sink, not tenant data. Delivery is at-least-once with the event id (object UID plus a sequence) as the idempotency key so an indexer dedupes; the webhook URL is operator-config (an SSRF-shaped surface like the git remote, the same high-trust boundary). Residual: no payload signing/auth on the webhook yet (the operator must trust the sink endpoint); NATS and the reference indexer are out of scope. |
| Console audit-sink egress (console B3b) | mitigated (metadata-only, https-only, best-effort); open (dispatch bounding + full SSRF egress allowlist) | The console BFF can forward each org audit event to org-admin-configured external sinks (webhook today; s3/splunk/datadog accepted as config types with drivers as follow-ups) via the SinkRegistry + DispatchingRecorder (internal/saas/console/audit_sinks.go). Payloads carry the AuditEvent IDENTIFIERS only (org id, actor id, action, target, a non-secret detail string, timestamp), never a secret value, env, or file content, so the feed leaks audit METADATA to wherever the org admin points the sink, not tenant data; SinkConfig stores no credential field and no endpoint or error is logged with a secret. Delivery is BEST-EFFORT: a sink failure is logged and never fails the audit Record or the user action. The sink endpoint is validated to require an https:// URL with a non-empty host at POST /console/audit/sinks (rejecting http and other schemes), which blocks the obvious SSRF-to-internal-http vector; the endpoint remains an admin-config SSRF-shaped surface like the git remote and the revision feed above. The registry is org-scoped (a cross-org sink is not_found) and every sink endpoint is added to the auth-gate table. Residual: no payload signing/auth on the sink yet (the admin must trust the endpoint); the per-event dispatch is unbounded goroutine-per-sink (a bounded worker pool is a tracked follow-up); full SSRF egress allowlisting (blocking private/loopback ranges) and the non-webhook sink drivers are follow-ups. |
| Memory-snapshot pairing (W4 resumable head) | mitigated | A workspace head can be paired with a VM MEMORY snapshot (WorkspaceRevision.memorySnapshotRef), which captures guest RAM and therefore CAN contain secrets that were delivered into the guest. The pairing is PRINCIPAL-BOUND: on a checkpoint-on-terminate the new revision records the CAPTURING claim’s principal (memorySnapshotPrincipal = its ServiceAccount); status.resumable is true only when the head’s snapshot still exists AND is verified principal-bound; and a new claim’s activation REFUSES to restore a memory snapshot whose principal does not match the activating claim’s principal (a cross-principal resume is rejected fail-closed, never silently downgraded to a cold start). So a memory image is never served to a principal other than the one that created it. The refusal is enforced at TWO layers: the controller’s maybeResumeMemory (internal/controller/workspace_binding.go), and the WorkspaceMemorySnapshotAdapter (internal/controller/workspace_memory_snapshot.go) which refuses any cross-principal resume/exists it is asked for directly even if a caller bypassed the upstream check. The whole flow (checkpoint pairs the snapshot; the head becomes resumable; a same-principal claim resumes; a DIFFERENT-principal claim is refused and never resumed/hydrated) is proven END TO END in envtest, internal/controller/resumable_envtest_test.go TestResumableHeadFromMemorySnapshot (the sa-b intruder case asserts the refusal fires and the resume seam is never reached for the intruder), plus adapter unit tests (workspace_memory_snapshot_test.go). Wiring the seams to a real memory snapshot is gated behind the controller --workspace-memory-snapshots flag; off by default a checkpoint-on-terminate FAILS LOUD rather than producing a revision falsely marked resumable (no fabricated snapshot, no-unverified-claims). PRINCIPAL AUTHENTICITY: the principal is the claim’s spec.serviceAccount, which a claim author sets freely, so the equality check above is only an authorization boundary if that field is bound to the creator’s identity. The validating admission webhook internal/admission.ClaimServiceAccountValidator (controller --enable-principal-webhook, Helm admissionWebhook.enabled) closes that gap: on claim CREATE/UPDATE it runs a SubjectAccessReview with the request creator’s identity and rejects the claim unless the creator may impersonate the named ServiceAccount, so a tenant cannot assert another principal’s value and resume its secrets-in-RAM. It is off by default (single-tenant and webhook-less installs unaffected) and STRONGLY RECOMMENDED whenever memory snapshots are enabled multi-tenant; the controller logs a warning if --workspace-memory-snapshots is on without it. The webhook handler + SAR decision are unit-tested (internal/admission); the cert wiring and a webhook admission e2e are the remaining gate before it is relied on, and it touches the authz boundary so needs a named human reviewer. Residual: the memory snapshot at rest inherits the snapshot store’s encryption (#31, the per-workspace key in section 5 now extends to the workspace store); the real VM-memory restore runs on a KVM-capable kubelet (the cluster-gated in-VM tail), while the pairing decision and the principal check are object-level proven in envtest. |
| Workspace S3 object-store egress (W4 Phase 4) | mitigated (credentials); open (external-endpoint egress) (documented) | A workspace may select an S3-compatible object store (Workspace.spec.store.s3 + objectStorageRef) as the content-addressed backend for its hydrate/dehydrate revision artifacts, an ALTERNATIVE to the node CAS. This is a NEW EGRESS: tenant workspace content (chunked, content-addressed) leaves the cluster to the operator-declared bucket. Bucket and endpoint are operator-declared (not tenant-controlled). CREDENTIALS are handled per the secrets policy: the access-key id and secret-access-key come from s3.credentialsSecretRef (a referenced Secret), the secret-access-key is a SECRET VALUE used ONLY to derive the SigV4 signing-key chain (internal/workspace/s3client.go), and it NEVER appears on the wire in cleartext, in a log line, in an error, in a condition, or in a committed object. The signed request carries only the SigV4 signature and the (non-secret) credential scope; the secret-never-on-the-wire invariant is asserted in internal/workspace/s3client_test.go. The store is plaintext CONTENT-ADDRESSED exactly like the node CAS (same manifest digest for a tree), so it is a drop-in that preserves byte-identical round trip and chunk-level dedup (TestS3DigestMatchesNodeCASDigest, TestS3DedupsByChunkDigest), and it composes with per-workspace encryption so the at-rest bucket objects are ciphertext (TestS3EncryptedRoundTrip). Residual EGRESS surface, explicitly: bytes leave the cluster to wherever the operator pointed the bucket (a high-trust, operator-declared boundary by design); the credential authenticates the egress, it does not bound where it goes. The node CAS stays the default when no objectStorageRef is set; the live round trip is the gated e2e tail. |
| Node CAS: tenant-writable, unbounded, integrity-DoS (W4) | open (high) | The node content-addressed store (<dataDir>/cas) is mounted READ-WRITE and shared per node across all husk pods (huskCASMountPath, internal/controller/huskpod.go), so a guest that escapes its VM into the husk pod can write, delete, or corrupt other tenants’ committed-revision chunks on the same node. Content-addressing protects INTEGRITY of what is read (the read side verifies the digest, so a forged chunk is rejected; and the manifest decoder now validates each chunk digest is a well-formed sha256 before it is ever used as a chunk path, internal/cas.decodeManifest calling Digest.Validate, so a malformed manifest from the untrusted peer-pull transport, internal/cas/http_transport.go, cannot traverse out of the chunk store (chunkPath joins the digest into a path) to stream an arbitrary host file to the output before the post-read verify, nor panic string(d)[:2] on a sub-2-char digest), but it does NOT prevent DESTRUCTION: a compromised pod can delete or truncate another tenant’s chunks, a cross-tenant AVAILABILITY attack. The store is also UNBOUNDED in production: internal/cas.EvictToFit exists but is called from NOWHERE outside its own test (internal/cas/evict.go, evict_test.go), so node disk is uncapped and a tenant can fill it (node-disk DoS). Separately, fork-child activation uses --allow-unverified-snapshots (the live fork snapshot is not content-addressed, internal/husk/control.go, stub.go), so a node-local attacker who escaped a VM could tamper a fork artifact a sibling pod loads UNVERIFIED. Required fix: wire EvictToFit (or another disk cap) into production, scope the CAS write surface per tenant (or per node with destruction protection), and remove the unverified fork-load path or bind it to a digest. |
| Husk pod default ServiceAccount token automount | mitigated (shipped) | The husk pod spec now sets AutomountServiceAccountToken: false (internal/controller/huskpod.go buildHuskPod), so the kubelet does NOT mount the namespace default ServiceAccount token into a husk pod. The stub speaks vsock + mTLS and never calls the Kubernetes API (verified: no client-go, no InClusterConfig, no SA token read in cmd/husk-stub or internal/husk), so the token was dead weight; a guest that escapes into the stub no longer inherits a free system:authenticated token. The opt-out applies to BOTH warm pods and fork-child pods, which share the builder, proven in TestBuildHuskPodDisablesSATokenAutomount. |
| forkd gRPC fails OPEN without TLS flags | mitigated (shipped) | grpcServerOptions (cmd/forkd/main.go) now FAILS CLOSED: with no TLS flags it returns a fatal error naming the missing --tls-cert/--tls-key/--tls-ca flags and the opt-in, so forkd refuses to start unauthenticated. The legacy insecure-with-loud-warning behavior is reachable only behind an explicit --allow-insecure-grpc opt-in (default false) for local development. A partial TLS triple is still a configuration error. The shipped DaemonSet always sets TLS, so production is unaffected; this only stops a silent-insecure misconfig. Proven in TestGRPCServerOptionsFailClosed (refuses with no TLS and no opt-in; allowed with the opt-in; allowed with a full TLS triple; partial is an error). |
| Host-side vsock read has no deadline | mitigated (shipped) | The host-side vsock.Client now applies a per-request read deadline in send (internal/vsock/client.go), defaulting to DefaultRequestTimeout (60s) and overridable via SetRequestTimeout. The CONNECT preamble reads in Connect/DialStream/DialStreamUnix are likewise bounded (then cleared so the long-lived streaming reads stay ctx/conn.Close-cancelled). A malicious or wedged guest that connects then stalls (or dribbles a partial line under the MaxMessageBytes cap) now causes the one-shot call to return a timeout error rather than hang the host caller goroutine, vsock fd, and (for the husk stub) stream slot indefinitely. Proven in TestSendReadDeadlineUnblocksOnStall (a fake agent that connects then never responds makes Ping return rather than hang). |
| Husk egress enforcement | mitigated (IMPLEMENTED and KVM-VERIFIED end to end) | See section 0 surface 5 and section 4. The husk default path enforces an in-pod nftables default-deny egress filter in the pod’s own netns (CNI-independent, the guarantee), an unconditional cloud-metadata block (169.254.169.254, 169.254.0.0/16, IPv6 fd00:ec2::254, before any allow), and the threaded per-template allowlist with a per-pod DNS proxy (failover upstream list, recommended 1.1.1.1:53,8.8.8.8:53, not cluster DNS); a best-effort controller-emitted NetworkPolicy adds defense in depth (CNI-dependent, cannot express name-based allows). Costs one scoped NET_ADMIN capability (ADR 0006) plus, on name-egress pools only, a short-lived privileged enable-ip-forward init container that sets net.ipv4.ip_forward=1 in the pod netns and exits (one-shot; no node prerequisite; the long-lived husk-stub container stays unprivileged). VERIFIED: the husk-network KVM cluster e2e (test/cluster-e2e/husk-network-e2e.sh, cluster-husk-network-e2e) PASSES on the Hetzner Talos KVM cluster; the claim reaches Ready and all three in-VM enforcement assertions are green inside a real restored VM (metadata-blocked exit 1, default-deny exit 1, allowlist-works example.com connect exit 0), verified twice (with and without a node sysctl allowance), proving NO node prerequisite. |
| agents.x-k8s.io facade admission surface (v1beta1 conformance, #357) | mitigated (bounded translation, no host surface) | The conformance facade (internal/facade, opt-in deployment) watches tenant-authored upstream agents.x-k8s.io/v1beta1 and extensions.agents.x-k8s.io/v1beta1 custom resources (Sandbox, SandboxTemplate, SandboxWarmPool, SandboxClaim) and TRANSLATES each into our mitos.run/v1 SandboxPool/Sandbox objects in the SAME namespace, owner-referenced to the upstream CR for GC. It creates only namespace-scoped objects and grants no cross-namespace or host access; the run-path security boundary is unchanged (the bridged Sandbox is governed by the husk-pod and sandbox-API controls above, not by the facade). Tenant-controlled upstream fields are BOUNDED: spec.operatingMode (Running/Suspended) drives only release/re-activate of the bridged claim; SandboxClaim.spec.warmPoolRef.name is a name lookup of OUR pool of that name (no path or host input); spec.podTemplate.spec.containers[0].env (including env.valueFrom references) is mirrored through by reference and never logged; all other podTemplate fields and volumeClaimTemplates are UNMAPPED (no tenant-driven volume provisioning or host-shaped field reaches our engine via the facade, see docs/facade-conformance.md exceptions 1, 3, 4, 5). The v0.5.0 migration to v1beta1 REDUCED the input surface: it removed the v1alpha1 warmpool policy resolution branch (none/default/named), so a claim now references a warm pool by name with no policy logic to abuse. Test-infra note: v1beta1 is the storage version with a conversion.strategy: Webhook we do not run; the kind conformance job pins the live CRD conversion to None at install time and exercises only the storage version (the vendored CRD on disk is unchanged). Residual: the facade trusts cluster RBAC that admits the upstream CRs, so whoever may create an agents.x-k8s.io Sandbox in a namespace can drive a pool binding there; the same-namespace tenancy boundary (section 7a) applies, and the facade adds no privilege beyond it. |
Code interpreter (run_code) surface
run_code (the Connect sandbox.v1.Sandbox.RunCode RPC; the legacy POST /v1/run_code/stream JSON route was removed in #358) runs tenant code in a
long-lived Python kernel (/opt/mitos/kernel_driver.py, ipykernel)
that the guest agent spawns lazily and keeps for the VM lifetime. Status:
mitigated for host isolation, partial for in-guest blast radius.
- The kernel runs INSIDE the untrusted guest VM, at the same privilege as any
execchild. It crosses no new host boundary: the KVM/Firecracker boundary and the unprivileged husk pod (section 0) bound it exactly as they boundexec. The host treats kernel output (frames) as data only. - It is a PERSISTENT interpreter holding tenant state across calls. Within one sandbox this is by design (statefulness is the feature); across tenants there is no sharing because each sandbox is its own VM.
- Fork inheritance: a forked VM inherits the live kernel and its namespace (it is part of the snapshot). This is the same fork-shared-state surface as any in-VM process; the RNG/clock caveat is in docs/fork-correctness.md.
- Optionality: a base image without the kernel returns a KernelUnavailable
error frame (exit 127); no new attack surface exists on minimal images, and
plain
execis unaffected. - The kernel driver reads only newline-delimited JSON {id, code} on its stdin from the agent, never from the network, so there is no kernel-protocol exposure beyond the existing vsock/HTTP authz (per-sandbox bearer token).
Residual (open): the kernel inherits the configured env+secrets exactly as
exec does; secret values are never logged in frames (only stdout the tenant
itself prints, which is the tenant’s own choice). No CPU/memory cgroup bounds
the kernel beyond the VM’s own limits.
Interactive terminal surface (GET /sandbox.v1.Sandbox/Exec over WebSocket)
The interactive terminal is the sandbox.v1.Sandbox.Exec bidi RPC carried over a
WebSocket upgrade (subprotocol connect.sandbox.v1, handleExecWS,
internal/daemon/exec_ws.go); it bridges to a dedicated vsock gRPC Exec stream on
which the guest agent allocates a pseudo-terminal and starts /bin/sh as a
session leader. This is a LIVE interactive shell into the VM: stdin frames flow
client to guest and stdout/stderr frames flow back, for as long as the connection
is held. The bespoke /v1/pty JSON WebSocket wire was removed in #358; the row
above (GET /sandbox.v1.Sandbox/Exec) is the authoritative surface entry. The
threat properties carry over unchanged:
- Authentication. The upgrade is a bodyless GET, so it does NOT pass through
the JSON-body-peeking
requireBearermiddleware.handleExecWSauthenticates the upgrade itself (ptyAuth): the sandbox id comes from the?sandbox=query parameter and the token from theAuthorization: Bearerheader, compared in constant time. Semantics matchrequireBearer: a sandbox with no registered token fails closed with 401 (unlessAllowTokenless, the standalone sandbox-server only); a missing, malformed, or mismatched token is 401. Token values are never logged. Status: mitigated (same per-sandbox token custody as exec/files). - Process containment. The shell runs in its own session and process group; a host hangup (WebSocket close, ctx cancel, or vsock drop) cancels the guest Exec stream, which the guest turns into a process-group SIGKILL, so a terminal session and its children do not outlive the connection. Status: mitigated.
- No command injection at the edge.
handleExecWSdoes NOT take the shell command from the client for the interactive case; the guest defaults to/bin/sh. Only the PTY window size crosses from the open frame. The shell, of course, can run any command the guest user can, exactly like exec; the terminal does not widen the in-guest capability set, only the interactivity of access. Status: accepted by design, identical to the exec surface. - Concurrent-session cap. The terminal holds a dedicated vsock connection for
the session lifetime, so it counts against the SAME per-sandbox
concurrent-stream ceiling as exec and run_code (production-blocker #2):
handleExecWSreserves a slot viaExecPTY; because the open frame arrives after the upgrade, a session over the--max-streams-per-sandboxceiling is refused with a terminal errorexitframe plus a policy-violation close (not a pre-upgrade 429); existing sessions are never killed. So a tenant cannot open unbounded terminals to exhaust host connections and goroutines. Status: mitigated (tested inTestExecWSStreamCapRejected). - Residual. The terminal inherits the exec surface’s residuals (the in-guest
user is unconfined within the VM; isolation is the microVM boundary, not
in-guest privilege separation). It adds no new host-side privilege. The auditor
records an
exec_wsop with theptyflag only, never terminal contents.
4. Sandbox → network
See docs/networking.md for the full design (tap-per-sandbox, nftables
dispatch model, per-fork identity). Networking is opt-in: with forkd’s
--enable-networking off, restored VMs have no NIC and egress is denied by
absence. With it on, each fork gets its own tap and a host-side default-deny
egress ruleset.
PER-MODE ENFORCEMENT (reconciled with the husk default, section 0 surface 5).
The nftables egress dataplane described in the rows below is the SAME rendering
(internal/netconf) used in both modes. In RAW-FORKD mode (--enable-raw-forkd)
with --enable-networking on, it is applied host-side in forkd’s netns; those
rows are CI-proven in-VM on KVM. On the HUSK DEFAULT path (the shipped default)
the SAME rendering is applied IN-POD by the husk-stub in the pod’s OWN netns at
activation (internal/husk/netfilter.go, internal/husk/stub.go), so a husk
sandbox gets a CNI-independent default-deny egress chain, the unconditional
cloud-metadata block, and the threaded per-template allowlist;
169.254.169.254 (and the v4 link-local range and IPv6 IMDS) is dropped before
any allow. Name-based egress runs the in-pod DNS proxy that resolves only
allowlisted names and pins each resolved IP into the per-tap allow set, forwarding
to a comma-separated failover upstream list (recommended 1.1.1.1:53,8.8.8.8:53,
NOT cluster DNS); allowed traffic egresses via an nftables SNAT masquerade scoped
to the guest source plus IPv4 forwarding, the latter enabled by a short-lived
privileged enable-ip-forward init container (name-egress pools only, no node
prerequisite). The controller additionally emits a best-effort
networking.k8s.io/v1 NetworkPolicy selecting mitos.run/husk=true
(internal/controller/husknetworkpolicy.go); HONEST CNI caveat: a NetworkPolicy
enforces only on a CNI that implements it and cannot express name-based allows, so
the in-pod nft filter is the guarantee and the NetworkPolicy is defense in depth.
The earlier CI step that hand-applied an allow-all (0.0.0.0/0) policy proved no
restriction and is superseded. This in-pod enforcement is IMPLEMENTED and
KVM-VERIFIED end to end: the husk-network KVM cluster e2e
(test/cluster-e2e/husk-network-e2e.sh, cluster-husk-network-e2e) PASSES on the
Hetzner Talos KVM cluster, the claim reaches Ready, and all three in-VM
enforcement assertions are green inside a real restored VM (metadata-blocked,
default-deny, allowlist-works), verified twice (with and without a node sysctl
allowance), proving NO node prerequisite. The raw-forkd in-VM KVM proof is
unchanged. See docs/husk-pods.md section 6d.
| Control | Status | Detail |
|---|---|---|
| Egress default-deny (IP:port) | partial / mitigated | Enforced host-side for literal IP:port allowlist entries. Each fork sits on its own tap with its own /30; a shared inet nftables table dispatches by inbound interface (the tap) into that sandbox’s regular chain, which accepts established/related, the allowlisted ip daddr/tcp dport pairs (each pinned to the sandbox’s ip saddr as anti-spoof), and ends in a terminal drop. The guest cannot influence the host ruleset and cannot spoof another sandbox’s source address onto its own tap. Proven in KVM CI: one VM reaches an allowed destination and is blocked from a denied one, plus a two-sandbox nft install proving cross-tap isolation (one sandbox’s drop never kills another’s allowed traffic). |
| Host-side enforcement | enforced | Egress policy is rendered and applied host-side only (nft per tap), never in-guest. The guest agent never edits nftables; the guest’s only network config is its own eth0 address. |
| DNS-based allowlists (name egress) | partial / enforced | Names like api.anthropic.com:443 are now enforced through a controlled per-node resolver (internal/dnsproxy, #47, behind --enable-dns-egress). The guest’s only resolver is the node resolver IP (169.254.1.1, written into the guest /etc/resolv.conf). The proxy resolves ONLY names on that sandbox’s allowlist, and for each resolved record pins (ip . port) into that sandbox’s nftables timeout set; the guest can then reach exactly the address it resolved, for exactly the allowed ports, for max(recordTTL, 30s). A name not on the allowlist gets REFUSED and nothing is pinned. Allowlist names: exact OR anchored suffix wildcard. An entry is matched exactly (case-insensitive, trailing-dot tolerant) OR, when it is written *.D, by the ANCHOR RULE: the query must end with .D and carry a NON-EMPTY label before that .D, so *.example.com matches a.example.com and a.b.example.com but NEVER the apex example.com, NEVER a look-alike (notexample.com, evilexample.com, xexample.com), and NEVER a name that carries D only as a non-suffix label (example.com.evil.com, a.example.com.evil.com). The match is a LITERAL anchored suffix check (strings-level, no regex); this anchor is the load-bearing guarantee and is exhaustively bypass-tested. A wildcard is validated at the boundary where the template networkPolicy names build the allowlist (ParseNameAllowList): it must be exactly a single leading *. plus a valid domain, so *, *., *foo.com, a.*.com, **.com, and multi-star names are REJECTED, never silently treated as a literal. AAAA/IPv6. The proxy now also forwards AAAA and pins each resolved v6 address into a SEPARATE per-sandbox v6 nftables timeout set (ipv6_addr . inet_service), and each per-sandbox chain carries a v6 default-deny (meta nfproto ipv6 drop under egress: deny), so an unpinned v6 destination is dropped rather than falling through to the base chain’s accept; v6 egress is therefore enforced by the same resolve-then-pin model as v4. Honest v6 scope: the guest is assigned only a v4 /30 source identity today (no v6 source address), so the v6 accept is NOT ip saddr anti-spoof-pinned the way the v4 accepts are; in single-stack guests this is moot (the guest cannot source v6), and the dataplane fails closed regardless because of the v6 default-deny. Exact and wildcard entries coexist; a double match unions the ports. The default stays DENY. Proven in KVM CI for v4: a resolved allowlisted name:port is reachable while an unlisted name (refused), the right name on a wrong port, and an un-resolved direct IP are all blocked. Residual risks are the next four rows. Literal IP:port rules remain the statically enforced path. |
| Name egress: upstream-resolver trust | open (documented) | The controlled proxy forwards allowed queries to a configured upstream (--dns-upstream, default the host resolver or 1.1.1.1:53) and pins whatever A records it returns. A malicious or compromised upstream can answer an allowlisted name with an attacker-controlled IP, which the proxy will then pin and the guest will reach. The trust boundary is the upstream resolver. Mitigations not yet in v1: DNSSEC validation, a pinned/known-good resolver set, response-IP sanity checks. |
| Name egress: bounded TTL window | partial / mitigated | A pinned (ip . port) stays reachable for max(recordTTL, 30s) after it is resolved, even if the name later stops resolving to that IP. The window is bounded by the record TTL (floored at 30s so a very short TTL cannot expire the pin before the guest connects) and the set’s timeout, after which the element is evicted and the IP is no longer reachable unless re-resolved. There is no manual revocation of a live pin before its timeout. |
| Name egress: shared-CDN-IP caveat | open (documented) | Pinning is by IP after resolution, so if an allowlisted name and a denied name resolve to the SAME IP (a shared CDN or load-balancer address), resolving the allowlisted name pins that IP and makes it reachable on the allowed port, including for traffic the operator intended to deny that happens to share the address. The denied NAME is still refused at the resolver (it is never answered or pinned), but the IP it shares becomes reachable once the allowlisted name is resolved. This is inherent to IP-level enforcement of name policy. |
| Name egress: DoH/DoT and DNS tunneling | mitigated | A guest cannot bypass the controlled resolver. Only udp/tcp 53 to the resolver IP is permitted by the egress chain, so a guest cannot reach an arbitrary external DoH/DoT server (its IP:port is not allowlisted and was never pinned). The resolver answers only A and AAAA queries for allowlisted names and REFUSES every other qtype, so it cannot be used as a covert DNS tunnel: only A/AAAA records are forwarded and the resolved IPs are constrained to the allowlist (and pinned into the v4 or v6 set by address family). |
| Name egress: source attribution | enforced | The proxy attributes each query to a sandbox by the query’s source guest IP (each sandbox has a unique /30 from the identity allocator) and pins into THAT sandbox’s set. A guest cannot grant itself another sandbox’s reach by spoofing a source IP: the per-tap dispatch sends a tap’s traffic only into its own chain, and every v4 accept (including the dynamic v4 pin-set accept) is ip saddr-pinned to the sandbox’s guest IP, so a spoofed-source query cannot land a pin that the spoofing guest can use. A query whose source has no live tap mapping is REFUSED and pins nothing. The v6 accept is not ip saddr-pinned because the guest has no v6 source address to spoof from today; the v6 default-deny in each chain remains the boundary there. |
| Layering: host netns vs per-VM netns | host-netns today | The tap and nftables ruleset live in forkd’s (the host’s) network namespace; isolation between sandboxes is by per-tap dispatch + per-/30 addressing + saddr anti-spoof, not by a kernel netns boundary per VM. Moving each VM into its own pod netns (husk pods, #18) adds a second, defense-in-depth layer and is where snapshot-fork-under-netns is resolved. Live-fork (ForkRunning) of a networked sandbox fails closed today (#18): a live fork would restore the source’s baked NIC and collide on tap/MAC/IP. |
| K8s NetworkPolicy | in-pod nft is the guarantee (IMPLEMENTED and KVM-VERIFIED end to end); NetworkPolicy is best-effort defense in depth | In RAW-FORKD mode sandboxes are not pods: NetworkPolicy does not govern them and our nftables egress layer is ours and documented as ours (CI-proven in-VM on KVM, opt-in). In the HUSK default the VM’s tap is in the husk pod’s netns, and the husk-stub programs an in-pod nftables default-deny egress filter there (the CNI-independent GUARANTEE, with the unconditional metadata block and the threaded allowlist). The controller additionally creates one networking.k8s.io/v1 NetworkPolicy per pool (internal/controller/husknetworkpolicy.go) selecting mitos.run/husk=true with default-deny egress, a DNS allow, and one egress rule per enforceable IP:port allow, owner-referenced to the pool for GC. HONEST CNI caveat: the NetworkPolicy enforces only on a CNI that implements it and cannot express name-based allows, so it is defense in depth ONLY; the in-pod nft filter holds with no CNI policy at all. The superseded CI “Conformance 3” allow-all step proved no restriction. The in-pod guarantee is now KVM-verified end to end: the husk-network KVM cluster e2e (test/cluster-e2e/husk-network-e2e.sh, cluster-husk-network-e2e) PASSES on the Hetzner Talos KVM cluster, the claim reaches Ready, and all three in-VM enforcement assertions are green inside a real restored VM (metadata-blocked, default-deny, allowlist-works), verified twice (with and without a node sysctl allowance), proving NO node prerequisite. |
5. Snapshot integrity and supply chain
Snapshots are executable memory images; loading one is equivalent to running arbitrary code at sandbox privilege.
| Control | Status | Detail |
|---|---|---|
| Content addressing (digest in CRD status) | mitigated | Every template snapshot is content-addressed in a CAS store the moment it is built: its sha256 manifest digest is recorded to <dataDir>/templates/<id>/manifest.digest, pinned in the store, reported through forkd GetCapacity/CreateTemplate, and written to SandboxPoolStatus.TemplateDigest so the snapshot identity is visible in kubectl get sandboxpool -o yaml. The digest is a content address and is safe to log. |
| Verify-on-load | mitigated | forkd verifies a snapshot’s on-disk bytes against the recorded digest before it is forked, and refuses on mismatch. To keep the fork hot path cheap, verification is verify-once-at-registration: at build time (trusted, marker written without re-hash) or at first use after a restart (lazy re-hash), recorded by a verified marker that Fork only stats. The dev-mode escape --allow-unverified-snapshots downgrades a failed verification to a loud one-time warning. Residual: verification is at registration, not per fork, so tampering AFTER a snapshot is verified is not re-detected until the marker is cleared; external snapshot import is not yet supported. |
| Publish authorization | mitigated | Snapshots are produced only by forkd’s own CreateTemplate, which is reachable solely over the mTLS-gated gRPC surface from the controller (PR #41). Externally supplied snapshots are not accepted, so the publish surface is exactly that authenticated CreateTemplate call. External snapshot import is future work. |
| Compatibility verification (no unsafe restore) | mitigated | The same load gate also runs the snapshot compatibility contract (internal/snapcompat.Check, issue #32) after the digest verify and before any Firecracker launch. The manifest records the producing environment (snapshot format version, Firecracker version, CPU model, kernel, config hash) as part of the content-addressed digest, so these fields cannot be tampered with or downgraded without changing the digest and failing the verify-on-load step above. A benign mismatch (a snapshot legitimately built under a different Firecracker version, a different CPU model, or an unsupported format version) fails closed: the restore is refused with an actionable error rather than crashing or silently corrupting a guest. The dev-mode escape --allow-incompatible-snapshots downgrades a refusal to a loud warning. Kernel mismatch is informational. Residual: cross-CPU-model restore via Firecracker CPU templates and live cross-Firecracker-version restore are out of scope (the contract refuses them today). |
| Encryption at rest + crypto-shredding (#31) | mitigated | Behind --enable-encryption (default off) each scope (a template now; a workspace when #21 lands) gets its own LUKS2 container (internal/storecrypt) backed by a sparse image; the snapshot and volumes are built inside the mounted, decrypted container, so the bytes at rest in <scope>.img are ciphertext, not the plaintext snapshot. dm-crypt sits below the page cache, so the mem mmap CoW restore reads decrypted pages and CoW page sharing across forks is preserved (no per-fork decryption copy). Erasure is crypto-shredding: luksErase wipes the LUKS keyslots and the image is removed at template delete, after which the ciphertext is unrecoverable even with the key. The key reaches cryptsetup only on stdin (--key-file -), never in argv or any log; storecrypt.Key redacts itself on any format. Proven in KVM CI on real cryptsetup: the marker is absent in the raw image but present in the decrypted mount (ciphertext at rest), reopen+read returns it intact (decrypt/restore works), and after shred a reopen with the original key fails and the image is gone (unrecoverable). Key custody (envelope, #31 follow-up): the controller generates a per-template 256-bit DEK with crypto/rand, WRAPS it with a KMS key-encryption key (KEK) via kms.Wrapper (internal/kms), zeroizes the plaintext DEK immediately, and stores ONLY the wrapped DEK plus the non-secret KEK id in a <template>-enc-key Kubernetes Secret owner-referenced to the SandboxPool (so GC of the pool GCs the Secret). The plaintext DEK never persists to etcd or disk. The controller delivers the WRAPPED DEK plus the KEK id to forkd in the mTLS-protected CreateTemplate and Fork gRPC requests; forkd unwraps via its KEK (--kek-file local AES-256-GCM provider) into process memory only, uses it for cryptsetup, and zeroizes the plaintext immediately after. forkd holds only the wrapped DEK via RequestKeyProvider and NEVER writes the plaintext or wrapped DEK to the node data disk; encryption enabled with no delivered wrapped DEK, or an unwrap failure (wrong KEK), fails closed. forkd refuses to start under --enable-encryption without --kek-file. The KEK never leaves the kms.Wrapper boundary: the local provider loads it by PATH from a Secret-mounted file (never argv, never logged; only the non-secret KEKID fingerprint is logged); a cloud KMS/HSM provider (AWS/GCP/Vault) is an interface-only documented follow-up where the KEK never leaves the HSM. The mTLS channel is ENFORCED, not merely used: forkd refuses to start with --enable-encryption unless its TLS cert/key/CA flags are set, and the controller refuses to deliver the wrapped DEK to a node whose connection is not mTLS (it fails the encrypted build/fork for that node rather than transmit it in cleartext). The DEK and the KEK are never logged anywhere in the key-custody code path (enforced by grep in CI). Proven by envtest and unit tests: the internal/kms round-trip/tamper/wrong-length/KEK-mismatch tests; the envtest proving the Secret stores the wrapped DEK + KEK id and never a raw key, and that the RPC carries the wrapped DEK + KEK id; daemon stash-and-forget of the wrapped form; forkd unwrap-and-zeroize; fail-closed; and DEK/KEK-never-logged. See docs/encryption.md. HUSK DEFAULT (section 0 surface 7): the same mTLS-only delivery and node-memory-only custody apply on the husk path: the wrapped DEK reaches the node over the mTLS control channel, the plaintext is unwrapped and zeroized while a container is opened, and neither is written to the node disk; the in-memory-DEK window is the named residual (HSM custody narrows but cannot eliminate it). Residuals, explicitly: (1) etcd now holds only the WRAPPED DEK and the non-secret KEK id, never the plaintext DEK; the etcd-at-rest-encryption trust is DOWNGRADED to defense-in-depth (an etcd exfiltration without the KEK cannot unwrap the DEK). The KEK custody is the internal/kms Wrapper: local AES-256-GCM from a Secret-mounted KEK file in dev/CI; a cloud KMS/HSM where the KEK never leaves the HSM is the documented follow-up. (2) Controller trust: a compromised controller can read the Secret and deliver the wrapped DEK to any forkd, and (with the KEK) wrap/unwrap; the cluster admin boundary is the trust anchor. The controller no longer holds the plaintext DEK after EnsureEncKey returns (it zeroizes immediately post-wrap). (3) Node-memory dump while open: while a container is open the plaintext DEK is necessarily in forkd process memory to serve I/O; a root attacker with a memory dump recovers it; zeroize-immediately-after-use is the current mitigation, full HSM custody narrows but cannot eliminate this window (dm-crypt requires the key in kernel memory). (4) TEARDOWN BOUNDARY: the controller does not yet send a DeleteTemplate RPC on SandboxTemplate deletion, so the node-side container is GC’d by node data dir lifecycle rather than a controller-driven crypto-shred; tracked as follow-up. PER-WORKSPACE STORE ENCRYPTION (W4 Phase 4, #21 + #31) is now in place. When Workspace.spec.store.encryptionKeyRef is set, every workspace revision CHUNK and MANIFEST is encrypted at rest with AES-256-GCM under a per-workspace DEK before it reaches the store (node CAS or S3), and decrypted on hydrate (internal/workspace/encryption.go, s3store.go). It reuses the SAME envelope custody as templates: the DEK is wrapped by the KEK via kms.Wrapper, the plaintext DEK is unwrapped only in node memory for the duration of a hydrate/dehydrate and zeroized after, and it is never logged, never in an error, never written to a host path. The key is PRINCIPAL-BOUND, pairing with the memory-snapshot policy in section 3. Two invariants are asserted in unit tests BEFORE merge: the manifest digest is computed over PLAINTEXT, so an encrypted dehydrate yields the SAME content identifier as a plaintext dehydrate (content-addressed dedup is preserved: TestEncryptedDigestMatchesPlaintextDigest, TestEncryptedDehydrateDedups), and the encrypted round trip is BYTE-IDENTICAL with chunks ciphertext at rest (TestEncryptedDehydrateHydrateRoundTrip, TestEncryptedChunksAreCiphertextAtRest); a wrong key fails closed at GCM Open (TestEncryptedHydrateWrongKeyFailsClosed). The per-chunk GCM nonce is a keyed HMAC over the plaintext digest (domain-separated from the GCM key), so identical plaintext re-encrypts byte-identically and the at-rest dedup skip still holds. Absent the key ref keeps today’s plaintext store path (backward compatible). Out of scope for now: cloud KMS/HSM providers (AWS/GCP/Vault, interface-only here), KEK rotation and DEK re-wrap, DEK rotation/re-encryption, and encrypting the template snapshot CAS chunk store (the workspace artifact store is now encrypted; the template snapshot CAS chunk store remains a follow-up). |
Device passthrough (GPU/VFIO) and the GPU node tier (issue #221)
GPU support attaches a physical device into the microVM via VFIO passthrough.
This MOVES the security surface: a passthrough device is a new, privileged path
between the guest and host hardware, and a GPU node is a DISTINCT, higher-trust
tier from a CPU-only node. NONE of this is implemented or hardware-validated
here; the device-attach and fork-with-device code is a documented,
HARDWARE-GATED follow-up (see docs/platforms/gpu.md). What ships today is the
scheduling/metering/spec plumbing only (GPU node selection in the registry,
resources.gpu on the template, GPU-seconds metering). The rows below state the
surface so it is not silently expanded when the device path lands.
| Control | Status | Detail |
|---|---|---|
| VFIO passthrough surface | open (not implemented; hardware-gated) | A passthrough GPU gives the guest DMA-capable access to a real PCI device. Without a correctly configured IOMMU and per-device IOMMU group isolation, a malicious guest can program the device’s DMA engine to read or write host memory outside the VM, defeating the KVM boundary. A GPU node is therefore only as isolated as its IOMMU configuration: IOMMU must be enabled and the GPU must sit in a clean IOMMU group (no bridge or sibling device shared with the host). This is a NODE prerequisite the operator owns; the controller cannot verify it. Until the device path is implemented AND validated on real GPU KVM hardware, do not attach a GPU to a sandbox that runs untrusted code. |
| GPU node tier (higher privilege) | open (documented) | GPU nodes are a SEPARATE, higher-trust node pool, labeled mitos.run/gpu. They carry expensive, DMA-capable hardware and (per the row above) a larger host-escape surface than CPU-only nodes. Treat them as a distinct blast-radius tier: dedicate them (issue #172 spec.placement) to trusted or single-tenant workloads, do not co-schedule untrusted multi-tenant sandboxes onto them until the VFIO surface is validated, and account for the fact that a successful device-DMA escape on a GPU node compromises that node’s host, not just the VM. The scheduler pins GPU pools to these nodes; it does NOT downgrade their trust requirement. |
| Fork-with-device (no live fork of a GPU sandbox) | refused by design (documented) | A GPU sandbox CANNOT be live-forked while the device is attached. A snapshot/fork captures and restores guest RAM with MAP_PRIVATE, but a passthrough PCI device has live hardware state (queues, BARs, on-device memory, DMA mappings) that is NOT part of the guest RAM image and cannot be coherently duplicated into a fork; restoring two VMs that both believe they own the same physical GPU is incoherent and unsafe. This matches Modal’s documented limitation (memory-snapshot fork is GPU-incompatible). The stance, stated like the issue requires: a GPU sandbox is NOT forkable with the GPU attached. The fork engine must fail closed on a fork of a GPU-attached sandbox when that path is built (the device must be detached first, or the sandbox forked only before attach). See docs/platforms/gpu.md. |
| GPU-seconds metering integrity | partial (accounting only; measurement hardware-gated) | GPU-seconds is added as a billable unit (internal/metering, summed straight per sandbox since a GPU is exclusively assigned and never CoW-shared across forks). The accounting math is unit-tested. The REAL per-device GPU-second measurement (reading actual device-busy time) is hardware-gated and not implemented; today the field is the seam the usage pipeline (#211) and Stripe metering (#212) bill on, fed by wall-clock-held-device time, not on-device utilization. |
Isolation tiers: not every node is the same assurance (issue #40)
Mitos’s default and strongest isolation is the hardware-virtualization microVM (KVM + Firecracker). Two WEAKER mechanisms exist as run-anywhere or fallback tiers, and the threat model’s job is to make sure they are NEVER silently treated as equivalent to hardware virt:
- PVM (Firecracker on the PVM kernel module, pagetable-based virtual machine,
Ant/Alibaba): runs guests in ring 3 via pagetable switching with NO VMX/SVM, so
Firecracker runs on a plain cloud VPS that exposes no
/dev/kvmnested virt. Ring-3 pagetable isolation is WEAKER than hardware virt: there is no VT-x/AMD-V root-mode boundary, the host kernel runs out-of-tree patches, and the guest shares more of the host’s privilege machinery. PVM is EVALUATED, NOT ADOPTED (docs/platforms/pvm-evaluation.md); this row exists so that if a PVM tier is ever enabled it is a documented lower-assurance tier from day one. - gVisor (userspace-kernel syscall interposition): not a VM at all. The software-isolation tier, relevant to the gVisor fallback and to ADR 0005 (raw forkd is not multi-tenant). Weaker still than a VM boundary.
The MITIGATION that makes a mixed fleet safe is a NODE isolation tier plus a per-pool/template assurance FLOOR, both shipped here:
| Control | Status | Detail |
|---|---|---|
| Node isolation tier label | mitigated (scheduling control) | Each node declares its isolation assurance via the mitos.run/isolation-tier label (hardware-kvm, pvm, or gvisor), mirrored into the scheduler’s NodeInfo.IsolationTier (IsolationTierFromNodeLabels). An UNDECLARED node is treated as the LOWEST assurance (fail-closed): it satisfies no floor. The tier is the node’s property, never inferred from the workload. The node-side mechanism that legitimately earns a tier (real hardware virt, a PVM host kernel, a gVisor runtime) is operational and out of scope for the controller; a typo or unrecognized label value never promotes a node above its declared assurance. |
Required assurance floor (minIsolationTier / requireHardwareKvm) | mitigated (scheduling control) | A template sets spec.minIsolationTier (hardware-kvm/pvm/gvisor) or the convenience spec.requireHardwareKvm: true; the controller’s node selection (internal/controller/scheduler.go admitsTier) admits ONLY nodes whose declared tier meets the floor. A floor is a minimum, so a stronger node still qualifies. A security-sensitive tenant requiring hardware-kvm therefore NEVER lands on a PVM or gVisor node, and when no node meets the floor the placement fails loudly with ErrNoCapacity (a relabel-or-relax remediation), never a silent downgrade. requireHardwareKvm can only tighten, never weaken, an explicit floor. Unit-tested: a hardware-kvm floor skips a PVM node, an undeclared node fails a real floor, and no floor uses any node. |
| PVM / gVisor co-tenancy posture | open (documented; opt-in only) | A PVM or gVisor node is a LOWER-assurance tier and must NOT co-host security-sensitive multi-tenant work unless the operator explicitly opts in. The default posture for an untrusted multi-tenant pool is to set a hardware-kvm floor so it cannot be scheduled onto a weaker tier. PVM nodes additionally carry an out-of-tree host kernel (a larger and less-reviewed host TCB than a stock hardware-virt node); treat a PVM-node host escape as compromising that node, and dedicate PVM nodes to trusted or explicitly-lower-assurance workloads. The control above is the enforcement seam; the operator owns the policy decision of which tenants may use which tier. |
| Confidential tier (SEV-SNP / TDX), memory encrypted from the host/operator | evaluated, NOT adopted (#354) | A hardware-memory-encrypted tier (AMD SEV-SNP / Intel TDX) was researched as the highest-ceiling assurance level: encrypt guest memory even from the node operator. Finding (docs/platforms/confidential-microvm-evaluation.md): confidential plus WARM FORK is infeasible on current Firecracker and hardware (Firecracker has no confidential support and snapshot/restore is named as the blocker; snapshot/restore and CoW fork both violate the per-guest-key, single-physical-page-owner, and fresh-attested-launch invariants of SEV-SNP/TDX). The only viable shape is a SEPARATE cold-boot tier (fresh per-VM attested launch, no fork, on QEMU or cloud-hypervisor, not Firecracker), which is a large separate program, out of scope until regulated demand justifies it. Per the no-unverified-claims rule, Mitos does not market confidential microVMs until such a tier is built and attested end to end. The spike also corrects the #40 0xc0010007 finding (it is a PMU counter MSR, MSR_K7_PERFCTR3, not a SEV MSR; a PMU snapshot-replay bug, not a confidential-computing one). |
6. Secrets
| Control | Status | Detail |
|---|---|---|
| Claim-time injection (not baked into snapshots) | partial | The design is right: pools snapshot before secrets exist; the controller resolves Secret refs at claim time (sandboxclaim_controller.go:resolveSecrets). Delivery into the guest is implemented over vsock post-restore (internal/daemon/server.go:deliverConfig); never via boot args, never via the FC API socket. Strict on real engines: if secrets cannot be delivered, the fork fails and the VM is reaped (a sandbox that reports Ready without its secrets is a lie). The mock engine skips delivery entirely; no guest exists. The same post-restore handshake also sends NotifyForked (32 bytes of host crypto/rand entropy plus a fork generation) before config. The reseed is now FAIL-CLOSED on every engine. internal/daemon/sandbox_api.go NotifyForked RETURNS the guest NotifyForkedResponse; internal/daemon/server.go notifyForked reaps the fork when the guest reports ReseededRNG:false (and ForkRunning reaps the live-fork child the same way), and cmd/sandbox-server/main.go handleFork runs reseedFork and unregisters the sandbox on a ReseededRNG:false result, so a guest that connected but did not reseed is never served (no duplicate keys/tokens/UUIDs across forks). The HUSK path was already fail-closed (internal/husk productionNotifier). The guest side is now honest too: the Rust agent’s reseed_crng reports success ONLY when the credited RNDADDENTROPY ioctl succeeds and returns false on the uncredited write fallback, so the boolean the gate keys on cannot be a false positive. Status: mitigated on all engines. See docs/fork-correctness.md row 1. Entropy bytes are never logged by host or guest. Resolved secret values (ForkRequest.Secrets) now transit the mTLS-protected controller→forkd channel when deployed as shipped (§3); they remain plaintext on the wire only in flag-less dev deployments, where forkd warns loudly. |
| Live-fork secret inheritance | mitigated (default-deny) | Forks of secret-holding sandboxes are rejected by the fork path without explicit spec.secretInheritance: inherit; opt-ins are recorded as an audit condition (sandboxfork_controller.go). Per-fork credential reissue remains the end state (open). See docs/fork-correctness.md §3. |
| Controller RBAC for Secrets | narrowing shipped, opt-in | A namespaced path now exists: the chart ships a mitos-pool-secrets ClusterRole (Secrets verbs, never bound cluster-wide) and grants the controller a bind verb SCOPED to that one ClusterRole plus rolebindings management; the controller binds itself to it per pool namespace as it adopts one (EnsurePoolSecretsRoleBinding, additive and idempotent), and in its own namespace via a chart RoleBinding. With controller.namespacedSecretsRBAC=true the cluster-wide Secrets grant is REMOVED, so a stolen controller token reaches Secrets only in the controller namespace and adopted pool namespaces, not cluster-wide. Default false for safe rollout (the per-pool bindings reconcile while the cluster-wide grant is still present, then operators flip the value). The bind grant is pinned by resourceNames to the one ClusterRole and is not a general escalation primitive. Residual: the controller can still write Secrets into any adopted pool namespace, and the cluster-wide pods create/delete grant (husk pods) is unchanged. Escalation-knot resolution and rollout are documented in docs/superpowers/plans/2026-06-18-rbac-narrowing.md. |
- Cross-namespace secret replication. The controller projects ONLY mitos-ca (ca.crt) from its namespace into every pool namespace where it creates husk pods (ReplicateHuskSecrets). The CA private key (ca.key) is never copied, and the forkd server leaf is no longer copied either: the husk control channel serves a per-namespace leaf (mitos-husk-tls, issued in-namespace by EnsureHuskTLS), so a pool namespace holds only the public CA cert and its own husk server key, never the CA signing key and never the shared forkd key. Scope: the cluster-wide secrets grant is the enabling privilege for writing ca.crt into pool namespaces; a namespaced grant scoped to pool namespaces is a follow-up once pool namespaces are enumerable at install time. This pool-namespace-holds-no-CA-signing-key property holds for MULTI-namespace deployments. In the single-namespace self-host topology (a pool created in the controller’s own namespace, now supported so husk pods start, #414) the pool IS the controller namespace and therefore naturally holds ca.key: the cross-namespace isolation boundary does not exist there because there is no second namespace. This is an accepted simplicity/isolation tradeoff for single-tenant self-host; multi-tenant deployments MUST give each pool (tenant) its own namespace distinct from the controller’s, where the property above applies.
- Husk activation target authenticity (warm-slot impersonation). Activation
delivers the claim’s resolved secrets plus the per-sandbox bearer token to the
husk pod’s self-reported PodIP over the mTLS control channel. So if pod
selection trusted only the husk LABELS (which any principal with pod-create in
the pool namespace can set), a tenant could stand up a decoy pod pointing at
their own listener and receive another claim’s secrets and token. Mitigated by
two independent, composing controls:
- PROVENANCE:
selectDormantHuskPodrequires the controller owner reference reconcileHuskPods stamps (a controller reference of Kind SandboxPool naming the pool with BlockOwnerDeletion=true), so only pods the controller actually created are activation targets. The forgery barrier is BlockOwnerDeletion: the OwnerReferencesPermissionEnforcement admission plugin refuses to let a tenant set it referencing a pool whose finalizers subresource they cannot update. - PER-NAMESPACE IDENTITY (now implemented): the husk control channel no longer
uses the shared forkd server key. Each pool namespace gets its OWN server
leaf,
husk.<namespace>.mitoswith a distinct key, issued by EnsureHuskTLS into a namespace-localmitos-husk-tlsSecret; the controller pinshusk.<namespace>.mitoswhen dialing a pod in that namespace (HuskDialTLSConfig), and ReplicateHuskSecrets projects onlyca.crtinto a pool namespace, never the forkd server key. So even a per-namespace key leaked in namespace A cannot impersonate a husk in namespace B (the controller dials B pinninghusk.B.mitos, which A’s leaf does not satisfy); within A, impersonation yields only A’s own tenant secrets. This does not depend on the OwnerReferencesPermissionEnforcement plugin, so it is the durable control on clusters where control (1) is weaker. End-to-end verified by the husk KVM e2e (the controller-pins-vs-husk-serves handshake). ROTATION: EnsureHuskTLS now reissues the per-namespace leaf when it is within HuskCertRenewBefore (30 days) of NotAfter, so a long-lived pool rotates its husk serving cert ahead of expiry across reconciles rather than serving an aging leaf until the control channel breaks.
- PROVENANCE:
7. Multi-tenancy statement
What a namespace boundary buys you today: RBAC on the CRDs, and nothing else. Pools, claims, and forks are namespace-scoped objects, but:
- Snapshots on a node are a flat directory shared by all tenants; no per-namespace separation, no enforcement that a claim only forks snapshots its namespace published. open
- VMs of different namespaces share nodes, host kernel, and forkd BY DEFAULT. open
- A pool MAY opt into node separation via
.spec.placement(PoolPlacement: anodeSelectorANDed onto the husk pods plustolerationsfor tainted dedicated nodes). The controller pins the pool’s husk pods to the matching nodes AND constrains the template snapshot build/distribution to that same node set (placementFilter/createSnapshotsOnNodes,internal/controller/sandboxpool_controller.go, issue #172), so a placed pool’s VMs and their snapshot never land on a node outside its dedicated set. This is SCHEDULING-enforced separation (KubernetesnodeSelector/ taints), NOT a hardware or hypervisor isolation guarantee: it depends on the operator labeling and tainting the dedicated nodes and not co-scheduling other tenants there. Node-side taint enforcement and per-tenant node-pool provisioning remain the operator’s responsibility. partial
Until the above are closed, treat the whole cluster as one trust domain unless an operator has provisioned dedicated, tainted node pools and placed each tenant’s pools onto them. This posture is recorded as a residual decision in docs/adr/0004-node-flat-snapshot-trust-domain.md.
7a. Per-org namespace tenancy boundary (issue #288)
The hosted offering hardens the namespace boundary above into a declarative
per-org isolation namespace. An Org CR (cluster-scoped; its name is the org id)
is reconciled into mitos-org-<id> by the OrgReconciler
(internal/controller/org_controller.go), gated behind --enable-org-tenancy
(default OFF, so a self-host single-tenant install is unaffected). Each org’s
pools, warm husk pods, claims, sandboxes, and secrets live in its own namespace.
The reconciler is self-healing: it watches the objects it owns, so a manually
deleted or drifted stack object enqueues the Org and is recreated; org deletion
cascades through owner references (the cluster-scoped Org owns the cluster-scoped
Namespace, a valid owner edge).
What isolates two orgs: the separate namespace + the per-org default-deny NetworkPolicy + the per-org ResourceQuota ceiling + the microVM. No single one of these is the whole boundary; the microVM remains the in-pod isolation guarantee, the namespace + NetworkPolicy is the cross-tenant control-plane boundary, and the ResourceQuota is the abuse-control ceiling.
| Boundary | Status | Mechanism |
|---|---|---|
| Per-org namespace separation | mitigated | Each org’s workloads land in mitos-org-<id> (tenant.NamespaceForOrg), stamped with the mitos.run/org label. Cross-org objects are in different namespaces, so RBAC on the CRDs and the default-deny NetworkPolicy below apply per namespace. The namespace is owner-referenced to the cluster-scoped Org so deletion garbage-collects the whole stack. |
| Cross-org network reachability | partial | The reconciler creates a per-org default-deny networking.k8s.io/v1 NetworkPolicy (mitos-org-default-deny) selecting ALL pods in the namespace (empty PodSelector), declaring BOTH Ingress and Egress policy types with NO ingress rules (deny all ingress) and a single egress rule allowing only cluster DNS (UDP/TCP 53). Pods can resolve names but cannot otherwise egress or be reached, and pods in different org namespaces cannot reach each other. HONEST CNI CAVEAT: a NetworkPolicy only enforces on a CNI that implements it (Calico, Cilium, etc.); on a CNI without NetworkPolicy support this object is inert, so a NetworkPolicy-enforcing CNI is REQUIRED in a hosted multi-tenant cluster. The in-pod nftables filter the husk-stub programs remains the per-pod egress guarantee regardless of CNI (mirrors husknetworkpolicy.go). |
| Per-org resource ceiling / abuse control | mitigated | A per-org ResourceQuota (mitos-org-quota) caps the pod count, the count/sandboxes.mitos.run object count, and the aggregate CPU/memory limits in the namespace. The ceiling comes from the Org’s spec.quota override or the controller defaults (--org-default-max-sandboxes, --org-default-cpu, --org-default-memory); a partial override keeps the defaults for the unset fields. A per-org LimitRange (mitos-org-limits) supplies default container limits so a pod that omits limits still counts against the quota. This bounds the blast radius of one org’s abuse (cluster-wide ceilings still apply on top). |
| Privileged PSA per org | accepted (documented) | The org namespace carries pod-security.kubernetes.io/enforce: privileged (plus audit/warn = privileged) because the husk pod needs /dev/kvm via the device plugin, NET_ADMIN for the in-pod egress firewall, and a short-lived privileged init container for name-egress pools. PSA level governs what a POD in the namespace may REQUEST, not cross-tenant access: the cross-org boundary is the separate namespace + the NetworkPolicy + the microVM, none of which privileged PSA weakens. This is the same privileged-PSA posture the husk pod already requires (section 0), now per org. |
| Per-org Secrets RBAC | partial | The reconciler binds the existing cluster-wide mitos-pool-secrets ClusterRole (a DEFINITION shipped by the chart, never bound cluster-wide) to the controller’s ServiceAccount INSIDE each org namespace via the mitos-pool-secrets RoleBinding, so the controller manages pool Secrets there without a cluster-wide Secrets grant. This extends the existing namespacedSecretsRBAC narrowing to org namespaces. |
PRODUCTION GATE: per the #208 sequencing, public self-serve untrusted multi-tenancy does NOT switch on until fork-correctness (#1), failure/GC (#163), and the external security review (#194) are green and this per-boundary checklist passes. Until then, run in waitlist / design-partner mode.
7b. Customer front door: accounts, orgs, and the public API gateway (issue #210)
The hosted offering adds a NEW public attack surface layered ABOVE the internal
mTLS and per-sandbox token plane: an internet-facing gateway (cmd/gateway,
internal/saas), customer-presented API keys, external accounts and
organizations, and cross-tenant (org) isolation. This surface does NOT replace
the internal plane; it sits in front of it and forwards org-scoped, authenticated
requests to the control plane. Design: docs/saas/accounts-gateway.md.
PRODUCTION GATE: this front door is NOT cleared for production tenants until the external security review (issue #194) covers it. Until then it is a development and review artifact only.
| Boundary | Status | Mechanism |
|---|---|---|
| Customer key authentication | partial | Prefix-tagged keys (mitos_live_...) hashed at rest with a process pepper, never stored in the clear, verified in CONSTANT TIME (crypto/subtle), with expiry, revocation, and scope checks. Only a masked prefix is shown after creation; the raw key is returned exactly once and is never logged or placed in an error. Unit-tested: verify rejects forged, malformed, expired, revoked, and wrong-scope keys (internal/saas/keys_test.go). The pepper is loaded from a secret in production (a follow-up); the default empty pepper still gives a sound sha256 hash. TLS termination and edge rate-limiting are follow-ups. |
| Cross-org (tenant) isolation | partial | The OrgID the gateway forwards to the control plane is taken SOLELY from the verified key, never from the request body or path, so a key for org A cannot address org B’s resources even by stuffing another org id into the body (TestGatewayCrossOrgIsolation). Key resolution maps a key to its own org only (TestOrgAKeyCannotResolveOrgB). The management verbs reject a non-member, so a user cannot mint, list, or revoke another org’s keys (internal/saas/account_test.go). The control-plane forward target that enforces this org boundary inside the cluster is a documented follow-up; the seam and its contract are tested here against a fake. |
| Public error surface | mitigated | The gateway maps every internal failure to the LLM-legible envelope (internal/apierr). A missing, malformed, unknown, expired, or revoked key all collapse to unauthorized so a probe cannot distinguish them; a valid-but-not-permitted key is forbidden; an org over a hosted quota is quota_exceeded. The two new codes are in sync with docs/api/errors.md, the JSON Schema, and llms.txt (the #28 sync tests are green). |
| Hosted quota and abuse controls | partial | The real QuotaEnforcer (issue #213, internal/saas/quota) now plugs into the gateway seam: it resolves the org’s plan tier, enforces per-sandbox size, live concurrency, and aggregate vCPU/memory/storage caps against the org’s CURRENT footprint (a live-count seam, with a #211-usage-store-backed fallback), and charges per-org AND per-IP token buckets split into lifecycle and in-sandbox rate windows plus a separate creation-rate bucket. A denial maps to the precise public code (quota_exceeded, rate_limited, or forbidden) via the gateway’s apiErrorProvider seam. Tiers are a prepaid ladder: a new signup lands on the tightest tier with a deny-by-default egress posture and the smallest caps; paying climbs the ladder. The enforcer, the rate limiter, the tier->egress mapping, and the kill-switch are unit-tested (over-concurrency, over-aggregate, over-size, within-quota, cross-org isolation, rate-limit allow/reject/refill, suspended-org fail-closed, emergency stop, abuse-signal-driven suspension). HONEST CAVEAT: the live multi-node count and the abuse-detection SIGNALS are documented seams (LiveCounter, AbuseSignal); this slice ships the enforcement decision and the suspend mechanism they drive, not the cluster-wide counter or the detectors. This abuse-control envelope is the HARD GATE on enabling public self-serve untrusted multi-tenancy: it must exist and be wired before that surface opens. |
| Per-tier egress policy (issue #219 datapath, #36 abuse ports) | partial | The tier->egress-policy mapping (internal/saas/quota EgressTier.Policy / Tier.ResolveEgress) selects which #219 netconf.SandboxPolicy a tier gets: free is deny-by-default (no egress without an explicit allowlist), a blocked tier maps to BlockNetwork, an open (paid) tier maps to EgressAllow. Every tier, including the open one, carries the fleet-wide abuse-port block (outbound SMTP 25/465/587, issue #36) and inherits the unconditional cloud-metadata drop, so even an open tier cannot send mail spam or steal node IAM credentials. This is policy SELECTION, unit-tested in isolation; the real packet enforcement is the KVM-gated #219 datapath. |
| Kill-switch / org suspension (issue #36) | partial | Org suspension is a first-class verb (internal/saas/quota KillSwitch): suspend one org (manual review hook), suspend a set at once (pool-wide / org-wide emergency stop), or suspend automatically from an abuse signal. A suspended org fails closed: the enforcer denies it before any quota math (forbidden), so its keys are rejected at the gateway and new claims are refused. An automated suspension is held for human review before it can be lifted. Unit-tested: a suspended org’s request is rejected, lifting restores access, emergency stop suspends all named orgs, an abuse signal drives suspension. The VM-freeze effect on already-running sandboxes and the durable suspension store are documented seams. |
| Metered billing and payment secrets (issue #212) | partial | The billing core (internal/saas/billing, docs/saas/pricing.md) wires Stripe metered usage, a credit ledger, spend caps, dunning, and a webhook handler. It is built ENTIRELY against a StripeClient interface with an in-memory FakeStripe: NOTHING in this slice makes a real charge, and the Stripe Go SDK is not a hard dependency. A Stripe API key, the webhook signing secret, and any payment-method detail are SECRETS that are never logged, never put in an error or a condition/webhook note, and never written to a host path; the kill-switch and webhook notes carry org ids, caps, and counts only. The metered push is idempotent on the (org, sandbox, window)+meter key (the same #211 record key), so a retried push never double-reports; the credit ledger is append-only and floors at zero so accounting cannot go negative or double-debit. A breached HARD spend cap suspends the org through the #213 kill-switch (the runaway-agent backstop) with a manual hold. Unit-tested: idempotent push (incl. retry after a transient failure), credit drawdown/no-negative/idempotent replay, soft-cap alert vs hard-cap suspend, dunning transitions, webhook status update. SEAMS (documented follow-ups, NOT enabled here): the real Stripe SDK adapter, the real webhook SIGNATURE verification (the signing-secret check over Stripe-Signature; the test uses a fake verifier that trusts the body), and durable stores. The webhook HTTP entry point never echoes the raw body or the signature in an error. PRODUCTION GATE: real-charge wiring is not enabled and must be reviewed (key handling, no-secret-logging, no-unverified published prices) before going live. |
| Session and login | partial | Token-based session resolution, hashed at rest, constant-time (internal/saas/session.go). Browser-based OAuth login is a documented follow-up; mitos auth login is --token only today. |
| Persistence | open | The account, org, membership, key, and session stores are IN-MEMORY for this slice (lost on restart, single-process). The Postgres store and its migrations are a documented follow-up behind the same Store interface. Do not run the front door as a stateful production service until that lands. |
| Console BFF per-verb permission enforcement (B3d-1) | mitigated | Every mutating console endpoint is gated by a permission verb from the enforced vocabulary (members.manage, projects.manage, secrets.manage, settings.manage, billing.manage, resources.use, read). The gate is DENY BY DEFAULT: Console.authorize (internal/saas/console/authz.go) resolves the caller’s role via permissionsFor, and if the required permission is absent the handler returns 403 with a cause naming the missing permission. An unknown role string (one that matches no built-in and no custom role in the org store) returns an empty permission map, so it is denied on every verb without a separate error path. The org is sourced from the verified session/key context only and is never taken from the request body or path, so the existing cross-org isolation invariant is unchanged. Built-in roles (owner, admin, member, viewer, billing) are resolved through saas.Role.Can over the fixed knownPermissions slice; a custom role’s permissions are looked up in the CustomRoleStore scoped to the caller’s org. Unit-tested: internal/saas/console/authz_enforce_test.go covers deny-by-default for unknown roles, per-verb allow/deny for each built-in and for custom roles, and the 403 error shape. |
| Console BFF custom-role management restricted to settings.manage (B3d-1) | mitigated | POST /console/roles (upsert) and DELETE /console/roles/{name} (delete) are gated by PermManageSettings. Only built-in roles that carry settings.manage (Owner and Admin) can create or modify custom roles; Member, Viewer, and Billing cannot. A caller without settings.manage receives 403 before the body is decoded. GET /console/roles is gated by PermReadOnly so any org member can read the permission matrix. This is enforced in handleUpsertRole and handleDeleteRole (internal/saas/console/customroles.go). |
| Console BFF custom-role vocabulary invariant (B3d-1) | mitigated | A custom role submitted to POST /console/roles is validated at write time: every permission in role.Permissions must appear in knownPermissions; any unknown string is rejected with invalid_json before the role is stored. A custom role cannot shadow a built-in role name (owner, admin, member, viewer, billing): the handler checks builtinRoles[role.Name] and rejects a collision. These two checks are enforced in handleUpsertRole and unit-tested in internal/saas/console/customroles_test.go. |
| Console BFF custom-role org isolation (B3d-1) | mitigated (unchanged from the baseline cross-org row) | Custom roles are stored and resolved in the CustomRoleStore keyed by orgID, which is sourced from the verified caller context only (never from the request body or path). MemCustomRoleStore uses a byOrg map and every read/write takes the orgID from the context; a write for org A can never affect org B, and a read for org A can never return org B’s roles. This is the same isolation invariant as the baseline cross-org row; B3d-1 does not widen or narrow it. |
| Console BFF per-project sandbox access: list filtered to accessible sandboxes (B3d-4) | mitigated | GET /console/sandboxes applies canAccessSandbox to every sandbox before including it in the response. The gate is DENY BY DEFAULT: if the caller holds no applicable permission (no org-wide role, no project membership), the sandbox is silently excluded from the list rather than returned. The decision never falls through to an allow: a caller with an empty permission map (unknown role, no membership) sees an empty list. The org for both the org-wide and the project membership lookups is taken from the verified session context, never from the request, so the filter cannot be bypassed by supplying a different org id. Unassigned sandboxes apply org-wide permissions (any org member with read can see them). Assigned sandboxes are restricted to org managers (projects.manage) plus members of that specific project. Intentional behavior change: assigning a sandbox to a project RESTRICTS its visibility; only project members and org managers can see it in the list. Implemented in internal/saas/console/console.go (handleListSandboxes); the access matrix is exhaustively tested in internal/saas/console/projectaccess_test.go. |
| Console BFF per-project sandbox access: inspect of an inaccessible sandbox returns 404 (B3d-4) | mitigated | GET /console/sandboxes/{id} resolves the sandbox, then calls canAccessSandbox. If the caller cannot access the sandbox, the handler returns 404 (not 403). This avoids leaking the existence of the sandbox to a caller who has no right to know it is there: a caller that has no access to a project-assigned sandbox receives the same response they would receive for a nonexistent id. Implemented in internal/saas/console/console.go (handleInspectSandbox); tested by the no-access-to-project rows of the access-matrix test. |
| Console BFF per-project sandbox access: terminate of an inaccessible sandbox returns 403 (B3d-4) | mitigated | DELETE /console/sandboxes/{id} resolves the sandbox, then calls canAccessSandbox for PermUseResources. If the caller cannot access the sandbox, the handler returns 403. Terminate is a destructive action, so the gate is different from inspect: the caller learns they are denied the action rather than that the resource does not exist (this is consistent: the caller either already knows the id from a prior list they could see, or is guessing, in which case 403 on an accessible resource and 403 on an inaccessible one are indistinguishable). Implemented in internal/saas/console/console.go (handleTerminateSandbox); tested by the no-access-to-project terminate rows of the access-matrix test. |
| Console BFF per-project sandbox access: org-scoped decision, no cross-org widening (B3d-4) | mitigated (unchanged from the baseline cross-org row) | canAccessSandbox (internal/saas/console/projectaccess.go) resolves both the org-wide permission set (permissionsFor) and the per-project membership set (projectPermissionsFor) using the orgID from the verified session context. Neither lookup accepts an org id from the request body or path. A project-level membership in org A cannot grant access to a sandbox in org B, because the project membership list is always fetched for the caller’s org. This preserves the baseline cross-org isolation invariant unchanged through the per-project access layer. |
7c. Expose URL ingress: per-sandbox port exposure (issues #126, #230)
Expose URLs add a NEW public ingress that maps a single-label hostname,
<label>.<expose-domain>, to a port INSIDE a running sandbox through the Mitos
expose edge proxy (cmd/preview-proxy, internal/preview). <label> is an opaque
routing key; the operator configures <expose-domain>. This is the E2B
get_host(port) / Daytona signed-preview-URL surface. It is a deliberate,
caller-initiated hole in the otherwise deny-by-default inbound posture (section
4): a tenant who mints an expose URL is asking for that port to be reachable
from the internet for the URL’s lifetime. Design: docs/preview-urls.md.
PRODUCTION GATE: this ingress is NOT cleared for production tenants until the
external security review (issue #194) covers the public attack surface it adds
and the #213 abuse-control envelope lands. Until then it is a development and
review artifact only. Slice 3 ships TLS termination directly at the Go proxy
with a wildcard cert and hybrid post-quantum key exchange; the proxy is now a
deployed public ingress surface gated by expose.enabled (default false) in the
Helm chart.
Architecture as of slice 2b: the proxy resolves the <label> in the incoming
hostname to a route (NodeEndpoint, SandboxID, Port, Token) via the route table,
verifies the signed expose token, and reverse-proxies to the forkd expose handler
at http://<NodeEndpoint>/v1/sandboxes/<id>/expose/<port>/ (the slice-1 forkd
surface, section 3 row ANY /v1/sandboxes/{id}/expose/{port}/). The upstream is
NEVER derived from request input; it comes only from the route table. The
per-sandbox bearer crosses the cluster network to forkd :9091 in CLEARTEXT:
this is the SAME trust model as the existing SDK path (the SDK reaches forkd
:9091 in cleartext with a per-sandbox bearer); the cluster network is the trust
boundary. In-cluster TLS for forkd :9091 is a recorded follow-up.
The route table is populated by the controller ExposeRouteReconciler (slice 2b).
On each reconcile it lists all Sandboxes, selects those that are Ready
(Status.Phase==Ready) with spec.expose set and a non-empty
Status.Endpoint, reads each one’s per-sandbox bearer from its
<name>-sandbox-token Secret (key token, namespace-scoped so two same-named
Sandboxes in different org namespaces never collide), builds the full route set,
and POSTs it to POST /internal/routes authenticated by the shared admin bearer.
The POST body carries per-sandbox bearer tokens over the in-cluster hop in
CLEARTEXT, the same cluster-network trust model as the forkd :9091 hop. The
admin token is sourced from EXPOSE_PROXY_ADMIN_TOKEN (environment variable, never
argv, never logged). The reconciler is disabled by default; it activates only when
--expose-proxy-admin-url is set. This completes the end-to-end path: a Sandbox
with spec.expose set and Status.Phase==Ready is reachable at
<label>.<expose-domain> with routes kept current by the reconciler.
| Boundary | Status | Mechanism |
|---|---|---|
| Signed, expiring URL minting and verification | mitigated (unit-tested) | An expose token is a detached HMAC-SHA256 over (sandboxID, port, expiry), base64url payload + tag, keyed by a server secret (internal/preview/sign.go). It reuses the SAME standard-library HMAC + crypto/subtle constant-time-compare core proven in internal/captoken and the W4 SigV4 signer; it is NOT a captoken because it needs no macaroon attenuation chain, only a single expiring binding. Verify rejects an expired token (never-accept-after-expiry, boundary-inclusive), a tampered payload or signature, and a token signed with a different key. The token VALUE and the server secret are bearer credentials: never logged, never in an error/condition, never on a host path. The signing path logs nothing. Unit-tested: round-trip, expiry-boundary, tamper (payload and signature), wrong-key, malformed. |
| Cross-sandbox token binding | mitigated (unit-tested) | The proxy resolves the label to a route (carrying SandboxID and Port) and requires the verified token to name THAT SAME sandbox and port: a token minted for sb-2 presented against a route for sb-1 is rejected with 403 even though it verifies cryptographically. So a leaked URL is bound to exactly one sandbox and one port; it cannot be replayed against another sandbox or a different port. |
| Reserved-name blocklist and Host allowlist | mitigated | A fixed set of labels (www, app, api, console, admin, auth, login, and others) is rejected with 404 by IsReservedLabel in the host parser BEFORE route table lookup, so a reserved name cannot be routed even if a route were registered under it. The route table itself acts as a Host allowlist: any label not present in the table returns 404, so arbitrary <label>.<expose-domain> hostnames have no target to proxy to. |
| Upstream derived from route table, never from request input | mitigated | The NodeEndpoint, SandboxID, and Port that identify the forkd upstream are read exclusively from the route table (populated only by the authenticated admin route-sync endpoint). No part of the upstream URL is derived from request headers, query parameters, or the sub-path beyond the per-route port. An SSRF path to steer the proxy to an unintended backend does not exist. |
| Per-sandbox bearer inject, token strip, and Authorization delete | mitigated | After token + binding checks pass, the proxy (a) injects the per-sandbox bearer token into the upstream Authorization header, (b) STRIPS the expose token from the forwarded query so forkd never sees the public expose bearer, and (c) unconditionally deletes any inbound Authorization header from the original request so a caller cannot relay an unrelated bearer to forkd. The per-sandbox token VALUE is never logged. |
| Dot-segment cleaning prevents prefix escape | mitigated | The URL sub-path after the label hostname is passed through path.Clean before being appended to http://<NodeEndpoint>/v1/sandboxes/<id>/expose/<port>/. Traversal sequences (../, ./) are collapsed so the proxy cannot be steered to a forkd route outside the expose prefix. |
| Route table GC on terminate | mitigated (unit-tested) | The route table maps <label> to a Route (NodeEndpoint, SandboxID, Port, Token, Sharing). RouteTable.Sync(routes) reconciles the table to exactly the provided set; a label not in the new set is REAPED immediately, so a terminated sandbox is unroutable (404) on the next sync. No route means 404; an expose URL for a dead sandbox cannot proxy anywhere. Unit-tested: add-on / remove-on-sync GC against an injectable route set. The slice-2b ExposeRouteReconciler drives Sync from the live Ready Sandbox watch; a Sandbox that leaves the Ready-and-exposed set drops from the next posted set and is reaped by the proxy. |
| Per-sandbox bearer hop to forkd :9091 in cleartext | partial | The per-sandbox bearer is injected into the upstream request and sent to forkd :9091 over plain HTTP (the same cleartext path the existing SDK uses). The cluster network is the trust boundary. A cluster-internal attacker with network access could observe or replay the bearer. In-cluster TLS for forkd :9091 is a recorded follow-up; until it lands, this hop is no weaker than the existing SDK path but is not hardened against a cluster-internal observer. |
Admin route-sync endpoint (POST /internal/routes) | mitigated | The admin endpoint accepts the new route set from the slice-2b ExposeRouteReconciler. The admin bearer (MITOS_EXPOSE_ADMIN_TOKEN) is read from the environment at startup and compared with crypto/subtle constant-time on every request; the token VALUE is never logged and never appears in an error body. An empty MITOS_EXPOSE_ADMIN_TOKEN disables the endpoint entirely (404 for all POST /internal/routes requests); it does NOT default to open. The endpoint is a new authenticated control surface: a compromised admin token lets an attacker inject arbitrary routes (redirect expose URLs to attacker-controlled NodeEndpoints). The token is protected the same way as all other bearer secrets in this system: environment variable, never logged, constant-time compare. |
| Route-sync control loop: controller-to-proxy authentication | mitigated | The ExposeRouteReconciler authenticates each POST /internal/routes with the shared admin bearer sourced from EXPOSE_PROXY_ADMIN_TOKEN (environment variable on the controller, never passed as argv, never logged). The proxy compares it with crypto/subtle constant-time on every request. A bearer-less or wrong-bearer POST is rejected 401. The token VALUE is never logged on either side. The controller POSTs to the plaintext ClusterIP admin Service (mitos-expose-admin:8080, the --http-addr listener); that port is separate from the public TLS Service and is never exposed via the LoadBalancer, Ingress, or wildcard cert. Per-sandbox bearers carried in the POST body travel over the in-cluster hop in cleartext, the documented in-cluster trust model (the same boundary as the forkd :9091 hop); in-cluster TLS for this hop is a recorded follow-up. |
| Route-sync POST body carries per-sandbox bearers over the in-cluster hop in cleartext | partial | The JSON body of POST /internal/routes contains per-sandbox bearer tokens. This POST crosses the in-cluster network in CLEARTEXT, the same cluster-network trust boundary as the forkd :9091 hop the proxy already uses. A cluster-internal attacker with network access could observe the bearer values in transit. In-cluster TLS for this control hop is a recorded follow-up; until it lands, this hop is no weaker than the existing SDK-to-forkd path but is not hardened against a cluster-internal observer. |
| Namespace-scoped token keying prevents cross-tenant token bleed | mitigated | Each Sandbox’s bearer is read from its <name>-sandbox-token Secret in the Sandbox’s own namespace. The controller reads only Secrets in the Sandbox namespace, so a Sandbox in org namespace mitos-org-A can never read the Secret of a same-named Sandbox in mitos-org-B. The route entry keyed by label carries only that Sandbox’s token; the proxy injects only the per-route token into the upstream request. No cross-namespace Secret read path exists. |
| Fail-safe: missing Secret or unreachable proxy requeues, never crashes | mitigated | If the <name>-sandbox-token Secret for a Sandbox is missing, the reconciler logs the event, skips that Sandbox, and requeues after 1 second; it does NOT panic and does NOT block reconciliation for other Sandboxes. If the admin POST fails (5xx or transport error), the reconciler requeues with backoff. Terminal 4xx responses (misconfigured URL or disabled endpoint) are not retried until the next Sandbox change triggers a reconcile. The controller process never crashes on a transient proxy failure. |
| Reconciler disabled by default | mitigated | The ExposeRouteReconciler starts only when --expose-proxy-admin-url is set on the controller. When the flag is absent, no admin POST is ever sent and the expose ingress feature is entirely inactive. This prevents accidental use against a cluster that has not deployed the proxy. |
| TLS at the public edge: wildcard cert | mitigated (slice 3) | The proxy (cmd/preview-proxy) terminates TLS directly. Pass --tls-cert/--tls-key to load an operator-provided wildcard *.<expose-domain> cert (WildcardProvider, internal/preview/cert.go); both flags are required together. A missing or unparseable file is a startup-time fatal error (fails closed). When --tls-cert is absent the proxy falls back to per-SNI self-signed certificates: NOT browser-trusted, suitable for local dev only. In the Helm chart the --tls-cert/--tls-key flags are always set when expose.enabled: true, so self-signed is not the deployed path. Single-key blast radius: the wildcard key covers all subdomains. Mitigation: (1) when expose.tls.certManager.enabled is true (not the default), cert-manager auto-rotation issues short-lived certificates; when certManager.enabled is false (the default), the operator owns rotation and a long-lived static wildcard key has a correspondingly larger exposure window until it is rotated; (2) the key is mounted read-only into a single unprivileged container (allowPrivilegeEscalation: false, capabilities.drop: [ALL]) so its exposure is confined to the proxy process; (3) when cert-manager is enabled, cert rotation replaces the secret and rolling update recycles the pod without downtime. |
| TLS at the public edge: post-quantum key exchange | mitigated (slice 3) | preview.ServerTLSConfig leaves CurvePreferences nil. On Go 1.24+, the default key-exchange preference list leads with X25519MLKEM768 (hybrid FIPS 203 ML-KEM-768 + X25519), so the session key is PQ-protected when the client supports it. This provides confidentiality against harvest-now-decrypt-later attacks. HONEST SCOPE: this is PQ KEY EXCHANGE for CONFIDENTIALITY ONLY. The certificate signature stays classical (ECDSA or RSA); no post-quantum CA exists in the public PKI, so there is NO post-quantum authentication and none is claimed. A guardrail test (internal/preview/tls_pq_test.go, TestServerTLSConfigNegotiatesPostQuantum) asserts that a PQ-only TLS 1.3 client completes the handshake with CurveID == X25519MLKEM768 and that cfg.CurvePreferences == nil; a future PR that pins curves will break this test and cannot silently regress PQ support. |
| Proxy deployed as a public ingress surface (slice 3) | partial | The Helm chart (deploy/charts/mitos/templates/expose-proxy.yaml) deploys the proxy Deployment, Service, and optional Ingress when expose.enabled: true (default false). The controller --expose-proxy-admin-url and EXPOSE_PROXY_ADMIN_TOKEN are wired to the proxy Service. The proxy container runs with runAsNonRoot: true, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, capabilities.drop: [ALL], and seccompProfile: RuntimeDefault. Residuals: per-IP rate limiting and a global connection cap are open (see the Ingress DoS row); the ingress is NOT cleared for production tenants until the #194 external security review covers it and the #213 abuse-control envelope lands. expose.enabled defaults to false so an unaware operator does not accidentally expose the surface. |
| Ingress DoS / unbounded label spray | open | This slice does not add per-IP rate limiting or a global connection cap at the expose ingress; an attacker could spray arbitrary <label>.<expose-domain> hostnames. Unknown labels return 404 quickly (no route lookup cost beyond a map read), but connection-level rate limiting is a documented follow-up, sequenced with the #213 abuse-control envelope and the #194 review before this ingress opens to untrusted tenants. |
Auth-origin session cookie isolation (__Host- prefix) | mitigated (unit-tested) | The preview proxy issues its session cookie with the __Host- prefix, enforcing Secure, HttpOnly, SameSite=Lax, Path=/, and no Domain attribute. A browser cannot send the cookie to a subdomain (a sandboxes URL lives under <label>.<expose-domain>; the session origin is the proxy root), preventing cross-subdomain session fixation. Codec: internal/preview/session.go. Unit-tested: every cookie attribute is asserted. |
| Single-use HMAC grant token | mitigated (unit-tested); HA caveat | The central-auth origin issues a short-lived HMAC-SHA256 grant token (payload + detached tag, domain-tagged as mitos-expose-grant-v1\x00) that is accepted exactly once: the nonce is recorded in a per-process sync.Map on first verify and rejected on replay. Expiry is checked before nonce recording. The grant token VALUE is a bearer credential: never logged, never in an error string. internal/preview/grant.go. Unit-tested: round-trip, expiry, replay (nonce reuse), label mismatch, tampered payload, wrong key, malformed token. HA CAVEAT: the nonce cache is per-process (in-memory sync.Map). With more than one proxy replica a grant presented to a different replica than the one that issued it is not yet in that replica’s nonce cache, so replay is NOT rejected across replicas. Mitigation until a shared store lands: keep expose.replicas: 1 or use sticky sessions at the load balancer so the callback hits the same replica that issued the grant. A shared nonce store (Redis/etcd) is a documented follow-up gated behind #194 before this surface serves untrusted tenants at scale. |
| CSRF state and open-redirect defense | mitigated (unit-tested) | CSRF state: before redirecting to the OIDC provider the auth origin mints a signed CSRF state token (a SessionCodec-signed {rd,path} JSON payload, short TTL) stored in a __Host- state cookie. On the /auth/callback return the state query parameter is verified against the cookie value with constant-time compare; a missing, tampered, or expired state is rejected 400 before the code exchange runs. Open-redirect defense: the rd (redirect) parameter in /start?rd=<label> is validated against the live route table; an unknown label is rejected 400. The path (callback path) parameter rejects values beginning with // or containing a backslash or CRLF via safeRelPath, preventing scheme-relative or header-injection redirects. internal/preview/oidc.go. Unit-tested: state cookie round-trip; tampered state rejected; expired state rejected; unknown label in rd rejected. |
| Verified-email audience rule: unverified email rejected | mitigated (unit-tested) | The allowedEmailDomains audience selector in the authorization pipeline checks BOTH that email_verified == true in the session identity AND that emailDomain(identity.Email) EXACTLY matches one of the listed domains (case-folded; a domain such as evilacme.com does NOT match an entry acme.com). An unverified email is rejected 403 regardless of its domain. In addition, resolveOrgs in internal/preview/oidc.go returns empty org IDs when EmailVerified is false, so the org/private tier also 403s for an unverified-email identity even if the Resolver or GroupsToOrgs fallback would otherwise return org memberships. internal/preview/authz.go, internal/preview/oidc.go. Unit-tested: unverified email with a matching domain is rejected by the audience check; verified email with matching domain is accepted; verified email with non-matching domain is rejected; callback with email_verified=false yields a grant with empty OrgIDs. |
| ForwardAuth BYO-IdP subrequest: client header spoof prevention | mitigated (unit-tested) | The ForwardAuth function builds the subrequest from scratch (http.NewRequestWithContext), forwarding only X-Forwarded-Method, X-Forwarded-Uri, X-Forwarded-Host, X-Forwarded-For, and Cookie. StripForwardAuthHeaders deletes all X-Auth-Request-* keys from the inbound request BEFORE ForwardAuth is called and BEFORE the request is proxied upstream, so a client cannot inject a forged X-Auth-Request-Email or X-Auth-Request-User to claim an identity the auth service did not assert. Identity is built exclusively from the auth service RESPONSE headers. The network-layer CIDR check runs before the forwardAuth subrequest, so a request from a disallowed IP is rejected before the external auth service is contacted. internal/preview/forwardauth.go. Unit-tested: 2xx allow with identity from response; 401 deny; StripForwardAuthHeaders removes client-supplied X-Auth-Request-*; non-request headers are not removed; X-Forwarded-For is forwarded to the auth service. |
| Verified-email to org resolution: bearer-gated in-cluster hop | mitigated (unit-tested) | The POST /internal/identity/resolve endpoint in the console binary resolves a verified email to an account ID and org IDs via AccountService.FindOrCreateByEmail and Organizations. The endpoint is protected by a shared bearer secret (MITOS_IDENTITY_RESOLVE_TOKEN, environment variable, never logged) compared with hmac.Equal (constant-time for equal-length inputs; a mismatched length returns early without iterating, which is acceptable here because the token is operator-configured and fixed-length in practice, and the empty-token guard fires before the compare so an unconfigured endpoint is always denied). The email-to-org mapping is NEVER logged; only the org count is logged. The resolver client (internal/preview/resolve.go) also never logs the token or email-to-org mapping; a non-empty configured URL is required or ErrResolveDisabled is returned immediately without making a request. internal/saas/identity_resolve.go, cmd/console/main.go. Unit-tested: bearer gate (missing, wrong, empty configured token); email resolves to account+orgs; bad body returns 400; empty URL returns ErrResolveDisabled with no HTTP request; non-2xx returns error without token in message. |
| Authorization pipeline: fail-closed tier, network, audience | mitigated (unit-tested) | The Authorize function in internal/preview/authz.go applies three layers in sequence: (1) NETWORK: route.Network is a CIDR allowlist (Route.Network []string); if non-empty the client IP must fall within at least one listed CIDR (NetworkAllows). A nil or out-of-range IP against a non-empty allowlist is denied. Malformed CIDRs are skipped (fail closed: no match returns DenyForbidden). (2) TIER: route.Sharing selects the tier. public always allows (but DenyForbidden if audience fields are set, because audience evaluation requires an identity). authenticated requires a non-nil identity. org/private require a non-nil identity AND route.OrgID present in identity.OrgIDs. link requires a non-nil identity (the proxy decodes the link cookie before calling Authorize). Unknown or empty Sharing is DenyForbidden (fail closed; no default-deny coercion to a different tier). (3) AUDIENCE: AllowedPrincipals uses case-folded exact email match; AllowedEmailDomains requires identity.EmailVerified == true and an EXACT case-folded domain match (containsDomainFolded); a suffix like evilacme.com does NOT match an entry acme.com. Empty allowlist entries are skipped. internal/preview/authz.go. Unit-tested across all tier/network/audience combinations; CIDR gate; suffix-trick rejected; case-insensitive; empty entries guarded. |
| OIDC live-IdP path: honest CI scope | partial | The OIDC authorization code flow (discovery, redirect to provider, callback with state validation, code exchange, ID token verify) is tested with a fake token verifier (oidcauth.FakeVerifier) in unit CI so the flow handlers can be exercised without a live issuer. The live-IdP path (a real /.well-known/openid-configuration discovery, a real code exchange) is maintainer-verified only; it is not covered by an automated CI job. PKCE is a documented follow-up. SAML and SCIM are deferred. This surface stays gated behind the external security review (issue #194) and the abuse-control envelope (issue #213). |
Headless-browser CDP exposure (issue #314)
The computer-use template (images/computer-use) exposes a Chrome DevTools
Protocol (CDP) endpoint over the section-7c expose ingress. CDP is FULL remote
control of the browser: a CDP client can navigate file://, read local files, and
execute arbitrary JavaScript in pages. The exposed CDP endpoint is therefore
treated as a privileged control surface and inherits the section-7c PRODUCTION
GATE: it is not cleared for untrusted tenants until the #194 external review and
the #213 abuse-control envelope land, and it MUST stay behind the expose access
ladder, private by default; the recipe never demonstrates the public tier for
CDP.
| Boundary | Status | Mechanism |
|---|---|---|
| In-guest CDP relay is unprivileged | mitigated | cmd/cdp-relay runs INSIDE the sandbox guest as part of the serving workload; it owns the exposed guest port and reverse-proxies to Chromium on an internal loopback port. It is an in-guest userspace proxy with no host, control-plane, or sibling-sandbox access; it is in the same untrusted-guest trust domain as the workload it fronts (section: Guest workload, untrusted). It exists only to satisfy Chromium’s DevTools host-header check (Chromium rejects any Host that is not an IP literal or localhost): the relay sends the upstream loopback host:port as Host and rewrites the discovery webSocketDebuggerUrl to the external origin. |
| Spoofable X-Forwarded-Host has no privileged effect | mitigated | The relay derives the rewritten discovery host from X-Forwarded-Host (set by the expose proxy from the client’s Host) or, absent it, the client’s own request Host. An authenticated client that spoofs X-Forwarded-Host only changes the host in the webSocketDebuggerUrl it gets back, an unreachable or self-chosen value: a self-inflicted bad URL. It does NOT steer the upstream (the expose proxy derives the forkd upstream solely from the route table, section 7c) and grants no cross-sandbox or cross-tenant access. |
Chromium --no-sandbox: the microVM is the boundary | mitigated | Chromium runs with --no-sandbox, disabling its in-process seccomp/namespace sandbox, the same posture as E2B Desktop and Browserbase. The containment boundary is the Firecracker microVM (sections 1 and 2): a hostile page or malicious script is confined to the one sandbox and cannot reach the host, the control plane, or sibling sandboxes. Rendering untrusted web content is the intended isolation use case. |
8. What we explicitly do NOT claim
- No pod-scoped Kubernetes mechanism (NetworkPolicy, PodSecurity, pod quotas) applies to sandbox VMs. Where we provide an equivalent, it is documented as ours.
- No external security review has been performed. The README must continue to say so until one has.
- Side-channel resistance between forks of the same snapshot is not claimed.
Supply chain and artifact provenance (issue #35)
| Boundary | Status | Mechanism |
|---|---|---|
| Image provenance (controller, forkd, husk-stub) | mitigated for published releases | cosign keyless signing + SPDX SBOM attestation, bound to the image digest, produced by .github/workflows/publish.yaml; consumer verification in docs/supply-chain.md. |
| Image CVEs | partial | Trivy scans the built images on every PR for HIGH/CRITICAL fixable findings (ADVISORY today: reported, not yet gating, pending base-image remediation); govulncheck is the BLOCKING gate for Go call-graph-reachable vulnerabilities; base-image and dependency bumps arrive via dependabot. Runtime re-scan of long-lived published tags is not yet automated. |
| Guest kernel currency | partial | The shipped vmlinux is pinned to an exact version (docs/kernel-cve.md) and validated end to end by the KVM workflow; CVE watch is a documented manual process, not an automated feed. |
| Guest kernel integrity at stage time | partial | The Helm kernel-stage init container downloads the guest kernel and, when kernelProvisioner.kernelSha256 is set, verifies the staged file against that digest and FAILS CLOSED on mismatch (covering both a fresh download and an already-cached file), so a swapped upstream object or a MITMed fetch cannot boot a backdoored kernel fleet-wide. The digest is empty by default (a warning is logged); operators should pin it. The privileged forkd, kvm-device-plugin, and kernel-stage pods set automountServiceAccountToken: false since none call the Kubernetes API. |
| Admission-time signature enforcement | open | The project ships signatures; requiring them at admission (policy-controller/Kyverno) is a documented operator choice, not a default. |
| Org-scoped usage attribution (billing) | mitigated | The org a sandbox bills to is a trust boundary: it is derived only from control-plane identity, never from client input. The controller stamps the mitos.run/org label on a sandbox’s husk pod from the sandbox’s hard-isolation namespace mitos-org-<id> via tenant.OrgFromNamespace (internal/controller/huskpod.go); a client-set mitos.run/org on the input object is overwritten, so a tenant cannot relabel its workload to another org (an adversarial test asserts the namespace value wins). The live usage scraper (internal/usage/livesource.go, run behind the off-by-default --usage-collector flag) reads org only through that controller-stamped label (LabelOrgResolver), and the per-sandbox metering report it pulls carries only ids, byte counts, and seconds, never secret values, argv, env, or file content. A self-host sandbox in a non-org namespace is left unattributed rather than misbilled to a default org. Usage aggregation is idempotent on (org, sandbox, window), so a re-scrape or a credit drawdown cannot double-count. The per-org Prometheus series is labeled by org only (no per-sandbox cardinality, no identifiers beyond the org id). The org/pool-labeled metering gauges (mitos_usage_{vcpu_seconds,mem_gib_seconds,egress_bytes,gpu_seconds}_total) are fed from the SAME store-fed cumulative, so the dashboard figure and the bill are one number; their only label is org (pool is a documented follow-up since the store keys on (org, sandbox, window)). Residual: the in-memory usage store is bounded by a retention window and survives no controller restart; the durable store and the real billing-provider push are the SaaS follow-up (issue #211), where the durable store is the billing system of record. |
| Guest vitals telemetry surface (issue #164 Phase 1.a) | mitigated | The guest health metrics (mitos_guest_{cpu_steal_percent,mem_balloon_bytes,mem_used_bytes,process_count}) are sampled by an off-by-default control-plane runnable (internal/controller/vitals_sampler.go, --vitals-sampler flag). A NEW node-scoped operational endpoint GET /v1/vitals/node (internal/daemon/vitals_api.go, mounted on the forkd operational mux next to /v1/metering, /metrics, /healthz) is UNAUTHENTICATED (the same access class as /metrics), so it returns one numeric-only entry per sandbox on that node. SECRET HYGIENE: each entry carries ONLY the control-plane labels (claim, pool, workspace, namespace; all k8s object names), the numeric guest vitals (steal, balloon, used/total memory), and a numeric process COUNT (process_count). It carries NO per-process table at all: no process command name, pid, state, rss, argv, or env crosses this unauthenticated wire. This is enforced at the source by a dedicated node-batch struct (NodeVitalsNumbers) that has no field for a process object, so a per-process string CANNOT be serialized here; handleNodeVitals writes only len(processes) as the count. The FULL per-process table (program names, pid, state, rss) remains available ONLY behind the per-sandbox bearer-authenticated Connect sandbox.v1.Sandbox.Vitals RPC (used by kubectl mitos ps --processes; the legacy /v1/vitals JSON route was removed in #358), never on this node batch. The sampler decodes ONLY the numeric fields including process_count, and NEVER receives or relies on a per-process command line, argv, pid, env value, or any free-form string. The published metric label set is EXACTLY {org, pool}, both bounded trusted control-plane values, so there is no per-sandbox/per-pid cardinality and no identifier beyond org+pool. Org is resolved via the SAME trusted mitos.run/org husk-pod label (LabelOrgResolver) the usage scraper uses; a sandbox with no resolvable org is left unattributed (counted), never attributed to a guessed or empty org. The node endpoint is NOT per-sandbox traffic, so it carries no per-sandbox bearer token (the controller holds none); it shares the access class of /v1/metering. A sandbox whose guest is unreachable is skipped and counted, never failing the report or the sample cycle. Residual: the endpoint is plaintext on the operational mux today (an https operational mux is the same documented follow-up as /v1/metering); the cpu_steal aggregation is the per-bucket MAX and memory/process_count the SUM, documented in each gauge’s Help text. |
Review gate
An external security review is required before any 1.0 (or “production-ready”) claim. Tracking: not scheduled. The hosted customer front door (section 7b, issue #210: the public gateway, customer keys, and cross-org isolation) is explicitly gated on this review (issue #194) before it serves production tenants.
Change discipline
Any PR that moves the security surface (new listener, new privilege, new artifact type, new cross-component call) must update this file in the same PR.