This document describes how a node accounts for the memory and disk a sandbox consumes, why that accounting is copy-on-write (CoW) aware, what is exact versus approximate, the operational endpoint and metrics that expose it, and how a hosted service would bill on top of it.
The implementation is internal/metering (the aggregation rule) plus the engine
Metering() method (internal/fork) that fills the samples and the forkd
GET /v1/metering endpoint that serves the report.
Why CoW-aware
Firecracker forks restore the SAME template snapshot with MAP_PRIVATE. Every
fork of a template maps the same shared page set; a fork only diverges as it
dirties pages (copy-on-write). So the naive “sum every fork’s resident set”
double-counts the shared template region once per fork. On a node running N
forks of one template, the shared region is physically resident ONCE, not N
times.
CoW-aware metering counts each template’s shared footprint a single time and attributes only the unique (private-dirty) pages to each fork. The marginal physical cost of an additional fork is its unique set, not a whole VM. This is both the honest capacity number the scheduler must use and the basis for honest billing.
What is metered
Memory (from /proc/<pid>/smaps_rollup)
For each sandbox the engine reads the Firecracker process’s smaps_rollup:
- MemoryUnique =
Private_Clean+Private_Dirty. Pages this fork alone owns. Never shared, so never deduplicated. - MemoryShared =
Shared_Clean+Shared_Dirty. Pages mapped from the shared template snapshot (and any other shared mappings).
Disk (apparent sizes via os.Stat)
Disk is accounted per volume by the volume’s fork policy (internal/volume):
- Fresh / Clone: the per-fork backing is counted wholly as the fork’s
DiskUnique(Fresh is an empty per-fork backing, Clone is a full byte-for-byte copy; neither shares blocks with siblings). - Share: the volume maps the template seed directly. The seed is counted as
DiskSharedand there is no per-fork backing. - Snapshot (reflink CoW): the volume reflinks the template seed, so it shares
the seed (
DiskShared= seed apparent size) and the fork’s divergence is approximated asmax(0, forkBackingApparentSize - seedApparentSize)counted asDiskUnique.
Volumes are documented in volumes.md.
The aggregation rule
metering.Aggregate folds the per-sandbox samples into a node Report:
- Sandboxes are grouped by Template (the snapshot id the fork was restored from). A sandbox with an empty Template is its own single-member group, so it never shares with anything.
- Each group’s shared footprint is counted once using the MAX of the
group’s members as the representative (
SharedOncefor memory,DiskSharedOncefor disk). The max is the conservative representative: all forks of a template should map approximately the same shared set, so the largest observed is a safe single charge. - The node totals:
- TotalUnique = sum of every sample’s MemoryUnique (never deduplicated).
- UsedCoWAware = TotalUnique + sum over templates of each template’s SharedOnce. The honest resident footprint.
- UsedNaive = TotalUnique + sum of every sample’s MemoryShared (the shared region double-counted), kept for comparison.
- CoWSavings = UsedNaive - UsedCoWAware: exactly the shared bytes the CoW model reveals are NOT consumed per fork.
- The
Disk*totals mirror the memory totals for backing storage.
GetCapacity (the hot heartbeat path, memory only) reports MemoryUsed = UsedCoWAware and MemoryShared = SharedOnceTotal so the scheduler sees the
honest resident footprint and never double-counts the shared template region
across forks.
Exact vs approximate
Exact:
- Per-process unique memory (
Private_Clean+Private_Dirtyfromsmaps_rollup). This is the kernel’s own per-process private-page accounting.
Approximate:
- The shared-once representative. We charge a template’s shared set once
using the MAX of its forks’
MemoryShared. This is a conservative single charge, not a per-page intersection of what the forks actually still share. If forks diverge in which shared pages they retain, the true shared resident set could be smaller than the representative; the representative never undercounts. - Disk divergence for reflink (Snapshot) volumes. We use apparent file
sizes (
os.Statlogical length), not allocated blocks. A reflinked fork that has rewritten few blocks has a large apparent size but a small physical divergence; the apparent-size subtraction overstates the unique disk. Precise block-level accounting is open (see below).
Endpoint and metrics
GET /v1/metering (forkd operational API)
forkd serves the full node Report as JSON on the operational mux. This is
operator/billing data in the same access class as /metrics and /healthz: it
is NOT behind the per-sandbox bearer token. The report holds only sandbox ids,
template names, and byte counts; it never contains secret values.
The JSON shape is the metering.Report: Sandboxes[] (per-sandbox
ID/Template/MemoryUnique/MemoryShared/DiskUnique/DiskShared), Templates[]
(Template/ForkCount/SharedOnce/DiskSharedOnce), and the node totals
(UsedCoWAware, UsedNaive, CoWSavings, TotalUnique and the Disk*
counterparts).
Prometheus metrics
The forkd /metrics gauges are CoW-aware (internal/daemon):
mitos_memory_shared_bytes: CoW-aware shared memory, each template’s shared page set counted once (the shared-once total).mitos_memory_unique_bytes: per-fork unique memory summed over all sandboxes.mitos_cow_memory_savings_bytes: memory the CoW model reveals is not consumed per fork (naive minus CoW-aware).mitos_metered_disk_bytes: CoW-aware metered backing storage (template volume seeds counted once).
Observability is documented in observability.md.
How a hosted service would bill
The honest unit is unique footprint + an amortized share of the template:
- Each sandbox is charged its unique memory and disk (the part it alone causes to be resident or allocated): this is exact.
- The template’s shared-once set is charged once per template and amortized across the tenants/forks that use it (it is paid once physically, so it should be billed once and split, not charged in full to each fork).
CoWSavings is the number that makes the CoW value legible: it is the footprint
a per-VM billing model would charge that the CoW model shows is not actually
consumed.
The per-account, time-integrated, auditable usage pipeline that AGGREGATES these
node reports into billable usage records (vCPU-seconds, GiB-seconds, storage
GiB-hours, egress bytes, GPU-seconds) and serves the org-scoped public usage API
is documented in saas/usage-pipeline.md. It
reconciles the
billable memory level back to this report’s CoW-aware total so the shared
template region is billed once, not once per fork.
Open
- Precise reflink/btrfs block accounting: replace apparent-size disk
divergence with allocated-block accounting (e.g.
FIEMAP/ btrfs extent sharing) so reflinked volumes report physical divergence, not logical length. - PSS-based attribution: use proportional set size to split shared pages across forks instead of the conservative max-once representative.
- KSM / same-page merging across distinct templates: today sharing is only recognized within a template group; kernel same-page merging across templates is not accounted.
- Per-tenant / per-workspace rollups: aggregate the node report by tenant, tied to Workspace.
- Billing export / OpenCost integration and historical time series: the metrics feed Prometheus today; dashboards and export are follow-ups.
The CoW density datapoint and its CI proof are in
../BENCHMARKS.md (the metering CI phase forks 4 sandboxes
from one template and asserts the shared region is counted once).