Node Allocatable: the resources a pod actually gets

K
Kai··5 min read

Article 22 put requests/limits on a pod and said the scheduler uses request to place it. The open question: the scheduler compares request against what on the node? Intuition says "the machine's total resources", but that's wrong. A 2-vCPU / 4-GiB worker doesn't let pods use all 2 vCPU / 4 GiB, because the node itself needs resources to run: the operating system (sshd, udev...), the Kubernetes daemons (kubelet, containerd), and a buffer so the kubelet can react in time when RAM is nearly gone. What's left after subtracting all of that is what pods get, called Allocatable. This is the article that looks at resource management from the node side, complementing Article 22's pod view.

The Allocatable formula

The docs define Allocatable and the formula:

"'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are available for pods. The scheduler does not over-subscribe 'Allocatable'."

Allocatable = Capacity − kube-reserved − system-reserved − eviction-threshold

Four components:

  • Capacity — the node's total visible physical resources.
  • kube-reserved — the slice for Kubernetes' daemons: "resource reservation for kubernetes system daemons like the kubelet, container runtime, etc."
  • system-reserved — the slice for OS daemons: "resource reservation for OS system daemons like sshd, udev, etc."
  • eviction-threshold — the buffer: "By reserving some memory via evictionHard setting, the kubelet attempts to evict pods whenever memory availability on the node drops below the reserved value."

And the core point for scheduling: "The scheduler treats 'Allocatable' as the available capacity for pods." The scheduler divides Allocatable, not Capacity. Placing a pod that requests 2 CPU on a 2-CPU node will not succeed, because Allocatable < 2.

Read Capacity vs Allocatable on a real node

Both numbers live right in v1.Node. Read the cluster's worker-0:

kubectl get node worker-0 -o jsonpath='CAPACITY    cpu={.status.capacity.cpu} mem={.status.capacity.memory} pods={.status.capacity.pods}{"\n"}ALLOCATABLE cpu={.status.allocatable.cpu} mem={.status.allocatable.memory} pods={.status.allocatable.pods}{"\n"}'
CAPACITY    cpu=2 mem=3926020Ki pods=110
ALLOCATABLE cpu=2 mem=3823620Ki pods=110

Read carefully: CPU Capacity = Allocatable = 2 (nothing carved off), but memory Allocatable falls short of Capacity: 3926020 − 3823620 = 102400 Ki = exactly 100 Mi. Why is only memory carved, by exactly 100 Mi? Because back in Article 11 when we stood up the kubelet, we did not declare kubeReserved/systemReserved — so both are zero, and CPU isn't subtracted. But the kubelet has a default eviction-hard of memory.available<100Mi; that's what carves 100 Mi off Allocatable memory. Confirm the node declares nothing:

ssh worker-0 'sudo grep -E "Reserved|eviction|maxPods" /var/lib/kubelet/kubelet-config.yaml || echo "(not declared -> using defaults)"'
# (not declared -> using defaults)

So that 100 Mi is entirely the default eviction threshold. pods=110 is also the default per-node pod limit.

Add a reservation and watch Allocatable drop

The formula theory deserves to be seen in motion. We'll add kubeReserved to worker-0's kubelet — simulating reserving resources for Kubernetes daemons — then watch Allocatable change (back up the config first so we can revert):

ssh worker-0 'sudo cp /var/lib/kubelet/kubelet-config.yaml{,.bak}
  printf "kubeReserved:\n  cpu: \"200m\"\n  memory: \"256Mi\"\n" | sudo tee -a /var/lib/kubelet/kubelet-config.yaml
  sudo systemctl restart kubelet'
# wait for the node to report status...
kubectl get node worker-0 -o jsonpath='ALLOCATABLE cpu={.status.allocatable.cpu} mem={.status.allocatable.memory}{"\n"}CAPACITY    cpu={.status.capacity.cpu} mem={.status.capacity.memory}{"\n"}'
ALLOCATABLE cpu=1800m mem=3561476Ki
CAPACITY    cpu=2 mem=3926020Ki

The formula checks out to the Ki. CPU: Allocatable = 2000m − 200m (kube-reserved) = 1800m. Memory: Allocatable = 3926020 − 262144 (256Mi kube-reserved) − 102400 (100Mi eviction) = 3561476 Ki. Capacity does not change (still 2 / 3926020Ki) because the hardware is unchanged; only the part divided among pods shrinks by exactly what we just reserved. This is how an administrator "sets aside" resources for the system, avoiding the situation where pods devour everything and starve the kubelet/containerd, leaving the node NotReady.

Return the config to original for a clean cluster:

ssh worker-0 'sudo mv /var/lib/kubelet/kubelet-config.yaml.bak /var/lib/kubelet/kubelet-config.yaml && sudo systemctl restart kubelet'
# ALLOCATABLE returns to cpu=2 mem=3823620Ki

Tying back to requests, QoS, and eviction

Now to splice this with Article 22. When the scheduler places a pod, it sums the requests of pods already on the node and only adds the new pod if total request ≤ Allocatable (this is the NodeResourcesFit plugin, which we'll dig into in the Scheduling part) — Allocatable, not Capacity, so the reserved part is genuinely "locked away" from pods' reach. kubectl describe node worker-0 shows an "Allocated resources" table — total request/limit the current pods occupy versus Allocatable.

And when the node still runs out of RAM despite the math (pods exceed their requests, nearing the eviction-threshold), the kubelet triggers node-pressure eviction, ejecting pods by the QoS order of Article 22 (BestEffort first, Guaranteed last). The eviction-threshold in the Allocatable formula is precisely that trigger level: reserving 100 Mi gives the kubelet "room to breathe" to clean up pods before the kernel's OOM killer (Article 22) crashes in brutally. The two mechanisms complement each other: eviction is the kubelet acting proactively and politely (respecting QoS, with a grace period), OOM kill is the kernel acting reactively and ruthlessly (killing the over-limit process immediately). (Details of eviction order and signals are saved for the eviction article in the Scheduling part.)

🧹 Cleanup

This article created no objects in the cluster, only edited worker-0's kubelet and already reverted to the original config above. Nothing to delete; the cluster still has two CoreDNS pods, two Ready nodes with Allocatable as before.

Wrap-up

A node doesn't give pods all of Capacity. Allocatable = Capacity − kube-reserved − system-reserved − eviction-threshold, and the scheduler divides Allocatable, not Capacity (no over-subscription). On worker-0 we saw CPU Allocatable = Capacity = 2 (the cluster declares no reservation), while memory falls short by exactly 100 Mi = the kubelet's default eviction-hard. Adding kubeReserved {cpu:200m, memory:256Mi} and restarting kubelet, Allocatable dropped precisely to 1800m / 3561476Ki (Capacity unchanged), proving the formula to the Ki. The reserved part protects system/Kubernetes daemons from being starved by pods; the eviction-threshold is the buffer so the kubelet can proactively evict by QoS (Article 22) before the kernel OOM kills. This is the node side of the resource story that Article 22 opened from the pod side.

Article 33 closes Part VI with policies at the namespace layer: LimitRange (sets default/min/max request-limit for pods in a namespace) and ResourceQuota (caps the total resources/object count a namespace can use), tools to split a cluster among many teams without anyone stepping on anyone else.