Node-pressure eviction

K
Kai··6 min read·7 views

Part VII is nearly closed. Article 37 showed preemption — the scheduler kicks out a low-priority pod at scheduling time. But there's another kind of "evicting a pod" that happens when a steadily-running cluster suddenly has a node that truly runs out of resources: node-pressure eviction. The kubelet proactively kills pods to save the node, quite different from preemption (scheduler, for priority) and Article 22's OOM kill (kernel, because a pod exceeds its own limit). Telling these three "pod gets killed" mechanisms apart is the goal of this article.

The kubelet watches node resources

The docs define it: "Node-pressure eviction is the process by which the kubelet proactively terminates pods to reclaim resource on nodes." and "The kubelet monitors resources like memory, disk space, and filesystem inodes ... When one or more of these resources reach specific consumption levels, the kubelet can proactively fail one or more pods on the node to reclaim resources and prevent starvation." When it evicts, "the kubelet sets the phase for the selected pods to Failed, and terminates the Pod."

The kubelet tracks this via eviction signals, which are the right-hand side of the threshold:

  • memory.available — free RAM (this is exactly Capacity − workingSet from Article 32).
  • nodefs.available / nodefs.inodesFree — remaining space / inodes of the node filesystem.
  • imagefs.available — space for images/layers.
  • pid.available — remaining PIDs (connects to Article 33).

An eviction threshold has the form [signal][operator][quantity], e.g. memory.available<100Mi. Two types:

  • hard"the kubelet uses a 0s grace period (immediate shutdown)" — cross it and the pod is killed at once.
  • soft — respects eviction-soft-grace-period/eviction-max-pod-grace-period, giving the pod a little time.

By default the kubelet has a built-in hard threshold memory.available<100Mi (that's the very 100Mi carved off Allocatable that we saw in Article 32).

Create real memory pressure

To see real eviction, we have to make the node genuinely short on RAM. Do it safely: set a demo threshold on worker-0 such that evicting the one pod causing the pressure will relieve it (without spreading to other pods). worker-0 currently has memory.available ≈ 3390Mi. Set a hard threshold memory.available<2500Mi:

ssh worker-0 'printf "evictionHard:\n  memory.available: \"2500Mi\"\n" | sudo tee -a /var/lib/kubelet/kubelet-config.yaml
  sudo systemctl restart kubelet'
kubectl get node worker-0 -o jsonpath='MemoryPressure={..conditions[?(@.type=="MemoryPressure")].status}'
# MemoryPressure=False   (3390Mi > 2500Mi, no pressure yet)

The node stays calm (available 3390 > threshold 2500). Now drop in a BestEffort pod (no resources declared, Article 22) that writes 1500Mi into tmpfs, pinned hard to worker-0:

apiVersion: v1
kind: Pod
metadata: {name: memhog}
spec:
  nodeName: worker-0          # pin directly, bypass the scheduler
  restartPolicy: Never
  containers:
  - name: c
    image: busybox:1.36
    command: ["sh","-c","dd if=/dev/zero of=/m/f bs=1M count=1500; sleep 600"]
    volumeMounts: [{name: m, mountPath: /m}]
  volumes:
  - {name: m, emptyDir: {medium: Memory}}     # tmpfs -> counts against node RAM

Poll the node and the pod:

t=1: worker-0.MemoryPressure=False  memhog=Running
t=2: worker-0.MemoryPressure=False  memhog=Running
t=3: worker-0.MemoryPressure=True   memhog=Failed/Evicted

memhog writes 1500Mi tmpfs → memory.available drops below 2500Mi → the kubelet sets the node to MemoryPressure=True → and evicts memhog. The pod moves to Failed with reason Evicted. The message spells everything out:

kubectl get pod memhog -o jsonpath='{.status.message}'
The node was low on resource: memory. Threshold quantity: 2500Mi, available: 2543640Ki.
Container c was using 1369520Ki, request is 0, has larger consumption of memory.

Read it carefully: the threshold is 2500Mi, at measurement time available 2543640Ki (right at the threshold), and the reason for picking memhog — Container c was using 1369520Ki, request is 0, has larger consumption of memory. That's the ranking the kubelet uses to choose its victim: the pod most over its request gets killed first. memhog is BestEffort (request is 0) yet eating 1.3Gi — "infinitely" over request, so it's at the top of the sacrifice list.

Eviction order, and why CoreDNS survives

The kubelet ranks victims by: (1) whether the pod is over its request, (2) Pod Priority (Article 37), (3) usage relative to request. Mapped onto the QoS of Article 22: BestEffort (no request) is evicted first, then Burstable over request, and finally Guaranteed and Burstable within request. There's also a CoreDNS pod on worker-0 — check it after the eviction:

kubectl get pods -n kube-system -o wide | grep coredns
coredns-...-pqzsx   Running   worker-0      # UNTOUCHED

CoreDNS was not evicted, even though it shares a node under MemoryPressure. Because: it uses very little RAM (within request), and it carries the system-cluster-critical priority (Article 37) — putting it last on the list. And crucially: evicting memhog freed 1.3Gi → available jumped back above 2500Mi → the pressure cleared → the kubelet stopped, with no need to touch anyone else. This is why we designed the threshold to sit between "available without the hog" and "available with the hog": evicting the actual culprit is enough.

Three kinds of "pod gets killed" — don't mix them up

Now gather the three mechanisms we've met, since they're easily confused:

                  WHO kills?   WHY?                             RESPECTS?
OOM kill (A22)    kernel       container exceeds MEMORY LIMIT    — (instant, exitCode 137)
preemption (A37)  scheduler    a HIGH-priority pod needs room    graceful termination
node-pressure     kubelet      NODE out of resources (signal)    NOT PDB, NOT
  eviction (A38)               crosses an eviction threshold      terminationGracePeriodSeconds

The docs stress the last point: node-pressure eviction "is not the same as API-initiated eviction" and "The kubelet does not respect your configured PodDisruptionBudget or the pod's terminationGracePeriodSeconds." — quite different from kubectl drain (Article 23), which goes through the Eviction API and respects PDB. When a node is on fire, the kubelet has no time to be polite. And it self-heals: if the evicted pod belongs to a Deployment/StatefulSet, the controller creates a replacement (usually on another node) — exactly the reconcile loop of Article 17.

🧹 Cleanup

ssh worker-0 'sudo mv /var/lib/kubelet/kubelet-config.yaml.bak /var/lib/kubelet/kubelet-config.yaml && sudo systemctl restart kubelet'
kubectl delete pod memhog --now

Return the eviction threshold to its default (reverted — node worker-0 is back to MemoryPressure=False), delete the test pod. The cluster returns to two CoreDNS pods, two nodes Ready. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 38-node-pressure-eviction.

Wrap-up

Node-pressure eviction is the kubelet proactively killing pods when a node truly runs out of resources — by eviction signals (memory.available, nodefs.available, pid.available...) against thresholds (hard = kill at once, soft = with grace). We set a threshold memory.available<2500Mi on worker-0, dropped in a BestEffort pod writing 1500Mi tmpfs → node MemoryPressure=True → the kubelet evicted that exact pod (Failed/Evicted), with a message stating the threshold, available, and the reason "request is 0, has larger consumption" — i.e. ranking by over-request → priority → usage (BestEffort first, critical last, so CoreDNS survives). Unlike preemption (scheduler/priority) and the OOM kill (kernel/limit), eviction does not respect PDB or terminationGracePeriodSeconds. This is the node's last line of defense, and it self-heals thanks to the controller recreating the pod.

End of Part VII. Part VIII moves to autoscaling — instead of killing pods under load, add pods (or nodes): Article 39 installs metrics-server (the first add-on we add to the cluster), then stands up a HorizontalPodAutoscaler to automatically grow/shrink the replica count by real CPU load.