GC, cgroup v2, Swap, and Graceful Node Shutdown

Most of the kubelet article (Article 11) was about it running pods. But kubelet also manages node resources continuously in the background: garbage collection when the disk fills, placing pods into cgroups and enforcing limits, handling swap, and shutting pods down in order when a node powers off. These four stay silent until they go wrong, and then the cluster hits hard-to-diagnose trouble. This article inspects each one on a real worker.

cgroup v2: where limits become reality

Article 22 declared resources.limits, but those limits are actually enforced at the kernel's cgroup. The cluster runs on cgroup v2 — check via the filesystem type of /sys/fs/cgroup:

ssh worker-0 'stat -fc %T /sys/fs/cgroup'

cgroup2fs

cgroup2fs is cgroup v2 (v1 returns tmpfs); from Kubernetes 1.35, cgroup v1 is deprecated. Kubelet arranges pods into a cgroup tree under kubepods.slice, split by QoS (Article 22):

ssh worker-0 'ls -d /sys/fs/cgroup/kubepods.slice/*/ | xargs -n1 basename'

kubepods-besteffort.slice
kubepods-burstable.slice

Guaranteed pods sit directly under kubepods.slice, while Burstable and BestEffort go into their own branches — exactly the three QoS classes. Create a Burstable pod (limit 256Mi, 500m CPU) and find its cgroup:

# pod uid=0905ccf2-..., qos=Burstable
ssh worker-0 'D=$(find /sys/fs/cgroup/kubepods.slice -type d -name "*pod0905ccf2*")
  cat $D/memory.max; cat $D/cpu.max'

dir: kubepods-burstable-pod0905ccf2_8a12_445b_a13b_43a8b4a15427.slice
memory.max = 268435456        # = 256Mi
cpu.max    = 50000 100000     # = 0.5 CPU (quota 50000 / period 100000)

memory.max equals 256Mi in bytes, cpu.max is the quota/period pair for 0.5 CPU. This is where limits from YAML become kernel law: exceed memory.max and it's OOM killed (Article 22), exceed cpu.max and it's throttled. Kubelet writes these files itself when creating the pod, no configuration needed.

Image garbage collection

Pulled images pile up on the node's disk. Kubelet cleans them up by disk threshold. See the effective config via kubelet's /configz:

kubectl get --raw /api/v1/nodes/worker-0/proxy/configz | jq '.kubeletconfig |
  {imageGCHighThresholdPercent, imageGCLowThresholdPercent, imageMinimumGCAge}'

imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
imageMinimumGCAge: 2m0s

The mechanism: when disk usage exceeds 85%, kubelet deletes unused images (oldest first) until it drops below 80%, skipping images younger than 2 minutes. worker-0 is currently holding 30 images; as long as the disk hasn't hit 85%, kubelet leaves them alone. Dead containers are cleaned up similarly. This is why you shouldn't treat images on a node as durable — kubelet deletes them when it needs space.

Swap: why it's blocked by default

Check swap on the node:

ssh worker-0 'free -h | grep -i swap'
kubectl get --raw .../configz | jq '.kubeletconfig.failSwapOn'

Swap:   0B   0B   0B
failSwapOn: true

Swap is off (0B), and failSwapOn: true means kubelet refuses to start if the node has swap enabled. The historical reason: swap breaks the resource assumptions of the scheduler and QoS — a pod that's supposedly RAM-limited could silently spill into swap, making performance unpredictable and breaking OOM semantics. Kubernetes has a NodeSwap feature (beta) that allows controlled swap use, but it must be configured with memorySwap explicitly; the default is still no swap. A from-scratch cluster therefore disables swap on the node before installing kubelet — otherwise kubelet won't come up.

Graceful node shutdown

When a node powers off (maintenance, scale down), the pods on it should shut down cleanly rather than being cut off. See the config:

kubectl get --raw .../configz | jq '.kubeletconfig | {shutdownGracePeriod, shutdownGracePeriodCriticalPods}'

shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s

0s means off — this is the default, and a gap in a self-built cluster. When the feature is on (set shutdownGracePeriod > 0), kubelet catches systemd's shutdown signal (via an inhibitor lock), stops accepting new pods, then sends SIGTERM to pods in order — ordinary pods first, critical pods after — waits through the grace period, and only then lets the node power off. With it off, when a node shuts down, pods are killed abruptly and the Service has to wait for a health check to notice before pulling the endpoint, causing transient errors. Distinguish this from drain (Article 63): drain proactively moves pods off before you intervene; graceful shutdown is kubelet reacting while the node is shutting down before there was a chance to drain.

🧹 Cleanup

kubectl delete namespace cg-demo

This article only creates a pod to inspect the cgroup; the rest is reading node config. Nothing on the node was changed. The commands used in this article are at github.com/nghiadaulau/kubernetes-from-scratch, directory 64-node-internals.

Wrap-up

Kubelet manages the node along four hidden fronts. cgroup v2 (cgroup2fs) is where limits become kernel law: kubelet arranges pods into kubepods.slice by QoS and writes memory.max/cpu.max equal to the limits (256Mi → 268435456, 500m → 50000 100000). Image GC cleans up unused images when the disk exceeds 85%, back down below 80%, skipping images under 2 minutes. Swap is blocked by default (failSwapOn: true) because it breaks the scheduler's and QoS's resource assumptions; using swap requires configured NodeSwap. Graceful node shutdown (shutdownGracePeriod) defaults to 0s — off — so a self-built cluster should enable it for pods to be SIGTERM'd in order when a node shuts down, rather than cut off abruptly; it complements drain for shutdowns that couldn't be drained in time.

By now the cluster is backed up (Article 62), knows how to upgrade (Article 63), and manages node resources (this article). The rest of Part XIII is observability: how to know what's happening inside the cluster. Article 65 starts with logging — where container logs sit, how kubelet rotates them, and the cluster-wide log collection model.