Priority and preemption

Article 34 created a pod that didn't fit any node and saw the event preemption: ... Preemption is not helpful for scheduling. Back then it was "not helpful" because every pod was at the same priority — there was nobody to kick out. This article makes preemption helpful: assign different priorities to pods, then when a node is full, the high-priority pod gets to evict (preempt) a low-priority pod to grab the spot. This is the mechanism that guarantees important workloads (control plane, tier-1 services) always have somewhere to run, even when the cluster is full of junk pods.

PriorityClass: assign a priority level

Priority isn't a number typed straight into the pod — it goes through a cluster object called PriorityClass. From the docs: "A PriorityClass is a non-namespaced object that defines mapping from a priority class name to an integer value ... The higher the value, the higher the priority." The value range: "from -2147483648 to 1000000000 inclusive. Larger numbers are reserved for built-in PriorityClasses" — and the cluster already ships two classes, system-cluster-critical and system-node-critical, for system pods (that's exactly where our CoreDNS sat in Article 15). Create two classes:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: low-prio}
value: 100
description: "ordinary workload, can be preempted"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: high-prio}
value: 1000000
description: "important workload, gets priority"

A pod references it via priorityClassName, and the admission controller fills in the numeric value at spec.priority:

kubectl get pod low-1 -o jsonpath='{.spec.priority}'
# 100

Priority has two effects. First, queue ordering: "the scheduler orders pending Pods by their priority and a pending Pod is placed ahead of other pending Pods with lower priority in the scheduling queue." Second — and more powerful — is preemption.

Fill the cluster with low-priority pods

Each worker has Allocatable cpu = 2 (Article 32), CoreDNS takes ~100m, leaving ~1900m. Create two low-priority pods, each requesting 1500m — each node holds exactly one:

# low-1, low-2: priorityClassName: low-prio, requests cpu 1500m

kubectl get pods -l tier=low -o wide

low-1   Running   worker-1
low-2   Running   worker-0

Now each node has ~400m free — not enough for another 1500m pod. The cluster is effectively "full" for that size of pod. This is the Article 34 situation: a new 1500m pod would go Pending... unless it has higher priority.

Preemption: a high-priority pod kicks out a low-priority one

Drop in a high-prio pod requesting 1500m. The docs describe the logic: "If no Node is found that satisfies all the specified requirements of the Pod, preemption logic is triggered ... Preemption logic tries to find a Node where removal of one or more Pods with lower priority than P would enable P to be scheduled. If such a Node is found, one or more lower priority Pods get evicted."

# important: priorityClassName: high-prio, requests cpu 1500m

Captured right after dropping it in:

kubectl get pods -l 'tier in (low,high)' -o wide
kubectl get pod important -o jsonpath='nominatedNodeName={.status.nominatedNodeName}'

important   Pending       <none>
low-1       Running       worker-1
low-2       Terminating   worker-0      # <- the victim being evicted

nominatedNodeName=worker-0

Preemption fired: the scheduler chose worker-0, set nominatedNodeName=worker-0 on important, and evicted low-2 running there. The docs explain nominatedNodeName: "When Pod P preempts one or more Pods on Node N, nominatedNodeName field of Pod P's status is set to the name of Node N." — a tentative "promise" of a node, though "Pod P is not necessarily scheduled to the nominated Node" if another node frees up first. The victim is shut down gracefully, not killed outright: "the victims get their graceful termination period." The evidence is in low-2's events:

kubectl get events --field-selector involvedObject.name=low-2 | grep Preempted

Normal  Preempted  Preempted by pod 7d4ecd4f-... on node worker-0

After low-2 finishes its grace period, important takes the freed spot:

kubectl get pods -l 'tier in (low,high)' -o wide
kubectl get pod important -o jsonpath='nodeName={.spec.nodeName} priority={.spec.priority}'

important   Running   worker-0
low-1       Running   worker-1       # <- untouched

nodeName=worker-0 priority=1000000

important (priority 1000000) is now Running on worker-0; low-1 (priority 100, on worker-1) is untouched, because evicting one pod was enough to make room. That's the PostFilter step of Article 34, but this time preemption is helpful: there's a lower-priority pod to kick out. Comparing the two events says it all: Article 34 (same priority) → Preemption is not helpful; here (with a priority difference) → Preempted by pod ... on node worker-0.

When you don't want eviction: preemptionPolicy Never

Sometimes you want a pod to be scheduled ahead in the queue but not kick anyone out — say, a scientific job that wants to jump to the front of the line but doesn't want to cancel running work. From the docs: "Pods with preemptionPolicy: Never will be placed in the scheduling queue ahead of lower-priority pods, but they cannot preempt other pods." (the default is PreemptLowerPriority). Declared on the PriorityClass:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: {name: high-priority-nonpreempting}
value: 1000000
preemptionPolicy: Never        # ordered ahead, but does NOT evict anyone

A pod of this kind, if there's no room, simply waits (Pending) and kicks out no pod — separating the two facets of priority: "considered first" and "gets the spot".

🧹 Cleanup

kubectl delete pod important low-1 --now
kubectl delete priorityclass low-prio high-prio

low-2 was already deleted by preemption. Delete the remaining pod and the two PriorityClasses (cluster objects, not namespaced). The cluster returns to two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 37-priority-preemption.

Wrap-up

PriorityClass assigns a priority level (an integer, higher = more important; built-in system-cluster-critical/system-node-critical for system pods), a pod references it via priorityClassName, and admission fills in spec.priority. Priority does two things: it orders high-priority pods ahead in the queue, and it enables preemption — when a high-priority pod has no room, the scheduler evicts a low-priority pod to grab the spot (we saw important set nominatedNodeName=worker-0, then kick out low-2 — Preempted event — and take the spot, while low-priority low-1 on the other node stayed put). The victim gets graceful termination. preemptionPolicy: Never separates "ordered first" from "gets the spot" — priority in the queue but kicks out no one. This is Article 34's PostFilter when there is a priority difference to exploit.

Article 38 closes Part VII with the other side: node-pressure eviction — when a node actually runs out of resources (not at scheduling time), the kubelet proactively evicts pods by threshold and QoS order (Article 22), quite different from preemption (by the scheduler, for priority) and the OOM kill (by the kernel, for exceeding a limit).