Topology spread, pod overhead, and scheduling readiness

K
Kai··6 min read

Article 35 showed that hard podAntiAffinity has a weak spot: with topologyKey: hostname, each node holds at most one pod — a third replica on a two-node cluster goes Pending immediately. That is often too strict: what you actually want is for pods to spread evenly, not to forbid two pods sharing a node. This article digs into three finer scheduling mechanisms that close out scheduler control from the pod side: topology spread (flexible spreading via maxSkew), pod overhead (adding resources for the sandbox runtime), and scheduling readiness (holding a pod back from scheduling).

Topology spread: spread evenly without rigidity

From the docs: "You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes ... This can help to achieve high availability as well as efficient resource utilization." Unlike anti-affinity ("forbid the same domain"), topology spread says "the difference between domains must not exceed maxSkew". The key fields:

  • maxSkew"describes the degree to which Pods may be unevenly distributed" — the maximum allowed difference between the most-populated and least-populated domain.
  • topologyKey — the node label that defines a "domain" (each <key,value> pair is one domain).
  • whenUnsatisfiableDoNotSchedule (hard, default) or ScheduleAnyway (soft, "prioritizing nodes that minimize the skew").
  • labelSelector — selects the group of pods to compute the skew over.

A Deployment with 4 replicas, maxSkew: 1 by hostname:

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: {matchLabels: {app: ts}}
kubectl get pods -l app=ts -o wide | awk '{print $7}' | sort | uniq -c
   2 worker-0
   2 worker-1

The four replicas spread 2 + 2 — skew 0, perfectly even. The difference from Article 35 is right here: hard anti-affinity forbids the third replica (only 1/node, replica 3 Pending), whereas topology spread allows several pods per node as long as the difference is ≤ maxSkew. So all four replicas run, still evenly spread. If the skew would exceed maxSkew and whenUnsatisfiable: DoNotSchedule, the new pod goes Pending (like anti-affinity); switch to ScheduleAnyway and the scheduler tries to spread but won't hang the pod. This is an HA spreading tool with the right dose — tight when needed (DoNotSchedule), loose when you want it (ScheduleAnyway), instead of anti-affinity's binary choice. (There are also minDomains, nodeAffinityPolicy, and nodeTaintsPolicy to further tune the scope over which skew is computed.)

Pod overhead: resources for the sandbox

Every pod costs resources beyond its own containers: the sandbox (pause container, network namespace) and — with isolating runtimes like Kata Containers / gVisor — a thin virtualization layer too. If the scheduler only counts the containers' requests, it underestimates what the pod actually consumes. Pod overhead fixes that: it attaches a fixed amount to the pod via a RuntimeClass.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata: {name: overhead-demo}
handler: runc                       # matches containerd's runtime (Article 10)
overhead:
  podFixed: {cpu: "100m", memory: "64Mi"}
---
apiVersion: v1
kind: Pod
metadata: {name: oh-pod}
spec:
  runtimeClassName: overhead-demo
  containers:
  - name: c
    image: busybox:1.36
    command: ["sleep","3600"]
    resources: {requests: {cpu: "100m", memory: "64Mi"}}

A pod that declares that runtimeClassName gets spec.overhead injected by admission:

kubectl get pod oh-pod -o jsonpath='{.spec.overhead}'
{"cpu":"100m","memory":"64Mi"}

And this overhead is added in when computing the pod's resources for scheduling and accounting. Verify it on the node where the pod runs:

kubectl describe node worker-1 | sed -n '/Allocated resources/,/Events/p' | grep -iE 'cpu|memory'
  cpu     300m (15%)
  memory  198Mi (5%)

Break it down: oh-pod contributes 100m (container) + 100m (overhead) = 200m CPU and 64Mi + 64Mi = 128Mi memory; add the resident CoreDNS on the node (100m/70Mi) and you get exactly 300m/198Mi. The overhead is genuinely counted — the scheduler knows the pod takes 200m, not 100m. With a heavy sandbox runtime (Kata/gVisor) this amount is much larger, and ignoring it would lead to overpacking the node. (Our cluster uses runc, so the real overhead is near zero; here we simulate an amount to see the mechanism.)

Scheduling readiness: hold a pod back from scheduling

Sometimes a pod is already created but should not yet be scheduled — it's waiting on an external resource (an approved quota, a Secret to be created, a prep step to finish). Putting that pod into the scheduler queue right away makes the scheduler (and the Cluster Autoscaler) spin uselessly on a pod that's certain to be Pending. schedulingGates solves this. From the docs: "By specifying/removing a Pod's .spec.schedulingGates, you can control when a Pod is ready to be considered for scheduling."

apiVersion: v1
kind: Pod
metadata: {name: gated}
spec:
  schedulingGates:
  - {name: kkloud.io/wait-for-config}
  containers: [{name: c, image: busybox:1.36, command: ["sleep","3600"]}]
kubectl get pod gated -o wide
kubectl get pod gated -o jsonpath='{.spec.schedulingGates}'
NAME    READY   STATUS            NODE
gated   0/1     SchedulingGated   <none>

[{"name":"kkloud.io/wait-for-config"}]

The pod sits in the special SchedulingGated state — the scheduler ignores it entirely (nodeName empty, not even a FailedScheduling event, because the scheduler hasn't looked at it yet). Unlike Pending (the scheduler looked but couldn't place it), SchedulingGated means "not its turn to be considered yet". Once the external condition is met, a controller (or you) removes the gate — note the rule: "each schedulingGate can be removed ... but addition of a new scheduling gate is disallowed" (you can only remove, not add, after creation):

kubectl patch pod gated --type=json -p='[{"op":"remove","path":"/spec/schedulingGates"}]'
kubectl get pod gated -o wide
NAME    READY   STATUS    NODE
gated   1/1     Running   worker-1

With the gates empty, the scheduler immediately picks up the pod and binds it. This is the "valve" that lets an external system control when a pod enters the scheduler — the foundation for patterns like gang scheduling (wait until the whole group is ready), or waiting for a specialized resource to become available, without keeping the scheduler busy for nothing.

🧹 Cleanup

kubectl delete deployment ts
kubectl delete pod gated oh-pod --now
kubectl delete runtimeclass overhead-demo

These are objects in the cluster, so deleting them cleans up. The cluster returns to two CoreDNS pods, two nodes Ready. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 36-topology-overhead-gates.

Wrap-up

Three finer scheduling mechanisms. Topology spread spreads pods by maxSkew — the difference between the most- and least-populated domain stays within a threshold (we saw 4 replicas → 2+2 even, allowing several pods/node, unlike Article 35's hard anti-affinity at 1/node); whenUnsatisfiable picks hard (DoNotSchedule) or soft (ScheduleAnyway). Pod overhead via RuntimeClass adds a fixed amount (podFixed) for the sandbox/runtime to the pod's resources — we saw oh-pod contribute 200m CPU (100 container + 100 overhead) to the node, correct accounting. schedulingGates holds a pod in the SchedulingGated state (scheduler ignores it, unlike Pending) until the gate is removed — a valve for an external system to control when a pod enters the scheduler. Together with affinity/taint (Article 35), this is the full scheduler control set from the pod side.

Article 37 moves to priority and preemption: a pod with a high priorityClassName can evict lower-priority pods when a node runs out of room — exactly the PostFilter step ("Preemption is not helpful") we saw in Article 34, now made helpful.