Affinity, taints, and tolerations

Article 34 showed the scheduler picking a node on its own via filter + score. But often you are the one who knows where a pod should sit: this workload needs an SSD node, two database replicas shouldn't share one machine (lose the machine, lose both), that node is reserved for one team. Part VII continues with three tools for steering the scheduler from your side, mapping straight onto the extension points of Article 34: nodeAffinity and taint/toleration affect the Filter step, podAffinity/anti-affinity uses both Filter and Score. Two opposite directions — affinity pulls, taint pushes.

nodeAffinity: pull a pod toward labeled nodes

nodeSelector (Articles 26, 28) is the simplest way: a pod only lands on nodes with all the declared labels. nodeAffinity is a more expressive version, with two levels of hard/soft. The docs:

requiredDuringSchedulingIgnoredDuringExecution — "The scheduler can't schedule the Pod unless the rule is met." (hard, like nodeSelector but with richer syntax)
preferredDuringSchedulingIgnoredDuringExecution — "The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod." (soft, with a weight 1–100 added to the Score)

And IgnoredDuringExecution means: "if the node labels change after Kubernetes schedules the Pod, the Pod continues to run." — the rule applies only at scheduling time, not evicting a running pod if node labels change afterward. Label the two workers then test the hard rule:

kubectl label node worker-0 disktype=ssd
kubectl label node worker-1 disktype=hdd

# pod aff-ssd: HARD requirement disktype In [ssd]
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - {key: disktype, operator: In, values: [ssd]}
# pod aff-nvme: requirement disktype In [nvme]  (no node has it)

kubectl get pods aff-ssd aff-nvme -o wide

aff-ssd    Running   worker-0
aff-nvme   Pending   <none>

# aff-nvme event:
FailedScheduling  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.

aff-ssd is pulled squarely to worker-0 (the only node with disktype=ssd); aff-nvme demands nvme, which no node has, so it's stuck Pending — this is still the Filter step of Article 34, only now the rule is set by the pod rather than by resources. Operators besides In include NotIn, Exists, DoesNotExist, Gt, Lt. (Soft is different: preferred only adds to the Score for matching nodes, scheduling still succeeds with no match — used when "preferred but not required".)

podAntiAffinity: push a pod away from other pods

nodeAffinity constrains by node labels. Inter-pod affinity constrains by other pods that are running: "constrain a Pod using labels on other Pods running on the node (or other topological domain)." podAffinity to place near, podAntiAffinity to place far — via topologyKey (scope: node, zone...) and labelSelector (which pods to consider). The common case: spread replicas across different nodes so losing one machine doesn't lose everything. A Deployment of 3 replicas, at most one per node (anti-affinity against its own label, topologyKey: kubernetes.io/hostname):

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector: {matchLabels: {app: spread-aa}}
        topologyKey: kubernetes.io/hostname

kubectl get pods -l app=spread-aa -o wide

spread-aa-...-f8mg9   Running   worker-1
spread-aa-...-kkdw8   Running   worker-0
spread-aa-...-85ggp   Pending   <none>

Two replicas spread one per node; the third is Pending — because the hard rule demands "no node already has a spread-aa pod", and both nodes already do. The cluster has only two workers so there's no room. (To run three replicas while still spreading well, use preferred instead of required, or topology spread from Article 36 with maxSkew — more flexible.) This is how to ensure availability: replicas don't pile onto one machine.

taint/toleration: a node pushes pods away

Affinity is a pod choosing a node. Taint is the reverse — a node rejecting pods. The docs: "Taints are ... applied to a node; this marks that the node should not accept any pods that do not tolerate the taints." and "Tolerations are applied to pods ... allow the scheduler to schedule pods with matching taints." Three effects:

NoSchedule — don't schedule new pods (except pods with a toleration); "Pods currently running on the node are not evicted."
PreferNoSchedule — the soft version, tries to avoid but doesn't guarantee.
NoExecute — the harshest, evicts even running pods that don't tolerate the taint.

Taint worker-1 then test a normal pod vs a pod with a toleration:

kubectl taint nodes worker-1 dedicated=team-a:NoSchedule

# no-tol: normal pod, NO toleration
# with-tol: has toleration + nodeSelector forcing it to target worker-1
spec:
  nodeSelector: {disktype: hdd}     # worker-1
  tolerations:
  - {key: dedicated, operator: Equal, value: team-a, effect: NoSchedule}

kubectl get pods no-tol with-tol -o wide

no-tol     Running   worker-0      # pushed off worker-1, only worker-0 left
with-tol   Running   worker-1      # toleration lets it past the taint

no-tol can't tolerate the taint so it's pushed off worker-1, landing only on worker-0. with-tol has a matching toleration (same key/value/effect) so it's allowed onto worker-1 even though forced to target it. This is how to "reserve" a node: taint it, and only the team's pods with a toleration get in. (A toleration matches a taint when key+effect match and operator: Exists, or operator: Equal with equal value.)

NoExecute evicts even running pods

NoSchedule only blocks new pods. NoExecute touches running pods too: "Pods that do not tolerate the taint are evicted immediately." Add a NoExecute taint to worker-1 (where with-tol is running, but it only tolerates NoSchedule):

kubectl taint nodes worker-1 evict=now:NoExecute
kubectl get pod with-tol -o wide

with-tol   Terminating   worker-1

with-tol is evicted immediately (Terminating) — its toleration only matches NoSchedule, not NoExecute, so this new taint applies and evicts it. This is the mechanism behind the series of NoExecute tolerations that DaemonSet auto-injects in Article 26 (not-ready, unreachable) — so the agent is not evicted when the node goes unhealthy; and it's also how the node controller cleans up pods when a node is NotReady too long. You can set tolerationSeconds to "hold on N more seconds before leaving" — useful for a graceful drain.

🧹 Cleanup

kubectl taint nodes worker-1 evict:NoExecute-          # remove taint (trailing -)
kubectl taint nodes worker-1 dedicated:NoSchedule-
kubectl label node worker-0 disktype- ; kubectl label node worker-1 disktype-
kubectl delete pod no-tol with-tol --now
kubectl delete deployment spread-aa

Removing the taints/labels returns the nodes to their original state (the trailing - deletes). The cluster returns to two CoreDNS pods, two Ready nodes with no taint. Manifests are at github.com/nghiadaulau/kubernetes-from-scratch, directory 35-affinity-taints.

Wrap-up

Three tools for steering the scheduler from your side, in two opposite directions. nodeAffinity pulls a pod toward nodes by label — required (hard, Pending if no match, we saw aff-nvme hang demanding nvme) or preferred (soft, adds to the Score); IgnoredDuringExecution = applies only at scheduling time. podAntiAffinity pushes a pod away from other pods via topologyKey+labelSelector — we saw 3 replicas but only 2 nodes leaving the third Pending (one replica per node). taint makes a node reject pods, toleration on a pod lets it through: NoSchedule blocks new pods (no-tol avoided worker-1, with-tol got in), NoExecute evicts even running pods (with-tol went Terminating because it only tolerated NoSchedule). Affinity affects Filter/Score, taint affects Filter — both are ways you "talk to" the scheduler of Article 34.

Article 36 digs into three finer scheduling mechanisms: topology spread constraints (maxSkew — spread pods evenly across zones/nodes flexibly, unlike rigid anti-affinity), pod overhead (accounting for extra resources for the runtime sandbox), and scheduling readiness (schedulingGates — hold a pod not yet schedulable until it's ready).