DaemonSet: one pod per node

K
Kai··6 min read

The previous two articles were two models for managing pods by count: a Deployment keeps N replicas placed wherever the scheduler likes, a StatefulSet keeps N pods with identity. The DaemonSet is the third model, and it doesn't think in terms of "replica count" at all. The docs define it in one sentence:

"A DaemonSet ensures that all (or some) Nodes run a copy of a Pod."

You don't set the pod count — it equals the node count. Add a node to the cluster and the DaemonSet adds a pod to it; remove a node and the pod is cleaned up with it. This is the mold for things that must be present on every machine, exactly as the docs list: "running a cluster storage daemon on every node; running a logs collection daemon on every node; running a node monitoring daemon on every node." We'll use this very model when deploying Cilium in the advanced networking section, since a CNI agent also runs as a DaemonSet.

One pod per node

Our cluster has two workers registered as nodes (the control plane doesn't run a kubelet, so it isn't a Kubernetes node). Create a DaemonSet and watch it spread pods:

apiVersion: apps/v1
kind: DaemonSet
metadata: {name: node-agent}
spec:
  selector: {matchLabels: {app: node-agent}}
  template:
    metadata: {labels: {app: node-agent}}
    spec:
      containers:
      - name: agent
        image: busybox:1.36
        command: ["sleep","3600"]

Note: there's no replicas field. The pod count is derived from the node count.

kubectl get nodes
kubectl get ds node-agent
kubectl get pods -l app=node-agent -o wide
NAME       STATUS   ROLES    AGE    VERSION
worker-0   Ready    <none>   113m   v1.36.1
worker-1   Ready    <none>   112m   v1.36.1

NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
node-agent   2         2         2       2            2           <none>          8s

NAME               READY   STATUS    NODE
node-agent-hzx2v   1/1     Running   worker-0
node-agent-rtznj   1/1     Running   worker-1

DESIRED 2 — exactly the two nodes, one pod each. A DaemonSet's DESIRED column isn't a number you declare but the number of eligible nodes. If we join a worker-2 tomorrow, the DaemonSet immediately creates a third pod without us editing anything.

How a DaemonSet pins a pod to a node

A good question: how does a DaemonSet ensure exactly one pod per node, never two piled on one? DaemonSets used to set nodeName themselves, bypassing the scheduler, but now they cooperate with the scheduler via nodeAffinity. Docs:

"The DaemonSet controller creates a Pod for each eligible node and adds the spec.affinity.nodeAffinity field of the Pod to match the target host. After the Pod is created, the default scheduler typically takes over and then binds the Pod to the target host by setting the .spec.nodeName field."

Inspect one DaemonSet pod and you'll see that auto-injected nodeAffinity:

POD=$(kubectl get pods -l app=node-agent -o jsonpath='{.items[0].metadata.name}')
kubectl get pod $POD -o jsonpath='{.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms}{"\n"}'
[{"matchFields":[{"key":"metadata.name","operator":"In","values":["worker-0"]}]}]

Each pod is hard-pinned to exactly one node via matchFields: metadata.name In [worker-0]. This is how a DaemonSet achieves "one pod per node": it pre-creates one pod for each node with affinity pointing straight at that node's name, then lets the scheduler (covered in Article 34) do the binding. We'll meet nodeAffinity again in a hand-written form in the affinity/taint article — here it's something the controller generates for us.

Why an agent can run even on an "unhealthy" node

A monitoring or log-collection agent must be present even when the node is in trouble — that's exactly when it's needed most. But a troubled node usually carries a taint that pushes normal pods elsewhere (the taint/toleration mechanism is left for the scheduling article). A DaemonSet solves this by auto-injecting a set of tolerations for its pods:

kubectl get pod $POD -o jsonpath='{range .spec.tolerations[*]}{.key} {.operator} {.effect}{"\n"}{end}'
node.kubernetes.io/not-ready Exists NoExecute
node.kubernetes.io/unreachable Exists NoExecute
node.kubernetes.io/disk-pressure Exists NoSchedule
node.kubernetes.io/memory-pressure Exists NoSchedule
node.kubernetes.io/pid-pressure Exists NoSchedule
node.kubernetes.io/unschedulable Exists NoSchedule

Reading this list is enough to understand the intent: a DaemonSet pod tolerates not-ready, unreachable, disk-pressure, memory-pressure, pid-pressure, and unschedulable nodes. The docs explain the leading NoExecute pair: a DaemonSet pod "can be scheduled onto nodes that are not healthy or ready ... will not be evicted" — that is, it isn't evicted even when the node loses health. This is the core difference from a Deployment: a normal pod gets evicted from a not-ready node, whereas the agent stays and keeps working. (Note: these tolerations are for node conditions; to run on a control plane that carries the taint node-role.kubernetes.io/control-plane you still have to add that toleration yourself — like the fluentd example in the docs.)

"all (or some)": limiting to a group of nodes

The "(or some)" in the definition lets a DaemonSet run on only some nodes. Docs: "If you specify a .spec.template.spec.nodeSelector, then the DaemonSet controller will create Pods on nodes which match that node selector ... If you do not specify either, then the DaemonSet controller will create Pods on all nodes." Label a node, then aim the DaemonSet at that exact label:

kubectl label node worker-0 disk=ssd
apiVersion: apps/v1
kind: DaemonSet
metadata: {name: ssd-agent}
spec:
  selector: {matchLabels: {app: ssd-agent}}
  template:
    metadata: {labels: {app: ssd-agent}}
    spec:
      nodeSelector: {disk: ssd}      # only nodes with this label
      containers:
      - name: agent
        image: busybox:1.36
        command: ["sleep","3600"]
kubectl get ds ssd-agent
kubectl get pods -l app=ssd-agent -o wide
NAME        DESIRED   CURRENT   READY   NODE SELECTOR   AGE
ssd-agent   1         1         1       disk=ssd        7s

NAME              READY   STATUS    NODE
ssd-agent-njzrx   1/1     Running   worker-0

DESIRED 1 — only worker-0 (the node with label disk=ssd) runs the pod, worker-1 does not. The DaemonSet has narrowed "all nodes" down to "nodes matching the selector". This is how you deploy an agent for only one class of hardware (GPU nodes, SSD nodes) or one group of roles.

Updating a DaemonSet

A DaemonSet also has an updateStrategy, defaulting to RollingUpdate:

kubectl get ds node-agent -o jsonpath='{.spec.updateStrategy.type}{"\n"}'
# RollingUpdate

RollingUpdate replaces the pod on each node one at a time (same spirit as Article 24, but per node). The other option is OnDelete — the DaemonSet does not replace pods automatically when you change the template; a pod only updates when you delete it yourself. OnDelete suits sensitive agents where you want precise control over exactly when each node gets replaced.

🧹 Cleanup

kubectl delete ds node-agent ssd-agent
kubectl label node worker-0 disk-      # remove the label we added

Deleting the DaemonSets cleans up the pods they created on every node; the second command removes the disk label to return the node to its prior state. The cluster is left with two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 26-daemonset.

Wrap-up

A DaemonSet is the "one pod per node" model: the pod count equals the node count, not a number you declare; add a node and a pod appears, remove a node and the pod is GC'd. The controller achieves this by auto-injecting nodeAffinity to pin each pod to exactly one node (then letting the scheduler bind it), and auto-injecting a set of tolerations (not-ready, unreachable, disk/memory/pid-pressure, unschedulable) so the agent can run, and isn't evicted, even on a troubled node — exactly when it's needed most. nodeSelector/affinity narrows "all nodes" down to a group (we saw the DaemonSet run only on the node labeled disk=ssd). The default update strategy is RollingUpdate (replacing per node), or OnDelete for manual control.

Article 27 closes Part IV with the run-to-completion family of controllers: the Job (runs one task until done, rather than running forever like every controller so far), the CronJob (a scheduled Job), and the TTL mechanism that auto-cleans finished Jobs.