DaemonSet: one pod per node
The previous two articles were two models for managing pods by count: a Deployment keeps N replicas placed wherever the scheduler likes, a StatefulSet keeps N pods with identity. The DaemonSet is the third model, and it doesn't think in terms of "replica count" at all. The docs define it in one sentence:
"A DaemonSet ensures that all (or some) Nodes run a copy of a Pod."
You don't set the pod count — it equals the node count. Add a node to the cluster and the DaemonSet adds a pod to it; remove a node and the pod is cleaned up with it. This is the mold for things that must be present on every machine, exactly as the docs list: "running a cluster storage daemon on every node; running a logs collection daemon on every node; running a node monitoring daemon on every node." We'll use this very model when deploying Cilium in the advanced networking section, since a CNI agent also runs as a DaemonSet.
One pod per node
Our cluster has two workers registered as nodes (the control plane doesn't run a kubelet, so it isn't a Kubernetes node). Create a DaemonSet and watch it spread pods:
apiVersion: apps/v1
kind: DaemonSet
metadata: {name: node-agent}
spec:
selector: {matchLabels: {app: node-agent}}
template:
metadata: {labels: {app: node-agent}}
spec:
containers:
- name: agent
image: busybox:1.36
command: ["sleep","3600"]
Note: there's no replicas field. The pod count is derived from the node count.
kubectl get nodes
kubectl get ds node-agent
kubectl get pods -l app=node-agent -o wide
NAME STATUS ROLES AGE VERSION
worker-0 Ready <none> 113m v1.36.1
worker-1 Ready <none> 112m v1.36.1
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
node-agent 2 2 2 2 2 <none> 8s
NAME READY STATUS NODE
node-agent-hzx2v 1/1 Running worker-0
node-agent-rtznj 1/1 Running worker-1
DESIRED 2 — exactly the two nodes, one pod each. A DaemonSet's DESIRED column isn't a number you declare but the number of eligible nodes. If we join a worker-2 tomorrow, the DaemonSet immediately creates a third pod without us editing anything.
How a DaemonSet pins a pod to a node
A good question: how does a DaemonSet ensure exactly one pod per node, never two piled on one? DaemonSets used to set nodeName themselves, bypassing the scheduler, but now they cooperate with the scheduler via nodeAffinity. Docs:
"The DaemonSet controller creates a Pod for each eligible node and adds the
spec.affinity.nodeAffinityfield of the Pod to match the target host. After the Pod is created, the default scheduler typically takes over and then binds the Pod to the target host by setting the.spec.nodeNamefield."
Inspect one DaemonSet pod and you'll see that auto-injected nodeAffinity:
POD=$(kubectl get pods -l app=node-agent -o jsonpath='{.items[0].metadata.name}')
kubectl get pod $POD -o jsonpath='{.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms}{"\n"}'
[{"matchFields":[{"key":"metadata.name","operator":"In","values":["worker-0"]}]}]
Each pod is hard-pinned to exactly one node via matchFields: metadata.name In [worker-0]. This is how a DaemonSet achieves "one pod per node": it pre-creates one pod for each node with affinity pointing straight at that node's name, then lets the scheduler (covered in Article 34) do the binding. We'll meet nodeAffinity again in a hand-written form in the affinity/taint article — here it's something the controller generates for us.
Why an agent can run even on an "unhealthy" node
A monitoring or log-collection agent must be present even when the node is in trouble — that's exactly when it's needed most. But a troubled node usually carries a taint that pushes normal pods elsewhere (the taint/toleration mechanism is left for the scheduling article). A DaemonSet solves this by auto-injecting a set of tolerations for its pods:
kubectl get pod $POD -o jsonpath='{range .spec.tolerations[*]}{.key} {.operator} {.effect}{"\n"}{end}'
node.kubernetes.io/not-ready Exists NoExecute
node.kubernetes.io/unreachable Exists NoExecute
node.kubernetes.io/disk-pressure Exists NoSchedule
node.kubernetes.io/memory-pressure Exists NoSchedule
node.kubernetes.io/pid-pressure Exists NoSchedule
node.kubernetes.io/unschedulable Exists NoSchedule
Reading this list is enough to understand the intent: a DaemonSet pod tolerates not-ready, unreachable, disk-pressure, memory-pressure, pid-pressure, and unschedulable nodes. The docs explain the leading NoExecute pair: a DaemonSet pod "can be scheduled onto nodes that are not healthy or ready ... will not be evicted" — that is, it isn't evicted even when the node loses health. This is the core difference from a Deployment: a normal pod gets evicted from a not-ready node, whereas the agent stays and keeps working. (Note: these tolerations are for node conditions; to run on a control plane that carries the taint node-role.kubernetes.io/control-plane you still have to add that toleration yourself — like the fluentd example in the docs.)
"all (or some)": limiting to a group of nodes
The "(or some)" in the definition lets a DaemonSet run on only some nodes. Docs: "If you specify a .spec.template.spec.nodeSelector, then the DaemonSet controller will create Pods on nodes which match that node selector ... If you do not specify either, then the DaemonSet controller will create Pods on all nodes." Label a node, then aim the DaemonSet at that exact label:
kubectl label node worker-0 disk=ssd
apiVersion: apps/v1
kind: DaemonSet
metadata: {name: ssd-agent}
spec:
selector: {matchLabels: {app: ssd-agent}}
template:
metadata: {labels: {app: ssd-agent}}
spec:
nodeSelector: {disk: ssd} # only nodes with this label
containers:
- name: agent
image: busybox:1.36
command: ["sleep","3600"]
kubectl get ds ssd-agent
kubectl get pods -l app=ssd-agent -o wide
NAME DESIRED CURRENT READY NODE SELECTOR AGE
ssd-agent 1 1 1 disk=ssd 7s
NAME READY STATUS NODE
ssd-agent-njzrx 1/1 Running worker-0
DESIRED 1 — only worker-0 (the node with label disk=ssd) runs the pod, worker-1 does not. The DaemonSet has narrowed "all nodes" down to "nodes matching the selector". This is how you deploy an agent for only one class of hardware (GPU nodes, SSD nodes) or one group of roles.
Updating a DaemonSet
A DaemonSet also has an updateStrategy, defaulting to RollingUpdate:
kubectl get ds node-agent -o jsonpath='{.spec.updateStrategy.type}{"\n"}'
# RollingUpdate
RollingUpdate replaces the pod on each node one at a time (same spirit as Article 24, but per node). The other option is OnDelete — the DaemonSet does not replace pods automatically when you change the template; a pod only updates when you delete it yourself. OnDelete suits sensitive agents where you want precise control over exactly when each node gets replaced.
🧹 Cleanup
kubectl delete ds node-agent ssd-agent
kubectl label node worker-0 disk- # remove the label we added
Deleting the DaemonSets cleans up the pods they created on every node; the second command removes the disk label to return the node to its prior state. The cluster is left with two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 26-daemonset.
Wrap-up
A DaemonSet is the "one pod per node" model: the pod count equals the node count, not a number you declare; add a node and a pod appears, remove a node and the pod is GC'd. The controller achieves this by auto-injecting nodeAffinity to pin each pod to exactly one node (then letting the scheduler bind it), and auto-injecting a set of tolerations (not-ready, unreachable, disk/memory/pid-pressure, unschedulable) so the agent can run, and isn't evicted, even on a troubled node — exactly when it's needed most. nodeSelector/affinity narrows "all nodes" down to a group (we saw the DaemonSet run only on the node labeled disk=ssd). The default update strategy is RollingUpdate (replacing per node), or OnDelete for manual control.
Article 27 closes Part IV with the run-to-completion family of controllers: the Job (runs one task until done, rather than running forever like every controller so far), the CronJob (a scheduled Job), and the TTL mechanism that auto-cleans finished Jobs.