DaemonSet: một pod trên mỗi node

Hai bài trước là hai mô hình quản pod theo số lượng: Deployment giữ N bản sao đặt đâu tùy scheduler, StatefulSet giữ N pod có danh tính. DaemonSet là mô hình thứ ba, và nó không nghĩ theo "số bản sao" chút nào. Tài liệu định nghĩa một câu:

"A DaemonSet ensures that all (or some) Nodes run a copy of a Pod."

Số pod không do bạn đặt, nó bằng số node. Thêm một node vào cluster thì DaemonSet tự thêm một pod lên đó; rút node đi thì pod bị dọn theo. Đây là khuôn cho những thứ phải hiện diện ở mọi máy, đúng như tài liệu liệt kê: "running a cluster storage daemon on every node; running a logs collection daemon on every node; running a node monitoring daemon on every node." Ta sắp dùng chính mô hình này khi triển khai Cilium ở phần mạng nâng cao, vì CNI agent cũng chạy dưới dạng DaemonSet.

Một pod trên mỗi node

Cluster của ta có hai worker đăng ký làm node (control plane không chạy kubelet nên không phải node Kubernetes). Tạo một DaemonSet và xem nó rải pod:

apiVersion: apps/v1
kind: DaemonSet
metadata: {name: node-agent}
spec:
  selector: {matchLabels: {app: node-agent}}
  template:
    metadata: {labels: {app: node-agent}}
    spec:
      containers:
      - name: agent
        image: busybox:1.36
        command: ["sleep","3600"]

Để ý: không có trường replicas. Số pod được suy ra từ số node.

kubectl get nodes
kubectl get ds node-agent
kubectl get pods -l app=node-agent -o wide

NAME       STATUS   ROLES    AGE    VERSION
worker-0   Ready    <none>   113m   v1.36.1
worker-1   Ready    <none>   112m   v1.36.1

NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
node-agent   2         2         2       2            2           <none>          8s

NAME               READY   STATUS    NODE
node-agent-hzx2v   1/1     Running   worker-0
node-agent-rtznj   1/1     Running   worker-1

DESIRED 2 — đúng bằng hai node, mỗi node một pod. Cột DESIRED của DaemonSet không phải con số bạn khai mà là số node đủ điều kiện. Nếu mai ta join thêm worker-2, DaemonSet lập tức tạo pod thứ ba mà không cần sửa gì.

DaemonSet ghim pod vào node bằng cách nào

Câu hỏi hay: làm sao DaemonSet đảm bảo đúng một pod mỗi node, không hai cái dồn một chỗ? Trước đây DaemonSet tự đặt nodeName bỏ qua scheduler, nhưng nay nó hợp tác với scheduler qua nodeAffinity. Tài liệu:

"The DaemonSet controller creates a Pod for each eligible node and adds the spec.affinity.nodeAffinity field of the Pod to match the target host. After the Pod is created, the default scheduler typically takes over and then binds the Pod to the target host by setting the .spec.nodeName field."

Soi một pod của DaemonSet sẽ thấy cái nodeAffinity tự tiêm đó:

POD=$(kubectl get pods -l app=node-agent -o jsonpath='{.items[0].metadata.name}')
kubectl get pod $POD -o jsonpath='{.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms}{"\n"}'

[{"matchFields":[{"key":"metadata.name","operator":"In","values":["worker-0"]}]}]

Mỗi pod bị ghim cứng vào đúng một node bằng matchFields: metadata.name In [worker-0]. Đây là cách DaemonSet đạt "một pod mỗi node": nó tạo trước một pod cho mỗi node với affinity trỏ thẳng tên node đó, rồi để scheduler (Bài 34 sẽ đào) làm nốt việc bind. Ta sẽ gặp lại nodeAffinity ở dạng tự viết trong bài affinity/taint — ở đây nó là thứ controller sinh ra hộ.

Vì sao agent chạy được cả trên node "chưa khỏe"

Một agent giám sát hay thu log phải có mặt ngay cả khi node đang trục trặc, đúng lúc đó mới cần nó nhất. Nhưng node trục trặc thường mang taint (vết nhơ) đẩy pod thường đi chỗ khác (cơ chế taint/toleration để dành bài scheduling). DaemonSet giải việc này bằng cách tự tiêm một loạt toleration cho pod của nó:

kubectl get pod $POD -o jsonpath='{range .spec.tolerations[*]}{.key} {.operator} {.effect}{"\n"}{end}'

node.kubernetes.io/not-ready Exists NoExecute
node.kubernetes.io/unreachable Exists NoExecute
node.kubernetes.io/disk-pressure Exists NoSchedule
node.kubernetes.io/memory-pressure Exists NoSchedule
node.kubernetes.io/pid-pressure Exists NoSchedule
node.kubernetes.io/unschedulable Exists NoSchedule

Đọc danh sách này là hiểu chủ ý: pod DaemonSet dung thứ node not-ready, unreachable, disk-pressure, memory-pressure, pid-pressure, unschedulable. Tài liệu giải thích cặp NoExecute đầu: pod DaemonSet "can be scheduled onto nodes that are not healthy or ready ... will not be evicted" — tức không bị đuổi đi cả khi node mất khỏe. Đây là điểm khác cốt lõi với Deployment: pod thường sẽ bị evict khỏi node not-ready, còn agent thì ở lại làm việc. (Lưu ý: các toleration này là cho điều kiện node; muốn chạy trên control plane có taint node-role.kubernetes.io/control-plane thì vẫn phải tự thêm toleration đó — như ví dụ fluentd trong tài liệu.)

"all (or some)": giới hạn vào một nhóm node

Chữ "(or some)" trong định nghĩa cho phép DaemonSet chỉ chạy trên một phần node. Tài liệu: "If you specify a .spec.template.spec.nodeSelector, then the DaemonSet controller will create Pods on nodes which match that node selector ... If you do not specify either, then the DaemonSet controller will create Pods on all nodes." Gắn nhãn cho một node rồi nhắm DaemonSet vào đúng nhãn đó:

kubectl label node worker-0 disk=ssd

apiVersion: apps/v1
kind: DaemonSet
metadata: {name: ssd-agent}
spec:
  selector: {matchLabels: {app: ssd-agent}}
  template:
    metadata: {labels: {app: ssd-agent}}
    spec:
      nodeSelector: {disk: ssd}      # chỉ node có nhãn này
      containers:
      - name: agent
        image: busybox:1.36
        command: ["sleep","3600"]

kubectl get ds ssd-agent
kubectl get pods -l app=ssd-agent -o wide

NAME        DESIRED   CURRENT   READY   NODE SELECTOR   AGE
ssd-agent   1         1         1       disk=ssd        7s

NAME              READY   STATUS    NODE
ssd-agent-njzrx   1/1     Running   worker-0

DESIRED 1 — chỉ worker-0 (node có nhãn disk=ssd) chạy pod, worker-1 thì không. DaemonSet đã thu hẹp "mọi node" thành "node khớp selector". Đây là cách triển khai agent chỉ cho một loại phần cứng (node có GPU, node có SSD) hay một nhóm vai trò.

Cập nhật DaemonSet

DaemonSet cũng có updateStrategy, mặc định là RollingUpdate:

kubectl get ds node-agent -o jsonpath='{.spec.updateStrategy.type}{"\n"}'
# RollingUpdate

RollingUpdate thay pod trên từng node lần lượt (giống tinh thần Bài 24 nhưng theo node). Lựa chọn còn lại là OnDelete — DaemonSet không tự thay pod khi bạn đổi template; pod chỉ cập nhật khi bạn tự xóa nó đi. OnDelete hợp với agent nhạy cảm mà bạn muốn kiểm soát chính xác thời điểm thay trên từng node.

🧹 Dọn dẹp

kubectl delete ds node-agent ssd-agent
kubectl label node worker-0 disk-      # gỡ nhãn đã gắn

Xóa DaemonSet dọn sạch pod nó tạo trên mọi node; lệnh thứ hai gỡ nhãn disk để node về trạng thái cũ. Cụm còn lại hai pod CoreDNS. Manifest ở github.com/nghiadaulau/kubernetes-from-scratch, thư mục 26-daemonset.

Tổng kết

DaemonSet là mô hình "một pod trên mỗi node": số pod bằng số node, không phải con số bạn khai; thêm node thì thêm pod, bớt node thì pod bị GC. Controller đạt điều đó bằng cách tự tiêm nodeAffinity ghim mỗi pod vào đúng một node (rồi để scheduler bind), và tự tiêm loạt tolerations (not-ready, unreachable, disk/memory/pid-pressure, unschedulable) để agent chạy được, và không bị evict, cả trên node đang trục trặc, đúng lúc cần nó nhất. nodeSelector/affinity thu hẹp "mọi node" xuống một nhóm (ta đã thấy DaemonSet chỉ chạy trên node gắn nhãn disk=ssd). Cập nhật mặc định RollingUpdate (thay theo từng node), hoặc OnDelete để tự kiểm soát.

Bài 27 khép Part IV bằng nhóm controller chạy-tới-hoàn-tất: Job (chạy một tác vụ tới khi xong, không phải chạy mãi như mọi controller tới giờ), CronJob (Job theo lịch), và cơ chế TTL tự dọn Job đã xong.