Affinity, taint và toleration

Bài 34 cho thấy scheduler tự chọn node qua filter + score. Nhưng nhiều khi bạn mới là người biết pod nên nằm đâu: workload này cần node có SSD, hai bản sao database đừng chung một máy (chết máy là mất cả hai), node kia dành riêng cho một team. Part VII tiếp tục với ba công cụ điều khiển scheduler từ phía bạn, ánh xạ thẳng vào các extension point của Bài 34: nodeAffinity và taint/toleration tác động bước Filter, podAffinity/anti-affinity dùng cả Filter lẫn Score. Hai hướng ngược nhau — affinity hút, taint đẩy.

nodeAffinity: hút pod về node có nhãn

nodeSelector (Bài 26, 28) là cách đơn giản nhất: pod chỉ lên node có đủ nhãn khai. nodeAffinity là bản biểu cảm hơn, với hai mức cứng/mềm. Tài liệu:

requiredDuringSchedulingIgnoredDuringExecution — "The scheduler can't schedule the Pod unless the rule is met." (cứng, như nodeSelector nhưng cú pháp giàu hơn)
preferredDuringSchedulingIgnoredDuringExecution — "The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod." (mềm, có weight 1–100 cộng vào điểm Score)

Và IgnoredDuringExecution nghĩa là: "if the node labels change after Kubernetes schedules the Pod, the Pod continues to run." — luật chỉ áp lúc xếp lịch, không đuổi pod đang chạy nếu nhãn node đổi sau đó. Gắn nhãn hai worker rồi thử luật cứng:

kubectl label node worker-0 disktype=ssd
kubectl label node worker-1 disktype=hdd

# pod aff-ssd: yêu cầu CỨNG disktype In [ssd]
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - {key: disktype, operator: In, values: [ssd]}
# pod aff-nvme: yêu cầu disktype In [nvme]  (không node nào có)

kubectl get pods aff-ssd aff-nvme -o wide

aff-ssd    Running   worker-0
aff-nvme   Pending   <none>

# event aff-nvme:
FailedScheduling  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.

aff-ssd bị hút đúng về worker-0 (node duy nhất có disktype=ssd); aff-nvme đòi nvme mà không node nào có nên kẹt Pending — đây vẫn là bước Filter của Bài 34, lần này luật do pod đặt thay vì tài nguyên. Toán tử ngoài In còn NotIn, Exists, DoesNotExist, Gt, Lt. (Mềm thì khác: preferred chỉ cộng điểm Score cho node khớp, không khớp vẫn xếp được — dùng khi "thích nhưng không bắt buộc".)

podAntiAffinity: đẩy pod xa pod khác

nodeAffinity ràng theo nhãn node. Inter-pod affinity ràng theo pod khác đang chạy: "constrain a Pod using labels on other Pods running on the node (or other topological domain)." podAffinity để đặt gần, podAntiAffinity để đặt xa — qua topologyKey (phạm vi: node, zone...) và labelSelector (chọn pod nào để xét). Trường hợp thường gặp: rải các bản sao ra khác node để một máy chết không mất hết. Một Deployment 3 bản sao, mỗi node tối đa một bản (anti-affinity với chính nhãn mình, topologyKey: kubernetes.io/hostname):

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector: {matchLabels: {app: spread-aa}}
        topologyKey: kubernetes.io/hostname

kubectl get pods -l app=spread-aa -o wide

spread-aa-...-f8mg9   Running   worker-1
spread-aa-...-kkdw8   Running   worker-0
spread-aa-...-85ggp   Pending   <none>

Hai bản sao rải mỗi node một cái; bản thứ ba Pending — vì luật cứng đòi "không node nào đã có pod spread-aa", mà cả hai node đều có rồi. Cụm chỉ hai worker nên hết chỗ. (Muốn ba bản sao chạy được mà vẫn rải tốt thì dùng preferred thay required, hoặc topology spread của Bài 36 với maxSkew — uyển chuyển hơn.) Đây là cách bảo đảm tính sẵn sàng: bản sao không dồn một máy.

taint/toleration: node đẩy pod ra

Affinity là pod chọn node. Taint ngược lại — node từ chối pod. Tài liệu: "Taints are ... applied to a node; this marks that the node should not accept any pods that do not tolerate the taints." và "Tolerations are applied to pods ... allow the scheduler to schedule pods with matching taints." Ba effect:

NoSchedule — không xếp pod mới (trừ pod có toleration); "Pods currently running on the node are not evicted."
PreferNoSchedule — bản mềm, cố tránh nhưng không bảo đảm.
NoExecute — gắt nhất, đuổi cả pod đang chạy không chịu được taint.

Taint worker-1 rồi thử pod thường vs pod có toleration:

kubectl taint nodes worker-1 dedicated=team-a:NoSchedule

# no-tol: pod thường, KHÔNG toleration
# with-tol: có toleration + nodeSelector ép nhắm worker-1
spec:
  nodeSelector: {disktype: hdd}     # worker-1
  tolerations:
  - {key: dedicated, operator: Equal, value: team-a, effect: NoSchedule}

kubectl get pods no-tol with-tol -o wide

no-tol     Running   worker-0      # bị worker-1 đẩy, chỉ còn worker-0
with-tol   Running   worker-1      # toleration cho qua taint

no-tol không chịu được taint nên bị worker-1 đẩy, chỉ lên được worker-0. with-tol có toleration khớp (cùng key/value/effect) nên được phép lên worker-1 dù bị ép nhắm vào đó. Đây là cách "dành riêng" một node: taint nó, chỉ pod của team có toleration mới vào. (Một toleration khớp taint khi cùng key+effect và operator: Exists, hoặc operator: Equal với value bằng nhau.)

NoExecute đuổi cả pod đang chạy

NoSchedule chỉ chặn pod mới. NoExecute đụng cả pod đang chạy: "Pods that do not tolerate the taint are evicted immediately." Thêm taint NoExecute lên worker-1 (nơi with-tol đang chạy, nhưng nó chỉ tolerate NoSchedule):

kubectl taint nodes worker-1 evict=now:NoExecute
kubectl get pod with-tol -o wide

with-tol   Terminating   worker-1

with-tol bị đuổi ngay (Terminating) — toleration của nó chỉ khớp NoSchedule, không khớp NoExecute, nên taint mới này áp dụng và evict nó. Đó là cơ chế đằng sau loạt toleration NoExecute mà DaemonSet tự tiêm ở Bài 26 (not-ready, unreachable) — để agent không bị đuổi khi node mất khỏe; và cũng là cách node controller dọn pod khi node NotReady quá lâu. Có thể đặt tolerationSeconds để "chịu thêm N giây rồi mới rời" — hữu ích cho graceful drain.

🧹 Dọn dẹp

kubectl taint nodes worker-1 evict:NoExecute-          # gỡ taint (dấu - cuối)
kubectl taint nodes worker-1 dedicated:NoSchedule-
kubectl label node worker-0 disktype- ; kubectl label node worker-1 disktype-
kubectl delete pod no-tol with-tol --now
kubectl delete deployment spread-aa

Gỡ taint/nhãn trả node về nguyên trạng (dấu - cuối để xóa). Cụm về lại hai pod CoreDNS, hai node Ready không taint. Manifest ở github.com/nghiadaulau/kubernetes-from-scratch, thư mục 35-affinity-taints.

Tổng kết

Ba công cụ điều khiển scheduler từ phía bạn, hai hướng ngược nhau. nodeAffinity hút pod về node theo nhãn — required (cứng, không khớp thì Pending, ta thấy aff-nvme treo vì đòi nvme) hay preferred (mềm, cộng điểm Score); IgnoredDuringExecution = chỉ áp lúc xếp lịch. podAntiAffinity đẩy pod xa pod khác qua topologyKey+labelSelector — ta thấy 3 bản sao mà chỉ 2 node thì bản thứ ba Pending (mỗi node một bản). taint làm node từ chối pod, toleration trên pod cho qua: NoSchedule chặn pod mới (no-tol né worker-1, with-tol vào được), NoExecute đuổi cả pod đang chạy (with-tol bị Terminating vì chỉ chịu NoSchedule). Affinity tác động Filter/Score, taint tác động Filter — đều là cách bạn "nói chuyện" với scheduler của Bài 34.

Bài 36 đào ba cơ chế scheduling tinh hơn: topology spread constraints (maxSkew — rải pod đều across zone/node một cách uyển chuyển, khác anti-affinity cứng nhắc), pod overhead (tính thêm tài nguyên cho runtime sandbox), và scheduling readiness (schedulingGates — giữ pod chưa cho xếp lịch tới khi sẵn sàng).