Scheduler và scheduling framework

Suốt series, mỗi pod ta tạo đều "tự nhiên" chạy trên một node nào đó — worker-0 hay worker-1 — mà ta chưa hỏi ai quyết định. Câu trả lời là kube-scheduler, thành phần ta đã dựng từ Bài 8 và thấy thoáng qua ở Bài 17 (vòng đời request). Part VII dành cho scheduling — "pod nào chạy ở node nào" — và bài mở đầu này đào đúng cách scheduler ra quyết định: nó lọc node không vừa, chấm điểm node còn lại, rồi bind. Hiểu hai bước này là nền cho mọi thứ phía sau (affinity, taint, topology spread).

Vai trò: tìm node cho pod chưa có node

Tài liệu mô tả gọn: "A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on." Pod mới tạo có spec.nodeName rỗng (Bài 17); scheduler theo dõi, chọn node, rồi "notifies the API server about this decision in a process called binding." Node nào hợp lệ thì gọi là feasible node: "Nodes that meet the scheduling requirements for a Pod are called feasible nodes. If none of the nodes are suitable, the pod remains unscheduled until the scheduler is able to place it."

Việc chọn chia làm hai bước, đúng nguyên văn:

Filtering — "finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resources to meet a Pod's specific resource requests."
Scoring — "the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering." Rồi "kube-scheduler assigns the Pod to the Node with the highest ranking."

Filter: loại node không vừa

Cách rõ nhất để thấy filter là tạo một pod không node nào chứa nổi. Mỗi worker có Allocatable cpu = 2 (Bài 32); xin 3 CPU thì filter loại sạch:

apiVersion: v1
kind: Pod
metadata: {name: too-greedy}
spec:
  containers:
  - name: c
    image: busybox:1.36
    command: ["sleep","3600"]
    resources: {requests: {cpu: "3"}}     # > Allocatable 2 của mọi node

kubectl get pod too-greedy -o wide
kubectl get event --field-selector involvedObject.name=too-greedy | grep FailedScheduling

NAME         READY   STATUS    NODE
too-greedy   0/1     Pending   <none>

Warning  FailedScheduling  0/2 nodes are available: 2 Insufficient cpu.
  ... preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.

Pod kẹt Pending, nodeName rỗng — "the pod remains unscheduled until the scheduler is able to place it." Thông điệp event là cả câu chuyện filter: 0/2 nodes are available: 2 Insufficient cpu — đúng filter NodeResourcesFit (tên cũ PodFitsResources), so request với Allocatable (chính cái Bài 32 đào, không phải Capacity). Câu thứ hai preemption: ... Preemption is not helpful là bước PostFilter: khi không còn node khả thi, scheduler thử trục xuất pod ưu tiên thấp hơn để nhường chỗ — nhưng ở đây mọi pod cùng mức ưu tiên nên trục xuất vô ích (preemption đào ở bài priority). Pod sẽ chờ mãi tới khi có node đủ chỗ.

Score + Bind: chọn node tốt nhất

Một pod vừa thì sao? Xin 1 CPU:

# pod fits-fine, requests cpu: "1"

kubectl get pod fits-fine -o wide
kubectl get event --field-selector involvedObject.name=fits-fine,reason=Scheduled

NAME        READY   STATUS    NODE
fits-fine   1/1     Running   worker-1

Normal  Scheduled  Successfully assigned default/fits-fine to worker-1

Qua filter (cả hai node đều đủ 1 CPU), scheduler chấm điểm rồi bind: event Scheduled: Successfully assigned default/fits-fine to worker-1 — đây là binding cycle, scheduler báo API server gán nodeName=worker-1. Quyết định này do default-scheduler — chính kube-scheduler HA ta dựng ở Bài 8 — đưa ra:

kubectl get pod fits-fine -o jsonpath='{.spec.schedulerName}{"\n"}'
kubectl get lease kube-scheduler -n kube-system -o jsonpath='{.spec.holderIdentity}{"\n"}'

default-scheduler
controller-0_422ddcd2-...

schedulerName: default-scheduler, và Lease kube-scheduler cho thấy leader đang là controller-0 (HA ba bản, một leader — đúng như Bài 8). Mọi pod không khai schedulerName riêng đều do anh này xử lý.

Scoring không phải chia đều — mà ưu tiên node ít tải

Score làm gì cụ thể? Thử tạo bốn pod nhỏ (50m mỗi cái) sau khi fits-fine (1 CPU) đã nằm trên worker-1, rồi xem chúng rơi vào đâu:

# tạo spread-1..4, mỗi pod requests cpu: 50m
kubectl get pods -l app=spread -o wide

spread-1 worker-0
spread-2 worker-0
spread-3 worker-0
spread-4 worker-0

Cả bốn dồn vào worker-0 — không chia đều mỗi node hai cái. Nhìn tài nguyên đã cấp phát sẽ rõ:

kubectl describe node worker-0 | grep -A2 "Allocated resources" | grep cpu   # 300m
kubectl describe node worker-1 | grep -A2 "Allocated resources" | grep cpu   # 1100m

worker-0:  cpu  300m (15%)
worker-1:  cpu  1100m (55%)

worker-1 đã ôm 1100m (1 CPU của fits-fine + CoreDNS), worker-0 chỉ 100m trước đó. Score của scheduler (mặc định gồm NodeResourcesBalancedAllocation và least-allocated) ưu tiên node ít tải hơn — nên mọi pod nhỏ được chấm cao cho worker-0 và dồn về đó cho tới khi cân bằng. Đây là điểm dễ hiểu nhầm: scheduler không round-robin ngây thơ; nó chọn node trông trống hơn để san tải. Filter trả lời "node nào được phép", score trả lời "node nào tốt nhất".

Scheduling framework: vì sao chuỗi bước này "cắm thêm" được

Filter và score không phải code cứng — chúng là plugin trong scheduling framework. Tài liệu: "The scheduling framework is a pluggable architecture for the Kubernetes scheduler. It consists of a set of plugin APIs that are compiled directly into the scheduler." Mỗi lần xếp một pod chia hai pha: "the scheduling cycle selects a node for the Pod, and the binding cycle applies that decision to the cluster." Scheduling cycle chạy tuần tự, binding cycle có thể chạy song song.

Mỗi pha có các extension point theo thứ tự, plugin cắm vào đây:

SCHEDULING CYCLE (tuần tự)            BINDING CYCLE (song song)
─────────────────────────            ──────────────────────────
PreFilter  → chuẩn bị / kiểm tra sơ bộ
Filter     → loại node ko chạy được   (← "Insufficient cpu" ở trên)
PostFilter → ko còn node? thử preempt  (← "Preemption not helpful")
PreScore   → chuẩn bị cho score
Score      → chấm điểm node còn lại    (← dồn về worker-0 ít tải)
NormalizeScore → chuẩn hóa điểm
Reserve    → giữ chỗ tài nguyên
Permit     → cho phép / hoãn bind
                                       PreBind → việc trước khi bind
                                       Bind    → gán nodeName  (← "Scheduled")
                                       PostBind → thông báo sau bind

Mọi thứ ta vừa quan sát đều ánh xạ vào đây: Insufficient cpu là Filter, Preemption is not helpful là PostFilter, việc dồn pod về node ít tải là Score, còn event Scheduled là Bind. Kiến trúc plugin này cho phép viết scheduler tùy biến (đổi trọng số score, thêm filter riêng) hay chạy nhiều scheduler song song — pod chọn scheduler qua schedulerName. Các bài tiếp theo của Part VII chính là cấu hình các plugin này từ phía pod: affinity/taint điều khiển Filter, topology spread điều khiển Score, priority điều khiển PostFilter (preemption).

🧹 Dọn dẹp

kubectl delete pod too-greedy fits-fine --now
kubectl delete pod -l app=spread --now

Object trong cluster, xóa là sạch. Cụm về lại hai pod CoreDNS. Manifest ở github.com/nghiadaulau/kubernetes-from-scratch, thư mục 34-scheduler.

Tổng kết

kube-scheduler (default-scheduler, HA leader trên controller-0 từ Bài 8) chọn node cho mọi pod có nodeName rỗng theo hai bước. Filter loại node không chạy được — NodeResourcesFit so request với Allocatable (Bài 32), không node nào vừa thì pod kẹt Pending với 0/2 nodes are available: Insufficient cpu (ta thấy pod xin 3 CPU treo mãi). Score chấm điểm node còn lại, ưu tiên node ít tải hơn (least-allocated/balanced) — không round-robin (bốn pod nhỏ dồn về worker-0 vì worker-1 đã giữ 1 CPU). Rồi binding cycle gán nodeName (event Scheduled). Tất cả là plugin trong scheduling framework với các extension point tuần tự (PreFilter→Filter→PostFilter→PreScore→Score→...→Bind) — kiến trúc cắm-thêm-được mà các bài Part VII sẽ điều khiển từ phía pod.

Bài 35 là công cụ điều khiển Filter mạnh nhất từ phía bạn: nodeAffinity (hút pod về node có nhãn nhất định — ta đã thấy DaemonSet tự tiêm nó ở Bài 26), podAffinity/anti-affinity (đặt pod gần/xa pod khác), và taint/toleration (node đẩy pod ra trừ khi pod chịu được — cặp đôi đã thấy trong tolerations tự tiêm của DaemonSet).