Requests, limits, QoS và Downward API

Tới giờ các pod ta tạo đều "trần" — không khai mình cần bao nhiêu CPU hay RAM. Trên một cluster thật, đó là điều không nên: scheduler không biết xếp pod vào đâu cho vừa, và một pod ngốn RAM có thể kéo sập cả node. requests và limits là hai con số sửa việc đó, nhưng chúng làm hai việc khác nhau, và từ chúng Kubernetes suy ra một thứ thứ ba là QoS class, quyết định pod nào bị hi sinh trước khi node kiệt tài nguyên. Cuối bài là Downward API: cách để chính container đọc được những con số đó (và nhiều thông tin khác về bản thân) mà không phải gọi API server.

requests dẫn đường, limits là hàng rào

Hai con số, hai vai trò tách bạch. requests là lượng tài nguyên pod xin được bảo đảm — và nó dùng để xếp lịch. Tài liệu:

"When you specify the resource request for containers in a Pod, the kube-scheduler uses this information to decide which node to place the Pod on."

Scheduler cộng request của các pod đã nằm trên node, và chỉ đặt pod mới lên node còn đủ chỗ cho request của nó (đây là phần NodeResourcesFit trong scheduler framework — sẽ đào ở phần Scheduling). Còn limits là trần, và cách cưỡng chế khác nhau hẳn giữa CPU và bộ nhớ:

"cpu limits are enforced by CPU throttling ... a cpu limit is a hard limit the kernel enforces. Containers may not use more CPU than is specified in their cpu limit."

CPU vượt trần thì bị bóp (throttle), tiến trình chạy chậm lại, không chết. Bộ nhớ thì tàn nhẫn hơn:

"memory limits are enforced by the kernel with out of memory (OOM) kills. When a container uses more than its memory limit, the kernel may terminate it ... A container may use more memory than its memory limit, but if it does, it may get killed."

Khác biệt này quan trọng khi đặt số: CPU vượt chỉ chậm, còn RAM vượt là chết. Về đơn vị, tài liệu chốt: 1 CPU = một core (vật lý hoặc ảo), và 0.1 = 100m ("một trăm millicpu"); bộ nhớ tính bằng byte, viết gọn bằng hậu tố Mi/Gi (lũy thừa 2) hay M/G (lũy thừa 10). Và một mặc định hay quên:

"If you specify a limit for a resource, but do not specify any request ... Kubernetes copies the limit you specified and uses it as the requested value for the resource."

Tức khai mỗi limit thì request tự bằng limit.

Ba lớp QoS

Từ cách khai requests/limits, Kubernetes tự gán mỗi pod một trong ba QoS class — ghi ở status.qosClass. Lớp này không phải để bạn đặt; nó được suy ra, và nó quyết định thứ tự pod bị evict khi node thiếu tài nguyên. Dựng đúng ba pod cho ba lớp.

Guaranteed — điều kiện ngặt nhất, theo tài liệu: mọi container phải có cả request lẫn limit cho cả CPU và memory, và request phải bằng limit:

apiVersion: v1
kind: Pod
metadata: {name: qos-guaranteed}
spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep","3600"]
    resources:
      requests: {cpu: "100m", memory: "64Mi"}
      limits:   {cpu: "100m", memory: "64Mi"}   # bằng đúng request

Burstable — không đạt Guaranteed, nhưng có ít nhất một request hoặc limit. Ở đây có request, và limit memory lớn hơn request (nên không "bằng"), lại thiếu limit CPU:

spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep","3600"]
    resources:
      requests: {cpu: "50m", memory: "32Mi"}
      limits:   {memory: "128Mi"}

BestEffort — không khai gì cả: không request, không limit, container nào cũng vậy.

spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep","3600"]

Tạo cả ba rồi đọc status.qosClass:

for p in qos-guaranteed qos-burstable qos-besteffort; do
  echo "$p => $(kubectl get pod $p -o jsonpath='{.status.qosClass}')"
done

qos-guaranteed => Guaranteed
qos-burstable => Burstable
qos-besteffort => BestEffort

Kubernetes suy đúng ba lớp từ cách khai. Vì sao bận tâm? Vì khi node cạn tài nguyên (node pressure), kubelet evict pod theo thứ tự lớp:

"When a Node runs out of resources, Kubernetes will first evict BestEffort Pods running on that Node, followed by Burstable and finally Guaranteed Pods."

BestEffort chết trước, Guaranteed chết sau cùng. Tài liệu mô tả Guaranteed: "least likely to face eviction ... guaranteed not to be killed until they exceed their limits or there are no lower-priority Pods that can be preempted". Và một điều kiện tinh tế: "only Pods exceeding resource requests are candidates for eviction", tức pod dùng trong phần request của mình thì an toàn. Bài học vận hành: workload quan trọng (database, control plane) nên để Guaranteed; job rác, batch chịu được mất thì BestEffort là vừa.

(Lưu ý: eviction do node pressure ở trên là kubelet chủ động dọn pod khi node thiếu, khác với OOM kill ngay dưới đây là kernel giết một tiến trình khi container vượt limit bộ nhớ của riêng nó. Phần eviction sẽ đào sâu ở bài về scheduling/eviction.)

Khi một container vượt limit bộ nhớ: OOMKilled

Lý thuyết "memory vượt limit là chết" nên được thấy tận mắt. Dựng một container đặt limit bộ nhớ 32Mi rồi cố ngốn bộ nhớ vô hạn — tail /dev/zero đọc một nguồn 0 vô tận vào bộ đệm, phình mãi tới khi đụng trần:

apiVersion: v1
kind: Pod
metadata: {name: oom-demo}
spec:
  restartPolicy: Never
  containers:
  - name: app
    image: busybox:1.36
    command: ["sh","-c","echo ngon bo nho khong gioi han, limit 32Mi; tail /dev/zero"]
    resources:
      limits: {memory: "32Mi"}

kubectl get pod oom-demo
kubectl get pod oom-demo -o jsonpath='phase={.status.phase}{"\n"}reason={.status.containerStatuses[0].state.terminated.reason} exitCode={.status.containerStatuses[0].state.terminated.exitCode}{"\n"}'

NAME       READY   STATUS      RESTARTS   AGE
oom-demo   0/1     OOMKilled   0          12s

phase=Failed
reason=OOMKilled exitCode=137

STATUS: OOMKilled, reason=OOMKilled, exitCode=137. Lại là 137 (128 + 9 = SIGKILL) như cú liveness ở Bài 20, nhưng lần này thủ phạm không phải kubelet mà là kernel: khi tiến trình trong container vượt trần bộ nhớ cgroup, OOM killer của kernel ra tay tức thì. Vì restartPolicy: Never nên pod thành Failed; để Always thì kubelet sẽ restart và (nếu app cứ ngốn RAM) cuốn vào CrashLoopBackOff của Bài 18. Đây là lý do đặt limit bộ nhớ phải sát thực tế dùng: đặt quá thấp thì app bị giết oan, đặt quá cao thì mất ý nghĩa bảo vệ node.

Downward API: để container tự biết về mình

Còn một câu hỏi thực tế: làm sao bên trong container biết được nó tên gì, chạy trên node nào, IP bao nhiêu, được cấp bao nhiêu RAM? Gọi thẳng API server thì phải có token, có quyền, và buộc app dính chặt vào Kubernetes. Downward API giải đúng chỗ đó. Tài liệu:

"The downward API allows containers to consume information about themselves or the cluster without using the Kubernetes client or API server." Lý do: "It is sometimes useful for a container to have information about itself, without being overly coupled to Kubernetes."

Có hai đường phơi thông tin vào container: biến môi trường và file trong volume downwardAPI. Một pod dùng cả hai:

apiVersion: v1
kind: Pod
metadata:
  name: downward-demo
  labels: {app: downward, tier: demo}
spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sh","-c","sleep 3600"]
    resources:
      requests: {cpu: "100m", memory: "64Mi"}
      limits:   {cpu: "250m", memory: "128Mi"}
    env:
    - {name: MY_POD_NAME,      valueFrom: {fieldRef: {fieldPath: metadata.name}}}
    - {name: MY_POD_NAMESPACE, valueFrom: {fieldRef: {fieldPath: metadata.namespace}}}
    - {name: MY_NODE_NAME,     valueFrom: {fieldRef: {fieldPath: spec.nodeName}}}
    - {name: MY_POD_IP,        valueFrom: {fieldRef: {fieldPath: status.podIP}}}
    - {name: MY_MEM_REQUEST,   valueFrom: {resourceFieldRef: {containerName: app, resource: requests.memory}}}
    - {name: MY_CPU_LIMIT,     valueFrom: {resourceFieldRef: {containerName: app, resource: limits.cpu}}}
    volumeMounts:
    - {name: podinfo, mountPath: /etc/podinfo}
  volumes:
  - name: podinfo
    downwardAPI:
      items:
      - path: labels
        fieldRef: {fieldPath: metadata.labels}

Hai loại tham chiếu: fieldRef lấy các trường của pod (metadata.*, spec.nodeName, status.podIP...), còn resourceFieldRef lấy request/limit tài nguyên của một container. Xem trong container:

kubectl exec downward-demo -- sh -c 'env | grep ^MY_ | sort'
kubectl exec downward-demo -- cat /etc/podinfo/labels

MY_CPU_LIMIT=1
MY_MEM_REQUEST=67108864
MY_NODE_NAME=worker-0
MY_POD_IP=10.200.0.15
MY_POD_NAME=downward-demo
MY_POD_NAMESPACE=default

app="downward"
tier="demo"

Đối chiếu với control plane cho chắc:

kubectl get pod downward-demo -o jsonpath='nodeName={.spec.nodeName} podIP={.status.podIP}{"\n"}'

nodeName=worker-0 podIP=10.200.0.15

MY_NODE_NAME và MY_POD_IP khớp đúng thực tế: container tự biết nó nằm ở worker-0, IP 10.200.0.15, mà không gọi API server lần nào. Hai chi tiết đáng nhớ ở phần resourceFieldRef:

MY_MEM_REQUEST=67108864 — bộ nhớ trả về bằng byte (64 × 1024 × 1024 = 67108864), không phải chuỗi 64Mi. App muốn dùng phải tự hiểu là byte.
MY_CPU_LIMIT=1 — limit khai 250m nhưng phơi ra thành 1. resourceFieldRef cho CPU mặc định làm tròn LÊN số core nguyên. Muốn lấy đúng millicpu phải thêm divisor: 1m vào tham chiếu. Một cái bẫy dễ dính nếu app tự chỉnh số luồng theo CPU limit.

Volume downwardAPI thì khác env một điểm hữu ích: metadata.labels qua volume cho ra toàn bộ nhãn, mỗi dòng một cặp key="value" (ta thấy app="downward" và tier="demo"), và file này cập nhật khi nhãn pod đổi, trong khi biến môi trường thì cố định lúc container khởi động. Cần theo dõi nhãn/annotation thay đổi thì dùng volume; cần một giá trị tĩnh thì env gọn hơn.

🧹 Dọn dẹp

kubectl delete pod qos-guaranteed qos-burstable qos-besteffort downward-demo oom-demo --now

Toàn object trong cluster, xóa là sạch, về lại hai pod CoreDNS. Manifest ở github.com/nghiadaulau/kubernetes-from-scratch, thư mục 22-resources-qos.

Tổng kết

requests và limits làm hai việc khác nhau: request dẫn đường cho scheduler (xin được bảo đảm), limit là trần kernel cưỡng chế, CPU vượt thì bị bóp (chỉ chậm), bộ nhớ vượt thì bị OOM kill (chết, exitCode 137, ta đã thấy OOMKilled). Khai mỗi limit thì request tự bằng limit. Từ bộ số này Kubernetes suy ra QoS class: Guaranteed (request==limit cho mọi tài nguyên, chết sau cùng), Burstable (có ít nhất một request/limit), BestEffort (không khai gì, chết trước), thứ tự đúng như ta đọc được ở status.qosClass, và nó định ai bị evict trước khi node cạn tài nguyên. Downward API cho container tự đọc thông tin về mình qua env (fieldRef/resourceFieldRef) hoặc volume, không cần đụng API server; nhớ rằng memory ra byte và CPU bị làm tròn lên core trừ khi đặt divisor.

Bài 23 khép lại Part III với mặt còn lại của vòng đời pod: disruption. Pod bị gián đoạn theo hai kiểu: tự nguyện (rút node để bảo trì, nâng cấp) và không tự nguyện (node chết, hết RAM); và PodDisruptionBudget là cách bảo cluster "đừng rút quá nhiều bản sao của tôi cùng lúc".