Requests, limits, QoS and the Downward API

So far every pod we've created has been "bare" — declaring nothing about how much CPU or RAM it needs. On a real cluster, that's a bad idea: the scheduler doesn't know where to fit the pod, and a RAM-hungry pod can drag down the whole node. requests and limits are the two numbers that fix this, but they do different jobs, and from them Kubernetes derives a third thing, the QoS class, which decides which pod gets sacrificed first when the node runs out of resources. At the end is the Downward API: how the container itself can read those numbers (and much other information about itself) without calling the API server.

requests guide, limits fence

Two numbers, two distinct roles. requests is the amount of resource the pod asks to be guaranteed — and it's used for scheduling. The docs:

"When you specify the resource request for containers in a Pod, the kube-scheduler uses this information to decide which node to place the Pod on."

The scheduler sums the requests of the pods already on a node, and only places the new pod on a node that still has room for its request (this is the NodeResourcesFit part of the scheduler framework — to be dug into in the Scheduling part). limits, meanwhile, is a ceiling, and the enforcement is quite different between CPU and memory:

"cpu limits are enforced by CPU throttling ... a cpu limit is a hard limit the kernel enforces. Containers may not use more CPU than is specified in their cpu limit."

Exceeding the CPU ceiling gets you throttled, the process runs slower, doesn't die. Memory is more ruthless:

"memory limits are enforced by the kernel with out of memory (OOM) kills. When a container uses more than its memory limit, the kernel may terminate it ... A container may use more memory than its memory limit, but if it does, it may get killed."

This difference matters when setting the numbers: exceeding CPU is just slow, while exceeding RAM is death. On units, the docs settle it: 1 CPU = one core (physical or virtual), and 0.1 = 100m ("one hundred millicpu"); memory is counted in bytes, written compactly with the suffix Mi/Gi (powers of 2) or M/G (powers of 10). And one often-forgotten default:

"If you specify a limit for a resource, but do not specify any request ... Kubernetes copies the limit you specified and uses it as the requested value for the resource."

That is, declare only a limit and the request automatically equals the limit.

The three QoS classes

From how requests/limits are declared, Kubernetes automatically assigns each pod one of three QoS classes — recorded at status.qosClass. This class isn't for you to set; it's derived, and it decides the order in which pods get evicted when the node is short on resources. Stand up exactly three pods for the three classes.

Guaranteed — the strictest condition, per the docs: every container must have both a request and a limit for both CPU and memory, and request must equal limit:

apiVersion: v1
kind: Pod
metadata: {name: qos-guaranteed}
spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep","3600"]
    resources:
      requests: {cpu: "100m", memory: "64Mi"}
      limits:   {cpu: "100m", memory: "64Mi"}   # exactly equal to request

Burstable — doesn't reach Guaranteed, but has at least one request or limit. Here there's a request, the memory limit is larger than the request (so not "equal"), and the CPU limit is missing:

spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep","3600"]
    resources:
      requests: {cpu: "50m", memory: "32Mi"}
      limits:   {memory: "128Mi"}

BestEffort — declares nothing at all: no requests, no limits, for every container.

spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep","3600"]

Create all three then read status.qosClass:

for p in qos-guaranteed qos-burstable qos-besteffort; do
  echo "$p => $(kubectl get pod $p -o jsonpath='{.status.qosClass}')"
done

qos-guaranteed => Guaranteed
qos-burstable => Burstable
qos-besteffort => BestEffort

Kubernetes derives the three classes correctly from how things are declared. Why care? Because when the node runs out of resources (node pressure), the kubelet evicts pods in class order:

"When a Node runs out of resources, Kubernetes will first evict BestEffort Pods running on that Node, followed by Burstable and finally Guaranteed Pods."

BestEffort dies first, Guaranteed dies last. The docs describe Guaranteed: "least likely to face eviction ... guaranteed not to be killed until they exceed their limits or there are no lower-priority Pods that can be preempted". And one subtle condition: "only Pods exceeding resource requests are candidates for eviction", i.e. a pod using within its own request is safe. The operational lesson: important workloads (database, control plane) should be Guaranteed; junk jobs and batch work that can tolerate loss are fine as BestEffort.

(Note: the eviction due to node pressure above is the kubelet proactively clearing pods when the node is short, which differs from the OOM kill just below, where the kernel kills a process when a container exceeds its own memory limit. Eviction will be dug into in the scheduling/eviction article.)

When a container exceeds its memory limit: OOMKilled

The theory "exceeding the memory limit is death" deserves to be seen firsthand. Stand up a container with a 32Mi memory limit then try to consume memory without bound — tail /dev/zero reads an endless source of zeros into a buffer, growing forever until it hits the ceiling:

apiVersion: v1
kind: Pod
metadata: {name: oom-demo}
spec:
  restartPolicy: Never
  containers:
  - name: app
    image: busybox:1.36
    command: ["sh","-c","echo ngon bo nho khong gioi han, limit 32Mi; tail /dev/zero"]
    resources:
      limits: {memory: "32Mi"}

kubectl get pod oom-demo
kubectl get pod oom-demo -o jsonpath='phase={.status.phase}{"\n"}reason={.status.containerStatuses[0].state.terminated.reason} exitCode={.status.containerStatuses[0].state.terminated.exitCode}{"\n"}'

NAME       READY   STATUS      RESTARTS   AGE
oom-demo   0/1     OOMKilled   0          12s

phase=Failed
reason=OOMKilled exitCode=137

STATUS: OOMKilled, reason=OOMKilled, exitCode=137. Again 137 (128 + 9 = SIGKILL) like the liveness case in Article 20, but this time the culprit isn't the kubelet but the kernel: when a process in the container exceeds the cgroup memory ceiling, the kernel's OOM killer acts instantly. Because restartPolicy: Never, the pod becomes Failed; with Always the kubelet would restart it and (if the app keeps eating RAM) spin it into the CrashLoopBackOff of Article 18. This is why a memory limit must be set close to real usage: too low and the app gets wrongfully killed, too high and you lose the point of protecting the node.

Downward API: letting a container know about itself

There's one more practical question: how does a container know, from inside, what it's named, what node it runs on, what its IP is, how much RAM it was granted? Calling the API server directly requires a token, permissions, and couples the app tightly to Kubernetes. The Downward API solves exactly that. The docs:

"The downward API allows containers to consume information about themselves or the cluster without using the Kubernetes client or API server." The reason: "It is sometimes useful for a container to have information about itself, without being overly coupled to Kubernetes."

There are two ways to expose information into a container: environment variables and files in a downwardAPI volume. One pod uses both:

apiVersion: v1
kind: Pod
metadata:
  name: downward-demo
  labels: {app: downward, tier: demo}
spec:
  containers:
  - name: app
    image: busybox:1.36
    command: ["sh","-c","sleep 3600"]
    resources:
      requests: {cpu: "100m", memory: "64Mi"}
      limits:   {cpu: "250m", memory: "128Mi"}
    env:
    - {name: MY_POD_NAME,      valueFrom: {fieldRef: {fieldPath: metadata.name}}}
    - {name: MY_POD_NAMESPACE, valueFrom: {fieldRef: {fieldPath: metadata.namespace}}}
    - {name: MY_NODE_NAME,     valueFrom: {fieldRef: {fieldPath: spec.nodeName}}}
    - {name: MY_POD_IP,        valueFrom: {fieldRef: {fieldPath: status.podIP}}}
    - {name: MY_MEM_REQUEST,   valueFrom: {resourceFieldRef: {containerName: app, resource: requests.memory}}}
    - {name: MY_CPU_LIMIT,     valueFrom: {resourceFieldRef: {containerName: app, resource: limits.cpu}}}
    volumeMounts:
    - {name: podinfo, mountPath: /etc/podinfo}
  volumes:
  - name: podinfo
    downwardAPI:
      items:
      - path: labels
        fieldRef: {fieldPath: metadata.labels}

Two kinds of reference: fieldRef pulls pod fields (metadata.*, spec.nodeName, status.podIP...), while resourceFieldRef pulls a container's resource request/limit. Look inside the container:

kubectl exec downward-demo -- sh -c 'env | grep ^MY_ | sort'
kubectl exec downward-demo -- cat /etc/podinfo/labels

MY_CPU_LIMIT=1
MY_MEM_REQUEST=67108864
MY_NODE_NAME=worker-0
MY_POD_IP=10.200.0.15
MY_POD_NAME=downward-demo
MY_POD_NAMESPACE=default

app="downward"
tier="demo"

Cross-check against the control plane to be sure:

kubectl get pod downward-demo -o jsonpath='nodeName={.spec.nodeName} podIP={.status.podIP}{"\n"}'

nodeName=worker-0 podIP=10.200.0.15

MY_NODE_NAME and MY_POD_IP match reality exactly: the container knows it sits on worker-0, IP 10.200.0.15, without calling the API server once. Two details worth remembering in the resourceFieldRef part:

MY_MEM_REQUEST=67108864 — memory is returned in bytes (64 × 1024 × 1024 = 67108864), not the string 64Mi. An app using it must understand it as bytes.
MY_CPU_LIMIT=1 — the limit is declared 250m but exposed as 1. resourceFieldRef for CPU defaults to rounding UP to whole cores. To get the exact millicpu you must add divisor: 1m to the reference. An easy trap if an app self-tunes its thread count based on the CPU limit.

The downwardAPI volume differs from env in one useful way: metadata.labels through a volume yields all labels, one key="value" pair per line (we see app="downward" and tier="demo"), and this file updates when the pod's labels change, whereas environment variables are fixed at container start. If you need to track changing labels/annotations, use a volume; if you need a static value, env is more compact.

🧹 Cleanup

kubectl delete pod qos-guaranteed qos-burstable qos-besteffort downward-demo oom-demo --now

All objects in the cluster — deleting them is clean, back to the two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 22-resources-qos.

Wrap-up

requests and limits do two different jobs: request guides the scheduler (asks to be guaranteed), limit is a kernel-enforced ceiling, exceeding CPU gets throttled (just slow), exceeding memory gets OOM killed (dead, exitCode 137, we saw OOMKilled). Declare only a limit and the request automatically equals the limit. From these numbers Kubernetes derives the QoS class: Guaranteed (request==limit for every resource, dies last), Burstable (has at least one request/limit), BestEffort (declares nothing, dies first), the order exactly as we read at status.qosClass, and it decides who gets evicted first when the node runs out of resources. The Downward API lets a container read information about itself via env (fieldRef/resourceFieldRef) or a volume, without touching the API server; remember that memory comes out in bytes and CPU is rounded up to cores unless you set divisor.

Article 23 closes Part III with the other side of the pod lifecycle: disruption. Pods get disrupted in two ways: voluntary (draining a node for maintenance or upgrade) and involuntary (the node dies, runs out of RAM); and PodDisruptionBudget is how you tell the cluster "don't take down too many replicas of mine at once."