Resource Requests/Limits and Autoscaling (HPA)

Two things are tied together: to have Kubernetes autoscale with load, each pod must first declare how much resource it needs. This article goes from requests/limits (the foundation the scheduler and autoscaler work on) to the HorizontalPodAutoscaler — and we'll generate real load to watch the cluster add pods on its own.

requests and limits: two numbers, two roles

In a container spec, you declare resources through two often-confused concepts:

resources:
  requests:           # the GUARANTEED amount — scheduler uses this to pick a node
    cpu: 200m         # 200 milli-CPU = 0.2 core
    memory: 64Mi
  limits:             # the CEILING — exceeding it gets blocked/killed
    cpu: 500m
    memory: 128Mi

requests is the amount of resource guaranteed to the pod. The scheduler (Article 1) uses this number to decide which node has room — a pod is only placed on a node with enough free requests. This is what lets Kubernetes pack pods sensibly without overloading one node.
limits is the ceiling. Exceed the CPU limit → the container gets throttled (slowed down, not killed). Exceed the memory limit → the container gets OOMKilled (killed for running out of memory). Limits protect the cluster from one buggy pod devouring its neighbors' resources.

Units: CPU is measured in cores, 1 = one core, 500m = half a core (m = milli). Memory is in bytes with the Mi/Gi suffix (mebibyte/gibibyte).

Setting requests correctly matters more than beginners think: too low and the node gets overloaded (pods fight, slow down); too high and you waste capacity (an empty node the scheduler thinks is full). And as you'll see, HPA needs requests to compute the usage percentage.

QoS: who gets "sacrificed" first when a node runs out

How you set requests/limits determines the pod's QoS class — and when a node runs out of memory, Kubernetes kills pods by this class:

Guaranteed (requests = limits): most protected, killed last.
Burstable (requests < limits): in the middle.
BestEffort (nothing declared): killed first when the node is short on resources.

The lesson: important pods should set requests/limits explicitly so they aren't treated as "disposable" during a crisis.

HorizontalPodAutoscaler: automatically raise/lower pod count

Scaling by hand with kubectl scale (Article 4) is a manual reaction. HPA automates it: it watches a metric (usually CPU) and changes the replicas count on its own to keep the metric around a target level. High load → add pods; load drops → remove pods.

HPA needs to know current load, and the source of that data is metrics-server. A fresh cluster doesn't have it — on minikube, enable it via an addon:

minikube addons enable metrics-server

Once metrics-server is up, kubectl top works:

kubectl top pods -l run=php-apache

NAME                          CPU(cores)   MEMORY(bytes)
php-apache-69b4854d9f-t44x4   11m          21Mi

Set up the HPA and generate real load

Deploy a demo app (image hpa-example — a PHP page that deliberately burns CPU on each request) with requests.cpu: 200m, then attach an HPA:

kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=5
kubectl get hpa php-apache

NAME         REFERENCE               TARGETS       MINPODS   MAXPODS   REPLICAS
php-apache   Deployment/php-apache   cpu: 0%/50%   1         5         1

HPA will keep average CPU around 50% of requests, between 1–5 replicas. It's idle so CPU is 0%, holding 1 pod. Now pour on load — one pod that hits the service nonstop:

kubectl run load-generator --image=busybox:1.36 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://php-apache; done"

Watch the HPA every 15 seconds:

[15s]  cpu: 0%/50%     REPLICAS 1
[45s]  cpu: 94%/50%    REPLICAS 1     ← load rises, exceeds target
[60s]  cpu: 94%/50%    REPLICAS 2     ← HPA adds a pod
[105s] cpu: 154%/50%   REPLICAS 2 → 4 ← still high, add more

kubectl get pods -l run=php-apache

NAME                          READY   STATUS    AGE
php-apache-69b4854d9f-96vzq   1/1     Running   9s     ← new pod
php-apache-69b4854d9f-j58rj   1/1     Running   69s    ← new pod
php-apache-69b4854d9f-t44x4   1/1     Running   4m32s  ← original pod
php-apache-69b4854d9f-tfxwh   1/1     Running   9s     ← new pod

Firsthand: CPU spikes to 94% then 154% (well past the 50% target), and HPA responds by gradually raising replicas 1 → 2 → 4 to share the load. No one typed a command — HPA's control loop did it. This is requests paying off: "154%" means it's using 1.5× requests.cpu (200m), so HPA knows how many pods it needs to pull the average back to 50%.

Scaling down and a caveat

When you stop the load, HPA also reduces pods — but cautiously and more slowly: by default it waits a stabilization window (around 5 minutes) before scaling down, to avoid "thrashing" when load fluctuates. Scale up fast to keep serving, scale down slow to stay stable — a sensible design.

Beyond HPA (scaling the number of pods), Kubernetes also has VPA (changing a pod's requests/limits) and the Cluster Autoscaler (adding/removing nodes). These three autoscaling layers usually work together in production. This fundamentals series stops at HPA — the most common and easiest to grasp.

Wrap-up

Every container should declare requests (the guaranteed amount — the scheduler uses it to place pods, HPA uses it to compute %) and limits (the ceiling — exceed CPU and you're throttled, exceed memory and you're OOMKilled). How you set these two numbers determines the QoS class (Guaranteed > Burstable > BestEffort) — that is, the order of sacrifice when a node runs out of resources. The HorizontalPodAutoscaler automatically changes the replica count to keep a metric (usually CPU) around a target — it needs metrics-server. The demo showed HPA raising 1→4 pods when CPU spiked to 154%, and scaling down slowly by design.

So far we've used Deployment for stateless apps. Article 12 meets the other workload types — StatefulSet (stateful apps), DaemonSet (one pod per node), Job/CronJob (run-then-done) — and when to use which.