Resource Requests/Limits and Autoscaling (HPA)
Two things are tied together: to have Kubernetes autoscale with load, each pod must first declare how much resource it needs. This article goes from requests/limits (the foundation the scheduler and autoscaler work on) to the HorizontalPodAutoscaler — and we'll generate real load to watch the cluster add pods on its own.
requests and limits: two numbers, two roles
In a container spec, you declare resources through two often-confused concepts:
resources:
requests: # the GUARANTEED amount — scheduler uses this to pick a node
cpu: 200m # 200 milli-CPU = 0.2 core
memory: 64Mi
limits: # the CEILING — exceeding it gets blocked/killed
cpu: 500m
memory: 128Mi
requestsis the amount of resource guaranteed to the pod. The scheduler (Article 1) uses this number to decide which node has room — a pod is only placed on a node with enough freerequests. This is what lets Kubernetes pack pods sensibly without overloading one node.limitsis the ceiling. Exceed the CPU limit → the container gets throttled (slowed down, not killed). Exceed the memory limit → the container gets OOMKilled (killed for running out of memory). Limits protect the cluster from one buggy pod devouring its neighbors' resources.
Units: CPU is measured in cores,
1= one core,500m= half a core (m = milli). Memory is in bytes with theMi/Gisuffix (mebibyte/gibibyte).
Setting requests correctly matters more than beginners think: too low and the node gets overloaded (pods fight, slow down); too high and you waste capacity (an empty node the scheduler thinks is full). And as you'll see, HPA needs requests to compute the usage percentage.
QoS: who gets "sacrificed" first when a node runs out
How you set requests/limits determines the pod's QoS class — and when a node runs out of memory, Kubernetes kills pods by this class:
- Guaranteed (requests = limits): most protected, killed last.
- Burstable (requests < limits): in the middle.
- BestEffort (nothing declared): killed first when the node is short on resources.
The lesson: important pods should set requests/limits explicitly so they aren't treated as "disposable" during a crisis.
HorizontalPodAutoscaler: automatically raise/lower pod count
Scaling by hand with kubectl scale (Article 4) is a manual reaction. HPA automates it: it watches a metric (usually CPU) and changes the replicas count on its own to keep the metric around a target level. High load → add pods; load drops → remove pods.
HPA needs to know current load, and the source of that data is metrics-server. A fresh cluster doesn't have it — on minikube, enable it via an addon:
minikube addons enable metrics-server
Once metrics-server is up, kubectl top works:
kubectl top pods -l run=php-apache
NAME CPU(cores) MEMORY(bytes)
php-apache-69b4854d9f-t44x4 11m 21Mi
Set up the HPA and generate real load
Deploy a demo app (image hpa-example — a PHP page that deliberately burns CPU on each request) with requests.cpu: 200m, then attach an HPA:
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=5
kubectl get hpa php-apache
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
php-apache Deployment/php-apache cpu: 0%/50% 1 5 1
HPA will keep average CPU around 50% of requests, between 1–5 replicas. It's idle so CPU is 0%, holding 1 pod. Now pour on load — one pod that hits the service nonstop:
kubectl run load-generator --image=busybox:1.36 --restart=Never -- \
/bin/sh -c "while true; do wget -q -O- http://php-apache; done"
Watch the HPA every 15 seconds:
[15s] cpu: 0%/50% REPLICAS 1
[45s] cpu: 94%/50% REPLICAS 1 ← load rises, exceeds target
[60s] cpu: 94%/50% REPLICAS 2 ← HPA adds a pod
[105s] cpu: 154%/50% REPLICAS 2 → 4 ← still high, add more
kubectl get pods -l run=php-apache
NAME READY STATUS AGE
php-apache-69b4854d9f-96vzq 1/1 Running 9s ← new pod
php-apache-69b4854d9f-j58rj 1/1 Running 69s ← new pod
php-apache-69b4854d9f-t44x4 1/1 Running 4m32s ← original pod
php-apache-69b4854d9f-tfxwh 1/1 Running 9s ← new pod
Firsthand: CPU spikes to 94% then 154% (well past the 50% target), and HPA responds by gradually raising replicas 1 → 2 → 4 to share the load. No one typed a command — HPA's control loop did it. This is requests paying off: "154%" means it's using 1.5× requests.cpu (200m), so HPA knows how many pods it needs to pull the average back to 50%.
Scaling down and a caveat
When you stop the load, HPA also reduces pods — but cautiously and more slowly: by default it waits a stabilization window (around 5 minutes) before scaling down, to avoid "thrashing" when load fluctuates. Scale up fast to keep serving, scale down slow to stay stable — a sensible design.
Beyond HPA (scaling the number of pods), Kubernetes also has VPA (changing a pod's requests/limits) and the Cluster Autoscaler (adding/removing nodes). These three autoscaling layers usually work together in production. This fundamentals series stops at HPA — the most common and easiest to grasp.
Wrap-up
Every container should declare requests (the guaranteed amount — the scheduler uses it to place pods, HPA uses it to compute %) and limits (the ceiling — exceed CPU and you're throttled, exceed memory and you're OOMKilled). How you set these two numbers determines the QoS class (Guaranteed > Burstable > BestEffort) — that is, the order of sacrifice when a node runs out of resources. The HorizontalPodAutoscaler automatically changes the replica count to keep a metric (usually CPU) around a target — it needs metrics-server. The demo showed HPA raising 1→4 pods when CPU spiked to 154%, and scaling down slowly by design.
So far we've used Deployment for stateless apps. Article 12 meets the other workload types — StatefulSet (stateful apps), DaemonSet (one pod per node), Job/CronJob (run-then-done) — and when to use which.