Metrics Server and HorizontalPodAutoscaler

Seven parts in, our cluster runs but is "blind" — kubectl top reports Metrics API not available, nobody measures how much CPU/RAM a pod uses. By Part VIII (autoscaling) that becomes a blocker: the HorizontalPodAutoscaler wants to grow/shrink the replica count by load, which means it needs load numbers. So this article has two halves: install Metrics Server — the first add-on we add to our hand-built cluster — and use its numbers to stand up an HPA. The first half hits a common trap of hand-built clusters, and fixing it is itself a good lesson on the aggregation layer.

Why the cluster has no metrics

kubectl top gets its numbers from the metrics.k8s.io API, which the docs state plainly: "The metrics.k8s.io API is usually provided by an add-on named Metrics Server, which needs to be launched separately." A kubeadm/managed-built cluster has it (or installs it alongside); our KTHW cluster is empty — we never installed it. Download the official manifest (pinned to v0.8.1) and add one flag:

curl -fsSL https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml \
  -o metrics-server.yaml
# our kubelet uses a self-signed cert (Article 11), metrics-server needs to skip verification:
#   add "- --kubelet-insecure-tls" to the Deployment's args
kubectl apply -f metrics-server.yaml

--kubelet-insecure-tls is needed because metrics-server connects to the kubelet over HTTPS and wants to verify the kubelet's serving cert; that cert (Article 11) is signed by the internal CA and its SAN doesn't match what metrics-server expects, so we tell it to skip verification (acceptable in a self-managed cluster). The manifest creates a Deployment, Service, RBAC, and an APIService v1beta1.metrics.k8s.io — registering metrics-server into the API server's aggregation layer.

The KTHW trap: the control plane can't reach the pod

The metrics-server pod comes up Running, but the APIService fails:

kubectl get apiservice v1beta1.metrics.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].message}'

failing or missing response from https://10.32.0.90:443/...: context deadline exceeded

10.32.0.90 is the ClusterIP of the metrics-server Service. The API server (running on controller-0/1/2) tries to call that ClusterIP but can't reach it. The reason lies in the cluster's architecture: in Part I, we installed kube-proxy + CNI only on the workers (Articles 12–14), not on the control plane. And a ClusterIP is a virtual IP translated (DNAT) by kube-proxy — the control plane has no kube-proxy, so it doesn't know what 10.32.0.90 is. The API server calls into the void → timeout.

Luckily we do have a path to the pod IP. The VPC route from Article 14 (10.200.0.0/24→worker-0...) is at the subnet level, which the control plane also uses — verify it:

ssh controller-0 'ping -c2 10.200.0.55'      # the metrics-server pod IP
# 0% packet loss

The control plane can ping the pod IP (via the VPC route), it just can't reach the ClusterIP. The fix: tell the aggregator to dial the pod IP directly instead of the ClusterIP, with the flag --enable-aggregator-routing=true on the API server:

# add to the ExecStart of kube-apiserver on ALL 3 controllers, then restart:
#   --enable-aggregator-routing=true

This flag makes the API server resolve the target Service's endpoints (pod IPs) and call them directly — bypassing the ClusterIP. Because the control plane can reach the pod IP via the VPC route, the aggregation layer works. This is the standard fix for any cluster where the API server isn't on the pod network (exactly the KTHW style). After the 3 API servers restart:

kubectl get apiservice v1beta1.metrics.k8s.io -o jsonpath='{.status.conditions[?(@.type=="Available")].status}'
# True
kubectl top nodes

NAME       CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)
worker-0   21m          1%       469Mi           12%
worker-1   18m          0%       440Mi           11%

kubectl top returns real numbers — the cluster can now "see" load. (These CPU/memory figures are the same node.stats that Article 38's eviction signals use.)

HorizontalPodAutoscaler

With metrics in hand, stand up an HPA. From the docs: "a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment ...), with the aim of automatically scaling capacity to match demand." — scaling horizontally (add pods), unlike scaling vertically (add resources to existing pods, which is the VPA of Article 40). Deploy an app that declares requests.cpu (HPA computes utilization relative to request) then create the HPA:

# Deployment cpu-app: 1 replica, container requests cpu 100m, runs sleep
kubectl autoscale deployment cpu-app --cpu-percent=50 --min=1 --max=4
kubectl get hpa cpu-app

NAME      REFERENCE            TARGETS       MINPODS   MAXPODS   REPLICAS
cpu-app   Deployment/cpu-app   cpu: 0%/50%   1         4         1

0%/50% — CPU is currently 0%, target is 50% (of the request 100m, i.e. 50m absolute). Idle, so it holds at 1 replica. The HPA formula, per the docs:

desiredReplicas = ceil[ currentReplicas × (currentMetricValue / desiredMetricValue) ]

Now burn CPU inside one pod (yes > /dev/null eats a whole core) and watch:

POD=$(kubectl get pods -l app=cpu-app -o jsonpath='{.items[0].metadata.name}')
kubectl exec $POD -- sh -c 'yes > /dev/null &'
# watch:
kubectl get hpa cpu-app -w

t=15s:  REPLICAS=4
...
currentCPU=249% target=50% current=4 desired=4

The CPU-burning pod uses 1000m = 1000% of its 100m request. The HPA computes desired = ceil(1 × 1000/50) = 20, but it's capped at max=4 so it scales to 4. After it has 4 replicas (only 1 burning, 3 idle), the average utilization = (1000 + 0 + 0 + 0)/4 / 100m ≈ 249% — still above 50%, so it holds at max 4. The event confirms it:

kubectl get events --field-selector involvedObject.name=cpu-app,reason=SuccessfulRescale

Normal  SuccessfulRescale  New size: 4; reason: cpu resource utilization (percentage of request) above target

The HPA controller runs inside kube-controller-manager (Article 8), checking periodically — "the default interval is 15 seconds". It skips if the ratio is near 1.0 (default tolerance 0.1), avoiding jitter around the threshold.

Slow scale-down by design

When you stop the burner, CPU returns to 0 — but the HPA does not drop pods immediately. There's a default stabilization window of 5 minutes for scale-down: the HPA waits for low load to settle before shrinking, avoiding "flapping" the pod count up and down as load fluctuates. (Scale-up is fast, with no default wait window — because too few pods is more dangerous than too many.) This is why in practice you see pods grow fast when load rises but shrink gradually after it drops.

🧹 Cleanup

kubectl delete hpa cpu-app
kubectl delete deployment cpu-app --now

Delete the HPA + Deployment (which takes the pods and burner with it). Keep Metrics Server — it's shared infrastructure for HPA, VPA (Article 40), and kubectl top going forward; also keep the --enable-aggregator-routing flag on the API server (needed for every aggregated API from here on). The cluster still has CoreDNS + metrics-server. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 39-metrics-hpa.

Wrap-up

A hand-built cluster has no metrics until we install Metrics Server (the first add-on) — and we hit the KTHW trap: the API server on the control plane can't reach the ClusterIP of metrics-server because the control plane runs no kube-proxy/CNI. Fix it with --enable-aggregator-routing=true so the aggregator dials the pod IP directly (which the control plane can reach via Article 14's VPC route). After that, kubectl top works. The HorizontalPodAutoscaler uses those numbers to scale horizontally: it computes desiredReplicas = ceil(currentReplicas × current/target), as a % of request. We burned CPU in one pod (1000% of request) → HPA scaled 1→4 (capped at max), event SuccessfulRescale; the controller checks every 15s, has a 0.1 tolerance, and scale-down waits a 5-minute stabilization. This is the "add pods" response — the opposite of Part VII's "kill pods".

Article 40 moves to scaling vertically: the VerticalPodAutoscaler (recommends/re-sets request-limit for existing pods instead of adding pods) and the node-level resource managers (CPU Manager, Memory Manager, Topology Manager — assigning exact CPU/NUMA to latency-sensitive workloads).