Vertical Pod Autoscaler and resource managers

Article 39 scaled horizontally — add pods as load rises. The other axis is scaling vertically: instead of adding pods, dial in the exact amount of resources per pod. The docs distinguish them: "you can either increase or decrease the number of replicas ... or adjust the resources available to the replicas in-place. The first approach is referred to as horizontal scaling, while the second is referred to as vertical scaling." This article has two parts that fit vertical scaling: the VerticalPodAutoscaler (right-size a pod's request/limit from real usage) and — on the node side — the kubelet's resource managers, of which we'll concretely test the CPU Manager static policy (pinning exclusive CPU cores to a Guaranteed pod).

VPA is an add-on, not core

Like Metrics Server, VPA is not built in. The docs: "Unlike the HPA, the VPA doesn't come with Kubernetes by default, but is an add-on that you or a cluster administrator may need to deploy ... You will need to have the Metrics Server installed." (Luckily we already have Metrics Server from Article 39.) VPA has three components — the recommender (observes usage, produces recommendations), the updater (evicts pods to apply recommendations), and the admission controller (rewrites requests at pod creation). We only need the recommender for "recommend only" mode:

# CRD + RBAC + recommender (vpa-recommender:1.6.0), from the autoscaler repo
kubectl apply -f vpa-v1-crd-gen.yaml      # CRD VerticalPodAutoscaler
kubectl apply -f vpa-rbac.yaml
kubectl apply -f recommender-deployment.yaml

The CRD verticalpodautoscalers.autoscaling.k8s.io lets us create VPA objects. Build a Deployment that deliberately sets a low request (50m CPU, 16Mi) but whose container consumes a lot (a dd loop burning CPU), along with a VPA in updateMode: "Off":

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: {name: vpa-demo}
spec:
  targetRef: {apiVersion: apps/v1, kind: Deployment, name: vpa-demo}
  updatePolicy: {updateMode: "Off"}        # recommend only, does NOT auto-modify

updateMode: "Off" is the safest mode — VPA only writes recommendations into status, never touching the pod (other modes: Initial sets requests at creation, Auto/Recreate evict pods to apply). Wait for the recommender to gather enough metrics:

kubectl get vpa vpa-demo -o jsonpath='{.status.recommendation.containerRecommendations[0]}'

target={"cpu":"182m","memory":"250Mi"}
lowerBound={"cpu":"25m","memory":"250Mi"}
upperBound={"cpu":"1209782m", ...}

VPA observes the real workload and recommends target: cpu 182m, memory 250Mi — much higher than the current request 50m/16Mi, because the dd loop really consumes ~182m CPU and 250Mi. This is the value of VPA: instead of guessing a request and setting it at random (too much wastes the Allocatable of Article 32, too little gets OOM/throttle), it measures and proposes the right number — along with lowerBound/upperBound (a confidence interval). Verify it does not modify the pod in Off mode:

kubectl get deployment vpa-demo -o jsonpath='{.spec.template.spec.containers[0].resources.requests}'
# {"cpu":"50m","memory":"16Mi"}     <- unchanged

The request stays 50m/16Mi — the running pod isn't disturbed. In practice, the ops team reads this recommendation and edits the manifest themselves (safe, GitOps-friendly), or turns on Auto to let VPA apply it (but Auto evicts and recreates the pod to change the request — causing disruption; note from the docs: as of v1.36 VPA does not yet support in-place resize, even though manual in-place resize itself exists). HPA and VPA should not both manage CPU/memory on the same workload (they fight) — VPA suits workloads hard to scale horizontally (stateful, single-instance).

Resource managers: CPU Manager static policy

VPA adjusts the request number. One layer deeper — how the kubelet places a pod onto physical CPUs. By default (policy none), the docs: "the kubelet uses CFS quota to enforce pod CPU limits ... the workload can move to different CPU cores." — the pod hops between cores at will, incurring context-switch and cache overhead. For latency-sensitive workloads (low-latency, real-time), you want to pin the pod to fixed cores. That's the CPU Manager static policy.

The docs state the exact conditions to get exclusive cores: "Only containers that are both part of a Guaranteed pod and have integer CPU requests are assigned exclusive CPUs." — it must be a Guaranteed pod (Article 22) and an integer CPU request. Enable it on worker-0 (note: changing the policy requires deleting the old state file, or the kubelet won't start):

ssh worker-0 'printf "cpuManagerPolicy: static\nkubeReserved:\n  cpu: \"200m\"\n  memory: \"256Mi\"\n" | sudo tee -a /var/lib/kubelet/kubelet-config.yaml
  sudo rm -f /var/lib/kubelet/cpu_manager_state      # required when changing the policy
  sudo systemctl restart kubelet'
ssh worker-0 'sudo cat /var/lib/kubelet/cpu_manager_state'
# {"policyName":"static","defaultCpuSet":"0-1",...}

The state file confirms policy static, defaultCpuSet: 0-1 — the initial shared pool covers both CPUs (the node has 2 vCPUs). Now drop two pods onto worker-0: one Guaranteed with an integer CPU request (=1), one Burstable with a fractional request (200m):

# pinned: requests==limits cpu=1, memory=128Mi  => Guaranteed, integer cpu
# shared: requests cpu=200m                      => Burstable, fractional cpu

echo "pinned: $(kubectl exec pinned -- cat /sys/fs/cgroup/cpuset.cpus.effective)"
echo "shared: $(kubectl exec shared -- cat /sys/fs/cgroup/cpuset.cpus.effective)"
ssh worker-0 'sudo cat /var/lib/kubelet/cpu_manager_state'

pinned: 1
shared: 0
{"policyName":"static","defaultCpuSet":"0","entries":{"...":{"c":"1"}},...}

Read the result: the pinned pod (Guaranteed, integer cpu) gets CPU 1 exclusively — cpuset.cpus.effective = 1. The shared pool shrinks to CPU 0 (defaultCpuSet from 0-1 down to 0), and the state file records that pinned's container holds "c":"1". The shared pod (Burstable) runs in the shared pool — cpuset = 0, which the docs confirm: "Containers in Guaranteed pods with fractional CPU requests also run on CPUs in the shared pool." (the same place as every non-exclusive pod). From now on pinned runs only on CPU 1, with no other pod touching it — exactly what a latency-sensitive workload needs.

The other two resource managers (just a mention; similar mechanism at the NUMA level): Memory Manager ensures a pod's memory sits on the same NUMA node as its pinned CPUs; Topology Manager coordinates CPU Manager + Memory Manager + device plugins (Article 61) so that all of a pod's resources land on one NUMA domain — critical for HPC/AI workloads that need maximum memory bandwidth.

🧹 Cleanup

kubectl delete pod pinned shared --now
# revert CPU Manager (again must delete the state file since the policy goes back to none):
ssh worker-0 'sudo mv /var/lib/kubelet/kubelet-config.yaml.bak /var/lib/kubelet/kubelet-config.yaml
  sudo rm -f /var/lib/kubelet/cpu_manager_state && sudo systemctl restart kubelet'
# remove VPA (not needed for later articles; KEEP Metrics Server):
kubectl delete -f recommender-deployment.yaml -f vpa-rbac.yaml -f vpa-v1-crd-gen.yaml

We reverted worker-0 to policy none (node Ready again) and removed VPA (keeping Metrics Server for later articles). The cluster returns to CoreDNS + metrics-server. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 40-vpa-cpumanager.

Wrap-up

Scaling vertically has two layers. The VerticalPodAutoscaler (an add-on, needs Metrics Server) right-sizes request/limit from real usage: we installed the recommender, created a VPA with updateMode: Off, and it recommended cpu 182m / memory 250Mi for a workload set to request 50m/16Mi — without modifying the pod in Off mode (Auto mode evicts-and-recreates to apply; no in-place resize as of v1.36); don't let VPA and HPA both manage CPU/memory on one workload. On the node side, the CPU Manager static policy pins exclusive CPU cores to Guaranteed pods with integer CPU: we saw the pinned pod get CPU 1 to itself (cpuset=1), the shared pool shrink to CPU 0, the state file clearly recording it — no more core-hopping for latency-sensitive workloads. Memory/Topology Manager extend that idea to NUMA alignment.

End of Part VIII. Part IX steps into storage — and this is where you need to trace carefully what creates what: Article 41 opens with volumes, ephemeral volumes, projected volumes — the volume types attached to a pod (emptyDir, hostPath, configMap/secret as in Article 31, projected combining multiple sources), a stepping stone before PV/PVC/StorageClass and CSI in later articles.

Vertical Pod Autoscaler and resource managers

VPA is an add-on, not core

Resource managers: CPU Manager static policy

🧹 Cleanup

Wrap-up

Related Posts

AWS-native Observability for EC2 with the CloudWatch Agent

Things GitHub Actions Tutorials Tend to Skip