Upgrades and Version Skew

K
Kai··5 min read

The cluster is on v1.36, and one day it'll have to go to v1.37. Upgrading Kubernetes isn't swapping all binaries at once — do that and you almost certainly break it. There's a rule about how far each component may skew the apiserver's version, and a mandatory upgrade order that follows from it. This article inspects the rule on the real cluster, then drills the hardest operational part of a node upgrade.

Version skew: who may skew, by how much

kubectl version shows every component is in sync:

kubectl version | head -3
kubectl get nodes -o wide | awk '{print $1, $5}'
Client Version: v1.36.1
Server Version: v1.36.1
worker-0 v1.36.1
worker-1 v1.36.1

In sync is the ideal state, but between upgrade steps skew is unavoidable. The version-skew rule limits that skew (relative to the newest apiserver):

   kube-apiserver (HA)              apiservers skew each other by at most 1 minor
        │
        ├── controller-manager/scheduler   at most 1 minor OLDER, NOT newer
        ├── kubelet                         at most 3 minors OLDER (N-3), NOT newer
        ├── kube-proxy                      at most 3 minors OLDER, NOT newer
        └── kubectl                         within ±1 minor

The point that repeats on every line: no component may be newer than the apiserver. kubelet may lag the apiserver by up to three minors (apiserver 1.36 means kubelet 1.36/1.35/1.34/1.33 all work), but kubelet 1.37 talking to apiserver 1.36 is not supported. controller-manager and scheduler are stricter — they may lag by only one minor.

The upgrade order follows from the rule

Since nothing may be newer than the apiserver, the upgrade order is a direct consequence:

1. kube-apiserver (all 3 control plane, one at a time)  ← upgrade FIRST
2. controller-manager + scheduler                        ← then these (no order between the two)
3. kubelet + kube-proxy on each worker                    ← last, one minor at a time

If you upgrade kubelet first, kubelet 1.37 would be newer than apiserver 1.36 — violating skew. Upgrade the apiserver first and, while kubelet is still 1.36, it lags apiserver 1.37 by only one minor, which is valid. For the same reason, upgrade only one minor at a time (1.35→1.36→1.37), never skip 1.35→1.37, because kubelet 1.35 with apiserver 1.37 is a two-minor skew — still within N-3 for kubelet, but controller-manager 1.35 with apiserver 1.37 is two minors, exceeding N-1.

On a self-built cluster, "upgrading a component" means downloading the new-version binary, swapping it in, restarting the service — exactly what we did in Part I, repeated with a new version. (v1.36.1 is currently the latest patch of 1.36, so this article doesn't bump the number; the part worth learning isn't the binary-download command, it's handling the node during a kubelet upgrade.)

Drilling a node upgrade: drain

When upgrading kubelet on a worker, you can't upgrade while pods keep running on it — you have to move pods off first, upgrade, then let pods come back. This is the part that's easy to get wrong and the part we drill for real on worker-0 (reversible). kubectl drain does two things: cordon (mark the node to take no new pods) and evict the running pods:

kubectl drain worker-0 --ignore-daemonsets --delete-emptydir-data --force
evicting pod kube-system/coredns-8569db9899-wxzlr
evicting pod kube-system/ebs-csi-controller-74ddd54f5b-zdgg2
evicting pod kube-system/cilium-operator-778946fc48-nhmnp
...
node/worker-0 drained

--ignore-daemonsets is mandatory because DaemonSets (cilium, ebs-csi-node — Article 26) are present on every node and would be recreated immediately if evicted, so drain skips them. Ordinary pods (coredns, cilium-operator, ebs-csi-controller, hubble-ui, snapshot-controller) get evicted and the scheduler moves them to worker-1. The node is now in a safe state to upgrade:

kubectl get nodes
NAME       STATUS                     ROLES    AGE   VERSION
worker-0   Ready,SchedulingDisabled   <none>   9h    v1.36.1
worker-1   Ready                      <none>   9h    v1.36.1

Ready,SchedulingDisabled — the node still runs (DaemonSets are still there) but takes no new pods. Only now is it safe to swap the kubelet binary and restart the service: no application pod gets disrupted by kubelet restarting, because they've already moved away. Drain respects PodDisruptionBudget (Article 23), so if an application declares a PDB, drain evicts gradually to keep enough replicas running.

Bringing the node back: uncordon

After the upgrade, uncordon reopens the node to take pods:

kubectl uncordon worker-0
kubectl get nodes
node/worker-0 uncordoned
NAME       STATUS   ROLES    AGE   VERSION
worker-0   Ready    <none>   9h    v1.36.1
worker-1   Ready    <none>   9h    v1.36.1

worker-0 returns to Ready and takes pods again. Note: uncordon does not pull the moved pods back automatically — the scheduler only places new pods on worker-0; pods running on worker-1 stay there until they're recreated. A full cluster upgrade is repeating drain → upgrade → uncordon for each worker, one node at a time, so the cluster always has a node serving.

🧹 Cleanup

This article doesn't install or change any version — it only cordons/drains/uncordons worker-0, and uncordon already returned the node to Ready. There's nothing further to clean up. The commands used in this article are at github.com/nghiadaulau/kubernetes-from-scratch, directory 63-upgrade.

Wrap-up

Kubernetes upgrades are bound by version skew: no component may be newer than the apiserver; kubelet/kube-proxy may lag by up to three minors (N-3), controller-manager/scheduler by only one (N-1), kubectl within ±1, and HA apiservers skew each other by at most one. The upgrade order follows — apiserver first, then controller-manager/scheduler, finally kubelet/kube-proxy — and only one minor at a time. On a self-built cluster, upgrading a component is swap binary + restart as in Part I. The hardest operational part is upgrading a node, and we drilled it for real on worker-0: drain --ignore-daemonsets cordons the node and evicts ordinary pods to worker-1 (keeping DaemonSets), bringing the node to Ready,SchedulingDisabled so kubelet can be swapped safely, then uncordon lets it take pods again. Drain respects PDBs (Article 23); uncordon doesn't pull old pods back.

Article 64 gathers several node-level operational tasks that kubelet handles quietly: cleaning up old images and containers (garbage collection), how pods map into cgroup v2, swap support, and orderly shutdown (graceful node shutdown) so pods aren't disrupted when a node shuts down.