Leader Election, Addons, and Node Autoscaling

K
Kai··5 min read

The cluster has three control planes (Article 8) for high availability. The apiserver: all three serve at once — any request can hit any machine. But controller-manager and scheduler are different: if all three were active, three schedulers would assign one pod to three nodes, and three controller-managers would create pods for one ReplicaSet. There has to be a mechanism guaranteeing that only one instance is active at a time, with the other two waiting. That's leader election. This article looks at it, proves real failover, then closes Part XIII with addons and node autoscaling.

Leader election via Lease

The mechanism relies on a Lease object in kube-system: whichever instance holds the lease is the leader, while the other two continuously try to win it but fail as long as the lease is still valid. See who holds it:

kubectl -n kube-system get lease kube-controller-manager kube-scheduler
NAME                      HOLDER                                              AGE
kube-controller-manager   controller-0_b27ab883-b893-434e-9feb-6023a9998b07   10h
kube-scheduler            controller-0_0b8e4b14-44cc-4c01-9a8f-8b4e8b07b647   10h

Both leaders are on controller-0. The leader has to renew the lease regularly; if it dies and stops renewing, the lease expires and another instance wins it. Each instance carries a random identifier (the part after the _) to tell them apart.

Proving failover

Take the leader down and watch what happens. Stop controller-manager on controller-0:

ssh controller-0 'sudo systemctl stop kube-controller-manager'
# wait for lease expiry + re-election (~15s)
kubectl -n kube-system get lease kube-controller-manager -o jsonpath='{.spec.holderIdentity}'
controller-1_4076e3a4-c151-444a-b724-fffffb24544e

The leader moves to controller-1 — with no intervention. controller-0 stops renewing, the lease expires after a few seconds, and controller-1 (which had been continuously trying to win it) takes it and starts doing controller-manager's work. This is real HA: lose the node holding the leader and another node takes over within roughly the lease duration, the cluster never pausing its reconciliation. Restart controller-0:

ssh controller-0 'sudo systemctl start kube-controller-manager'

It comes back up but does not reclaim leadership — controller-1 holds a valid lease, so controller-0 falls into the waiting role. The leader only changes when the current leader stops renewing; there's no back-and-forth contention when both are healthy. (The reason the scheduler in Article 34 logged holder=controller-0 is this same mechanism.)

Addons: a from-scratch cluster manages them by hand

An "addon" is a component that runs inside the cluster but belongs to infrastructure rather than to user applications: CoreDNS (Article 15), Cilium (Article 46), metrics-server (Article 39). The operational question: who installs and updates them?

This from-scratch cluster manages addons by hand — we kubectl apply a manifest or helm install, and take responsibility for upgrades ourselves. No component reconciles them automatically. Some distributions (kubeadm, or managed clusters) have an "addon manager" that reapplies manifests from a directory, but that's an extra convenience, not core Kubernetes. The trade-off is the same as everything in this series: doing it by hand gives you control over every detail and a clear picture of what's running, but you have to remember to update it yourself — nobody patches CoreDNS for you when a CVE lands.

Node autoscaling

Article 39 had HPA — adding pods under high load. But if you add a pod and no node has room, the new pod sticks in Pending (as in Article 34). Node autoscaling fills that gap: automatically adding nodes when there are unschedulable pods, and removing nodes when there are too many.

This cluster runs exactly six fixed EC2 instances, with no autoscaler, so this section is the framework rather than something we can demo:

   pod Pending (no node has enough room)
        │  watch
        ▼
   Cluster Autoscaler  ──►  call cloud API (AWS ASG...)  ──►  add EC2  ──►  new node Ready
        │                                                                      │
        └──────────  reverse: node underutilized long  ──►  drain + remove EC2  ◄────────┘

There are two main approaches: Cluster Autoscaler watches for Pending pods, and when it sees a pod stuck due to lack of resources it scales up a cloud node group (AWS Auto Scaling Group, GCP MIG, etc.), and scales down when a node is idle for a while (drain then delete — exactly the procedure from Article 63). Karpenter (originally AWS, now broader) does it more finely: instead of a predefined node group, it picks the instance type that best fits the pending pods and launches it directly. Both are components outside core Kubernetes, run as a Deployment in the cluster, and need permission to call the cloud API. On a fixed from-scratch EC2 cluster, we don't enable it — but the principle connects directly to scheduling (Article 34) and drain (Article 63): the autoscaler is just a loop that reads Pending pods and acts on the cloud infrastructure.

🧹 Cleanup

This article only reads the Lease and performs one controlled failover (controller-manager on controller-0 has been restarted; the leader is now controller-1, which is entirely normal). It creates nothing. The commands used here are at github.com/nghiadaulau/kubernetes-from-scratch, directory 67-leader-election.

Wrap-up

In an HA control plane, the apiserver runs all three in parallel, but controller-manager and scheduler use leader election to keep only one instance active — avoiding three instances stepping on each other. The mechanism is a Lease in kube-system: the holder is the leader and must renew it regularly; we proved failover by stopping controller-manager on controller-0, and after a few seconds the leader moved to controller-1 with no intervention, then controller-0 came back into the waiting role (no reclaim). Addons (CoreDNS, Cilium, metrics-server) on a from-scratch cluster are managed by hand via apply/helm — full control, in exchange for having to remember to update them yourself. Node autoscaling (Cluster Autoscaler by node group, or Karpenter picking instances directly) adds nodes when a pod is Pending for lack of room and removes them when idle; it's a component outside the core that connects scheduling (Article 34) with drain (Article 63), and this fixed EC2 cluster doesn't enable it.

Part XIII closes — the cluster is now backed up (62), knows how to upgrade (63), manages node resources (64), and is observable (65–67). Article 68 opens Part XIV, using this full cluster to try features that just graduated in the very v1.36 release it runs: admission via CEL, in-place pod resize, new storage, and kubelet observability/security.