controller-manager and scheduler: Control Loops and Leader Election
The api-server in Article 7 does just one thing: receive and store the desired state. It doesn't create pods, pick nodes, or rebuild anything when something dies. Those jobs belong to the two components in this article: kube-controller-manager and kube-scheduler. Both are clients of the api-server (using the kubeconfigs created in Article 5), and since we run three instances of each, both must elect a leader — we'll watch that happen for real.
controller-manager: where the control loops live
Recall from Article 1: a controller is a continuous loop that compares desired state with actual state and acts to reconcile them. kube-controller-manager is a single process that bundles dozens of such loops — Deployment, ReplicaSet, Node, Job, ServiceAccount, EndpointSlice... Rather than run dozens of separate processes, Kubernetes packs them into one binary for convenience.
In our configuration, the controller-manager also holds a few special roles that need to be configured correctly:
- Signing CSRs (
--cluster-signing-cert-file=ca.pem,--cluster-signing-key-file=ca-key.pem): it's the component that signs certificate requests in the cluster. This is why we had to putca-key.pemon the controllers in Article 7. - Signing ServiceAccount tokens (
--service-account-private-key-file=service-account-key.pem): recall from Article 2, the controller-manager uses the private key to sign tokens, and the api-server uses the public key to verify them. - Knowing the pod network range (
--cluster-cidr=10.200.0.0/16): the range we'll carve up per node in the networking section (Articles 13–14).
scheduler: filter, then score
kube-scheduler watches for pods not yet assigned to a node (spec.nodeName empty), and for each pod, decides which node it should run on. That decision goes through two phases:
All nodes
│
▼ PHASE 1 — FILTERING
drop nodes that DON'T qualify:
not enough CPU/RAM, doesn't match nodeSelector/affinity,
has a taint the pod doesn't tolerate, port already in use...
│
▼ (the feasible nodes remain)
▼ PHASE 2 — SCORING
give each feasible node a score by various criteria
(load balancing, preference for spreading, data locality...)
│
▼
highest-scoring node ──► write spec.nodeName onto the pod (via api-server)
The scheduler only chooses and writes the node name onto the pod; it doesn't start the container. Starting the container is the job of the kubelet on the chosen node (Article 11). This is the kind of clean division of roles that's very Kubernetes: each component does exactly one thing, then leaves the result for the next component via state in the cluster.
Leader election: three instances, one does the work
Both the controller-manager and the scheduler run three instances (one per controller). If all three were active, we'd have three schedulers assigning a node to a single pod, or three ReplicaSet controllers each creating pods — chaos. The solution is leader election: the three instances compete to hold a lock, and only the lock holder does the work.
That lock is a Lease object in the kube-system namespace. Whichever instance holds the Lease periodically "renews" it; if it dies and stops renewing, after a while another instance grabs it and takes over. We enable this mechanism with --leader-elect=true (controller-manager) and leaderElection.leaderElect: true (scheduler).
Step 1 — Get the kubeconfigs onto the controllers
The kube-controller-manager and kube-scheduler binaries were already downloaded in Article 7. Now we just need to get the two corresponding kubeconfigs (created in Article 5, pointing at 127.0.0.1:6443) onto each controller:
# from the workstation, in ~/k8s-scratch/pki
for h in controller-0 controller-1 controller-2; do
scp kube-controller-manager.kubeconfig kube-scheduler.kubeconfig ${h}:/tmp/
ssh $h 'sudo mv /tmp/kube-controller-manager.kubeconfig /tmp/kube-scheduler.kubeconfig /var/lib/kubernetes/'
done
ca.pem, ca-key.pem, and service-account-key.pem are already in /var/lib/kubernetes from Article 7, so there's no need to copy them again.
Step 2 — systemd unit for the controller-manager
The unit is identical on all three controllers (leader election handles coordination, so there's no need to differentiate by machine):
[Unit]
Description=Kubernetes Controller Manager
After=network.target
[Service]
ExecStart=/usr/local/bin/kube-controller-manager \
--bind-address=0.0.0.0 \
--cluster-cidr=10.200.0.0/16 \
--cluster-name=kubernetes \
--cluster-signing-cert-file=/var/lib/kubernetes/ca.pem \
--cluster-signing-key-file=/var/lib/kubernetes/ca-key.pem \
--kubeconfig=/var/lib/kubernetes/kube-controller-manager.kubeconfig \
--authentication-kubeconfig=/var/lib/kubernetes/kube-controller-manager.kubeconfig \
--authorization-kubeconfig=/var/lib/kubernetes/kube-controller-manager.kubeconfig \
--leader-elect=true \
--root-ca-file=/var/lib/kubernetes/ca.pem \
--service-account-private-key-file=/var/lib/kubernetes/service-account-key.pem \
--service-cluster-ip-range=10.32.0.0/24 \
--use-service-account-credentials=true \
--v=2
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
--use-service-account-credentials=true is worth a note: it makes each internal control loop use its own ServiceAccount when calling the api-server, rather than sharing a single identity. That way RBAC grants each loop the minimum permissions it needs.
Step 3 — Config and unit for the scheduler
The modern scheduler takes its configuration via a KubeSchedulerConfiguration file instead of cramming everything into flags. Create that file, point it at the kubeconfig, and enable leader election:
sudo tee /var/lib/kubernetes/kube-scheduler.yaml >/dev/null <<'EOF'
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /var/lib/kubernetes/kube-scheduler.kubeconfig
leaderElection:
leaderElect: true
EOF
The scheduler's unit is therefore very short:
[Unit]
Description=Kubernetes Scheduler
After=network.target
[Service]
ExecStart=/usr/local/bin/kube-scheduler \
--config=/var/lib/kubernetes/kube-scheduler.yaml \
--v=2
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Write both units and the config file to each controller, then reload and enable:
sudo systemctl daemon-reload
sudo systemctl enable kube-controller-manager kube-scheduler
Step 4 — Start and verify
Start both services on the three controllers:
for h in controller-0 controller-1 controller-2; do
ssh $h 'sudo systemctl start kube-controller-manager kube-scheduler'
done
Check the status:
for h in controller-0 controller-1 controller-2; do
printf "%-14s cm=%s sched=%s\n" "$h" \
"$(ssh $h 'systemctl is-active kube-controller-manager')" \
"$(ssh $h 'systemctl is-active kube-scheduler')"
done
controller-0 cm=active sched=active
controller-1 cm=active sched=active
controller-2 cm=active sched=active
When a service won't come up — read the logs, don't guess. The first time I ran this,
kube-controller-manageron controller-0 keptactivatingthen restarting on a loop.journalctl -u kube-controller-managershowedstatus=203/EXEC— the systemd error code for "couldn't execute the binary file". Comparing file sizes revealed the cause: the binary on controller-0 was only 12MB, while on controller-1 it was 74MB — it had been truncated on download in Article 7 (exactly the mid-transfercurltrap we warned about). Re-download a complete copy, thensystemctl restart, and you're done:bash ls -l /usr/local/bin/kube-controller-manager # 12582912 — way too small! sudo curl -fSL -o /usr/local/bin/kube-controller-manager \ https://dl.k8s.io/release/v1.36.1/bin/linux/amd64/kube-controller-manager sudo chmod +x /usr/local/bin/kube-controller-manager sudo systemctl restart kube-controller-manager
Now the fun part: watching leader election. Each type has a Lease in kube-system, and the holderIdentity field tells you which instance is holding it. Call the api-server with the admin cert:
C="--cacert ca.pem --cert admin.pem --key admin-key.pem"
curl -s $C "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases" \
| python3 -c 'import sys,json; d=json.load(sys.stdin); [print(i["metadata"]["name"],"->",i["spec"].get("holderIdentity")) for i in d["items"]]'
apiserver-tyzveoctxbnh6lvbkdt2xqncle -> apiserver-tyzveoctxbnh6lvbkdt2xqncle_14cfdd62-...
apiserver-wlwnjbc6t2b2g566g3usjzlbdm -> apiserver-wlwnjbc6t2b2g566g3usjzlbdm_979b5321-...
apiserver-zjy5m4qienmmbtxvgklj3tgz6i -> apiserver-zjy5m4qienmmbtxvgklj3tgz6i_18335984-...
kube-controller-manager -> controller-1_2bc8f705-33eb-46f5-bf37-a7c9946369a6
kube-scheduler -> controller-0_422ddcd2-4dd9-46a2-b7d4-819843c626c0
Reading the result: the controller-manager leader is controller-1, and the scheduler leader is controller-0. The other two instances of each type are running idle, waiting their turn. (The three apiserver-... leases are a different matter: they're the HA identities each api-server registers for itself, not leader election.)
If you now stop the controller-manager on controller-1, within a few seconds one of the two remaining instances grabs the Lease and holderIdentity switches to that machine's name — exactly the fault tolerance we designed. You can try it yourself to see it firsthand.
🧹 Cleanup
Both services are permanent components; don't shut them down. Remember to delete the temporary cert files if you copied them up to a controller to run the curl checks.
Wrap-up
The control plane is essentially complete: etcd stores state, the api-server is the entry point, the controller-manager runs the control loops, and the scheduler picks nodes — and both of the latter two elected a leader correctly. What's worth taking away from this article isn't just the flags, but seeing that leader election is a concrete mechanism (a Lease, a holderIdentity) rather than an abstract concept.
But so far we still call the api-server through 127.0.0.1 on each controller, and there are no workers yet. Article 9 stands up HAProxy on lb-0 to consolidate the three api-servers into one address, configures kubectl on your laptop to point at its Elastic IP, and sets up the RBAC so the api-server is allowed to call down to the kubelet — finishing the preparation before we add workers to the cluster in Article 10.