controller-manager and scheduler: Control Loops and Leader Election

K
Kai··6 min read

The api-server in Article 7 does just one thing: receive and store the desired state. It doesn't create pods, pick nodes, or rebuild anything when something dies. Those jobs belong to the two components in this article: kube-controller-manager and kube-scheduler. Both are clients of the api-server (using the kubeconfigs created in Article 5), and since we run three instances of each, both must elect a leader — we'll watch that happen for real.

controller-manager: where the control loops live

Recall from Article 1: a controller is a continuous loop that compares desired state with actual state and acts to reconcile them. kube-controller-manager is a single process that bundles dozens of such loops — Deployment, ReplicaSet, Node, Job, ServiceAccount, EndpointSlice... Rather than run dozens of separate processes, Kubernetes packs them into one binary for convenience.

In our configuration, the controller-manager also holds a few special roles that need to be configured correctly:

  • Signing CSRs (--cluster-signing-cert-file=ca.pem, --cluster-signing-key-file=ca-key.pem): it's the component that signs certificate requests in the cluster. This is why we had to put ca-key.pem on the controllers in Article 7.
  • Signing ServiceAccount tokens (--service-account-private-key-file=service-account-key.pem): recall from Article 2, the controller-manager uses the private key to sign tokens, and the api-server uses the public key to verify them.
  • Knowing the pod network range (--cluster-cidr=10.200.0.0/16): the range we'll carve up per node in the networking section (Articles 13–14).

scheduler: filter, then score

kube-scheduler watches for pods not yet assigned to a node (spec.nodeName empty), and for each pod, decides which node it should run on. That decision goes through two phases:

   All nodes
       │
       ▼  PHASE 1 — FILTERING
   drop nodes that DON'T qualify:
   not enough CPU/RAM, doesn't match nodeSelector/affinity,
   has a taint the pod doesn't tolerate, port already in use...
       │
       ▼  (the feasible nodes remain)
       ▼  PHASE 2 — SCORING
   give each feasible node a score by various criteria
   (load balancing, preference for spreading, data locality...)
       │
       ▼
   highest-scoring node ──► write spec.nodeName onto the pod (via api-server)

The scheduler only chooses and writes the node name onto the pod; it doesn't start the container. Starting the container is the job of the kubelet on the chosen node (Article 11). This is the kind of clean division of roles that's very Kubernetes: each component does exactly one thing, then leaves the result for the next component via state in the cluster.

Leader election: three instances, one does the work

Both the controller-manager and the scheduler run three instances (one per controller). If all three were active, we'd have three schedulers assigning a node to a single pod, or three ReplicaSet controllers each creating pods — chaos. The solution is leader election: the three instances compete to hold a lock, and only the lock holder does the work.

That lock is a Lease object in the kube-system namespace. Whichever instance holds the Lease periodically "renews" it; if it dies and stops renewing, after a while another instance grabs it and takes over. We enable this mechanism with --leader-elect=true (controller-manager) and leaderElection.leaderElect: true (scheduler).

Step 1 — Get the kubeconfigs onto the controllers

The kube-controller-manager and kube-scheduler binaries were already downloaded in Article 7. Now we just need to get the two corresponding kubeconfigs (created in Article 5, pointing at 127.0.0.1:6443) onto each controller:

# from the workstation, in ~/k8s-scratch/pki
for h in controller-0 controller-1 controller-2; do
  scp kube-controller-manager.kubeconfig kube-scheduler.kubeconfig ${h}:/tmp/
  ssh $h 'sudo mv /tmp/kube-controller-manager.kubeconfig /tmp/kube-scheduler.kubeconfig /var/lib/kubernetes/'
done

ca.pem, ca-key.pem, and service-account-key.pem are already in /var/lib/kubernetes from Article 7, so there's no need to copy them again.

Step 2 — systemd unit for the controller-manager

The unit is identical on all three controllers (leader election handles coordination, so there's no need to differentiate by machine):

[Unit]
Description=Kubernetes Controller Manager
After=network.target

[Service]
ExecStart=/usr/local/bin/kube-controller-manager \
  --bind-address=0.0.0.0 \
  --cluster-cidr=10.200.0.0/16 \
  --cluster-name=kubernetes \
  --cluster-signing-cert-file=/var/lib/kubernetes/ca.pem \
  --cluster-signing-key-file=/var/lib/kubernetes/ca-key.pem \
  --kubeconfig=/var/lib/kubernetes/kube-controller-manager.kubeconfig \
  --authentication-kubeconfig=/var/lib/kubernetes/kube-controller-manager.kubeconfig \
  --authorization-kubeconfig=/var/lib/kubernetes/kube-controller-manager.kubeconfig \
  --leader-elect=true \
  --root-ca-file=/var/lib/kubernetes/ca.pem \
  --service-account-private-key-file=/var/lib/kubernetes/service-account-key.pem \
  --service-cluster-ip-range=10.32.0.0/24 \
  --use-service-account-credentials=true \
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

--use-service-account-credentials=true is worth a note: it makes each internal control loop use its own ServiceAccount when calling the api-server, rather than sharing a single identity. That way RBAC grants each loop the minimum permissions it needs.

Step 3 — Config and unit for the scheduler

The modern scheduler takes its configuration via a KubeSchedulerConfiguration file instead of cramming everything into flags. Create that file, point it at the kubeconfig, and enable leader election:

sudo tee /var/lib/kubernetes/kube-scheduler.yaml >/dev/null <<'EOF'
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: /var/lib/kubernetes/kube-scheduler.kubeconfig
leaderElection:
  leaderElect: true
EOF

The scheduler's unit is therefore very short:

[Unit]
Description=Kubernetes Scheduler
After=network.target

[Service]
ExecStart=/usr/local/bin/kube-scheduler \
  --config=/var/lib/kubernetes/kube-scheduler.yaml \
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Write both units and the config file to each controller, then reload and enable:

sudo systemctl daemon-reload
sudo systemctl enable kube-controller-manager kube-scheduler

Step 4 — Start and verify

Start both services on the three controllers:

for h in controller-0 controller-1 controller-2; do
  ssh $h 'sudo systemctl start kube-controller-manager kube-scheduler'
done

Check the status:

for h in controller-0 controller-1 controller-2; do
  printf "%-14s cm=%s sched=%s\n" "$h" \
    "$(ssh $h 'systemctl is-active kube-controller-manager')" \
    "$(ssh $h 'systemctl is-active kube-scheduler')"
done
controller-0   cm=active sched=active
controller-1   cm=active sched=active
controller-2   cm=active sched=active

When a service won't come up — read the logs, don't guess. The first time I ran this, kube-controller-manager on controller-0 kept activating then restarting on a loop. journalctl -u kube-controller-manager showed status=203/EXEC — the systemd error code for "couldn't execute the binary file". Comparing file sizes revealed the cause: the binary on controller-0 was only 12MB, while on controller-1 it was 74MB — it had been truncated on download in Article 7 (exactly the mid-transfer curl trap we warned about). Re-download a complete copy, then systemctl restart, and you're done: bash ls -l /usr/local/bin/kube-controller-manager # 12582912 — way too small! sudo curl -fSL -o /usr/local/bin/kube-controller-manager \ https://dl.k8s.io/release/v1.36.1/bin/linux/amd64/kube-controller-manager sudo chmod +x /usr/local/bin/kube-controller-manager sudo systemctl restart kube-controller-manager

Now the fun part: watching leader election. Each type has a Lease in kube-system, and the holderIdentity field tells you which instance is holding it. Call the api-server with the admin cert:

C="--cacert ca.pem --cert admin.pem --key admin-key.pem"
curl -s $C "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases" \
  | python3 -c 'import sys,json; d=json.load(sys.stdin); [print(i["metadata"]["name"],"->",i["spec"].get("holderIdentity")) for i in d["items"]]'
apiserver-tyzveoctxbnh6lvbkdt2xqncle -> apiserver-tyzveoctxbnh6lvbkdt2xqncle_14cfdd62-...
apiserver-wlwnjbc6t2b2g566g3usjzlbdm -> apiserver-wlwnjbc6t2b2g566g3usjzlbdm_979b5321-...
apiserver-zjy5m4qienmmbtxvgklj3tgz6i -> apiserver-zjy5m4qienmmbtxvgklj3tgz6i_18335984-...
kube-controller-manager -> controller-1_2bc8f705-33eb-46f5-bf37-a7c9946369a6
kube-scheduler          -> controller-0_422ddcd2-4dd9-46a2-b7d4-819843c626c0

Reading the result: the controller-manager leader is controller-1, and the scheduler leader is controller-0. The other two instances of each type are running idle, waiting their turn. (The three apiserver-... leases are a different matter: they're the HA identities each api-server registers for itself, not leader election.)

If you now stop the controller-manager on controller-1, within a few seconds one of the two remaining instances grabs the Lease and holderIdentity switches to that machine's name — exactly the fault tolerance we designed. You can try it yourself to see it firsthand.

🧹 Cleanup

Both services are permanent components; don't shut them down. Remember to delete the temporary cert files if you copied them up to a controller to run the curl checks.

Wrap-up

The control plane is essentially complete: etcd stores state, the api-server is the entry point, the controller-manager runs the control loops, and the scheduler picks nodes — and both of the latter two elected a leader correctly. What's worth taking away from this article isn't just the flags, but seeing that leader election is a concrete mechanism (a Lease, a holderIdentity) rather than an abstract concept.

But so far we still call the api-server through 127.0.0.1 on each controller, and there are no workers yet. Article 9 stands up HAProxy on lb-0 to consolidate the three api-servers into one address, configures kubectl on your laptop to point at its Elastic IP, and sets up the RBAC so the api-server is allowed to call down to the kubelet — finishing the preparation before we add workers to the cluster in Article 10.