The scheduler and the scheduling framework

Throughout the series, every pod we created "just naturally" ran on some node — worker-0 or worker-1 — without us asking who decided. The answer is kube-scheduler, the component we stood up back in Article 8 and glimpsed in Article 17 (the request lifecycle). Part VII is dedicated to scheduling — "which pod runs on which node" — and this opening article digs into exactly how the scheduler decides: it filters out nodes that don't fit, scores the remaining nodes, then binds. Understanding these two steps is the foundation for everything that follows (affinity, taints, topology spread).

The role: find a node for a pod that has none

The docs put it tersely: "A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on." A newly created pod has an empty spec.nodeName (Article 17); the scheduler watches, picks a node, then "notifies the API server about this decision in a process called binding." A node that qualifies is called a feasible node: "Nodes that meet the scheduling requirements for a Pod are called feasible nodes. If none of the nodes are suitable, the pod remains unscheduled until the scheduler is able to place it."

The selection splits into two steps, verbatim:

Filtering — "finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resources to meet a Pod's specific resource requests."
Scoring — "the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering." Then "kube-scheduler assigns the Pod to the Node with the highest ranking."

Filter: drop nodes that don't fit

The clearest way to see filtering is to create a pod no node can hold. Each worker has Allocatable cpu = 2 (Article 32); asking for 3 CPU means the filter drops them all:

apiVersion: v1
kind: Pod
metadata: {name: too-greedy}
spec:
  containers:
  - name: c
    image: busybox:1.36
    command: ["sleep","3600"]
    resources: {requests: {cpu: "3"}}     # > every node's Allocatable 2

kubectl get pod too-greedy -o wide
kubectl get event --field-selector involvedObject.name=too-greedy | grep FailedScheduling

NAME         READY   STATUS    NODE
too-greedy   0/1     Pending   <none>

Warning  FailedScheduling  0/2 nodes are available: 2 Insufficient cpu.
  ... preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.

The pod is stuck Pending, nodeName empty — "the pod remains unscheduled until the scheduler is able to place it." The event message is the whole filter story: 0/2 nodes are available: 2 Insufficient cpu — that's the NodeResourcesFit filter (old name PodFitsResources), comparing request against Allocatable (the very thing Article 32 dug into, not Capacity). The second line preemption: ... Preemption is not helpful is the PostFilter step: when no feasible node remains, the scheduler tries to evict lower-priority pods to make room — but here every pod is at the same priority so eviction is useless (preemption is dug into in the priority article). The pod will wait forever until a node has room.

Score + Bind: pick the best node

What about a pod that does fit? Ask for 1 CPU:

# pod fits-fine, requests cpu: "1"

kubectl get pod fits-fine -o wide
kubectl get event --field-selector involvedObject.name=fits-fine,reason=Scheduled

NAME        READY   STATUS    NODE
fits-fine   1/1     Running   worker-1

Normal  Scheduled  Successfully assigned default/fits-fine to worker-1

Past the filter (both nodes have enough for 1 CPU), the scheduler scores then binds: the event Scheduled: Successfully assigned default/fits-fine to worker-1 — this is the binding cycle, the scheduler telling the API server to set nodeName=worker-1. This decision is made by the default-scheduler — the very HA kube-scheduler we stood up in Article 8:

kubectl get pod fits-fine -o jsonpath='{.spec.schedulerName}{"\n"}'
kubectl get lease kube-scheduler -n kube-system -o jsonpath='{.spec.holderIdentity}{"\n"}'

default-scheduler
controller-0_422ddcd2-...

schedulerName: default-scheduler, and the kube-scheduler Lease shows the leader is currently controller-0 (three replicas HA, one leader — exactly as in Article 8). Every pod that doesn't declare its own schedulerName is handled by this one.

Scoring isn't an even split — it favors the less-loaded node

What does Score do concretely? Try creating four small pods (50m each) after fits-fine (1 CPU) already sits on worker-1, then see where they land:

# create spread-1..4, each pod requests cpu: 50m
kubectl get pods -l app=spread -o wide

spread-1 worker-0
spread-2 worker-0
spread-3 worker-0
spread-4 worker-0

All four pile onto worker-0 — not two per node. Looking at allocated resources makes it clear:

kubectl describe node worker-0 | grep -A2 "Allocated resources" | grep cpu   # 300m
kubectl describe node worker-1 | grep -A2 "Allocated resources" | grep cpu   # 1100m

worker-0:  cpu  300m (15%)
worker-1:  cpu  1100m (55%)

worker-1 already holds 1100m (fits-fine's 1 CPU + CoreDNS), worker-0 only 100m before this. The scheduler's Score (default includes NodeResourcesBalancedAllocation and least-allocated) favors the less-loaded node — so every small pod scores high for worker-0 and piles there until balanced. This is the common misunderstanding: the scheduler does not naively round-robin; it picks the node that looks emptier to balance load. Filter answers "which node is allowed", score answers "which node is best".

The scheduling framework: why this chain of steps is "pluggable"

Filter and score aren't hardcoded — they're plugins in the scheduling framework. The docs: "The scheduling framework is a pluggable architecture for the Kubernetes scheduler. It consists of a set of plugin APIs that are compiled directly into the scheduler." Each time it places a pod, it splits into two phases: "the scheduling cycle selects a node for the Pod, and the binding cycle applies that decision to the cluster." The scheduling cycle runs sequentially, the binding cycle can run in parallel.

Each phase has ordered extension points where plugins hook in:

SCHEDULING CYCLE (sequential)         BINDING CYCLE (parallel)
─────────────────────────            ──────────────────────────
PreFilter  → prepare / pre-check
Filter     → drop nodes that can't run  (← "Insufficient cpu" above)
PostFilter → no nodes left? try preempt  (← "Preemption not helpful")
PreScore   → prepare for score
Score      → score the remaining nodes   (← pile onto less-loaded worker-0)
NormalizeScore → normalize the scores
Reserve    → reserve resources
Permit     → allow / delay bind
                                       PreBind → work before binding
                                       Bind    → set nodeName  (← "Scheduled")
                                       PostBind → notify after bind

Everything we just observed maps here: Insufficient cpu is Filter, Preemption is not helpful is PostFilter, piling pods onto the less-loaded node is Score, and the Scheduled event is Bind. This plugin architecture allows writing custom schedulers (changing score weights, adding custom filters) or running multiple schedulers in parallel — a pod selects a scheduler via schedulerName. The remaining articles of Part VII are precisely about configuring these plugins from the pod side: affinity/taints control Filter, topology spread controls Score, priority controls PostFilter (preemption).

🧹 Cleanup

kubectl delete pod too-greedy fits-fine --now
kubectl delete pod -l app=spread --now

Objects in the cluster, deleting cleans them up. The cluster returns to two CoreDNS pods. Manifests are at github.com/nghiadaulau/kubernetes-from-scratch, directory 34-scheduler.

Wrap-up

kube-scheduler (default-scheduler, HA leader on controller-0 since Article 8) picks a node for every pod with an empty nodeName in two steps. Filter drops nodes that can't run it — NodeResourcesFit compares request against Allocatable (Article 32), and if no node fits the pod gets stuck Pending with 0/2 nodes are available: Insufficient cpu (we saw a pod asking 3 CPU hang forever). Score ranks the remaining nodes, favoring the less-loaded one (least-allocated/balanced) — not round-robin (four small pods piled onto worker-0 because worker-1 already held 1 CPU). Then the binding cycle sets nodeName (the Scheduled event). All of it is plugins in the scheduling framework with sequential extension points (PreFilter→Filter→PostFilter→PreScore→Score→...→Bind) — a pluggable architecture that the Part VII articles will control from the pod side.

Article 35 is the most powerful Filter control from your side: nodeAffinity (pull a pod toward nodes with a certain label — we saw DaemonSet auto-inject it in Article 26), podAffinity/anti-affinity (place a pod near/far from other pods), and taint/toleration (a node pushes pods away unless the pod tolerates it — the pair we saw in DaemonSet's auto-injected tolerations).