Kubernetes Architecture, Up Close: Loops, Watches and the API Server

In the minikube series we got the big picture: the control plane is the brain, nodes are where applications run, and everything goes through the api-server. That's the map level. This article zooms into the mechanics — not what the components are (you already know), but how they coordinate, even though no component calls another directly. Understand this mechanism and everything we do later — bootstrapping each binary then wiring them together — has a frame to slot into.

Desired state and the control loop

Kubernetes' founding mindset is declarative: you submit an object describing the desired state ("I want 3 nginx replicas"), then Kubernetes pulls reality toward it. The previous series stopped there. Now add a layer. The thing that pulls reality into line is a control loop, and it works level-triggered, not edge-triggered.

This distinction matters. Edge-triggered means reacting to events: "a pod just died, recreate it." Level-triggered means reacting to the current state: "there are 2 pods, I want 3, create 1 more," regardless of how it ended up that way.

   EDGE-triggered                    LEVEL-triggered (Kubernetes uses this one)
   ───────────────────────          ─────────────────────────────────────────
   hear "pod died" → +1             see "have 2, want 3" → +1
   what if an event is MISSED?      missing anything is fine:
   → forever short 1 pod            next pass still sees "2 ≠ 3" → fix

The level-triggered approach lets the system self-heal even when it just failed itself. A controller can die, miss a few events, then restart, and it's still fine, because the next reconciliation looks at the current state and fixes it. No missed event leaves a lasting consequence. This is one of the reasons Kubernetes is resilient: it doesn't rely on memory of past events, it continuously compares the current state against the desired one.

   ┌──────────────── CONTROL LOOP (runs continuously) ───────────┐
   │                                                             │
   │   OBSERVE  ──►  current state = ?   (read from api-server)  │
   │      │                                                      │
   │      ▼                                                      │
   │   COMPARE  ──►  current  vs  desired                        │
   │      │                                                      │
   │      ▼                                                      │
   │   ACT      ──►  if diverged: act to match (write via api)   │
   │      │                                                      │
   │      └──────────────────────► repeat                        │
   └─────────────────────────────────────────────────────────────┘

controller-manager bundles many small controllers

What we casually call "controller-manager" is really a single process bundling dozens of small controllers, each handling one kind of object and running exactly the loop above:

Deployment controller sees a Deployment and creates or updates a ReplicaSet to match.
ReplicaSet controller sees a ReplicaSet wanting N replicas, counts the actual pods, and creates or deletes to reach N.
Node controller watches nodes; when a node loses its heartbeat it marks it and evicts pods elsewhere.
EndpointSlice controller connects a Service to the list of pods behind it.
ServiceAccount controller, Job controller, CronJob controller — each one small job.

These controllers don't call each other. The Deployment controller doesn't command the ReplicaSet controller; it just creates a ReplicaSet object through the api-server and stops there. The ReplicaSet controller, which is already watching ReplicaSets, sees the new one and gets to work. They communicate indirectly through state stored in the cluster, not through direct calls. Because of this, adding a new controller doesn't require touching the old ones; it just needs to watch the kind of object it cares about.

   kubectl creates a Deployment
        │  (write into the cluster via api-server)
        ▼
   [Deployment ctrl] sees new Deployment ──► creates a ReplicaSet
        │
        ▼
   [ReplicaSet ctrl] sees new ReplicaSet ──► creates 3 Pods (no node yet)
        │
        ▼
   [scheduler] sees Pods without a node ──► assigns a node to each Pod
        │
        ▼
   [kubelet on that node] sees a Pod assigned to it ──► tells containerd to run it

No step calls another; each component just watches and reacts to state.

The list-watch mechanism

If every controller polled the api-server every few seconds with "anything new?", then with thousands of objects the api-server would be overwhelmed. Kubernetes solves this with a leaner mechanism: list-watch.

List: at startup, the controller asks the api-server for all objects of type X currently present, and gets them back along with a resourceVersion number marking the moment of that snapshot.
Watch: the controller then opens a long-lived HTTP connection (streaming-style), saying that from this resourceVersion onward, push any change to it. The api-server doesn't make it poll again; it actively pushes each change (add, update, delete) over that connection.

   Controller                          api-server
      │                                    │
      │── LIST type X (get all) ──────────►│
      │◄── list + resourceVersion ─────────│
      │                                    │
      │── WATCH from resourceVersion ─────►│   (connection kept open)
      │                                    │
      │◄═══ change: Pod A deleted ═════════│   ┐
      │◄═══ change: Pod B added ═══════════│   ├ pushed in real time
      │◄═══ change: Pod C updated ═════════│   ┘

To avoid calling the api-server for every reconciliation, each controller keeps a local cache (the informer cache) that the watch mechanism keeps continuously updated. The reconciliation loop reads from this cache, which is fast and cheap, and only writes back to the api-server when it needs to act. That's why a cluster with thousands of pods still runs smoothly: most reads happen against an in-memory cache and the api-server isn't hammered with requests.

Watch is also how kubelet and kube-proxy on each node learn their work: kubelet watches the Pods assigned to its node, kube-proxy watches Services and Endpoints. Nobody sends commands down to a node; the node watches its relevant slice and reacts.

Why everything goes through the api-server

By now it's clear why the api-server is the single gate, and why it alone is allowed to talk to etcd:

        kubectl ─┐
        kubelet ─┤
      scheduler ─┼──►  kube-apiserver  ──►  etcd
     controllers ┤      (the gate)          (state store)
     kube-proxy ─┘
                         ▲
                  authn → authz → admission → validate → write

Every request entering the api-server passes through a fixed pipeline before it touches etcd:

Authentication (authn) — who are you? Verified by client certificate, token, or another mechanism. This is why, throughout the whole series, we create a certificate for each component: so they can prove their identity when calling the api-server.
Authorization (authz) — are you allowed to do this? Usually via RBAC.
Admission control — is this request valid, or does it need adjusting? Admission controllers can reject it, or mutate it (set default values, inject a sidecar).
Validation and write — check the object matches the schema, then write it into etcd.

Putting all four stages in one place gives us three things we wouldn't have if each component freely read and wrote etcd: consistent authentication and authorization, a single point to write audit logs, and a single source of state. If kubelet were allowed to write straight to etcd, all three guarantees would be lost. So the rule is that only the api-server touches etcd, and every other component touches the api-server.

Where a `kubectl apply` command goes

Putting it all together with one example. You run kubectl apply -f nginx-deployment.yaml, with a Deployment wanting 3 replicas. Here's what happens:

 1. kubectl  ──POST /apis/apps/v1/.../deployments──►  api-server
              (with a client cert for authentication)

 2. api-server: authn → authz (RBAC) → admission → validate
              → writes the Deployment object into etcd
              → returns 201 Created to kubectl   (your command ends here)

 3. Deployment controller (WATCHing Deployments) sees the new one
              → creates a ReplicaSet (write via api-server → etcd)

 4. ReplicaSet controller (WATCHing ReplicaSets) sees "want 3, have 0"
              → creates 3 Pod objects, spec.nodeName still EMPTY

 5. scheduler (WATCHing Pods without a node) sees 3 pods with no node
              → scores the nodes, writes spec.nodeName for each pod

 6. kubelet on each node (WATCHing Pods assigned to it) sees a new pod
              → calls containerd (via CRI) to pull the image, run the container
              → reports status "Running" back up to api-server → etcd

Notice step 2: your kubectl command returns success the moment the Deployment is written into etcd — that is, before any container runs. The rest happens asynchronously, handled gradually by the control loops. That's the essence of the declarative model: you don't create a container, you record a wish, then a sequence of loops turns that wish into reality. When you type kubectl get pods a few seconds later and see ContainerCreating then Running, that's you watching the loops do their work.

HA: three control plane copies that don't step on each other

This series stands up three control planes, which raises a question: if all three controller-managers run at once, will all three create pods, ending up with triple the count?

No, thanks to leader election. In HA mode, controller-manager and scheduler compete to hold a "lock" (a Lease object in the cluster). Only the one holding the lock works; the other two run idle and wait. If the leader dies, one of the other two grabs the lock and takes over within seconds. So you get redundancy without conflict.

   controller-mgr-0   controller-mgr-1   controller-mgr-2
        │                  │                  │
        └──── compete for Lease "kube-controller-manager" ────┘
                           │
                  only ONE wins = leader (does the work)
                  the other two wait; leader dies → re-elect

The api-server is different: all three copies serve in parallel, because they hold no state of their own — all state lives in etcd. That's why we put a load balancer in front of the three api-servers (Article 9): clients see only one address, load spreads evenly, and if one api-server dies the other two carry on. etcd is different yet again: it needs quorum (a majority) to reach consensus, and that's the topic of Article 6.

The picture to keep

From the next article we start building. Each time you install a binary, you can ask yourself where it sits in this picture:

   ┌──────────────────── CONTROL PLANE (×3, HA) ────────────────────┐
   │  etcd ◄──(it alone)── api-server ──► authn/authz/admission      │
   │                              ▲   ▲                              │
   │            (leader) scheduler┘   └controller-manager (leader)   │
   └──────────────────────────────┬─────────────────────────────────┘
                          (every component talks via the api-server)
              ┌─────────────────────┴─────────────────────┐
              ▼                                            ▼
        ┌── worker ──┐                              ┌── worker ──┐
        │ kubelet ───┼── watch its own Pods         │ kubelet    │
        │ kube-proxy ┼── watch Service/Endpoint     │ kube-proxy │
        │ containerd ┴── run containers (via CRI)   │ containerd │
        └────────────┘                              └────────────┘

In the coming articles, each arrow in this diagram becomes a real line of config: an --etcd-servers flag, a path to a certificate, a load balancer address. At that point the architecture is no longer theory, but processes told each other's addresses and identities.

There's one thing that appears all over the diagram above that we haven't dissected yet: the certificate. The api-server needs to know the kubelet really is the kubelet; the kubelet needs to know it's talking to the real api-server and not an impostor; etcd needs to know that only the api-server may read it. All of this rests on a PKI/TLS system we have to build ourselves. Article 2 explains why a Kubernetes cluster needs so many certificates, and who signs for whom — the foundation for everything we type from Article 4 onward.

Kubernetes Architecture, Up Close: Loops, Watches and the API Server

Desired state and the control loop

controller-manager bundles many small controllers

The list-watch mechanism

Why everything goes through the api-server

Where a `kubectl apply` command goes

HA: three control plane copies that don't step on each other

The picture to keep

Related Posts

From Messy Bank Statements to AI Insights in 48h: An AWS-Native AI Money Coach System Design

AWS-native Observability for EC2 with the CloudWatch Agent

Desired state and the control loop

controller-manager bundles many small controllers

The list-watch mechanism

Why everything goes through the api-server

Where a kubectl apply command goes

HA: three control plane copies that don't step on each other

The picture to keep

Related Posts

From Messy Bank Statements to AI Insights in 48h: An AWS-Native AI Money Coach System Design

AWS-native Observability for EC2 with the CloudWatch Agent

Where a `kubectl apply` command goes