kube-proxy: Turning a ClusterIP Into a Real Destination

Article 11 brought two workers into the cluster: kubelet running, nodes registered, but still NotReady because pod networking is missing. Before we touch pod networking (Articles 13–14), there's one more component that belongs to each worker and can be fully explained right now without CNI: kube-proxy.

kube-proxy is what makes Kubernetes Services work. And to see what it does, you first have to understand what a Service actually is.

A Service is an IP that belongs to no one

When you create a ClusterIP Service, Kubernetes hands it an IP from the Service range (in our cluster: 10.32.0.0/24, set in Article 7). That IP — the ClusterIP — is not bound to any network card. No machine owns it, no interface answers ARP for it. It's a virtual address, a pointer.

Behind the Service is a set of real pods, each with its own IP that changes constantly (pods die, new ones are born, different IPs). The problem: the client wants to call a stable address, but the real destinations keep drifting. The Service solves this by giving the client a fixed ClusterIP; the job of translating that fixed ClusterIP into a real, live pod is kube-proxy's.

kube-proxy runs on every node, watching two things from the api-server: the list of Services (which ClusterIPs) and the list of Endpoints (which pods sit behind each Service). Whenever something changes, it rewrites the kernel's forwarding rules so that any packet sent to ClusterIP:port gets DNAT'd to the podIP:port of a live endpoint, chosen at random to spread the load.

   client inside the cluster
        │  sends to 10.32.0.1:443   (ClusterIP — virtual)
        ▼
   ┌─────────────── kernel netfilter on the node ─────────────┐
   │  kube-proxy has installed iptables rules:                │
   │     10.32.0.1:443  ──DNAT──►  one of the endpoints       │
   │                               (random split)             │
   └──────────────────────────────────────────────────────────┘
        │
        ▼
   real endpoints:  10.0.1.11:6443 | 10.0.1.12:6443 | 10.0.1.13:6443

The key point: kube-proxy itself is not on the packet's path. It only installs the rules and steps aside; the DNAT is done by the kernel. So if kube-proxy dies for a while, it doesn't cut off traffic that's already flowing; the rules just don't get updated until it comes back.

There's already a Service to try right now

We haven't created any Service yet, but the cluster has had one of its own ever since the api-server started: the kubernetes Service in the default namespace.

kubectl get svc -A
kubectl get endpoints -n default kubernetes

NAMESPACE   NAME         TYPE        CLUSTER-IP   PORT(S)   AGE
default     kubernetes   ClusterIP   10.32.0.1    443/TCP   31m

NAME         ENDPOINTS                                      AGE
kubernetes   10.0.1.11:6443,10.0.1.12:6443,10.0.1.13:6443   31m

This is a lucky Service for verification: ClusterIP 10.32.0.1, three endpoints which are exactly our three api-servers, and crucially its endpoints are host IPs (10.0.1.11/12/13), not pod IPs. That means we can prove kube-proxy works end-to-end without pod networking yet: DNAT from the ClusterIP to a host IP that's already routable inside the VPC.

Step 1 — Distribute the kubeconfig for kube-proxy

kube-proxy is also an api-server client with its own identity: the kube-proxy cert with CN=system:kube-proxy, O=system:node-proxier (Article 4). The system:node-proxier group is already bound to read Service/Endpoint via the built-in system:node-proxier ClusterRole. Its kubeconfig (Article 5) already points at the internal load balancer 10.0.1.10:6443:

# from the pki directory, loop over both workers
for W in worker-0 worker-1; do
  scp kube-proxy.kubeconfig $W:/tmp/
  ssh $W 'sudo mkdir -p /var/lib/kube-proxy
    sudo mv /tmp/kube-proxy.kubeconfig /var/lib/kube-proxy/kubeconfig
    sudo chmod 600 /var/lib/kube-proxy/kubeconfig'
done

Unlike kubelet, kube-proxy does not have a per-node identity; both workers share the same system:kube-proxy cert. That makes sense: kube-proxy only needs to read cluster-wide Service/Endpoint, which is identical work on every node, with no operation tied to a single node's identity the way kubelet has.

Step 2 — Install the binary

# on worker-0
cd /tmp
curl -fsSL -o kube-proxy https://dl.k8s.io/v1.36.1/bin/linux/amd64/kube-proxy
ls -la kube-proxy
sudo install -m 755 kube-proxy /usr/local/bin/kube-proxy
kube-proxy --version

-rw-rw-r-- 1 ubuntu ubuntu 44200098 kube-proxy
Kubernetes v1.36.1

Step 3 — KubeProxyConfiguration and the systemd unit

Minimal config: point at the kubeconfig, pick iptables mode, and declare clusterCIDR, the overall pod network range (10.200.0.0/16, the supernet of the two /24 ranges per worker).

# on worker-0
sudo tee /var/lib/kube-proxy/kube-proxy-config.yaml >/dev/null <<'EOF'
kind: KubeProxyConfiguration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
clientConnection:
  kubeconfig: "/var/lib/kube-proxy/kubeconfig"
mode: "iptables"
clusterCIDR: "10.200.0.0/16"
EOF

clusterCIDR lets kube-proxy distinguish traffic from within the pod range from traffic from outside: a packet to a ClusterIP whose source lies outside the pod range gets marked to be masqueraded (SNAT) on the way out, avoiding the situation where a pod receives a packet with a source address it can't route back to. We'll see exactly this mark in the NAT chain in a later step.

Why iptables instead of ipvs or nftables? kube-proxy has several modes. iptables is the oldest, clear to read, and good enough for a small cluster, so it's a good fit for learning. For clusters with thousands of Services, ipvs or the nftables mode (stable in recent releases) scale better because they aren't linear in the number of rules. We use iptables here, and besides, by the end of the series we'll drop kube-proxy entirely when we switch to Cilium eBPF (Articles 18–19).

The systemd unit:

sudo tee /etc/systemd/system/kube-proxy.service >/dev/null <<'EOF'
[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://github.com/kubernetes/kubernetes
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/usr/local/bin/kube-proxy \
  --config=/var/lib/kube-proxy/kube-proxy-config.yaml \
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now kube-proxy
sleep 4
systemctl is-active kube-proxy

active

See what it does at startup:

sudo journalctl -u kube-proxy --no-pager | grep -E 'Reloading|SyncProxyRules complete' | tail -2

proxier.go:1389] "Reloading service iptables data" ipFamily="IPv4" numServices=1 numEndpoints=3 ...
proxier.go:662] "SyncProxyRules complete" ipFamily="IPv4" elapsed="107.898293ms"

numServices=1 numEndpoints=3: kube-proxy has seen exactly the kubernetes Service with its three api-server endpoints, and has finished writing the rules.

Step 4 — Read the iptables chains kube-proxy generates

This is the part most worth looking at. kube-proxy organizes its rules into several nested chains. Start from KUBE-SERVICES, the entry gate that holds one line per ClusterIP:

sudo iptables -t nat -L KUBE-SERVICES -n | grep -E 'Chain|10.32.0.1'

Chain KUBE-SERVICES (2 references)
KUBE-SVC-NPX46M4PTMTKRN6Y  6  --  0.0.0.0/0  10.32.0.1  /* default/kubernetes:https cluster IP */ tcp dpt:443

A packet to 10.32.0.1:443 is pushed into the KUBE-SVC-... chain (the name is a hash of the Service). Open that chain:

sudo iptables -t nat -L KUBE-SVC-NPX46M4PTMTKRN6Y -n

target           prot  source           destination     /* ... */
KUBE-MARK-MASQ   6     !10.200.0.0/16    10.32.0.1       /* ... cluster IP */ tcp dpt:443
KUBE-SEP-L63N... 0     0.0.0.0/0         0.0.0.0/0       /* ... -> 10.0.1.11:6443 */ statistic mode random probability 0.33333333349
KUBE-SEP-DFEZ... 0     0.0.0.0/0         0.0.0.0/0       /* ... -> 10.0.1.12:6443 */ statistic mode random probability 0.50000000000
KUBE-SEP-UYNO... 0     0.0.0.0/0         0.0.0.0/0       /* ... -> 10.0.1.13:6443 */

These four lines capture exactly how a Service works:

The first line — KUBE-MARK-MASQ with the source condition !10.200.0.0/16 (OUTSIDE the pod range) — is precisely the masquerade mark that clusterCIDR produces: traffic to the Service from outside the pod range gets SNAT'd on the way out.
The three KUBE-SEP-... lines are the three endpoints. Note the probability column: the first line matches with probability 1/3 ≈ 0.333; if it misses, the second matches with 0.5 (i.e. half of the remaining 2/3); if it misses again, the third takes everything. Those three thresholds add up to an even ⅓–⅓–⅓ split across the three endpoints. That's the entire "load balancing algorithm" of iptables mode: static probabilities, with no notion of which endpoint is busy or idle.

Each KUBE-SEP-... chain (Service EndPoint) holds the actual DNAT command, rewriting the packet's destination to the podIP:port of the corresponding endpoint. Here the "pod" happens to be the api-server on a host IP, so we can verify it right away.

Step 5 — curl the ClusterIP directly

From worker-0 itself, call https://10.32.0.1:443, a virtual IP that belongs to no machine:

# on worker-0
curl -s --cacert /var/lib/kubernetes/ca.pem https://10.32.0.1:443/healthz; echo

ok

The packet leaves the curl process with destination 10.32.0.1:443; netfilter matches the KUBE-SERVICES → KUBE-SVC → one KUBE-SEP chain, DNATs the destination to (say) 10.0.1.11:6443; the api-server answers ok. The virtual ClusterIP just became a real destination, exactly the job kube-proxy exists to do, and all of it happened without a single line of CNI config, because the endpoints here are host IPs.

Step 6 — Repeat on worker-1

worker-1 shares the kube-proxy kubeconfig (copied in Step 1). Download the binary, write the same kube-proxy-config.yaml and kube-proxy.service as above (both files are identical between the two nodes), then start and verify:

# on worker-1
sudo systemctl daemon-reload
sudo systemctl enable --now kube-proxy
systemctl is-active kube-proxy
curl -s --cacert /var/lib/kubernetes/ca.pem https://10.32.0.1:443/healthz; echo

active
ok

Note: installing kube-proxy does not flip the node to Ready; kubectl get nodes still reports NotReady. kube-proxy handles Services, while a node's Ready condition depends on pod networking (CNI). Two separate concerns, and the missing piece is still CNI.

🧹 Cleanup

kube-proxy is a long-lived service, leave it running (already enabled). Clean up the downloaded binary:

# on each worker
rm -f /tmp/kube-proxy

The iptables rules are managed by kube-proxy; don't edit them by hand. If you need to wipe them clean for diagnosis, kube-proxy --cleanup tears out all of its rules, but on a running cluster you rarely need that. When you stop/start the EC2 cluster, kube-proxy comes back up on its own and rebuilds all the rules from scratch.

The full scripts are at github.com/nghiadaulau/kubernetes-from-scratch, directory 12-kube-proxy.

Wrap-up

The worker now has the full trio: containerd runs containers, kubelet orchestrates the node, kube-proxy implements Services. We've seen with our own eyes that a ClusterIP is just a few iptables chains written by kube-proxy, translating a virtual IP into a real endpoint via DNAT, spreading load with static probabilities. Once you have the KUBE-SERVICES → KUBE-SVC → KUBE-SEP chain in hand, then when a Service doesn't work later, you know exactly where to iptables -t nat -L.

There's still the gap we've been digging at for three articles: pods have no networking, nodes are still NotReady. Article 13 steps back to lay the theory: Kubernetes' flat network model, the four requirements it imposes, and where CNI fits in, before Article 14 actually wires up pod networking and watches the two nodes finally flip to Ready.