kube-proxy: Turning a ClusterIP Into a Real Destination
Article 11 brought two workers into the cluster: kubelet running, nodes registered, but still NotReady because pod networking is missing. Before we touch pod networking (Articles 13–14), there's one more component that belongs to each worker and can be fully explained right now without CNI: kube-proxy.
kube-proxy is what makes Kubernetes Services work. And to see what it does, you first have to understand what a Service actually is.
A Service is an IP that belongs to no one
When you create a ClusterIP Service, Kubernetes hands it an IP from the Service range (in our cluster: 10.32.0.0/24, set in Article 7). That IP — the ClusterIP — is not bound to any network card. No machine owns it, no interface answers ARP for it. It's a virtual address, a pointer.
Behind the Service is a set of real pods, each with its own IP that changes constantly (pods die, new ones are born, different IPs). The problem: the client wants to call a stable address, but the real destinations keep drifting. The Service solves this by giving the client a fixed ClusterIP; the job of translating that fixed ClusterIP into a real, live pod is kube-proxy's.
kube-proxy runs on every node, watching two things from the api-server: the list of Services (which ClusterIPs) and the list of Endpoints (which pods sit behind each Service). Whenever something changes, it rewrites the kernel's forwarding rules so that any packet sent to ClusterIP:port gets DNAT'd to the podIP:port of a live endpoint, chosen at random to spread the load.
client inside the cluster
│ sends to 10.32.0.1:443 (ClusterIP — virtual)
▼
┌─────────────── kernel netfilter on the node ─────────────┐
│ kube-proxy has installed iptables rules: │
│ 10.32.0.1:443 ──DNAT──► one of the endpoints │
│ (random split) │
└──────────────────────────────────────────────────────────┘
│
▼
real endpoints: 10.0.1.11:6443 | 10.0.1.12:6443 | 10.0.1.13:6443
The key point: kube-proxy itself is not on the packet's path. It only installs the rules and steps aside; the DNAT is done by the kernel. So if kube-proxy dies for a while, it doesn't cut off traffic that's already flowing; the rules just don't get updated until it comes back.
There's already a Service to try right now
We haven't created any Service yet, but the cluster has had one of its own ever since the api-server started: the kubernetes Service in the default namespace.
kubectl get svc -A
kubectl get endpoints -n default kubernetes
NAMESPACE NAME TYPE CLUSTER-IP PORT(S) AGE
default kubernetes ClusterIP 10.32.0.1 443/TCP 31m
NAME ENDPOINTS AGE
kubernetes 10.0.1.11:6443,10.0.1.12:6443,10.0.1.13:6443 31m
This is a lucky Service for verification: ClusterIP 10.32.0.1, three endpoints which are exactly our three api-servers, and crucially its endpoints are host IPs (10.0.1.11/12/13), not pod IPs. That means we can prove kube-proxy works end-to-end without pod networking yet: DNAT from the ClusterIP to a host IP that's already routable inside the VPC.
Step 1 — Distribute the kubeconfig for kube-proxy
kube-proxy is also an api-server client with its own identity: the kube-proxy cert with CN=system:kube-proxy, O=system:node-proxier (Article 4). The system:node-proxier group is already bound to read Service/Endpoint via the built-in system:node-proxier ClusterRole. Its kubeconfig (Article 5) already points at the internal load balancer 10.0.1.10:6443:
# from the pki directory, loop over both workers
for W in worker-0 worker-1; do
scp kube-proxy.kubeconfig $W:/tmp/
ssh $W 'sudo mkdir -p /var/lib/kube-proxy
sudo mv /tmp/kube-proxy.kubeconfig /var/lib/kube-proxy/kubeconfig
sudo chmod 600 /var/lib/kube-proxy/kubeconfig'
done
Unlike kubelet, kube-proxy does not have a per-node identity; both workers share the same system:kube-proxy cert. That makes sense: kube-proxy only needs to read cluster-wide Service/Endpoint, which is identical work on every node, with no operation tied to a single node's identity the way kubelet has.
Step 2 — Install the binary
# on worker-0
cd /tmp
curl -fsSL -o kube-proxy https://dl.k8s.io/v1.36.1/bin/linux/amd64/kube-proxy
ls -la kube-proxy
sudo install -m 755 kube-proxy /usr/local/bin/kube-proxy
kube-proxy --version
-rw-rw-r-- 1 ubuntu ubuntu 44200098 kube-proxy
Kubernetes v1.36.1
Step 3 — KubeProxyConfiguration and the systemd unit
Minimal config: point at the kubeconfig, pick iptables mode, and declare clusterCIDR, the overall pod network range (10.200.0.0/16, the supernet of the two /24 ranges per worker).
# on worker-0
sudo tee /var/lib/kube-proxy/kube-proxy-config.yaml >/dev/null <<'EOF'
kind: KubeProxyConfiguration
apiVersion: kubeproxy.config.k8s.io/v1alpha1
clientConnection:
kubeconfig: "/var/lib/kube-proxy/kubeconfig"
mode: "iptables"
clusterCIDR: "10.200.0.0/16"
EOF
clusterCIDR lets kube-proxy distinguish traffic from within the pod range from traffic from outside: a packet to a ClusterIP whose source lies outside the pod range gets marked to be masqueraded (SNAT) on the way out, avoiding the situation where a pod receives a packet with a source address it can't route back to. We'll see exactly this mark in the NAT chain in a later step.
Why iptables instead of ipvs or nftables? kube-proxy has several modes.
iptablesis the oldest, clear to read, and good enough for a small cluster, so it's a good fit for learning. For clusters with thousands of Services,ipvsor thenftablesmode (stable in recent releases) scale better because they aren't linear in the number of rules. We useiptableshere, and besides, by the end of the series we'll drop kube-proxy entirely when we switch to Cilium eBPF (Articles 18–19).
The systemd unit:
sudo tee /etc/systemd/system/kube-proxy.service >/dev/null <<'EOF'
[Unit]
Description=Kubernetes Kube Proxy
Documentation=https://github.com/kubernetes/kubernetes
After=network-online.target
Wants=network-online.target
[Service]
ExecStart=/usr/local/bin/kube-proxy \
--config=/var/lib/kube-proxy/kube-proxy-config.yaml \
--v=2
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now kube-proxy
sleep 4
systemctl is-active kube-proxy
active
See what it does at startup:
sudo journalctl -u kube-proxy --no-pager | grep -E 'Reloading|SyncProxyRules complete' | tail -2
proxier.go:1389] "Reloading service iptables data" ipFamily="IPv4" numServices=1 numEndpoints=3 ...
proxier.go:662] "SyncProxyRules complete" ipFamily="IPv4" elapsed="107.898293ms"
numServices=1 numEndpoints=3: kube-proxy has seen exactly the kubernetes Service with its three api-server endpoints, and has finished writing the rules.
Step 4 — Read the iptables chains kube-proxy generates
This is the part most worth looking at. kube-proxy organizes its rules into several nested chains. Start from KUBE-SERVICES, the entry gate that holds one line per ClusterIP:
sudo iptables -t nat -L KUBE-SERVICES -n | grep -E 'Chain|10.32.0.1'
Chain KUBE-SERVICES (2 references)
KUBE-SVC-NPX46M4PTMTKRN6Y 6 -- 0.0.0.0/0 10.32.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
A packet to 10.32.0.1:443 is pushed into the KUBE-SVC-... chain (the name is a hash of the Service). Open that chain:
sudo iptables -t nat -L KUBE-SVC-NPX46M4PTMTKRN6Y -n
target prot source destination /* ... */
KUBE-MARK-MASQ 6 !10.200.0.0/16 10.32.0.1 /* ... cluster IP */ tcp dpt:443
KUBE-SEP-L63N... 0 0.0.0.0/0 0.0.0.0/0 /* ... -> 10.0.1.11:6443 */ statistic mode random probability 0.33333333349
KUBE-SEP-DFEZ... 0 0.0.0.0/0 0.0.0.0/0 /* ... -> 10.0.1.12:6443 */ statistic mode random probability 0.50000000000
KUBE-SEP-UYNO... 0 0.0.0.0/0 0.0.0.0/0 /* ... -> 10.0.1.13:6443 */
These four lines capture exactly how a Service works:
- The first line —
KUBE-MARK-MASQwith the source condition!10.200.0.0/16(OUTSIDE the pod range) — is precisely the masquerade mark thatclusterCIDRproduces: traffic to the Service from outside the pod range gets SNAT'd on the way out. - The three
KUBE-SEP-...lines are the three endpoints. Note theprobabilitycolumn: the first line matches with probability1/3 ≈ 0.333; if it misses, the second matches with0.5(i.e. half of the remaining 2/3); if it misses again, the third takes everything. Those three thresholds add up to an even ⅓–⅓–⅓ split across the three endpoints. That's the entire "load balancing algorithm" of iptables mode: static probabilities, with no notion of which endpoint is busy or idle.
Each KUBE-SEP-... chain (Service EndPoint) holds the actual DNAT command, rewriting the packet's destination to the podIP:port of the corresponding endpoint. Here the "pod" happens to be the api-server on a host IP, so we can verify it right away.
Step 5 — curl the ClusterIP directly
From worker-0 itself, call https://10.32.0.1:443, a virtual IP that belongs to no machine:
# on worker-0
curl -s --cacert /var/lib/kubernetes/ca.pem https://10.32.0.1:443/healthz; echo
ok
The packet leaves the curl process with destination 10.32.0.1:443; netfilter matches the KUBE-SERVICES → KUBE-SVC → one KUBE-SEP chain, DNATs the destination to (say) 10.0.1.11:6443; the api-server answers ok. The virtual ClusterIP just became a real destination, exactly the job kube-proxy exists to do, and all of it happened without a single line of CNI config, because the endpoints here are host IPs.
Step 6 — Repeat on worker-1
worker-1 shares the kube-proxy kubeconfig (copied in Step 1). Download the binary, write the same kube-proxy-config.yaml and kube-proxy.service as above (both files are identical between the two nodes), then start and verify:
# on worker-1
sudo systemctl daemon-reload
sudo systemctl enable --now kube-proxy
systemctl is-active kube-proxy
curl -s --cacert /var/lib/kubernetes/ca.pem https://10.32.0.1:443/healthz; echo
active
ok
Note: installing kube-proxy does not flip the node to Ready; kubectl get nodes still reports NotReady. kube-proxy handles Services, while a node's Ready condition depends on pod networking (CNI). Two separate concerns, and the missing piece is still CNI.
🧹 Cleanup
kube-proxy is a long-lived service, leave it running (already enabled). Clean up the downloaded binary:
# on each worker
rm -f /tmp/kube-proxy
The iptables rules are managed by kube-proxy; don't edit them by hand. If you need to wipe them clean for diagnosis, kube-proxy --cleanup tears out all of its rules, but on a running cluster you rarely need that. When you stop/start the EC2 cluster, kube-proxy comes back up on its own and rebuilds all the rules from scratch.
The full scripts are at github.com/nghiadaulau/kubernetes-from-scratch, directory 12-kube-proxy.
Wrap-up
The worker now has the full trio: containerd runs containers, kubelet orchestrates the node, kube-proxy implements Services. We've seen with our own eyes that a ClusterIP is just a few iptables chains written by kube-proxy, translating a virtual IP into a real endpoint via DNAT, spreading load with static probabilities. Once you have the KUBE-SERVICES → KUBE-SVC → KUBE-SEP chain in hand, then when a Service doesn't work later, you know exactly where to iptables -t nat -L.
There's still the gap we've been digging at for three articles: pods have no networking, nodes are still NotReady. Article 13 steps back to lay the theory: Kubernetes' flat network model, the four requirements it imposes, and where CNI fits in, before Article 14 actually wires up pod networking and watches the two nodes finally flip to Ready.