Cilium and eBPF: why replace kube-proxy

K
Kai··5 min read

Part I built "good enough" pod networking: kube-proxy in iptables mode (Article 12) routing Services, a cni0 bridge + host-local IPAM (Article 14) connecting pods. It runs — but it's the old way, with limits. Part X upgrades the whole networking layer to Cilium based on eBPF: replacing both kube-proxy and the bridge. This article is the theory — you have to understand why before you migrate (Article 46). And so this isn't just talk, we look straight at the iptables datapath running on the cluster — the thing about to be replaced.

The current datapath: iptables grows linearly

kube-proxy in iptables mode (Article 12) translates each Service ClusterIP into a chain of nat rules. Count on worker-0:

ssh worker-0 'sudo iptables -t nat -L KUBE-SERVICES | head; sudo iptables -t nat -S | wc -l'
Chain KUBE-SERVICES
KUBE-SVC-...  tcp -- ... 10.32.0.10 ... kube-dns:dns-tcp cluster IP
KUBE-SVC-...  tcp -- ... 10.32.0.90 ... metrics-server:https cluster IP
KUBE-SVC-...  tcp -- ... 10.32.0.1  ... kubernetes:https cluster IP
...
74          # total rules in the nat table

Just a few Services (kube-dns, metrics-server, kubernetes) and the nat table is already 74 rules. The core problem: iptables processes a packet by walking the rule chain sequentially — complexity O(n) in the number of rules. A real cluster has thousands of Services → tens of thousands of rules → every packet has to traverse a long list; updating one Service means rewriting the whole table (kube-proxy locks, regenerates). At scale this is a bottleneck for both latency and convergence time. (The ipvs mode helps, but it's still the old architecture.) That's the motivation to change.

What eBPF is

eBPF (extended Berkeley Packet Filter) is a Linux kernel technology that lets you run sandboxed programs right inside the kernel without changing kernel source or loading a module. Cilium's docs call it "a new Linux kernel technology" enabling "the dynamic insertion of powerful security visibility and control logic within Linux itself." Concretely: you load a piece of eBPF bytecode, the kernel verifies it's safe (the verifier — no infinite loops, no bad memory access), then attaches it to a hook point — for example right as a packet hits the NIC (XDP), or at the socket, or at tc (traffic control). When a packet hits the hook, the eBPF program runs in the kernel and decides immediately: forward, change destination (DNAT), drop, or collect metrics.

Two things make eBPF powerful for networking:

  1. A hash map instead of a rule chain. eBPF uses a map with O(1) lookup — instead of walking 74 rules, it looks up "this ClusterIP → which backend" in a hash map in one step. More Services doesn't slow down each packet.
  2. Skipping most of the network stack. With XDP, eBPF processes a packet before the kernel builds a full sk_buff and before iptables — cutting out layers. The docs: the visibility/datapath is "programmable ... minimizes overhead."

The cluster's kernel is new enough to run full eBPF:

ssh worker-0 'uname -r'      # 6.17.0-1015-aws

What Cilium does differently

Cilium is a CNI (like the bridge in Article 14) plus a kube-proxy replacement (Article 12) — all on eBPF. The docs: "open source software for transparently securing the network connectivity between application services." Four things Cilium changes versus our hand-rolled setup:

  • Replaces kube-proxy (kube-proxy replacement). Cilium installs eBPF programs that do kube-proxy's job — DNAT ClusterIP → pod backend — via a hash map in the kernel. After migrating, we fully remove kube-proxy and those 74 iptables rules disappear. The docs: east-west load balancing "fully replacing kube-proxy."
  • Pod datapath replaces the bridge. Cilium handles pod connectivity itself (replacing Article 14's cni0 + host-local), routing between nodes via eBPF (or tunnel, or native routing — we'll pick native like Article 13).
  • Security by identity, not by IP. This is the subtle point: pod IPs churn constantly (Article 25). Cilium assigns each group of pods (by label) a numeric identity, and policy applies by identity — "Rather than relying on IP addresses that frequently churn ... Cilium operates on service identity." NetworkPolicy (Article 47) is more durable as a result, and scales up to L7 (HTTP path, gRPC method), not just L3/L4.
  • Observable (Hubble). Because the datapath is eBPF, Cilium sees every flow right in the kernel → Hubble lets us view network flows, drops, policy verdicts in real time — something iptables can't.
   BEFORE (Part I)                          AFTER (Part X, Cilium eBPF)
   ─────────────                            ──────────────────────────
   kube-proxy ── writes 74+ iptables rules  eBPF program at tc/XDP hook
       │  packet walks the chain O(n)            │  hash map lookup O(1), in kernel
   cni0 bridge + host-local IPAM            Cilium CNI (eBPF datapath)
   policy by IP (brittle)                   policy by IDENTITY (L3-L7)
   nearly blind (only tcpdump)              Hubble: see every flow/verdict

Why "kube-proxy-less"

The migration goal (Article 46) is to run Cilium kube-proxy replacement fully — removing the kube-proxy DaemonSet altogether. The win: a single datapath (eBPF) instead of two layers (kube-proxy's iptables + the CNI), less overhead, faster convergence, and it unlocks L7/observability features. The cost: depending on a new-enough kernel (we have 6.17, more than enough) and a more complex CNI to operate. For a learning/throwaway cluster this is exactly the time to try it. We'll pin Cilium 1.19 and enable kubeProxyReplacement=true + Hubble.

🧹 Cleanup

Theory article — nothing created on the cluster (only reading existing iptables). The cluster is unchanged: CoreDNS + metrics-server + ebs-csi + snapshot-controller, kube-proxy still running (to be removed in Article 46). Nothing to clean up.

Wrap-up

Part I's networking (kube-proxy iptables + hand-rolled bridge) works but is dated: iptables processes packets O(n) in the number of rules (we counted 74 rules with just a few Services), and updates rewrite the table — a bottleneck at scale. eBPF runs sandboxed bytecode in the kernel at hooks (XDP/tc/socket), uses a hash map O(1), and skips most of the network stack — fast and programmable (the cluster's 6.17 kernel is more than capable). Cilium uses eBPF to: replace kube-proxy (DNAT via hash map), replace the bridge (pod datapath), secure by identity instead of IP (durable against pod IP churn, extends to L7), and provide Hubble to observe every flow. Part X's destination is a kube-proxy-less Cilium cluster — kube-proxy fully removed.

Article 46 does it for real: install Cilium 1.19 with kubeProxyReplacement, remove the kube-proxy DaemonSet + delete those 74 iptables rules, switch the pod datapath to Cilium, enable Hubble — then verify Services still work with not a single kube-proxy iptables rule left.