tc/sched_cls and Dissecting a Live Cilium Datapath
In Article 11, XDP dropped packets at the driver, before the kernel even knew about the packet. But most of Kubernetes networking — routing between pods, Service load balancing, applying NetworkPolicy — happens at the next layer: tc. This is where Cilium puts almost its entire datapath. Instead of writing a toy tc program, this article dissects the 74 sched_cls Cilium programs actually running on a node, to see how a real pod packet is processed.
tc/sched_cls: the hook after the sk_buff exists
tc (traffic control) is the kernel's long-standing built-in traffic shaping system. eBPF plugs into it via the program type BPF_PROG_TYPE_SCHED_CLS — bpftool shows it as sched_cls. It differs from XDP in a few core ways:
| XDP | tc/sched_cls | |
|---|---|---|
| Runs | before sk_buff |
after the sk_buff exists |
| Sees | ingress only | both ingress and egress |
| Data | raw packet | __sk_buff (already has metadata: mark, ifindex, cgroup...) |
| Location | physical NIC only | any interface, including a pod's veth |
In exchange for being "later" than XDP, tc can see egress and works with a sk_buff full of metadata — exactly what you need to route between pods. Kernel 6.6 and up also has tcx, a newer way of attaching tc based on BPF links: multiple programs chained on the same hook, safe ownership, auto-detach on close, ordering decided by the BPF_F_BEFORE/BPF_F_AFTER flags — and that is exactly what this cluster (kernel 6.17) uses (bpftool net show clearly says tcx/ingress, tcx/egress).
74 programs, where they attach
On the node, count eBPF programs by type:
sudo bpftool prog show | grep -c sched_cls
74
74 out of the node's total of 140 eBPF programs are sched_cls — nearly all of them Cilium's. Where do they attach? bpftool net show:
tc:
ens5(2) tcx/ingress cil_from_netdev prog_id 2969
ens5(2) tcx/egress cil_to_netdev prog_id 2960
cilium_host(74) tcx/ingress cil_to_host prog_id 2944
cilium_host(74) tcx/egress cil_from_host prog_id 2943
lxc9020cf5e63ba(100) tcx/ingress cil_from_container prog_id 2889
lxc_health(124) tcx/ingress cil_from_container prog_id 2934
lxcd74767545ed6(160) tcx/ingress cil_from_container prog_id 3388
lxcbafc4faaa189(162) tcx/ingress cil_from_container prog_id 3437
You can read the whole datapath architecture out of this:
cil_from_netdev/cil_to_netdevattach onens5(the physical NIC) — packets into/out of the node pass through here. This is the layer right after the XDP firewall from Article 11.cil_from_containerattaches on the ingress of eachlxc...— each of those interfaces is one end of a pod's veth. Every pod added to the node adds anothersched_clsprogram attached to its veth. That's the reason for the number 74: it scales with the number of pods + interfaces on the node.cil_host/cil_nethandle traffic going into the host stack.
Pod A ──veth(lxc..)──► cil_from_container ──┐
│ (BPF, in the kernel)
▼
map lookup: LB? policy? CT?
│
┌───────────────────┴────────────┐
▼ ▼
Pod B on the same node cil_to_netdev ──► ens5 ──► another node
(forwarded directly veth→veth)
Tail call: the datapath split into many programs
Why 74 programs and not one? Two reasons. One, each pod has its own program. Two, Cilium splits the datapath into many programs that call each other via tail calls — because a single eBPF program is limited to 1 million instructions (Article 2), the full datapath (LB + CT + NAT + policy + encap) doesn't fit in one program. Look at the names of the sched_cls programs and you can see the processing pipeline:
cil_from_container <- entry point from the pod
tail_handle_ipv4 <- process IPv4
tail_handle_ipv4_cont <- (continued)
tail_ipv4_ct_ingress <- look up/update conntrack
tail_nodeport_rev_dnat_ipv4 <- reverse NAT for NodePort
tail_no_service_ipv4 <- no Service hit
cil_lxc_policy <- apply NetworkPolicy
tail_ipv4_to_endpoint <- deliver to the destination pod
The mechanism is the tail call (Article 4): a program does bpf_tail_call into the next program via a prog_array map — and that exact map is in the node's list of maps:
174: prog_array name cilium_call_pol <- the datapath's tail-call table
175: prog_array name cilium_egressca
Each entry in cilium_call_policy points to one of the tail_* programs above. The datapath is therefore a chain of small programs linked by tail calls, each one handling a single task and fitting comfortably inside the verifier's limit.
Load balancing is one map lookup
This is where "kube-proxy-less" becomes obvious. This cluster doesn't run kube-proxy; Cilium does Service load balancing entirely in eBPF, and the "Service table" is just a BPF map — cilium_lb4_services (map id 181 in the list). Read it out in a human-readable form:
cilium bpf lb list # reads from the cilium_lb4_services map
SERVICE ADDRESS BACKEND ADDRESS (REVNAT_ID) (SLOT)
10.32.0.1:443/TCP (2) 10.0.1.12:6443/TCP (1) (2) <- kube-apiserver, backend 1
10.32.0.1:443/TCP (3) 10.0.1.13:6443/TCP (1) (3) <- kube-apiserver, backend 2
10.32.0.10:53/UDP (2) 10.200.0.44:53/UDP (6) (2) <- CoreDNS
10.32.0.10:9153/TCP (2) 10.200.0.44:9153/TCP (7) (2) <- CoreDNS metrics
You can read the whole mechanism out of this: the ClusterIP Service 10.32.0.1:443 (kube-apiserver) has two slots, each pointing to a real backend (10.0.1.12:6443, 10.0.1.13:6443). When a pod sends a packet to 10.32.0.1:443, cil_from_container looks up this map, picks a slot, then DNATs the destination address to the backend's IP:port — all in the kernel, right at the pod's veth, no iptables, no kube-proxy, never leaving the kernel. Adding/removing a backend pod is just an update to this map. This is literally what makes Cilium fast: load balancing = one hash map lookup + DNAT.
Conntrack and policy are also maps
After DNAT, the packet needs to be remembered so the return direction can reverse-NAT correctly — that's conntrack, and again it's a map (cilium_ct4_global, id 192):
cilium bpf ct list global
TCP IN 10.200.0.64:56046 -> 10.200.0.44:8080 expires=49141 ... [ RxClosing TxClosing SeenNonSyn ] RevNAT=0
TCP IN 10.200.0.64:45086 -> 10.200.0.44:8181 ... SourceSecurityID=1
Each line is a tracked connection: direction, IP:port at both ends, TCP flags, and SourceSecurityID — the security identity of the source pod. That SecurityID is precisely how Cilium applies NetworkPolicy: instead of matching on IP (which changes constantly in k8s), Cilium assigns each pod an identity by labels, stores the pod→identity mapping in cilium_lxc (id 198), then decides allow/deny in cil_lxc_policy based on identity. cilium endpoint list shows that mapping:
ENDPOINT IDENTITY LABELS IPv4
403 18203 k8s:k8s-app=kube-dns ... 10.200.0.33
849 18203 k8s:io.kubernetes.pod.namespace=kube-system 10.200.0.44
The CoreDNS pod carries identity 18203 (derived from k8s labels). A NetworkPolicy compiles into entries in the policy map (cilium_policy_v*, lpm_trie) keyed by identity — and cil_lxc_policy simply looks up that map to allow or drop. Policy, like LB and CT, comes down to one BPF map lookup.
🧹 Cleanup
This article only reads the running datapath state — it doesn't attach or change anything, so there's nothing to clean up. The node still has 140 programs as at the start. All the observation commands (bpftool net show, bpftool prog show, cilium bpf lb/ct list, cilium endpoint list) are at github.com/nghiadaulau/ebpf-from-scratch, directory 12-tc-cilium-datapath.
Wrap-up
tc/sched_cls is an eBPF hook that runs after the sk_buff exists, sees both ingress and egress, and works with a __sk_buff full of metadata — exactly what you need to route pod network traffic, which is why Cilium puts almost its entire datapath here (74/140 of the node's programs are sched_cls). They attach on ens5 (cil_from_netdev, right after XDP), on the host (cil_to_host), and on the ingress of each pod veth (cil_from_container) — the program count scales with the number of pods. The datapath is split into many small programs linked by tail calls through the prog_array cilium_call_policy (because of the 1-million-instruction limit), clearly visible in the names tail_handle_ipv4 → tail_ipv4_ct_ingress → cil_lxc_policy. And the three pillars of the datapath all come down to a BPF map lookup: Service load balancing = look up cilium_lb4_services then DNAT (kube-proxy-less, you can read out kube-apiserver 10.32.0.1:443 → two backends :6443); conntrack = cilium_ct4_global; NetworkPolicy = look up the policy map by the pod's identity (not by IP). All in the kernel, never going out to userspace.
Part IV has one more article: we've read Cilium's datapath; Article 13 writes a small tc program ourselves — counting and classifying the node's own egress traffic — to understand __sk_buff from the inside, then compare it with how Cilium uses that very hook.