tc/sched_cls and Dissecting a Live Cilium Datapath

In Article 11, XDP dropped packets at the driver, before the kernel even knew about the packet. But most of Kubernetes networking — routing between pods, Service load balancing, applying NetworkPolicy — happens at the next layer: tc. This is where Cilium puts almost its entire datapath. Instead of writing a toy tc program, this article dissects the 74 sched_cls Cilium programs actually running on a node, to see how a real pod packet is processed.

tc/sched_cls: the hook after the sk_buff exists

tc (traffic control) is the kernel's long-standing built-in traffic shaping system. eBPF plugs into it via the program type BPF_PROG_TYPE_SCHED_CLS — bpftool shows it as sched_cls. It differs from XDP in a few core ways:

	XDP	tc/sched_cls
Runs	before `sk_buff`	after the `sk_buff` exists
Sees	ingress only	both ingress and egress
Data	raw packet	`__sk_buff` (already has metadata: mark, ifindex, cgroup...)
Location	physical NIC only	any interface, including a pod's veth

In exchange for being "later" than XDP, tc can see egress and works with a sk_buff full of metadata — exactly what you need to route between pods. Kernel 6.6 and up also has tcx, a newer way of attaching tc based on BPF links: multiple programs chained on the same hook, safe ownership, auto-detach on close, ordering decided by the BPF_F_BEFORE/BPF_F_AFTER flags — and that is exactly what this cluster (kernel 6.17) uses (bpftool net show clearly says tcx/ingress, tcx/egress).

74 programs, where they attach

On the node, count eBPF programs by type:

sudo bpftool prog show | grep -c sched_cls

74 out of the node's total of 140 eBPF programs are sched_cls — nearly all of them Cilium's. Where do they attach? bpftool net show:

tc:
ens5(2)            tcx/ingress cil_from_netdev    prog_id 2969
ens5(2)            tcx/egress  cil_to_netdev      prog_id 2960
cilium_host(74)    tcx/ingress cil_to_host        prog_id 2944
cilium_host(74)    tcx/egress  cil_from_host      prog_id 2943
lxc9020cf5e63ba(100) tcx/ingress cil_from_container prog_id 2889
lxc_health(124)      tcx/ingress cil_from_container prog_id 2934
lxcd74767545ed6(160) tcx/ingress cil_from_container prog_id 3388
lxcbafc4faaa189(162) tcx/ingress cil_from_container prog_id 3437

You can read the whole datapath architecture out of this:

cil_from_netdev / cil_to_netdev attach on ens5 (the physical NIC) — packets into/out of the node pass through here. This is the layer right after the XDP firewall from Article 11.
cil_from_container attaches on the ingress of each lxc... — each of those interfaces is one end of a pod's veth. Every pod added to the node adds another sched_cls program attached to its veth. That's the reason for the number 74: it scales with the number of pods + interfaces on the node.
cil_host / cil_net handle traffic going into the host stack.

  Pod A ──veth(lxc..)──► cil_from_container ──┐
                                              │ (BPF, in the kernel)
                                              ▼
                            map lookup: LB? policy? CT?
                                              │
                          ┌───────────────────┴────────────┐
                          ▼                                 ▼
                  Pod B on the same node          cil_to_netdev ──► ens5 ──► another node
              (forwarded directly veth→veth)

Tail call: the datapath split into many programs

Why 74 programs and not one? Two reasons. One, each pod has its own program. Two, Cilium splits the datapath into many programs that call each other via tail calls — because a single eBPF program is limited to 1 million instructions (Article 2), the full datapath (LB + CT + NAT + policy + encap) doesn't fit in one program. Look at the names of the sched_cls programs and you can see the processing pipeline:

cil_from_container          <- entry point from the pod
tail_handle_ipv4            <- process IPv4
tail_handle_ipv4_cont       <- (continued)
tail_ipv4_ct_ingress        <- look up/update conntrack
tail_nodeport_rev_dnat_ipv4 <- reverse NAT for NodePort
tail_no_service_ipv4        <- no Service hit
cil_lxc_policy              <- apply NetworkPolicy
tail_ipv4_to_endpoint       <- deliver to the destination pod

The mechanism is the tail call (Article 4): a program does bpf_tail_call into the next program via a prog_array map — and that exact map is in the node's list of maps:

174: prog_array  name cilium_call_pol      <- the datapath's tail-call table
175: prog_array  name cilium_egressca

Each entry in cilium_call_policy points to one of the tail_* programs above. The datapath is therefore a chain of small programs linked by tail calls, each one handling a single task and fitting comfortably inside the verifier's limit.

Load balancing is one map lookup

This is where "kube-proxy-less" becomes obvious. This cluster doesn't run kube-proxy; Cilium does Service load balancing entirely in eBPF, and the "Service table" is just a BPF map — cilium_lb4_services (map id 181 in the list). Read it out in a human-readable form:

cilium bpf lb list      # reads from the cilium_lb4_services map

SERVICE ADDRESS         BACKEND ADDRESS (REVNAT_ID) (SLOT)
10.32.0.1:443/TCP (2)   10.0.1.12:6443/TCP (1) (2)     <- kube-apiserver, backend 1
10.32.0.1:443/TCP (3)   10.0.1.13:6443/TCP (1) (3)     <- kube-apiserver, backend 2
10.32.0.10:53/UDP (2)   10.200.0.44:53/UDP  (6) (2)    <- CoreDNS
10.32.0.10:9153/TCP (2) 10.200.0.44:9153/TCP (7) (2)   <- CoreDNS metrics

You can read the whole mechanism out of this: the ClusterIP Service 10.32.0.1:443 (kube-apiserver) has two slots, each pointing to a real backend (10.0.1.12:6443, 10.0.1.13:6443). When a pod sends a packet to 10.32.0.1:443, cil_from_container looks up this map, picks a slot, then DNATs the destination address to the backend's IP:port — all in the kernel, right at the pod's veth, no iptables, no kube-proxy, never leaving the kernel. Adding/removing a backend pod is just an update to this map. This is literally what makes Cilium fast: load balancing = one hash map lookup + DNAT.

Conntrack and policy are also maps

After DNAT, the packet needs to be remembered so the return direction can reverse-NAT correctly — that's conntrack, and again it's a map (cilium_ct4_global, id 192):

cilium bpf ct list global

TCP IN 10.200.0.64:56046 -> 10.200.0.44:8080 expires=49141 ... [ RxClosing TxClosing SeenNonSyn ] RevNAT=0
TCP IN 10.200.0.64:45086 -> 10.200.0.44:8181 ... SourceSecurityID=1

Each line is a tracked connection: direction, IP:port at both ends, TCP flags, and SourceSecurityID — the security identity of the source pod. That SecurityID is precisely how Cilium applies NetworkPolicy: instead of matching on IP (which changes constantly in k8s), Cilium assigns each pod an identity by labels, stores the pod→identity mapping in cilium_lxc (id 198), then decides allow/deny in cil_lxc_policy based on identity. cilium endpoint list shows that mapping:

ENDPOINT  IDENTITY  LABELS                                    IPv4
403       18203     k8s:k8s-app=kube-dns ...                  10.200.0.33
849       18203     k8s:io.kubernetes.pod.namespace=kube-system 10.200.0.44

The CoreDNS pod carries identity 18203 (derived from k8s labels). A NetworkPolicy compiles into entries in the policy map (cilium_policy_v*, lpm_trie) keyed by identity — and cil_lxc_policy simply looks up that map to allow or drop. Policy, like LB and CT, comes down to one BPF map lookup.

🧹 Cleanup

This article only reads the running datapath state — it doesn't attach or change anything, so there's nothing to clean up. The node still has 140 programs as at the start. All the observation commands (bpftool net show, bpftool prog show, cilium bpf lb/ct list, cilium endpoint list) are at github.com/nghiadaulau/ebpf-from-scratch, directory 12-tc-cilium-datapath.

Wrap-up

tc/sched_cls is an eBPF hook that runs after the sk_buff exists, sees both ingress and egress, and works with a __sk_buff full of metadata — exactly what you need to route pod network traffic, which is why Cilium puts almost its entire datapath here (74/140 of the node's programs are sched_cls). They attach on ens5 (cil_from_netdev, right after XDP), on the host (cil_to_host), and on the ingress of each pod veth (cil_from_container) — the program count scales with the number of pods. The datapath is split into many small programs linked by tail calls through the prog_array cilium_call_policy (because of the 1-million-instruction limit), clearly visible in the names tail_handle_ipv4 → tail_ipv4_ct_ingress → cil_lxc_policy. And the three pillars of the datapath all come down to a BPF map lookup: Service load balancing = look up cilium_lb4_services then DNAT (kube-proxy-less, you can read out kube-apiserver 10.32.0.1:443 → two backends :6443); conntrack = cilium_ct4_global; NetworkPolicy = look up the policy map by the pod's identity (not by IP). All in the kernel, never going out to userspace.

Part IV has one more article: we've read Cilium's datapath; Article 13 writes a small tc program ourselves — counting and classifying the node's own egress traffic — to understand __sk_buff from the inside, then compare it with how Cilium uses that very hook.