Writing a tc Program Yourself: __sk_buff and the tcx Chain
Article 12 read Cilium's tc datapath from the outside. Now we write a sched_cls program ourselves to understand it from the inside: an egress packet counter, classified by protocol. The goal isn't to do anything terribly useful, but to nail down two things — __sk_buff (what tc hands the program, completely unlike XDP's raw packet) and tcx (how to chain multiple tc programs on one hook). And we'll trip over a real lesson about that chain.
__sk_buff: a packet with metadata, not raw bytes
The core difference between XDP and tc lies in what the kernel hands the program. XDP (Article 11) hands xdp_md — basically just two pointers to the raw packet; you have to parse everything yourself. tc hands __sk_buff, a view onto the sk_buff that the kernel has already built and filled with metadata:
struct __sk_buff {
__u32 len; // packet length — already computed
__u32 protocol; // L3 protocol (ETH_P_IP...) — already filled in
__u32 ifindex; // interface
__u32 mark; // fwmark (shared with netfilter/routing)
__u32 priority;
... // and data/data_end if you need raw parsing
};
That means many questions don't need byte parsing: want the L3 protocol, read skb->protocol directly; want the packet length, skb->len. This is the convenience of running after the kernel has built the sk_buff — the price you pay versus XDP, but in return the information is ready.
The program: count egress by protocol
#define TC_ACT_OK 0 // == TCX_PASS: let the packet through, STOP the tcx chain
struct { // count array: 0=total 1=IPv4 2=IPv6 3=other, 9=total bytes
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 10);
__type(key, __u32);
__type(value, __u64);
} egress SEC(".maps");
SEC("tc")
int count_egress(struct __sk_buff *skb)
{
add(0, 1); // every packet: increment the total
add(9, skb->len); // accumulate length — metadata is ready
__u32 proto = bpf_ntohs(skb->protocol); // L3 protocol, no parsing needed
if (proto == ETH_P_IP) add(1, 1);
else if (proto == ETH_P_IPV6) add(2, 1);
else add(3, 1);
return TC_ACT_OK; // don't drop — just observe
}
skb->protocol is __be16 (network byte order), so we use bpf_ntohs to compare. add() is a small helper — bpf_map_lookup_elem + __sync_fetch_and_add. Not a single line of Ethernet/IP parsing — all the information comes straight from the __sk_buff metadata.
The return value TC_ACT_OK is where the story is — read carefully at the end of the article.
Attaching with tcx
tcx (kernel 6.6+, Article 12) is the BPF-link-based way of attaching tc; bpftool attaches it directly:
clang -O2 -g -target bpf -I. -c count.bpf.c -o count.bpf.o
sudo bpftool prog loadall count.bpf.o /sys/fs/bpf/tccount
sudo bpftool net attach tcx_egress pinned /sys/fs/bpf/tccount/count_egress dev lo
Attach to lo (loopback), then generate a bit of internal traffic and read the map:
ping -c 3 127.0.0.1 # IPv4
ping6 -c 2 ::1 # IPv6
sudo bpftool map dump id <id>
key: 0 value: 30 <- 30 total egress packets on lo
key: 1 value: 26 <- IPv4
key: 2 value: 4 <- IPv6 (2 echo + 2 reply from ping6 -c 2)
key: 3 value: 0 <- other: none
key: 9 value: 2902 <- total bytes
The numbers match: ping6 -c 2 ::1 creates 2 echo + 2 reply, all four going through lo's egress → IPv6 = 4. IPv4 = 26 consists of the 6 packets from ping 127.0.0.1 plus background loopback traffic (systemd, local DNS). Total 30 = 26 + 4, no "other" packets. The program runs correctly, classifying by __sk_buff metadata without parsing a single byte.
Why attach to lo and not ens5?
The first try attached this exact program to the egress of ens5 (the physical NIC) — and the total counter sat still at 0, even though the node clearly had outbound traffic — not a bug. bpftool net show dev ens5 shows the tcx egress chain:
ens5 tcx/egress cil_to_netdev prog_id 2960 link_id 17 <- Cilium, attached first
ens5 tcx/egress count_egress prog_id 4669 <- ours, attached after
Two programs on the same egress hook, running in order — Cilium first, ours after. The problem lies in the return code. Read the definition straight from the node's own /usr/include/linux/bpf.h:
/* (Simplified) user return codes for tcx prog type.
* ... unknown return codes are mapped to TCX_NEXT. */
enum tcx_action_base {
TCX_NEXT = -1, // run the NEXT program in the chain
TCX_PASS = 0, // == TC_ACT_OK: let the packet through, STOP the chain
TCX_DROP = 2, // == TC_ACT_SHOT
TCX_REDIRECT = 7, // == TC_ACT_REDIRECT
};
The tcx chain only continues to the next program when the current program returns TCX_NEXT (-1). Cilium's cil_to_netdev returns a terminating verdict (TCX_PASS or TCX_REDIRECT — it forwards/redirects the packet), so the chain stops right at Cilium, never reaching count_egress behind it → total = 0. On lo it's different: our program is the only one and is first, so it runs, then its TC_ACT_OK stops the chain (there's nobody behind it to run anyway).
To make the counter run on ens5 you'd have to attach it before Cilium in the chain — that's the job of the BPF_F_BEFORE flag when creating the link (precisely the feature tcx added in kernel 6.6). The lesson: on a hook that already has an owner like Cilium's datapath, the order in the tcx chain decides whether your program sees the packet at all — and a terminating verdict from a program ahead of you will hide every program behind it.
🧹 Cleanup
sudo bpftool net detach tcx_egress dev lo
sudo rm -rf /sys/fs/bpf/tccount
As in Article 11, the detach command is placed in a trap ... EXIT so it always runs. After detaching, the node is back to 140 programs. The source (count.bpf.c, the build/attach commands) is at github.com/nghiadaulau/ebpf-from-scratch, directory 13-tc-write.
Wrap-up
Writing a tc/sched_cls program ourselves makes two things clear. One, __sk_buff: tc runs after the kernel builds the sk_buff, so it hands the program metadata already filled in — skb->protocol (L3 protocol), skb->len, skb->mark, skb->ifindex — and we classified egress packets by protocol without parsing a single byte (unlike XDP, which has to peel it off from raw pointers). We attached with tcx (bpftool net attach tcx_egress, kernel 6.6+) and ran on lo to get correct counts (IPv4 26 / IPv6 4, matching the generated traffic). Two, the tcx chain: multiple programs stacked on one hook run in order, and only continue when the current one returns TCX_NEXT (-1); TC_ACT_OK (= TCX_PASS = 0) stops the chain. That's why the same program attached after Cilium on ens5 never ran once (Cilium returns a terminating verdict first) — verified straight from enum tcx_action_base in the node's kernel header. To make it run you must attach with BPF_F_BEFORE. On a hook that already has an owner, chain order is everything.
Part IV closes here: we've gone from XDP (Article 11) through Cilium's datapath (Article 12) to writing tc ourselves (this article) — covering both of eBPF's main network hooks and how they chain. Part V moves on to security: LSM BPF attaching to the kernel's security control points, seccomp, and how Tetragon uses eBPF to observe and then enforce process behavior across a cluster.