Writing a tc Program Yourself: __sk_buff and the tcx Chain

K
Kai··6 min read

Article 12 read Cilium's tc datapath from the outside. Now we write a sched_cls program ourselves to understand it from the inside: an egress packet counter, classified by protocol. The goal isn't to do anything terribly useful, but to nail down two things — __sk_buff (what tc hands the program, completely unlike XDP's raw packet) and tcx (how to chain multiple tc programs on one hook). And we'll trip over a real lesson about that chain.

__sk_buff: a packet with metadata, not raw bytes

The core difference between XDP and tc lies in what the kernel hands the program. XDP (Article 11) hands xdp_md — basically just two pointers to the raw packet; you have to parse everything yourself. tc hands __sk_buff, a view onto the sk_buff that the kernel has already built and filled with metadata:

struct __sk_buff {
    __u32 len;          // packet length — already computed
    __u32 protocol;     // L3 protocol (ETH_P_IP...) — already filled in
    __u32 ifindex;      // interface
    __u32 mark;         // fwmark (shared with netfilter/routing)
    __u32 priority;
    ...                 // and data/data_end if you need raw parsing
};

That means many questions don't need byte parsing: want the L3 protocol, read skb->protocol directly; want the packet length, skb->len. This is the convenience of running after the kernel has built the sk_buff — the price you pay versus XDP, but in return the information is ready.

The program: count egress by protocol

#define TC_ACT_OK  0      // == TCX_PASS: let the packet through, STOP the tcx chain

struct {                  // count array: 0=total 1=IPv4 2=IPv6 3=other, 9=total bytes
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 10);
    __type(key, __u32);
    __type(value, __u64);
} egress SEC(".maps");

SEC("tc")
int count_egress(struct __sk_buff *skb)
{
    add(0, 1);                               // every packet: increment the total
    add(9, skb->len);                        // accumulate length — metadata is ready

    __u32 proto = bpf_ntohs(skb->protocol);  // L3 protocol, no parsing needed
    if (proto == ETH_P_IP)        add(1, 1);
    else if (proto == ETH_P_IPV6) add(2, 1);
    else                          add(3, 1);

    return TC_ACT_OK;                        // don't drop — just observe
}

skb->protocol is __be16 (network byte order), so we use bpf_ntohs to compare. add() is a small helper — bpf_map_lookup_elem + __sync_fetch_and_add. Not a single line of Ethernet/IP parsing — all the information comes straight from the __sk_buff metadata.

The return value TC_ACT_OK is where the story is — read carefully at the end of the article.

Attaching with tcx

tcx (kernel 6.6+, Article 12) is the BPF-link-based way of attaching tc; bpftool attaches it directly:

clang -O2 -g -target bpf -I. -c count.bpf.c -o count.bpf.o
sudo bpftool prog loadall count.bpf.o /sys/fs/bpf/tccount
sudo bpftool net attach tcx_egress pinned /sys/fs/bpf/tccount/count_egress dev lo

Attach to lo (loopback), then generate a bit of internal traffic and read the map:

ping  -c 3 127.0.0.1      # IPv4
ping6 -c 2 ::1            # IPv6
sudo bpftool map dump id <id>
key: 0  value: 30        <- 30 total egress packets on lo
key: 1  value: 26        <- IPv4
key: 2  value: 4         <- IPv6 (2 echo + 2 reply from ping6 -c 2)
key: 3  value: 0         <- other: none
key: 9  value: 2902      <- total bytes

The numbers match: ping6 -c 2 ::1 creates 2 echo + 2 reply, all four going through lo's egress → IPv6 = 4. IPv4 = 26 consists of the 6 packets from ping 127.0.0.1 plus background loopback traffic (systemd, local DNS). Total 30 = 26 + 4, no "other" packets. The program runs correctly, classifying by __sk_buff metadata without parsing a single byte.

Why attach to lo and not ens5?

The first try attached this exact program to the egress of ens5 (the physical NIC) — and the total counter sat still at 0, even though the node clearly had outbound traffic — not a bug. bpftool net show dev ens5 shows the tcx egress chain:

ens5  tcx/egress cil_to_netdev  prog_id 2960 link_id 17    <- Cilium, attached first
ens5  tcx/egress count_egress   prog_id 4669               <- ours, attached after

Two programs on the same egress hook, running in order — Cilium first, ours after. The problem lies in the return code. Read the definition straight from the node's own /usr/include/linux/bpf.h:

/* (Simplified) user return codes for tcx prog type.
 * ... unknown return codes are mapped to TCX_NEXT. */
enum tcx_action_base {
    TCX_NEXT     = -1,    // run the NEXT program in the chain
    TCX_PASS     = 0,     // == TC_ACT_OK: let the packet through, STOP the chain
    TCX_DROP     = 2,     // == TC_ACT_SHOT
    TCX_REDIRECT = 7,     // == TC_ACT_REDIRECT
};

The tcx chain only continues to the next program when the current program returns TCX_NEXT (-1). Cilium's cil_to_netdev returns a terminating verdict (TCX_PASS or TCX_REDIRECT — it forwards/redirects the packet), so the chain stops right at Cilium, never reaching count_egress behind it → total = 0. On lo it's different: our program is the only one and is first, so it runs, then its TC_ACT_OK stops the chain (there's nobody behind it to run anyway).

To make the counter run on ens5 you'd have to attach it before Cilium in the chain — that's the job of the BPF_F_BEFORE flag when creating the link (precisely the feature tcx added in kernel 6.6). The lesson: on a hook that already has an owner like Cilium's datapath, the order in the tcx chain decides whether your program sees the packet at all — and a terminating verdict from a program ahead of you will hide every program behind it.

🧹 Cleanup

sudo bpftool net detach tcx_egress dev lo
sudo rm -rf /sys/fs/bpf/tccount

As in Article 11, the detach command is placed in a trap ... EXIT so it always runs. After detaching, the node is back to 140 programs. The source (count.bpf.c, the build/attach commands) is at github.com/nghiadaulau/ebpf-from-scratch, directory 13-tc-write.

Wrap-up

Writing a tc/sched_cls program ourselves makes two things clear. One, __sk_buff: tc runs after the kernel builds the sk_buff, so it hands the program metadata already filled in — skb->protocol (L3 protocol), skb->len, skb->mark, skb->ifindex — and we classified egress packets by protocol without parsing a single byte (unlike XDP, which has to peel it off from raw pointers). We attached with tcx (bpftool net attach tcx_egress, kernel 6.6+) and ran on lo to get correct counts (IPv4 26 / IPv6 4, matching the generated traffic). Two, the tcx chain: multiple programs stacked on one hook run in order, and only continue when the current one returns TCX_NEXT (-1); TC_ACT_OK (= TCX_PASS = 0) stops the chain. That's why the same program attached after Cilium on ens5 never ran once (Cilium returns a terminating verdict first) — verified straight from enum tcx_action_base in the node's kernel header. To make it run you must attach with BPF_F_BEFORE. On a hook that already has an owner, chain order is everything.

Part IV closes here: we've gone from XDP (Article 11) through Cilium's datapath (Article 12) to writing tc ourselves (this article) — covering both of eBPF's main network hooks and how they chain. Part V moves on to security: LSM BPF attaching to the kernel's security control points, seccomp, and how Tetragon uses eBPF to observe and then enforce process behavior across a cluster.