XDP: Processing Packets at the Earliest Point, Writing a Firewall

K
Kai··7 min read

Part III built tools that observe — execsnoop only reads, it doesn't intervene. Part IV steps into where eBPF is most famous and also most active: networking. Here an eBPF program doesn't just look at a packet, it decides the packet's fate — let it through, drop it, bounce it back, kick it elsewhere. This post starts at the earliest hook: XDP.

What XDP is, and why "earliest"

XDP (eXpress Data Path) attaches an eBPF program directly to the network card's driver. When a packet arrives, the XDP program runs before the kernel allocates an sk_buff — the structure describing the packet that the entire network stack relies on. This is the earliest point software can touch a packet: the packet is still in the driver's buffer, not yet a "kernel packet".

That early actually matters: if you intend to drop a packet (DDoS mitigation, firewall), dropping at XDP means not spending a single cycle allocating an sk_buff or going through netfilter — Cloudflare and Facebook use XDP to drop millions of attack packets per second for exactly this reason.

   Packet arrives at NIC
        │
        ▼
   ┌─────────┐   XDP runs HERE (in the driver, before sk_buff)
   │   XDP   │── returns verdict ──► DROP  (drop immediately, cheapest)
   └────┬────┘                       TX    (bounce back out the same NIC)
        │ PASS                       REDIRECT (kick to another NIC/CPU)
        ▼
   allocate sk_buff
        │
        ▼
   tc ingress ──► network stack ──► socket  (see Article 12)

Verdict: a packet's four fates

An XDP program returns one of a set of constants — that's its entire interface with the kernel:

  • XDP_PASS — let the packet continue into the stack as normal (allocate sk_buff, up to tc, IP, socket).
  • XDP_DROP — drop the packet right at the driver. No log, no ICMP error, the packet vanishes.
  • XDP_TX — send the packet back out the same interface it arrived on (load balancer, reflection).
  • XDP_REDIRECT — push the packet to another interface, to another CPU, or up to an AF_XDP socket.

We'll write a firewall: drop ICMP, pass everything else.

Reading a packet in XDP: pointers and bounds checks

XDP hands the program a struct xdp_md containing two pointers: data (start of the packet) and data_end (end). To read the Ethernet header then the IP header, we cast the pointers and must bounds-check before each read — this is exactly the verifier's rule from Article 2: touching memory outside [data, data_end) gets the load rejected.

SEC("xdp")
int xdp_fw(struct xdp_md *ctx)
{
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)        // enough room for the Ethernet header?
        return XDP_PASS;
    if (eth->h_proto != bpf_htons(ETH_P_IP)) // not IPv4 -> let it through
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)         // enough room for the IP header?
        return XDP_PASS;

    if (ip->protocol == IPPROTO_ICMP) {
        __u32 key = 0;
        __u64 *n = bpf_map_lookup_elem(&icmp_drops, &key);
        if (n) __sync_fetch_and_add(n, 1);   // count dropped packets
        return XDP_DROP;                      // drop ICMP
    }
    return XDP_PASS;                              // everything else: let it through
}

Each if (... > data_end) return XDP_PASS; isn't redundant defense — drop one and the verifier rejects the load immediately ("invalid access to packet"). bpf_htons swaps byte order because the on-the-wire value is big-endian. A map icmp_drops (an ARRAY with 1 element) counts dropped packets. The whole program is only about 20 lines.

Why pick ICMP to drop? This is running on a real node, over SSH. SSH is TCP — dropping ICMP doesn't touch it. The rule for playing with XDP on a remote machine: never drop the thing holding your session. Default everything to XDP_PASS, only drop the one harmless thing you chose.

Attaching to a real interface

Compile it like any eBPF program (Article 9), then load and attach it to the NIC with bpftool:

clang -O2 -g -target bpf -I. -c xdp_fw.bpf.c -o xdp_fw.bpf.o   # compile
sudo bpftool prog loadall xdp_fw.bpf.o /sys/fs/bpf/xdpfw       # load + pin
sudo bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdpfw/xdp_fw dev ens5

Note xdpgeneric — that's generic mode (SKB mode): the kernel runs the XDP program after it has already allocated an sk_buff, slower than native but runs on any driver and is easy to remove. There are three XDP attach modes:

  • native (xdpdrv) — runs in the driver, fastest, requires driver support.
  • offload (xdpoffload) — runs directly on a SmartNIC, the kernel never touches the packet.
  • generic (xdpgeneric) — kernel-emulated, runs on any driver, used for experimentation.

On a remote production node, generic is the safe choice to test with.

Ping falls from 0% to 100%

Before attaching, ping from the node to the internal LB runs normally:

-- before attach: ping LB --
2 packets transmitted, 2 received, 0% packet loss
rtt min/avg/max/mdev = 0.191/0.193/0.196/0.002 ms

Attach the firewall to ens5, then ping again:

== attached (xdpgeneric on ens5) ==
-- with XDP firewall: ping (expect 100% loss) --
4 packets transmitted, 0 received, 100% packet loss

100% loss. The subtle reason: XDP runs at ingress (packets arriving at the NIC). When the node sends an ICMP echo-request out (egress, untouched by XDP), the packet reaches its destination normally; but the echo-reply coming back to ens5 lands in XDP and gets XDP_DROP'd. No reply gets through → ping reports 100% loss. Meanwhile the SSH session is never interrupted — the very command printing the line above is still running over TCP, and TCP packets get XDP_PASS.

Read the counter in the map to confirm the exact number of dropped packets:

-- icmp_drops map --
[{ "key": 0, "value": 4 }]

Exactly 4 — matching the 4 echo-replies of ping -c 4. The firewall works exactly as written: see ICMP, count it, drop it; let everything else through.

XDP sits ahead of Cilium's datapath

Once attached, bpftool net show dev ens5 reveals one more thing:

xdp:
ens5(2) generic id 4656              <- the firewall we just attached

tc:
ens5(2) tcx/ingress cil_from_netdev prog_id 2969
ens5(2) tcx/egress  cil_to_netdev   prog_id 2960

On the very same interface ens5 there are already two Cilium eBPF programs attached at the tc layer (cil_from_netdev ingress, cil_to_netdev egress) — that's the real datapath routing every pod packet on this node. The diagram at the top of the post shows up right here in reality: our XDP runs first, and only packets that XDP_PASS go down to tc where Cilium takes over. Article 12 dissects that tc layer directly.

🧹 Cleanup

This step is mandatory — an XDP program left attached will keep filtering packets:

sudo bpftool net detach xdpgeneric dev ens5     # detach from the interface
sudo rm -rf /sys/fs/bpf/xdpfw                    # remove the pin

Once detached, ping returns to 0% loss immediately; the node returns to 140 programs (Cilium's datapath is untouched). In the test script, the detach command is placed in a trap ... EXIT so it always runs, even if a step in between fails — a habit worth having when playing with XDP on a remote machine. The source (xdp_fw.bpf.c, the build/attach commands) is at github.com/nghiadaulau/ebpf-from-scratch, directory 11-xdp-firewall.

Wrap-up

XDP attaches eBPF to the network driver, running on every incoming packet before allocating an sk_buff — the earliest, cheapest point to process a packet, which is why it's the foundation of DDoS dropping and high-speed LB. The program returns a verdict: XDP_PASS (continue), XDP_DROP (drop on the spot), XDP_TX (bounce back), XDP_REDIRECT (kick elsewhere). Read the packet via data/data_end in xdp_md, bounds-checking before each access (the verifier's rule, Article 2). We wrote a firewall that drops ICMP, attached it to ens5 with bpftool net attach xdpgeneric, watched ping fall 0%→100% loss (echo-replies dropped at ingress) while SSH/TCP stayed open — the map counter confirmed exactly 4 packets. Three attach modes: native (fast), offload (on the NIC), generic (emulated, safe to test). And bpftool net show revealed our XDP sitting ahead of Cilium's tc datapath on the same interface.

Article 12 goes down to that layer: tc/sched_cls — the hook where the packet already has an sk_buff, where Cilium places most of its datapath. We'll dissect the 74 sched_cls programs running on the cluster to see how a pod packet is routed, load-balanced, and has policy applied.