BTF and CO-RE: Compile Once, Run on Every Kernel

K
Kai··4 min read

In Article 0, each program carried a btf_id. In Article 3, we declared a map using BTF syntax. Now it's time to say plainly what BTF is, and why it lets a pre-compiled eBPF program read the right data inside the kernel — even though the kernel's layout changes from one version to the next. This is the last foundational piece of Part I, and the thing that makes writing your own tools in Part III feasible.

The problem: kernel layout isn't fixed

A tracing program often needs to read the kernel's internal structures — for example task_struct (which describes a process) to get the pid of the parent process. But task_struct has hundreds of fields, and their order/offset differs between kernel builds: the real_parent field at offset 2680 on one kernel might be at a different offset on another. If you hardcode the offset, the program only runs correctly on one kernel build — recompiling for each build is an operational nightmare.

BTF: the kernel describes its own types

BTF (BPF Type Format) is metadata describing data types — "debug information about types, structs, field layout", like debug info but the kernel understands it. The kernel builds its own BTF and exposes it:

ls -la /sys/kernel/btf/vmlinux
7005028 /sys/kernel/btf/vmlinux      # ~7MB: describes EVERY type in this kernel

This is the full "blueprint" of the kernel structs exact for the running kernel. From it, you generate vmlinux.h — a header containing the definitions of every kernel type, so an eBPF program can include it instead of digging through kernel headers:

sudo bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
wc -l vmlinux.h ; grep -c 'struct task_struct {' vmlinux.h
165625      # 165k lines of type definitions
1           # task_struct is defined

CO-RE: compile once, run on every kernel

CO-RE (Compile Once Run Everywhere) uses BTF to solve the offset problem. Write a program that reads ppid by walking task->real_parent->tgid:

#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>

SEC("tracepoint/sched/sched_process_exec")
int on_exec(void *ctx)
{
    struct task_struct *task = (struct task_struct *)bpf_get_current_task();
    __u32 ppid = BPF_CORE_READ(task, real_parent, tgid);   // <- CO-RE relocation
    char comm[16];
    bpf_get_current_comm(comm, sizeof(comm));
    bpf_printk("exec %s ppid=%d", comm, ppid);
    return 0;
}

BPF_CORE_READ(task, real_parent, tgid) doesn't hardcode an offset. At compile time, clang writes a CO-RE relocation into the object: "I need the real_parent field of task_struct, then the tgid field of that". Compiling only needs vmlinux.h, not kernel headers:

clang -O2 -g -target bpf -I/tmp -c ppid.bpf.c -o ppid.bpf.o

At load time, libbpf reads the running kernel's BTF (/sys/kernel/btf/vmlinux), looks up the real offset of real_parent and tgid on this kernel, then patches them into the program before the verifier runs. Load and try:

sudo bpftool prog loadall ppid.bpf.o /sys/fs/bpf/ppid autoattach
sudo timeout 2 cat /sys/kernel/debug/tracing/trace_pipe | grep 'exec .* ppid='
   sudo-357811    [001] ....1 47863.911994: bpf_trace_printk: exec sudo ppid=357804
   sleep-357812   [000] ....1 47863.911994: bpf_trace_printk: exec sleep ppid=357810
   grep-357813    [001] ....1 47863.912715: bpf_trace_printk: exec grep ppid=357804
   timeout-357814 [000] ....1 47863.920067: bpf_trace_printk: exec timeout ppid=357811

The program reads real_parent->tgid correctly — sleep has parent 357810, grep and sudo share parent 357804 (the shell). It walks two levels of pointer deep into task_struct without knowing the offset when it was written: libbpf filled in the right offset for this kernel 6.17 at load time. Take that same ppid.bpf.o file to a machine running a different kernel — where real_parent sits at a different offset — and it still runs correctly, because libbpf re-relocates according to that machine's BTF. That's the meaning of "compile once, run everywhere".

(A related detail: BPF_CORE_READ uses a helper to read kernel memory safely rather than dereferencing directly — the verifier in Article 2 doesn't allow arbitrary dereference of a kernel pointer; CO-RE handles both the offset and the safe read.)

🧹 Cleanup

sudo rm -rf /sys/fs/bpf/ppid
rm -f /tmp/ppid.bpf.* /tmp/vmlinux.h

Removing the pin is all it takes; the node goes back to 140 programs. Source is at github.com/nghiadaulau/ebpf-from-scratch, directory 05-btf-core.

Wrap-up

Kernel structs like task_struct have different field offsets across kernel builds, so a program with hardcoded offsets only runs correctly on one build. BTF is metadata describing every kernel type; the kernel exposes it ready-made at /sys/kernel/btf/vmlinux (~7MB), and bpftool btf dump ... format c generates vmlinux.h (165k lines, includes task_struct) for a program to include. CO-RE builds on BTF: BPF_CORE_READ(task, real_parent, tgid) writes a relocation at compile time ("I need this field") instead of a fixed offset; at load time, libbpf reads the running kernel's BTF, finds the real offset, and patches it in. We read ppid through two levels of pointer into task_struct without knowing the offset when we wrote it — and the same binary runs on a different kernel. This is why modern eBPF tools (bcc-libbpf, Cilium, Tetragon) ship a single binary that runs everywhere.

Part I closes: we now have the virtual machine (Article 1), the verifier (Article 2), maps (Article 3), program types & hooks (Article 4), and BTF/CO-RE (this post) — enough of a framework to understand any eBPF program. Part II moves on to using it to observe the system pragmatically with bpftrace: writing tracing programs in a single line, with no C compilation, to answer real questions about the kernel.