seccomp-bpf: Classic BPF Filtering Syscalls in Every Container

K
Kai··6 min read

Article 14 enforced with modern eBPF (LSM). But there's a BPF-based filtering mechanism that's much older and far more widespread — it's running in nearly every container you've ever used: seccomp-bpf. And it uses classic BPF (cBPF), not eBPF. This article makes that difference clear, inspects real seccomp on the cluster, then writes a filter ourselves.

cBPF and eBPF: two generations

The original "BPF" (1992) was the Berkeley Packet Filter — a tiny virtual machine to filter packets for tcpdump. That is cBPF (classic BPF): 2 registers (A, X), a tiny instruction set, no maps, no helpers, no loops. eBPF (Article 1) is the extended version from 2014: 11 64-bit registers, maps, helpers, a powerful verifier — almost an entirely different thing.

What many people don't notice: seccomp still uses cBPF, not eBPF. When you install a seccomp filter, you load an array of cBPF instructions (struct sock_filter). The kernel translates cBPF to eBPF internally to run it, but the API userspace sees is pure cBPF. This is why seccomp belongs in an eBPF series: it's the old relative of eBPF, and the most widely deployed BPF application on the planet.

Real seccomp on the cluster: who's restricted, who isn't

/proc/<pid>/status tells you whether a process has seccomp: the Seccomp field (0=off, 1=strict, 2=filter) and Seccomp_filters (the number of filters attached). Scan the whole node:

pid 1161  pause            filters=1     <- a pod's sandbox container
pid 1290  aws-ebs-csi-dri  filters=1
pid 1429  cilium-operator  filters=1
pid 2645  csi-provisioner  filters=1
pid 450   systemd-resolve  filters=28    <- a systemd service tightening itself
pid 123   systemd-journal  filters=17

This reads out the real picture. Ordinary containerspause (each pod's sandbox container), csi-*, cilium-operator — run with Seccomp: 2, filters=1: that's the default seccomp profile containerd applies when starting a container. But check the privileged pods:

cilium-agent (pid 1857): Seccomp: 0    <- NOT restricted
kubelet      (pid 817):  Seccomp: 0
containerd   (pid 592):  Seccomp: 0

Seccomp: 0 — no seccomp. This is the point worth correcting: not every container has seccomp. Privileged / hostPID pods (like cilium-agent, kubelet) run unconfined; and Kubernetes by default does not apply seccomp to workloads unless you set securityContext.seccompProfile: RuntimeDefault (or enable the SeccompDefault feature gate). Containerd does apply a default profile to non-privileged containers — which is why pause and the CSI sidecars have filters=1.

And systemd-resolve with 28 filters illustrates a core property: seccomp filters stack on top of each other and cannot be removed. Each SystemCallFilter= line in a unit file adds a filter; once set, it stays until the process dies. This is by design — a sandbox that could relax itself would be meaningless.

Writing a cBPF filter ourselves

A seccomp filter is a cBPF program running on struct seccomp_data (syscall number nr, architecture arch, the args), returning a verdict. Write a filter that blocks mkdir/mkdirat with EPERM and allows every other syscall:

struct sock_filter filter[] = {
    // A = arch; if not x86_64, allow (architecture defense)
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, arch)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // A = nr (syscall number)
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir,   2, 0),  // match -> jump to RET ERRNO
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdirat, 1, 0),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),                       // other syscall: allow
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
};
struct sock_fprog prog = { .len = 8, .filter = filter };

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);              // mandatory for an unprivileged process
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);   // install the filter

This is bare cBPF: BPF_STMT/BPF_JUMP are macros that build individual instructions; the A register loads a field of seccomp_data via BPF_LD|BPF_ABS; jumps use a relative offset (the number of instructions to skip when true/false). No maps, no helpers — completely unlike eBPF. Verdicts: SECCOMP_RET_ALLOW lets the syscall run, SECCOMP_RET_ERRNO | EPERM makes the syscall return -EPERM without executing (other verdicts: KILL_PROCESS, TRAP sending SIGSYS, USER_NOTIF...). PR_SET_NO_NEW_PRIVS is a mandatory condition for an unprivileged process to be able to install seccomp.

Running: mkdir blocked, printf still runs

filter has 8 cBPF instructions
before filter: mkdir /tmp/sc-a -> OK
seccomp filter installed (mode 2)
after filter:  mkdir /tmp/sc-b -> Operation not permitted
after filter:  this printf still works (write syscall allowed)

Before installing, mkdir runs normally. After installing the filter, mkdir gets exactly Operation not permitted (the EPERM the filter returns), while printf — which calls the write syscall, allowed by SECCOMP_RET_ALLOW — still prints normally. Filtering is precise at the per-syscall level: block this one, allow that one. This is exactly the mechanism a container seccomp profile uses to block dangerous syscalls (mount, kexec_load, ptrace...) while still letting the application run.

seccomp versus LSM BPF

Two enforcement mechanisms at different layers:

seccomp-bpf LSM BPF (Article 14)
BPF cBPF (classic) eBPF
Filters by syscall number + raw args semantic operation (open this file, mount...)
Sees seccomp_data (nr, arch, args) real kernel objects (struct file *...)
Scope per process, inherited via fork system-wide
Removal no (stacks until death) yes (detach the link)

seccomp is fast and simple but "blind" — it sees the syscall number and raw arguments, can't dereference a pointer (doesn't know which file openat is opening). LSM BPF sees the resolved kernel object. They complement each other: seccomp narrows the syscall surface, LSM BPF enforces semantic policy.

🧹 Cleanup

Nothing to clean up at the system level: the seccomp filter lives only inside the demo process; when the process exits the filter disappears (seccomp is attached per process, not globally). The /proc/*/status scan only reads, touching nothing. The node still has 140 eBPF programs. The source (seccomp_demo.c) is at github.com/nghiadaulau/ebpf-from-scratch, directory 15-seccomp-bpf.

Wrap-up

seccomp-bpf filters syscalls with classic BPF (cBPF) — the primitive BPF that tcpdump uses, not eBPF (the kernel translates cBPF→eBPF to run it, but the API is cBPF). It's the most widespread BPF application: containerd applies a default profile to ordinary containers (pause, CSI sidecars on the cluster have Seccomp: 2), and systemd tightens its services (systemd-resolve stacks 28 filters). But not every container has it — privileged pods like cilium-agent/kubelet run Seccomp: 0, and Kubernetes by default applies no seccomp unless you set RuntimeDefault. A filter is a cBPF program over struct seccomp_data (nr/arch/args), installed with prctl(PR_SET_SECCOMP, ...) after PR_SET_NO_NEW_PRIVS, returning a verdict (ALLOW/ERRNO/KILL...). We wrote 8 cBPF instructions blocking mkdir with EPERM — blocking exactly that, while printf (write) still runs. The filter is inherited via fork/exec and cannot be removed (by design: a sandbox doesn't relax itself). Versus LSM BPF: seccomp filters raw syscalls and is blind to pointers; LSM sees kernel objects — two complementary layers.

Article 16 closes Part V with how a real Kubernetes system does runtime security: Tetragon — observing with kprobe/tracepoint then enforcing (killing processes, blocking) right inside the kernel, and we'll dissect the mechanism it uses to turn observation into action.