seccomp-bpf: Classic BPF Filtering Syscalls in Every Container
Article 14 enforced with modern eBPF (LSM). But there's a BPF-based filtering mechanism that's much older and far more widespread — it's running in nearly every container you've ever used: seccomp-bpf. And it uses classic BPF (cBPF), not eBPF. This article makes that difference clear, inspects real seccomp on the cluster, then writes a filter ourselves.
cBPF and eBPF: two generations
The original "BPF" (1992) was the Berkeley Packet Filter — a tiny virtual machine to filter packets for tcpdump. That is cBPF (classic BPF): 2 registers (A, X), a tiny instruction set, no maps, no helpers, no loops. eBPF (Article 1) is the extended version from 2014: 11 64-bit registers, maps, helpers, a powerful verifier — almost an entirely different thing.
What many people don't notice: seccomp still uses cBPF, not eBPF. When you install a seccomp filter, you load an array of cBPF instructions (struct sock_filter). The kernel translates cBPF to eBPF internally to run it, but the API userspace sees is pure cBPF. This is why seccomp belongs in an eBPF series: it's the old relative of eBPF, and the most widely deployed BPF application on the planet.
Real seccomp on the cluster: who's restricted, who isn't
/proc/<pid>/status tells you whether a process has seccomp: the Seccomp field (0=off, 1=strict, 2=filter) and Seccomp_filters (the number of filters attached). Scan the whole node:
pid 1161 pause filters=1 <- a pod's sandbox container
pid 1290 aws-ebs-csi-dri filters=1
pid 1429 cilium-operator filters=1
pid 2645 csi-provisioner filters=1
pid 450 systemd-resolve filters=28 <- a systemd service tightening itself
pid 123 systemd-journal filters=17
This reads out the real picture. Ordinary containers — pause (each pod's sandbox container), csi-*, cilium-operator — run with Seccomp: 2, filters=1: that's the default seccomp profile containerd applies when starting a container. But check the privileged pods:
cilium-agent (pid 1857): Seccomp: 0 <- NOT restricted
kubelet (pid 817): Seccomp: 0
containerd (pid 592): Seccomp: 0
Seccomp: 0 — no seccomp. This is the point worth correcting: not every container has seccomp. Privileged / hostPID pods (like cilium-agent, kubelet) run unconfined; and Kubernetes by default does not apply seccomp to workloads unless you set securityContext.seccompProfile: RuntimeDefault (or enable the SeccompDefault feature gate). Containerd does apply a default profile to non-privileged containers — which is why pause and the CSI sidecars have filters=1.
And systemd-resolve with 28 filters illustrates a core property: seccomp filters stack on top of each other and cannot be removed. Each SystemCallFilter= line in a unit file adds a filter; once set, it stays until the process dies. This is by design — a sandbox that could relax itself would be meaningless.
Writing a cBPF filter ourselves
A seccomp filter is a cBPF program running on struct seccomp_data (syscall number nr, architecture arch, the args), returning a verdict. Write a filter that blocks mkdir/mkdirat with EPERM and allows every other syscall:
struct sock_filter filter[] = {
// A = arch; if not x86_64, allow (architecture defense)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, arch)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// A = nr (syscall number)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 2, 0), // match -> jump to RET ERRNO
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdirat, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW), // other syscall: allow
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
};
struct sock_fprog prog = { .len = 8, .filter = filter };
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); // mandatory for an unprivileged process
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); // install the filter
This is bare cBPF: BPF_STMT/BPF_JUMP are macros that build individual instructions; the A register loads a field of seccomp_data via BPF_LD|BPF_ABS; jumps use a relative offset (the number of instructions to skip when true/false). No maps, no helpers — completely unlike eBPF. Verdicts: SECCOMP_RET_ALLOW lets the syscall run, SECCOMP_RET_ERRNO | EPERM makes the syscall return -EPERM without executing (other verdicts: KILL_PROCESS, TRAP sending SIGSYS, USER_NOTIF...). PR_SET_NO_NEW_PRIVS is a mandatory condition for an unprivileged process to be able to install seccomp.
Running: mkdir blocked, printf still runs
filter has 8 cBPF instructions
before filter: mkdir /tmp/sc-a -> OK
seccomp filter installed (mode 2)
after filter: mkdir /tmp/sc-b -> Operation not permitted
after filter: this printf still works (write syscall allowed)
Before installing, mkdir runs normally. After installing the filter, mkdir gets exactly Operation not permitted (the EPERM the filter returns), while printf — which calls the write syscall, allowed by SECCOMP_RET_ALLOW — still prints normally. Filtering is precise at the per-syscall level: block this one, allow that one. This is exactly the mechanism a container seccomp profile uses to block dangerous syscalls (mount, kexec_load, ptrace...) while still letting the application run.
seccomp versus LSM BPF
Two enforcement mechanisms at different layers:
| seccomp-bpf | LSM BPF (Article 14) | |
|---|---|---|
| BPF | cBPF (classic) | eBPF |
| Filters by | syscall number + raw args | semantic operation (open this file, mount...) |
| Sees | seccomp_data (nr, arch, args) |
real kernel objects (struct file *...) |
| Scope | per process, inherited via fork | system-wide |
| Removal | no (stacks until death) | yes (detach the link) |
seccomp is fast and simple but "blind" — it sees the syscall number and raw arguments, can't dereference a pointer (doesn't know which file openat is opening). LSM BPF sees the resolved kernel object. They complement each other: seccomp narrows the syscall surface, LSM BPF enforces semantic policy.
🧹 Cleanup
Nothing to clean up at the system level: the seccomp filter lives only inside the demo process; when the process exits the filter disappears (seccomp is attached per process, not globally). The /proc/*/status scan only reads, touching nothing. The node still has 140 eBPF programs. The source (seccomp_demo.c) is at github.com/nghiadaulau/ebpf-from-scratch, directory 15-seccomp-bpf.
Wrap-up
seccomp-bpf filters syscalls with classic BPF (cBPF) — the primitive BPF that tcpdump uses, not eBPF (the kernel translates cBPF→eBPF to run it, but the API is cBPF). It's the most widespread BPF application: containerd applies a default profile to ordinary containers (pause, CSI sidecars on the cluster have Seccomp: 2), and systemd tightens its services (systemd-resolve stacks 28 filters). But not every container has it — privileged pods like cilium-agent/kubelet run Seccomp: 0, and Kubernetes by default applies no seccomp unless you set RuntimeDefault. A filter is a cBPF program over struct seccomp_data (nr/arch/args), installed with prctl(PR_SET_SECCOMP, ...) after PR_SET_NO_NEW_PRIVS, returning a verdict (ALLOW/ERRNO/KILL...). We wrote 8 cBPF instructions blocking mkdir with EPERM — blocking exactly that, while printf (write) still runs. The filter is inherited via fork/exec and cannot be removed (by design: a sandbox doesn't relax itself). Versus LSM BPF: seccomp filters raw syscalls and is blind to pointers; LSM sees kernel objects — two complementary layers.
Article 16 closes Part V with how a real Kubernetes system does runtime security: Tetragon — observing with kprobe/tracepoint then enforcing (killing processes, blocking) right inside the kernel, and we'll dissect the mechanism it uses to turn observation into action.