Seccomp, AppArmor and Capabilities

K
Kai··4 min read

Article 54 stopped at the policy level: restricted demands runAsNonRoot, drop: ["ALL"], seccompProfile, allowPrivilegeEscalation: false. Those four fields aren't Kubernetes concepts — they're Linux kernel mechanisms that the container runtime turns on for each container. This article goes down to see what they actually do, by comparing two pods side by side: one default pod and one hardened pod, reading /proc/self/status directly to see the difference at the kernel layer.

Three layers of defense in securityContext

A pod/container securityContext controls three different kernel mechanisms:

   capabilities  ── splits root's powers into ~40 pieces; drop some so even root
                    can't do dangerous operations (mount, chown, bind low ports...)

   seccomp       ── filters syscalls: the runtime's filter blocks rarely-used but
                    dangerous syscalls (keyctl, ptrace into another process...)

   AppArmor      ── (on the node's Ubuntu) a profile limiting the files/paths/operations
                    a process is allowed to touch, applied at the kernel's LSM layer

These three layers are independent and additive. Create two pods to compare — one declaring nothing, one fully hardened:

# hardened pod
spec:
  securityContext:
    seccompProfile: {type: RuntimeDefault}
    appArmorProfile: {type: RuntimeDefault}
  containers:
  - name: c
    image: busybox:1.36
    command: ["sleep", "100000"]
    securityContext:
      allowPrivilegeEscalation: false
      capabilities: {drop: ["ALL"]}

Reading /proc/self/status of two pods

The kernel exposes a process's security state at /proc/self/status. Compare the default pod and the hardened pod:

kubectl -n sec-demo exec def      -- sh -c 'grep -E "^(CapEff|Seccomp|NoNewPrivs)" /proc/self/status; cat /proc/self/attr/current'
kubectl -n sec-demo exec hardened -- sh -c 'grep -E "^(CapEff|Seccomp|NoNewPrivs)" /proc/self/status; cat /proc/self/attr/current'
# pod def (default)
CapEff:     00000000a80425fb
NoNewPrivs: 0
Seccomp:    0
AppArmor:   cri-containerd.apparmor.d (enforce)

# pod hardened
CapEff:     0000000000000000
NoNewPrivs: 1
Seccomp:    2
AppArmor:   cri-containerd.apparmor.d (enforce)

Read line by line:

  • CapEff (effective capabilities): the default pod has a80425fb — these are the ~14 capabilities containerd grants every container by default (chown, net_bind_service, setuid...), not the full power of root but still plenty. The hardened pod has 0 — fully clean, no capabilities at all.
  • Seccomp: 0 is off (every syscall passes through), 2 is filter mode (the runtime's filter is blocking). The default pod has no seccomp; the hardened pod has RuntimeDefault on.
  • NoNewPrivs: 1 in the hardened pod from allowPrivilegeEscalation: false — a child process can't gain more privilege (disabling setuid escalation).
  • AppArmor: both show the same cri-containerd.apparmor.d (enforce). This is an easy point to misread: containerd applies a default AppArmor profile to every container on this Ubuntu node, including a pod that declares nothing. Declaring appArmorProfile: RuntimeDefault only states explicitly what's already happening.

What a capability blocks: try chown

The number CapEff: 0 sounds abstract, so try a concrete operation. chown needs CAP_CHOWN. Both pods run as root (busybox defaults to uid 0), but the hardened pod has drop: ["ALL"]:

kubectl -n sec-demo exec def      -- sh -c 'touch /tmp/f && chown 1000 /tmp/f && echo OK; whoami'
kubectl -n sec-demo exec hardened -- sh -c 'touch /tmp/f && chown 1000 /tmp/f; id -u'
# pod def
OK
root

# pod hardened
chown: /tmp/f: Operation not permitted
0

The default pod can chown because it still has CAP_CHOWN. The hardened pod gets Operation not permitted on chown even though id -u is still 0 — still root, but having dropped CAP_CHOWN the kernel blocks it. This is the core point of capabilities: they split root's powers into pieces, so a container running as root that drops all capabilities can no longer perform most dangerous operations. If it needs one specific power back, add exactly that one (capabilities.add: ["NET_BIND_SERVICE"] to bind ports <1024) instead of opening everything.

🧹 Cleanup

kubectl delete namespace sec-demo

This article only creates two pods, touching no node or runtime configuration. Containerd's default AppArmor remains in place for every pod. Manifests are at github.com/nghiadaulau/kubernetes-from-scratch, directory 55-seccomp-apparmor-capabilities.

Wrap-up

The four fields restricted demands in Article 54 map down to three independent kernel mechanisms. Comparing /proc/self/status of two pods shows it clearly: the default pod has CapEff a80425fb (~14 capabilities containerd grants by default), Seccomp 0 (off), NoNewPrivs 0; the hardened pod has CapEff 0 (drop ALL), Seccomp 2 (RuntimeDefault's filter), NoNewPrivs 1 (allowPrivilegeEscalation=false). For AppArmor, containerd has already applied the default cri-containerd.apparmor.d profile to both, so appArmorProfile: RuntimeDefault is just stating it explicitly. The real consequence of dropping a capability: chown gets Operation not permitted even when the container runs as root — capabilities split root's powers, and drop-all-then-add-back-just-what-you-need is the tidy way to tighten. This is the layer beneath the Pod Security policy from Article 54: PSA makes you declare, the kernel enforces.

Most of Part XI so far has tightened permissions and tightened the container. Article 56 closes it out with secrets and the holes still left: where a Secret lives, who can read it, the "whoever can create a pod can read the Secret" detour, and a table of cluster-hardening steps still missing in a self-built cluster.