What a Container Is Made Of: Namespaces, Cgroups and Union Filesystem

K
Kai··7 min read

In Article 1 we saw that runc is the thing that actually creates a container, by using Linux kernel features. This article dissects exactly those features. This is the "magic" of containers, and the good news is there's no magic at all — just three mechanisms already in the Linux kernel, combined.

One sentence to keep in mind before going further: a container is not a kind of tiny virtual machine. It's just an ordinary Linux process, but one the Linux kernel has given a limited, isolated view. Three things create that isolation:

  • Namespaces — decide what a container sees (isolation).
  • Cgroups — decide how much resource a container can use (limits).
  • Union filesystem — decide which filesystem a container sees (layers).
        An ordinary process  ──────────►  A "container"
                                  wrapped with:
        ┌───────────────────────────────────────────────┐
        │  namespaces   → what it sees (proc, net, mnt) │
        │  cgroups      → how much it can use (CPU, RAM)│
        │  union FS     → which fs it sees (layers)     │
        └───────────────────────────────────────────────┘

Let's take them one at a time.

Namespaces: isolating what you see

The Docker docs call namespaces "the first and most straightforward form of isolation." The idea: the Linux kernel can give each process its own "namespace" for each kind of resource, and a process can only see what's in its namespace.

Per the docs, "processes running within a container cannot see, and even less affect, processes running in another container," and "each container also has its own network stack."

There are several kinds of namespace, each isolating one thing:

  • PID — isolates the process tree. The first process in the container sees itself as PID 1, and can't see processes outside the container.
  • NET — isolates networking: its own virtual network interface, IP, and routing table.
  • MNT — isolates mount points (the filesystem the process sees).
  • UTS — isolates the hostname.
  • IPC — isolates inter-process communication (shared memory, etc.).
  • USER — maps users/groups, allowing root in the container to not be root on the host.

Verify it yourself. Each process has a /proc/<pid>/ns/ directory listing its namespaces. Run:

docker run --rm alpine ls -l /proc/self/ns/

Output (trimmed):

lrwxrwxrwx  cgroup -> cgroup:[4026532965]
lrwxrwxrwx  ipc -> ipc:[4026532963]
lrwxrwxrwx  mnt -> mnt:[4026532961]
lrwxrwxrwx  net -> net:[4026532966]
lrwxrwxrwx  pid -> pid:[...]
lrwxrwxrwx  uts -> uts:[...]

Each line is a namespace, with a number (inode) identifying it. The container has its own set of namespaces, so these numbers differ from processes on the host.

PID isolation is clearest through the ps command. Inside the container:

docker run --rm alpine ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 ps aux

There is only one process, and it is PID 1. The host machine has hundreds of processes, but the container sees none of them — because it's in its own PID namespace. This is exactly "isolation": the same Linux kernel, but each container believes it has a whole system of its own.

And precisely because it shares the kernel with the host, a container is lightweight and starts fast — it doesn't boot an operating system like a VM, it just creates a few namespaces for a process.

Cgroups: limiting resources

Namespaces handle what you see. But if one container eats all the machine's CPU or RAM, the other containers die with it. That's the job of control groups (cgroups).

The Docker docs: cgroups "implement resource accounting and limiting," ensuring "each container gets its fair share of memory, CPU, disk I/O," and preventing a single container from exhausting system resources.

Try limiting a container to 50 MB of RAM and then have it read its own limit:

docker run --rm --memory=50m alpine cat /sys/fs/cgroup/memory.max
52428800

52428800 bytes is exactly 50 × 1024 × 1024 = 50 MiB. The --memory=50m flag is translated into a cgroup memory limit, and inside the container the file /sys/fs/cgroup/memory.max reflects that exact number. If a process in the container tries to use more than 50 MB, the kernel blocks it (and it may get OOM-killed).

Similarly there are --cpus and --cpu-shares to limit CPU. The key point: the limit is enforced at the kernel level, and the container can't "sneak around" it.

Note: the path /sys/fs/cgroup/memory.max is from cgroup v2 (common on newer systems). Older systems use cgroup v1 with a different path (/sys/fs/cgroup/memory/memory.limit_in_bytes). The mechanism is the same.

Union filesystem: the layered filesystem

The third piece answers: where does the container get its filesystem (directories, files, commands)? The answer is the union filesystem — a mechanism that stacks multiple layers of files into one.

Per the Docker docs, "an image is built from a series of layers. Each layer corresponds to an instruction in the image's Dockerfile." These layers are read-only. When a container runs, "Docker adds a writable layer on top" to hold any changes.

   Running container
   ┌─────────────────────────────────────────┐
   │  Writable layer (the container's own)     │  ← writes go here
   ├─────────────────────────────────────────┤
   │  Layer 4 (read-only)  CMD/ENV...         │  ┐
   │  Layer 3 (read-only)  copy code          │  │  the image's
   │  Layer 2 (read-only)  install libs       │  │  layers
   │  Layer 1 (read-only)  base OS (alpine...)│  ┘  (shared)
   └─────────────────────────────────────────┘
        union mount → process sees 1 seamless file tree

View an image's layers yourself:

docker pull nginx:alpine
docker image inspect nginx:alpine --format '{{range .RootFS.Layers}}{{println .}}{{end}}'

Each sha256:... line is a read-only layer. To find out which instruction created which layer and how large it is:

docker history nginx:alpine
CREATED BY                                      SIZE
RUN /bin/sh -c set -x  && apkArch="$(cat …      48.3MB
ENV ACME_VERSION=0.4.1                          0B
CMD ["nginx" "-g" "daemon off;"]                0B
EXPOSE map[80/tcp:{}]                           0B

Notice: the RUN instruction that installs software creates a heavy layer (48.3MB), while ENV, CMD, EXPOSE are only metadata, so they're 0B. Understanding this matters a lot for Article 5 (build cache) and Article 9 (image optimization).

Why this approach saves space

The read-only layers are shared across many containers and many images. If you run 10 containers from the same image, they share those read-only layers; each container only adds a thin writable layer of its own. That's why containers start fast and are light on disk — there's no copying of the whole image for each container.

Copy-on-write: when a container writes a file

Image layers are read-only, so how does a container modify a file in them? Through copy-on-write (CoW). Per the docs, when a container needs to modify a file sitting in a read-only layer below, the storage driver "searches through the image layers for the file," then "performs a copy_up operation, copying the file up to the container's writable layer." From there the container edits the copy in the writable layer; the original layer is unchanged.

   Writing to /etc/nginx/nginx.conf:
   ┌────────────────────────────┐
   │ Writable layer  [copy]    ◄┼── edited here
   ├────────────────────────────┤
   │ Image layer  [original]    │   (unchanged, read-only)
   └────────────────────────────┘

The important consequence: every change in the container lives in the writable layer, and the writable layer vanishes when the container is removed. Data you want to keep (a database, uploaded files) can't live in the container — that's why volumes exist, the topic of Article 6.

Putting it together: what a container is

Combining the three pieces, the definition of "container" becomes concrete: it's one (or a few) Linux processes that the kernel gives their own namespaces (so they believe they have their own system), are limited by cgroups (so they don't wreck the machine), and run on a union filesystem stacked from read-only image layers plus a writable layer (so they're light and fast).

No guest operating system, no hardware virtualization. Just the host's Linux kernel, used cleverly. That's the entire "secret" of containers.

Wrap-up

  • Namespaces isolate the view (PID, NET, MNT, UTS, IPC, USER) — verify with /proc/self/ns/ and ps in a container.
  • Cgroups limit resources — verify with --memory and /sys/fs/cgroup/memory.max.
  • Union filesystem stacks read-only layers + one writable layer, using copy-on-write — view with docker history and docker image inspect.

These first two articles give you the foundation to understand everything that follows. From Article 3, we roll up our sleeves and get hands-on: installing Docker (if you haven't) and properly running and managing your first container's lifecycle.