The Kubernetes Network Model

K
Kai··8 min read

The last three articles stood up the full set of components on the worker: containerd runs containers, kubelet orchestrates the node, kube-proxy implements Services. But the nodes are still NotReady, and the reason is always the same line: cni plugin not initialized. Pods have no networking. The next article wires that up by hand, but diving straight into commands risks doing it without understanding why. This article steps back to lay the theory: what network model Kubernetes demands, and how the various solutions fit that demand.

The four foundational requirements

Kubernetes doesn't set up pod networking itself. It only lays out a contract and leaves the implementation to a third party (the CNI plugin). That contract, per the official docs, consists of these requirements:

  1. Each pod has a unique IP across the whole cluster"Each pod in a cluster gets its own unique cluster-wide IP address." Not a per-container IP, but per-pod.
  2. Every pod can talk to every other pod, with no NAT"All pods can communicate with all other pods, whether they are on the same node or on different nodes ... without the use of proxies or address translation (NAT)." A pod on node A calls a pod on node B by node B's real IP, and node B sees the source as node A's real IP.
  3. Agents on a node can talk to every pod on that node"Agents on a node (such as system daemons, or kubelet) can communicate with all pods on that node." This is how kubelet runs health checks against pods.
  4. Containers within the same pod share a network namespace — they see one shared IP and talk to each other over localhost.

The notable phrase is no NAT. It's what distinguishes the Kubernetes model from Docker's default container networking, where each container sits behind a NAT bridge and has to map ports to the host to be reachable. In Kubernetes, the network is flat: a pod's IP is a real address that everywhere in the cluster can route to, not hidden behind a host port. This flat model makes everything above it (Service, DNS, NetworkPolicy) far simpler to reason about, because the address a pod sees is exactly the address other pods use to call it.

Pods share one network namespace

The fourth requirement is worth pausing on, because it explains what a "pod" actually is, network-wise. A pod is not a container — it's a group of containers sharing several Linux namespaces, including the network namespace. Specifically: the pause container (Article 10) is created first and holds the pod's network namespace; the application containers then join that same namespace instead of creating their own.

   ┌─────────────── Pod (one network namespace) ───────────────┐
   │   pod IP: 10.200.0.7                                       │
   │                                                            │
   │   ┌──────────┐   ┌──────────┐   ┌──────────────┐          │
   │   │ pause    │   │ app:8080 │   │ sidecar:9090 │          │
   │   └──────────┘   └────┬─────┘   └──────┬───────┘          │
   │   (holds netns)       │  localhost     │                  │
   │                       └────────────────┘                  │
   │              same IP, talk over 127.0.0.1                 │
   └────────────────────────────────────────────────────────────┘

The practical consequence: two containers in the same pod cannot listen on the same port (they share one IP), and they call each other over localhost rather than by service name. That's why pause matters: it's the anchor keeping the namespace alive even when an application container restarts, so the pod doesn't lose its IP mid-flight.

The four kinds of communication

Put together, traffic in a cluster falls into four kinds, each handled by a different mechanism:

  • Container ↔ container in the same pod — over localhost, thanks to the shared network namespace. Nothing extra needed.
  • Pod ↔ pod on the same node — over a bridge on the node: two pods plug into the same virtual bridge and forward packets to each other at layer 2. The CNI plugin sets up this bridge.
  • Pod ↔ pod on a different node — the hard part, discussed below. You need a way for a packet to leave node A, reach the right node B, then enter the right pod.
  • Pod ↔ Service — via kube-proxy (Article 12): a virtual ClusterIP gets DNAT'd to a real pod endpoint. Note that kube-proxy only rewrites the destination of the packet; getting that packet to the destination pod still relies on the three kinds above.

The first three are CNI's business; the fourth is kube-proxy's, and that's done. So the rest of the networking part of the series comes down to: build a bridge on each node, and wire the bridges together across nodes.

Pod-to-pod across nodes: the hard part

Within a node, everything is simple: the pods all hang off one bridge, and the kernel forwards packets between them. The problem arises when a pod on worker-0 wants to call a pod on worker-1. The packet's destination is a pod IP (say 10.200.1.5), but that IP doesn't exist on the VPC's physical network; it only has meaning inside the pod network. The underlying network (here, the AWS VPC) has no idea where to send 10.200.1.5.

The way to solve this splits into two families:

Overlay (tunneling). Wrap the pod-to-pod packet inside another packet whose destination is the node IP. Node A wraps the packet 10.200.0.7 → 10.200.1.5 inside a UDP/VXLAN packet sent to node B's real IP (10.0.1.21); node B unwraps it and hands it to the pod. The underlying network only sees normal node-to-node traffic and doesn't need to know about the pod range at all. The tradeoff: every packet carries the extra outer header (encapsulation overhead) and the usable MTU shrinks. Flannel (VXLAN mode) and Calico (overlay mode) take this route. The big advantage: it works on almost any infrastructure, even when you can't control the routing table of the underlying network.

Native routing (L3 routing). No wrapping at all; instead you teach the underlying network how to route the pod range. Each node gets its own subrange (e.g. worker-0 keeps 10.200.0.0/24, worker-1 keeps 10.200.1.0/24), then you add routes: "to reach 10.200.1.0/24, send to worker-1". The pod-to-pod packet travels intact, with no extra header, faster — but it requires you to control the underlying network's routing, and the underlying network must be willing to forward packets with unfamiliar source/destination IPs (not the node's own IP).

  OVERLAY (VXLAN)                      NATIVE ROUTING (L3)
  pod A ─► [ pod packet ]              pod A ─► [ pod packet ]
          wrapped in                          goes straight,
          [ node→node packet ] ─► net         net has route
          node B unwraps ─► pod B              10.200.1.0/24 → worker-1 ─► pod B

Which route our cluster takes

This cluster sits entirely within one subnet of an AWS VPC (10.0.1.0/24), so we choose native routing, not overlay. The reasons, and the pieces already prepared:

  • Per-node pod ranges. The overall pod range is 10.200.0.0/16 (already declared in the controller-manager's --cluster-cidr in Article 8 and kube-proxy's clusterCIDR in Article 12). We split it: worker-0 takes 10.200.0.0/24, worker-1 takes 10.200.1.0/24. Right now kubectl get nodes -o jsonpath='{..podCIDR}' is still empty because no node has been assigned a range yet; Article 14 will assign them.
  • Routes between the two ranges. Since both nodes are in the same subnet, we add routes in the VPC's route table: 10.200.0.0/24 → worker-0, 10.200.1.0/24 → worker-1. Then pod-to-pod traffic across nodes is forwarded by the VPC router itself, no tunnel needed.
  • Disable source/destination check. By default AWS blocks an instance from sending/receiving packets with an IP that isn't its own — exactly what we need to violate, since a pod packet carries a pod IP, not a node IP. In Article 3 we already disabled this check; let's confirm it's off:
aws ec2 describe-instances --filters Name=tag:Name,Values=worker-0,worker-1 \
  --query 'Reservations[].Instances[].[Tags[?Key==`Name`]|[0].Value,SourceDestCheck]' --output text
worker-1   False
worker-0   False

False means the check is off, and the node is allowed to forward packets with foreign IPs. These three pieces (per-node pod ranges, VPC routes, source/dest check off) are the entire framework for native routing; Article 14 assembles them.

Where CNI fits

There's still a question: who actually creates the bridge, attaches the pod, assigns the pod an IP? Not kubelet directly, but a CNI plugin that kubelet calls through an intermediary. Recall the delegation chain from Articles 10–11, now extended one more notch:

  kubelet ──CRI──► containerd ──► create the network namespace for the pod
                        │
                        └──► call the CNI plugin (binary in /opt/cni/bin)
                               read the config in /etc/cni/net.d
                               ADD: assign IP, create veth, attach pod to bridge
                               DEL: reclaim when the pod dies

CNI is a contract that's surprisingly simple: a plugin is just an executable that takes an ADD/DEL command via environment variables and stdin (JSON), then sets up or tears down networking for a network namespace. When containerd creates a pod, it calls the plugin with ADD; the plugin assigns an IP from the node's range, creates a veth pair linking the pod namespace to the node's bridge, and returns the assigned IP. When the pod dies, DEL cleans up.

We already have both halves from Article 10: the plugin binaries live in /opt/cni/bin (including bridge and host-local, which we'll use), but the config in /etc/cni/net.d is still empty. That very emptiness is why crictl info reports cni plugin not initialized, and why kubelet keeps the node NotReady: kubelet refuses to accept pods until it has a way to give them networking. Writing that config file, along with adding the VPC routes, is the next article's job.

Wrap-up

The Kubernetes network model boils down to one idea: a flat network, IP-per-pod, no NAT between pods. Every mechanism above relies on it. Of the four kinds of communication, three are CNI's job and one (Service) is already done with kube-proxy. The tricky part, pod-to-pod across nodes, has two solutions; our single-subnet cluster picks native routing for its simplicity and speed, with the three prepared pieces already in place.

Now we can type the commands, because we know what each one is for. Article 14 writes the CNI config for bridge + host-local on each worker, assigns a pod range to each node, adds the routes in the VPC, then creates a few real pods to watch them get IPs, ping each other across nodes, and the two nodes finally flip to Ready.