Wiring Up Pod Networking by Hand: CNI bridge and VPC Routes

K
Kai··8 min read

Article 13 laid the theory: a flat network model, IP-per-pod, and the choice of native routing for our single-subnet cluster. Now we assemble. The work boils down to three pieces: write the CNI config so each node assigns IPs and a bridge for its own pods, add routes in the VPC so pod-to-pod packets find their way across nodes, and a masquerade rule for pod traffic leaving the cluster. With those three done, the nodes flip to Ready on their own and pods actually run.

Step 1 — CNI config for each worker

Recall Articles 10–13: the plugin binaries already live in /opt/cni/bin, but /etc/cni/net.d is still empty, and that emptiness is why the node is NotReady. Now write two config files into it.

The main file uses the bridge plugin with host-local IPAM. The crucial point: each node declares a different subnet (worker-0 takes 10.200.0.0/24, worker-1 takes 10.200.1.0/24) so the two nodes never assign overlapping pod IPs.

# on worker-0 (worker-1 changes subnet to 10.200.1.0/24)
sudo mkdir -p /etc/cni/net.d
sudo tee /etc/cni/net.d/10-bridge.conf >/dev/null <<'EOF'
{
  "cniVersion": "1.0.0",
  "name": "bridge",
  "type": "bridge",
  "bridge": "cni0",
  "isGateway": true,
  "ipMasq": false,
  "ipam": {
    "type": "host-local",
    "ranges": [[{"subnet": "10.200.0.0/24"}]],
    "routes": [{"dst": "0.0.0.0/0"}]
  }
}
EOF

sudo tee /etc/cni/net.d/99-loopback.conf >/dev/null <<'EOF'
{
  "cniVersion": "1.0.0",
  "name": "lo",
  "type": "loopback"
}
EOF

Field by field:

  • type: bridge, bridge: cni0 — every pod on the node hangs off a virtual bridge named cni0. Pods on the same node talk to each other over this bridge at layer 2.
  • isGateway: true — assigns cni0 the .1 address of the range (10.200.0.1) and makes it the default gateway for pods. Anything a pod sends out goes through here first.
  • ipMasq: falsedon't let the plugin masquerade on its own. This is native routing's key choice: we want pod-to-pod packets to keep their real source IP (the "no NAT" model from Article 13). Masquerading for traffic leaving the cluster we handle separately, under control, in Step 3.
  • ipam: host-local with ranges being the node's range — the plugin manages sequential IP assignment within the range itself, keeping its ledger in /var/lib/cni/networks. routes: 0.0.0.0/0 writes into the pod a default route pointing back to cni0.
  • The 99-loopback file sets up the lo interface inside each pod. The 99 number makes it load later; the loopback plugin only handles 127.0.0.1 inside the pod.

containerd watches this directory and reloads immediately. Check that it has read the config:

sudo crictl info | grep -i lastCNILoadStatus
  "lastCNILoadStatus": "OK",

OK replaces the cni plugin not initialized message from earlier articles — the runtime now knows how to set up pod networking. Write the same thing for worker-1 with subnet 10.200.1.0/24.

Step 2 — Routes in the VPC for the two pod ranges

The config above is enough for pods on the same node to talk to each other over cni0. But a packet from a pod on worker-0 (10.200.0.x) sent to a pod on worker-1 (10.200.1.x) will go to cni0, out the node's eth0, then to the VPC router — and the router has no idea where 10.200.1.0/24 lives, because it's the pod range, not the subnet's range. We teach it with two static routes: which pod range goes to the instance holding that range.

RTB=rtb-086c1b93e4ff0a50c   # route table of the subnet
aws ec2 create-route --route-table-id $RTB \
  --destination-cidr-block 10.200.0.0/24 --instance-id i-0f1ab7628507cb9cd  # worker-0
aws ec2 create-route --route-table-id $RTB \
  --destination-cidr-block 10.200.1.0/24 --instance-id i-0a33782c408f5bf09  # worker-1
{ "Return": true }
{ "Return": true }

Review the route table:

aws ec2 describe-route-tables --route-table-ids $RTB \
  --query 'RouteTables[].Routes[].[DestinationCidrBlock,InstanceId,GatewayId]' --output text
10.200.0.0/24   i-0f1ab7628507cb9cd   None
10.200.1.0/24   i-0a33782c408f5bf09   None
10.0.0.0/16     None                  local
0.0.0.0/0       None                  igw-0f956a0362900fb68

The first two lines are what we just added, and this is Article 13's "native routing" in concrete form: the underlying network (the VPC) routes the pod range itself, no tunnel needed. These routes work because in Article 3 we disabled source/destination check on both instances; otherwise AWS would block the node from forwarding packets carrying pod IPs (not the node's own IP).

Step 3 — Masquerade for pod traffic leaving the cluster

Because we set ipMasq: false, pod-to-pod keeps the original IP. But when a pod wants to reach the Internet (or a service outside the cluster), the packet carries source IP 10.200.x.y, and that destination has no return path to the pod range. We need to SNAT packets leaving the pod network to the node IP, and only those, never touching pod-to-pod.

The condition states it concisely: source belongs to the node's pod range, destination does not belong to the overall pod range (10.200.0.0/16):

# on worker-0 (worker-1 changes -s to 10.200.1.0/24)
sudo iptables -t nat -A POSTROUTING \
  -s 10.200.0.0/24 ! -d 10.200.0.0/16 -j MASQUERADE \
  -m comment --comment "pod egress masq"

! -d 10.200.0.0/16 is the crux: every destination within the pod network is excluded from masquerade, so pod-to-pod (even across nodes) still sees each other's real IPs; only traffic going outside 10.200.0.0/16 gets SNAT'd to the node IP.

Step 4 — Nodes flip to Ready

No restart needed. kubelet is still polling containerd about networking status periodically; the moment CNI loads OK, the node's Ready condition flips on its own:

kubectl get nodes -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP   CONTAINER-RUNTIME
worker-0   Ready    <none>   14m   v1.36.1   10.0.1.20     containerd://2.3.1
worker-1   Ready    <none>   13m   v1.36.1   10.0.1.21     containerd://2.3.1

Ready, both of them. This is the milestone: the first time the cluster is ready to accept and run real pods. The gap that stretched back to Article 11 has been filled exactly where we predicted: it was CNI, not a broken kubelet.

Step 5 — Create real pods on two nodes

Create two busybox pods, pinning each to a node with nodeName to be sure they sit on different nodes — exactly the situation we want to test:

cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: pod-a
spec:
  nodeName: worker-0
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep", "3600"]
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-b
spec:
  nodeName: worker-1
  containers:
  - name: app
    image: busybox:1.36
    command: ["sleep", "3600"]
EOF

kubectl wait --for=condition=Ready pod/pod-a pod/pod-b --timeout=90s
kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP           NODE
pod-a   1/1     Running   0          7s    10.200.0.2   worker-0
pod-b   1/1     Running   0          7s    10.200.1.2   worker-1

Both pods Running, and crucially the IPs: pod-a got 10.200.0.2 (from worker-0's range), pod-b got 10.200.1.2 (from worker-1's range). The host-local IPAM assigned each node's range correctly: .1 is cni0, so the first pod gets .2. From inside the pod, the routes match what we configured:

kubectl exec pod-a -- ip route
default via 10.200.0.1 dev eth0
10.200.0.0/24 dev eth0 scope link  src 10.200.0.2

The default gateway is 10.200.0.1, which is worker-0's cni0. Everything the pod sends outside its node's range goes through here.

Step 6 — Ping across nodes

This is the test for all three networking articles. pod-a on worker-0 pings pod-b on worker-1:

kubectl exec pod-a -- ping -c 3 10.200.1.2
PING 10.200.1.2 (10.200.1.2): 56 data bytes
64 bytes from 10.200.1.2: seq=0 ttl=62 time=1.369 ms
64 bytes from 10.200.1.2: seq=1 ttl=62 time=0.347 ms
64 bytes from 10.200.1.2: seq=2 ttl=62 time=0.410 ms

--- 10.200.1.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss

It gets through, no packet loss. ttl=62 (down 2 from the default 64) shows the packet went through exactly two routing hops — that is, worker-0's cni0 then the VPC forwarding to worker-1 — not a tunnel. The full path:

   pod-a 10.200.0.2                          pod-b 10.200.1.2
       │ veth                                     veth │
   ┌───┴── cni0 10.200.0.1                cni0 10.200.1.1 ──┴───┐
   │   worker-0 eth0 10.0.1.20      worker-1 eth0 10.0.1.21     │
   └────────┬──────────────────────────────────┬───────────────┘
            └──────► VPC route table ◄──────────┘
                 10.200.1.0/24 → worker-1
                 10.200.0.0/24 → worker-0

Three more checks to round it out. The pod sees its own real IP (not hidden behind NAT):

kubectl exec pod-a -- ip -4 addr show eth0 | grep inet
    inet 10.200.0.2/24 brd 10.200.0.255 scope global eth0

From inside the pod, open TCP to the ClusterIP of the kubernetes Service; kube-proxy (Article 12) does its job even when the source is a real pod:

kubectl exec pod-b -- nc -w 3 -zv 10.32.0.1 443
10.32.0.1 (10.32.0.1:443) open

And the pod reaches the Internet, thanks to the masquerade rule in Step 3:

kubectl exec pod-b -- ping -c 2 8.8.8.8
2 packets transmitted, 2 packets received, 0% packet loss

The four kinds of communication from Article 13 now all run for real: containers within a pod (localhost), pod-to-pod on the same node (cni0), pod-to-pod across nodes (VPC routes), and pod-to-Service (kube-proxy).

🧹 Cleanup

Delete the two test pods, they were only for verification:

kubectl delete pod pod-a pod-b

The CNI config and VPC routes are permanent parts of the cluster, leave them. A note on durability: the VPC routes survive stop/start, but the iptables masquerade rule from Step 3 does not, because it lives in kernel memory and is lost after the instance reboots/stop-starts. If you pause the cluster and bring it back, re-run the iptables ... MASQUERADE command on each worker (or put it in a oneshot systemd unit if you want it to truly persist). The kube-proxy and bridge rules rebuild themselves when their services start, so no worry there.

The full scripts (CNI config for both nodes, route commands, masquerade) are at github.com/nghiadaulau/kubernetes-from-scratch, directory 14-pod-network.

Wrap-up

Pod networking is wired end to end, built entirely by hand and from pieces we understand: cni0 for same-node pods, VPC routes for crossing nodes, selective masquerade for traffic leaving. No overlay, no extra headers: pod-to-pod packets travel intact because the cluster fits in one subnet and we control the underlying network's routing. Both nodes Ready, pods running for real and reachable in every direction.

One thing is still missing for the cluster to look usable: pods calling each other by name, not IP. A pod's IP changes every time the pod is reborn, so no one hard-codes them. Article 15 deploys CoreDNS as a Deployment in the cluster, creates a Service for it at exactly the 10.32.0.10 that kubelet has been pointing pods at (Article 11), and watches a pod resolve a Service name into a ClusterIP.