Wiring Up Pod Networking by Hand: CNI bridge and VPC Routes
Article 13 laid the theory: a flat network model, IP-per-pod, and the choice of native routing for our single-subnet cluster. Now we assemble. The work boils down to three pieces: write the CNI config so each node assigns IPs and a bridge for its own pods, add routes in the VPC so pod-to-pod packets find their way across nodes, and a masquerade rule for pod traffic leaving the cluster. With those three done, the nodes flip to Ready on their own and pods actually run.
Step 1 — CNI config for each worker
Recall Articles 10–13: the plugin binaries already live in /opt/cni/bin, but /etc/cni/net.d is still empty, and that emptiness is why the node is NotReady. Now write two config files into it.
The main file uses the bridge plugin with host-local IPAM. The crucial point: each node declares a different subnet (worker-0 takes 10.200.0.0/24, worker-1 takes 10.200.1.0/24) so the two nodes never assign overlapping pod IPs.
# on worker-0 (worker-1 changes subnet to 10.200.1.0/24)
sudo mkdir -p /etc/cni/net.d
sudo tee /etc/cni/net.d/10-bridge.conf >/dev/null <<'EOF'
{
"cniVersion": "1.0.0",
"name": "bridge",
"type": "bridge",
"bridge": "cni0",
"isGateway": true,
"ipMasq": false,
"ipam": {
"type": "host-local",
"ranges": [[{"subnet": "10.200.0.0/24"}]],
"routes": [{"dst": "0.0.0.0/0"}]
}
}
EOF
sudo tee /etc/cni/net.d/99-loopback.conf >/dev/null <<'EOF'
{
"cniVersion": "1.0.0",
"name": "lo",
"type": "loopback"
}
EOF
Field by field:
type: bridge,bridge: cni0— every pod on the node hangs off a virtual bridge namedcni0. Pods on the same node talk to each other over this bridge at layer 2.isGateway: true— assignscni0the.1address of the range (10.200.0.1) and makes it the default gateway for pods. Anything a pod sends out goes through here first.ipMasq: false— don't let the plugin masquerade on its own. This is native routing's key choice: we want pod-to-pod packets to keep their real source IP (the "no NAT" model from Article 13). Masquerading for traffic leaving the cluster we handle separately, under control, in Step 3.ipam: host-localwithrangesbeing the node's range — the plugin manages sequential IP assignment within the range itself, keeping its ledger in/var/lib/cni/networks.routes: 0.0.0.0/0writes into the pod a default route pointing back tocni0.- The
99-loopbackfile sets up thelointerface inside each pod. The99number makes it load later; theloopbackplugin only handles127.0.0.1inside the pod.
containerd watches this directory and reloads immediately. Check that it has read the config:
sudo crictl info | grep -i lastCNILoadStatus
"lastCNILoadStatus": "OK",
OK replaces the cni plugin not initialized message from earlier articles — the runtime now knows how to set up pod networking. Write the same thing for worker-1 with subnet 10.200.1.0/24.
Step 2 — Routes in the VPC for the two pod ranges
The config above is enough for pods on the same node to talk to each other over cni0. But a packet from a pod on worker-0 (10.200.0.x) sent to a pod on worker-1 (10.200.1.x) will go to cni0, out the node's eth0, then to the VPC router — and the router has no idea where 10.200.1.0/24 lives, because it's the pod range, not the subnet's range. We teach it with two static routes: which pod range goes to the instance holding that range.
RTB=rtb-086c1b93e4ff0a50c # route table of the subnet
aws ec2 create-route --route-table-id $RTB \
--destination-cidr-block 10.200.0.0/24 --instance-id i-0f1ab7628507cb9cd # worker-0
aws ec2 create-route --route-table-id $RTB \
--destination-cidr-block 10.200.1.0/24 --instance-id i-0a33782c408f5bf09 # worker-1
{ "Return": true }
{ "Return": true }
Review the route table:
aws ec2 describe-route-tables --route-table-ids $RTB \
--query 'RouteTables[].Routes[].[DestinationCidrBlock,InstanceId,GatewayId]' --output text
10.200.0.0/24 i-0f1ab7628507cb9cd None
10.200.1.0/24 i-0a33782c408f5bf09 None
10.0.0.0/16 None local
0.0.0.0/0 None igw-0f956a0362900fb68
The first two lines are what we just added, and this is Article 13's "native routing" in concrete form: the underlying network (the VPC) routes the pod range itself, no tunnel needed. These routes work because in Article 3 we disabled source/destination check on both instances; otherwise AWS would block the node from forwarding packets carrying pod IPs (not the node's own IP).
Step 3 — Masquerade for pod traffic leaving the cluster
Because we set ipMasq: false, pod-to-pod keeps the original IP. But when a pod wants to reach the Internet (or a service outside the cluster), the packet carries source IP 10.200.x.y, and that destination has no return path to the pod range. We need to SNAT packets leaving the pod network to the node IP, and only those, never touching pod-to-pod.
The condition states it concisely: source belongs to the node's pod range, destination does not belong to the overall pod range (10.200.0.0/16):
# on worker-0 (worker-1 changes -s to 10.200.1.0/24)
sudo iptables -t nat -A POSTROUTING \
-s 10.200.0.0/24 ! -d 10.200.0.0/16 -j MASQUERADE \
-m comment --comment "pod egress masq"
! -d 10.200.0.0/16 is the crux: every destination within the pod network is excluded from masquerade, so pod-to-pod (even across nodes) still sees each other's real IPs; only traffic going outside 10.200.0.0/16 gets SNAT'd to the node IP.
Step 4 — Nodes flip to Ready
No restart needed. kubelet is still polling containerd about networking status periodically; the moment CNI loads OK, the node's Ready condition flips on its own:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP CONTAINER-RUNTIME
worker-0 Ready <none> 14m v1.36.1 10.0.1.20 containerd://2.3.1
worker-1 Ready <none> 13m v1.36.1 10.0.1.21 containerd://2.3.1
Ready, both of them. This is the milestone: the first time the cluster is ready to accept and run real pods. The gap that stretched back to Article 11 has been filled exactly where we predicted: it was CNI, not a broken kubelet.
Step 5 — Create real pods on two nodes
Create two busybox pods, pinning each to a node with nodeName to be sure they sit on different nodes — exactly the situation we want to test:
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: pod-a
spec:
nodeName: worker-0
containers:
- name: app
image: busybox:1.36
command: ["sleep", "3600"]
---
apiVersion: v1
kind: Pod
metadata:
name: pod-b
spec:
nodeName: worker-1
containers:
- name: app
image: busybox:1.36
command: ["sleep", "3600"]
EOF
kubectl wait --for=condition=Ready pod/pod-a pod/pod-b --timeout=90s
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
pod-a 1/1 Running 0 7s 10.200.0.2 worker-0
pod-b 1/1 Running 0 7s 10.200.1.2 worker-1
Both pods Running, and crucially the IPs: pod-a got 10.200.0.2 (from worker-0's range), pod-b got 10.200.1.2 (from worker-1's range). The host-local IPAM assigned each node's range correctly: .1 is cni0, so the first pod gets .2. From inside the pod, the routes match what we configured:
kubectl exec pod-a -- ip route
default via 10.200.0.1 dev eth0
10.200.0.0/24 dev eth0 scope link src 10.200.0.2
The default gateway is 10.200.0.1, which is worker-0's cni0. Everything the pod sends outside its node's range goes through here.
Step 6 — Ping across nodes
This is the test for all three networking articles. pod-a on worker-0 pings pod-b on worker-1:
kubectl exec pod-a -- ping -c 3 10.200.1.2
PING 10.200.1.2 (10.200.1.2): 56 data bytes
64 bytes from 10.200.1.2: seq=0 ttl=62 time=1.369 ms
64 bytes from 10.200.1.2: seq=1 ttl=62 time=0.347 ms
64 bytes from 10.200.1.2: seq=2 ttl=62 time=0.410 ms
--- 10.200.1.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
It gets through, no packet loss. ttl=62 (down 2 from the default 64) shows the packet went through exactly two routing hops — that is, worker-0's cni0 then the VPC forwarding to worker-1 — not a tunnel. The full path:
pod-a 10.200.0.2 pod-b 10.200.1.2
│ veth veth │
┌───┴── cni0 10.200.0.1 cni0 10.200.1.1 ──┴───┐
│ worker-0 eth0 10.0.1.20 worker-1 eth0 10.0.1.21 │
└────────┬──────────────────────────────────┬───────────────┘
└──────► VPC route table ◄──────────┘
10.200.1.0/24 → worker-1
10.200.0.0/24 → worker-0
Three more checks to round it out. The pod sees its own real IP (not hidden behind NAT):
kubectl exec pod-a -- ip -4 addr show eth0 | grep inet
inet 10.200.0.2/24 brd 10.200.0.255 scope global eth0
From inside the pod, open TCP to the ClusterIP of the kubernetes Service; kube-proxy (Article 12) does its job even when the source is a real pod:
kubectl exec pod-b -- nc -w 3 -zv 10.32.0.1 443
10.32.0.1 (10.32.0.1:443) open
And the pod reaches the Internet, thanks to the masquerade rule in Step 3:
kubectl exec pod-b -- ping -c 2 8.8.8.8
2 packets transmitted, 2 packets received, 0% packet loss
The four kinds of communication from Article 13 now all run for real: containers within a pod (localhost), pod-to-pod on the same node (cni0), pod-to-pod across nodes (VPC routes), and pod-to-Service (kube-proxy).
🧹 Cleanup
Delete the two test pods, they were only for verification:
kubectl delete pod pod-a pod-b
The CNI config and VPC routes are permanent parts of the cluster, leave them. A note on durability: the VPC routes survive stop/start, but the iptables masquerade rule from Step 3 does not, because it lives in kernel memory and is lost after the instance reboots/stop-starts. If you pause the cluster and bring it back, re-run the iptables ... MASQUERADE command on each worker (or put it in a oneshot systemd unit if you want it to truly persist). The kube-proxy and bridge rules rebuild themselves when their services start, so no worry there.
The full scripts (CNI config for both nodes, route commands, masquerade) are at github.com/nghiadaulau/kubernetes-from-scratch, directory 14-pod-network.
Wrap-up
Pod networking is wired end to end, built entirely by hand and from pieces we understand: cni0 for same-node pods, VPC routes for crossing nodes, selective masquerade for traffic leaving. No overlay, no extra headers: pod-to-pod packets travel intact because the cluster fits in one subnet and we control the underlying network's routing. Both nodes Ready, pods running for real and reachable in every direction.
One thing is still missing for the cluster to look usable: pods calling each other by name, not IP. A pod's IP changes every time the pod is reborn, so no one hard-codes them. Article 15 deploys CoreDNS as a Deployment in the cluster, creates a Service for it at exactly the 10.32.0.10 that kubelet has been pointing pods at (Article 11), and watches a pod resolve a Service name into a ClusterIP.