LB IPAM and Traffic Policy

K
Kai··7 min read

Article 48 exposed Ingress via NodePort, Article 49 exposed the Gateway also via NodePort, and both left the same mark: the LoadBalancer-type Service hangs at EXTERNAL-IP <pending>, the Gateway reports Programmed=False / AddressNotAssigned. The reason was covered in Article 49 — the self-built EC2 cluster has no cloud controller to hand out an address to a LoadBalancer Service. This article uses Cilium's LB IPAM to assign it, then looks at how externalTrafficPolicy changes the way inbound traffic reaches the pod.

A LoadBalancer Service stuck pending

Stand up a LoadBalancer-type Service to see that state again:

kubectl create namespace lb-demo
kubectl -n lb-demo create deployment web --image=hashicorp/http-echo:1.0 -- /http-echo -text=hello-lb -listen=:8080
kubectl -n lb-demo expose deployment web --type=LoadBalancer --port=80 --target-port=8080
kubectl -n lb-demo get svc web
NAME   TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)        AGE
web    LoadBalancer   10.32.0.76   <pending>     80:32303/TCP   1s

The Service asks for an external address but nobody assigns one, so it waits forever. NodePort 32303 is still assigned automatically (every LoadBalancer Service comes with a NodePort), which is why Articles 48–49 could still test via NodePort even with the external-IP hanging.

CiliumLoadBalancerIPPool assigns addresses

LB IPAM is always on but sits idle until the first pool exists. A CiliumLoadBalancerIPPool declares an IP range for Cilium to draw from and assign to LoadBalancer Services. Declare the range with cidr, or with start/stop:

apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
  name: demo-pool
spec:
  blocks:
  - start: "192.0.2.10"
    stop: "192.0.2.50"

192.0.2.0/24 is the documentation range (TEST-NET, RFC 5737) — here we just need a range that doesn't collide with real IPs in the VPC. Apply the pool, and Service web gets an address right away:

kubectl apply -f demo-pool.yaml
kubectl get ciliumloadbalancerippool
kubectl -n lb-demo get svc web
kubectl -n lb-demo get svc web -o jsonpath='{range .status.conditions[*]}{.type}={.status}{"\n"}{end}'
NAME        DISABLED   CONFLICTING   IPS AVAILABLE   AGE
demo-pool   false      False         40              6s

NAME   TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
web    LoadBalancer   10.32.0.159   192.0.2.10    80:30648/TCP   6s

cilium.io/IPAMRequestSatisfied=True

The pool reports 40 IPs left (the range 192.0.2.10–.50 has 41 addresses, one just assigned to web), the Service flips from <pending> to 192.0.2.10, and the condition IPAMRequestSatisfied=True. The pool declares no serviceSelector, so it assigns to every LoadBalancer Service; adding serviceSelector.matchLabels makes the pool assign only to Services carrying the matching label, useful when you want to split IP ranges by environment.

Article 49's Gateway flips to Programmed

This same mechanism fills the Gateway's hang too. Recreate a Gateway like in Article 49, this time with the pool in place:

kubectl -n lb-demo get gateway gw
NAME   CLASS    ADDRESS      PROGRAMMED   AGE
gw     cilium   192.0.2.11   True         7s

ADDRESS 192.0.2.11, PROGRAMMED True. The Gateway creates a LoadBalancer Service behind it (cilium-gateway-gw), that Service gets an IP from LB IPAM, and Article 49's AddressNotAssigned condition disappears. The same LB IPAM serves both regular Services and Gateways.

Assigning an IP is not advertising an IP

Now that there's an IP, try calling it directly from a node inside the cluster:

# from worker-1
curl -s -o /dev/null -w "HTTP %{http_code}\n" http://192.0.2.10/
HTTP 200

It works, but you need to understand exactly why. Cilium kube-proxy-less (Article 46) loads the LoadBalancer VIP into the eBPF datapath on every node; when a node inside the cluster sends a packet to 192.0.2.10, the eBPF on that very node catches it and DNATs to the backend before the packet can leave. Look at the eBPF service table:

kubectl -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg service list | grep 192.0.2
35   192.0.2.10:80/TCP   LoadBalancer   1 => 10.200.0.90:8080/TCP (active)
39   192.0.2.11:80/TCP   LoadBalancer

So the IP is usable from inside the cluster. From a client outside the cluster, not yet, and this is the line the docs stress: LB IPAM only assigns the address, it doesn't advertise it. A router outside doesn't know where 192.0.2.10 lives until something announces it: L2 Announcement (CiliumL2AnnouncementPolicy, for internal L2 networks, where nodes answer ARP for the VIP) or BGP (Cilium peers with a router and advertises the route). On AWS, the VPC network is software-defined, so ARP-based L2 doesn't route the way it does on-prem; the practical approach is BGP or using the AWS Load Balancer Controller directly. That's also why EKS doesn't use LB IPAM and instead attaches to an AWS NLB/ALB. This self-built cluster stops at assigning the IP and confirming the eBPF datapath works; advertising it externally depends on the specific network infrastructure.

externalTrafficPolicy: Cluster or Local

A Service's externalTrafficPolicy field decides how inbound traffic (via NodePort or LoadBalancer) is handled. To see the difference, pin an echo pod (agnhost, the /clientip endpoint prints the source address it observes) on worker-0, with a NodePort Service 31888. The client is node lb-0 (10.0.1.10), which has no pod on it.

With externalTrafficPolicy: Cluster (the default), call the NodePort of worker-1 — which has no echo pod:

# from lb-0 (10.0.1.10), call worker-1's NodePort
curl -s http://10.0.1.21:31888/clientip
10.0.1.21:39672

The pod sees the source as 10.0.1.21 — worker-1's IP, not the real client IP 10.0.1.10. worker-1 receives the packet, sees no local pod, so forwards it to worker-0 and SNATs the source address to its own IP. The real client is masked. Switch to Local:

kubectl -n lb-demo patch svc echo -p '{"spec":{"externalTrafficPolicy":"Local"}}'

Then call again from lb-0, to two different nodes:

curl -s -m5 -o /dev/null -w "HTTP %{http_code}\n" http://10.0.1.21:31888/clientip   # worker-1: no pod
curl -s http://10.0.1.20:31888/clientip                                            # worker-0: has pod
HTTP 000          # worker-1 has no local endpoint -> drop
10.0.1.10:53068   # worker-0 has a pod -> keeps the real client IP intact

Local only forwards to a pod on the same node that received the packet: calling worker-1 (no endpoint) drops, calling worker-0 (has a pod) lets the pod see the correct client 10.0.1.10. The tradeoff is clear: Cluster spreads across every node but adds a hop and masks the source IP; Local keeps the source IP and drops the extra hop but requires the external load balancer to know which nodes have an endpoint. When you need to log the user's real IP or apply policy by source IP, choose Local.

A close relative is trafficDistribution: PreferClose on a Service — a hint to prefer sending to endpoints in the same zone to reduce cross-zone traffic (topology-aware routing). As for dual-stack (IPv4 + IPv6 side by side), this cluster doesn't enable it because it was built pure IPv4 from the start (pod CIDR 10.200.0.0/16); enabling dual-stack requires declaring an IPv6 CIDR from the control plane build stage.

🧹 Cleanup

kubectl delete namespace lb-demo                          # web, echo, gateway gw
kubectl delete ciliumloadbalancerippool demo-pool

Delete the pool too, since no LoadBalancer Service needs an address anymore. Cilium keeps its config; LB IPAM goes back to idle until the next pool. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, folder 50-lb-ipam.

Wrap-up

LB IPAM fills the <pending> hang that Articles 48–49 left behind: a CiliumLoadBalancerIPPool declares an IP range (cidr or start/stop), Cilium assigns it to LoadBalancer Services (condition IPAMRequestSatisfied=True) and also to the Service behind a Gateway, which flips Article 49's Gateway to Programmed=True with a real address. The key boundary: LB IPAM assigns IPs, it doesn't advertise them — the IP is usable from inside the cluster because eBPF loads the VIP on every node, but for an outside client to reach it you need L2 Announcement (internal L2 network) or BGP, and on AWS the most practical options are BGP or the AWS Load Balancer Controller. externalTrafficPolicy decides the source address: Cluster (default) spreads across every node but SNATs over the client IP; Local only serves an endpoint on the same node, keeps the real client IP, at the cost of dropping on nodes with no endpoint. That closes Part X — from the flat network model of Articles 13–14, through Cilium eBPF, NetworkPolicy, Ingress, Gateway API, to assigning and routing inbound addresses.

Article 51 opens Part XI on security, starting from the path into the API server: which authentication stages does a kubectl request pass through before it reaches data — certificate, token, service account — and where did our self-built cluster configure each one.