73 part series

Kubernetes From Scratch

Build a complete Kubernetes cluster by hand — no kubeadm, no scripts — from the first certificate to a real HA cluster, then use that cluster as a lab to deep-dive every Kubernetes concept. Part one: PKI/TLS, etcd, the control plane, workers, pod networking, CoreDNS. Part two: Pods, workload controllers, scheduling, storage, advanced networking (Cilium eBPF), security, extending the API, operations. Each component is both explained from the inside and stood up/configured by hand. Tested for real on AWS EC2 with Kubernetes v1.36; manifests/scripts at github.com/nghiadaulau/kubernetes-from-scratch. Grounded in the official docs at kubernetes.io.

37

Topology spread, pod overhead, and scheduling readiness

Anti-affinity is rigid: one pod per node, anything extra hangs. Topology spread is softer — it spreads pods evenly by maxSkew while still allowing several pods per node. This article digs into three finer scheduling mechanisms: topologySpreadConstraints (flexible spreading), pod overhead (extra resource accounting for the sandbox runtime), and schedulingGates (hold a pod back from scheduling). All three tested for real on the cluster.

Kai··6 min read·DevOpsKubernetes
38

Priority and preemption

A node is full and an important pod was just created. Does it hang Pending behind the junk pods, or does it get to kick out a less important pod to grab the spot? PriorityClass assigns a priority level; preemption lets a high-priority pod evict low-priority pods when needed. This article fills the cluster with low-priority pods, then drops in a high-priority one — watch it kick out the victim and take the spot, exactly the PostFilter step Article 34 called 'not helpful'.

Kai··5 min read·DevOpsKubernetes
39

Node-pressure eviction

The last three articles were about placing pods. This one is about evicting them — but not preemption (scheduler, for priority) or the OOM kill (kernel, for exceeding a limit). This is the kubelet proactively killing pods when a node truly runs out of RAM or disk, by its own thresholds and ranking. This article creates real memory pressure on a worker by hand, then watches the kubelet evict the right hungriest pod — with an eviction message that says exactly why.

Kai··6 min read·DevOpsKubernetes
40

Metrics Server and HorizontalPodAutoscaler

Part VIII changes direction: instead of killing pods under load, we add pods. But to autoscale by CPU, the cluster has to know how much CPU each pod uses — and our hand-built cluster has nobody measuring it yet. This article installs the first add-on, Metrics Server, hits the exact KTHW trap (control plane can't talk to a pod) then fixes it, then stands up an HPA and burns real CPU to watch it multiply pods from 1 to 4.

Kai··6 min read·DevOpsAutoscaling
41

Vertical Pod Autoscaler and resource managers

HPA adds pods as load rises. VPA does the opposite: keep the pod count fixed but dial in the exact CPU/RAM each pod needs — no more setting requests at random and then wasting or starving. This article installs VPA (an add-on, like Metrics Server), lets it observe a real workload and produce a recommendation, then crosses to the node side: CPU Manager static policy pins whole CPU cores to a Guaranteed pod — tested for real, watching a pod get exactly one exclusive CPU.

Kai··6 min read·DevOpsAutoscaling
42

Volumes: ephemeral, hostPath, and projected

Files in a container vanish on restart, and two containers in one pod don't see each other's files. Volumes solve both. This article opens Part IX (storage) with volumes attached straight to a pod: emptyDir (a scratch area shared within the pod), hostPath (borrow a node directory), and projected (combine configMap/secret/downwardAPI/token into one place) — each tested for real, making clear which lives with the container, the pod, or the node.

Kai··6 min read·DevOpsStorage
43

PersistentVolume and PersistentVolumeClaim

The volumes in Article 41 die with the pod. To make data outlive the pod, Kubernetes splits it in two: PersistentVolume is the real storage (admin creates), PersistentVolumeClaim is the storage request (user creates) — and a control loop binds them. This article traces who-creates-what, who-binds-what: admin builds a PV, user requests a PVC, the controller binds both ways, a pod uses the claim, delete the pod and data survives, delete the claim and the PV goes Released.

Kai··6 min read·DevOpsStorage
44

StorageClass, dynamic provisioning, and EBS CSI

In Article 42 the admin had to create the PV by hand first. Nobody does that at real scale. StorageClass + CSI driver flip it around: the user creates only a PVC, the system spawns the PV — and even calls AWS to create a real EBS volume. This article installs the real EBS CSI driver (with IAM for the nodes), traces every link of who-calls-who from PVC to the moment an EBS volume is born, then deletes the PVC and watches the volume disappear.

Kai··6 min read·DevOpsStorage
45

VolumeSnapshot and CSI snapshot

We have persistent volumes now — how do we back them up? VolumeSnapshot takes a point-in-time snapshot of a PVC's contents — and with EBS CSI, it creates a real EBS snapshot on AWS. This article closes Part IX: install the snapshot controller, snapshot a PVC, restore a new PVC from that snapshot — with a hard-won lesson on why the first restore came out an empty file, and why you must sync before snapshotting.

Kai··5 min read·DevOpsStorage
46

Cilium and eBPF: why replace kube-proxy

In Part I we built pod networking with kube-proxy iptables and a hand-rolled bridge — enough to run, but iptables grows linearly with the number of Services. Part X upgrades: replace both kube-proxy and the bridge with eBPF-based Cilium. This article is the theory — what eBPF is, why it's faster than iptables, what Cilium does differently at the datapath — looking straight at the 74 iptables rules currently running to see what we're about to drop.

Kai··5 min read·DevOpsNetworking
47

Migrating to kube-proxy-less Cilium

Theory's done, now for real: replace Part I's kube-proxy + bridge with eBPF-based Cilium 1.19, remove kube-proxy entirely, enable Hubble. This article traces each migration step on a live cluster — install Cilium, disable kube-proxy, confirm Services still work with not a single kube-proxy iptables rule left — plus four real traps a self-built cluster hits (providerID, topology labels, IMDS hop limit, hostNetwork) and how to clear them.

Kai··6 min read·DevOpsNetworking
48

NetworkPolicy: A Firewall By Label

By default every pod in the cluster talks freely with every other — flat and open. This article uses NetworkPolicy to lock it down: deny all ingress to a pod, then allow only the right labels through, tested for real on a Cilium cluster. And because the cluster runs eBPF, we get to watch Hubble print DROPPED/FORWARDED verdicts per packet, with Cilium identity proving policy attaches to labels, not IPs.

Kai··8 min read·DevOpsSecurity