Kubernetes From Scratch
Build a complete Kubernetes cluster by hand — no kubeadm, no scripts — from the first certificate to a real HA cluster, then use that cluster as a lab to deep-dive every Kubernetes concept. Part one: PKI/TLS, etcd, the control plane, workers, pod networking, CoreDNS. Part two: Pods, workload controllers, scheduling, storage, advanced networking (Cilium eBPF), security, extending the API, operations. Each component is both explained from the inside and stood up/configured by hand. Tested for real on AWS EC2 with Kubernetes v1.36; manifests/scripts at github.com/nghiadaulau/kubernetes-from-scratch. Grounded in the official docs at kubernetes.io.
Topology spread, pod overhead, and scheduling readiness
Anti-affinity is rigid: one pod per node, anything extra hangs. Topology spread is softer — it spreads pods evenly by maxSkew while still allowing several pods per node. This article digs into three finer scheduling mechanisms: topologySpreadConstraints (flexible spreading), pod overhead (extra resource accounting for the sandbox runtime), and schedulingGates (hold a pod back from scheduling). All three tested for real on the cluster.
Priority and preemption
A node is full and an important pod was just created. Does it hang Pending behind the junk pods, or does it get to kick out a less important pod to grab the spot? PriorityClass assigns a priority level; preemption lets a high-priority pod evict low-priority pods when needed. This article fills the cluster with low-priority pods, then drops in a high-priority one — watch it kick out the victim and take the spot, exactly the PostFilter step Article 34 called 'not helpful'.
Node-pressure eviction
The last three articles were about placing pods. This one is about evicting them — but not preemption (scheduler, for priority) or the OOM kill (kernel, for exceeding a limit). This is the kubelet proactively killing pods when a node truly runs out of RAM or disk, by its own thresholds and ranking. This article creates real memory pressure on a worker by hand, then watches the kubelet evict the right hungriest pod — with an eviction message that says exactly why.
Metrics Server and HorizontalPodAutoscaler
Part VIII changes direction: instead of killing pods under load, we add pods. But to autoscale by CPU, the cluster has to know how much CPU each pod uses — and our hand-built cluster has nobody measuring it yet. This article installs the first add-on, Metrics Server, hits the exact KTHW trap (control plane can't talk to a pod) then fixes it, then stands up an HPA and burns real CPU to watch it multiply pods from 1 to 4.
Vertical Pod Autoscaler and resource managers
HPA adds pods as load rises. VPA does the opposite: keep the pod count fixed but dial in the exact CPU/RAM each pod needs — no more setting requests at random and then wasting or starving. This article installs VPA (an add-on, like Metrics Server), lets it observe a real workload and produce a recommendation, then crosses to the node side: CPU Manager static policy pins whole CPU cores to a Guaranteed pod — tested for real, watching a pod get exactly one exclusive CPU.
Volumes: ephemeral, hostPath, and projected
Files in a container vanish on restart, and two containers in one pod don't see each other's files. Volumes solve both. This article opens Part IX (storage) with volumes attached straight to a pod: emptyDir (a scratch area shared within the pod), hostPath (borrow a node directory), and projected (combine configMap/secret/downwardAPI/token into one place) — each tested for real, making clear which lives with the container, the pod, or the node.
PersistentVolume and PersistentVolumeClaim
The volumes in Article 41 die with the pod. To make data outlive the pod, Kubernetes splits it in two: PersistentVolume is the real storage (admin creates), PersistentVolumeClaim is the storage request (user creates) — and a control loop binds them. This article traces who-creates-what, who-binds-what: admin builds a PV, user requests a PVC, the controller binds both ways, a pod uses the claim, delete the pod and data survives, delete the claim and the PV goes Released.
StorageClass, dynamic provisioning, and EBS CSI
In Article 42 the admin had to create the PV by hand first. Nobody does that at real scale. StorageClass + CSI driver flip it around: the user creates only a PVC, the system spawns the PV — and even calls AWS to create a real EBS volume. This article installs the real EBS CSI driver (with IAM for the nodes), traces every link of who-calls-who from PVC to the moment an EBS volume is born, then deletes the PVC and watches the volume disappear.
VolumeSnapshot and CSI snapshot
We have persistent volumes now — how do we back them up? VolumeSnapshot takes a point-in-time snapshot of a PVC's contents — and with EBS CSI, it creates a real EBS snapshot on AWS. This article closes Part IX: install the snapshot controller, snapshot a PVC, restore a new PVC from that snapshot — with a hard-won lesson on why the first restore came out an empty file, and why you must sync before snapshotting.
Cilium and eBPF: why replace kube-proxy
In Part I we built pod networking with kube-proxy iptables and a hand-rolled bridge — enough to run, but iptables grows linearly with the number of Services. Part X upgrades: replace both kube-proxy and the bridge with eBPF-based Cilium. This article is the theory — what eBPF is, why it's faster than iptables, what Cilium does differently at the datapath — looking straight at the 74 iptables rules currently running to see what we're about to drop.
Migrating to kube-proxy-less Cilium
Theory's done, now for real: replace Part I's kube-proxy + bridge with eBPF-based Cilium 1.19, remove kube-proxy entirely, enable Hubble. This article traces each migration step on a live cluster — install Cilium, disable kube-proxy, confirm Services still work with not a single kube-proxy iptables rule left — plus four real traps a self-built cluster hits (providerID, topology labels, IMDS hop limit, hostNetwork) and how to clear them.
NetworkPolicy: A Firewall By Label
By default every pod in the cluster talks freely with every other — flat and open. This article uses NetworkPolicy to lock it down: deny all ingress to a pod, then allow only the right labels through, tested for real on a Cilium cluster. And because the cluster runs eBPF, we get to watch Hubble print DROPPED/FORWARDED verdicts per packet, with Cilium identity proving policy attaches to labels, not IPs.