StatefulSet: stable identity and order
Article 24 showed that a Deployment treats pods like an anonymous school of fish: three pods rollout-demo-7545d5669f-xxxxx with random names, any one interchangeable — lose one and the ReplicaSet spawns another with a different name, no problem, because they're identical and hold nothing of their own. That model is perfect for stateless web apps. But a database, a message queue, or the very etcd we set up in Article 6 is different: each node has an identity (the etcd member controller-0 can't be replaced by a stranger name), starts up in order, and keeps its own data. That's when you need a StatefulSet.
Four guarantees
The docs list exactly four things a StatefulSet guarantees that a Deployment doesn't:
"Stable, unique network identifiers. Stable, persistent storage. Ordered, graceful deployment and scaling. Ordered, automated rolling updates."
This article verifies three of the four with real tests: stable network identity, deploy/scale order, and (indirectly) ordered updates. The second one, persistent storage, needs a StorageClass and a dynamic provisioner (EBS CSI) we haven't set up yet; it's left for the Storage section, but at the end we'll spell out the mechanism.
One prerequisite first, per the docs: "StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for creating this Service." A headless service is a Service with clusterIP: None — it doesn't hand out a single aggregated virtual IP like a normal Service (Article 16), but lets DNS return the IPs of the individual pods. A StatefulSet relies on it to give each pod its own DNS name.
apiVersion: v1
kind: Service
metadata: {name: web, labels: {app: web}}
spec:
clusterIP: None # <-- headless
selector: {app: web}
ports: [{port: 80, name: web}]
---
apiVersion: apps/v1
kind: StatefulSet
metadata: {name: web}
spec:
serviceName: web # <-- points at the headless service above
replicas: 3
selector: {matchLabels: {app: web}}
template:
metadata: {labels: {app: web}}
spec:
containers:
- name: app
image: busybox:1.36
command: ["sleep","3600"]
readinessProbe:
exec: {command: ["true"]}
initialDelaySeconds: 2
periodSeconds: 2
Identity: fixed names, not a random hash
The difference shows up immediately in the pod names. Docs: "Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal of the Pod. The pattern for the constructed hostname is $(statefulset name)-$(ordinal)." With replicas: 3 we get web-0, web-1, web-2 — numbered from 0, no random suffix. And the hostname inside the pod is exactly that name:
for p in web-0 web-1 web-2; do echo "$p -> hostname: $(kubectl exec $p -- hostname)"; done
web-0 -> hostname: web-0
web-1 -> hostname: web-1
web-2 -> hostname: web-2
Each pod knows who it is: web-0 is always web-0. Compared to a Deployment where the hostname is a meaningless hash string, this is the foundation for a stateful cluster to configure itself (for example, etcd's --name taken from the hostname).
Per-pod DNS via the headless service
Identity isn't just a name, it's also an addressable one. The headless service gives each pod a DNS name of the form $(podname).$(service).$(namespace).svc.cluster.local. Resolve it from inside web-0:
kubectl exec web-0 -- nslookup web-1.web.default.svc.cluster.local
Name: web-1.web.default.svc.cluster.local
Address: 10.200.0.25
web-1.web.default.svc.cluster.local resolves correctly to the IP of pod web-1. This is something a Deployment + Service normally does not have: in Article 16, a ClusterIP Service gave a single aggregated virtual IP and then load-balanced randomly to any pod — you couldn't address one pod by name. Headless is the opposite. Querying the service name directly (without a podname) returns all pod IPs:
kubectl exec web-0 -- nslookup web.default.svc.cluster.local
kubectl get svc web -o jsonpath='clusterIP={.spec.clusterIP}{"\n"}'
Address: 10.200.0.26
Address: 10.200.0.25
Address: 10.200.1.24
clusterIP=None
Three addresses match the three pods (cross-check: web-0=10.200.1.24, web-1=10.200.0.25, web-2=10.200.0.26). clusterIP: None confirms this is headless: no virtual IP, the client picks a pod by its DNS name. That's how a client of a stateful cluster finds exactly the node it needs (e.g. connecting to the right database primary).
Created in order, deleted in reverse order
The third guarantee is order. A StatefulSet creates pods sequentially {0..N-1}, and each pod is only created after the previous one is Running and Ready. Catch it red-handed by polling right after apply:
for i in $(seq 1 12); do
echo "t=$i: $(kubectl get pods -l app=web --no-headers | awk '{print $1"="$3}' | tr '\n' ' ')"
sleep 2
done
t=1: web-0=ContainerCreating
t=2: web-0=Running
t=3: web-0=Running web-1=Running
t=4: web-0=Running web-1=Running web-2=Running
web-0 appears alone first; only once it's Running does web-1 get created; then web-2. Unlike a Deployment, which stands up all three at once. The rule from the docs: "For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}" and "Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready." This matters for a cluster that needs a startup order (node 0 is the seed, later nodes join it).
Stable identity also means: delete a pod and it comes back with the same old name, not a new one like a Deployment:
kubectl delete pod web-1 --now
# poll:
t=1: web-0=Running web-1=ContainerCreating web-2=Running
t=2: web-0=Running web-1=Running web-2=Running
web-1 dies and comes back still as web-1, same name, same DNS name (even though the pod IP may change). An etcd member named web-1 is therefore still itself after the pod is rescheduled. When scaling down, the order reverses, deleting from the highest ordinal:
kubectl scale statefulset/web --replicas=2
# poll:
t=1: web-0=Running web-1=Running web-2=Terminating
web-2 (the highest ordinal) is deleted first, web-0/web-1 stay put. Docs: "When Pods are being deleted, they are terminated in reverse order, from {N-1..0}" and "Before a Pod is terminated, all of its successors must be completely shutdown." Shrinking a stateful cluster must remove from the newest node backward, avoiding accidentally removing the seed node that holds quorum.
Persistent storage: volumeClaimTemplates (preview)
The remaining guarantee, persistent storage, is the most important reason people choose a StatefulSet, but it needs storage infrastructure we haven't set up yet. The mechanism: instead of declaring a shared volume, a StatefulSet declares volumeClaimTemplates, and "For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives one PersistentVolumeClaim." Each pod is given its own PVC, named in the form <claim-name>-<pod-name> (e.g. www-web-0, www-web-1...).
volumeClaimTemplates: # (added to the StatefulSet spec)
- metadata: {name: www}
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "..." # needs a StorageClass + dynamic provisioner
resources: {requests: {storage: 1Gi}}
The key point for data: this PVC sticks to the identity, not the pod lifecycle. web-0 dies and comes back still attached to the same old www-web-0, data intact. And the docs warn of a deliberate safety choice: "Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure data safety." That is, deleting a StatefulSet does not delete the volumes — you have to clean up the PVCs yourself, guarding against accidental data loss.
We'll do this part for real in the Storage section (Article 43): set up a StorageClass with EBS CSI, create a StatefulSet with volumeClaimTemplates, write data into web-0, delete the pod, and see the data still intact. For now, just grasp this: stable identity + identity-bound PVC is the duo that makes a StatefulSet.
🧹 Cleanup
kubectl delete statefulset web
kubectl delete svc web
Since we never declared volumeClaimTemplates, there are no PVCs to clean up. (When you do have volumes, remember to kubectl delete pvc -l app=web separately — as warned above, they don't go away on their own.) The cluster returns to two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 25-statefulset.
Wrap-up
A StatefulSet is for stateful applications, where pods can't be swapped for one another arbitrarily. It needs a headless service (clusterIP: None) and provides four guarantees: stable names (web-0..N-1 numbered from 0, hostname = pod name); per-pod DNS (web-0.web.default.svc.cluster.local, addressable by name because headless returns each pod's IP rather than a single aggregated ClusterIP); order (created sequentially 0→N-1 waiting for the previous pod to be Ready, deleted in reverse N-1→0, and a deleted pod resurrects with the same old name); and persistent storage via volumeClaimTemplates, one PVC <claim>-<pod> per pod that sticks to the identity and isn't deleted on scale-down (tested for real in the Storage section). Compared to the "anonymous school of fish" of a Deployment, each StatefulSet pod has an identity and its own storage.
Article 26 moves to the DaemonSet, the controller for a third model: not "N replicas" but "exactly one replica on each node", used for log agents, CNI, node exporters... things that must be present on every machine.