etcd: Quorum, Raft, and Standing Up a Three-Node Cluster

K
Kai··8 min read

From this article on, we start turning on the real components, and the most logical starting point is etcd — because everything else writes its state into it. Without etcd, the api-server has nowhere to store data, and controllers have nothing to reconcile against. So we stand it up first, right on the three controllers, using the exact etcd and etcd-ca certs we created in Article 4.

What etcd is and what it stores

etcd is a distributed, strongly consistent key-value database. In Kubernetes it holds the entire cluster state: every object you've ever created — Pod, Deployment, Service, ConfigMap, Secret, even the Leases used for leader election — lives here as key-value pairs. As mentioned in Article 1, only the api-server talks to etcd; every other component reads/writes state through the api-server.

Because etcd holds everything, two of its properties determine the reliability of the whole cluster: data must be consistent (every node sees the same truth) and must stay alive even when one machine dies. Both come from the Raft consensus algorithm.

Raft, leader, and quorum

In an etcd cluster, the nodes elect a leader; every write goes through the leader, and the leader replicates it to the followers. A write is only considered successful when a majority of nodes have acknowledged it. This "majority" is called quorum, and understanding it is understanding why the number of nodes matters.

   Writing a key:
      client ──► LEADER ──┬──► follower 1  (ack)
                          └──► follower 2  (ack)
      Leader waits for a MAJORITY to ack (including itself) → reports success

The quorum of an N-node cluster is floor(N/2) + 1. The cluster keeps working as long as quorum is met; lose quorum and etcd stops accepting writes so it won't produce two conflicting versions of the truth. A table makes it clear:

   Nodes (N)     Quorum needed   Max tolerable loss
   ───────────   ─────────────   ──────────────────
        1             1               0
        2             2               0      ← worse than 1!
        3             2               1
        4             3               1      ← same as 3, but costs more
        5             3               2

This table explains why you always use an odd number of nodes. Two nodes need both to agree (quorum 2), so one node dying loses quorum — that's worse than a single node, and costs twice as much. Four nodes tolerate exactly one node death, just like three, but cost an extra machine without adding fault tolerance. So three is the smallest meaningful choice for HA (tolerating one node death), and that's the number we pick for this series.

Step 1 — Install the etcd binaries on the three controllers

etcd ships as a tarball with two binaries: etcd (server) and etcdctl (command-line client). We use v3.6.11 (the latest 3.6.x at the time of writing). On each controller, download and place them in /usr/local/bin, and create the directories etcd needs:

# run on EACH controller (controller-0, controller-1, controller-2)
cd /tmp
VER=v3.6.11
curl -sL -o etcd.tar.gz \
  https://github.com/etcd-io/etcd/releases/download/${VER}/etcd-${VER}-linux-amd64.tar.gz
tar -xzf etcd.tar.gz
sudo cp etcd-${VER}-linux-amd64/etcd etcd-${VER}-linux-amd64/etcdctl /usr/local/bin/
sudo mkdir -p /etc/etcd /var/lib/etcd
sudo chmod 700 /var/lib/etcd
etcd --version | head -1
etcd Version: 3.6.11

/var/lib/etcd is where etcd writes its real data; set it to 700 so only root can read it. In this series I run these commands one after another on all three controllers over SSH; you can do it by hand on each machine or use an ssh loop like I do.

Step 2 — Get the etcd certs onto each controller

etcd uses TLS on two fronts: with clients (the api-server calling port 2379) and between peers (port 2380). Both use the same etcd cert pair signed by etcd-ca in Article 4. Copy the three files from the workstation up to /etc/etcd on each controller:

# from the workstation, in the ~/k8s-scratch/pki directory
for h in controller-0 controller-1 controller-2; do
  scp etcd-ca.pem etcd.pem etcd-key.pem ${h}:/tmp/
  ssh $h 'sudo cp /tmp/etcd-*.pem /etc/etcd/ && rm /tmp/etcd-*.pem'
done

(I use the ssh_config file from Article 3 so I can type node names directly; you'd add -F ~/k8s-scratch/ssh_config or -i key.pem ubuntu@<ip> depending on how you set things up.)

Step 3 — Write the systemd unit for etcd

Each controller runs etcd as a systemd service. The unit file is nearly identical across the three machines, differing in only two places: the node name (--name) and its own IP. The most important thing is to understand the flags:

  • --listen-peer-urls / --initial-advertise-peer-urls (port 2380): the channel etcd instances use to talk to each other (Raft).
  • --listen-client-urls / --advertise-client-urls (port 2379): the channel for clients (the api-server). We have it listen on both the internal IP and 127.0.0.1 so local calls on the controller work too.
  • --initial-cluster: the list of all three members — this is how each etcd knows where the other two are so they can handshake into a cluster.
  • --initial-cluster-state new: signals this is a brand-new cluster (as opposed to adding a node to an existing one).
  • --client-cert-auth / --peer-client-cert-auth: requires both client and peer to present a valid cert — no cert, no entry.

Here's the unit for controller-0 (IP 10.0.1.11); the other two machines change --name and the corresponding IPs:

[Unit]
Description=etcd
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
Type=notify
ExecStart=/usr/local/bin/etcd \
  --name controller-0 \
  --cert-file=/etc/etcd/etcd.pem \
  --key-file=/etc/etcd/etcd-key.pem \
  --peer-cert-file=/etc/etcd/etcd.pem \
  --peer-key-file=/etc/etcd/etcd-key.pem \
  --trusted-ca-file=/etc/etcd/etcd-ca.pem \
  --peer-trusted-ca-file=/etc/etcd/etcd-ca.pem \
  --client-cert-auth \
  --peer-client-cert-auth \
  --initial-advertise-peer-urls https://10.0.1.11:2380 \
  --listen-peer-urls https://10.0.1.11:2380 \
  --listen-client-urls https://10.0.1.11:2379,https://127.0.0.1:2379 \
  --advertise-client-urls https://10.0.1.11:2379 \
  --initial-cluster-token etcd-cluster-0 \
  --initial-cluster controller-0=https://10.0.1.11:2380,controller-1=https://10.0.1.12:2380,controller-2=https://10.0.1.13:2380 \
  --initial-cluster-state new \
  --data-dir=/var/lib/etcd
Restart=on-failure
RestartSec=5
LimitNOFILE=40000

[Install]
WantedBy=multi-user.target

Write this file to /etc/systemd/system/etcd.service on each machine (remember to change the name + IP), then reload systemd and enable the service to start on boot:

sudo systemctl daemon-reload
sudo systemctl enable etcd

Step 4 — Start and verify

There's one thing to watch when starting: with Type=notify, systemctl start etcd will wait until etcd reports ready — and etcd is only ready once quorum is met, i.e. once it has seen the other peers. If you start them serially and wait on each, the first machine will hang waiting for the other two. The clean way is to start all three almost simultaneously, using --no-block so the command returns immediately:

for h in controller-0 controller-1 controller-2; do
  ssh $h 'sudo systemctl start --no-block etcd'
done

After a few seconds, the three etcds find each other, elect a leader, and the cluster forms. Check the member list — run etcdctl with the TLS cert set (without a cert etcd refuses, since we enabled --client-cert-auth):

sudo etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/etcd/etcd-ca.pem \
  --cert=/etc/etcd/etcd.pem \
  --key=/etc/etcd/etcd-key.pem -w table
+------------------+---------+--------------+------------------------+------------------------+------------+
|        ID        | STATUS  |     NAME     |       PEER ADDRS       |      CLIENT ADDRS      | IS LEARNER |
+------------------+---------+--------------+------------------------+------------------------+------------+
| 33eed1f752c2defa | started | controller-2 | https://10.0.1.13:2380 | https://10.0.1.13:2379 |      false |
| bbeedf10f5bbaa0c | started | controller-1 | https://10.0.1.12:2380 | https://10.0.1.12:2379 |      false |
| eecdfcb7e79fc5dd | started | controller-0 | https://10.0.1.11:2380 | https://10.0.1.11:2379 |      false |
+------------------+---------+--------------+------------------------+------------------------+------------+

Three members, all started. To avoid retyping the long cert set, put it in a variable, then check the health and status of the whole cluster:

E="--cacert=/etc/etcd/etcd-ca.pem --cert=/etc/etcd/etcd.pem --key=/etc/etcd/etcd-key.pem"
sudo etcdctl endpoint health --cluster $E -w table
+------------------------+--------+-------------+-------+
|        ENDPOINT        | HEALTH |    TOOK     | ERROR |
+------------------------+--------+-------------+-------+
| https://10.0.1.11:2379 |   true | 12.525445ms |       |
| https://10.0.1.12:2379 |   true | 17.632423ms |       |
| https://10.0.1.13:2379 |   true | 22.387437ms |       |
+------------------------+--------+-------------+-------+

endpoint status shows who's the leader (a few columns trimmed for readability):

sudo etcdctl endpoint status --cluster $E -w table
+------------------------+------------------+---------+---------+-----------+-----------+
|        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+------------------------+------------------+---------+---------+-----------+-----------+
| https://10.0.1.13:2379 | 33eed1f752c2defa |  3.6.11 |   20 kB |     false |         2 |
| https://10.0.1.12:2379 | bbeedf10f5bbaa0c |  3.6.11 |   20 kB |     false |         2 |
| https://10.0.1.11:2379 | eecdfcb7e79fc5dd |  3.6.11 |   20 kB |      true |         2 |
+------------------------+------------------+---------+---------+-----------+-----------+

controller-0 is the leader (IS LEADER true); the other two are followers. RAFT TERM 2 is the "term" number — each time a leader is re-elected, the term increments. If you now shut down controller-0, the cluster loses its leader for a moment, then the two remaining machines (still meeting quorum 2/3) elect a new leader, the term bumps to 3, and the cluster keeps accepting writes. That's exactly the one-node-failure tolerance the quorum table above promised — you can try it yourself at the end of the article.

🧹 Cleanup

etcd now runs permanently on the three controllers as part of the control plane, so we don't shut it down — the later articles need it. If you're taking a break from learning, just stop-instances the whole cluster like in Article 3; etcd will restart itself (it's systemctl enabled) and rejoin into a cluster when the machines come back up, because the data lives in /var/lib/etcd on the EBS volume and survives a stop.

Wrap-up

We now have a distributed state store, tolerant of one node failure, protected by mutual TLS. More important than the number "three", you understand why it's three: quorum and Raft drive every choice about etcd's node count and topology. This understanding goes a long way, because etcd is the thing that needs the most careful attention in any production cluster — and it's what we'll back up and restore in Article 21.

Now that we have a place to store state, Article 7 stands up the component that sits right in front of it and is the cluster's single entry point: kube-apiserver. We'll configure it to talk to the etcd cluster we just built (via the apiserver-etcd-client cert), enable Secret encryption with the encryption-config.yaml from Article 5, and dig deeper into the authn → authz → admission chain that every request must pass through.