Swarm: Service, Scale and Rolling Update

K
Kai··4 min read

In Article 10 we set up the cluster and talked about desired state. This article makes it concrete: on Swarm you don't run individual containers with docker run, you declare a service — and Swarm handles the rest.

Service and task

Per the Docker docs, "a service is the definition of the tasks to execute on the nodes", and "a task carries a container and the commands to run inside the container; it is the atomic scheduling unit of swarm". The relationship:

   Service "web" (desired: 3 replicas)
        │  the manager splits it into tasks
        ├── task web.1 ──► nginx container (on node X)
        ├── task web.2 ──► nginx container (on node Y)
        └── task web.3 ──► nginx container (on node Z)

You declare the service "wants 3 replicas"; the manager creates 3 tasks; each task runs a container, possibly on different nodes. You manage at the service level, not each container.

There are two service modes:

  • replicated (default): runs exactly the number of replicas you set, and the manager distributes them across nodes. Used for most applications.
  • global: runs exactly one replica on each node in the cluster. Good for monitoring/log agents that need to be present on every machine.

Creating a service

Create a web service with 3 replicas, publishing a port:

docker service create --name webv --replicas 3 -p 9090:80 nginx:alpine
verify: Service converged

"Converged" means the actual state now matches the desired state (all 3 replicas running). View the service:

docker service ls
NAME   MODE         REPLICAS   IMAGE
webv   replicated   3/3        nginx:alpine

3/3 = 3 replicas desired, 3 running. View each task and where it sits:

docker service ps webv
NAME      CURRENT STATE
webv.1    Running
webv.2    Running
webv.3    Running

(On a multi-node cluster, the node column shows the tasks spread across different machines.)

Note: docker service ... only runs on a manager node. Workers can't issue orchestration commands (Article 10). And a service differs from docker run: docker ps only shows containers on the current node, while docker service ps shows tasks across the whole cluster.

Scale: increase or decrease replicas

Change the replica count with a single command:

docker service scale webv=5
verify: Service converged

The manager creates 2 more tasks to reach 5, distributing them onto nodes with room. Scale down and it shuts some tasks off. An equivalent way:

docker service update --replicas 5 webv

This is desired state in practice: you state the number, Swarm self-adjusts.

Self-healing

Because Swarm always keeps the actual state matching the desired state, if a task dies (container crash, or a whole node failing), the manager detects "there's a shortfall" and creates a replacement task on an available node. You don't have to do anything.

On a multi-node cluster you can try: shut Docker down on a worker, then run docker service ps webv on the manager — you'll see the tasks on that node move to Shutdown/Failed and new tasks spring up on another node to make up for them. (On a single-node cluster you can't simulate this — you need multi-node as suggested in Article 10.)

Rolling update: change versions without downtime

When you need to update the image (deploy a new version), Swarm replaces tasks piece by piece instead of stopping them all at once — this is a rolling update. That way the service isn't interrupted.

docker service update --image nginx:1.27-alpine webv
verify: Service converged

Swarm does it one at a time: stop an old task, start a new task with the new image, wait for it to run cleanly, then move to the next task. Check that the image changed:

docker service inspect webv --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}'

Control how the update runs with flags:

docker service update \
  --image nginx:1.27-alpine \
  --update-parallelism 2 \   # update 2 tasks at a time
  --update-delay 10s \       # wait 10s between batches
  webv

--update-parallelism decides how many tasks are replaced at once; --update-delay is the pause between batches so the new replica has time to stabilize before touching the next one.

Rollback when an update goes wrong

If the new version has a problem, go back to the previous version with just:

docker service rollback webv

Swarm stores the previous configuration, so the rollback also happens in a rolling fashion. You can also set an automatic-rollback policy for failed updates with --update-failure-action rollback at service creation time.

🧹 Cleanup

docker service rm webv

Removing a service stops and deletes all its tasks/containers on every node. (We still keep the swarm for Articles 12–13; leaving the swarm is at the end of Article 13.)

Wrap-up

On Swarm, the unit of work is the service — a declaration of "I want N replicas of this image". The manager splits it into tasks spread across nodes, and continuously keeps the replica count met (self-healing). scale changes the replica count; a rolling update replaces the image batch by batch so there's no downtime; rollback returns to the previous version. It's all desired state: you declare, Swarm converges to it.

The service now runs many replicas across many nodes, but which network do they talk over, and how does a published port reach a replica that's sitting on another node? Article 12 answers that: overlay networks and the routing mesh.