Optimization and Execution Strategy

On a single host, Ansible runs simply. Across hundreds of hosts, two questions arise: how to run fast (in parallel), and how to run safely (without taking down the whole fleet at once). This article covers the real-world operational techniques that answer both.

forks: how many hosts run in parallel

By default Ansible processes 5 hosts in parallel (forks = 5). For a large fleet, increase it to go faster:

# ansible.cfg
[defaults]
forks = 50

Or ansible-playbook site.yml -f 50. forks is the number of simultaneous SSH connections the control node opens. Setting it too high can choke the control node (CPU/RAM/network) — tune it to the machine's capacity, typically 20–100.

strategy: linear vs free

The strategy plugin decides how Ansible orchestrates tasks across hosts:

   linear (default):  ALL hosts finish task N before the group moves to task N+1
                      → synchronized, easy to follow; but fast hosts wait on slow ones

   free:              each host runs all its tasks INDEPENDENTLY, no waiting
                      → faster when hosts are uneven; harder to follow

- hosts: web
  strategy: free        # each host runs its whole playbook without waiting

linear (default) is easy to reason about and fine for most cases. free helps when hosts have very different speeds and the tasks are independent.

serial: zero-downtime rolling updates

This is the single most important technique for production. By default Ansible applies changes to all hosts at once — dangerous: if a change has a bug, the entire fleet goes down simultaneously. serial splits hosts into batches:

- hosts: web
  serial: 2          # or "25%"
  tasks:
    - name: Deploy the new version
      ...

   6 hosts, serial: 2 →
   Batch 1: host1, host2   ── deploy + check ──┐
   Batch 2: host3, host4   ── runs ONLY IF batch 1 ok │ (rolling)
   Batch 3: host5, host6   ──                          ┘

If a batch fails (exceeds the max_fail_percentage threshold), Ansible stops — the remaining hosts are not yet touched, and the service keeps running on them. Combine it with a load balancer (Networking series, Article 11): pull a host out of the LB → deploy → check → put it back in the LB, batch by batch → zero-downtime updates. This is how Ansible does rolling deploys.

delegate_to and run_once

delegate_to runs a task on a different host than the one being processed — useful when an operation relates to the current host but must run elsewhere:

- name: Pull the host out of the load balancer before deploying
  community.general.haproxy:
    state: disabled
    host: "{{ inventory_hostname }}"
  delegate_to: "{{ load_balancer_host }}"     # runs ON the load balancer

run_once: true runs a task exactly once (on the first host) instead of on every host — for things that only need to happen once (a database migration, creating a DNS record):

- name: Run migration (once for the whole group)
  ansible.builtin.command: /opt/app/migrate.sh
  run_once: true

Combining delegate_to: localhost + run_once is a common pattern for running something on the control node once.

async: long-running tasks

A long task (a system update, a build) can exceed the SSH timeout. async runs it in the background on the host while you poll its status:

- name: A long job
  ansible.builtin.command: /opt/long-job.sh
  async: 600        # allow up to 600s of runtime
  poll: 10          # check every 10s

poll: 0 is "fire and forget" (start it and move on, check later with the async_status module) — useful for kicking off many long jobs in parallel.

Speeding up: fact caching and pipelining

Two optimizations cut time significantly:

Pipelining (recall Article 1): send the module over SSH's stdin instead of a separate sftp PUT, reducing the number of SSH round trips per task. Enable it in ansible.cfg:

[ssh_connection]
pipelining = True

Fact caching: Gathering Facts (Article 4) takes time on every run. Cache facts to reuse them between runs:

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600

Or, more simply: gather_facts: false on a play that doesn't need facts. Find slow tasks with the profile_tasks callback (Article 12):

ANSIBLE_CALLBACKS_ENABLED=profile_tasks ansible-playbook site.yml

Gathering Facts -------------------------- 3.49s
Install package --------------------------- 2.97s

→ you know exactly which task eats time, so you can optimize it.

tags: running selectively

Attach tags to tasks to run part of a playbook (mentioned in Article 7):

    - name: Install package
      ansible.builtin.dnf: { name: nginx, state: present }
      tags: [install]
    - name: Configure
      ansible.builtin.copy: { ... }
      tags: [config]

ansible-playbook site.yml --tags config       # run only tasks tagged "config"
ansible-playbook site.yml --skip-tags install # run everything EXCEPT tag "install"
ansible-playbook site.yml --list-tasks        # see tasks + tags before running

Running --tags config shows that only tasks tagged config run while the install task is skipped. Extremely handy when a playbook is long but you only want to update configuration (without re-running the install part).

check mode: dry-run before production

Recalling Article 5, because it belongs to the safe-operations group: --check --diff shows what the playbook would change without applying it for real. Always --check before running for real on production — and --diff to see how file contents would change. This is the most important safety net in operations.

Wrap-up

Running Ansible at scale: forks controls parallelism; strategy (linear synchronized / free independent); serial splits into batches for zero-downtime rolling updates (combined with a load balancer); delegate_to/run_once run a task on another host / once; async for long tasks; pipelining + fact caching + gather_facts: false to speed things up (profile_tasks finds slow tasks); tags for selective runs; --check --diff for a safe dry-run before production. This is the toolkit that turns a playbook that "runs" into one that's "operable on a real fleet".

Now that it runs fast and safely — but how do you know a role is correct before pushing it to hundreds of hosts? Article 14: rigorous role testing with Molecule.