Off-CPU and Scheduler Latency: Measuring the Time a Process Is NOT Running

K
Kai··5 min read

On-CPU profiling (Article 17) answers "what is the CPU busy doing". But a slow application is usually not slow because the CPU is burning — it's because it sits waiting: waiting for disk, waiting for the network, waiting for a lock, or waiting for its turn to be given a CPU. That time is off-CPU, and on-CPU profiling sees nothing (the process isn't running, so there's nothing to sample). This article measures exactly that gap, through the scheduler's tracepoints.

Two different off-CPU questions

"Off-CPU" lumps together two things, so let's separate them clearly:

  1. Run-queue latency — the task is already ready to run (it was woken up), but has to wait for a free CPU core. This is latency due to the scheduler/CPU contention, not because the task itself is waiting for anything.
  2. Off-CPU time — the task yields the CPU itself to wait for an event (I/O done, a lock released, a sleep finishing). This is the blocked time.

Both are read from the scheduler tracepoints: sched_wakeup (a task becomes ready) and sched_switch (the CPU switches from one task to another).

Run-queue latency: waiting for a CPU turn

Measure the interval from the wakeup (sched_wakeup) to when it actually runs (sched_switch picks it as next):

sudo bpftrace -e '
tracepoint:sched:sched_wakeup     { @qt[args.pid] = nsecs; }
tracepoint:sched:sched_wakeup_new { @qt[args.pid] = nsecs; }
tracepoint:sched:sched_switch /@qt[args.next_pid]/ {
  @runq_us = hist((nsecs - @qt[args.next_pid]) / 1000);   // microseconds
  delete(@qt[args.next_pid]);
}'

sched_wakeup records a timestamp keyed by pid (the task being woken up); when sched_switch picks that exact pid as the next task (next_pid), the difference between the two marks is the time waiting in the run queue. Create scheduling pressure (4 busy tasks on 2 CPUs) then measure:

@runq_us:
[1]                 1428 |@@@@@@@@@@@@                                        |
[2, 4)              1692 |@@@@@@@@@@@@@@@                                     |
[4, 8)              5787 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16)              919 |@@@@@@@@                                            |
[16, 32)             355 |@@@                                                 |
...
[2K, 4K)              10 |                                                    |
[16K, 32K)             2 |                                                    |

The unit is microseconds. Most tasks get onto the CPU after 4-8µs — fast. But there's a tail reaching [16K, 32K): a few times a task had to wait 16-32ms before running. That is precisely the sign of CPU contention — more tasks are ready than there are cores, so a few have to queue for a long time. On a node overloaded on CPU, this tail is what makes an application stutter while on-CPU profiling never points it out. This is exactly BCC's runqlat tool, built in a few lines.

Off-CPU time: how long blocked

A different question: each time it leaves the CPU, how long does the task stay off? Record the mark when it is switched out (prev_pid), compute the difference when it comes back (next_pid):

sudo bpftrace -e '
tracepoint:sched:sched_switch {
  @off[args.prev_pid] = nsecs;                         // just left the CPU
  $t = @off[args.next_pid];
  if ($t) { @offcpu_ms = hist((nsecs - $t) / 1000000); delete(@off[args.next_pid]); }
}'
@offcpu_ms:
[0]                22137 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                 3005 |@@@@@@@                                             |
[8, 16)             1248 |@@                                                  |
[64, 128)            431 |@                                                   |
[256, 512)           291 |                                                    |
[512, 1K)            180 |                                                    |
[4K, 8K)               4 |                                                    |

The unit is milliseconds. Most off-CPU intervals are under 1ms (an ordinary context switch), but the tail stretches to [4K, 8K) — 4 to 8 seconds off-CPU. Those are tasks blocked for a long time: a daemon sleeping while waiting for an event, a process waiting for I/O, a sleep in a loop. This tail is the real time a user perceives as "slow/hung", and it's invisible to on-CPU profiling — the task isn't running, so there's nothing to sample.

A note on reading the numbers: the [0] bucket dominates because most context switches are instantaneous; the tail is the part worth inspecting. To know where in the kernel a task is blocked, switch the key from pid to kstack at the moment it's switched out — that gives off-CPU by stack, which combined with on-CPU forms the complete picture.

Why this can be measured in the kernel without cost

Both measurements attach only to scheduler tracepoints — kernel points that already run on every context switch. The eBPF program only records a timestamp into a map and adds to a histogram; it pushes no events out to userspace. The scheduler is an extremely hot path (hundreds of thousands of times/sec), but because the aggregation happens right in the kernel (Article 7), the added cost is negligible — so it can be measured even in production.

🧹 Cleanup

bpftrace unloads automatically on exit; the node returns to 140 programs. Commands at github.com/nghiadaulau/ebpf-from-scratch, directory 18-off-cpu.

Wrap-up

Most latency lies in the time a process is not running — something on-CPU profiling (Article 17) doesn't see. eBPF measures off-CPU through scheduler tracepoints. Run-queue latency (sched_wakeupsched_switch picks next_pid) is the time waiting for a CPU turn: under contention load, most are 4-8µs but the tail reaches 16-32ms — a sign of CPU shortage. Off-CPU time (sched_switch: mark when prev leaves, diff when next returns) is the blocked time: most under 1ms but the tail reaches several seconds for tasks waiting on I/O/lock/sleep. Both aggregate histograms in the kernel (BCC's runqlat/offcputime), with almost no perturbation since they only add work to scheduler tracepoints that already run. On-CPU (Article 17) + off-CPU (this article) = the complete performance picture: where the CPU burns, and where the wait time goes.

Article 19 closes Part VI with how the Kubernetes ecosystem turns all these eBPF events into something observable cluster-wide: Hubble — dissecting how Cilium pushes events from the eBPF datapath (via the cilium_events perf/ring buffer) up into a network flow picture labeled with pod, service, and policy verdict.