Off-CPU and Scheduler Latency: Measuring the Time a Process Is NOT Running
On-CPU profiling (Article 17) answers "what is the CPU busy doing". But a slow application is usually not slow because the CPU is burning — it's because it sits waiting: waiting for disk, waiting for the network, waiting for a lock, or waiting for its turn to be given a CPU. That time is off-CPU, and on-CPU profiling sees nothing (the process isn't running, so there's nothing to sample). This article measures exactly that gap, through the scheduler's tracepoints.
Two different off-CPU questions
"Off-CPU" lumps together two things, so let's separate them clearly:
- Run-queue latency — the task is already ready to run (it was woken up), but has to wait for a free CPU core. This is latency due to the scheduler/CPU contention, not because the task itself is waiting for anything.
- Off-CPU time — the task yields the CPU itself to wait for an event (I/O done, a lock released, a
sleepfinishing). This is the blocked time.
Both are read from the scheduler tracepoints: sched_wakeup (a task becomes ready) and sched_switch (the CPU switches from one task to another).
Run-queue latency: waiting for a CPU turn
Measure the interval from the wakeup (sched_wakeup) to when it actually runs (sched_switch picks it as next):
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qt[args.pid] = nsecs; }
tracepoint:sched:sched_wakeup_new { @qt[args.pid] = nsecs; }
tracepoint:sched:sched_switch /@qt[args.next_pid]/ {
@runq_us = hist((nsecs - @qt[args.next_pid]) / 1000); // microseconds
delete(@qt[args.next_pid]);
}'
sched_wakeup records a timestamp keyed by pid (the task being woken up); when sched_switch picks that exact pid as the next task (next_pid), the difference between the two marks is the time waiting in the run queue. Create scheduling pressure (4 busy tasks on 2 CPUs) then measure:
@runq_us:
[1] 1428 |@@@@@@@@@@@@ |
[2, 4) 1692 |@@@@@@@@@@@@@@@ |
[4, 8) 5787 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16) 919 |@@@@@@@@ |
[16, 32) 355 |@@@ |
...
[2K, 4K) 10 | |
[16K, 32K) 2 | |
The unit is microseconds. Most tasks get onto the CPU after 4-8µs — fast. But there's a tail reaching [16K, 32K): a few times a task had to wait 16-32ms before running. That is precisely the sign of CPU contention — more tasks are ready than there are cores, so a few have to queue for a long time. On a node overloaded on CPU, this tail is what makes an application stutter while on-CPU profiling never points it out. This is exactly BCC's runqlat tool, built in a few lines.
Off-CPU time: how long blocked
A different question: each time it leaves the CPU, how long does the task stay off? Record the mark when it is switched out (prev_pid), compute the difference when it comes back (next_pid):
sudo bpftrace -e '
tracepoint:sched:sched_switch {
@off[args.prev_pid] = nsecs; // just left the CPU
$t = @off[args.next_pid];
if ($t) { @offcpu_ms = hist((nsecs - $t) / 1000000); delete(@off[args.next_pid]); }
}'
@offcpu_ms:
[0] 22137 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 3005 |@@@@@@@ |
[8, 16) 1248 |@@ |
[64, 128) 431 |@ |
[256, 512) 291 | |
[512, 1K) 180 | |
[4K, 8K) 4 | |
The unit is milliseconds. Most off-CPU intervals are under 1ms (an ordinary context switch), but the tail stretches to [4K, 8K) — 4 to 8 seconds off-CPU. Those are tasks blocked for a long time: a daemon sleeping while waiting for an event, a process waiting for I/O, a sleep in a loop. This tail is the real time a user perceives as "slow/hung", and it's invisible to on-CPU profiling — the task isn't running, so there's nothing to sample.
A note on reading the numbers: the
[0]bucket dominates because most context switches are instantaneous; the tail is the part worth inspecting. To know where in the kernel a task is blocked, switch the key frompidtokstackat the moment it's switched out — that gives off-CPU by stack, which combined with on-CPU forms the complete picture.
Why this can be measured in the kernel without cost
Both measurements attach only to scheduler tracepoints — kernel points that already run on every context switch. The eBPF program only records a timestamp into a map and adds to a histogram; it pushes no events out to userspace. The scheduler is an extremely hot path (hundreds of thousands of times/sec), but because the aggregation happens right in the kernel (Article 7), the added cost is negligible — so it can be measured even in production.
🧹 Cleanup
bpftrace unloads automatically on exit; the node returns to 140 programs. Commands at github.com/nghiadaulau/ebpf-from-scratch, directory 18-off-cpu.
Wrap-up
Most latency lies in the time a process is not running — something on-CPU profiling (Article 17) doesn't see. eBPF measures off-CPU through scheduler tracepoints. Run-queue latency (sched_wakeup → sched_switch picks next_pid) is the time waiting for a CPU turn: under contention load, most are 4-8µs but the tail reaches 16-32ms — a sign of CPU shortage. Off-CPU time (sched_switch: mark when prev leaves, diff when next returns) is the blocked time: most under 1ms but the tail reaches several seconds for tasks waiting on I/O/lock/sleep. Both aggregate histograms in the kernel (BCC's runqlat/offcputime), with almost no perturbation since they only add work to scheduler tracepoints that already run. On-CPU (Article 17) + off-CPU (this article) = the complete performance picture: where the CPU burns, and where the wait time goes.
Article 19 closes Part VI with how the Kubernetes ecosystem turns all these eBPF events into something observable cluster-wide: Hubble — dissecting how Cilium pushes events from the eBPF datapath (via the cilium_events perf/ring buffer) up into a network flow picture labeled with pod, service, and policy verdict.