CPU Profiling with perf_event: Sampling Stacks, the Foundation of Flame Graphs

Part V was enforcement. Part VI returns to observation, but at the performance layer — the question "what is the CPU busy doing, where does the time go". The first article is the most foundational technique: CPU profiling by sampling, through the perf_event eBPF program type.

Sampling: instead of counting everything, snapshot periodically

Measuring latency in Article 7 used a kprobe on every function call — accurate but expensive when events are dense. Profiling takes a different route: statistical sampling. A few dozen to a few hundred times per second, we freeze each CPU and record the stack running at that instant. A function that shows up in many samples is one that takes a lot of CPU time — no need to measure every call, just enough samples to be statistically correct.

The mechanism is perf_event: the kernel has perf infrastructure that provides counters, including a "CPU clock" counter (PERF_COUNT_SW_CPU_CLOCK) that fires steadily at a preset frequency. eBPF attaches a perf_event program to that counter; each time it fires, the program runs on the CPU that was just interrupted, captures the current stack and aggregates it.

   perf counter (CPU clock) fires 99 times/sec, on EVERY CPU
            │
            ▼
   eBPF perf_event program runs (interrupt context)
            │
            ├── bpf_get_stackid() -> store the stack in a stackmap
            └── @[stack] = count()    (aggregate: identical stacks add up)
            │
            ▼
   userspace reads the map: the stack with the most samples = hottest

Profile a node at 99Hz

bpftrace wraps the mechanism above in the profile probe. Sample the kernel stack 99 times/sec on every CPU, aggregating by stack:

sudo bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

Why 99 and not 100? To avoid lock-step — many of the kernel's periodic activities run at multiples of 100Hz (the timer tick), so sampling at exactly 100Hz easily falls into rhythm with them and skews the result. 99Hz is a nearby prime, breaking the rhythm. (A Brendan Gregg trick.)

Put a little load on the node (dd if=/dev/zero of=/dev/null) then profile, and the hot kernel stacks appear:

@[
    rep_stos_alternative+75
    vfs_read+186
    ksys_read+113
    __x64_sys_read+25
    do_syscall_64+128
    entry_SYSCALL_64_after_hwframe+118
]: 95

@[
    pv_native_safe_halt+11
    arch_cpu_idle+9
    default_idle_call+48
    do_idle+127
    cpu_startup_entry+41
    start_secondary+296
]: 383

Both stacks read straight off. The first is dd reading /dev/zero: from entry_SYSCALL_64 → __x64_sys_read → vfs_read → rep_stos_alternative (the x86 instruction that writes zeros en masse to fill the buffer) — the kernel read path, 95 samples in just one offset variant. The second stack is the idle loop: do_idle → arch_cpu_idle → pv_native_safe_halt — the CPU is sleeping, 383 samples. Profiling exposes both: where the CPU does real work and where it's idle.

Aggregate by process: who's eating CPU

Switch the key from stack to process name to see who takes the CPU:

sudo bpftrace -e 'profile:hz:99 { @samples[comm] = count(); }'

@samples[kubelet]:       11
@samples[containerd-shim]: 9
@samples[cilium-agent]:   4
@samples[swapper/1]:    215
@samples[swapper/0]:    237
@samples[dd]:           479

dd dominates (479 samples) — exactly the load we created. swapper/0 and swapper/1 are the idle processes of CPU 0 and 1 (each CPU has a swapper); a sample landing in swapper means that CPU is idle — they add up to ~452 samples, matching the two nearly idle cores. The rest is cluster background activity: kubelet, containerd-shim, cilium-agent. A whole-node CPU distribution picture, captured in a few seconds, with almost no perturbation of the system.

This really is eBPF

Count the eBPF programs while the profiler runs:

during profiling: 141 progs; perf_event progs: 1
after:            140 progs

Exactly one perf_event program is loaded during profiling (140 → 141), then unloaded on exit (back to 140) — bpftrace profile is itself an eBPF perf_event program attached to the CPU clock counter, not a separate tool. (Compare Article 0: the node always returns to 140 of Cilium's background programs.)

The foundation of flame graphs

The @[kstack] = count() data is the input to a flame graph. Each stack is "folded" into a line func1;func2;func3 <count>, then the flamegraph.pl tool draws it into a chart: the horizontal axis is the proportion of samples (width = CPU time), the vertical axis is stack depth. The function that is wide at the top is where the CPU truly burns. All the aggregation was done in the kernel (stackmap + count); userspace only receives the counted list of stacks — so you can profile even production without crushing the load. This is how Parca, Pyroscope, and eBPF-flagged perf build continuous profilers for entire clusters.

🧹 Cleanup

bpftrace automatically unloads the perf_event program + stackmap on exit; there's nothing to clean by hand, and the node returns to 140 programs. Commands at github.com/nghiadaulau/ebpf-from-scratch, directory 17-cpu-profiling.

Wrap-up

CPU profiling does not measure every call but does statistical sampling: attach an eBPF perf_event program to the kernel's CPU-clock counter, and each time it fires (99 times/sec per CPU) it captures the current stack with bpf_get_stackid into a stackmap and aggregates with count() right in the kernel. Profiling a real node shows dd reading /dev/zero through vfs_read → rep_stos_alternative, the idle cores sitting in the idle loop do_idle → pv_native_safe_halt; aggregating by process gives dd 479 samples, swapper/0,1 (the per-CPU idle task) ~452. Use 99Hz (a prime) to avoid lock-step with the kernel's 100Hz timer. Counting the programs confirms this is eBPF (140→141, exactly 1 perf_event). This counted-stack data is exactly the input to a flame graph, and because the aggregation happens in the kernel you can profile production continuously — the foundation of Parca/Pyroscope.

Article 18 asks the opposite question: not "what is the CPU busy with" but "why is a process not running" — measuring off-CPU time (waiting for I/O, waiting for a lock, waiting for a CPU turn) via the scheduler tracepoints, to see the latency that on-CPU profiling never sees.