bpftrace: Maps, Counting and Histograms

K
Kai··4 min read

Article 6 printed one line per event — fine when events are sparse, but "every syscall" or "every disk read" floods the screen instantly. The real power of bpftrace (and of eBPF in general) is aggregating data inside the kernel: counting, building distribution charts on the spot, then returning only a summary. This is why eBPF is usable for production observability — not pumping millions of events out to userspace.

Map: aggregate by key

bpftrace has a map variable denoted @, which is exactly the BPF map (Article 3) that bpftrace builds automatically. Assign @name[key] = aggregation to accumulate by key. Count which process calls openat the most:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @opens[comm] = count(); }'
@opens[bash]:             6
@opens[cilium-agent]:    30
@opens[containerd]:      43
@opens[kubelet]:         85
@opens[cat]:             93
@opens[ls]:             123
@opens[iptables]:       360

@opens[comm] = count() — the key is the process name, the value is a counter that increments each time the probe fires. The counting happens inside the kernel; when bpftrace exits, it prints the whole map once. On this cluster iptables opens files the most (Cilium/kube-proxy manipulating rules), then ls, cat. Change the key to args.filename and you get which file gets opened the most; change count() to sum(...) and you accumulate a quantity.

Histogram: see the distribution, not just the average

hist() builds a distribution chart by powers of 2 (lhist() for linear intervals) — right inside the kernel. This is what the average hides: an average of 1ms can mask a rare 100ms tail that is itself the source of an incident.

Measuring vfs_read latency

A commonly used pattern: record the entry timestamp in a kprobe (when the function starts), compute the difference in a kretprobe (when the function returns). Measure the latency of the kernel's vfs_read function:

sudo bpftrace -e '
  kprobe:vfs_read { @s[tid] = nsecs; }
  kretprobe:vfs_read /@s[tid]/ { @ns = hist(nsecs - @s[tid]); delete(@s[tid]); }'
@ns:
[256, 512)    973 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              |
[512, 1K)    1306 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)     1145 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[2K, 4K)     1273 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
[4K, 8K)      792 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[8K, 16K)     153 |@@@@@@                                            |
[16K, 32K)     78 |@@@                                               |
[64K, 128K)    29 |@                                                 |
[128K, 256K)    6 |                                                  |
[2M, 4M)        1 |                                                  |

Read the mechanism piece by piece:

  • kprobe:vfs_read { @s[tid] = nsecs; } — when vfs_read starts, store the timestamp (nsecs) in map @s keyed by tid (thread id). Keying by tid is mandatory: many threads run vfs_read at the same time, each needing its own timestamp.
  • kretprobe:vfs_read /@s[tid]/ { ... } — when the function returns, the filter /@s[tid]/ ensures there's a matching entry timestamp; nsecs - @s[tid] is the latency.
  • @ns = hist(...) — feed the latency into a histogram; delete(@s[tid]) frees the used entry so the map doesn't grow unbounded.

The unit is nanoseconds. Most vfs_read calls take 512ns–4µs (reads from page cache, fast), but there's a tail reaching [2M, 4M) — a single read taking 2–4ms (likely hitting the actual disk). That tail is what an average would make disappear, while the histogram exposes it immediately. The entire measurement + histogram construction runs inside the kernel; userspace only receives a few dozen lines of summary, even though vfs_read fires thousands of times per second.

🧹 Cleanup

bpftrace detaches the program + maps itself on exit. There's nothing to clean up by hand; the node goes back to 140 programs. The commands are at github.com/nghiadaulau/ebpf-from-scratch, directory 07-bpftrace-aggregation.

Wrap-up

bpftrace's strength is aggregating data inside the kernel and returning only a summary — not pumping every event out to userspace. The map @name[key] = aggregation (based on the BPF map of Article 3) counts/sums by key: @opens[comm] = count() shows iptables opening files the most on the cluster. hist() builds a distribution chart right inside the kernel (ASCII bars), exposing the tail that an average hides. The commonly used latency-measurement pattern: a kprobe stores nsecs keyed by tid, a kretprobe computes the difference and then hist() — we measured vfs_read latency and found most calls under 4µs but the occasional one reaching 2–4ms. Key by tid to separate concurrent threads, delete() to keep the map from growing. All aggregated inside the kernel, efficient even with millions of events.

Article 8 extends bpftrace's reach beyond the kernel: uprobe and USDT — attaching to functions in userspace programs (including inside containers), and using bpftrace to inspect what a real pod on the cluster is doing.