Metrics, Traces, and API Priority and Fairness
Article 65 covered logs — discrete events. Observability also needs continuous measurements: how many requests are in flight, what the latency looks like, whether any pod is restarting. Kubernetes exposes those through the /metrics endpoint. This article looks at those metrics, then digs into a mechanism that is both observability and self-defense: API Priority and Fairness, which we glimpsed in the apiserver logs in Article 65.
The /metrics endpoint
Both the apiserver and the kubelet expose /metrics in Prometheus format — plain text, one measurement per line with labels:
kubectl get --raw /metrics | grep -E "apiserver_current_inflight_requests|apiserver_request_total" | head
apiserver_current_inflight_requests{request_kind="mutating"} 0
apiserver_current_inflight_requests{request_kind="readOnly"} 1
apiserver_request_total{code="0",resource="pods",subresource="exec",verb="CONNECT",version="v1"} 18
apiserver_current_inflight_requests counts requests being processed by kind, apiserver_request_total is a cumulative counter broken down by resource/verb/return code. The kubelet has its own set:
kubectl get --raw /api/v1/nodes/worker-0/proxy/metrics | grep kubelet_running_containers
kubelet_running_containers{container_state="created"} 1
kubelet_running_containers{container_state="exited"} 6
kubelet_running_containers{container_state="running"} 14
This is raw data; Prometheus (or an equivalent) scrapes it periodically and stores time series for charting and alerting. Kubernetes itself stores no metrics — like logs, it only exposes them, and collection is an external system. (metrics-server from Article 39 is a different thing: it scrapes a small subset to serve HPA, not a full metrics store.)
API Priority and Fairness
The apiserver has a limited request budget. If a buggy controller spins and hammers the API, it can consume the entire budget and starve leader election and node heartbeats along with it — the cluster loses stability. APF (stable since 1.29) prevents that: it classifies requests into priority levels, each with its own concurrency limit, so one noisy stream can't choke the others.
Two kinds of object control it. PriorityLevelConfiguration defines the levels and each level's share of bandwidth:
kubectl get prioritylevelconfigurations
NAME TYPE NOMINALCONCURRENCYSHARES QUEUES HANDSIZE QUEUELENGTHLIMIT
catch-all Limited 5 <none> <none> <none>
exempt Exempt <none> <none> <none> <none>
leader-election Limited 10 16 4 50
node-high Limited 40 64 6 50
system Limited 30 64 6 50
workload-high Limited 40 128 6 50
workload-low Limited 100 128 6 50
FlowSchema decides which request goes to which level, based on the sender/resource, and splits requests into flows (by user or namespace):
kubectl get flowschemas
NAME PRIORITYLEVEL MATCHINGPRECEDENCE DISTINGUISHERMETHOD
exempt exempt 1 <none>
probes exempt 2 <none>
system-leader-election leader-election 100 ByUser
system-nodes system 500 ByUser
kube-controller-manager workload-high 800 ByNamespace
kube-scheduler workload-high 800 ByNamespace
Read one row: a leader-election request matches FlowSchema system-leader-election (precedence 100, which wins over a higher precedence number), goes to the leader-election level, and is split into flows ByUser. The exempt level is special, bypassing all limits, reserved for health checks and critical requests (probes, exempt). The cluster needs no configuration: this set of FlowSchemas and PriorityLevelConfigurations is built in.
APF's live state
The apiserver exposes APF state through a debug endpoint — you see immediately which level is busy and whether any request is being rejected:
kubectl get --raw /debug/api_priority_and_fairness/dump_priority_levels
PriorityLevelName, ..., ExecutingRequests, DispatchedRequests, RejectedRequests, ...
catch-all, ..., 0, 7, 0,
exempt, ..., 1, 15451, 0,
leader-election, ..., 0, 39746, 0,
node-high, ..., 0, 2479, 0,
DispatchedRequests shows the real load distribution: leader-election has handled 39746 requests, exempt 15451 (continuous health checks). The column to watch is RejectedRequests, sitting at 0 across all levels, meaning the cluster is not yet overloaded. When a level saturates and its queue fills, new requests get a 429 Too Many Requests and this number climbs at exactly that level — pointing straight at which flow is applying pressure, instead of letting the whole apiserver slow down evenly. This is where APF both protects and lets you observe: it doesn't just block overload, it tells you where the overload is.
Traces
The third piece of observability is tracing — following one request across components to find where it's slow. The apiserver and kubelet support exporting traces via OpenTelemetry, enabled through --tracing-config-file. This cluster doesn't have it on:
ssh controller-0 'grep -c tracing-config-file /etc/systemd/system/kube-apiserver.service'
0
When enabled, the apiserver sends spans to an OTel collector, and you can see, for example, that a CREATE pod request spent time in an admission webhook (Article 58) or in etcd. Traces complement metrics: metrics say that something is slow, traces say where it's slow. The from-scratch cluster leaves it off by default because it needs an external collector; enabling it means adding --tracing-config-file pointing at that collector.
🧹 Cleanup
This article only reads the /metrics endpoint, the built-in APF objects, and the debug endpoint — it creates nothing. There's nothing to clean up. The commands used here are at github.com/nghiadaulau/kubernetes-from-scratch, directory 66-metrics-apf.
Wrap-up
Kubernetes observability has three pieces. Metrics: the apiserver and kubelet expose /metrics in Prometheus format (apiserver_request_total, kubelet_running_containers, etc.), and the cluster itself stores nothing — an external Prometheus scrapes and keeps the time series. API Priority and Fairness both protects the apiserver and is observable: PriorityLevelConfiguration splits the request budget into levels (system, leader-election, workload-high/low, catch-all, exempt), FlowSchema routes requests to a level by sender/resource and splits them into flows, and the /debug/api_priority_and_fairness endpoint shows each level's DispatchedRequests/RejectedRequests — a 429 at one level points straight at the flow causing the load. Traces (OTel, --tracing-config-file) follow a request across components to pinpoint where it's slow, off by default because they need an external collector. All three are exposed-at-the-cluster, collected-externally — exactly Kubernetes's philosophy on observability.
Article 67 closes Part XIII with the remaining operational mechanisms: leader election keeps controller-manager and scheduler to a single active instance under HA, how addons are managed, and node autoscaling — the cluster adding/removing nodes by load.