Device Plugins and Extended Resources

Throughout the series, pods have requested only two resource types: CPU and memory (Article 22). But a node can have other things — a GPU, a high-speed NIC, an FPGA — that kubelet doesn't know about by default. A device plugin is the framework that lets a node advertise those as an extended resource, so pods can request them and the scheduler can divvy them up just like CPU. This article stands up a real device plugin to capture the full gRPC flow between it and kubelet, then looks at the very extended-resource mechanism it builds on underneath.

The device plugin framework

A device plugin separates the hardware knowledge out of kubelet: the vendor writes their own plugin, kubelet needs no patching. The two sides talk over gRPC on a Unix socket in /var/lib/kubelet/device-plugins/, via two services:

   Registration service  (KUBELET serves it, at kubelet.sock)
     Register(RegisterRequest{Version, Endpoint, ResourceName, Options})

   DevicePlugin service   (PLUGIN serves it, at <name>.sock)
     GetDevicePluginOptions()  — declare optional features
     ListAndWatch() → stream    — report device list {ID, Health}, update on change
     Allocate(AllocateRequest)  — kubelet calls at container creation, returns envs/mounts/devices

The handshake sequence has a mandatory order: (1) the plugin opens a gRPC server on its own socket; (2) the plugin calls Register on kubelet.sock, declaring its resource name and socket name; (3) kubelet turns around and calls ListAndWatch on the plugin, receiving the device list; (4) kubelet writes the device count into Node.status.capacity under the name vendor-domain/type; (5) when a pod requesting that resource is scheduled onto the node, kubelet calls Allocate so the plugin can configure the container. Look at the starting point — the kubelet.sock socket kubelet opens to receive registrations:

ssh worker-0 'sudo ls /var/lib/kubelet/device-plugins/'

kubelet.sock

That's all there is, because no plugin has registered yet. Now stand one up.

Standing up a real device plugin

You don't need a GPU to see the gRPC flow — Kubernetes has a sample device plugin that advertises a fake resource example.com/resource. Run it on worker-0, mounting /var/lib/kubelet/device-plugins so it can create its own socket and reach kubelet.sock:

apiVersion: v1
kind: Pod
metadata: {name: sample-dp}
spec:
  nodeSelector: {kubernetes.io/hostname: worker-0}
  containers:
  - name: dp
    image: registry.k8s.io/e2e-test-images/sample-device-plugin:1.3
    securityContext: {privileged: true}
    env:
    - {name: PLUGIN_SOCK_DIR, value: /var/lib/kubelet/device-plugins}
    volumeMounts:
    - {name: device-plugin, mountPath: /var/lib/kubelet/device-plugins}
  volumes:
  - {name: device-plugin, hostPath: {path: /var/lib/kubelet/device-plugins}}

The plugin's logs print exactly steps (1)–(3) above:

kubectl -n dp-demo logs sample-dp

pluginSocksDir: /var/lib/kubelet/device-plugins
Starting to serve on /var/lib/kubelet/device-plugins/dp.1779583118   # (1) open own socket
Deprecation file not found. Invoke registration                      # (2) call Register on kubelet.sock
ListAndWatch                                                         # (3) kubelet calls ListAndWatch

The plugin's own socket appears next to kubelet.sock, and kubelet has written the resource into the node's capacity — that's step (4), and the number 2 comes from the plugin's ListAndWatch, not from anyone typing it by hand:

ssh worker-0 'sudo ls /var/lib/kubelet/device-plugins/ | grep -v kubelet'
kubectl get node worker-0 -o json | jq '.status.capacity["example.com/resource"]'

dp.1779583118        # plugin socket
"2"                  # example.com/resource = 2, reported by ListAndWatch

Now create a pod requesting example.com/resource: 1. When kubelet schedules the pod onto worker-0, it calls step (5) — Allocate on the plugin, passing the chosen device ID:

# pod limits {example.com/resource: "1"}, pinned to worker-0
kubectl -n dp-demo logs sample-dp | grep Allocate

Allocate, &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{
  &ContainerAllocateRequest{DevicesIDs:[Dev-1],},},}

kubelet picked device Dev-1 for the pod and asked the plugin how to configure the container. The plugin returns an AllocateResponse containing envs/mounts/devices for kubelet to inject into the container — with a real GPU, this is where the plugin mounts the device node /dev/nvidia0 and sets driver environment variables. This sample plugin returns an empty response (no real hardware to attach), but the Allocate RPC is still called at the right moment during container creation. That's the entire lifecycle: the plugin registers, reports devices via ListAndWatch, kubelet advertises them, then calls Allocate on allocation.

The mechanism underneath: extended resources

Boiled down, what the plugin does to the node is just push a number into status.capacity (via ListAndWatch). For a logical resource not tied to hardware, you can do that step directly by hand — PATCH the Node status, exactly as the docs note — and this is also a clean way to see how the scheduler treats an extended resource. Advertise kkloud.io/widget: 2 on worker-0:

kubectl patch node worker-0 --subresource=status --type=json \
  -p='[{"op":"add","path":"/status/capacity/kkloud.io~1widget","value":"2"}]'
kubectl get node worker-0 -o json | jq '.status.capacity["kkloud.io/widget"], .status.allocatable["kkloud.io/widget"]'

capacity:    2
allocatable: 2

(~1 is how JSON Pointer writes the / inside a resource name.) Kubelet pushes it from capacity into allocatable — just like when a plugin reports it, only the source differs.

The scheduler splits it exactly like CPU

A pod requests an extended resource in resources.limits, and the scheduler treats it just like CPU/memory — integer, no over-commit, no sharing between containers. Create one pod requesting 1 widget and one requesting 2, both pinned to worker-0:

# w-one: limits {kkloud.io/widget: "1"} ; w-two: limits {kkloud.io/widget: "2"}
kubectl -n dev-demo get pods -o wide

NAME    READY   STATUS    NODE
w-one   1/1     Running   worker-0
w-two   0/1     Pending   <none>

w-one takes 1 of the 2 widgets and runs. w-two requests 2 but only 1 is left, so it's Pending. See why:

kubectl -n dev-demo get event --field-selector involvedObject.name=w-two | grep FailedScheduling

Warning  FailedScheduling  0/2 nodes are available: 1 Insufficient kkloud.io/widget,
  1 node(s) didn't match Pod's node affinity/selector ...

Insufficient kkloud.io/widget is the exact message the NodeResourcesFit plugin (Article 34) gives when CPU is short, now applied to widgets. The scheduler doesn't distinguish an extended resource from a core one: same allocatable count, same subtraction per scheduled pod, same rejection when there isn't enough. A device plugin only needs to push a number into capacity and configure the container via Allocate(); the scheduling part reuses the existing resource mechanism. (From v1.36, the allocatedResourcesStatus field in container status also reports the health of the allocated device — beta.)

🧹 Cleanup

kubectl delete namespace dp-demo dev-demo
# remove the virtual extended resource + clean up the entry the plugin left behind (kubelet doesn't delete a resource it stopped managing)
kubectl patch node worker-0 --subresource=status --type=json \
  -p='[{"op":"remove","path":"/status/capacity/kkloud.io~1widget"}]'
kubectl patch node worker-0 --subresource=status --type=json \
  -p='[{"op":"remove","path":"/status/capacity/example.com~1resource"},
       {"op":"remove","path":"/status/allocatable/example.com~1resource"}]'

One operational point to remember: when a device plugin stops, kubelet sets the resource's allocatable to 0 but does not remove the entry from status.capacity (it remembers it in a checkpoint), so you have to PATCH it out by hand to get the node back clean. The manifests are at github.com/nghiadaulau/kubernetes-from-scratch, directory 61-device-plugins.

Wrap-up

A device plugin lets a node advertise hardware beyond CPU/memory as an extended resource vendor/type, over gRPC with kubelet: the plugin opens its own socket, calls Register on kubelet.sock, kubelet calls back ListAndWatch to receive the device list then writes it into Node.status.capacity, and calls Allocate when a pod is allocated so the plugin can configure the container (envs/mounts/devices). We stood up a real sample plugin and captured the full flow: logs Starting to serve → Invoke registration → ListAndWatch, the plugin socket appears next to kubelet.sock, the node shows example.com/resource: 2 (reported by ListAndWatch, not typed by hand), and when a pod requests it, Allocate, DevicesIDs:[Dev-1] is called. Underneath, the plugin just pushes a number into capacity — for a logical resource you can PATCH it by hand, and the scheduler splits an extended resource exactly like CPU: a pod requesting 1 runs, a pod requesting 2 (only 1 left) goes Pending with Insufficient kkloud.io/widget via NodeResourcesFit (Article 34).

Part XII closes here — four ways to extend Kubernetes: CRD (new kind in etcd), admission webhook (intercept the write path), operator (CRD + controller), API aggregation (second server), and device plugin (hardware resources over gRPC). Part XIII turns to cluster operations: backing up and restoring etcd, version upgrades, garbage collection, then observability — logging, metrics, traces. Article 62 starts with the scariest thing to lose: backing up and restoring etcd, where all of the cluster's state lives.