Ephemeral containers and kubectl debug
In the last few articles we often did kubectl exec ... -- sh to get into a pod and look around. But that approach rests on an assumption: that the container has a shell. In production that assumption is usually wrong, and wrong on purpose. A distroless image (containing only the app + libraries, no shell, no ps, no cat) is a common security recommendation because it cuts away nearly all of the attack surface. The price you pay: when a pod has a problem, kubectl exec has nothing left to run. The docs state the problem plainly:
"Since distroless images do not include a shell or any debugging utilities, it's difficult to troubleshoot distroless images using
kubectl execalone."
The answer is the ephemeral container: slip a tooling container (busybox, or a dedicated debug image) temporarily into a running pod, without restarting the pod, without modifying the image. This article digs into the semantics of ephemeral containers, then verifies all three modes of the command that wraps them: kubectl debug.
Why you can't just "add another container"
The first question: why not just declare a debug container in spec.containers and be done? Because the spec of a running pod is nearly immutable: changing spec.containers requires recreating the pod, and recreating the pod means destroying the scene we need to inspect. Ephemeral containers exist to work around exactly this. The docs define:
"Ephemeral containers differ from other containers in that they lack guarantees for resources or execution, and they will never be automatically restarted, so they are not appropriate for building applications."
And the purpose:
"a special type of container that runs temporarily in an existing Pod to accomplish user-initiated actions such as troubleshooting. You use ephemeral containers to inspect services rather than to build applications."
Because it's a debugging tool and not for running an application, several things are cut from it:
"Ephemeral containers may not have ports, so fields such as
ports,livenessProbe,readinessProbeare disallowed." and "Pod resource allocations are immutable, so settingresourcesis disallowed."
Most importantly is how it's added, not through the usual spec:
"Ephemeral containers are created using a special
ephemeralcontainershandler in the API rather than by adding them directly topod.spec, so it's not possible to add an ephemeral container usingkubectl edit."
That is, there's a separate API subresource named ephemeralcontainers to attach it to a live pod. We won't call the raw API because kubectl debug handles that. And once attached, it can't be removed or modified ("you may not change or remove an ephemeral container after you have added it"); it lives until the pod dies.
Mode 1: attach an ephemeral container to a running pod
Stand up a pod that deliberately has no shell to reproduce the distroless case. Kubernetes' pause image is the perfect example: it runs only a /pause binary that sleeps forever, with no shell or utilities:
apiVersion: v1
kind: Pod
metadata:
name: noshell
spec:
containers:
- name: app
image: registry.k8s.io/pause:3.10.2
Try to exec into it as usual:
kubectl exec -it noshell -- /bin/sh -c 'echo hi'
error: Internal error occurred: ... OCI runtime exec failed: exec failed:
unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory
No /bin/sh to run, exactly the distroless scenario. kubectl exec is stuck. Now use kubectl debug to attach a busybox ephemeral container, with the --target flag:
kubectl debug noshell -it --image=busybox:1.36 --target=app --container=debugger
# (here run non-interactively to capture output:)
kubectl debug noshell -it=false --image=busybox:1.36 --target=app --container=debugger -- ps -ef
Targeting container "app". If you don't see processes from this container it may be
because the container runtime doesn't support this feature.
PID USER TIME COMMAND
1 65535 0:00 /pause
14 root 0:00 ps -ef
The app container has no shell at all, but we can still run ps -ef, because the command runs in the debugger container (busybox, which has the tools), not in app. And note the line PID 1 /pause: we see the app container's process from inside the debugger. That's thanks to the --target=app flag.
--target enables process namespace sharing between the debugger and the target container: the debugger is joined into app's process namespace, so ps in the debugger lists app's /pause too. Without --target the debugger only sees its own processes. This is how you debug distroless: the target image has no ps/ls/cat, but from the debugger we can inspect its processes, and (if also shared) its filesystem via /proc/<pid>/root. (The "container runtime doesn't support this feature" warning is just a fallback; our containerd supports it, so we do see /pause.)
Where the ephemeral container shows up, and how it doesn't touch the app
After attaching, the pod records the ephemeral container in two separate places, not mixed into containers:
kubectl get pod noshell -o jsonpath='{range .spec.ephemeralContainers[*]}name={.name} image={.image} target={.targetContainerName}{"\n"}{end}'
kubectl get pod noshell -o jsonpath='{range .status.ephemeralContainerStatuses[*]}name={.name} state={.state}{"\n"}{end}'
name=debugger image=busybox:1.36 target=app
name=debugger state={"terminated":{"exitCode":0,"reason":"Completed",...}}
spec.ephemeralContainers holds the declaration (with targetContainerName: app), status.ephemeralContainerStatuses holds the state. The ps command finished so the debugger is terminated/Completed, and this is the point to engrave: an ephemeral container never restarts on its own. Quite unlike a regular container (Article 18) or a sidecar (Article 19), it runs once and stops. If you want a longer debug session, have it run sleep or an interactive shell (-it).
What about the app container? Entirely undisturbed:
kubectl get pod noshell
kubectl get pod noshell -o jsonpath='app.restartCount={.status.containerStatuses[0].restartCount}{"\n"}'
NAME READY STATUS RESTARTS AGE
noshell 1/1 Running 0 28s
app.restartCount=0
1/1 Running, restartCount=0: attaching the debugger didn't restart the pod, didn't modify the image, didn't interrupt the app. Exactly what you need while investigating a live incident: keep the scene intact.
Mode 2: copy the pod with --copy-to
Sometimes you don't want to (or aren't allowed to) touch the original pod, or you need to change the command/image to see the app behave differently, which an ephemeral container can't do since it can't modify an existing container's command. In that case use --copy-to: kubectl debug creates a copy of the pod, you tweak the copy, and the original is left intact.
kubectl debug noshell -it=false --copy-to=noshell-dbg \
--image=busybox:1.36 --container=debugger --share-processes -- sleep 3600
NAME READY STATUS RESTARTS AGE
noshell-dbg 2/2 Running 0 8s
The copy noshell-dbg has two containers: the original app with its image unchanged, plus debugger:
kubectl get pod noshell-dbg -o jsonpath='{range .spec.containers[*]}container={.name} image={.image}{"\n"}{end}'
container=app image=registry.k8s.io/pause:3.10.2
container=debugger image=busybox:1.36
In copy mode there's no --target (the debug container here is a regular container, not an ephemeral one), so to see the app's processes we use the --share-processes flag, which enables shareProcessNamespace for the whole copied pod. Get into the debugger and verify:
kubectl exec noshell-dbg -c debugger -- ps -ef
PID USER TIME COMMAND
1 65535 0:00 /pause
7 65535 0:00 /pause
13 root 0:00 sleep 3600
19 root 0:00 ps -ef
The debugger sees both the app container's /pause (PID 7) and its own sleep 3600 (PID 13), the true meaning of sharing the process namespace across the whole pod. Since it's a copy, it sits on its own IP/pod and doesn't affect the original; when debugging is done, deleting the copy cleans everything up. --copy-to suits a case where the original pod is in CrashLoopBackOff (Article 18): copy it out and change the command to sleep so the pod stays still for us to dissect instead of dying and respawning constantly.
Mode 3: debug a node directly with debug node/
The two modes above debug inside a pod. The third targets the node: when the suspect is on the host, like a full disk, kernel logs, kubelet config, CNI files (the things we built by hand in Part I). kubectl debug node/<node-name> creates a privileged debug pod on that exact node, and mounts the host's entire filesystem at /host:
kubectl debug node/worker-0 -it=false --image=busybox:1.36 --container=nodedbg -- ls /host
Creating debugging pod node-debugger-worker-0-nns48 with container nodedbg on node worker-0.
bin
boot
dev
etc
home
lib
lib64
lost+found
...
/host is exactly / of node worker-0. Prove it decisively by reading the host's hostname then cross-checking against the real hostname over SSH:
kubectl debug node/worker-0 -it=false --image=busybox:1.36 --container=nodedbg -- cat /host/etc/hostname
# worker-0
ssh worker-0 hostname
# worker-0
They match: the debug pod reads exactly /etc/hostname on the node's rootfs. From here you can chroot /host to use the host's own tools, view /host/var/log, /host/etc/kubernetes, inspect /host/etc/cni/net.d (the CNI files from Article 14)... all without an SSH key. This is how you debug a node "from inside the cluster," handy when there's no SSH route available but you have kubectl access. In exchange, this pod is highly privileged, so delete it as soon as you're done (it's the node-debugger-* pod in the default namespace).
The three modes side by side
kubectl debug <pod> --target=C ── attach an ephemeral container INTO a running pod
(no restart, no image change; --target = shared
process namespace to inspect container C's processes)
kubectl debug <pod> --copy-to=<copy> ── COPY the pod then tweak the copy
(change image/command, --share-processes; original intact)
kubectl debug node/<node> ── privileged pod on the NODE, host fs mounted at /host
(debug the host: logs, config, disk)
🧹 Cleanup
kubectl delete pod noshell noshell-dbg --now
# delete the node debug pod (auto-created in the default namespace):
kubectl delete pod -l '!app' --field-selector=status.phase=Succeeded --now
kubectl get pods -o name | grep node-debugger | xargs -r kubectl delete
Note: the debugger ephemeral container can't be deleted on its own — it's bound tightly to the noshell pod, so deleting the pod is what does it. The cluster returns to just the two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 21-ephemeral-debug.
Wrap-up
When kubectl exec is powerless because the container has no shell or has crashed, the ephemeral container is the way in: a tooling container attached temporarily to a running pod via the ephemeralcontainers subresource, without restarting the pod, without modifying the image, never restarting on its own, and not removable after attaching. kubectl debug wraps three modes: attach an ephemeral container to a running pod (--target enables process namespace sharing to inspect the target container's processes — we saw the /pause of a shell-less image); copy a pod with --copy-to to change the image/command while keeping the original intact (--share-processes); and debug node/ to create a privileged pod that mounts the node's rootfs at /host to debug the host (we read exactly /host/etc/hostname = worker-0). The three modes cover three layers: inside the container, inside the pod, and below at the node.
With this, Part III has gone through the full lifecycle and how to observe/debug a pod. Article 22 moves on to resources: requests/limits, the three QoS classes (Guaranteed/Burstable/BestEffort) that decide which pod gets killed first when the node runs out of memory, and the Downward API for a pod to know information about itself.