Job, CronJob and TTL
The four controllers we've met so far share an implicit trait: they run forever. A Deployment keeps N pods alive indefinitely, StatefulSet and DaemonSet too — any pod that dies is rebuilt, the goal being to always have pods running. But many real tasks have an end point: running a database migration, backing up a volume, processing a batch of data. For those, "run forever" is wrong — we need something that stops once it's done. That's the Job. This article closes Part IV with the Job, its scheduled sibling the CronJob, and the TTL mechanism that auto-cleans finished Jobs.
Job: run to completion, then stop
The docs draw a clear distinction: "Jobs represent one-off tasks that run to completion and then stop." and the mechanism: "A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate ... When a specified number of successful completions is reached, the task (ie, Job) is complete." The phrase "successfully terminate" is worth noting: a Job cares about pods exiting with code 0, not pods still running. A minimal Job:
apiVersion: batch/v1
kind: Job
metadata: {name: job-once}
spec:
template:
spec:
restartPolicy: Never # a Job only accepts Never or OnFailure
containers:
- name: w
image: busybox:1.36
command: ["sh","-c","echo working; sleep 3; echo done; exit 0"]
Note restartPolicy: Never. The docs require it: "Only a RestartPolicy equal to Never or OnFailure is allowed." Not Always (the Pod default, Article 18), because Always means "always restart", so the pod could never "complete" — contradicting the very nature of a Job.
kubectl get job job-once
kubectl get pods -l job-name=job-once
kubectl get job job-once -o jsonpath='succeeded={.status.succeeded} complete={.status.conditions[?(@.type=="Complete")].status}{"\n"}'
NAME STATUS COMPLETIONS DURATION AGE
job-once Complete 1/1 7s 30s
NAME READY STATUS ...
job-once-7rdhj 0/1 Completed ...
succeeded=1 complete=True
STATUS: Complete, COMPLETIONS 1/1, the pod is in Completed state (not Running), succeeded=1, condition Complete=True. The Job finished its work and stopped, the pod isn't rebuilt. This is the Succeeded from Article 18 at the controller scale.
completions and parallelism
A Job may need to run multiple times, possibly in parallel. Two fields control this: completions (how many successful runs are needed) and parallelism (the max number of pods running at once).
apiVersion: batch/v1
kind: Job
metadata: {name: job-parallel}
spec:
completions: 4 # need 4 completions
parallelism: 2 # but at most 2 pods in parallel
template:
spec:
restartPolicy: Never
containers:
- name: w
image: busybox:1.36
command: ["sh","-c","sleep 3"]
kubectl get job job-parallel
kubectl get pods -l job-name=job-parallel --no-headers | wc -l
NAME STATUS COMPLETIONS DURATION AGE
job-parallel Complete 4/4 11s 29s
4
COMPLETIONS 4/4 is reached across 4 pods, but because parallelism: 2, the Job runs only 2 pods at a time before moving to the next 2 — DURATION 11s (roughly 2 waves × ~5s) instead of ~5s had all 4 run in parallel. This is the mold for batch processing: split the work into N parts, cap the concurrent load.
When a Job fails: backoffLimit
A Job retries when a pod fails — but not forever. backoffLimit sets the number of attempts before the Job gives up. A Job that always fails, with backoffLimit: 2:
apiVersion: batch/v1
kind: Job
metadata: {name: job-fail}
spec:
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: w
image: busybox:1.36
command: ["sh","-c","echo will fail; exit 1"]
kubectl get job job-fail
kubectl get pods -l job-name=job-fail --no-headers | awk '{print $1,$3}'
kubectl get job job-fail -o jsonpath='failed={.status.failed} reason={.status.conditions[?(@.type=="Failed")].reason} msg={.status.conditions[?(@.type=="Failed")].message}{"\n"}'
NAME STATUS COMPLETIONS DURATION AGE
job-fail Failed 0/1 60s 60s
job-fail-69ldt Error
job-fail-pdz4b Error
job-fail-wdlkd Error
failed=3 reason=BackoffLimitExceeded msg=Job has reached the specified backoff limit
Note: backoffLimit: 2 but failed=3 — three pods failed. backoffLimit counts retries, so the total number of runs is backoffLimit + 1 (the first run + 2 retries). Hitting the ceiling, the Job goes to Failed with reason BackoffLimitExceeded. (Between attempts, the Job waits with exponential backoff — same spirit as the CrashLoopBackOff of Article 18.) backoffLimit defaults to 6. This is how a Job distinguishes "transient error, retry" from "really broken, stop and report".
ttlSecondsAfterFinished: auto-clean finished Jobs
A finished Job doesn't vanish on its own — the Job object and its Completed pods stay around so you can inspect logs/results. Accumulated over time, that's clutter. ttlSecondsAfterFinished lets a Job self-destruct a number of seconds after it finishes (whether Complete or Failed):
apiVersion: batch/v1
kind: Job
metadata: {name: job-ttl}
spec:
ttlSecondsAfterFinished: 20 # 20s after finishing, auto-delete
template:
spec:
restartPolicy: Never
containers:
- name: w
image: busybox:1.36
command: ["sh","-c","echo quick; exit 0"]
The Job completes almost instantly. Wait past 20 seconds, then look for it again:
kubectl get job job-ttl
Error from server (NotFound): jobs.batch "job-ttl" not found
The Job deleted itself and took its pod with it. No cron cleanup or external script needed. For Jobs created continuously (especially from a CronJob below), ttlSecondsAfterFinished is a tidy way to keep the cluster from drowning in old Jobs.
CronJob: a scheduled Job
Finally, the CronJob: "A CronJob creates Jobs on a repeating schedule." The docs analogize: "One CronJob object is like one line of a crontab file on a Unix system." It uses the five-field cron syntax (minute, hour, day, month, weekday) and a jobTemplate that's exactly the Job mold from above. The schedule * * * * * means every minute:
apiVersion: batch/v1
kind: CronJob
metadata: {name: cron-demo}
spec:
schedule: "* * * * *"
successfulJobsHistoryLimit: 3 # keep the 3 most recent successful Jobs
failedJobsHistoryLimit: 1 # keep the 1 most recent failed Job
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: w
image: busybox:1.36
command: ["sh","-c","date; echo hello from cronjob"]
Created at 23:26:32, wait past the next minute boundary (23:27:00), then look:
kubectl get cronjob cron-demo
kubectl get jobs
kubectl logs job/cron-demo-29659227
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cron-demo * * * * * False 0 30s 58s
NAME STATUS COMPLETIONS DURATION AGE
cron-demo-29659227 Complete 1/1 3s 31s
Sat May 23 16:27:00 UTC 2026
hello from cronjob
LAST SCHEDULE 30s — the CronJob fired exactly at the 23:27:00 boundary, spawned a Job cron-demo-29659227 (the suffix is a per-minute timestamp), and the pod log prints exactly 16:27:00 — running right at the top of the minute. This Job is owned by the CronJob:
kubectl get job cron-demo-29659227 -o jsonpath='ownerKind={.metadata.ownerReferences[0].kind} ownerName={.metadata.ownerReferences[0].name}{"\n"}'
# ownerKind=CronJob ownerName=cron-demo
The ownership chain is CronJob → Job → Pod, and successfulJobsHistoryLimit: 3 (default 3) keeps the 3 most recent successful Jobs then auto-cleans older ones, failedJobsHistoryLimit: 1 (default 1) keeps 1 failed Job. A few other fields worth knowing: concurrencyPolicy handles a new run arriving while the previous one isn't done — Allow (default, allows overlap), Forbid (skips the new run), Replace (replaces the old run); and suspend: true pauses the schedule without deleting the CronJob.
🧹 Cleanup
kubectl delete cronjob cron-demo
kubectl delete job --all
Deleting the CronJob takes the Jobs and pods it spawned with it; job-ttl already deleted itself. The cluster returns to two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 27-job-cronjob.
Wrap-up
A Job is a run-to-completion controller, the opposite of the run-forever Deployment/StatefulSet/DaemonSet. It creates pods until enough exit with code 0: completions (how many runs needed), parallelism (how many pods in parallel), backoffLimit (how many retries before Failed with BackoffLimitExceeded, total runs = backoffLimit + 1, default 6); restartPolicy must be Never or OnFailure. ttlSecondsAfterFinished lets a Job delete itself after finishing (we saw job-ttl vanish after 20s). The CronJob spawns Jobs on a cron schedule via a jobTemplate (we caught it fire exactly on the minute boundary, ownership chain CronJob→Job→Pod), with concurrencyPolicy, history limits (default 3 successful / 1 failed), and suspend. With this, the promise from Article 19 also becomes clear: a native sidecar doesn't block a Job from completing, whereas an old-style sidecar hangs a Job forever.
That's the end of Part IV — we've covered all five families of controllers. Part V shifts from "what to run" to "organizing and querying objects": Article 28 opens with labels, selectors, namespaces, annotations and field selectors, the classification and filtering toolkit we've used here and there (the very -l job-name=... in this article) now studied properly.