Job, CronJob and TTL

K
Kai··7 min read

The four controllers we've met so far share an implicit trait: they run forever. A Deployment keeps N pods alive indefinitely, StatefulSet and DaemonSet too — any pod that dies is rebuilt, the goal being to always have pods running. But many real tasks have an end point: running a database migration, backing up a volume, processing a batch of data. For those, "run forever" is wrong — we need something that stops once it's done. That's the Job. This article closes Part IV with the Job, its scheduled sibling the CronJob, and the TTL mechanism that auto-cleans finished Jobs.

Job: run to completion, then stop

The docs draw a clear distinction: "Jobs represent one-off tasks that run to completion and then stop." and the mechanism: "A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate ... When a specified number of successful completions is reached, the task (ie, Job) is complete." The phrase "successfully terminate" is worth noting: a Job cares about pods exiting with code 0, not pods still running. A minimal Job:

apiVersion: batch/v1
kind: Job
metadata: {name: job-once}
spec:
  template:
    spec:
      restartPolicy: Never        # a Job only accepts Never or OnFailure
      containers:
      - name: w
        image: busybox:1.36
        command: ["sh","-c","echo working; sleep 3; echo done; exit 0"]

Note restartPolicy: Never. The docs require it: "Only a RestartPolicy equal to Never or OnFailure is allowed." Not Always (the Pod default, Article 18), because Always means "always restart", so the pod could never "complete" — contradicting the very nature of a Job.

kubectl get job job-once
kubectl get pods -l job-name=job-once
kubectl get job job-once -o jsonpath='succeeded={.status.succeeded} complete={.status.conditions[?(@.type=="Complete")].status}{"\n"}'
NAME       STATUS     COMPLETIONS   DURATION   AGE
job-once   Complete   1/1           7s         30s

NAME             READY   STATUS      ...
job-once-7rdhj   0/1     Completed   ...

succeeded=1 complete=True

STATUS: Complete, COMPLETIONS 1/1, the pod is in Completed state (not Running), succeeded=1, condition Complete=True. The Job finished its work and stopped, the pod isn't rebuilt. This is the Succeeded from Article 18 at the controller scale.

completions and parallelism

A Job may need to run multiple times, possibly in parallel. Two fields control this: completions (how many successful runs are needed) and parallelism (the max number of pods running at once).

apiVersion: batch/v1
kind: Job
metadata: {name: job-parallel}
spec:
  completions: 4          # need 4 completions
  parallelism: 2          # but at most 2 pods in parallel
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: w
        image: busybox:1.36
        command: ["sh","-c","sleep 3"]
kubectl get job job-parallel
kubectl get pods -l job-name=job-parallel --no-headers | wc -l
NAME           STATUS     COMPLETIONS   DURATION   AGE
job-parallel   Complete   4/4           11s        29s

4

COMPLETIONS 4/4 is reached across 4 pods, but because parallelism: 2, the Job runs only 2 pods at a time before moving to the next 2 — DURATION 11s (roughly 2 waves × ~5s) instead of ~5s had all 4 run in parallel. This is the mold for batch processing: split the work into N parts, cap the concurrent load.

When a Job fails: backoffLimit

A Job retries when a pod fails — but not forever. backoffLimit sets the number of attempts before the Job gives up. A Job that always fails, with backoffLimit: 2:

apiVersion: batch/v1
kind: Job
metadata: {name: job-fail}
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: w
        image: busybox:1.36
        command: ["sh","-c","echo will fail; exit 1"]
kubectl get job job-fail
kubectl get pods -l job-name=job-fail --no-headers | awk '{print $1,$3}'
kubectl get job job-fail -o jsonpath='failed={.status.failed} reason={.status.conditions[?(@.type=="Failed")].reason} msg={.status.conditions[?(@.type=="Failed")].message}{"\n"}'
NAME       STATUS   COMPLETIONS   DURATION   AGE
job-fail   Failed   0/1           60s        60s

job-fail-69ldt Error
job-fail-pdz4b Error
job-fail-wdlkd Error

failed=3 reason=BackoffLimitExceeded msg=Job has reached the specified backoff limit

Note: backoffLimit: 2 but failed=3three pods failed. backoffLimit counts retries, so the total number of runs is backoffLimit + 1 (the first run + 2 retries). Hitting the ceiling, the Job goes to Failed with reason BackoffLimitExceeded. (Between attempts, the Job waits with exponential backoff — same spirit as the CrashLoopBackOff of Article 18.) backoffLimit defaults to 6. This is how a Job distinguishes "transient error, retry" from "really broken, stop and report".

ttlSecondsAfterFinished: auto-clean finished Jobs

A finished Job doesn't vanish on its own — the Job object and its Completed pods stay around so you can inspect logs/results. Accumulated over time, that's clutter. ttlSecondsAfterFinished lets a Job self-destruct a number of seconds after it finishes (whether Complete or Failed):

apiVersion: batch/v1
kind: Job
metadata: {name: job-ttl}
spec:
  ttlSecondsAfterFinished: 20      # 20s after finishing, auto-delete
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: w
        image: busybox:1.36
        command: ["sh","-c","echo quick; exit 0"]

The Job completes almost instantly. Wait past 20 seconds, then look for it again:

kubectl get job job-ttl
Error from server (NotFound): jobs.batch "job-ttl" not found

The Job deleted itself and took its pod with it. No cron cleanup or external script needed. For Jobs created continuously (especially from a CronJob below), ttlSecondsAfterFinished is a tidy way to keep the cluster from drowning in old Jobs.

CronJob: a scheduled Job

Finally, the CronJob: "A CronJob creates Jobs on a repeating schedule." The docs analogize: "One CronJob object is like one line of a crontab file on a Unix system." It uses the five-field cron syntax (minute, hour, day, month, weekday) and a jobTemplate that's exactly the Job mold from above. The schedule * * * * * means every minute:

apiVersion: batch/v1
kind: CronJob
metadata: {name: cron-demo}
spec:
  schedule: "* * * * *"
  successfulJobsHistoryLimit: 3      # keep the 3 most recent successful Jobs
  failedJobsHistoryLimit: 1          # keep the 1 most recent failed Job
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: w
            image: busybox:1.36
            command: ["sh","-c","date; echo hello from cronjob"]

Created at 23:26:32, wait past the next minute boundary (23:27:00), then look:

kubectl get cronjob cron-demo
kubectl get jobs
kubectl logs job/cron-demo-29659227
NAME        SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cron-demo   * * * * *   False     0        30s             58s

NAME                 STATUS     COMPLETIONS   DURATION   AGE
cron-demo-29659227   Complete   1/1           3s         31s

Sat May 23 16:27:00 UTC 2026
hello from cronjob

LAST SCHEDULE 30s — the CronJob fired exactly at the 23:27:00 boundary, spawned a Job cron-demo-29659227 (the suffix is a per-minute timestamp), and the pod log prints exactly 16:27:00 — running right at the top of the minute. This Job is owned by the CronJob:

kubectl get job cron-demo-29659227 -o jsonpath='ownerKind={.metadata.ownerReferences[0].kind} ownerName={.metadata.ownerReferences[0].name}{"\n"}'
# ownerKind=CronJob ownerName=cron-demo

The ownership chain is CronJob → Job → Pod, and successfulJobsHistoryLimit: 3 (default 3) keeps the 3 most recent successful Jobs then auto-cleans older ones, failedJobsHistoryLimit: 1 (default 1) keeps 1 failed Job. A few other fields worth knowing: concurrencyPolicy handles a new run arriving while the previous one isn't done — Allow (default, allows overlap), Forbid (skips the new run), Replace (replaces the old run); and suspend: true pauses the schedule without deleting the CronJob.

🧹 Cleanup

kubectl delete cronjob cron-demo
kubectl delete job --all

Deleting the CronJob takes the Jobs and pods it spawned with it; job-ttl already deleted itself. The cluster returns to two CoreDNS pods. Manifests at github.com/nghiadaulau/kubernetes-from-scratch, directory 27-job-cronjob.

Wrap-up

A Job is a run-to-completion controller, the opposite of the run-forever Deployment/StatefulSet/DaemonSet. It creates pods until enough exit with code 0: completions (how many runs needed), parallelism (how many pods in parallel), backoffLimit (how many retries before Failed with BackoffLimitExceeded, total runs = backoffLimit + 1, default 6); restartPolicy must be Never or OnFailure. ttlSecondsAfterFinished lets a Job delete itself after finishing (we saw job-ttl vanish after 20s). The CronJob spawns Jobs on a cron schedule via a jobTemplate (we caught it fire exactly on the minute boundary, ownership chain CronJob→Job→Pod), with concurrencyPolicy, history limits (default 3 successful / 1 failed), and suspend. With this, the promise from Article 19 also becomes clear: a native sidecar doesn't block a Job from completing, whereas an old-style sidecar hangs a Job forever.

That's the end of Part IV — we've covered all five families of controllers. Part V shifts from "what to run" to "organizing and querying objects": Article 28 opens with labels, selectors, namespaces, annotations and field selectors, the classification and filtering toolkit we've used here and there (the very -l job-name=... in this article) now studied properly.