Merge pull request #41924 from kannon92/job-release-blog-post
Blog post on new Job features in 1.28 (PodReplacementPolicy and BackoffLimitPerIndex)pull/42591/head
commit
3c9c3afd2a
|
@ -0,0 +1,231 @@
|
|||
---
|
||||
layout: blog
|
||||
title: "Kubernetes 1.28: Improved failure handling for Jobs"
|
||||
date: 2023-08-21
|
||||
slug: kubernetes-1-28-jobapi-update
|
||||
---
|
||||
|
||||
**Authors:** Kevin Hannon (G-Research), Michał Woźniak (Google)
|
||||
|
||||
This blog discusses two new features in Kubernetes 1.28 to improve Jobs for batch
|
||||
users: [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy)
|
||||
and [Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index).
|
||||
|
||||
These features continue the effort started by the
|
||||
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
|
||||
to improve the handling of Pod failures in a Job.
|
||||
|
||||
## Pod replacement policy {#pod-replacement-policy}
|
||||
|
||||
By default, when a pod enters a terminating state (e.g. due to preemption or
|
||||
eviction), Kubernetes immediately creates a replacement Pod. Therefore, both Pods are running
|
||||
at the same time. In API terms, a pod is considered terminating when it has a
|
||||
`deletionTimestamp` and it has a phase `Pending` or `Running`.
|
||||
|
||||
The scenario when two Pods are running at a given time is problematic for
|
||||
some popular machine learning frameworks, such as
|
||||
TensorFlow and [JAX](https://jax.readthedocs.io/en/latest/), which require at most one Pod running at the same time,
|
||||
for a given index.
|
||||
Tensorflow gives the following error if two pods are running for a given index.
|
||||
|
||||
```
|
||||
/job:worker/task:4: Duplicate task registration with task_name=/job:worker/replica:0/task:4
|
||||
```
|
||||
|
||||
See more details in the ([issue](https://github.com/kubernetes/kubernetes/issues/115844)).
|
||||
|
||||
|
||||
Creating the replacement Pod before the previous one fully terminates can also
|
||||
cause problems in clusters with scarce resources or with tight budgets, such as:
|
||||
* cluster resources can be difficult to obtain for Pods pending to be scheduled,
|
||||
as Kubernetes might take a long time to find available nodes until the existing
|
||||
Pods are fully terminated.
|
||||
* if cluster autoscaler is enabled, the replacement Pods might produce undesired
|
||||
scale ups.
|
||||
|
||||
### How can you use it? {#pod-replacement-policy-how-to-use}
|
||||
|
||||
This is an alpha feature, which you can enable by turning on `JobPodReplacementPolicy`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in
|
||||
your cluster.
|
||||
|
||||
Once the feature is enabled in your cluster, you can use it by creating a new Job that specifies a
|
||||
`podReplacementPolicy` field as shown here:
|
||||
|
||||
```yaml
|
||||
kind: Job
|
||||
metadata:
|
||||
name: new
|
||||
...
|
||||
spec:
|
||||
podReplacementPolicy: Failed
|
||||
...
|
||||
```
|
||||
|
||||
In that Job, the Pods would only be replaced once they reached the `Failed` phase,
|
||||
and not when they are terminating.
|
||||
|
||||
Additionally, you can inspect the `.status.terminating` field of a Job. The value
|
||||
of the field is the number of Pods owned by the Job that are currently terminating.
|
||||
|
||||
```shell
|
||||
kubectl get jobs/myjob -o=jsonpath='{.items[*].status.terminating}'
|
||||
```
|
||||
|
||||
```
|
||||
3 # three Pods are terminating and have not yet reached the Failed phase
|
||||
```
|
||||
|
||||
This can be particularly useful for external queueing controllers, such as
|
||||
[Kueue](https://github.com/kubernetes-sigs/kueue), that tracks quota
|
||||
from running Pods of a Job until the resources are reclaimed from
|
||||
the currently terminating Job.
|
||||
|
||||
Note that the `podReplacementPolicy: Failed` is the default when using a custom
|
||||
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy).
|
||||
|
||||
## Backoff limit per index {#backoff-limit-per-index}
|
||||
|
||||
By default, Pod failures for [Indexed Jobs](/docs/concepts/workloads/controllers/job/#completion-mode)
|
||||
are counted towards the global limit of retries, represented by `.spec.backoffLimit`.
|
||||
This means, that if there is a consistently failing index, it is restarted
|
||||
repeatedly until it exhausts the limit. Once the limit is reached the entire
|
||||
Job is marked failed and some indexes may never be even started.
|
||||
|
||||
This is problematic for use cases where you want to handle Pod failures for
|
||||
every index independently. For example, if you use Indexed Jobs for running
|
||||
integration tests where each index corresponds to a testing suite. In that case,
|
||||
you may want to account for possible flake tests allowing for 1 or 2 retries per
|
||||
suite. There might be some buggy suites, making the corresponding
|
||||
indexes fail consistently. In that case you may prefer to limit retries for
|
||||
the buggy suites, yet allowing other suites to complete.
|
||||
|
||||
The feature allows you to:
|
||||
* complete execution of all indexes, despite some indexes failing.
|
||||
* better utilize the computational resources by avoiding unnecessary retries of consistently failing indexes.
|
||||
|
||||
### How can you use it? {#backoff-limit-per-index-how-to-use}
|
||||
|
||||
This is an alpha feature, which you can enable by turning on the
|
||||
`JobBackoffLimitPerIndex`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
in your cluster.
|
||||
|
||||
Once the feature is enabled in your cluster, you can create an Indexed Job with the
|
||||
`.spec.backoffLimitPerIndex` field specified.
|
||||
|
||||
#### Example
|
||||
|
||||
The following example demonstrates how to use this feature to make sure the
|
||||
Job executes all indexes (provided there is no other reason for the early Job
|
||||
termination, such as reaching the `activeDeadlineSeconds` timeout, or being
|
||||
manually deleted by the user), and the number of failures is controlled per index.
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: job-backoff-limit-per-index-execute-all
|
||||
spec:
|
||||
completions: 8
|
||||
parallelism: 2
|
||||
completionMode: Indexed
|
||||
backoffLimitPerIndex: 1
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: example # this example container returns an error, and fails,
|
||||
# when it is run as the second or third index in any Job
|
||||
# (even after a retry)
|
||||
image: python
|
||||
command:
|
||||
- python3
|
||||
- -c
|
||||
- |
|
||||
import os, sys, time
|
||||
id = int(os.environ.get("JOB_COMPLETION_INDEX"))
|
||||
if id == 1 or id == 2:
|
||||
sys.exit(1)
|
||||
time.sleep(1)
|
||||
```
|
||||
|
||||
Now, inspect the Pods after the job is finished:
|
||||
|
||||
```sh
|
||||
kubectl get pods -l job-name=job-backoff-limit-per-index-execute-all
|
||||
```
|
||||
|
||||
Returns output similar to this:
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
job-backoff-limit-per-index-execute-all-0-b26vc 0/1 Completed 0 49s
|
||||
job-backoff-limit-per-index-execute-all-1-6j5gd 0/1 Error 0 49s
|
||||
job-backoff-limit-per-index-execute-all-1-6wd82 0/1 Error 0 37s
|
||||
job-backoff-limit-per-index-execute-all-2-c66hg 0/1 Error 0 32s
|
||||
job-backoff-limit-per-index-execute-all-2-nf982 0/1 Error 0 43s
|
||||
job-backoff-limit-per-index-execute-all-3-cxmhf 0/1 Completed 0 33s
|
||||
job-backoff-limit-per-index-execute-all-4-9q6kq 0/1 Completed 0 28s
|
||||
job-backoff-limit-per-index-execute-all-5-z9hqf 0/1 Completed 0 28s
|
||||
job-backoff-limit-per-index-execute-all-6-tbkr8 0/1 Completed 0 23s
|
||||
job-backoff-limit-per-index-execute-all-7-hxjsq 0/1 Completed 0 22s
|
||||
```
|
||||
|
||||
Additionally, you can take a look at the status for that Job:
|
||||
|
||||
```sh
|
||||
kubectl get jobs job-backoff-limit-per-index-fail-index -o yaml
|
||||
```
|
||||
|
||||
The output ends with a `status` similar to:
|
||||
|
||||
```yaml
|
||||
status:
|
||||
completedIndexes: 0,3-7
|
||||
failedIndexes: 1,2
|
||||
succeeded: 6
|
||||
failed: 4
|
||||
conditions:
|
||||
- message: Job has failed indexes
|
||||
reason: FailedIndexes
|
||||
status: "True"
|
||||
type: Failed
|
||||
```
|
||||
|
||||
Here, indexes `1` and `2` were both retried once. After the second failure,
|
||||
in each of them, the specified `.spec.backoffLimitPerIndex` was exceeded, so
|
||||
the retries were stopped. For comparison, if the per-index backoff was disabled,
|
||||
then the buggy indexes would retry until the global `backoffLimit` was exceeded,
|
||||
and then the entire Job would be marked failed, before some of the higher
|
||||
indexes are started.
|
||||
|
||||
## How can you learn more?
|
||||
|
||||
- Read the user-facing documentation for [Pod replacement policy](/docs/concepts/workloads/controllers/job/#pod-replacement-policy),
|
||||
[Backoff limit per index](/docs/concepts/workloads/controllers/job/#backoff-limit-per-index), and
|
||||
[Pod failure policy](/docs/concepts/workloads/controllers/job/#pod-failure-policy)
|
||||
- Read the KEPs for [Pod Replacement Policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3939-allow-replacement-when-fully-terminated),
|
||||
[Backoff limit per index](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs), and
|
||||
[Pod failure policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures).
|
||||
|
||||
## Getting Involved
|
||||
|
||||
These features were sponsored by [SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps). Batch use cases are actively
|
||||
being improved for Kubernetes users in the
|
||||
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch).
|
||||
Working groups are relatively short-lived initiatives focused on specific goals.
|
||||
The goal of the WG Batch is to improve experience for batch workload users, offer support for
|
||||
batch processing use cases, and enhance the
|
||||
Job API for common use cases. If that interests you, please join the working
|
||||
group either by subscriping to our
|
||||
[mailing list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
|
||||
[Slack](https://kubernetes.slack.com/messages/wg-batch).
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
As with any Kubernetes feature, multiple people contributed to getting this
|
||||
done, from testing and filing bugs to reviewing code.
|
||||
|
||||
We would not have been able to achieve either of these features without Aldo
|
||||
Culquicondor (Google) providing excellent domain knowledge and expertise
|
||||
throughout the Kubernetes ecosystem.
|
Loading…
Reference in New Issue