KEP-3998: Add JobSuccessPolicy Documentation

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
pull/45135/head
Yuki Iwai 2024-02-14 23:38:47 +09:00
parent d665f924d5
commit 92a00327bb
3 changed files with 96 additions and 0 deletions

View File

@ -1050,6 +1050,63 @@ after the operation: the built-in Job controller and the external controller
indicated by the field value.
{{< /warning >}}
### Success policy {#success-policy}
{{< feature-state for_k8s_version="v1.29" state="alpha" >}}
{{< note >}}
You can only configure a success policy for an Indexed Job if you have the
`JobSuccessPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
enabled in your cluster.
{{< /note >}}
When you run an indexed Job, a success policy defined with the `spec.successPolicy` field,
allows you to define when a Job can be declared as succeeded based on the number of succeeded pods.
In some situations, you may want to have a better control when handling Pod
successes than the control provided by the `.spec.completins`.
There are some examples of use cases:
* To optimize costs of running workloads by avoiding unnecessary Pod running,
you can terminate a Job as soon as one of its Pods succeeds.
* To care only about a leader index in determining the success or failure of a Job
in a batch workloads such as MPI and PyTorch etc.
You can configure a success policy, in the `.spec.successPolicy` field,
to meet the above use cases. This policy can handle Job successes based on the
number of succeeded pods. After the Job meet success policy, the lingering Pods
are terminated by the Job controller.
When you specify the only `.spec.successPolicy.rules[*].succeededIndexes`,
once all indexes specified in the `succeededIndexes` succeeded, the Job is marked as succeeded.
The `succeededIndexes` must be a list within 0 to `.spec.completions-1` and
must not contain duplicate indexes. The `succeededIndexes` is represented as intervals separated by a hyphen.
The number are listed in represented by the first and last element of the series, separated by a hyphen.
For example, if you want to specify 1, 3, 4, 5 and 7, the `succeededIndexes` is represented as `1,3-5,7`.
When you specify the only `spec.successPolicy.rules[*].succeededCount`,
once the number of succeeded indexes reaches the `succeededCount`, the Job is marked as succeeded.
When you specify both `succeededIndexes` and `succeededCount`,
once the number of succeeded indexes specified in the `succeededIndexes` reaches the `succeededCount`,
the Job is marked as succeeded.
Note that when you specify multiple rules in the `.spec.succeessPolicy.rules`,
the rules are evaluated in order. Once the Job meets a rule, the remaining rules are ignored.
Here is a manifest for a Job with `successPolicy`:
{{% code_sample file="/controllers/job-success-policy-example.yaml" %}}
In the example above, the rule of the success policy specifies that
the Job should be marked succeeded and terminate the lingering Pods
if one of the 0, 1, and 2 indexes succeeded.
{{< note >}}
When you specify both a success policy and some terminating policies such as `.spec.backoffLimit` and `.spec.podFailurePolicy`,
once the Job meets both policies, the terminating policies are respected and a success policy is ignored.
{{< /note >}}
## Alternatives
### Bare Pods

View File

@ -0,0 +1,14 @@
---
title: JobSuccessPolicy
content_type: feature_gate
_build:
list: never
render: false
stages:
- stage: alpha
defaultValue: false
fromVersion: "1.30"
---
Allow users to specify when a Job can be declared as succeeded based on the set of succeeded pods.

View File

@ -0,0 +1,25 @@
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed # Required for the feature
successPolicy:
rules:
- succeededIndexes: 0-2
succeededCount: 1
template:
spec:
containers:
- name: main
image: python
command: # The jobs succeed as there is one succeeded index
# among indexes 0, 1, and 2.
- python3
- -c
- |
import os, sys
if os.environ.get("JOB_COMPLETION_INDEX") == "1":
sys.exit(0)
else:
sys.exit(1)