add suggestions

pull/41745/head
Kevin Hannon 2023-07-31 14:15:34 -04:00
parent cd0de2832a
commit 4afba1c609
2 changed files with 20 additions and 17 deletions

View File

@ -458,9 +458,9 @@ ensures that deleted pods have their finalizers removed by the Job controller.
{{< /note >}}
{{< note >}}
Since Kubernetes v1.28, when pod failure policy is used, the Job controller recreates
terminating pods only once they reach the terminal `Failed` phase. This behavior is analogous
to when using `podRecreationPolicy: Failed`, see [pod replacement policy](#pod-replacement-policy) for more details.
Starting with Kubernetes v1.28, when Pod failure policy is used, the Job controller recreates
terminating Pods only once these Pods reach the terminal `Failed` phase. This behavior is similar
to `podRecreationPolicy: Failed`. For more information, see [Pod replacement policy](#pod-replacement-policy).
{{< /note >}}
## Job termination and cleanup
@ -873,7 +873,7 @@ is disabled, `.spec.completions` is immutable.
Use cases for elastic Indexed Jobs include batch workloads which require
scaling an indexed Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
### Pod Replacement Policy
### Delayed creation of replacement pods
{{< feature-state for_k8s_version="v1.28" state="alpha" >}}
@ -881,12 +881,19 @@ scaling an indexed Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
You can only set `podReplacementPolicy` on Jobs if you enable the `JobPodReplacementPolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/).
{{< /note >}}
By default, the Job controller recreates pods as soon they are either failed or terminating (have a deletion timestamp).
This means that, at a given time, the number of running Pods for the Jobs can be greater than `parallelism` or, if using Indexed Jobs, more than one running Pod per index, if some of the Pods are terminating.
By default, the Job controller recreates Pods as soon they either fail or are terminating (have a deletion timestamp).
This means that, at a given time, when some of the Pods are terminating, the number of running Pods for the Jobs can be greater than `parallelism` or greater than one Pod per index (if using Indexed Jobs).
You may choose to create replacement pods only when the terminating pod is fully terminal (has `status.phase: Failed`). To do this, set the `.spec.podReplacementPolicy: Failed`.
This will only recreate pods once they are terminated.
The default policy is `FailedOrTerminating`, meaning that the control plane creates replacement Pods upon deletion (`DeletionTimestamp != nil`).
You may choose to create replacement Pods only when the terminating Pod is fully terminal (has `status.phase: Failed`). To do this, set the `.spec.podReplacementPolicy: Failed`.
This will only recreate Pods once they are terminated.
The default replacement policy depends on whether the Job has a `podFailurePolicy` set.
With no Pod failure policy defined for a Job, omitting the `podReplacementPolicy` field selects the
`FailedOrTerminating` replacement policy:
the control plane creates replacement Pods immediately upon Pod deletion
(as soon as the control plane sees that a Pod for this Job has `deletionTimestamp` set).
For Jobs with a Pod failure policy set, the default `podReplacementPolicy` is `Failed`, and no other
value is permitted.
See [Pod failure policy](#pod-failure-policy) to learn more about Pod failure policies for Jobs.
```yaml
kind: Job
@ -898,8 +905,8 @@ spec:
...
```
You can inspect a new field in the JobStatus called `terminating`.
This will report the number pods that are currently terminating and is easily viewable in the status.
Provided your cluster has the feature gate enabled, you can inspect the `.status.terminating` field of a Job.
The value of the field is the number of Pods owned by the Job that are currently terminating.
```shell
kubectl get jobs/myjob -o yaml
@ -910,13 +917,9 @@ apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
terminating: 1 # if pod is terminating
terminating: 3 # three Pods are terminating and have not yet reached the Failed phase
```
When you use a [podFailurePolicy](#pod-failure-policy) in a Job, the Job will have a default `podReplacementPolicy` value of `Failed`, and this is the only policy allowed.
If `JobPodReplacementPolicy` is disabled and `podFailurePolicy` is enabled, a Job will wait for terminating pods to be fully terminated before marking the pod as failed.
In this case, you will not be able to inspect the `terminating` field.
## Alternatives
### Bare Pods

View File

@ -552,7 +552,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
the pod template of [Job](/docs/concepts/workloads/controllers/job).
- `JobPodFailurePolicy`: Allow users to specify handling of pod failures based on container
exit codes and pod conditions.
- `JobPodReplacementPolicy`: Allows users to specify pod replacement for terminating pods in a [Job](/docs/concepts/workloads/controllers/job)
- `JobPodReplacementPolicy`: Allows you to specify pod replacement for terminating pods in a [Job](/docs/concepts/workloads/controllers/job)
- `JobReadyPods`: Enables tracking the number of Pods that have a `Ready`
[condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
The count of `Ready` pods is recorded in the