Merge pull request #35219 from mimowo/retriable-pod-failures-docs
Add docs for KEP-3329 Retriable and non-retriable Pod failures for Jobspull/35685/head
commit
61b69cfd38
|
@ -695,6 +695,90 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
|
|||
`manualSelector: true` tells the system that you know what you are doing and to allow this
|
||||
mismatch.
|
||||
|
||||
### Pod failure policy {#pod-failure-policy}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||
|
||||
{{< note >}}
|
||||
You can only configure a Pod failure policy for a Job if you have the
|
||||
`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
enabled in your cluster. Additionally, it is recommended
|
||||
to enable the `PodDisruptionsCondition` feature gate in order to be able to detect and handle
|
||||
Pod disruption conditions in the Pod failure policy (see also:
|
||||
[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
|
||||
available in Kubernetes v1.25.
|
||||
{{< /note >}}
|
||||
|
||||
A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
|
||||
your cluster to handle Pod failures based on the container exit codes and the
|
||||
Pod conditions.
|
||||
|
||||
In some situations, you may want to have a better control when handling Pod
|
||||
failures than the control provided by the default policy, which is based on the
|
||||
Job's (`.spec.backoffLimit`](#pod-backoff-failure-policy)). These are some
|
||||
examples of use cases:
|
||||
* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
|
||||
you can terminate a Job as soon as one of its Pods fails with an exit code
|
||||
indicating a software bug.
|
||||
* To guarantee that your Job finishes even if there are disruptions, you can
|
||||
ignore Pod failures caused by disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
|
||||
{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
|
||||
or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
|
||||
that they don't count towards the `.spec.backoffLimit` limit of retries.
|
||||
|
||||
You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
|
||||
to meet the above use cases. This policy can handle Pod failures based on the
|
||||
container exit codes and the Pod conditions.
|
||||
|
||||
Here is a manifest for a Job that defines a `podFailurePolicy`:
|
||||
|
||||
{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
|
||||
|
||||
In the example above, the first rule of the Pod failure policy specifies that
|
||||
the Job should be marked failed if the `main` container fails with the 42 exit
|
||||
code. The following are the rules for the `main` container specifically:
|
||||
|
||||
- an exit code of 0 means that the container succeeded
|
||||
- an exit code of 42 means that the **entire Job** failed
|
||||
- any other exit code represents that the container failed, and hence the entire
|
||||
Pod. The Pod will be re-created if the total number of restarts is
|
||||
below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
|
||||
|
||||
{{< note >}}
|
||||
Because the Pod template specifies a `restartPolicy: Never`,
|
||||
the kubelet does not restart the `main` container in that particular Pod.
|
||||
{{< /note >}}
|
||||
|
||||
The second rule of the Pod failure policy, specifying the `Ignore` action for
|
||||
failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
|
||||
being counted towards the `.spec.backoffLimit` limit of retries.
|
||||
|
||||
{{< note >}}
|
||||
If the Job failed, either by the Pod failure policy or Pod backoff
|
||||
failure policy, and the Job is running multiple Pods, Kubernetes terminates all
|
||||
the Pods in that Job that are still Pending or Running.
|
||||
{{< /note >}}
|
||||
|
||||
These are some requirements and semantics of the API:
|
||||
- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
|
||||
also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
|
||||
- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
|
||||
are evaluated in order. Once a rule matches a Pod failure, the remaining rules
|
||||
are ignored. When no rule matches the Pod failure, the default
|
||||
handling applies.
|
||||
- you may want to restrict a rule to a specific container by specifing its name
|
||||
in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
|
||||
applies to all containers. When specified, it should match one the container
|
||||
or `initContainer` names in the Pod template.
|
||||
- you may specify the action taken when a Pod failure policy is matched by
|
||||
`spec.podFailurePolicy.rules[*].action`. Possible values are:
|
||||
- `FailJob`: use to indicate that the Pod's job should be marked as Failed and
|
||||
all running Pods should be terminated.
|
||||
- `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
|
||||
should not be incremented and a replacement Pod should be created.
|
||||
- `Count`: use to indicate that the Pod should be handled in the default way.
|
||||
The counter towards the `.spec.backoffLimit` should be incremented.
|
||||
|
||||
### Job tracking with finalizers
|
||||
|
||||
{{< feature-state for_k8s_version="v1.23" state="beta" >}}
|
||||
|
@ -783,3 +867,5 @@ object, but maintains complete control over what Pods are created and how work i
|
|||
* Read about [`CronJob`](/docs/concepts/workloads/controllers/cron-jobs/), which you
|
||||
can use to define a series of Jobs that will run based on a schedule, similar to
|
||||
the UNIX tool `cron`.
|
||||
* Practice how to configure handling of retriable and non-retriable pod failures
|
||||
using `podFailurePolicy`, based on the step-by-step [examples](/docs/tasks/job/pod-failure-policy/).
|
||||
|
|
|
@ -227,6 +227,44 @@ can happen, according to:
|
|||
- the type of controller
|
||||
- the cluster's resource capacity
|
||||
|
||||
## Pod disruption conditions {#pod-disruption-conditions}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||
|
||||
{{< note >}}
|
||||
In order to use this behavior, you must enable the `PodDisruptionsCondition`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
in your cluster.
|
||||
{{< /note >}}
|
||||
|
||||
When enabled, a dedicated Pod `DisruptionTarget` [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions) is added to indicate
|
||||
that the Pod is about to be deleted due to a {{<glossary_tooltip term_id="disruption" text="disruption">}}.
|
||||
The `reason` field of the condition additionally
|
||||
indicates one of the following reasons for the Pod termination:
|
||||
|
||||
`PreemptionByKubeScheduler`
|
||||
: Pod has been {{<glossary_tooltip term_id="preemption" text="preempted">}} by a scheduler in order to accommodate a new Pod with a higher priority. For more information, see [Pod priority preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/).
|
||||
|
||||
`DeletionByTaintManager`
|
||||
: Pod is due to be deleted by Taint Manager due to to a `NoExecute` taint that the Pod does not tolerate; see {{<glossary_tooltip term_id="taint" text="taint">}}-based evictions.
|
||||
|
||||
`EvictionByEvictionAPI`
|
||||
: Pod has been marked for {{<glossary_tooltip term_id="api-eviction" text="eviction using the Kubernetes API">}} .
|
||||
|
||||
`DeletionByPodGC`
|
||||
: an orphaned Pod deleted by [Pod garbage collection](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).
|
||||
|
||||
{{< note >}}
|
||||
A Pod disruption might be interrupted. The control plane might re-attempt to
|
||||
continue the disruption of the same Pod, but it is not guaranteed. As a result,
|
||||
the `DisruptionTarget` condition might be added to a Pod, but that Pod might then not actually be
|
||||
deleted. In such a situation, after some time, the
|
||||
Pod disruption condition will be cleared.
|
||||
{{< /note >}}
|
||||
|
||||
When using a Job (or CronJob), you may want to use these Pod disruption conditions as part of your Job's
|
||||
[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy).
|
||||
|
||||
## Separating Cluster Owner and Application Owner Roles
|
||||
|
||||
Often, it is useful to think of the Cluster Manager
|
||||
|
|
|
@ -377,6 +377,7 @@ different Kubernetes components.
|
|||
| `IngressClassNamespacedParams` | `true` | GA | 1.23 | - |
|
||||
| `Initializers` | `false` | Alpha | 1.7 | 1.13 |
|
||||
| `Initializers` | - | Deprecated | 1.14 | - |
|
||||
| `JobPodFailurePolicy` | `false` | Alpha | 1.25 | - |
|
||||
| `KubeletConfigFile` | `false` | Alpha | 1.8 | 1.9 |
|
||||
| `KubeletConfigFile` | - | Deprecated | 1.10 | - |
|
||||
| `KubeletPluginsWatcher` | `false` | Alpha | 1.11 | 1.11 |
|
||||
|
@ -419,6 +420,7 @@ different Kubernetes components.
|
|||
| `PodDisruptionBudget` | `false` | Alpha | 1.3 | 1.4 |
|
||||
| `PodDisruptionBudget` | `true` | Beta | 1.5 | 1.20 |
|
||||
| `PodDisruptionBudget` | `true` | GA | 1.21 | - |
|
||||
| `PodDisruptionConditions` | `false` | Alpha | 1.25 | - |
|
||||
| `PodOverhead` | `false` | Alpha | 1.16 | 1.17 |
|
||||
| `PodOverhead` | `true` | Beta | 1.18 | 1.23 |
|
||||
| `PodOverhead` | `true` | GA | 1.24 | - |
|
||||
|
@ -950,6 +952,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
|
|||
support for IPv6.
|
||||
- `JobMutableNodeSchedulingDirectives`: Allows updating node scheduling directives in
|
||||
the pod template of [Job](/docs/concepts/workloads/controllers/job).
|
||||
- `JobPodFailurePolicy`: Allow users to specify handling of pod failures based on container exit codes and pod conditions.
|
||||
- `JobReadyPods`: Enables tracking the number of Pods that have a `Ready`
|
||||
[condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
|
||||
The count of `Ready` pods is recorded in the
|
||||
|
@ -1042,6 +1045,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
|
|||
- `PodAndContainerStatsFromCRI`: Configure the kubelet to gather container and
|
||||
pod stats from the CRI container runtime rather than gathering them from cAdvisor.
|
||||
- `PodDisruptionBudget`: Enable the [PodDisruptionBudget](/docs/tasks/run-application/configure-pdb/) feature.
|
||||
- `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption.
|
||||
- `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
|
||||
- `PodOverhead`: Enable the [PodOverhead](/docs/concepts/scheduling-eviction/pod-overhead/)
|
||||
feature to account for pod overheads.
|
||||
|
|
|
@ -0,0 +1,139 @@
|
|||
---
|
||||
title: Handling retriable and non-retriable pod failures with Pod failure policy
|
||||
content_type: task
|
||||
min-kubernetes-server-version: v1.25
|
||||
weight: 60
|
||||
---
|
||||
|
||||
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||
|
||||
<!-- overview -->
|
||||
|
||||
This document shows you how to use the
|
||||
[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy),
|
||||
in combination with the default
|
||||
[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
|
||||
to improve the control over the handling of container- or Pod-level failure
|
||||
within a {{<glossary_tooltip text="Job" term_id="job">}}.
|
||||
|
||||
The definition of Pod failure policy may help you to:
|
||||
* better utilize the computational resources by avoiding unnecessary Pod retries.
|
||||
* avoid Job failures due to Pod disruptions (such {{<glossary_tooltip text="preemption" term_id="preemption" >}},
|
||||
{{<glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
|
||||
or {{<glossary_tooltip text="taint" term_id="taint" >}}-based eviction).
|
||||
|
||||
## {{% heading "prerequisites" %}}
|
||||
|
||||
You should already be familiar with the basic use of [Job](/docs/concepts/workloads/controllers/job/).
|
||||
|
||||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||||
|
||||
<!-- steps -->
|
||||
|
||||
{{< note >}}
|
||||
As the features are in Alpha, prepare the Kubernetes cluster with the two
|
||||
[feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
enabled: `JobPodFailurePolicy` and `PodDisruptionsCondition`.
|
||||
{{< /note >}}
|
||||
|
||||
## Using Pod failure policy to avoid unnecessary Pod retries
|
||||
|
||||
With the following example, you can learn how to use Pod failure policy to
|
||||
avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
|
||||
software bug.
|
||||
|
||||
First, create a Job based on the config:
|
||||
|
||||
{{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}}
|
||||
|
||||
by running:
|
||||
|
||||
```sh
|
||||
kubectl create -f job-pod-failure-policy-failjob.yaml
|
||||
```
|
||||
|
||||
After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
|
||||
```sh
|
||||
kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
|
||||
```
|
||||
|
||||
In the Job status, see a job `Failed` condition with the field `reason`
|
||||
equal `PodFailurePolicy`. Additionally, the `message` field contains a
|
||||
more detailed information about the Job termination, such as:
|
||||
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
|
||||
|
||||
For comparison, if the Pod failure policy was disabled it would take 6 retries
|
||||
of the Pod, taking at least 2 minutes.
|
||||
|
||||
### Clean up
|
||||
|
||||
Delete the Job you created:
|
||||
```sh
|
||||
kubectl delete jobs/job-pod-failure-policy-failjob
|
||||
```
|
||||
The cluster automatically cleans up the Pods.
|
||||
|
||||
## Using Pod failure policy to ignore Pod disruptions
|
||||
|
||||
With the following example, you can learn how to use Pod failure policy to
|
||||
ignore Pod disruptions from incrementing the Pod retry counter towards the
|
||||
`.spec.backoffLimit` limit.
|
||||
|
||||
{{< caution >}}
|
||||
Timing is important for this example, so you may want to read the steps before
|
||||
execution. In order to trigger a Pod disruption it is important to drain the
|
||||
node while the Pod is running on it (within 90s since the Pod is scheduled).
|
||||
{{< /caution >}}
|
||||
|
||||
1. Create a Job based on the config:
|
||||
|
||||
{{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}}
|
||||
|
||||
by running:
|
||||
|
||||
```sh
|
||||
kubectl create -f job-pod-failure-policy-ignore.yaml
|
||||
```
|
||||
|
||||
2. Run this command to check the `nodeName` the Pod is scheduled to:
|
||||
|
||||
```sh
|
||||
nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}')
|
||||
```
|
||||
|
||||
3. Drain the node to evict the Pod before it completes (within 90s):
|
||||
```sh
|
||||
kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
|
||||
```
|
||||
|
||||
4. Inspect the `.status.failed` to check the counter for the Job is not incremented:
|
||||
```sh
|
||||
kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
|
||||
```
|
||||
|
||||
5. Uncordon the node:
|
||||
```sh
|
||||
kubectl uncordon nodes/$nodeName
|
||||
```
|
||||
|
||||
The Job resumes and succeeds.
|
||||
|
||||
For comparison, if the Pod failure policy was disabled the Pod disruption would
|
||||
result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0).
|
||||
|
||||
### Cleaning up
|
||||
|
||||
Delete the Job you created:
|
||||
```sh
|
||||
kubectl delete jobs/job-pod-failure-policy-ignore
|
||||
```
|
||||
The cluster automatically cleans up the Pods.
|
||||
|
||||
## Alternatives
|
||||
|
||||
You could rely solely on the
|
||||
[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
|
||||
by specifying the Job's `.spec.backoffLimit` field. However, in many situations
|
||||
it is problematic to find a balance between setting the a low value for `.spec.backoffLimit`
|
||||
to avoid unnecessary Pod retries, yet high enough to make sure the Job would
|
||||
not be terminated by Pod disruptions.
|
|
@ -0,0 +1,28 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: job-pod-failure-policy-example
|
||||
spec:
|
||||
completions: 12
|
||||
parallelism: 3
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: main
|
||||
image: docker.io/library/bash:5
|
||||
command: ["bash"] # example command simulating a bug which triggers the FailJob action
|
||||
args:
|
||||
- -c
|
||||
- echo "Hello world!" && sleep 5 && exit 42
|
||||
backoffLimit: 6
|
||||
podFailurePolicy:
|
||||
rules:
|
||||
- action: FailJob
|
||||
onExitCodes:
|
||||
containerName: main # optional
|
||||
operator: In # one of: In, NotIn
|
||||
values: [42]
|
||||
- action: Ignore # one of: Ignore, FailJob, Count
|
||||
onPodConditions:
|
||||
- type: DisruptionTarget # indicates Pod disruption
|
|
@ -0,0 +1,25 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: job-pod-failure-policy-failjob
|
||||
spec:
|
||||
completions: 8
|
||||
parallelism: 2
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: main
|
||||
image: docker.io/library/bash:5
|
||||
command: ["bash"]
|
||||
args:
|
||||
- -c
|
||||
- echo "Hello world! I'm going to exit with 42 to simulate a software bug." && sleep 30 && exit 42
|
||||
backoffLimit: 6
|
||||
podFailurePolicy:
|
||||
rules:
|
||||
- action: FailJob
|
||||
onExitCodes:
|
||||
containerName: main
|
||||
operator: In
|
||||
values: [42]
|
|
@ -0,0 +1,23 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: job-pod-failure-policy-ignore
|
||||
spec:
|
||||
completions: 4
|
||||
parallelism: 2
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: main
|
||||
image: docker.io/library/bash:5
|
||||
command: ["bash"]
|
||||
args:
|
||||
- -c
|
||||
- echo "Hello world! I'm going to exit with 0 (success)." && sleep 90 && exit 0
|
||||
backoffLimit: 0
|
||||
podFailurePolicy:
|
||||
rules:
|
||||
- action: Ignore
|
||||
onPodConditions:
|
||||
- type: DisruptionTarget
|
Loading…
Reference in New Issue