Merge pull request #35219 from mimowo/retriable-pod-failures-docs
Add docs for KEP-3329 Retriable and non-retriable Pod failures for Jobspull/35685/head
commit
61b69cfd38
|
@ -695,6 +695,90 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
|
||||||
`manualSelector: true` tells the system that you know what you are doing and to allow this
|
`manualSelector: true` tells the system that you know what you are doing and to allow this
|
||||||
mismatch.
|
mismatch.
|
||||||
|
|
||||||
|
### Pod failure policy {#pod-failure-policy}
|
||||||
|
|
||||||
|
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
You can only configure a Pod failure policy for a Job if you have the
|
||||||
|
`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||||
|
enabled in your cluster. Additionally, it is recommended
|
||||||
|
to enable the `PodDisruptionsCondition` feature gate in order to be able to detect and handle
|
||||||
|
Pod disruption conditions in the Pod failure policy (see also:
|
||||||
|
[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
|
||||||
|
available in Kubernetes v1.25.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
|
||||||
|
your cluster to handle Pod failures based on the container exit codes and the
|
||||||
|
Pod conditions.
|
||||||
|
|
||||||
|
In some situations, you may want to have a better control when handling Pod
|
||||||
|
failures than the control provided by the default policy, which is based on the
|
||||||
|
Job's (`.spec.backoffLimit`](#pod-backoff-failure-policy)). These are some
|
||||||
|
examples of use cases:
|
||||||
|
* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
|
||||||
|
you can terminate a Job as soon as one of its Pods fails with an exit code
|
||||||
|
indicating a software bug.
|
||||||
|
* To guarantee that your Job finishes even if there are disruptions, you can
|
||||||
|
ignore Pod failures caused by disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
|
||||||
|
{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
|
||||||
|
or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
|
||||||
|
that they don't count towards the `.spec.backoffLimit` limit of retries.
|
||||||
|
|
||||||
|
You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
|
||||||
|
to meet the above use cases. This policy can handle Pod failures based on the
|
||||||
|
container exit codes and the Pod conditions.
|
||||||
|
|
||||||
|
Here is a manifest for a Job that defines a `podFailurePolicy`:
|
||||||
|
|
||||||
|
{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
|
||||||
|
|
||||||
|
In the example above, the first rule of the Pod failure policy specifies that
|
||||||
|
the Job should be marked failed if the `main` container fails with the 42 exit
|
||||||
|
code. The following are the rules for the `main` container specifically:
|
||||||
|
|
||||||
|
- an exit code of 0 means that the container succeeded
|
||||||
|
- an exit code of 42 means that the **entire Job** failed
|
||||||
|
- any other exit code represents that the container failed, and hence the entire
|
||||||
|
Pod. The Pod will be re-created if the total number of restarts is
|
||||||
|
below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
Because the Pod template specifies a `restartPolicy: Never`,
|
||||||
|
the kubelet does not restart the `main` container in that particular Pod.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
The second rule of the Pod failure policy, specifying the `Ignore` action for
|
||||||
|
failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
|
||||||
|
being counted towards the `.spec.backoffLimit` limit of retries.
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
If the Job failed, either by the Pod failure policy or Pod backoff
|
||||||
|
failure policy, and the Job is running multiple Pods, Kubernetes terminates all
|
||||||
|
the Pods in that Job that are still Pending or Running.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
These are some requirements and semantics of the API:
|
||||||
|
- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
|
||||||
|
also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
|
||||||
|
- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
|
||||||
|
are evaluated in order. Once a rule matches a Pod failure, the remaining rules
|
||||||
|
are ignored. When no rule matches the Pod failure, the default
|
||||||
|
handling applies.
|
||||||
|
- you may want to restrict a rule to a specific container by specifing its name
|
||||||
|
in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
|
||||||
|
applies to all containers. When specified, it should match one the container
|
||||||
|
or `initContainer` names in the Pod template.
|
||||||
|
- you may specify the action taken when a Pod failure policy is matched by
|
||||||
|
`spec.podFailurePolicy.rules[*].action`. Possible values are:
|
||||||
|
- `FailJob`: use to indicate that the Pod's job should be marked as Failed and
|
||||||
|
all running Pods should be terminated.
|
||||||
|
- `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
|
||||||
|
should not be incremented and a replacement Pod should be created.
|
||||||
|
- `Count`: use to indicate that the Pod should be handled in the default way.
|
||||||
|
The counter towards the `.spec.backoffLimit` should be incremented.
|
||||||
|
|
||||||
### Job tracking with finalizers
|
### Job tracking with finalizers
|
||||||
|
|
||||||
{{< feature-state for_k8s_version="v1.23" state="beta" >}}
|
{{< feature-state for_k8s_version="v1.23" state="beta" >}}
|
||||||
|
@ -783,3 +867,5 @@ object, but maintains complete control over what Pods are created and how work i
|
||||||
* Read about [`CronJob`](/docs/concepts/workloads/controllers/cron-jobs/), which you
|
* Read about [`CronJob`](/docs/concepts/workloads/controllers/cron-jobs/), which you
|
||||||
can use to define a series of Jobs that will run based on a schedule, similar to
|
can use to define a series of Jobs that will run based on a schedule, similar to
|
||||||
the UNIX tool `cron`.
|
the UNIX tool `cron`.
|
||||||
|
* Practice how to configure handling of retriable and non-retriable pod failures
|
||||||
|
using `podFailurePolicy`, based on the step-by-step [examples](/docs/tasks/job/pod-failure-policy/).
|
||||||
|
|
|
@ -227,6 +227,44 @@ can happen, according to:
|
||||||
- the type of controller
|
- the type of controller
|
||||||
- the cluster's resource capacity
|
- the cluster's resource capacity
|
||||||
|
|
||||||
|
## Pod disruption conditions {#pod-disruption-conditions}
|
||||||
|
|
||||||
|
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
In order to use this behavior, you must enable the `PodDisruptionsCondition`
|
||||||
|
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||||
|
in your cluster.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
When enabled, a dedicated Pod `DisruptionTarget` [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions) is added to indicate
|
||||||
|
that the Pod is about to be deleted due to a {{<glossary_tooltip term_id="disruption" text="disruption">}}.
|
||||||
|
The `reason` field of the condition additionally
|
||||||
|
indicates one of the following reasons for the Pod termination:
|
||||||
|
|
||||||
|
`PreemptionByKubeScheduler`
|
||||||
|
: Pod has been {{<glossary_tooltip term_id="preemption" text="preempted">}} by a scheduler in order to accommodate a new Pod with a higher priority. For more information, see [Pod priority preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/).
|
||||||
|
|
||||||
|
`DeletionByTaintManager`
|
||||||
|
: Pod is due to be deleted by Taint Manager due to to a `NoExecute` taint that the Pod does not tolerate; see {{<glossary_tooltip term_id="taint" text="taint">}}-based evictions.
|
||||||
|
|
||||||
|
`EvictionByEvictionAPI`
|
||||||
|
: Pod has been marked for {{<glossary_tooltip term_id="api-eviction" text="eviction using the Kubernetes API">}} .
|
||||||
|
|
||||||
|
`DeletionByPodGC`
|
||||||
|
: an orphaned Pod deleted by [Pod garbage collection](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
A Pod disruption might be interrupted. The control plane might re-attempt to
|
||||||
|
continue the disruption of the same Pod, but it is not guaranteed. As a result,
|
||||||
|
the `DisruptionTarget` condition might be added to a Pod, but that Pod might then not actually be
|
||||||
|
deleted. In such a situation, after some time, the
|
||||||
|
Pod disruption condition will be cleared.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
When using a Job (or CronJob), you may want to use these Pod disruption conditions as part of your Job's
|
||||||
|
[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy).
|
||||||
|
|
||||||
## Separating Cluster Owner and Application Owner Roles
|
## Separating Cluster Owner and Application Owner Roles
|
||||||
|
|
||||||
Often, it is useful to think of the Cluster Manager
|
Often, it is useful to think of the Cluster Manager
|
||||||
|
|
|
@ -377,6 +377,7 @@ different Kubernetes components.
|
||||||
| `IngressClassNamespacedParams` | `true` | GA | 1.23 | - |
|
| `IngressClassNamespacedParams` | `true` | GA | 1.23 | - |
|
||||||
| `Initializers` | `false` | Alpha | 1.7 | 1.13 |
|
| `Initializers` | `false` | Alpha | 1.7 | 1.13 |
|
||||||
| `Initializers` | - | Deprecated | 1.14 | - |
|
| `Initializers` | - | Deprecated | 1.14 | - |
|
||||||
|
| `JobPodFailurePolicy` | `false` | Alpha | 1.25 | - |
|
||||||
| `KubeletConfigFile` | `false` | Alpha | 1.8 | 1.9 |
|
| `KubeletConfigFile` | `false` | Alpha | 1.8 | 1.9 |
|
||||||
| `KubeletConfigFile` | - | Deprecated | 1.10 | - |
|
| `KubeletConfigFile` | - | Deprecated | 1.10 | - |
|
||||||
| `KubeletPluginsWatcher` | `false` | Alpha | 1.11 | 1.11 |
|
| `KubeletPluginsWatcher` | `false` | Alpha | 1.11 | 1.11 |
|
||||||
|
@ -419,6 +420,7 @@ different Kubernetes components.
|
||||||
| `PodDisruptionBudget` | `false` | Alpha | 1.3 | 1.4 |
|
| `PodDisruptionBudget` | `false` | Alpha | 1.3 | 1.4 |
|
||||||
| `PodDisruptionBudget` | `true` | Beta | 1.5 | 1.20 |
|
| `PodDisruptionBudget` | `true` | Beta | 1.5 | 1.20 |
|
||||||
| `PodDisruptionBudget` | `true` | GA | 1.21 | - |
|
| `PodDisruptionBudget` | `true` | GA | 1.21 | - |
|
||||||
|
| `PodDisruptionConditions` | `false` | Alpha | 1.25 | - |
|
||||||
| `PodOverhead` | `false` | Alpha | 1.16 | 1.17 |
|
| `PodOverhead` | `false` | Alpha | 1.16 | 1.17 |
|
||||||
| `PodOverhead` | `true` | Beta | 1.18 | 1.23 |
|
| `PodOverhead` | `true` | Beta | 1.18 | 1.23 |
|
||||||
| `PodOverhead` | `true` | GA | 1.24 | - |
|
| `PodOverhead` | `true` | GA | 1.24 | - |
|
||||||
|
@ -950,6 +952,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
|
||||||
support for IPv6.
|
support for IPv6.
|
||||||
- `JobMutableNodeSchedulingDirectives`: Allows updating node scheduling directives in
|
- `JobMutableNodeSchedulingDirectives`: Allows updating node scheduling directives in
|
||||||
the pod template of [Job](/docs/concepts/workloads/controllers/job).
|
the pod template of [Job](/docs/concepts/workloads/controllers/job).
|
||||||
|
- `JobPodFailurePolicy`: Allow users to specify handling of pod failures based on container exit codes and pod conditions.
|
||||||
- `JobReadyPods`: Enables tracking the number of Pods that have a `Ready`
|
- `JobReadyPods`: Enables tracking the number of Pods that have a `Ready`
|
||||||
[condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
|
[condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
|
||||||
The count of `Ready` pods is recorded in the
|
The count of `Ready` pods is recorded in the
|
||||||
|
@ -1042,6 +1045,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
|
||||||
- `PodAndContainerStatsFromCRI`: Configure the kubelet to gather container and
|
- `PodAndContainerStatsFromCRI`: Configure the kubelet to gather container and
|
||||||
pod stats from the CRI container runtime rather than gathering them from cAdvisor.
|
pod stats from the CRI container runtime rather than gathering them from cAdvisor.
|
||||||
- `PodDisruptionBudget`: Enable the [PodDisruptionBudget](/docs/tasks/run-application/configure-pdb/) feature.
|
- `PodDisruptionBudget`: Enable the [PodDisruptionBudget](/docs/tasks/run-application/configure-pdb/) feature.
|
||||||
|
- `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption.
|
||||||
- `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
|
- `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
|
||||||
- `PodOverhead`: Enable the [PodOverhead](/docs/concepts/scheduling-eviction/pod-overhead/)
|
- `PodOverhead`: Enable the [PodOverhead](/docs/concepts/scheduling-eviction/pod-overhead/)
|
||||||
feature to account for pod overheads.
|
feature to account for pod overheads.
|
||||||
|
|
|
@ -0,0 +1,139 @@
|
||||||
|
---
|
||||||
|
title: Handling retriable and non-retriable pod failures with Pod failure policy
|
||||||
|
content_type: task
|
||||||
|
min-kubernetes-server-version: v1.25
|
||||||
|
weight: 60
|
||||||
|
---
|
||||||
|
|
||||||
|
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||||
|
|
||||||
|
<!-- overview -->
|
||||||
|
|
||||||
|
This document shows you how to use the
|
||||||
|
[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy),
|
||||||
|
in combination with the default
|
||||||
|
[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
|
||||||
|
to improve the control over the handling of container- or Pod-level failure
|
||||||
|
within a {{<glossary_tooltip text="Job" term_id="job">}}.
|
||||||
|
|
||||||
|
The definition of Pod failure policy may help you to:
|
||||||
|
* better utilize the computational resources by avoiding unnecessary Pod retries.
|
||||||
|
* avoid Job failures due to Pod disruptions (such {{<glossary_tooltip text="preemption" term_id="preemption" >}},
|
||||||
|
{{<glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
|
||||||
|
or {{<glossary_tooltip text="taint" term_id="taint" >}}-based eviction).
|
||||||
|
|
||||||
|
## {{% heading "prerequisites" %}}
|
||||||
|
|
||||||
|
You should already be familiar with the basic use of [Job](/docs/concepts/workloads/controllers/job/).
|
||||||
|
|
||||||
|
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||||||
|
|
||||||
|
<!-- steps -->
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
As the features are in Alpha, prepare the Kubernetes cluster with the two
|
||||||
|
[feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||||
|
enabled: `JobPodFailurePolicy` and `PodDisruptionsCondition`.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
## Using Pod failure policy to avoid unnecessary Pod retries
|
||||||
|
|
||||||
|
With the following example, you can learn how to use Pod failure policy to
|
||||||
|
avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
|
||||||
|
software bug.
|
||||||
|
|
||||||
|
First, create a Job based on the config:
|
||||||
|
|
||||||
|
{{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}}
|
||||||
|
|
||||||
|
by running:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
kubectl create -f job-pod-failure-policy-failjob.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
|
||||||
|
```sh
|
||||||
|
kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
In the Job status, see a job `Failed` condition with the field `reason`
|
||||||
|
equal `PodFailurePolicy`. Additionally, the `message` field contains a
|
||||||
|
more detailed information about the Job termination, such as:
|
||||||
|
`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
|
||||||
|
|
||||||
|
For comparison, if the Pod failure policy was disabled it would take 6 retries
|
||||||
|
of the Pod, taking at least 2 minutes.
|
||||||
|
|
||||||
|
### Clean up
|
||||||
|
|
||||||
|
Delete the Job you created:
|
||||||
|
```sh
|
||||||
|
kubectl delete jobs/job-pod-failure-policy-failjob
|
||||||
|
```
|
||||||
|
The cluster automatically cleans up the Pods.
|
||||||
|
|
||||||
|
## Using Pod failure policy to ignore Pod disruptions
|
||||||
|
|
||||||
|
With the following example, you can learn how to use Pod failure policy to
|
||||||
|
ignore Pod disruptions from incrementing the Pod retry counter towards the
|
||||||
|
`.spec.backoffLimit` limit.
|
||||||
|
|
||||||
|
{{< caution >}}
|
||||||
|
Timing is important for this example, so you may want to read the steps before
|
||||||
|
execution. In order to trigger a Pod disruption it is important to drain the
|
||||||
|
node while the Pod is running on it (within 90s since the Pod is scheduled).
|
||||||
|
{{< /caution >}}
|
||||||
|
|
||||||
|
1. Create a Job based on the config:
|
||||||
|
|
||||||
|
{{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}}
|
||||||
|
|
||||||
|
by running:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
kubectl create -f job-pod-failure-policy-ignore.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Run this command to check the `nodeName` the Pod is scheduled to:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}')
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Drain the node to evict the Pod before it completes (within 90s):
|
||||||
|
```sh
|
||||||
|
kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Inspect the `.status.failed` to check the counter for the Job is not incremented:
|
||||||
|
```sh
|
||||||
|
kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Uncordon the node:
|
||||||
|
```sh
|
||||||
|
kubectl uncordon nodes/$nodeName
|
||||||
|
```
|
||||||
|
|
||||||
|
The Job resumes and succeeds.
|
||||||
|
|
||||||
|
For comparison, if the Pod failure policy was disabled the Pod disruption would
|
||||||
|
result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0).
|
||||||
|
|
||||||
|
### Cleaning up
|
||||||
|
|
||||||
|
Delete the Job you created:
|
||||||
|
```sh
|
||||||
|
kubectl delete jobs/job-pod-failure-policy-ignore
|
||||||
|
```
|
||||||
|
The cluster automatically cleans up the Pods.
|
||||||
|
|
||||||
|
## Alternatives
|
||||||
|
|
||||||
|
You could rely solely on the
|
||||||
|
[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
|
||||||
|
by specifying the Job's `.spec.backoffLimit` field. However, in many situations
|
||||||
|
it is problematic to find a balance between setting the a low value for `.spec.backoffLimit`
|
||||||
|
to avoid unnecessary Pod retries, yet high enough to make sure the Job would
|
||||||
|
not be terminated by Pod disruptions.
|
|
@ -0,0 +1,28 @@
|
||||||
|
apiVersion: batch/v1
|
||||||
|
kind: Job
|
||||||
|
metadata:
|
||||||
|
name: job-pod-failure-policy-example
|
||||||
|
spec:
|
||||||
|
completions: 12
|
||||||
|
parallelism: 3
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
restartPolicy: Never
|
||||||
|
containers:
|
||||||
|
- name: main
|
||||||
|
image: docker.io/library/bash:5
|
||||||
|
command: ["bash"] # example command simulating a bug which triggers the FailJob action
|
||||||
|
args:
|
||||||
|
- -c
|
||||||
|
- echo "Hello world!" && sleep 5 && exit 42
|
||||||
|
backoffLimit: 6
|
||||||
|
podFailurePolicy:
|
||||||
|
rules:
|
||||||
|
- action: FailJob
|
||||||
|
onExitCodes:
|
||||||
|
containerName: main # optional
|
||||||
|
operator: In # one of: In, NotIn
|
||||||
|
values: [42]
|
||||||
|
- action: Ignore # one of: Ignore, FailJob, Count
|
||||||
|
onPodConditions:
|
||||||
|
- type: DisruptionTarget # indicates Pod disruption
|
|
@ -0,0 +1,25 @@
|
||||||
|
apiVersion: batch/v1
|
||||||
|
kind: Job
|
||||||
|
metadata:
|
||||||
|
name: job-pod-failure-policy-failjob
|
||||||
|
spec:
|
||||||
|
completions: 8
|
||||||
|
parallelism: 2
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
restartPolicy: Never
|
||||||
|
containers:
|
||||||
|
- name: main
|
||||||
|
image: docker.io/library/bash:5
|
||||||
|
command: ["bash"]
|
||||||
|
args:
|
||||||
|
- -c
|
||||||
|
- echo "Hello world! I'm going to exit with 42 to simulate a software bug." && sleep 30 && exit 42
|
||||||
|
backoffLimit: 6
|
||||||
|
podFailurePolicy:
|
||||||
|
rules:
|
||||||
|
- action: FailJob
|
||||||
|
onExitCodes:
|
||||||
|
containerName: main
|
||||||
|
operator: In
|
||||||
|
values: [42]
|
|
@ -0,0 +1,23 @@
|
||||||
|
apiVersion: batch/v1
|
||||||
|
kind: Job
|
||||||
|
metadata:
|
||||||
|
name: job-pod-failure-policy-ignore
|
||||||
|
spec:
|
||||||
|
completions: 4
|
||||||
|
parallelism: 2
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
restartPolicy: Never
|
||||||
|
containers:
|
||||||
|
- name: main
|
||||||
|
image: docker.io/library/bash:5
|
||||||
|
command: ["bash"]
|
||||||
|
args:
|
||||||
|
- -c
|
||||||
|
- echo "Hello world! I'm going to exit with 0 (success)." && sleep 90 && exit 0
|
||||||
|
backoffLimit: 0
|
||||||
|
podFailurePolicy:
|
||||||
|
rules:
|
||||||
|
- action: Ignore
|
||||||
|
onPodConditions:
|
||||||
|
- type: DisruptionTarget
|
Loading…
Reference in New Issue