From c47a02571329a46ec1b359f463ae42ba9a694415 Mon Sep 17 00:00:00 2001 From: Michal Wozniak Date: Mon, 15 Aug 2022 13:22:37 +0200 Subject: [PATCH 1/2] Add docs for KEP-3329 Retriable and non-retriable Pod failures for Jobs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Code review remarks and suggested commit updates are co-authored Co-authored-by: Tim Bannister Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com> Co-authored-by: Paola Cortés <51036950+cortespao@users.noreply.github.com> # Conflicts: # content/en/docs/reference/command-line-tools-reference/feature-gates.md --- .../concepts/workloads/controllers/job.md | 86 +++++++++++ .../concepts/workloads/pods/disruptions.md | 29 ++++ .../feature-gates.md | 4 + .../en/docs/tasks/job/pod-failure-policy.md | 139 ++++++++++++++++++ .../job-pod-failure-policy-example.yaml | 28 ++++ .../job-pod-failure-policy-failjob.yaml | 25 ++++ .../job-pod-failure-policy-ignore.yaml | 23 +++ 7 files changed, 334 insertions(+) create mode 100644 content/en/docs/tasks/job/pod-failure-policy.md create mode 100644 content/en/examples/controllers/job-pod-failure-policy-example.yaml create mode 100644 content/en/examples/controllers/job-pod-failure-policy-failjob.yaml create mode 100644 content/en/examples/controllers/job-pod-failure-policy-ignore.yaml diff --git a/content/en/docs/concepts/workloads/controllers/job.md b/content/en/docs/concepts/workloads/controllers/job.md index cb1b72fd03..639a31136c 100644 --- a/content/en/docs/concepts/workloads/controllers/job.md +++ b/content/en/docs/concepts/workloads/controllers/job.md @@ -695,6 +695,90 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010 `manualSelector: true` tells the system that you know what you are doing and to allow this mismatch. +### Pod failure policy {#pod-failure-policy} + +{{< feature-state for_k8s_version="v1.25" state="alpha" >}} + +{{< note >}} +You can only configure a Pod failure policy for a Job if you have the +`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) +enabled in your cluster. Additionally, it is recommended +to enable the `PodDisruptionsCondition` feature gate in order to be able to detect and handle +Pod disruption conditions in the Pod failure policy (see also: +[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are +available in Kubernetes v1.25. +{{< /note >}} + +A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables +your cluster to handle Pod failures based on the container exit codes and the +Pod conditions. + +In some situations, you may want to have a better control when handling Pod +failures than the control provided by the default policy, which is based on the +Job's (`.spec.backoffLimit`](#pod-backoff-failure-policy)). These are some +examples of use cases: +* To optimize costs of running workloads by avoiding unnecessary Pod restarts, + you can terminate a Job as soon as one of its Pods fails with an exit code + indicating a software bug. +* To guarantee that your Job finishes even if there are disruptions, you can + ignore Pod failures caused by disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}}, + {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}} + or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so + that they don't count towards the `.spec.backoffLimit` limit of retries. + +You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field, +to meet the above use cases. This policy can handle Pod failures based on the +container exit codes and the Pod conditions. + +Here is a manifest for a Job that defines a `podFailurePolicy`: + +{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}} + +In the example above, the first rule of the Pod failure policy specifies that +the Job should be marked failed if the `main` container fails with the 42 exit +code. The following are the rules for the `main` container specifically: + +- an exit code of 0 means that the container succeeded +- an exit code of 42 means that the **entire Job** failed +- any other exit code represents that the container failed, and hence the entire + Pod. The Pod will be re-created if the total number of restarts is + below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed. + +{{< note >}} +Because the Pod template specifies a `restartPolicy: Never`, +the kubelet does not restart the `main` container in that particular Pod. +{{< /note >}} + +The second rule of the Pod failure policy, specifying the `Ignore` action for +failed Pods with condition `DisruptionTarget` excludes Pod disruptions from +being counted towards the `.spec.backoffLimit` limit of retries. + +{{< note >}} +If the Job failed, either by the Pod failure policy or Pod backoff +failure policy, and the Job is running multiple Pods, Kubernetes terminates all +the Pods in that Job that are still Pending or Running. +{{< /note >}} + +These are some requirements and semantics of the API: +- if you want to use a `.spec.podFailurePolicy` field for a Job, you must + also define that Job's pod template with `.spec.restartPolicy` set to `Never`. +- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules` + are evaluated in order. Once a rule matches a Pod failure, the remaining rules + are ignored. When no rule matches the Pod failure, the default + handling applies. +- you may want to restrict a rule to a specific container by specifing its name + in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule + applies to all containers. When specified, it should match one the container + or `initContainer` names in the Pod template. +- you may specify the action taken when a Pod failure policy is matched by + `spec.podFailurePolicy.rules[*].action`. Possible values are: + - `FailJob`: use to indicate that the Pod's job should be marked as Failed and + all running Pods should be terminated. + - `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit` + should not be incremented and a replacement Pod should be created. + - `Count`: use to indicate that the Pod should be handled in the default way. + The counter towards the `.spec.backoffLimit` should be incremented. + ### Job tracking with finalizers {{< feature-state for_k8s_version="v1.23" state="beta" >}} @@ -783,3 +867,5 @@ object, but maintains complete control over what Pods are created and how work i * Read about [`CronJob`](/docs/concepts/workloads/controllers/cron-jobs/), which you can use to define a series of Jobs that will run based on a schedule, similar to the UNIX tool `cron`. +* Practice how to configure handling of retriable and non-retriable pod failures + using `podFailurePolicy`, based on the step-by-step [examples](/docs/tasks/job/pod-failure-policy/). diff --git a/content/en/docs/concepts/workloads/pods/disruptions.md b/content/en/docs/concepts/workloads/pods/disruptions.md index 055fc0a65d..a9e1a93e0d 100644 --- a/content/en/docs/concepts/workloads/pods/disruptions.md +++ b/content/en/docs/concepts/workloads/pods/disruptions.md @@ -227,6 +227,35 @@ can happen, according to: - the type of controller - the cluster's resource capacity +## Pod disruption conditions {#pod-disruption-conditions} + +{{< feature-state for_k8s_version="v1.25" state="alpha" >}} + +{{< note >}} +In order to use this behavior, you must enable `PodDisruptionsCondition` +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) +in your cluster. +{{< /note >}} + +When enabled, a dedicated Pod `DisruptionTarget` condition is added to indicate +an imminent disruption of a Pod. The `reason` field of the condition additionally +indicates one of the following reasons for the Pod termination: +- `PreemptionByKubeScheduler`: Pod preempted by kube-scheduler to accommodate a Pod with higher priority. For more information, see [Pod priority preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/). +- `DeletionByTaintManager`: Pod deleted by taint manager due to NoExecute taint, see more [here](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions). +- `EvictionByEvictionAPI`: Pod evicted by [Eviction API](/docs/concepts/scheduling-eviction/api-eviction/). +- `DeletionByPodGC`: an orphaned Pod deleted by [PodGC](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection). + +{{< note >}} +A Pod disruption might be interrupted. The control plane might re-attempt to +continue the disruption of the same Pod, but it is not guaranteed. As a result, +the `DisruptionTarget` condition might be added to Pod, but the Pod might not be +deleted. In such a situation, after some time, the +Pod disruption condition will be cleared. +{{< /note >}} + +When using a Job, you may want to use these Pod disruption conditions you defined in your +[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy). + ## Separating Cluster Owner and Application Owner Roles Often, it is useful to think of the Cluster Manager diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates.md b/content/en/docs/reference/command-line-tools-reference/feature-gates.md index dfcee748bd..e73cc6d51b 100644 --- a/content/en/docs/reference/command-line-tools-reference/feature-gates.md +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates.md @@ -380,6 +380,7 @@ different Kubernetes components. | `IngressClassNamespacedParams` | `true` | GA | 1.23 | - | | `Initializers` | `false` | Alpha | 1.7 | 1.13 | | `Initializers` | - | Deprecated | 1.14 | - | +| `JobPodFailurePolicy` | `false` | Alpha | 1.25 | - | | `KubeletConfigFile` | `false` | Alpha | 1.8 | 1.9 | | `KubeletConfigFile` | - | Deprecated | 1.10 | - | | `KubeletPluginsWatcher` | `false` | Alpha | 1.11 | 1.11 | @@ -416,6 +417,7 @@ different Kubernetes components. | `PodDisruptionBudget` | `false` | Alpha | 1.3 | 1.4 | | `PodDisruptionBudget` | `true` | Beta | 1.5 | 1.20 | | `PodDisruptionBudget` | `true` | GA | 1.21 | - | +| `PodDisruptionConditions` | `false` | Alpha | 1.25 | - | | `PodOverhead` | `false` | Alpha | 1.16 | 1.17 | | `PodOverhead` | `true` | Beta | 1.18 | 1.23 | | `PodOverhead` | `true` | GA | 1.24 | - | @@ -947,6 +949,7 @@ Each feature gate is designed for enabling/disabling a specific feature: support for IPv6. - `JobMutableNodeSchedulingDirectives`: Allows updating node scheduling directives in the pod template of [Job](/docs/concepts/workloads/controllers/job). +- `JobPodFailurePolicy`: Allow users to specify handling of pod failures based on container exit codes and pod conditions. - `JobReadyPods`: Enables tracking the number of Pods that have a `Ready` [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions). The count of `Ready` pods is recorded in the @@ -1039,6 +1042,7 @@ Each feature gate is designed for enabling/disabling a specific feature: - `PodAndContainerStatsFromCRI`: Configure the kubelet to gather container and pod stats from the CRI container runtime rather than gathering them from cAdvisor. - `PodDisruptionBudget`: Enable the [PodDisruptionBudget](/docs/tasks/run-application/configure-pdb/) feature. +- `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption. - `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods. - `PodOverhead`: Enable the [PodOverhead](/docs/concepts/scheduling-eviction/pod-overhead/) feature to account for pod overheads. diff --git a/content/en/docs/tasks/job/pod-failure-policy.md b/content/en/docs/tasks/job/pod-failure-policy.md new file mode 100644 index 0000000000..3ba337fcea --- /dev/null +++ b/content/en/docs/tasks/job/pod-failure-policy.md @@ -0,0 +1,139 @@ +--- +title: Handling retriable and non-retriable pod failures with Pod failure policy +content_type: task +min-kubernetes-server-version: v1.25 +weight: 60 +--- + +{{< feature-state for_k8s_version="v1.25" state="alpha" >}} + + + +This document shows you how to use the +[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy), +in combination with the default +[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy), +to improve the control over the handling of container- or Pod-level failure +within a {{}}. + +The definition of Pod failure policy may help you to better utilize the computational +resources by avoiding unnecessary Pod retries. This policy also lets you avoid Job +failures due to Pod disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}}, +{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}} +or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction). + +## {{% heading "prerequisites" %}} + +You should already be familiar with the basic use of [Job](/docs/concepts/workloads/controllers/job/). + +{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}} + + + +{{< note >}} +As the features are in Alpha, prepare the Kubernetes cluster with the two +[feature gates](/docs/reference/command-line-tools-reference/feature-gates/) +enabled: `JobPodFailurePolicy` and `PodDisruptionsCondition`. +{{< /note >}} + +## Using Pod failure policy to avoid unnecessary Pod retries + +With the following example, you can learn how to use Pod failure policy to +avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable +software bug. + +First, create a Job based on the config: + +{{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}} + +by running: + +```sh +kubectl create -f job-pod-failure-policy-failjob.yaml +``` + +After around 30s the entire Job should be terminated. Inspect the status of the Job by running: +```sh +kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml +``` + +In the Job status, see a job `Failed` condition with the field `reason` +equal `PodFailurePolicy`. Additionally, the `message` field contains a +more detailed information about the Job termination, such as: +`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`. + +For comparison, if the Pod failure policy was disabled it would take 6 retries +of the Pod, taking at least 2 minutes. + +### Clean up + +Delete the Job you created: +```sh +kubectl delete jobs/job-pod-failure-policy-failjob +``` +The cluster automatically cleans up the Pods. + +## Using Pod failure policy to ignore Pod disruptions + +With the following example, you can learn how to use Pod failure policy to +ignore Pod disruptions from incrementing the Pod retry counter towards the +`.spec.backoffLimit` limit. + +{{< caution >}} +Timing is important for this example, so you may want to read the steps before +execution. In order to trigger a Pod disruption it is important to drain the +node while the Pod is running on it (within 90s since the Pod is scheduled). +{{< /caution >}} + +1. Create a Job based on the config: + +{{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}} + +by running: + +```sh +kubectl create -f job-pod-failure-policy-ignore.yaml +``` + +2. Run this command to check the `nodeName` the Pod is scheduled to: + +```sh +nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}') +``` + +3. Drain the node to evict the Pod before it completes (within 90s): +```sh +kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0 +``` + +4. Inspect the `.status.failed` to check the counter for the Job is not incremented: +```sh +kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml +``` + +5. Uncordon the node: +```sh +kubectl uncordon nodes/$nodeName +``` + +The Job resumes and succeeds. + +For comparison, if the Pod failure policy was disabled the Pod disruption would +result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0). + +### Cleaning up + +Delete the Job you created: +```sh +kubectl delete jobs/job-pod-failure-policy-ignore +``` +The cluster automatically cleans up the Pods. + +## Alternatives + +You could rely solely on the +[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy), +by specifying the Job's `.spec.backoffLimit` field. However, in many situations +it is problematic to find a balance between setting the a low value for `.spec.backoffLimit` + to avoid unnecessary Pod retries, yet high enough to make sure the Job would +not be terminated by Pod disruptions. diff --git a/content/en/examples/controllers/job-pod-failure-policy-example.yaml b/content/en/examples/controllers/job-pod-failure-policy-example.yaml new file mode 100644 index 0000000000..f75d4d6bb1 --- /dev/null +++ b/content/en/examples/controllers/job-pod-failure-policy-example.yaml @@ -0,0 +1,28 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: job-pod-failure-policy-example +spec: + completions: 12 + parallelism: 3 + template: + spec: + restartPolicy: Never + containers: + - name: main + image: docker.io/library/bash:5 + command: ["bash"] # example command simulating a bug which triggers the FailJob action + args: + - -c + - echo "Hello world!" && sleep 5 && exit 42 + backoffLimit: 6 + podFailurePolicy: + rules: + - action: FailJob + onExitCodes: + containerName: main # optional + operator: In # one of: In, NotIn + values: [42] + - action: Ignore # one of: Ignore, FailJob, Count + onPodConditions: + - type: DisruptionTarget # indicates Pod disruption diff --git a/content/en/examples/controllers/job-pod-failure-policy-failjob.yaml b/content/en/examples/controllers/job-pod-failure-policy-failjob.yaml new file mode 100644 index 0000000000..a83abe84c1 --- /dev/null +++ b/content/en/examples/controllers/job-pod-failure-policy-failjob.yaml @@ -0,0 +1,25 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: job-pod-failure-policy-failjob +spec: + completions: 8 + parallelism: 2 + template: + spec: + restartPolicy: Never + containers: + - name: main + image: docker.io/library/bash:5 + command: ["bash"] + args: + - -c + - echo "Hello world! I'm going to exit with 42 to simulate a software bug." && sleep 30 && exit 42 + backoffLimit: 6 + podFailurePolicy: + rules: + - action: FailJob + onExitCodes: + containerName: main + operator: In + values: [42] diff --git a/content/en/examples/controllers/job-pod-failure-policy-ignore.yaml b/content/en/examples/controllers/job-pod-failure-policy-ignore.yaml new file mode 100644 index 0000000000..9747644ff2 --- /dev/null +++ b/content/en/examples/controllers/job-pod-failure-policy-ignore.yaml @@ -0,0 +1,23 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: job-pod-failure-policy-ignore +spec: + completions: 4 + parallelism: 2 + template: + spec: + restartPolicy: Never + containers: + - name: main + image: docker.io/library/bash:5 + command: ["bash"] + args: + - -c + - echo "Hello world! I'm going to exit with 0 (success)." && sleep 90 && exit 0 + backoffLimit: 0 + podFailurePolicy: + rules: + - action: Ignore + onPodConditions: + - type: DisruptionTarget From 449ef99fe393b222e68504e19a9600f3e6a34e88 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Wo=C5=BAniak?= Date: Fri, 12 Aug 2022 20:18:51 +0200 Subject: [PATCH 2/2] Update content/en/docs/tasks/job/pod-failure-policy.md Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com> Co-authored-by: Tim Bannister --- .../concepts/workloads/pods/disruptions.md | 27 ++++++++++++------- .../en/docs/tasks/job/pod-failure-policy.md | 10 +++---- 2 files changed, 23 insertions(+), 14 deletions(-) diff --git a/content/en/docs/concepts/workloads/pods/disruptions.md b/content/en/docs/concepts/workloads/pods/disruptions.md index a9e1a93e0d..09122df8fa 100644 --- a/content/en/docs/concepts/workloads/pods/disruptions.md +++ b/content/en/docs/concepts/workloads/pods/disruptions.md @@ -232,28 +232,37 @@ can happen, according to: {{< feature-state for_k8s_version="v1.25" state="alpha" >}} {{< note >}} -In order to use this behavior, you must enable `PodDisruptionsCondition` +In order to use this behavior, you must enable the `PodDisruptionsCondition` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) in your cluster. {{< /note >}} -When enabled, a dedicated Pod `DisruptionTarget` condition is added to indicate -an imminent disruption of a Pod. The `reason` field of the condition additionally +When enabled, a dedicated Pod `DisruptionTarget` [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions) is added to indicate +that the Pod is about to be deleted due to a {{}}. +The `reason` field of the condition additionally indicates one of the following reasons for the Pod termination: -- `PreemptionByKubeScheduler`: Pod preempted by kube-scheduler to accommodate a Pod with higher priority. For more information, see [Pod priority preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/). -- `DeletionByTaintManager`: Pod deleted by taint manager due to NoExecute taint, see more [here](/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions). -- `EvictionByEvictionAPI`: Pod evicted by [Eviction API](/docs/concepts/scheduling-eviction/api-eviction/). -- `DeletionByPodGC`: an orphaned Pod deleted by [PodGC](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection). + +`PreemptionByKubeScheduler` +: Pod has been {{}} by a scheduler in order to accommodate a new Pod with a higher priority. For more information, see [Pod priority preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/). + +`DeletionByTaintManager` +: Pod is due to be deleted by Taint Manager due to to a `NoExecute` taint that the Pod does not tolerate; see {{}}-based evictions. + +`EvictionByEvictionAPI` +: Pod has been marked for {{}} . + +`DeletionByPodGC` +: an orphaned Pod deleted by [Pod garbage collection](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection). {{< note >}} A Pod disruption might be interrupted. The control plane might re-attempt to continue the disruption of the same Pod, but it is not guaranteed. As a result, -the `DisruptionTarget` condition might be added to Pod, but the Pod might not be +the `DisruptionTarget` condition might be added to a Pod, but that Pod might then not actually be deleted. In such a situation, after some time, the Pod disruption condition will be cleared. {{< /note >}} -When using a Job, you may want to use these Pod disruption conditions you defined in your +When using a Job (or CronJob), you may want to use these Pod disruption conditions as part of your Job's [Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy). ## Separating Cluster Owner and Application Owner Roles diff --git a/content/en/docs/tasks/job/pod-failure-policy.md b/content/en/docs/tasks/job/pod-failure-policy.md index 3ba337fcea..f6243f73ef 100644 --- a/content/en/docs/tasks/job/pod-failure-policy.md +++ b/content/en/docs/tasks/job/pod-failure-policy.md @@ -16,11 +16,11 @@ in combination with the default to improve the control over the handling of container- or Pod-level failure within a {{}}. -The definition of Pod failure policy may help you to better utilize the computational -resources by avoiding unnecessary Pod retries. This policy also lets you avoid Job -failures due to Pod disruptions (such {{< glossary_tooltip text="preemption" term_id="preemption" >}}, -{{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}} -or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction). +The definition of Pod failure policy may help you to: +* better utilize the computational resources by avoiding unnecessary Pod retries. +* avoid Job failures due to Pod disruptions (such {{}}, +{{}} +or {{}}-based eviction). ## {{% heading "prerequisites" %}}