Merge pull request #35219 from mimowo/retriable-pod-failures-docs

Add docs for KEP-3329 Retriable and non-retriable Pod failures for Jobs
2022-08-15 19:17:07 -07:00 · 2022-08-15 19:17:07 -07:00 · 61b69cfd38
parent b268418615 449ef99fe3
commit 61b69cfd38
7 changed files with 343 additions and 0 deletions
--- a/content/en/docs/concepts/workloads/controllers/job.md
+++ b/content/en/docs/concepts/workloads/controllers/job.md
@ -695,6 +695,90 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
 `manualSelector: true` tells the system that you know what you are doing and to allow this
 mismatch.

+### Pod failure policy {#pod-failure-policy}
+
+{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
+
+{{< note >}}
+You can only configure a Pod failure policy for a Job if you have the
+`JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
+enabled in your cluster. Additionally, it is recommended
+to enable the `PodDisruptionsCondition` feature gate in order to be able to detect and handle
+Pod disruption conditions in the Pod failure policy (see also:
+[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
+available in Kubernetes v1.25.
+{{< /note >}}
+
+A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
+your cluster to handle Pod failures based on the container exit codes and the
+Pod conditions.
+
+In some situations, you  may want to have a better control when handling Pod
+failures than the control provided by the default policy, which is based on the
+Job's (`.spec.backoffLimit`](#pod-backoff-failure-policy)). These are some
+examples of use cases:
+* To optimize costs of running workloads by avoiding unnecessary Pod restarts,
+  you can terminate a Job as soon as one of its Pods fails with an exit code
+  indicating a software bug.
+* To guarantee that your Job finishes even if there are disruptions, you can
+  ignore Pod failures caused by disruptions  (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
+  {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
+  or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
+  that they don't count towards the `.spec.backoffLimit` limit of retries.
+
+You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
+to meet the above use cases. This policy can handle Pod failures based on the
+container exit codes and the Pod conditions.
+
+Here is a manifest for a Job that defines a `podFailurePolicy`:
+
+{{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
+
+In the example above, the first rule of the Pod failure policy specifies that
+the Job should be marked failed if the `main` container fails with the 42 exit
+code. The following are the rules for the `main` container specifically:
+
+- an exit code of 0 means that the container succeeded
+- an exit code of 42 means that the **entire Job** failed
+- any other exit code represents that the container failed, and hence the entire
+  Pod. The Pod will be re-created if the total number of restarts is
+  below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
+
+{{< note >}}
+Because the Pod template specifies a `restartPolicy: Never`,
+the kubelet does not restart the `main` container in that particular Pod.
+{{< /note >}}
+
+The second rule of the Pod failure policy, specifying the `Ignore` action for
+failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
+being counted towards the `.spec.backoffLimit` limit of retries.
+
+{{< note >}}
+If the Job failed, either by the Pod failure policy or Pod backoff
+failure policy, and the Job is running multiple Pods, Kubernetes terminates all
+the Pods in that Job that are still Pending or Running.
+{{< /note >}}
+
+These are some requirements and semantics of the API:
+- if you want to use a `.spec.podFailurePolicy` field for a Job, you must
+  also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
+- the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
+  are evaluated in order. Once a rule matches a Pod failure, the remaining rules
+  are ignored. When no rule matches the Pod failure, the default
+  handling applies.
+- you may want to restrict a rule to a specific container by specifing its name
+  in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
+  applies to all containers. When specified, it should match one the container
+  or `initContainer` names in the Pod template.
+- you may specify the action taken when a Pod failure policy is matched by
+  `spec.podFailurePolicy.rules[*].action`. Possible values are:
+  - `FailJob`: use to indicate that the Pod's job should be marked as Failed and
+     all running Pods should be terminated.
+  - `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
+     should not be incremented and a replacement Pod should be created.
+  - `Count`: use to indicate that the Pod should be handled in the default way.
+     The counter towards the `.spec.backoffLimit` should be incremented.
+
 ### Job tracking with finalizers

 {{< feature-state for_k8s_version="v1.23" state="beta" >}}
@ -783,3 +867,5 @@ object, but maintains complete control over what Pods are created and how work i
 * Read about [`CronJob`](/docs/concepts/workloads/controllers/cron-jobs/), which you
  can use to define a series of Jobs that will run based on a schedule, similar to
  the UNIX tool `cron`.
+* Practice how to configure handling of retriable and non-retriable pod failures
+  using `podFailurePolicy`, based on the step-by-step [examples](/docs/tasks/job/pod-failure-policy/).
--- a/content/en/docs/concepts/workloads/pods/disruptions.md
+++ b/content/en/docs/concepts/workloads/pods/disruptions.md
@ -227,6 +227,44 @@ can happen, according to:
 - the type of controller
 - the cluster's resource capacity

+## Pod disruption conditions {#pod-disruption-conditions}
+
+{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
+
+{{< note >}}
+In order to use this behavior, you must enable the `PodDisruptionsCondition`
+[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
+in your cluster.
+{{< /note >}}
+
+When enabled, a dedicated Pod `DisruptionTarget` [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions) is added to indicate
+that the Pod is about to be deleted due to a {{<glossary_tooltip term_id="disruption" text="disruption">}}.
+The `reason` field of the condition additionally
+indicates one of the following reasons for the Pod termination:
+
+`PreemptionByKubeScheduler`
+: Pod has been {{<glossary_tooltip term_id="preemption" text="preempted">}} by a scheduler in order to accommodate a new Pod with a higher priority. For more information, see [Pod priority preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/).
+
+`DeletionByTaintManager`
+: Pod is due to be deleted by Taint Manager due to to a `NoExecute` taint that the Pod does not tolerate; see {{<glossary_tooltip term_id="taint" text="taint">}}-based evictions.
+
+`EvictionByEvictionAPI`
+: Pod has been marked for {{<glossary_tooltip term_id="api-eviction" text="eviction using the Kubernetes API">}} .
+
+`DeletionByPodGC`
+: an orphaned Pod deleted by [Pod garbage collection](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).
+
+{{< note >}}
+A Pod disruption might be interrupted. The control plane might re-attempt to
+continue the disruption of the same Pod, but it is not guaranteed. As a result,
+the `DisruptionTarget` condition might be added to a Pod, but that Pod might then not actually be
+deleted. In such a situation, after some time, the
+Pod disruption condition will be cleared.
+{{< /note >}}
+
+When using a Job (or CronJob), you may want to use these Pod disruption conditions as part of your Job's
+[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy).
+
 ## Separating Cluster Owner and Application Owner Roles

 Often, it is useful to think of the Cluster Manager
--- a/content/en/docs/reference/command-line-tools-reference/feature-gates.md
+++ b/content/en/docs/reference/command-line-tools-reference/feature-gates.md
@ -377,6 +377,7 @@ different Kubernetes components.
 | `IngressClassNamespacedParams` | `true` | GA | 1.23 | - |
 | `Initializers` | `false` | Alpha | 1.7 | 1.13 |
 | `Initializers` | - | Deprecated | 1.14 | - |
+| `JobPodFailurePolicy` | `false` | Alpha | 1.25 | - |
 | `KubeletConfigFile` | `false` | Alpha | 1.8 | 1.9 |
 | `KubeletConfigFile` | - | Deprecated | 1.10 | - |
 | `KubeletPluginsWatcher` | `false` | Alpha | 1.11 | 1.11 |
@ -419,6 +420,7 @@ different Kubernetes components.
 | `PodDisruptionBudget` | `false` | Alpha | 1.3 | 1.4 |
 | `PodDisruptionBudget` | `true` | Beta | 1.5 | 1.20 |
 | `PodDisruptionBudget` | `true` | GA | 1.21 | - |
+| `PodDisruptionConditions` | `false` | Alpha | 1.25 | - |
 | `PodOverhead` | `false` | Alpha | 1.16 | 1.17 |
 | `PodOverhead` | `true` | Beta | 1.18 | 1.23 |
 | `PodOverhead` | `true` | GA | 1.24 | - |
@ -950,6 +952,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
  support for IPv6.
 - `JobMutableNodeSchedulingDirectives`: Allows updating node scheduling directives in
  the pod template of [Job](/docs/concepts/workloads/controllers/job).
+- `JobPodFailurePolicy`: Allow users to specify handling of pod failures based on container exit codes and pod conditions.
 - `JobReadyPods`: Enables tracking the number of Pods that have a `Ready`
  [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
  The count of `Ready` pods is recorded in the
@ -1042,6 +1045,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
 - `PodAndContainerStatsFromCRI`: Configure the kubelet to gather container and
  pod stats from the CRI container runtime rather than gathering them from cAdvisor.
 - `PodDisruptionBudget`: Enable the [PodDisruptionBudget](/docs/tasks/run-application/configure-pdb/) feature.
+- `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption.
 - `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
 - `PodOverhead`: Enable the [PodOverhead](/docs/concepts/scheduling-eviction/pod-overhead/)
  feature to account for pod overheads.
--- a/content/en/docs/tasks/job/pod-failure-policy.md
+++ b/content/en/docs/tasks/job/pod-failure-policy.md
@ -0,0 +1,139 @@
+---
+title: Handling retriable and non-retriable pod failures with Pod failure policy
+content_type: task
+min-kubernetes-server-version: v1.25
+weight: 60
+---
+
+{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
+
+<!-- overview -->
+
+This document shows you how to use the
+[Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy),
+in combination with the default
+[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
+to improve the control over the handling of container- or Pod-level failure
+within a {{<glossary_tooltip text="Job" term_id="job">}}.
+
+The definition of Pod failure policy may help you to:
+* better utilize the computational resources by avoiding unnecessary Pod retries.
+* avoid Job failures due to Pod disruptions (such {{<glossary_tooltip text="preemption" term_id="preemption" >}},
+{{<glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
+or {{<glossary_tooltip text="taint" term_id="taint" >}}-based eviction).
+
+## {{% heading "prerequisites" %}}
+
+You should already be familiar with the basic use of [Job](/docs/concepts/workloads/controllers/job/).
+
+{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
+
+<!-- steps -->
+
+{{< note >}}
+As the features are in Alpha, prepare the Kubernetes cluster with the two
+[feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
+enabled: `JobPodFailurePolicy` and `PodDisruptionsCondition`.
+{{< /note >}}
+
+## Using Pod failure policy to avoid unnecessary Pod retries
+
+With the following example, you can learn how to use Pod failure policy to
+avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
+software bug.
+
+First, create a Job based on the config:
+
+{{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}}
+
+by running:
+
+```sh
+kubectl create -f job-pod-failure-policy-failjob.yaml
+```
+
+After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
+```sh
+kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
+```
+
+In the Job status, see a job `Failed` condition with the field `reason`
+equal `PodFailurePolicy`. Additionally, the `message` field contains a
+more detailed information about the Job termination, such as:
+`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
+
+For comparison, if the Pod failure policy was disabled it would take 6 retries
+of the Pod, taking at least 2 minutes.
+
+### Clean up
+
+Delete the Job you created:
+```sh
+kubectl delete jobs/job-pod-failure-policy-failjob
+```
+The cluster automatically cleans up the Pods.
+
+## Using Pod failure policy to ignore Pod disruptions
+
+With the following example, you can learn how to use Pod failure policy to
+ignore Pod disruptions from incrementing the Pod retry counter towards the
+`.spec.backoffLimit` limit.
+
+{{< caution >}}
+Timing is important for this example, so you may want to read the steps before
+execution. In order to trigger a Pod disruption it is important to drain the
+node while the Pod is running on it (within 90s since the Pod is scheduled).
+{{< /caution >}}
+
+1. Create a Job based on the config:
+
+{{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}}
+
+by running:
+
+```sh
+kubectl create -f job-pod-failure-policy-ignore.yaml
+```
+
+2. Run this command to check the `nodeName` the Pod is scheduled to:
+
+```sh
+nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}')
+```
+
+3. Drain the node to evict the Pod before it completes (within 90s):
+```sh
+kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
+```
+
+4. Inspect the `.status.failed` to check the counter for the Job is not incremented:
+```sh
+kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
+```
+
+5. Uncordon the node:
+```sh
+kubectl uncordon nodes/$nodeName
+```
+
+The Job resumes and succeeds.
+
+For comparison, if the Pod failure policy was disabled the Pod disruption would
+result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0).
+
+### Cleaning up
+
+Delete the Job you created:
+```sh
+kubectl delete jobs/job-pod-failure-policy-ignore
+```
+The cluster automatically cleans up the Pods.
+
+## Alternatives
+
+You could rely solely on the
+[Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
+by specifying the Job's `.spec.backoffLimit` field. However, in many situations
+it is problematic to find a balance between setting the a low value for `.spec.backoffLimit`
+ to avoid unnecessary Pod retries, yet high enough to make sure the Job would
+not be terminated by Pod disruptions.
--- a/content/en/examples/controllers/job-pod-failure-policy-example.yaml
+++ b/content/en/examples/controllers/job-pod-failure-policy-example.yaml
@ -0,0 +1,28 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: job-pod-failure-policy-example
+spec:
+  completions: 12
+  parallelism: 3
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: main
+        image: docker.io/library/bash:5
+        command: ["bash"]        # example command simulating a bug which triggers the FailJob action
+        args:
+        - -c
+        - echo "Hello world!" && sleep 5 && exit 42
+  backoffLimit: 6
+  podFailurePolicy:
+    rules:
+    - action: FailJob
+      onExitCodes:
+        containerName: main      # optional
+        operator: In             # one of: In, NotIn
+        values: [42]
+    - action: Ignore             # one of: Ignore, FailJob, Count
+      onPodConditions:
+      - type: DisruptionTarget   # indicates Pod disruption
--- a/content/en/examples/controllers/job-pod-failure-policy-failjob.yaml
+++ b/content/en/examples/controllers/job-pod-failure-policy-failjob.yaml
@ -0,0 +1,25 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: job-pod-failure-policy-failjob
+spec:
+  completions: 8
+  parallelism: 2
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: main
+        image: docker.io/library/bash:5
+        command: ["bash"]
+        args:
+        - -c
+        - echo "Hello world! I'm going to exit with 42 to simulate a software bug." && sleep 30 && exit 42
+  backoffLimit: 6
+  podFailurePolicy:
+    rules:
+    - action: FailJob
+      onExitCodes:
+        containerName: main
+        operator: In
+        values: [42]
--- a/content/en/examples/controllers/job-pod-failure-policy-ignore.yaml
+++ b/content/en/examples/controllers/job-pod-failure-policy-ignore.yaml
@ -0,0 +1,23 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: job-pod-failure-policy-ignore
+spec:
+  completions: 4
+  parallelism: 2
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: main
+        image: docker.io/library/bash:5
+        command: ["bash"]
+        args:
+        - -c
+        - echo "Hello world! I'm going to exit with 0 (success)." && sleep 90 && exit 0
+  backoffLimit: 0
+  podFailurePolicy:
+    rules:
+    - action: Ignore
+      onPodConditions:
+      - type: DisruptionTarget