Merge pull request #35219 from mimowo/retriable-pod-failures-docs

Add docs for KEP-3329 Retriable and non-retriable Pod failures for Jobs
2022-08-15 19:17:07 -07:00 · 2022-08-15 19:17:07 -07:00 · 61b69cfd38
parent b268418615 449ef99fe3
commit 61b69cfd38
7 changed files with 343 additions and 0 deletions
--- a/content/en/docs/concepts/workloads/controllers/job.md
+++ b/content/en/docs/concepts/workloads/controllers/job.md
@ -695,6 +695,90 @@ The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010
 `manualSelector: true` tells the system that you know what you are doing and to allow this
 mismatch.
 ### Pod failure policy {#pod-failure-policy}
 {{< feature-state for_k8s_version="v1.25" state="alpha" >}}
 {{< note >}}
 You can only configure a Pod failure policy for a Job if you have the
 `JobPodFailurePolicy` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
 enabled in your cluster. Additionally, it is recommended
 to enable the `PodDisruptionsCondition` feature gate in order to be able to detect and handle
 Pod disruption conditions in the Pod failure policy (see also:
 [Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
 available in Kubernetes v1.25.
 {{< /note >}}
 A Pod failure policy, defined with the `.spec.podFailurePolicy` field, enables
 your cluster to handle Pod failures based on the container exit codes and the
 Pod conditions.
 In some situations, you  may want to have a better control when handling Pod
 failures than the control provided by the default policy, which is based on the
 Job's (`.spec.backoffLimit`](#pod-backoff-failure-policy)). These are some
 examples of use cases:
 * To optimize costs of running workloads by avoiding unnecessary Pod restarts,
  you can terminate a Job as soon as one of its Pods fails with an exit code
  indicating a software bug.
 * To guarantee that your Job finishes even if there are disruptions, you can
  ignore Pod failures caused by disruptions  (such {{< glossary_tooltip text="preemption" term_id="preemption" >}},
  {{< glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
  or {{< glossary_tooltip text="taint" term_id="taint" >}}-based eviction) so
  that they don't count towards the `.spec.backoffLimit` limit of retries.
 You can configure a Pod failure policy, in the `.spec.podFailurePolicy` field,
 to meet the above use cases. This policy can handle Pod failures based on the
 container exit codes and the Pod conditions.
 Here is a manifest for a Job that defines a `podFailurePolicy`:
 {{< codenew file="/controllers/job-pod-failure-policy-example.yaml" >}}
 In the example above, the first rule of the Pod failure policy specifies that
 the Job should be marked failed if the `main` container fails with the 42 exit
 code. The following are the rules for the `main` container specifically:
 - an exit code of 0 means that the container succeeded
 - an exit code of 42 means that the **entire Job** failed
 - any other exit code represents that the container failed, and hence the entire
  Pod. The Pod will be re-created if the total number of restarts is
  below `backoffLimit`. If the `backoffLimit` is reached the **entire Job** failed.
 {{< note >}}
 Because the Pod template specifies a `restartPolicy: Never`,
 the kubelet does not restart the `main` container in that particular Pod.
 {{< /note >}}
 The second rule of the Pod failure policy, specifying the `Ignore` action for
 failed Pods with condition `DisruptionTarget` excludes Pod disruptions from
 being counted towards the `.spec.backoffLimit` limit of retries.
 {{< note >}}
 If the Job failed, either by the Pod failure policy or Pod backoff
 failure policy, and the Job is running multiple Pods, Kubernetes terminates all
 the Pods in that Job that are still Pending or Running.
 {{< /note >}}
 These are some requirements and semantics of the API:
 - if you want to use a `.spec.podFailurePolicy` field for a Job, you must
  also define that Job's pod template with `.spec.restartPolicy` set to `Never`.
 - the Pod failure policy rules you specify under `spec.podFailurePolicy.rules`
  are evaluated in order. Once a rule matches a Pod failure, the remaining rules
  are ignored. When no rule matches the Pod failure, the default
  handling applies.
 - you may want to restrict a rule to a specific container by specifing its name
  in`spec.podFailurePolicy.rules[*].containerName`. When not specified the rule
  applies to all containers. When specified, it should match one the container
  or `initContainer` names in the Pod template.
 - you may specify the action taken when a Pod failure policy is matched by
  `spec.podFailurePolicy.rules[*].action`. Possible values are:
  - `FailJob`: use to indicate that the Pod's job should be marked as Failed and
     all running Pods should be terminated.
  - `Ignore`: use to indicate that the counter towards the `.spec.backoffLimit`
     should not be incremented and a replacement Pod should be created.
  - `Count`: use to indicate that the Pod should be handled in the default way.
     The counter towards the `.spec.backoffLimit` should be incremented.
 ### Job tracking with finalizers
 {{< feature-state for_k8s_version="v1.23" state="beta" >}}
@ -783,3 +867,5 @@ object, but maintains complete control over what Pods are created and how work i
 * Read about [`CronJob`](/docs/concepts/workloads/controllers/cron-jobs/), which you
  can use to define a series of Jobs that will run based on a schedule, similar to
  the UNIX tool `cron`.
 * Practice how to configure handling of retriable and non-retriable pod failures
  using `podFailurePolicy`, based on the step-by-step [examples](/docs/tasks/job/pod-failure-policy/).
--- a/content/en/docs/concepts/workloads/pods/disruptions.md
+++ b/content/en/docs/concepts/workloads/pods/disruptions.md
@ -227,6 +227,44 @@ can happen, according to:
 - the type of controller
 - the cluster's resource capacity
 ## Pod disruption conditions {#pod-disruption-conditions}
 {{< feature-state for_k8s_version="v1.25" state="alpha" >}}
 {{< note >}}
 In order to use this behavior, you must enable the `PodDisruptionsCondition`
 [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
 in your cluster.
 {{< /note >}}
 When enabled, a dedicated Pod `DisruptionTarget` [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions) is added to indicate
 that the Pod is about to be deleted due to a {{<glossary_tooltip term_id="disruption" text="disruption">}}.
 The `reason` field of the condition additionally
 indicates one of the following reasons for the Pod termination:
 `PreemptionByKubeScheduler`
 : Pod has been {{<glossary_tooltip term_id="preemption" text="preempted">}} by a scheduler in order to accommodate a new Pod with a higher priority. For more information, see [Pod priority preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/).
 `DeletionByTaintManager`
 : Pod is due to be deleted by Taint Manager due to to a `NoExecute` taint that the Pod does not tolerate; see {{<glossary_tooltip term_id="taint" text="taint">}}-based evictions.
 `EvictionByEvictionAPI`
 : Pod has been marked for {{<glossary_tooltip term_id="api-eviction" text="eviction using the Kubernetes API">}} .
 `DeletionByPodGC`
 : an orphaned Pod deleted by [Pod garbage collection](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection).
 {{< note >}}
 A Pod disruption might be interrupted. The control plane might re-attempt to
 continue the disruption of the same Pod, but it is not guaranteed. As a result,
 the `DisruptionTarget` condition might be added to a Pod, but that Pod might then not actually be
 deleted. In such a situation, after some time, the
 Pod disruption condition will be cleared.
 {{< /note >}}
 When using a Job (or CronJob), you may want to use these Pod disruption conditions as part of your Job's
 [Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy).
 ## Separating Cluster Owner and Application Owner Roles
 Often, it is useful to think of the Cluster Manager
--- a/content/en/docs/reference/command-line-tools-reference/feature-gates.md
+++ b/content/en/docs/reference/command-line-tools-reference/feature-gates.md
@ -377,6 +377,7 @@ different Kubernetes components.
 | `IngressClassNamespacedParams` | `true` | GA | 1.23 | - |
 | `Initializers` | `false` | Alpha | 1.7 | 1.13 |
 | `Initializers` | - | Deprecated | 1.14 | - |
 | `JobPodFailurePolicy` | `false` | Alpha | 1.25 | - |
 | `KubeletConfigFile` | `false` | Alpha | 1.8 | 1.9 |
 | `KubeletConfigFile` | - | Deprecated | 1.10 | - |
 | `KubeletPluginsWatcher` | `false` | Alpha | 1.11 | 1.11 |
@ -419,6 +420,7 @@ different Kubernetes components.
 | `PodDisruptionBudget` | `false` | Alpha | 1.3 | 1.4 |
 | `PodDisruptionBudget` | `true` | Beta | 1.5 | 1.20 |
 | `PodDisruptionBudget` | `true` | GA | 1.21 | - |
 | `PodDisruptionConditions` | `false` | Alpha | 1.25 | - |
 | `PodOverhead` | `false` | Alpha | 1.16 | 1.17 |
 | `PodOverhead` | `true` | Beta | 1.18 | 1.23 |
 | `PodOverhead` | `true` | GA | 1.24 | - |
@ -950,6 +952,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
  support for IPv6.
 - `JobMutableNodeSchedulingDirectives`: Allows updating node scheduling directives in
  the pod template of [Job](/docs/concepts/workloads/controllers/job).
 - `JobPodFailurePolicy`: Allow users to specify handling of pod failures based on container exit codes and pod conditions.
 - `JobReadyPods`: Enables tracking the number of Pods that have a `Ready`
  [condition](/docs/concepts/workloads/pods/pod-lifecycle/#pod-conditions).
  The count of `Ready` pods is recorded in the
@ -1042,6 +1045,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
 - `PodAndContainerStatsFromCRI`: Configure the kubelet to gather container and
  pod stats from the CRI container runtime rather than gathering them from cAdvisor.
 - `PodDisruptionBudget`: Enable the [PodDisruptionBudget](/docs/tasks/run-application/configure-pdb/) feature.
 - `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption.
 - `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
 - `PodOverhead`: Enable the [PodOverhead](/docs/concepts/scheduling-eviction/pod-overhead/)
  feature to account for pod overheads.
--- a/content/en/docs/tasks/job/pod-failure-policy.md
+++ b/content/en/docs/tasks/job/pod-failure-policy.md
@ -0,0 +1,139 @@
 ---
 title: Handling retriable and non-retriable pod failures with Pod failure policy
 content_type: task
 min-kubernetes-server-version: v1.25
 weight: 60
 ---
 {{< feature-state for_k8s_version="v1.25" state="alpha" >}}
 <!-- overview -->
 This document shows you how to use the
 [Pod failure policy](/docs/concepts/workloads/controllers/job#pod-failure-policy),
 in combination with the default
 [Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
 to improve the control over the handling of container- or Pod-level failure
 within a {{<glossary_tooltip text="Job" term_id="job">}}.
 The definition of Pod failure policy may help you to:
 * better utilize the computational resources by avoiding unnecessary Pod retries.
 * avoid Job failures due to Pod disruptions (such {{<glossary_tooltip text="preemption" term_id="preemption" >}},
 {{<glossary_tooltip text="API-initiated eviction" term_id="api-eviction" >}}
 or {{<glossary_tooltip text="taint" term_id="taint" >}}-based eviction).
 ## {{% heading "prerequisites" %}}
 You should already be familiar with the basic use of [Job](/docs/concepts/workloads/controllers/job/).
 {{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
 <!-- steps -->
 {{< note >}}
 As the features are in Alpha, prepare the Kubernetes cluster with the two
 [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
 enabled: `JobPodFailurePolicy` and `PodDisruptionsCondition`.
 {{< /note >}}
 ## Using Pod failure policy to avoid unnecessary Pod retries
 With the following example, you can learn how to use Pod failure policy to
 avoid unnecessary Pod restarts when a Pod failure indicates a non-retriable
 software bug.
 First, create a Job based on the config:
 {{< codenew file="/controllers/job-pod-failure-policy-failjob.yaml" >}}
 by running:
 ```sh
 kubectl create -f job-pod-failure-policy-failjob.yaml
 ```
 After around 30s the entire Job should be terminated. Inspect the status of the Job by running:
 ```sh
 kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
 ```
 In the Job status, see a job `Failed` condition with the field `reason`
 equal `PodFailurePolicy`. Additionally, the `message` field contains a
 more detailed information about the Job termination, such as:
 `Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
 For comparison, if the Pod failure policy was disabled it would take 6 retries
 of the Pod, taking at least 2 minutes.
 ### Clean up
 Delete the Job you created:
 ```sh
 kubectl delete jobs/job-pod-failure-policy-failjob
 ```
 The cluster automatically cleans up the Pods.
 ## Using Pod failure policy to ignore Pod disruptions
 With the following example, you can learn how to use Pod failure policy to
 ignore Pod disruptions from incrementing the Pod retry counter towards the
 `.spec.backoffLimit` limit.
 {{< caution >}}
 Timing is important for this example, so you may want to read the steps before
 execution. In order to trigger a Pod disruption it is important to drain the
 node while the Pod is running on it (within 90s since the Pod is scheduled).
 {{< /caution >}}
 1. Create a Job based on the config:
 {{< codenew file="/controllers/job-pod-failure-policy-ignore.yaml" >}}
 by running:
 ```sh
 kubectl create -f job-pod-failure-policy-ignore.yaml
 ```
 2. Run this command to check the `nodeName` the Pod is scheduled to:
 ```sh
 nodeName=$(kubectl get pods -l job-name=job-pod-failure-policy-ignore -o jsonpath='{.items[0].spec.nodeName}')
 ```
 3. Drain the node to evict the Pod before it completes (within 90s):
 ```sh
 kubectl drain nodes/$nodeName --ignore-daemonsets --grace-period=0
 ```
 4. Inspect the `.status.failed` to check the counter for the Job is not incremented:
 ```sh
 kubectl get jobs -l job-name=job-pod-failure-policy-ignore -o yaml
 ```
 5. Uncordon the node:
 ```sh
 kubectl uncordon nodes/$nodeName
 ```
 The Job resumes and succeeds.
 For comparison, if the Pod failure policy was disabled the Pod disruption would
 result in terminating the entire Job (as the `.spec.backoffLimit` is set to 0).
 ### Cleaning up
 Delete the Job you created:
 ```sh
 kubectl delete jobs/job-pod-failure-policy-ignore
 ```
 The cluster automatically cleans up the Pods.
 ## Alternatives
 You could rely solely on the
 [Pod backoff failure policy](/docs/concepts/workloads/controllers/job#pod-backoff-failure-policy),
 by specifying the Job's `.spec.backoffLimit` field. However, in many situations
 it is problematic to find a balance between setting the a low value for `.spec.backoffLimit`
 to avoid unnecessary Pod retries, yet high enough to make sure the Job would
 not be terminated by Pod disruptions.
--- a/content/en/examples/controllers/job-pod-failure-policy-example.yaml
+++ b/content/en/examples/controllers/job-pod-failure-policy-example.yaml
@ -0,0 +1,28 @@
 apiVersion: batch/v1
 kind: Job
 metadata:
  name: job-pod-failure-policy-example
 spec:
  completions: 12
  parallelism: 3
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: main
        image: docker.io/library/bash:5
        command: ["bash"]        # example command simulating a bug which triggers the FailJob action
        args:
        - -c
        - echo "Hello world!" && sleep 5 && exit 42
  backoffLimit: 6
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: main      # optional
        operator: In             # one of: In, NotIn
        values: [42]
    - action: Ignore             # one of: Ignore, FailJob, Count
      onPodConditions:
      - type: DisruptionTarget   # indicates Pod disruption
--- a/content/en/examples/controllers/job-pod-failure-policy-failjob.yaml
+++ b/content/en/examples/controllers/job-pod-failure-policy-failjob.yaml
@ -0,0 +1,25 @@
 apiVersion: batch/v1
 kind: Job
 metadata:
  name: job-pod-failure-policy-failjob
 spec:
  completions: 8
  parallelism: 2
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: main
        image: docker.io/library/bash:5
        command: ["bash"]
        args:
        - -c
        - echo "Hello world! I'm going to exit with 42 to simulate a software bug." && sleep 30 && exit 42
  backoffLimit: 6
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: main
        operator: In
        values: [42]
--- a/content/en/examples/controllers/job-pod-failure-policy-ignore.yaml
+++ b/content/en/examples/controllers/job-pod-failure-policy-ignore.yaml
@ -0,0 +1,23 @@
 apiVersion: batch/v1
 kind: Job
 metadata:
  name: job-pod-failure-policy-ignore
 spec:
  completions: 4
  parallelism: 2
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: main
        image: docker.io/library/bash:5
        command: ["bash"]
        args:
        - -c
        - echo "Hello world! I'm going to exit with 0 (success)." && sleep 90 && exit 0
  backoffLimit: 0
  podFailurePolicy:
    rules:
    - action: Ignore
      onPodConditions:
      - type: DisruptionTarget