Update for KEP3329: "Retriable and non-retriable Pod failures for Jobs"
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>pull/39809/head
parent
e2526aa6c4
commit
801b556183
|
@ -807,6 +807,17 @@ These are some requirements and semantics of the API:
|
|||
- `Count`: use to indicate that the Pod should be handled in the default way.
|
||||
The counter towards the `.spec.backoffLimit` should be incremented.
|
||||
|
||||
{{< note >}}
|
||||
When you use a `podFailurePolicy`, the job controller only matches Pods in the
|
||||
`Failed` phase. Pods with a deletion timestamp that are not in a terminal phase
|
||||
(`Failed` or `Succeeded`) are considered still terminating. This implies that
|
||||
terminating pods retain a [tracking finalizer](#job-tracking-with-finalizers)
|
||||
until they reach a terminal phase.
|
||||
Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
|
||||
(see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)). This
|
||||
ensures that deleted pods have their finalizers removed by the Job controller.
|
||||
{{< /note >}}
|
||||
|
||||
### Job tracking with finalizers
|
||||
|
||||
{{< feature-state for_k8s_version="v1.26" state="stable" >}}
|
||||
|
|
|
@ -231,11 +231,6 @@ can happen, according to:
|
|||
|
||||
{{< feature-state for_k8s_version="v1.26" state="beta" >}}
|
||||
|
||||
{{< note >}}
|
||||
If you are using an older version of Kubernetes than {{< skew currentVersion >}}
|
||||
please refer to the corresponding version of the documentation.
|
||||
{{< /note >}}
|
||||
|
||||
{{< note >}}
|
||||
In order to use this behavior, you must have the `PodDisruptionConditions`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
|
|
|
@ -91,6 +91,12 @@ A Pod is granted a term to terminate gracefully, which defaults to 30 seconds.
|
|||
You can use the flag `--force` to [terminate a Pod by force](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced).
|
||||
{{< /note >}}
|
||||
|
||||
Since Kubernetes 1.27, the kubelet transitions deleted pods, except for
|
||||
[static pods](/docs/tasks/configure-pod-container/static-pod/) and
|
||||
[force-deleted pods](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced)
|
||||
without a finalizer, to a terminal phase (`Failed` or `Succeeded` depending on
|
||||
the exit statuses of the pod containers) before their deletion from the API server.
|
||||
|
||||
If a node dies or is disconnected from the rest of the cluster, Kubernetes
|
||||
applies a policy for setting the `phase` of all Pods on the lost node to Failed.
|
||||
|
||||
|
@ -476,6 +482,8 @@ An example flow:
|
|||
1. When the grace period expires, the kubelet triggers forcible shutdown. The container runtime sends
|
||||
`SIGKILL` to any processes still running in any container in the Pod.
|
||||
The kubelet also cleans up a hidden `pause` container if that container runtime uses one.
|
||||
1. The kubelet transitions the pod into a terminal phase (`Failed` or `Succeeded` depending on
|
||||
the end state of its containers). This step is guaranteed since version 1.27.
|
||||
1. The kubelet triggers forcible removal of Pod object from the API server, by setting grace period
|
||||
to 0 (immediate deletion).
|
||||
1. The API server deletes the Pod's API object, which is then no longer visible from any client.
|
||||
|
|
|
@ -28,6 +28,9 @@ You should already be familiar with the basic use of [Job](/docs/concepts/worklo
|
|||
|
||||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||||
|
||||
Ensure that the [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
`PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled in your cluster.
|
||||
|
||||
## Using Pod failure policy to avoid unnecessary Pod retries
|
||||
|
||||
With the following example, you can learn how to use Pod failure policy to
|
||||
|
@ -129,6 +132,114 @@ kubectl delete jobs/job-pod-failure-policy-ignore
|
|||
|
||||
The cluster automatically cleans up the Pods.
|
||||
|
||||
## Using Pod failure policy to avoid unnecessary Pod retries based on custom Pod Conditions
|
||||
|
||||
With the following example, you can learn how to use Pod failure policy to
|
||||
avoid unnecessary Pod restarts based on custom Pod Conditions.
|
||||
|
||||
{{< note >}}
|
||||
The example below works since version 1.27 as it relies on transitioning of
|
||||
deleted pods, in the `Pending` phase, to a terminal phase
|
||||
(see: [Pod Phase](/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase)).
|
||||
{{< /note >}}
|
||||
|
||||
1. First, create a Job based on the config:
|
||||
|
||||
{{< codenew file="/controllers/job-pod-failure-policy-config-issue.yaml" >}}
|
||||
|
||||
by running:
|
||||
|
||||
```sh
|
||||
kubectl create -f job-pod-failure-policy-config-issue.yaml
|
||||
```
|
||||
|
||||
Note that, the image is misconfigured, as it does not exist.
|
||||
|
||||
2. Inspect the status of the job's Pods by running:
|
||||
|
||||
```sh
|
||||
kubectl get pods -l job-name=job-pod-failure-policy-config-issue -o yaml
|
||||
```
|
||||
|
||||
You will see output similar to this:
|
||||
```yaml
|
||||
containerStatuses:
|
||||
- image: non-existing-repo/non-existing-image:example
|
||||
...
|
||||
state:
|
||||
waiting:
|
||||
message: Back-off pulling image "non-existing-repo/non-existing-image:example"
|
||||
reason: ImagePullBackOff
|
||||
...
|
||||
phase: Pending
|
||||
```
|
||||
|
||||
Note that the pod remains in the `Pending` phase as it fails to pull the
|
||||
misconfigured image. This, in principle, could be a transient issue and the
|
||||
image could get pulled. However, in this case, the image does not exist so
|
||||
we indicate this fact by a custom condition.
|
||||
|
||||
3. Add the custom condition. First prepare the patch by running:
|
||||
|
||||
```sh
|
||||
cat <<EOF > patch.yaml
|
||||
status:
|
||||
conditions:
|
||||
- type: ConfigIssue
|
||||
status: "True"
|
||||
reason: "NonExistingImage"
|
||||
lastTransitionTime: "$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
|
||||
EOF
|
||||
```
|
||||
Second, select one of the pods created by the job by running:
|
||||
```
|
||||
podName=$(kubectl get pods -l job-name=job-pod-failure-policy-config-issue -o jsonpath='{.items[0].metadata.name}')
|
||||
```
|
||||
|
||||
Then, apply the patch on one of the pods by running the following command:
|
||||
|
||||
```sh
|
||||
kubectl patch pod $podName --subresource=status --patch-file=patch.yaml
|
||||
```
|
||||
|
||||
If applied successfully, you will get a notification like this:
|
||||
|
||||
```sh
|
||||
pod/job-pod-failure-policy-config-issue-k6pvp patched
|
||||
```
|
||||
|
||||
4. Delete the pod to transition it to `Failed` phase, by running the command:
|
||||
|
||||
```sh
|
||||
kubectl delete pods/$podName
|
||||
```
|
||||
|
||||
5. Inspect the status of the Job by running:
|
||||
|
||||
```sh
|
||||
kubectl get jobs -l job-name=job-pod-failure-policy-config-issue -o yaml
|
||||
```
|
||||
|
||||
In the Job status, see a job `Failed` condition with the field `reason`
|
||||
equal `PodFailurePolicy`. Additionally, the `message` field contains a
|
||||
more detailed information about the Job termination, such as:
|
||||
`Pod default/job-pod-failure-policy-config-issue-k6pvp has condition ConfigIssue matching FailJob rule at index 0`.
|
||||
|
||||
{{< note >}}
|
||||
In a production environment, the steps 3 and 4 should be automated by a
|
||||
user-provided controller.
|
||||
{{< /note >}}
|
||||
|
||||
### Cleaning up
|
||||
|
||||
Delete the Job you created:
|
||||
|
||||
```sh
|
||||
kubectl delete jobs/job-pod-failure-policy-config-issue
|
||||
```
|
||||
|
||||
The cluster automatically cleans up the Pods.
|
||||
|
||||
## Alternatives
|
||||
|
||||
You could rely solely on the
|
||||
|
|
|
@ -0,0 +1,19 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: job-pod-failure-policy-config-issue
|
||||
spec:
|
||||
completions: 8
|
||||
parallelism: 2
|
||||
template:
|
||||
spec:
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: main
|
||||
image: "non-existing-repo/non-existing-image:example"
|
||||
backoffLimit: 6
|
||||
podFailurePolicy:
|
||||
rules:
|
||||
- action: FailJob
|
||||
onPodConditions:
|
||||
- type: ConfigIssue
|
Loading…
Reference in New Issue