From 7bf346a1f6c6a68ac46622b2f5319fa8b1c0ed76 Mon Sep 17 00:00:00 2001 From: Michal Wozniak Date: Fri, 26 Jul 2024 10:36:55 +0200 Subject: [PATCH] Address review remarks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Filip Křepinský Co-authored-by: Shannon Kularathna Co-authored-by: Tim Bannister --- .../concepts/workloads/controllers/job.md | 106 ++++++++++-------- .../en/docs/tasks/job/pod-failure-policy.md | 13 ++- 2 files changed, 69 insertions(+), 50 deletions(-) diff --git a/content/en/docs/concepts/workloads/controllers/job.md b/content/en/docs/concepts/workloads/controllers/job.md index 88379f0a64..7c314663f0 100644 --- a/content/en/docs/concepts/workloads/controllers/job.md +++ b/content/en/docs/concepts/workloads/controllers/job.md @@ -438,15 +438,21 @@ kubectl get -o yaml job job-backoff-limit-per-index-example succeeded: 5 # 1 succeeded pod for each of 5 succeeded indexes failed: 10 # 2 failed pods (1 retry) for each of 5 failed indexes conditions: + - message: Job has failed indexes + reason: FailedIndexes + status: "True" + type: FailureTarget - message: Job has failed indexes reason: FailedIndexes status: "True" type: Failed ``` -Note that, since v1.31, you will also observe in the status the `FailureTarget` -Job condition, with the same `reason` and `message` as for the the `Failed` -condition (see also [Job termination and cleanup](#job-termination-and-cleanup)). +The Job controller adds the `FailureTarget` Job condition to trigger +[Job termination and cleanup](#job-termination-and-cleanup). The +`Failed` condition has the same values for `reason` and `message` as the +`FailureTarget` Job condition, but is added to the Job at the moment all Pods +are terminated; for details see [Termination of Job pods](#termination-of-job-pods). Additionally, you may want to use the per-index backoff along with a [pod failure policy](#pod-failure-policy). When using @@ -560,7 +566,7 @@ to `podReplacementPolicy: Failed`. For more information, see [Pod replacement po When you use the `podFailurePolicy`, and the Job fails due to the pod matching the rule with the `FailJob` action, then the Job controller triggers the Job termination process by adding the `FailureTarget` condition. -See [Job termination and cleanup](#job-termination-and-cleanup) for more details. +For more details, see [Job termination and cleanup](#job-termination-and-cleanup). ## Success policy {#success-policy} @@ -670,42 +676,64 @@ and `.spec.backoffLimit` result in a permanent Job failure that requires manual ### Terminal Job conditions -A Job has two possible terminal states, it ends up either succeeded, or failed, -and these states are reflected by the presence of the Job conditions `Complete` -or `Failed`, respectively. +A Job has two possible terminal states, each of which has a corresponding Job +condition: +* Succeeded: Job condition `Complete` +* Failed: Job condition `Failed`. -The failure scenarios encompass: -- the `.spec.backoffLimit` -- the `.spec.activeDeadlineSeconds` is exceeded -- the `.spec.backoffLimitPerIndex` is exceeded (see [Backoff limit per index](#backoff-limit-per-index)) -- the Pod matches the Job Pod Failure Policy rule with the `FailJob` action (see more [Pod failure policy](#pod-failure-policy)) +The possible reasons for a Job failure: +- The number of Pod failures exceeded the specified `.spec.backoffLimit` in the Job + specification. For details, see [Pod backoff failure policy](#pod-backoff-failure-policy). +- The Job runtime exceeded the specified `.spec.activeDeadlineSeconds` +- An indexed Job that used `.spec.backoffLimitPerIndex` has failed indexes. + For details, see [Backoff limit per index](#backoff-limit-per-index). +- The number of failed indexes in the Job exceeded the specified + `spec.maxFailedIndexes`. For details, see [Backoff limit per index](#backoff-limit-per-index) +- A failed Pod matches a rule in `.spec.podFailurePolicy` that has the `FailJob` + action. For details about how Pod failure policy rules might affect failure + evaluation, see [Pod failure policy](#pod-failure-policy). -The success scenarios encompass: -- the `.spec.completions` is reached -- the criteria specified by the Job Success Policy are met (see more [Success policy](#success-policy)) +The possible reasons for a Job success: +- The number of succeeded Pods reached the specified `.spec.completions` +- The criteria specified in `.spec.successPolicy` are met. For details, see + [Success policy](#success-policy). + +In Kubernetes v1.31 and later the Job controller delays the addition of the +terminal conditions,`Failed` or `Succeeded`, until all pods are terminated. + +{{< note >}} +In Kubernetes v1.30 and earlier, Job terminal conditions were added when the Job +termination process is triggered, and all Pod finalizers are removed, but some +pods may still remain running/terminating at that point in time. + +The change of the behavior is activated by enablement of the `JobManagedBy` or +`JobPodReplacementPolicy` (enabled by default) +[feature gates](/docs/reference/command-line-tools-reference/feature-gates/). +{{< /note >}} ### Termination of Job pods -Prior to v1.31 the Job terminal conditions are added when the Job termination -process is triggered, and all Pod finalizers are removed, but some pods may -still remain running at that point in time. +The Job controller adds the `FailureTarget` condition or the `SuccessCriteriaMet` +condition to the Job to trigger Pod termination after a Job meets either the +success or failure criteria. -Since v1.31, when you enable either the `JobManagedBy` or -`JobPodReplacementPolicy` (enabled by default) -[feature gate](/docs/reference/command-line-tools-reference/feature-gates/), the -Job controller awaits for termination of all pods before adding a condition -indicating that the Job is finished (either `Complete` or `Failed`). +Factors like `terminationGracePeriodSeconds` might increase the amount of time +from the moment that the Job controller adds the `FailureTarget` condition or the +`SuccessCriteriaMet` condition to the moment that all of the Job Pods terminate +and the Job controller adds a [terminal condition](#terminal-job-conditions) +(`Failed` or `Complete`). -Note that, the process of terminating all pods may take a substantial amount -of time, depending on a Pod's `terminationGracePeriodSeconds` (see -[Pod termination](#docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)), -and thus adding the terminal Job condition, even if the fate of the Job is -already determined. +You can use the `FailureTarget` or the `SuccessCriteriaMet` condition to evaluate +whether the Job has failed or succeeded without having to wait for the controller +to add a terminal condition. -If you want to know the fate of the Job as soon as determined you can use, -since v1.31, the `FailureTarget` and `SuccessCriteriaMet` conditions, which -cover all scenarios in which Job controller triggers the Job termination process -(see [Terminal Job conditions](#terminal-job-conditions)). +{{< note >}} +For example, you can use the `FailureTarget` condition to quickly decide whether +to create a replacement Job, but it could result in Pods from the failing and +replacement Jobs running at the same time for a while. Thus, if your cluster +capacity is limited, you may prefer to wait for the `Failed` condition before +creating the replacement Job. +{{< /note >}} ## Clean up finished jobs automatically @@ -1111,13 +1139,6 @@ status: terminating: 3 # three Pods are terminating and have not yet reached the Failed phase ``` -{{< note >}} -Since v1.31, when you enable the `JobPodReplacementPolicy` -[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) -(enabled by default), the Job controller awaits for termination of all pods -before marking a Job as terminal (see [Termination of Job Pods](#termination-of-job-pods)). -{{< /note >}} - ### Delegation of managing a Job object to external controller {{< feature-state feature_gate_name="JobManagedBy" >}} @@ -1162,13 +1183,6 @@ after the operation: the built-in Job controller and the external controller indicated by the field value. {{< /warning >}} -{{< note >}} -Since v1.31, when you enable the `JobManagedBy` -[feature gate](/docs/reference/command-line-tools-reference/feature-gates/), -the Job controller awaits for termination of all pods before marking a Job as -terminal (see [Termination of Job Pods](#termination-of-job-pods)). -{{< /note >}} - ## Alternatives ### Bare Pods diff --git a/content/en/docs/tasks/job/pod-failure-policy.md b/content/en/docs/tasks/job/pod-failure-policy.md index ee1bbb3ac8..d9dc183347 100644 --- a/content/en/docs/tasks/job/pod-failure-policy.md +++ b/content/en/docs/tasks/job/pod-failure-policy.md @@ -53,10 +53,15 @@ After around 30s the entire Job should be terminated. Inspect the status of the kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml ``` -In the Job status, see a job `Failed` condition with the field `reason` -equal `PodFailurePolicy`. Additionally, the `message` field contains a -more detailed information about the Job termination, such as: -`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`. +In the Job status, the following conditions display: +- `FailureTarget` condition: has a `reason` field set to `PodFailurePolicy` and + a `message` field with more information about the termination, like + `Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`. + The Job controller adds this condition as soon as the Job is considered a failure. + For details, see [Termination of Job Pods](/docs/concepts/workloads/controllers/job/#termination-of-job-pods). +- `Failed` condition: same `reason` and `message` as the `FailureTarget` + condition. The Job controller adds this condition after all of the Job's Pods + are terminated. For comparison, if the Pod failure policy was disabled it would take 6 retries of the Pod, taking at least 2 minutes.