From 7bf346a1f6c6a68ac46622b2f5319fa8b1c0ed76 Mon Sep 17 00:00:00 2001
From: Michal Wozniak <michalwozniak@google.com>
Date: Fri, 26 Jul 2024 10:36:55 +0200
Subject: [PATCH] Address review remarks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Filip Křepinský <fkrepins@redhat.com>
Co-authored-by: Shannon Kularathna <ax3shannonkularathna@gmail.com>
Co-authored-by: Tim Bannister <tim@scalefactory.com>
---
 .../concepts/workloads/controllers/job.md     | 106 ++++++++++--------
 .../en/docs/tasks/job/pod-failure-policy.md   |  13 ++-
 2 files changed, 69 insertions(+), 50 deletions(-)

diff --git a/content/en/docs/concepts/workloads/controllers/job.md b/content/en/docs/concepts/workloads/controllers/job.md
index 88379f0a64..7c314663f0 100644
--- a/content/en/docs/concepts/workloads/controllers/job.md
+++ b/content/en/docs/concepts/workloads/controllers/job.md
@@ -438,15 +438,21 @@ kubectl get -o yaml job job-backoff-limit-per-index-example
     succeeded: 5          # 1 succeeded pod for each of 5 succeeded indexes
     failed: 10            # 2 failed pods (1 retry) for each of 5 failed indexes
     conditions:
+    - message: Job has failed indexes
+      reason: FailedIndexes
+      status: "True"
+      type: FailureTarget
     - message: Job has failed indexes
       reason: FailedIndexes
       status: "True"
       type: Failed
 ```
 
-Note that, since v1.31, you will also observe in the status the `FailureTarget`
-Job condition, with the same `reason` and `message` as for the the `Failed`
-condition (see also [Job termination and cleanup](#job-termination-and-cleanup)).
+The Job controller adds the `FailureTarget` Job condition to trigger
+[Job termination and cleanup](#job-termination-and-cleanup). The
+`Failed` condition has the same values for `reason` and `message` as the
+`FailureTarget` Job condition, but is added to the Job at the moment all Pods
+are terminated; for details see [Termination of Job pods](#termination-of-job-pods).
 
 Additionally, you may want to use the per-index backoff along with a
 [pod failure policy](#pod-failure-policy). When using
@@ -560,7 +566,7 @@ to `podReplacementPolicy: Failed`. For more information, see [Pod replacement po
 When you use the `podFailurePolicy`, and the Job fails due to the pod
 matching the rule with the `FailJob` action, then the Job controller triggers
 the Job termination process by adding the `FailureTarget` condition.
-See [Job termination and cleanup](#job-termination-and-cleanup) for more details.
+For more details, see [Job termination and cleanup](#job-termination-and-cleanup).
 
 ## Success policy {#success-policy}
 
@@ -670,42 +676,64 @@ and `.spec.backoffLimit` result in a permanent Job failure that requires manual
 
 ### Terminal Job conditions
 
-A Job has two possible terminal states, it ends up either succeeded, or failed,
-and these states are reflected by the presence of the Job conditions `Complete`
-or `Failed`, respectively.
+A Job has two possible terminal states, each of which has a corresponding Job
+condition:
+* Succeeded:  Job condition `Complete`
+* Failed: Job condition `Failed`.
 
-The failure scenarios encompass:
-- the `.spec.backoffLimit`
-- the `.spec.activeDeadlineSeconds` is exceeded
-- the `.spec.backoffLimitPerIndex` is exceeded (see [Backoff limit per index](#backoff-limit-per-index))
-- the Pod matches the Job Pod Failure Policy rule with the `FailJob` action  (see more [Pod failure policy](#pod-failure-policy))
+The possible reasons for a Job failure:
+- The number of Pod failures exceeded the specified `.spec.backoffLimit` in the Job
+  specification. For details, see [Pod backoff failure policy](#pod-backoff-failure-policy).
+- The Job runtime exceeded the specified `.spec.activeDeadlineSeconds`
+- An indexed Job that used `.spec.backoffLimitPerIndex` has failed indexes.
+  For details, see [Backoff limit per index](#backoff-limit-per-index).
+- The number of failed indexes in the Job exceeded the specified
+  `spec.maxFailedIndexes`. For details, see [Backoff limit per index](#backoff-limit-per-index)
+- A failed Pod matches a rule in `.spec.podFailurePolicy` that has the `FailJob`
+   action. For details about how Pod failure policy rules might affect failure
+   evaluation, see [Pod failure policy](#pod-failure-policy).
 
-The success scenarios encompass:
-- the `.spec.completions` is reached
-- the criteria specified by the Job Success Policy are met (see more [Success policy](#success-policy))
+The possible reasons for a Job success:
+- The number of succeeded Pods reached the specified `.spec.completions`
+- The criteria specified in `.spec.successPolicy` are met. For details, see
+  [Success policy](#success-policy).
+
+In Kubernetes v1.31 and later the Job controller delays the addition of the
+terminal conditions,`Failed` or `Succeeded`, until all pods are terminated.
+
+{{< note >}}
+In Kubernetes v1.30 and earlier, Job terminal conditions were added when the Job
+termination process is triggered, and all Pod finalizers are removed, but some
+pods may still remain running/terminating at that point in time.
+
+The change of the behavior is activated by enablement of the `JobManagedBy` or
+`JobPodReplacementPolicy` (enabled by default)
+[feature gates](/docs/reference/command-line-tools-reference/feature-gates/).
+{{< /note >}}
 
 ### Termination of Job pods
 
-Prior to v1.31 the Job terminal conditions are added when the Job termination
-process is triggered, and all Pod finalizers are removed, but some pods may
-still remain running at that point in time.
+The Job controller adds the `FailureTarget` condition or the `SuccessCriteriaMet`
+condition to the Job to trigger Pod termination after a Job meets either the
+success or failure criteria.
 
-Since v1.31, when you enable either the `JobManagedBy` or
-`JobPodReplacementPolicy` (enabled by default)
-[feature gate](/docs/reference/command-line-tools-reference/feature-gates/), the
-Job controller awaits for termination of all pods before adding a condition
-indicating that the Job is finished (either `Complete` or `Failed`).
+Factors like `terminationGracePeriodSeconds` might increase the amount of time
+from the moment that the Job controller adds the `FailureTarget` condition or the
+`SuccessCriteriaMet` condition to the moment that all of the Job Pods terminate
+and the Job controller adds a [terminal condition](#terminal-job-conditions)
+(`Failed` or `Complete`).
 
-Note that, the process of terminating all pods may take a substantial amount
-of time, depending on a Pod's `terminationGracePeriodSeconds` (see
-[Pod termination](#docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)),
-and thus adding the terminal Job condition, even if the fate of the Job is
-already determined.
+You can use the `FailureTarget` or the `SuccessCriteriaMet` condition to evaluate
+whether the Job has failed or succeeded without having to wait for the controller
+to add a terminal condition.
 
-If you want to know the fate of the Job as soon as determined you can use,
-since v1.31, the `FailureTarget` and `SuccessCriteriaMet` conditions, which
-cover all scenarios in which Job controller triggers the Job termination process
-(see [Terminal Job conditions](#terminal-job-conditions)).
+{{< note >}}
+For example, you can use the `FailureTarget` condition to quickly decide whether
+to create a replacement Job, but it could result in Pods from the failing and
+replacement Jobs running at the same time for a while. Thus, if your cluster
+capacity is limited, you may prefer to wait for the `Failed` condition before
+creating the replacement Job.
+{{< /note >}}
 
 ## Clean up finished jobs automatically
 
@@ -1111,13 +1139,6 @@ status:
   terminating: 3 # three Pods are terminating and have not yet reached the Failed phase
 ```
 
-{{< note >}}
-Since v1.31, when you enable the `JobPodReplacementPolicy`
-[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
-(enabled by default), the Job controller awaits for termination of all pods
-before marking a Job as terminal (see [Termination of Job Pods](#termination-of-job-pods)).
-{{< /note >}}
-
 ### Delegation of managing a Job object to external controller
 
 {{< feature-state feature_gate_name="JobManagedBy" >}}
@@ -1162,13 +1183,6 @@ after the operation: the built-in Job controller and the external controller
 indicated by the field value.
 {{< /warning >}}
 
-{{< note >}}
-Since v1.31, when you enable the `JobManagedBy`
-[feature gate](/docs/reference/command-line-tools-reference/feature-gates/),
-the Job controller awaits for termination of all pods before marking a Job as
-terminal (see [Termination of Job Pods](#termination-of-job-pods)).
-{{< /note >}}
-
 ## Alternatives
 
 ### Bare Pods
diff --git a/content/en/docs/tasks/job/pod-failure-policy.md b/content/en/docs/tasks/job/pod-failure-policy.md
index ee1bbb3ac8..d9dc183347 100644
--- a/content/en/docs/tasks/job/pod-failure-policy.md
+++ b/content/en/docs/tasks/job/pod-failure-policy.md
@@ -53,10 +53,15 @@ After around 30s the entire Job should be terminated. Inspect the status of the
 kubectl get jobs -l job-name=job-pod-failure-policy-failjob -o yaml
 ```
 
-In the Job status, see a job `Failed` condition with the field `reason`
-equal `PodFailurePolicy`. Additionally, the `message` field contains a
-more detailed information about the Job termination, such as:
-`Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
+In the Job status, the following conditions display:
+- `FailureTarget` condition: has a `reason` field set to `PodFailurePolicy` and
+  a `message` field with more information about the termination, like
+  `Container main for pod default/job-pod-failure-policy-failjob-8ckj8 failed with exit code 42 matching FailJob rule at index 0`.
+  The Job controller adds this condition as soon as the Job is considered a failure.
+  For details, see [Termination of Job Pods](/docs/concepts/workloads/controllers/job/#termination-of-job-pods).
+- `Failed` condition: same `reason` and `message` as the `FailureTarget`
+  condition. The Job controller adds this condition after all of the Job's Pods
+  are terminated.
 
 For comparison, if the Pod failure policy was disabled it would take 6 retries
 of the Pod, taking at least 2 minutes.