Merge pull request #38795 from windsonsea/jobyh

[zh] sync /controllers/job.md
pull/38836/head
Kubernetes Prow Robot 2023-01-08 09:11:30 -08:00 committed by GitHub
commit f76ecfa831
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 221 additions and 102 deletions

View File

@ -9,6 +9,7 @@ weight: 50
---
<!--
reviewers:
- alculquicondor
- erictune
- soltysh
title: Jobs
@ -21,6 +22,7 @@ weight: 50
-->
<!-- overview -->
<!--
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate.
As pods successfully complete, the Job tracks the successful completions. When a specified number
@ -90,31 +92,23 @@ Check on the status of the Job with `kubectl`:
-->
使用 `kubectl` 来检查 Job 的状态:
```shell
kubectl describe jobs/pi
```
<!--
The output is similar to this:
-->
输出类似于:
```
{{< tabs name="Check status of Job" >}}
{{< tab name="kubectl describe job pi" codelang="bash" >}}
Name: pi
Namespace: default
Selector: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
Labels: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
Selector: controller-uid=0cd26dd5-88a2-4a5f-a203-ea19a1d5d578
Labels: controller-uid=0cd26dd5-88a2-4a5f-a203-ea19a1d5d578
job-name=pi
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"pi","namespace":"default"},"spec":{"backoffLimit":4,"template":...
Annotations: batch.kubernetes.io/job-tracking:
Parallelism: 1
Completions: 1
Start Time: Mon, 02 Dec 2019 15:20:11 +0200
Completed At: Mon, 02 Dec 2019 15:21:16 +0200
Duration: 65s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Completion Mode: NonIndexed
Start Time: Fri, 28 Oct 2022 13:05:18 +0530
Completed At: Fri, 28 Oct 2022 13:05:21 +0530
Duration: 3s
Pods Statuses: 0 Active / 1 Succeeded / 0 Failed
Pod Template:
Labels: controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
Labels: controller-uid=0cd26dd5-88a2-4a5f-a203-ea19a1d5d578
job-name=pi
Containers:
pi:
@ -132,8 +126,66 @@ Pod Template:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 14m job-controller Created pod: pi-5rwd7
```
Normal SuccessfulCreate 21s job-controller Created pod: pi-xf9p4
Normal Completed 18s job-controller Job completed
{{< /tab >}}
{{< tab name="kubectl get job pi -o yaml" codelang="bash" >}}
apiVersion: batch/v1
kind: Job
metadata:
annotations:
batch.kubernetes.io/job-tracking: ""
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"pi","namespace":"default"},"spec":{"backoffLimit":4,"template":{"spec":{"containers":[{"command":["perl","-Mbignum=bpi","-wle","print bpi(2000)"],"image":"perl:5.34.0","name":"pi"}],"restartPolicy":"Never"}}}}
creationTimestamp: "2022-11-10T17:53:53Z"
generation: 1
labels:
controller-uid: 204fb678-040b-497f-9266-35ffa8716d14
job-name: pi
name: pi
namespace: default
resourceVersion: "4751"
uid: 204fb678-040b-497f-9266-35ffa8716d14
spec:
backoffLimit: 4
completionMode: NonIndexed
completions: 1
parallelism: 1
selector:
matchLabels:
controller-uid: 204fb678-040b-497f-9266-35ffa8716d14
suspend: false
template:
metadata:
creationTimestamp: null
labels:
controller-uid: 204fb678-040b-497f-9266-35ffa8716d14
job-name: pi
spec:
containers:
- command:
- perl
- -Mbignum=bpi
- -wle
- print bpi(2000)
image: perl:5.34.0
imagePullPolicy: IfNotPresent
name: pi
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
active: 1
ready: 0
startTime: "2022-11-10T17:53:57Z"
uncountedTerminatedPods: {}
{{< /tab >}}
{{< /tabs >}}
<!--
To view completed Pods of a Job, use `kubectl get pods`.
@ -186,14 +238,27 @@ The output is similar to this:
## Writing a Job spec
As with all other Kubernetes config, a Job needs `apiVersion`, `kind`, and `metadata` fields.
Its name must be a valid [DNS subdomain name](/docs/concepts/overview/working-with-objects/names#dns-subdomain-names).
When the control plane creates new Pods for a Job, the `.metadata.name` of the
Job is part of the basis for naming those Pods. The name of a Job must be a valid
[DNS subdomain](/docs/concepts/overview/working-with-objects/names#dns-subdomain-names)
value, but this can produce unexpected results for the Pod hostnames. For best compatibility,
the name should follow the more restrictive rules for a
[DNS label](/docs/concepts/overview/working-with-objects/names#dns-label-names).
Even when the name is a DNS subdomain, the name must be no longer than 63
characters.
A Job also needs a [`.spec` section](https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status).
-->
## 编写 Job 规约 {#writing-a-job-spec}
与 Kubernetes 中其他资源的配置类似Job 也需要 `apiVersion`、`kind` 和 `metadata` 字段。
Job 的名字必须是合法的 [DNS 子域名](/zh-cn/docs/concepts/overview/working-with-objects/names#dns-subdomain-names)。
当控制面为 Job 创建新的 Pod 时Job 的 `.metadata.name` 是命名这些 Pod 的基础组成部分。
Job 的名字必须是合法的 [DNS 子域名](/zh-cn/docs/concepts/overview/working-with-objects/names#dns-subdomain-names)值,
但这可能对 Pod 主机名产生意料之外的结果。为了获得最佳兼容性,此名字应遵循更严格的
[DNS 标签](/zh-cn/docs/concepts/overview/working-with-objects/names#dns-label-names)规则。
即使该名字被要求遵循 DNS 子域名规则,也不得超过 63 个字符。
Job 配置还需要一个 [`.spec` 节](https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status)。
@ -402,7 +467,6 @@ on the node, but the container is re-run. Therefore, your program needs to hand
restarted locally, or else specify `.spec.template.spec.restartPolicy = "Never"`.
See [pod lifecycle](/docs/concepts/workloads/pods/pod-lifecycle/#example-states) for more information on `restartPolicy`.
-->
## 处理 Pod 和容器失效 {#handling-pod-and-container-failures}
Pod 中的容器可能因为多种不同原因失效,例如因为其中的进程退出时返回值非零,
@ -429,6 +493,15 @@ caused by previous runs.
这意味着,你的应用需要处理在一个新 Pod 中被重启的情况。
尤其是应用需要处理之前运行所产生的临时文件、锁、不完整的输出等问题。
<!--
By default, each pod failure is counted towards the `.spec.backoffLimit` limit,
see [pod backoff failure policy](#pod-backoff-failure-policy). However, you can
customize handling of pod failures by setting the Job's [pod failure policy](#pod-failure-policy).
-->
默认情况下,每个 Pod 失效都被计入 `.spec.backoffLimit` 限制,
请参阅 [Pod 回退失效策略](#pod-backoff-failure-policy)。
但你可以通过设置 Job 的 [Pod 失效策略](#pod-failure-policy)自定义对 Pod 失效的处理方式。
<!--
Note that even if you specify `.spec.parallelism = 1` and `.spec.completions = 1` and
`.spec.template.spec.restartPolicy = "Never"`, the same program may
@ -444,6 +517,29 @@ multiple pods running at once. Therefore, your pods must also be tolerant of co
那就有可能同时出现多个 Pod 运行的情况。
为此,你的 Pod 也必须能够处理并发性问题。
<!--
When the [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
`PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled,
and the `.spec.podFailurePolicy` field is set, the Job controller does not consider a terminating
Pod (a pod that has a `.metadata.deletionTimestamp` field set) as a failure until that Pod is
terminal (its `.status.phase` is `Failed` or `Succeeded`). However, the Job controller
creates a replacement Pod as soon as the termination becomes apparent. Once the
pod terminates, the Job controller evaluates `.backoffLimit` and `.podFailurePolicy`
for the relevant Job, taking this now-terminated Pod into consideration.
If either of these requirements is not satisfied, the Job controller counts
a terminating Pod as an immediate failure, even if that Pod later terminates
with `phase: "Succeeded"`.
-->
当[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)
`PodDisruptionConditions``JobPodFailurePolicy` 都被启用且 `.spec.podFailurePolicy` 字段被设置时,
Job 控制器不会将终止过程中的 Pod已设置 `.metadata.deletionTimestamp` 字段的 Pod视为失效 Pod
直到该 Pod 完全终止(其 `.status.phase``Failed``Succeeded`)。
但只要终止变得显而易见Job 控制器就会创建一个替代的 Pod。一旦 Pod 终止Job 控制器将把这个刚终止的
Pod 考虑在内,评估相关 Job 的 `.backoffLimit``.podFailurePolicy`
如果不满足任一要求,即使 Pod 稍后以 `phase: "Succeeded"` 终止Job 控制器也会将此即将终止的 Pod 计为立即失效。
<!--
### Pod backoff failure policy
@ -461,10 +557,6 @@ The number of retries is calculated in two ways:
If either of the calculations reaches the `.spec.backoffLimit`, the Job is
considered failed.
When the [`JobTrackingWithFinalizers`](#job-tracking-with-finalizers) feature is
disabled, the number of failed Pods is only based on Pods that are still present
in the API.
-->
### Pod 回退失效策略 {#pod-backoff-failure-policy}
@ -482,9 +574,6 @@ in the API.
如果两种方式其中一个的值达到 `.spec.backoffLimit`,则 Job 被判定为失败。
当 [`JobTrackingWithFinalizers`](#job-tracking-with-finalizers) 特性被禁用时,
失败的 Pod 数目仅基于 API 中仍然存在的 Pod。
{{< note >}}
<!--
If your job has `restartPolicy = "OnFailure"`, keep in mind that your Pod running the Job
@ -508,7 +597,7 @@ allows you to still view the logs of completed pods to check for errors, warning
The job object also remains after it is completed so that you can view its status. It is up to the user to delete
old jobs after noting their status. Delete the job with `kubectl` (e.g. `kubectl delete jobs/pi` or `kubectl delete -f ./job.yaml`). When you delete the job using `kubectl`, all the pods it created are deleted too.
-->
## Job 终止与清理 {#clean-up-finished-jobs-automatically}
## Job 终止与清理 {#job-termination-and-cleanup}
Job 完成时不会再创建新的 Pod不过已有的 Pod [通常](#pod-backoff-failure-policy)也不会被删除。
保留这些 Pod 使得你可以查看已完成的 Pod 的日志输出,以便检查错误、警告或者其它诊断性输出。
@ -658,6 +747,37 @@ Job `pi-with-ttl` 在结束 100 秒之后,可以成为被自动删除的对象
如果该字段设置为 `0`Job 在结束之后立即成为可被自动删除的对象。
如果该字段没有设置Job 不会在结束之后被 TTL 控制器自动清除。
{{< note >}}
<!--
It is recommended to set `ttlSecondsAfterFinished` field because unmanaged jobs
(Jobs that you created directly, and not indirectly through other workload APIs
such as CronJob) have a default deletion
policy of `orphanDependents` causing Pods created by an unmanaged Job to be left around
after that Job is fully deleted.
Even though the {{< glossary_tooltip text="control plane" term_id="control-plane" >}} eventually
[garbage collects](/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection)
the Pods from a deleted Job after they either fail or complete, sometimes those
lingering pods may cause cluster performance degradation or in worst case cause the
cluster to go offline due to this degradation.
-->
建议设置 `ttlSecondsAfterFinished` 字段,因为非托管任务
(是你直接创建的 Job而不是通过其他工作负载 API如 CronJob间接创建的 Job
的默认删除策略是 `orphanDependents`,这会导致非托管 Job 创建的 Pod 在该 Job 被完全删除后被保留。
即使{{< glossary_tooltip text="控制面" term_id="control-plane" >}}最终在 Pod 失效或完成后
对已删除 Job 中的这些 Pod 执行[垃圾收集](/zh-cn/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection)操作,
这些残留的 Pod 有时可能会导致集群性能下降,或者在最坏的情况下会导致集群因这种性能下降而离线。
<!--
You can use [LimitRanges](/docs/concepts/policy/limit-range/) and
[ResourceQuotas](/docs/concepts/policy/resource-quotas/) to place a
cap on the amount of resources that a particular namespace can
consume.
-->
你可以使用 [LimitRange](/zh-cn/docs/concepts/policy/limit-range/) 和
[ResourceQuota](/zh-cn/docs/concepts/policy/resource-quotas/)
设定一个特定名字空间可以消耗的资源上限。
{{< /note >}}
<!--
## Job patterns
@ -711,6 +831,14 @@ The tradeoffs are:
<!--
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs.
The pattern names are also links to examples and more detailed description.
| Pattern | Single Job object | Fewer pods than work items? | Use app unmodified? |
| ----------------------------------------------- |:-----------------:|:---------------------------:|:-------------------:|
| [Queue with Pod Per Work Item] | ✓ | | sometimes |
| [Queue with Variable Pod Count] | ✓ | ✓ | |
| [Indexed Job with Static Work Assignment] | ✓ | | ✓ |
| [Job Template Expansion] | | | ✓ |
| [Job with Pod-to-Pod Communication] | ✓ | sometimes | sometimes |
-->
下面是对这些权衡的汇总,第 2 到 4 列对应上面的权衡比较。
模式的名称对应了相关示例和更详细描述的链接。
@ -742,6 +870,15 @@ Here, `W` is the number of work items.
下表显示的是每种模式下 `.spec.parallelism``.spec.completions` 所需要的设置。
其中,`W` 表示的是工作条目的个数。
<!--
| Pattern | `.spec.completions` | `.spec.parallelism` |
| ----------------------------------------------- |:-------------------:|:--------------------:|
| [Queue with Pod Per Work Item] | W | any |
| [Queue with Variable Pod Count] | null | any |
| [Indexed Job with Static Work Assignment] | W | any |
| [Job Template Expansion] | 1 | should be 1 |
| [Job with Pod-to-Pod Communication] | W | W |
-->
| 模式 | `.spec.completions` | `.spec.parallelism` |
| ----- |:-------------------:|:--------------------:|
| [每工作条目一 Pod 的队列](/zh-cn/docs/tasks/job/coarse-parallel-processing-work-queue/) | W | 任意值 |
@ -1090,7 +1227,7 @@ mismatch.
-->
### Pod 失效策略 {#pod-failure-policy}
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
{{< feature-state for_k8s_version="v1.26" state="beta" >}}
{{< note >}}
<!--
@ -1100,14 +1237,14 @@ enabled in your cluster. Additionally, it is recommended
to enable the `PodDisruptionConditions` feature gate in order to be able to detect and handle
Pod disruption conditions in the Pod failure policy (see also:
[Pod disruption conditions](/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions)). Both feature gates are
available in Kubernetes v1.25.
available in Kubernetes {{< skew currentVersion >}}.
-->
只有你在集群中启用了
`JobPodFailurePolicy` [特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)
你才能为某个 Job 配置 Pod 失效策略。
此外,建议启用 `PodDisruptionConditions` 特性门控以便在 Pod 失效策略中检测和处理 Pod 干扰状况
(参考:[Pod 干扰状况](/zh-cn/docs/concepts/workloads/pods/disruptions#pod-disruption-conditions))。
这两个特性门控都是在 Kubernetes v1.25 中提供的。
这两个特性门控都是在 Kubernetes {{< skew currentVersion >}} 中提供的。
{{< /note >}}
<!--
@ -1250,75 +1387,53 @@ These are some requirements and semantics of the API:
-->
### 使用 Finalizer 追踪 Job {#job-tracking-with-finalizers}
{{< feature-state for_k8s_version="v1.23" state="beta" >}}
{{< feature-state for_k8s_version="v1.26" state="stable" >}}
{{< note >}}
<!--
In order to use this behavior, you must enable the `JobTrackingWithFinalizers`
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/)
and the [controller manager](/docs/reference/command-line-tools-reference/kube-controller-manager/).
It is enabled by default.
The control plane doesn't track Jobs using finalizers, if the Jobs were created
when the feature gate `JobTrackingWithFinalizers` was disabled, even after you
upgrade the control plane to 1.26.
-->
要使用该行为,你必须为 [API 服务器](/zh-cn/docs/reference/command-line-tools-reference/kube-apiserver/)
和[控制器管理器](/zh-cn/docs/reference/command-line-tools-reference/kube-controller-manager/)启用
`JobTrackingWithFinalizers`
[特性门控](/zh-cn/docs/reference/command-line-tools-reference/feature-gates/)。
该特性默认是启用的。
<!--
When enabled, the control plane tracks new Jobs using the behavior described
below. Jobs created before the feature was enabled are unaffected. As a user,
the only difference you would see is that the control plane tracking of Job
completion is more accurate.
-->
启用后,控制面基于下述行为追踪新的 Job。在启用该特性之前创建的 Job 不受影响。
作为用户,你会看到的唯一区别是控制面对 Job 完成情况的跟踪更加准确。
如果 Job 是在特性门控 `JobTrackingWithFinalizers` 被禁用时创建的,即使你将控制面升级到 1.26
控制面也不会使用 Finalizer 跟踪 Job。
{{< /note >}}
<!--
When this feature isn't enabled, the Job {{< glossary_tooltip term_id="controller" >}}
relies on counting the Pods that exist in the cluster to track the Job status,
that is, to keep the counters for `succeeded` and `failed` Pods.
However, Pods can be removed for a number of reasons, including:
- The garbage collector that removes orphan Pods when a Node goes down.
- The garbage collector that removes finished Pods (in `Succeeded` or `Failed`
phase) after a threshold.
- Human intervention to delete Pods belonging to a Job.
- An external controller (not provided as part of Kubernetes) that removes or
replaces Pods.
The control plane keeps track of the Pods that belong to any Job and notices if
any such Pod is removed from the API server. To do that, the Job controller
creates Pods with the finalizer `batch.kubernetes.io/job-tracking`. The
controller removes the finalizer only after the Pod has been accounted for in
the Job status, allowing the Pod to be removed by other controllers or users.
Jobs created before upgrading to Kubernetes 1.26 or before the feature gate
`JobTrackingWithFinalizers` is enabled are tracked without the use of Pod
finalizers.
The Job {{< glossary_tooltip term_id="controller" text="controller" >}} updates
the status counters for `succeeded` and `failed` Pods based only on the Pods
that exist in the cluster. The contol plane can lose track of the progress of
the Job if Pods are deleted from the cluster.
-->
该功能未启用时Job {{< glossary_tooltip term_id="controller" >}} 依靠计算集群中存在的 Pod 来跟踪作业状态。
也就是说,维持一个统计 `succeeded``failed` 的 Pod 的计数器。
然而Pod 可以因为一些原因被移除,包括:
- 当一个节点宕机时垃圾收集器会删除孤立OrphanPod。
- 垃圾收集器在某个阈值后删除已完成的 Pod处于 `Succeeded``Failed` 阶段)。
- 人工干预删除 Job 的 Pod。
- 一个外部控制器(不包含于 Kubernetes来删除或取代 Pod。
控制面会跟踪属于任何 Job 的 Pod并通知是否有任何这样的 Pod 被从 API 服务器中移除。
为了实现这一点Job 控制器创建的 Pod 带有 Finalizer `batch.kubernetes.io/job-tracking`
控制器只有在 Pod 被记入 Job 状态后才会移除 Finalizer允许 Pod 可以被其他控制器或用户移除。
在升级到 Kubernetes 1.26 之前或在启用特性门控 `JobTrackingWithFinalizers`
之前创建的 Job 被跟踪时不使用 Pod Finalizer。
Job {{< glossary_tooltip term_id="controller" text="控制器" >}}仅根据集群中存在的 Pod
更新 `succeeded``failed` Pod 的状态计数器。如果 Pod 被从集群中删除,控制面可能无法跟踪 Job 的进度。
<!--
If you enable the `JobTrackingWithFinalizers` feature for your cluster, the
control plane keeps track of the Pods that belong to any Job and notices if any
such Pod is removed from the API server. To do that, the Job controller creates Pods with
the finalizer `batch.kubernetes.io/job-tracking`. The controller removes the
finalizer only after the Pod has been accounted for in the Job status, allowing
the Pod to be removed by other controllers or users.
The Job controller uses the new algorithm for new Jobs only. Jobs created
before the feature is enabled are unaffected. You can determine if the Job
controller is tracking a Job using Pod finalizers by checking if the Job has the
annotation `batch.kubernetes.io/job-tracking`. You should **not** manually add
or remove this annotation from Jobs.
You can determine if the control plane is tracking a Job using Pod finalizers by
checking if the Job has the annotation
`batch.kubernetes.io/job-tracking`. You should **not** manually add or remove
this annotation from Jobs. Instead, you can recreate the Jobs to ensure they
are tracked using Pod finalizers.
-->
如果你为你的集群启用了 `JobTrackingWithFinalizers` 特性,控制面会跟踪属于任何 Job 的 Pod。
并注意是否有任何这样的 Pod 被从 API 服务器上删除。
为了实现这一点Job 控制器创建的 Pod 带有 Finalizer `batch.kubernetes.io/job-tracking`
控制器只有在 Pod 被记入 Job 状态后才会移除 Finalizer允许 Pod 可以被其他控制器或用户删除。
Job 控制器只对新的 Job 使用新的算法。在启用该特性之前创建的 Job 不受影响。
你可以根据检查 Job 是否含有 `batch.kubernetes.io/job-tracking` 注解,
来确定 Job 控制器是否正在使用 Pod Finalizer 追踪 Job。
来确定控制面是否正在使用 Pod Finalizer 追踪 Job。
你**不**应该给 Job 手动添加或删除该注解。
取而代之的是你可以重新创建 Job 以确保使用 Pod Finalizer 跟踪这些 Job。
<!--
## Alternatives
@ -1406,6 +1521,8 @@ object, but maintains complete control over what Pods are created and how work i
* Read about [`CronJob`](/docs/concepts/workloads/controllers/cron-jobs/), which you
can use to define a series of Jobs that will run based on a schedule, similar to
the UNIX tool `cron`.
* Practice how to configure handling of retriable and non-retriable pod failures
using `podFailurePolicy`, based on the step-by-step [examples](/docs/tasks/job/pod-failure-policy/).
-->
* 了解 [Pod](/zh-cn/docs/concepts/workloads/pods)。
* 了解运行 Job 的不同的方式:
@ -1418,3 +1535,5 @@ object, but maintains complete control over what Pods are created and how work i
对象定义理解关于该资源的 API。
* 阅读 [`CronJob`](/zh-cn/docs/concepts/workloads/controllers/cron-jobs/)
它允许你定义一系列定期运行的 Job类似于 UNIX 工具 `cron`
* 根据循序渐进的[示例](/zh-cn/docs/tasks/job/pod-failure-policy/)
练习如何使用 `podFailurePolicy` 配置处理可重试和不可重试的 Pod 失效。