website/content/zh-cn/blog/_posts/2025-05-15-jobs-successpoli...

166 lines
7.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
layout: blog
title: "Kubernetes 1.33Job 的 SuccessPolicy 进阶至 GA"
date: 2025-05-15T10:30:00-08:00
slug: kubernetes-1-33-jobs-success-policy-goes-ga
authors: >
[Yuki Iwai](https://github.com/tenzen-y) (CyberAgent, Inc)
translator: >
[Michael Yao](https://github.com/windsonsea) (DaoCloud)
---
<!--
layout: blog
title: "Kubernetes 1.33: Job's SuccessPolicy Goes GA"
date: 2025-05-15T10:30:00-08:00
slug: kubernetes-1-33-jobs-success-policy-goes-ga
authors: >
[Yuki Iwai](https://github.com/tenzen-y) (CyberAgent, Inc)
-->
<!--
On behalf of the Kubernetes project, I'm pleased to announce that Job _success policy_ has graduated to General Availability (GA) as part of the v1.33 release.
-->
我代表 Kubernetes 项目组,很高兴地宣布在 v1.33 版本中Job 的**成功策略**已进阶至 GA正式发布
<!--
## About Job's Success Policy
In batch workloads, you might want to use leader-follower patterns like [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface),
in which the leader controls the execution, including the followers' lifecycle.
-->
## 关于 Job 的成功策略 {#about-jobs-success-policy}
在批处理工作负载中,你可能希望使用类似
[MPI消息传递接口](https://zh.wikipedia.org/zh-cn/%E8%A8%8A%E6%81%AF%E5%82%B3%E9%81%9E%E4%BB%8B%E9%9D%A2)
的领导者跟随者leader-follower模式其中领导者控制执行过程包括跟随者的生命周期。
<!--
In this case, you might want to mark it as succeeded
even if some of the indexes failed. Unfortunately, a leader-follower Kubernetes Job that didn't use a success policy, in most cases, would have to require **all** Pods to finish successfully
for that Job to reach an overall succeeded state.
For Kubernetes Jobs, the API allows you to specify the early exit criteria using the `.spec.successPolicy`
field (you can only use the `.spec.successPolicy` field for an [indexed Job](/docs/concept/workloads/controllers/job/#completion-mode)).
Which describes a set of rules either using a list of succeeded indexes for a job, or defining a minimal required size of succeeded indexes.
-->
在这种情况下,即使某些索引失败了,你也可能希望将 Job 标记为成功。
然而在没有使用成功策略的情况下Kubernetes 中的领导者跟随者
Job 通常必须要求**所有** Pod 成功完成,整个 Job 才会被视为成功。
对于 Kubernetes JobAPI 允许你通过 `.spec.successPolicy` 字段指定提前退出的条件
(你只能将此字段用于[带索引的 Job](/zh-cn/docs/concept/workloads/controllers/job/#completion-mode))。
此字段通过使用已成功的索引列表或定义成功索引的最小数量来描述一组规则。
<!--
This newly stable field is especially valuable for scientific simulation, AI/ML and High-Performance Computing (HPC) batch workloads.
Users in these areas often run numerous experiments and may only need a specific number to complete successfully, rather than requiring all of them to succeed.
In this case, the leader index failure is the only relevant Job exit criteria, and the outcomes for individual follower Pods are handled
only indirectly via the status of the leader index.
Moreover, followers do not know when they can terminate themselves.
-->
这个全新的稳定字段对科学仿真、AI/ML 和高性能计算HPC等批处理工作负载特别有价值。
这些领域的用户通常会运行大量实验,而他们可能只需要其中一部分成功完成,而不需要全部成功。
在这种情况下,领导者索引失败是对应 Job 的唯一重要退出条件,个别跟随者 Pod
的结果仅通过领导者索引的状态间接被处理。此外,跟随者自身并不知道何时可以终止。
<!--
After Job meets any __Success Policy__, the Job is marked as succeeded, and all Pods are terminated including the running ones.
## How it works
The following excerpt from a Job manifest, using `.successPolicy.rules[0].succeededCount`, shows an example of
using a custom success policy:
-->
一旦 Job 满足任一**成功策略**,此 Job 就会被标记为成功,并终止所有 Pod包括正在运行的 Pod。
## 工作原理 {#how-it-works}
以下是使用 `.successPolicy.rules[0].succeededCount` 的 Job 清单片段,
这是一个自定义成功策略的例子:
```yaml
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededCount: 1
```
<!--
Here, the Job is marked as succeeded when one index succeeded regardless of its number.
Additionally, you can constrain index numbers against `succeededCount` in `.successPolicy.rules[0].succeededCount`
as shown below:
-->
在这里只要有任意一个索引成功Job 就会被标记为成功,而不管具体是哪个索引。
此外,你还可以基于 `.successPolicy.rules[0].succeededCount` 限制索引编号,如下所示:
<!--
```yaml
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededIndexes: 0 # index of the leader Pod
succeededCount: 1
```
-->
```yaml
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededIndexes: 0 # 领导者 Pod 的索引
succeededCount: 1
```
<!--
This example shows that the Job will be marked as succeeded once a Pod with a specific index (Pod index 0) has succeeded.
Once the Job either reaches one of the `successPolicy` rules, or achieves its `Complete` criteria based on `.spec.completions`,
the Job controller within kube-controller-manager adds the `SuccessCriteriaMet` condition to the Job status.
After that, the job-controller initiates cleanup and termination of Pods for Jobs with `SuccessCriteriaMet` condition.
Eventually, Jobs obtain `Complete` condition when the job-controller finished cleanup and termination.
-->
这个例子表示只要具有特定索引Pod 索引 0的 Pod 成功,整个 Job 就会被标记为成功。
一旦 Job 满足任一条 `successPolicy` 规则,或根据 `.spec.completions` 达到其 `Complete` 条件,
kube-controller-manager 中的 Job 控制器就会向 Job 状态添加 `SuccessCriteriaMet` 状况。
之后job-controller 会为具有 `SuccessCriteriaMet` 状况的 Job 发起 Pod 的清理和终止。
当 job-controller 完成清理和终止后Job 会获得 `Complete` 状况。
<!--
## Learn more
- Read the documentation for
[success policy](/docs/concepts/workloads/controllers/job/#success-policy).
- Read the KEP for the [Job success/completion policy](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3998-job-success-completion-policy)
-->
## 了解更多 {#learn-more}
* 阅读关于[成功策略的文档](/zh-cn/docs/concepts/workloads/controllers/job/#success-policy)
* 阅读关于 [Job 成功/完成策略的 KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3998-job-success-completion-policy)
<!--
## Get involved
This work was led by the Kubernetes
[batch working group](https://github.com/kubernetes/community/tree/master/wg-batch)
in close collaboration with the
[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) community.
If you are interested in working on new features in the space I recommend
subscribing to our [Slack](https://kubernetes.slack.com/messages/wg-batch)
channel and attending the regular community meetings.
-->
## 加入我们 {#get-involved}
这项工作由 Kubernetes 的
[Batch Working Group批处理工作组](https://github.com/kubernetes/community/tree/master/wg-batch)牵头,并与
[SIG Apps](https://github.com/kubernetes/community/tree/master/sig-apps) 社区密切协作。
如果你对此领域的新特性开发感兴趣,推荐你订阅我们的
[Slack 频道](https://kubernetes.slack.com/messages/wg-batch),并参加定期举行的社区会议。