[zh] sync pod-scheduling-readiness.md
parent
1bb518d9e0
commit
8e89830e92
|
@ -3,7 +3,6 @@ title: Pod 调度就绪态
|
||||||
content_type: concept
|
content_type: concept
|
||||||
weight: 40
|
weight: 40
|
||||||
---
|
---
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
title: Pod Scheduling Readiness
|
title: Pod Scheduling Readiness
|
||||||
content_type: concept
|
content_type: concept
|
||||||
|
@ -27,7 +26,7 @@ to be considered for scheduling.
|
||||||
Pod 一旦创建就被认为准备好进行调度。
|
Pod 一旦创建就被认为准备好进行调度。
|
||||||
Kubernetes 调度程序尽职尽责地寻找节点来放置所有待处理的 Pod。
|
Kubernetes 调度程序尽职尽责地寻找节点来放置所有待处理的 Pod。
|
||||||
然而,在实际环境中,会有一些 Pod 可能会长时间处于"缺少必要资源"状态。
|
然而,在实际环境中,会有一些 Pod 可能会长时间处于"缺少必要资源"状态。
|
||||||
这些 Pod 实际上以一种不必要的方式扰乱了调度器(以及下游的集成方,如 Cluster AutoScaler)。
|
这些 Pod 实际上以一种不必要的方式扰乱了调度器(以及 Cluster AutoScaler 这类下游的集成方)。
|
||||||
|
|
||||||
通过指定或删除 Pod 的 `.spec.schedulingGates`,可以控制 Pod 何时准备好被纳入考量进行调度。
|
通过指定或删除 Pod 的 `.spec.schedulingGates`,可以控制 Pod 何时准备好被纳入考量进行调度。
|
||||||
|
|
||||||
|
@ -47,7 +46,8 @@ each schedulingGate can be removed in arbitrary order, but addition of a new sch
|
||||||
该字段只能在创建 Pod 时初始化(由客户端创建,或在准入期间更改)。
|
该字段只能在创建 Pod 时初始化(由客户端创建,或在准入期间更改)。
|
||||||
创建后,每个 schedulingGate 可以按任意顺序删除,但不允许添加新的调度门控。
|
创建后,每个 schedulingGate 可以按任意顺序删除,但不允许添加新的调度门控。
|
||||||
|
|
||||||
{{< figure src="/docs/images/podSchedulingGates.svg" alt="pod-scheduling-gates-diagram" caption="<!--Figure. Pod SchedulingGates-->数字。Pod SchedulingGates" class="diagram-large" link="https://mermaid.live/edit#pako:eNplkktTwyAUhf8KgzuHWpukaYszutGlK3caFxQuCVMCGSDVTKf_XfKyPlhxz4HDB9wT5lYAptgHFuBRsdKxenFMClMYFIdfUdRYgbiD6ItJTEbR8wpEq5UpUfnDTf-5cbPoJjcbXdcaE61RVJIiqJvQ_Y30D-OCt-t3tFjcR5wZayiVnIGmkv4NiEfX9jijKTmmRH5jf0sRugOP0HyHUc1m6KGMFP27cM28fwSJDluPpNKaXqVJzmFNfHD2APRKSjnNFx9KhIpmzSfhVls3eHdTRrwG8QnxKfEZUUNeYTDBNbiaKRF_5dSfX-BQQQ0FpnEqQLJWhwIX5hyXsjbYl85wTINrgeC2EZd_xFQy7b_VJ6GCdd-itkxALE84dE3fAqXyIUZya6Qqe711OspVCI2ny2Vv35QqVO3-htt66ZWomAvVcZcv8yTfsiSFfJOydZoKvl_ttjLJVlJsblcJw-czwQ0zr9ZeqGDgeR77b2jD8xdtjtDn" >}}
|
{{< figure src="/docs/images/podSchedulingGates.svg" alt="pod-scheduling-gates-diagram" caption="<!--Figure. Pod SchedulingGates-->图:Pod SchedulingGates" class="diagram-large" link="https://mermaid.live/edit#pako:eNplkktTwyAUhf8KgzuHWpukaYszutGlK3caFxQuCVMCGSDVTKf_XfKyPlhxz4HDB9wT5lYAptgHFuBRsdKxenFMClMYFIdfUdRYgbiD6ItJTEbR8wpEq5UpUfnDTf-5cbPoJjcbXdcaE61RVJIiqJvQ_Y30D-OCt-t3tFjcR5wZayiVnIGmkv4NiEfX9jijKTmmRH5jf0sRugOP0HyHUc1m6KGMFP27cM28fwSJDluPpNKaXqVJzmFNfHD2APRKSjnNFx9KhIpmzSfhVls3eHdTRrwG8QnxKfEZUUNeYTDBNbiaKRF_5dSfX-BQQQ0FpnEqQLJWhwIX5hyXsjbYl85wTINrgeC2EZd_xFQy7b_VJ6GCdd-itkxALE84dE3fAqXyIUZya6Qqe711OspVCI2ny2Vv35QqVO3-htt66ZWomAvVcZcv8yTfsiSFfJOydZoKvl_ttjLJVlJsblcJw-czwQ0zr9ZeqGDgeR77b2jD8xdtjtDn" >}}
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
## Usage example
|
## Usage example
|
||||||
|
|
||||||
|
@ -93,7 +93,7 @@ The output is:
|
||||||
输出是:
|
输出是:
|
||||||
|
|
||||||
```none
|
```none
|
||||||
[{"name":"foo"},{"name":"bar"}]
|
[{"name":"example.com/foo"},{"name":"example.com/bar"}]
|
||||||
```
|
```
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
|
@ -126,7 +126,8 @@ kubectl get pod test-pod -o wide
|
||||||
Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's state get
|
Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's state get
|
||||||
transited from previous `SchedulingGated` to `Running`:
|
transited from previous `SchedulingGated` to `Running`:
|
||||||
-->
|
-->
|
||||||
鉴于 test-pod 不请求任何 CPU/内存资源,预计此 Pod 的状态会从之前的 `SchedulingGated` 转变为 `Running`:
|
鉴于 test-pod 不请求任何 CPU/内存资源,预计此 Pod 的状态会从之前的
|
||||||
|
`SchedulingGated` 转变为 `Running`:
|
||||||
|
|
||||||
```none
|
```none
|
||||||
NAME READY STATUS RESTARTS AGE IP NODE
|
NAME READY STATUS RESTARTS AGE IP NODE
|
||||||
|
@ -146,9 +147,61 @@ scheduling. You can use `scheduler_pending_pods{queue="gated"}` to check the met
|
||||||
以区分 Pod 是否已尝试调度但被宣称不可调度,或明确标记为未准备好调度。
|
以区分 Pod 是否已尝试调度但被宣称不可调度,或明确标记为未准备好调度。
|
||||||
你可以使用 `scheduler_pending_pods{queue="gated"}` 来检查指标结果。
|
你可以使用 `scheduler_pending_pods{queue="gated"}` 来检查指标结果。
|
||||||
|
|
||||||
|
<!--
|
||||||
|
## Mutable Pod Scheduling Directives
|
||||||
|
-->
|
||||||
|
## 可变 Pod 调度指令 {#mutable-pod-scheduling-directives}
|
||||||
|
|
||||||
|
{{< feature-state for_k8s_version="v1.27" state="beta" >}}
|
||||||
|
|
||||||
|
<!--
|
||||||
|
You can mutate scheduling directives of Pods while they have scheduling gates, with certain constraints.
|
||||||
|
At a high level, you can only tighten the scheduling directives of a Pod. In other words, the updated
|
||||||
|
directives would cause the Pods to only be able to be scheduled on a subset of the nodes that it would
|
||||||
|
previously match. More concretely, the rules for updating a Pod's scheduling directives are as follows:
|
||||||
|
-->
|
||||||
|
当 Pod 具有调度门控时,你可以在某些约束条件下改变 Pod 的调度指令。
|
||||||
|
在高层次上,你只能收紧 Pod 的调度指令。换句话说,更新后的指令将导致
|
||||||
|
Pod 只能被调度到它之前匹配的节点子集上。
|
||||||
|
更具体地说,更新 Pod 的调度指令的规则如下:
|
||||||
|
|
||||||
|
<!--
|
||||||
|
1. For `.spec.nodeSelector`, only additions are allowed. If absent, it will be allowed to be set.
|
||||||
|
|
||||||
|
2. For `spec.affinity.nodeAffinity`, if nil, then setting anything is allowed.
|
||||||
|
-->
|
||||||
|
1. 对于 `.spec.nodeSelector`,只允许增加。如果原来未设置,则允许设置此字段。
|
||||||
|
|
||||||
|
2. 对于 `spec.affinity.nodeAffinity`,如果当前值为 nil,则允许设置为任意值。
|
||||||
|
|
||||||
|
<!--
|
||||||
|
3. If `NodeSelectorTerms` was empty, it will be allowed to be set.
|
||||||
|
If not empty, then only additions of `NodeSelectorRequirements` to `matchExpressions`
|
||||||
|
or `fieldExpressions` are allowed, and no changes to existing `matchExpressions`
|
||||||
|
and `fieldExpressions` will be allowed. This is because the terms in
|
||||||
|
`.requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms`, are ORed
|
||||||
|
while the expressions in `nodeSelectorTerms[].matchExpressions` and
|
||||||
|
`nodeSelectorTerms[].fieldExpressions` are ANDed.
|
||||||
|
-->
|
||||||
|
3. 如果 `NodeSelectorTerms` 之前为空,则允许设置该字段。
|
||||||
|
如果之前不为空,则仅允许增加 `NodeSelectorRequirements` 到 `matchExpressions`
|
||||||
|
或 `fieldExpressions`,且不允许更改当前的 `matchExpressions` 和 `fieldExpressions`。
|
||||||
|
这是因为 `.requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms`
|
||||||
|
中的条目被执行逻辑或运算,而 `nodeSelectorTerms[].matchExpressions` 和
|
||||||
|
`nodeSelectorTerms[].fieldExpressions` 中的表达式被执行逻辑与运算。
|
||||||
|
|
||||||
|
<!--
|
||||||
|
4. For `.preferredDuringSchedulingIgnoredDuringExecution`, all updates are allowed.
|
||||||
|
This is because preferred terms are not authoritative, and so policy controllers
|
||||||
|
don't validate those terms.
|
||||||
|
-->
|
||||||
|
4. 对于 `.preferredDuringSchedulingIgnoredDuringExecution`,所有更新都被允许。
|
||||||
|
这是因为首选条目不具有权威性,因此策略控制器不会验证这些条目。
|
||||||
|
|
||||||
## {{% heading "whatsnext" %}}
|
## {{% heading "whatsnext" %}}
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
* Read the [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) for more details
|
* Read the [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) for more details
|
||||||
-->
|
-->
|
||||||
* 阅读 [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) 了解更多详情
|
* 阅读 [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness)
|
||||||
|
了解更多详情
|
||||||
|
|
|
@ -54,7 +54,8 @@ For example,
|
||||||
-->
|
-->
|
||||||
## 概念 {#concepts}
|
## 概念 {#concepts}
|
||||||
|
|
||||||
你可以使用命令 [kubectl taint](/docs/reference/generated/kubectl/kubectl-commands#taint) 给节点增加一个污点。比如,
|
你可以使用命令 [kubectl taint](/docs/reference/generated/kubectl/kubectl-commands#taint)
|
||||||
|
给节点增加一个污点。比如:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
kubectl taint nodes node1 key1=value1:NoSchedule
|
kubectl taint nodes node1 key1=value1:NoSchedule
|
||||||
|
@ -82,7 +83,7 @@ to schedule onto `node1`:
|
||||||
-->
|
-->
|
||||||
你可以在 Pod 规约中为 Pod 设置容忍度。
|
你可以在 Pod 规约中为 Pod 设置容忍度。
|
||||||
下面两个容忍度均与上面例子中使用 `kubectl taint` 命令创建的污点相匹配,
|
下面两个容忍度均与上面例子中使用 `kubectl taint` 命令创建的污点相匹配,
|
||||||
因此如果一个 Pod 拥有其中的任何一个容忍度,都能够被调度到 `node1` :
|
因此如果一个 Pod 拥有其中的任何一个容忍度,都能够被调度到 `node1`:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
tolerations:
|
tolerations:
|
||||||
|
@ -119,11 +120,10 @@ A toleration "matches" a taint if the keys are the same and the effects are the
|
||||||
-->
|
-->
|
||||||
一个容忍度和一个污点相“匹配”是指它们有一样的键名和效果,并且:
|
一个容忍度和一个污点相“匹配”是指它们有一样的键名和效果,并且:
|
||||||
|
|
||||||
* 如果 `operator` 是 `Exists` (此时容忍度不能指定 `value`),或者
|
* 如果 `operator` 是 `Exists`(此时容忍度不能指定 `value`),或者
|
||||||
* 如果 `operator` 是 `Equal` ,则它们的 `value` 应该相等
|
* 如果 `operator` 是 `Equal`,则它们的 `value` 应该相等。
|
||||||
|
|
||||||
{{< note >}}
|
{{< note >}}
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
There are two special cases:
|
There are two special cases:
|
||||||
|
|
||||||
|
@ -182,7 +182,7 @@ scheduled onto the node (if it is not yet running on the node).
|
||||||
<!--
|
<!--
|
||||||
For example, imagine you taint a node like this
|
For example, imagine you taint a node like this
|
||||||
-->
|
-->
|
||||||
例如,假设你给一个节点添加了如下污点
|
例如,假设你给一个节点添加了如下污点:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
kubectl taint nodes node1 key1=value1:NoSchedule
|
kubectl taint nodes node1 key1=value1:NoSchedule
|
||||||
|
@ -279,7 +279,7 @@ onto nodes labeled with `dedicated=groupName`.
|
||||||
很容易就能做到)。
|
很容易就能做到)。
|
||||||
拥有上述容忍度的 Pod 就能够被调度到上述专用节点,同时也能够被调度到集群中的其它节点。
|
拥有上述容忍度的 Pod 就能够被调度到上述专用节点,同时也能够被调度到集群中的其它节点。
|
||||||
如果你希望这些 Pod 只能被调度到上述专用节点,
|
如果你希望这些 Pod 只能被调度到上述专用节点,
|
||||||
那么你还需要给这些专用节点另外添加一个和上述污点类似的 label (例如:`dedicated=groupName`),
|
那么你还需要给这些专用节点另外添加一个和上述污点类似的 label(例如:`dedicated=groupName`),
|
||||||
同时还要在上述准入控制器中给 Pod 增加节点亲和性要求,要求上述 Pod 只能被调度到添加了
|
同时还要在上述准入控制器中给 Pod 增加节点亲和性要求,要求上述 Pod 只能被调度到添加了
|
||||||
`dedicated=groupName` 标签的节点上。
|
`dedicated=groupName` 标签的节点上。
|
||||||
|
|
||||||
|
@ -310,7 +310,7 @@ manually add tolerations to your pods.
|
||||||
我们希望不需要这类硬件的 Pod 不要被调度到这些特殊节点,以便为后继需要这类硬件的 Pod 保留资源。
|
我们希望不需要这类硬件的 Pod 不要被调度到这些特殊节点,以便为后继需要这类硬件的 Pod 保留资源。
|
||||||
要达到这个目的,可以先给配备了特殊硬件的节点添加污点
|
要达到这个目的,可以先给配备了特殊硬件的节点添加污点
|
||||||
(例如 `kubectl taint nodes nodename special=true:NoSchedule` 或
|
(例如 `kubectl taint nodes nodename special=true:NoSchedule` 或
|
||||||
`kubectl taint nodes nodename special=true:PreferNoSchedule`),
|
`kubectl taint nodes nodename special=true:PreferNoSchedule`),
|
||||||
然后给使用了这类特殊硬件的 Pod 添加一个相匹配的容忍度。
|
然后给使用了这类特殊硬件的 Pod 添加一个相匹配的容忍度。
|
||||||
和专用节点的例子类似,添加这个容忍度的最简单的方法是使用自定义
|
和专用节点的例子类似,添加这个容忍度的最简单的方法是使用自定义
|
||||||
[准入控制器](/zh-cn/docs/reference/access-authn-authz/admission-controllers/)。
|
[准入控制器](/zh-cn/docs/reference/access-authn-authz/admission-controllers/)。
|
||||||
|
@ -347,7 +347,7 @@ running on the node as follows
|
||||||
* pods that tolerate the taint with a specified `tolerationSeconds` remain
|
* pods that tolerate the taint with a specified `tolerationSeconds` remain
|
||||||
bound for the specified amount of time
|
bound for the specified amount of time
|
||||||
-->
|
-->
|
||||||
前文提到过污点的效果值 `NoExecute` 会影响已经在节点上运行的 Pod,如下
|
前文提到过污点的效果值 `NoExecute` 会影响已经在节点上运行的如下 Pod:
|
||||||
|
|
||||||
* 如果 Pod 不能忍受这类污点,Pod 会马上被驱逐。
|
* 如果 Pod 不能忍受这类污点,Pod 会马上被驱逐。
|
||||||
* 如果 Pod 能够忍受这类污点,但是在容忍度定义中没有指定 `tolerationSeconds`,
|
* 如果 Pod 能够忍受这类污点,但是在容忍度定义中没有指定 `tolerationSeconds`,
|
||||||
|
@ -395,6 +395,16 @@ controller can remove the relevant taint(s).
|
||||||
在节点被驱逐时,节点控制器或者 kubelet 会添加带有 `NoExecute` 效果的相关污点。
|
在节点被驱逐时,节点控制器或者 kubelet 会添加带有 `NoExecute` 效果的相关污点。
|
||||||
如果异常状态恢复正常,kubelet 或节点控制器能够移除相关的污点。
|
如果异常状态恢复正常,kubelet 或节点控制器能够移除相关的污点。
|
||||||
|
|
||||||
|
<!--
|
||||||
|
In some cases when the node is unreachable, the API server is unable to communicate
|
||||||
|
with the kubelet on the node. The decision to delete the pods cannot be communicated to
|
||||||
|
the kubelet until communication with the API server is re-established. In the meantime,
|
||||||
|
the pods that are scheduled for deletion may continue to run on the partitioned node.
|
||||||
|
-->
|
||||||
|
在某些情况下,当节点不可达时,API 服务器无法与节点上的 kubelet 进行通信。
|
||||||
|
在与 API 服务器的通信被重新建立之前,删除 Pod 的决定无法传递到 kubelet。
|
||||||
|
同时,被调度进行删除的那些 Pod 可能会继续运行在分区后的节点上。
|
||||||
|
|
||||||
{{< note >}}
|
{{< note >}}
|
||||||
<!--
|
<!--
|
||||||
The control plane limits the rate of adding node new taints to nodes. This rate limiting
|
The control plane limits the rate of adding node new taints to nodes. This rate limiting
|
||||||
|
@ -518,7 +528,6 @@ tolerations to all daemons, to prevent DaemonSets from breaking.
|
||||||
* `node.kubernetes.io/unschedulable` (1.10 or later)
|
* `node.kubernetes.io/unschedulable` (1.10 or later)
|
||||||
* `node.kubernetes.io/network-unavailable` (*host network only*)
|
* `node.kubernetes.io/network-unavailable` (*host network only*)
|
||||||
-->
|
-->
|
||||||
|
|
||||||
DaemonSet 控制器自动为所有守护进程添加如下 `NoSchedule` 容忍度,以防 DaemonSet 崩溃:
|
DaemonSet 控制器自动为所有守护进程添加如下 `NoSchedule` 容忍度,以防 DaemonSet 崩溃:
|
||||||
|
|
||||||
* `node.kubernetes.io/memory-pressure`
|
* `node.kubernetes.io/memory-pressure`
|
||||||
|
@ -531,7 +540,6 @@ DaemonSet 控制器自动为所有守护进程添加如下 `NoSchedule` 容忍
|
||||||
Adding these tolerations ensures backward compatibility. You can also add
|
Adding these tolerations ensures backward compatibility. You can also add
|
||||||
arbitrary tolerations to DaemonSets.
|
arbitrary tolerations to DaemonSets.
|
||||||
-->
|
-->
|
||||||
|
|
||||||
添加上述容忍度确保了向后兼容,你也可以选择自由向 DaemonSet 添加容忍度。
|
添加上述容忍度确保了向后兼容,你也可以选择自由向 DaemonSet 添加容忍度。
|
||||||
|
|
||||||
## {{% heading "whatsnext" %}}
|
## {{% heading "whatsnext" %}}
|
||||||
|
|
Loading…
Reference in New Issue