Merge pull request #32671 from zaunist/docs/concepts

[zh] Resync device-plugins
pull/32698/head
Kubernetes Prow Robot 2022-03-31 17:40:38 -07:00 committed by GitHub
commit e94ce37cc2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 82 additions and 11 deletions

View File

@ -92,12 +92,14 @@ specification as they request other types of resources, with the following limit
* 扩展资源仅可作为整数资源使用,并且不能被过量使用
* 设备不能在容器之间共享
### 示例 {#example-pod}
<!--
Suppose a Kubernetes cluster is running a device plugin that advertises resource `hardware-vendor.example/foo`
on certain nodes. Here is an example of a pod requesting this resource to run a demo workload:
-->
假设 Kubernetes 集群正在运行一个设备插件,该插件在一些节点上公布的资源为 `hardware-vendor.example/foo`
下面就是一个 Pod 示例,请求此资源以运行某演示负载
下面就是一个 Pod 示例,请求此资源以运行一个工作负载的示例
```yaml
---
@ -140,8 +142,12 @@ The general workflow of a device plugin includes the following steps:
一个 gRPC 服务,该服务实现以下接口:
<!--
```gRPC
service DevicePlugin {
// GetDevicePluginOptions returns options to be communicated with Device Manager.
rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}
// ListAndWatch returns a stream of List of Devices
// Whenever a Device state change or a Device disappears, ListAndWatch
// returns the new list
@ -168,6 +174,9 @@ The general workflow of a device plugin includes the following steps:
-->
```gRPC
service DevicePlugin {
// GetDevicePluginOptions 返回与设备管理器沟通的选项。
rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}
// ListAndWatch 返回 Device 列表构成的数据流。
// 当 Device 状态发生变化或者 Device 消失时ListAndWatch
// 会返回新的列表。
@ -331,6 +340,8 @@ service PodResourcesLister {
}
```
### `List` gRPC 端点 {#grpc-endpoint-list}
<!--
The `List` endpoint provides information on resources of running pods, with details such as the
id of exclusively allocated CPUs, device id as it was reported by device plugins and id of
@ -387,6 +398,51 @@ message ContainerDevices {
}
```
<!--
{{< note >}}
cpu_ids in the `ContainerResources` in the `List` endpoint correspond to exclusive CPUs allocated
to a partilar container. If the goal is to evaluate CPUs that belong to the shared pool, the `List`
endpoint needs to be used in conjunction with the `GetAllocatableResources` endpoint as explained
below:
1. Call `GetAllocatableResources` to get a list of all the allocatable CPUs
2. Call `GetCpuIds` on all `ContainerResources` in the system
3. Subtract out all of the CPUs from the `GetCpuIds` calls from the `GetAllocatableResources` call
{{< /note >}}
-->
{{< note >}}
`List` 端点中的 `ContainerResources` 中的 cpu_ids 对应于分配给某个容器的专属 CPU。
如果要统计共享池中的 CPU`List` 端点需要与 `GetAllocatableResources` 端点一起使用,如下所述:
1. 调用 `GetAllocatableResources` 获取所有可用的 CPUs。
2. 在系统中所有的 `ContainerResources` 上调用 `GetCpuIds`
3. 用 `GetAllocatableResources` 获取的 CPU 数减去 `GetCpuIds` 获取的 CPU 数。
{{< /note >}}
### `GetAllocatableResources` gRPC 端点 {#grpc-endpoint-getallocatableresources}
{{< feature-state state="beta" for_k8s_version="v1.23" >}}
<!--
{{< note >}}
`GetAllocatableResources` should only be used to evaluate [allocatable](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable)
resources on a node. If the goal is to evaluate free/unallocated resources it should be used in
conjunction with the List() endpoint. The result obtained by `GetAllocatableResources` would remain
the same unless the underlying resources exposed to kubelet change. This happens rarely but when
it does (for example: hotplug/hotunplug, device health changes), client is expected to call
`GetAlloctableResources` endpoint.
However, calling `GetAllocatableResources` endpoint is not sufficient in case of cpu and/or memory
update and Kubelet needs to be restarted to reflect the correct resource capacity and allocatable.
{{< /note >}}
-->
{{< note >}}
`GetAllocatableResources` 应该仅被用于评估一个节点上的[可分配的](/zh/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable)
资源。如果目标是评估空闲/未分配的资源,此调用应该与 List() 端点一起使用。
除非暴露给 kubelet 的底层资源发生变化 否则 `GetAllocatableResources` 得到的结果将保持不变。
这种情况很少发生,但当发生时(例如:热插拔,设备健康状况改变),客户端应该调用 `GetAlloctableResources` 端点。
然而,调用 `GetAllocatableResources` 端点在 cpu、内存被更新的情况下是不够的
Kubelet 需要重新启动以获取正确的资源容量和可分配的资源。
{{< /note >}}
<!--
GetAllocatableResources provides information on resources initially available on the worker node.
It provides more information than kubelet exports to APIServer.
@ -394,7 +450,6 @@ It provides more information than kubelet exports to APIServer.
端点 `GetAllocatableResources` 提供最初在工作节点上可用的资源的信息。
此端点所提供的信息比导出给 API 服务器的信息更丰富。
```gRPC
// AllocatableResourcesResponses 包含 kubelet 所了解到的所有设备的信息
message AllocatableResourcesResponse {
@ -405,6 +460,23 @@ message AllocatableResourcesResponse {
```
<!--
Starting from Kubernetes v1.23, the `GetAllocatableResources` is enabled by default.
You can disable it by turning off the
`KubeletPodResourcesGetAllocatable` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/).
Preceding Kubernetes v1.23, to enable this feature `kubelet` must be started with the following flag:
`--feature-gates=KubeletPodResourcesGetAllocatable=true`
-->
从 Kubernetes v1.23 开始,`GetAllocatableResources` 被默认启用。
你可以通过关闭 `KubeletPodResourcesGetAllocatable`
[特性门控](/zh/docs/reference/command-line-tools-reference/feature-gates/) 来禁用。
在 Kubernetes v1.23 之前,要启用这一功能,`kubelet` 必须用以下标志启动:
`--feature-gates=KubeletPodResourcesGetAllocatable=true`
<!--
`ContainerDevices` do expose the topology information declaring to which NUMA cells the device is affine.
The NUMA cells are identified using a opaque integer ID, which value is consistent to what device
@ -515,6 +587,7 @@ Here are some examples of device plugin implementations:
* 需要 [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) 2.0,以允许运行 Docker 容器的时候启用 GPU。
* [为 Container-Optimized OS 所提供的 NVIDIA GPU 设备插件](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
* [RDMA 设备插件](https://github.com/hustcat/k8s-rdma-device-plugin)
* [SocketCAN 设备插件](https://github.com/collabora/k8s-socketcan)
* [Solarflare 设备插件](https://github.com/vikaschoudhary16/sfc-device-plugin)
* [SR-IOV 网络设备插件](https://github.com/intel/sriov-network-device-plugin)
* [Xilinx FPGA 设备插件](https://github.com/Xilinx/FPGA_as_a_Service/tree/master/k8s-fpga-device-plugin)
@ -531,5 +604,3 @@ Here are some examples of device plugin implementations:
* 查看在上如何[公布节点上的扩展资源](/zh/docs/tasks/administer-cluster/extended-resource-node/)
* 阅读如何在 Kubernetes 中使用 [TLS Ingress 的硬件加速](https://kubernetes.io/blog/2019/04/24/hardware-accelerated-ssl/tls-termination-in-ingress-controllers-using-kubernetes-device-plugins-and-runtimeclass/)
* 学习[拓扑管理器](/zh/docs/tasks/administer-cluster/topology-manager/)