Merge pull request #37121 from windsonsea/schgpu

[zh] Sync scheduling-gpus.md
pull/37196/head
Kubernetes Prow Robot 2022-10-07 00:15:53 -07:00 committed by GitHub
commit 0088ea5da0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 74 additions and 192 deletions

View File

@ -16,215 +16,90 @@ description: Configure and schedule GPUs for use as a resource by nodes in a clu
{{< feature-state state="beta" for_k8s_version="v1.10" >}}
<!--
Kubernetes includes **experimental** support for managing AMD and NVIDIA GPUs
Kubernetes includes **experimental** support for managing GPUs
(graphical processing units) across several nodes.
This page describes how users can consume GPUs across different Kubernetes versions
and the current limitations.
This page describes how users can consume GPUs, and outlines
some of the limitations in the implementation.
-->
Kubernetes 支持对节点上的 AMD 和 NVIDIA GPU (图形处理单元)进行管理,目前处于**实验**状态。
Kubernetes 支持对若干节点上的 GPU图形处理单元进行管理目前处于**实验**状态。
本页介绍用户如何在不同的 Kubernetes 版本中使用 GPU以及当前存在的一些限制。
本页介绍用户如何使用 GPU 以及当前存在的一些限制。
<!-- body -->
<!--
## Using device plugins
Kubernetes implements {{< glossary_tooltip text="Device Plugins" term_id="device-plugin" >}}
Kubernetes implements {{< glossary_tooltip text="device plugins" term_id="device-plugin" >}}
to let Pods access specialized hardware features such as GPUs.
-->
## 使用设备插件 {#using-device-plugins}
Kubernetes 实现了{{< glossary_tooltip text="设备插件Device Plugin" term_id="device-plugin" >}}
以允许 Pod 访问类似 GPU 这类特殊的硬件功能特性。
{{% thirdparty-content %}}
<!--
As an administrator, you have to install GPU drivers from the corresponding
hardware vendor on the nodes and run the corresponding device plugin from the
GPU vendor:
-->
## 使用设备插件 {#using-device-plugins}
Kubernetes 实现了{{< glossary_tooltip text="设备插件Device Plugins" term_id="device-plugin" >}}
以允许 Pod 访问类似 GPU 这类特殊的硬件功能特性。
作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行
来自 GPU 厂商的对应的设备插件。
* [AMD](#deploying-amd-gpu-device-plugin)
* [NVIDIA](#deploying-nvidia-gpu-device-plugin)
作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行来自
GPU 厂商的对应设备插件。
* [AMD](https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment)
* [Intel](https://intel.github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/README.html)
* [NVIDIA](https://github.com/NVIDIA/k8s-device-plugin#quick-start)
<!--
When the above conditions are true, Kubernetes will expose `amd.com/gpu` or
`nvidia.com/gpu` as a schedulable resource.
Once you have installed the plugin, your cluster exposes a custom schedulable
resource such as `amd.com/gpu` or `nvidia.com/gpu`.
You can consume these GPUs from your containers by requesting
`<vendor>.com/gpu` the same way you request `cpu` or `memory`.
However, there are some limitations in how you specify the resource requirements
when using GPUs:
the custom GPU resource, the same way you request `cpu` or `memory`.
However, there are some limitations in how you specify the resource
requirements for custom devices.
-->
当以上条件满足时Kubernetes 将暴露 `amd.com/gpu``nvidia.com/gpu`
可调度的资源。
一旦你安装了插件,你的集群就会暴露一个自定义可调度的资源,例如 `amd.com/gpu``nvidia.com/gpu`
你可以通过请求 `<vendor>.com/gpu` 资源来使用 GPU 设备,就像你为 CPU
和内存所做的那样。
不过,使用 GPU 时,在如何指定资源需求这个方面还是有一些限制的:
你可以通过请求这个自定义的 GPU 资源在你的容器中使用这些 GPU其请求方式与请求 `cpu``memory` 时相同。
不过,在如何指定自定义设备的资源请求方面存在一些限制。
<!--
- GPUs are only supposed to be specified in the `limits` section, which means:
* You can specify GPU `limits` without specifying `requests` because
GPUs are only supposed to be specified in the `limits` section, which means:
* You can specify GPU `limits` without specifying `requests`, because
Kubernetes will use the limit as the request value by default.
* You can specify GPU in both `limits` and `requests` but these two values
* You can specify GPU in both `limits` and `requests` but these two values
must be equal.
* You cannot specify GPU `requests` without specifying `limits`.
- Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
- Each container can request one or more GPUs. It is not possible to request a
fraction of a GPU.
* You cannot specify GPU `requests` without specifying `limits`.
-->
- GPU 只能设置`limits` 部分,这意味着:
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`Kubernetes 将使用限制
值作为默认的请求值;
- GPU 只能在 `limits` 部分指定,这意味着:
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`,因为 Kubernetes 将默认使用限制
值作为请求值。
* 你可以同时指定 `limits``requests`,不过这两个值必须相等。
* 你不可以仅指定 `requests` 而不指定 `limits`
- 容器(以及 Pod之间是不共享 GPU 的。GPU 也不可以过量分配Overcommitting
- 每个容器可以请求一个或者多个 GPU但是用小数值来请求部分 GPU 是不允许的。
<!--
Here's an example:
Here's an example manifest for a Pod that requests a GPU:
-->
这里是一个例子
以下是一个 Pod 请求 GPU 的示例清单
```yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
name: example-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "registry.k8s.io/cuda-vector-add:v0.1"
- name: example-vector-add
image: "registry.example/example-vector-add:v42"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
gpu-vendor.example/example-gpu: 1 # 请求 1 个 GPU
```
<!--
### Deploying AMD GPU device plugin
The [official AMD GPU device plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
has the following requirements:
-->
### 部署 AMD GPU 设备插件 {#deploying-amd-gpu-device-plugin}
[官方的 AMD GPU 设备插件](https://github.com/RadeonOpenCompute/k8s-device-plugin)有以下要求:
<!--
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
To deploy the AMD device plugin once your cluster is running and the above
requirements are satisfied:
-->
- Kubernetes 节点必须预先安装 AMD GPU 的 Linux 驱动。
如果你的集群已经启动并且满足上述要求的话,可以这样部署 AMD 设备插件:
```shell
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
```
<!--
You can report issues with this third-party device plugin by logging an issue in
[RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin).
-->
你可以到 [RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
项目报告有关此设备插件的问题。
<!--
### Deploying NVIDIA GPU device plugin
There are currently two device plugin implementations for NVIDIA GPUs:
-->
### 部署 NVIDIA GPU 设备插件 {#deploying-nvidia-gpu-device-plugin}
对于 NVIDIA GPU目前存在两种设备插件的实现
<!--
#### Official NVIDIA GPU device plugin
The [official NVIDIA GPU device plugin](https://github.com/NVIDIA/k8s-device-plugin)
has the following requirements:
-->
#### 官方的 NVIDIA GPU 设备插件
[官方的 NVIDIA GPU 设备插件](https://github.com/NVIDIA/k8s-device-plugin) 有以下要求:
<!--
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
- Kubelet must use Docker as its container runtime
- `nvidia-container-runtime` must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
for Docker, instead of runc.
- The version of the NVIDIA drivers must match the constraint ~= 384.81.
To deploy the NVIDIA device plugin once your cluster is running and the above
requirements are satisfied:
-->
- Kubernetes 的节点必须预先安装了 NVIDIA 驱动
- Kubernetes 的节点必须预先安装 [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
- Kubelet 的容器运行时必须使用 Docker
- Docker 的[默认运行时](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)必须设置为
`nvidia-container-runtime`,而不是 `runc`
- NVIDIA 驱动程序的版本必须匹配 ~= 384.81
如果你的集群已经启动并且满足上述要求的话,可以这样部署 NVIDIA 设备插件:
```shell
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
```
<!--
You can report issues with this third-party device plugin by logging an issue in
[NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).
-->
你可以通过在 [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) 中记录问题来报告此第三方设备插件的问题。
<!--
#### NVIDIA GPU device plugin used by GCE
The [NVIDIA GPU device plugin used by GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
doesn't require using nvidia-docker and should work with any container runtime
that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested
on [Container-Optimized OS](https://cloud.google.com/container-optimized-os/)
and has experimental code for Ubuntu from 1.9 onwards.
-->
#### GCE 中使用的 NVIDIA GPU 设备插件
[GCE 使用的 NVIDIA GPU 设备插件](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu) 并不要求使用 nvidia-docker并且对于任何实现了 Kubernetes CRI 的容器运行时,都应该能够使用。这一实现已经在 [Container-Optimized OS](https://cloud.google.com/container-optimized-os/) 上进行了测试,并且在 1.9 版本之后会有对于 Ubuntu 的实验性代码。
<!--
You can use the following commands to install the NVIDIA drivers and device plugin:
-->
你可以使用下面的命令来安装 NVIDIA 驱动以及设备插件:
```shell
# 在 Container-Optimized OS 上安装 NVIDIA 驱动:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
# 在 Ubuntu 上安装 NVIDIA 驱动 (实验性质):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
# 安装设备插件:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.12/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
```
<!--
You can report issues with using or deploying this third-party device plugin by logging an issue in
[GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
-->
你可以通过在 [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)
中记录问题来报告使用或部署此第三方设备插件的问题。
关于如何在 GKE 上使用 NVIDIA GPUGoogle 也提供自己的[指令](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)。
<!--
## Clusters containing different types of GPUs
@ -234,20 +109,26 @@ to schedule pods to appropriate nodes.
For example:
-->
## 集群内存在不同类型的 GPU
## 集群内存在不同类型的 GPU {#clusters-containing-different-types-of-gpus}
如果集群内部的不同节点上有不同类型的 NVIDIA GPU那么你可以使用
[节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)
来将 pod 调度到合适的节点上。
如果集群内部的不同节点上有不同类型的 NVIDIA GPU
那么你可以使用[节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)来将
Pod 调度到合适的节点上。
例如:
```shell
# 为你的节点加上它们所拥有的加速器类型的标签
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
kubectl label nodes node1 accelerator=example-gpu-x100
kubectl label nodes node2 accelerator=other-gpu-k915
```
<!--
That label key `accelerator` is just an example; you can use
a different label key if you prefer.
-->
这个标签键 `accelerator` 只是一个例子;如果你愿意,可以使用不同的标签键。
<!--
## Automatic node labelling {#node-labeller}
-->
@ -280,7 +161,6 @@ At the moment, that controller can add labels for:
* CZ - Carrizo
* AI - Arctic Islands
* RV - Raven
Example result:
--->
* 设备 ID (-device-id)
* VRAM 大小 (-vram)
@ -296,26 +176,23 @@ Example result:
* AI - Arctic Islands
* RV - Raven
示例:
```shell
kubectl describe node cluster-node-23
```
```
Name: cluster-node-23
Roles: <none>
Labels: beta.amd.com/gpu.cu-count.64=1
Name: cluster-node-23
Roles: <none>
Labels: beta.amd.com/gpu.cu-count.64=1
beta.amd.com/gpu.device-id.6860=1
beta.amd.com/gpu.family.AI=1
beta.amd.com/gpu.simd-count.256=1
beta.amd.com/gpu.vram.16G=1
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/os=linux
kubernetes.io/hostname=cluster-node-23
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
Annotations: node.alpha.kubernetes.io/ttl: 0
```
<!--
@ -337,12 +214,17 @@ spec:
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
matchExpressions:
key: beta.amd.com/gpu.family.AI # Arctic Islands GPU 系列
operator: Exist
```
<!--
This will ensure that the pod will be scheduled to a node that has the GPU type
This ensures that the Pod will be scheduled to a node that has the GPU type
you specified.
-->
这能够保证 Pod 能够被调度到你所指定类型的 GPU 的节点上去。