commit
0088ea5da0
|
@ -16,215 +16,90 @@ description: Configure and schedule GPUs for use as a resource by nodes in a clu
|
||||||
{{< feature-state state="beta" for_k8s_version="v1.10" >}}
|
{{< feature-state state="beta" for_k8s_version="v1.10" >}}
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
Kubernetes includes **experimental** support for managing AMD and NVIDIA GPUs
|
Kubernetes includes **experimental** support for managing GPUs
|
||||||
(graphical processing units) across several nodes.
|
(graphical processing units) across several nodes.
|
||||||
|
|
||||||
This page describes how users can consume GPUs across different Kubernetes versions
|
This page describes how users can consume GPUs, and outlines
|
||||||
and the current limitations.
|
some of the limitations in the implementation.
|
||||||
-->
|
-->
|
||||||
Kubernetes 支持对节点上的 AMD 和 NVIDIA GPU (图形处理单元)进行管理,目前处于**实验**状态。
|
Kubernetes 支持对若干节点上的 GPU(图形处理单元)进行管理,目前处于**实验**状态。
|
||||||
|
|
||||||
本页介绍用户如何在不同的 Kubernetes 版本中使用 GPU,以及当前存在的一些限制。
|
本页介绍用户如何使用 GPU 以及当前存在的一些限制。
|
||||||
|
|
||||||
<!-- body -->
|
<!-- body -->
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
## Using device plugins
|
## Using device plugins
|
||||||
|
|
||||||
Kubernetes implements {{< glossary_tooltip text="Device Plugins" term_id="device-plugin" >}}
|
Kubernetes implements {{< glossary_tooltip text="device plugins" term_id="device-plugin" >}}
|
||||||
to let Pods access specialized hardware features such as GPUs.
|
to let Pods access specialized hardware features such as GPUs.
|
||||||
|
-->
|
||||||
|
## 使用设备插件 {#using-device-plugins}
|
||||||
|
|
||||||
|
Kubernetes 实现了{{< glossary_tooltip text="设备插件(Device Plugin)" term_id="device-plugin" >}}
|
||||||
|
以允许 Pod 访问类似 GPU 这类特殊的硬件功能特性。
|
||||||
|
|
||||||
|
{{% thirdparty-content %}}
|
||||||
|
|
||||||
|
<!--
|
||||||
As an administrator, you have to install GPU drivers from the corresponding
|
As an administrator, you have to install GPU drivers from the corresponding
|
||||||
hardware vendor on the nodes and run the corresponding device plugin from the
|
hardware vendor on the nodes and run the corresponding device plugin from the
|
||||||
GPU vendor:
|
GPU vendor:
|
||||||
-->
|
-->
|
||||||
## 使用设备插件 {#using-device-plugins}
|
作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行来自
|
||||||
|
GPU 厂商的对应设备插件。
|
||||||
Kubernetes 实现了{{< glossary_tooltip text="设备插件(Device Plugins)" term_id="device-plugin" >}}
|
|
||||||
以允许 Pod 访问类似 GPU 这类特殊的硬件功能特性。
|
|
||||||
|
|
||||||
作为集群管理员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行
|
|
||||||
来自 GPU 厂商的对应的设备插件。
|
|
||||||
|
|
||||||
* [AMD](#deploying-amd-gpu-device-plugin)
|
|
||||||
* [NVIDIA](#deploying-nvidia-gpu-device-plugin)
|
|
||||||
|
|
||||||
|
* [AMD](https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment)
|
||||||
|
* [Intel](https://intel.github.io/intel-device-plugins-for-kubernetes/cmd/gpu_plugin/README.html)
|
||||||
|
* [NVIDIA](https://github.com/NVIDIA/k8s-device-plugin#quick-start)
|
||||||
<!--
|
<!--
|
||||||
When the above conditions are true, Kubernetes will expose `amd.com/gpu` or
|
Once you have installed the plugin, your cluster exposes a custom schedulable
|
||||||
`nvidia.com/gpu` as a schedulable resource.
|
resource such as `amd.com/gpu` or `nvidia.com/gpu`.
|
||||||
|
|
||||||
You can consume these GPUs from your containers by requesting
|
You can consume these GPUs from your containers by requesting
|
||||||
`<vendor>.com/gpu` the same way you request `cpu` or `memory`.
|
the custom GPU resource, the same way you request `cpu` or `memory`.
|
||||||
However, there are some limitations in how you specify the resource requirements
|
However, there are some limitations in how you specify the resource
|
||||||
when using GPUs:
|
requirements for custom devices.
|
||||||
-->
|
-->
|
||||||
当以上条件满足时,Kubernetes 将暴露 `amd.com/gpu` 或 `nvidia.com/gpu` 为
|
一旦你安装了插件,你的集群就会暴露一个自定义可调度的资源,例如 `amd.com/gpu` 或 `nvidia.com/gpu`。
|
||||||
可调度的资源。
|
|
||||||
|
|
||||||
你可以通过请求 `<vendor>.com/gpu` 资源来使用 GPU 设备,就像你为 CPU
|
你可以通过请求这个自定义的 GPU 资源在你的容器中使用这些 GPU,其请求方式与请求 `cpu` 或 `memory` 时相同。
|
||||||
和内存所做的那样。
|
不过,在如何指定自定义设备的资源请求方面存在一些限制。
|
||||||
不过,使用 GPU 时,在如何指定资源需求这个方面还是有一些限制的:
|
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
- GPUs are only supposed to be specified in the `limits` section, which means:
|
GPUs are only supposed to be specified in the `limits` section, which means:
|
||||||
* You can specify GPU `limits` without specifying `requests` because
|
* You can specify GPU `limits` without specifying `requests`, because
|
||||||
Kubernetes will use the limit as the request value by default.
|
Kubernetes will use the limit as the request value by default.
|
||||||
* You can specify GPU in both `limits` and `requests` but these two values
|
* You can specify GPU in both `limits` and `requests` but these two values
|
||||||
must be equal.
|
must be equal.
|
||||||
* You cannot specify GPU `requests` without specifying `limits`.
|
* You cannot specify GPU `requests` without specifying `limits`.
|
||||||
- Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
|
|
||||||
- Each container can request one or more GPUs. It is not possible to request a
|
|
||||||
fraction of a GPU.
|
|
||||||
-->
|
-->
|
||||||
- GPU 只能设置在 `limits` 部分,这意味着:
|
- GPU 只能在 `limits` 部分指定,这意味着:
|
||||||
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`,Kubernetes 将使用限制
|
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`,因为 Kubernetes 将默认使用限制
|
||||||
值作为默认的请求值;
|
值作为请求值。
|
||||||
* 你可以同时指定 `limits` 和 `requests`,不过这两个值必须相等。
|
* 你可以同时指定 `limits` 和 `requests`,不过这两个值必须相等。
|
||||||
* 你不可以仅指定 `requests` 而不指定 `limits`。
|
* 你不可以仅指定 `requests` 而不指定 `limits`。
|
||||||
- 容器(以及 Pod)之间是不共享 GPU 的。GPU 也不可以过量分配(Overcommitting)。
|
|
||||||
- 每个容器可以请求一个或者多个 GPU,但是用小数值来请求部分 GPU 是不允许的。
|
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
Here's an example:
|
Here's an example manifest for a Pod that requests a GPU:
|
||||||
-->
|
-->
|
||||||
这里是一个例子:
|
以下是一个 Pod 请求 GPU 的示例清单:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
kind: Pod
|
kind: Pod
|
||||||
metadata:
|
metadata:
|
||||||
name: cuda-vector-add
|
name: example-vector-add
|
||||||
spec:
|
spec:
|
||||||
restartPolicy: OnFailure
|
restartPolicy: OnFailure
|
||||||
containers:
|
containers:
|
||||||
- name: cuda-vector-add
|
- name: example-vector-add
|
||||||
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
|
image: "registry.example/example-vector-add:v42"
|
||||||
image: "registry.k8s.io/cuda-vector-add:v0.1"
|
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
nvidia.com/gpu: 1 # requesting 1 GPU
|
gpu-vendor.example/example-gpu: 1 # 请求 1 个 GPU
|
||||||
```
|
```
|
||||||
|
|
||||||
<!--
|
|
||||||
### Deploying AMD GPU device plugin
|
|
||||||
|
|
||||||
The [official AMD GPU device plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
|
|
||||||
has the following requirements:
|
|
||||||
-->
|
|
||||||
### 部署 AMD GPU 设备插件 {#deploying-amd-gpu-device-plugin}
|
|
||||||
|
|
||||||
[官方的 AMD GPU 设备插件](https://github.com/RadeonOpenCompute/k8s-device-plugin)有以下要求:
|
|
||||||
|
|
||||||
<!--
|
|
||||||
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
|
|
||||||
|
|
||||||
To deploy the AMD device plugin once your cluster is running and the above
|
|
||||||
requirements are satisfied:
|
|
||||||
-->
|
|
||||||
- Kubernetes 节点必须预先安装 AMD GPU 的 Linux 驱动。
|
|
||||||
|
|
||||||
如果你的集群已经启动并且满足上述要求的话,可以这样部署 AMD 设备插件:
|
|
||||||
|
|
||||||
```shell
|
|
||||||
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
<!--
|
|
||||||
You can report issues with this third-party device plugin by logging an issue in
|
|
||||||
[RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin).
|
|
||||||
-->
|
|
||||||
你可以到 [RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
|
|
||||||
项目报告有关此设备插件的问题。
|
|
||||||
|
|
||||||
<!--
|
|
||||||
### Deploying NVIDIA GPU device plugin
|
|
||||||
|
|
||||||
There are currently two device plugin implementations for NVIDIA GPUs:
|
|
||||||
-->
|
|
||||||
### 部署 NVIDIA GPU 设备插件 {#deploying-nvidia-gpu-device-plugin}
|
|
||||||
|
|
||||||
对于 NVIDIA GPU,目前存在两种设备插件的实现:
|
|
||||||
|
|
||||||
<!--
|
|
||||||
#### Official NVIDIA GPU device plugin
|
|
||||||
|
|
||||||
The [official NVIDIA GPU device plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
|
||||||
has the following requirements:
|
|
||||||
-->
|
|
||||||
#### 官方的 NVIDIA GPU 设备插件
|
|
||||||
|
|
||||||
[官方的 NVIDIA GPU 设备插件](https://github.com/NVIDIA/k8s-device-plugin) 有以下要求:
|
|
||||||
|
|
||||||
<!--
|
|
||||||
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
|
|
||||||
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
|
|
||||||
- Kubelet must use Docker as its container runtime
|
|
||||||
- `nvidia-container-runtime` must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
|
|
||||||
for Docker, instead of runc.
|
|
||||||
- The version of the NVIDIA drivers must match the constraint ~= 384.81.
|
|
||||||
|
|
||||||
To deploy the NVIDIA device plugin once your cluster is running and the above
|
|
||||||
requirements are satisfied:
|
|
||||||
-->
|
|
||||||
- Kubernetes 的节点必须预先安装了 NVIDIA 驱动
|
|
||||||
- Kubernetes 的节点必须预先安装 [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
|
|
||||||
- Kubelet 的容器运行时必须使用 Docker
|
|
||||||
- Docker 的[默认运行时](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)必须设置为
|
|
||||||
`nvidia-container-runtime`,而不是 `runc`。
|
|
||||||
- NVIDIA 驱动程序的版本必须匹配 ~= 384.81
|
|
||||||
|
|
||||||
如果你的集群已经启动并且满足上述要求的话,可以这样部署 NVIDIA 设备插件:
|
|
||||||
|
|
||||||
```shell
|
|
||||||
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
|
|
||||||
```
|
|
||||||
|
|
||||||
<!--
|
|
||||||
You can report issues with this third-party device plugin by logging an issue in
|
|
||||||
[NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).
|
|
||||||
-->
|
|
||||||
你可以通过在 [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) 中记录问题来报告此第三方设备插件的问题。
|
|
||||||
|
|
||||||
<!--
|
|
||||||
#### NVIDIA GPU device plugin used by GCE
|
|
||||||
|
|
||||||
The [NVIDIA GPU device plugin used by GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
|
|
||||||
doesn't require using nvidia-docker and should work with any container runtime
|
|
||||||
that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested
|
|
||||||
on [Container-Optimized OS](https://cloud.google.com/container-optimized-os/)
|
|
||||||
and has experimental code for Ubuntu from 1.9 onwards.
|
|
||||||
-->
|
|
||||||
#### GCE 中使用的 NVIDIA GPU 设备插件
|
|
||||||
|
|
||||||
[GCE 使用的 NVIDIA GPU 设备插件](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu) 并不要求使用 nvidia-docker,并且对于任何实现了 Kubernetes CRI 的容器运行时,都应该能够使用。这一实现已经在 [Container-Optimized OS](https://cloud.google.com/container-optimized-os/) 上进行了测试,并且在 1.9 版本之后会有对于 Ubuntu 的实验性代码。
|
|
||||||
|
|
||||||
<!--
|
|
||||||
You can use the following commands to install the NVIDIA drivers and device plugin:
|
|
||||||
-->
|
|
||||||
你可以使用下面的命令来安装 NVIDIA 驱动以及设备插件:
|
|
||||||
|
|
||||||
```shell
|
|
||||||
# 在 Container-Optimized OS 上安装 NVIDIA 驱动:
|
|
||||||
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
|
|
||||||
|
|
||||||
# 在 Ubuntu 上安装 NVIDIA 驱动 (实验性质):
|
|
||||||
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
|
|
||||||
|
|
||||||
# 安装设备插件:
|
|
||||||
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.12/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
<!--
|
|
||||||
You can report issues with using or deploying this third-party device plugin by logging an issue in
|
|
||||||
[GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
|
|
||||||
|
|
||||||
Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
|
|
||||||
-->
|
|
||||||
你可以通过在 [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)
|
|
||||||
中记录问题来报告使用或部署此第三方设备插件的问题。
|
|
||||||
|
|
||||||
关于如何在 GKE 上使用 NVIDIA GPU,Google 也提供自己的[指令](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)。
|
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
## Clusters containing different types of GPUs
|
## Clusters containing different types of GPUs
|
||||||
|
|
||||||
|
@ -234,20 +109,26 @@ to schedule pods to appropriate nodes.
|
||||||
|
|
||||||
For example:
|
For example:
|
||||||
-->
|
-->
|
||||||
## 集群内存在不同类型的 GPU
|
## 集群内存在不同类型的 GPU {#clusters-containing-different-types-of-gpus}
|
||||||
|
|
||||||
如果集群内部的不同节点上有不同类型的 NVIDIA GPU,那么你可以使用
|
如果集群内部的不同节点上有不同类型的 NVIDIA GPU,
|
||||||
[节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)
|
那么你可以使用[节点标签和节点选择器](/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes/)来将
|
||||||
来将 pod 调度到合适的节点上。
|
Pod 调度到合适的节点上。
|
||||||
|
|
||||||
例如:
|
例如:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
# 为你的节点加上它们所拥有的加速器类型的标签
|
# 为你的节点加上它们所拥有的加速器类型的标签
|
||||||
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
|
kubectl label nodes node1 accelerator=example-gpu-x100
|
||||||
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
|
kubectl label nodes node2 accelerator=other-gpu-k915
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<!--
|
||||||
|
That label key `accelerator` is just an example; you can use
|
||||||
|
a different label key if you prefer.
|
||||||
|
-->
|
||||||
|
这个标签键 `accelerator` 只是一个例子;如果你愿意,可以使用不同的标签键。
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
## Automatic node labelling {#node-labeller}
|
## Automatic node labelling {#node-labeller}
|
||||||
-->
|
-->
|
||||||
|
@ -280,7 +161,6 @@ At the moment, that controller can add labels for:
|
||||||
* CZ - Carrizo
|
* CZ - Carrizo
|
||||||
* AI - Arctic Islands
|
* AI - Arctic Islands
|
||||||
* RV - Raven
|
* RV - Raven
|
||||||
Example result:
|
|
||||||
--->
|
--->
|
||||||
* 设备 ID (-device-id)
|
* 设备 ID (-device-id)
|
||||||
* VRAM 大小 (-vram)
|
* VRAM 大小 (-vram)
|
||||||
|
@ -296,26 +176,23 @@ Example result:
|
||||||
* AI - Arctic Islands
|
* AI - Arctic Islands
|
||||||
* RV - Raven
|
* RV - Raven
|
||||||
|
|
||||||
示例:
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
kubectl describe node cluster-node-23
|
kubectl describe node cluster-node-23
|
||||||
```
|
```
|
||||||
|
|
||||||
```
|
```
|
||||||
Name: cluster-node-23
|
Name: cluster-node-23
|
||||||
Roles: <none>
|
Roles: <none>
|
||||||
Labels: beta.amd.com/gpu.cu-count.64=1
|
Labels: beta.amd.com/gpu.cu-count.64=1
|
||||||
beta.amd.com/gpu.device-id.6860=1
|
beta.amd.com/gpu.device-id.6860=1
|
||||||
beta.amd.com/gpu.family.AI=1
|
beta.amd.com/gpu.family.AI=1
|
||||||
beta.amd.com/gpu.simd-count.256=1
|
beta.amd.com/gpu.simd-count.256=1
|
||||||
beta.amd.com/gpu.vram.16G=1
|
beta.amd.com/gpu.vram.16G=1
|
||||||
beta.kubernetes.io/arch=amd64
|
kubernetes.io/arch=amd64
|
||||||
beta.kubernetes.io/os=linux
|
kubernetes.io/os=linux
|
||||||
kubernetes.io/hostname=cluster-node-23
|
kubernetes.io/hostname=cluster-node-23
|
||||||
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
|
Annotations: node.alpha.kubernetes.io/ttl: 0
|
||||||
node.alpha.kubernetes.io/ttl: 0
|
…
|
||||||
…
|
|
||||||
```
|
```
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
|
@ -337,12 +214,17 @@ spec:
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
nvidia.com/gpu: 1
|
nvidia.com/gpu: 1
|
||||||
nodeSelector:
|
affinity:
|
||||||
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
|
nodeAffinity:
|
||||||
|
requiredDuringSchedulingIgnoredDuringExecution:
|
||||||
|
nodeSelectorTerms:
|
||||||
|
– matchExpressions:
|
||||||
|
– key: beta.amd.com/gpu.family.AI # Arctic Islands GPU 系列
|
||||||
|
operator: Exist
|
||||||
```
|
```
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
This will ensure that the pod will be scheduled to a node that has the GPU type
|
This ensures that the Pod will be scheduled to a node that has the GPU type
|
||||||
you specified.
|
you specified.
|
||||||
-->
|
-->
|
||||||
这能够保证 Pod 能够被调度到你所指定类型的 GPU 的节点上去。
|
这能够保证 Pod 能够被调度到你所指定类型的 GPU 的节点上去。
|
||||||
|
|
Loading…
Reference in New Issue