fix some layout
parent
104f34049d
commit
b174bfed02
|
@ -1,6 +1,6 @@
|
||||||
---
|
---
|
||||||
content_type: concept
|
content_type: concept
|
||||||
title: 调度 GPUs
|
title: 调度 GPU
|
||||||
description: 配置和调度 GPU 成一类资源以供集群中节点使用。
|
description: 配置和调度 GPU 成一类资源以供集群中节点使用。
|
||||||
---
|
---
|
||||||
<!--
|
<!--
|
||||||
|
@ -76,7 +76,7 @@ when using GPUs:
|
||||||
- Each container can request one or more GPUs. It is not possible to request a
|
- Each container can request one or more GPUs. It is not possible to request a
|
||||||
fraction of a GPU.
|
fraction of a GPU.
|
||||||
-->
|
-->
|
||||||
- GPUs 只能设置在 `limits` 部分,这意味着:
|
- GPU 只能设置在 `limits` 部分,这意味着:
|
||||||
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`,Kubernetes 将使用限制
|
* 你可以指定 GPU 的 `limits` 而不指定其 `requests`,Kubernetes 将使用限制
|
||||||
值作为默认的请求值;
|
值作为默认的请求值;
|
||||||
* 你可以同时指定 `limits` 和 `requests`,不过这两个值必须相等。
|
* 你可以同时指定 `limits` 和 `requests`,不过这两个值必须相等。
|
||||||
|
@ -87,6 +87,8 @@ when using GPUs:
|
||||||
<!--
|
<!--
|
||||||
Here's an example:
|
Here's an example:
|
||||||
-->
|
-->
|
||||||
|
这里是一个例子:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
kind: Pod
|
kind: Pod
|
||||||
|
@ -111,27 +113,20 @@ has the following requirements:
|
||||||
-->
|
-->
|
||||||
### 部署 AMD GPU 设备插件 {#deploying-amd-gpu-device-plugin}
|
### 部署 AMD GPU 设备插件 {#deploying-amd-gpu-device-plugin}
|
||||||
|
|
||||||
[官方的 AMD GPU 设备插件](https://github.com/RadeonOpenCompute/k8s-device-plugin) 有以下要求:
|
[官方的 AMD GPU 设备插件](https://github.com/RadeonOpenCompute/k8s-device-plugin)有以下要求:
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
|
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
|
||||||
|
|
||||||
To deploy the AMD device plugin once your cluster is running and the above
|
To deploy the AMD device plugin once your cluster is running and the above
|
||||||
requirements are satisfied:
|
requirements are satisfied:
|
||||||
```
|
|
||||||
# For Kubernetes v1.9
|
|
||||||
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.9/k8s-ds-amdgpu-dp.yaml
|
|
||||||
|
|
||||||
# For Kubernetes v1.10
|
|
||||||
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml
|
|
||||||
```
|
|
||||||
-->
|
-->
|
||||||
- Kubernetes 节点必须预先安装 AMD GPU 的 Linux 驱动。
|
- Kubernetes 节点必须预先安装 AMD GPU 的 Linux 驱动。
|
||||||
|
|
||||||
如果你的集群已经启动并且满足上述要求的话,可以这样部署 AMD 设备插件:
|
如果你的集群已经启动并且满足上述要求的话,可以这样部署 AMD 设备插件:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml
|
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
|
@ -148,7 +143,7 @@ There are currently two device plugin implementations for NVIDIA GPUs:
|
||||||
-->
|
-->
|
||||||
### 部署 NVIDIA GPU 设备插件 {#deploying-nvidia-gpu-device-plugin}
|
### 部署 NVIDIA GPU 设备插件 {#deploying-nvidia-gpu-device-plugin}
|
||||||
|
|
||||||
对于 NVIDIA GPUs,目前存在两种设备插件的实现:
|
对于 NVIDIA GPU,目前存在两种设备插件的实现:
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
#### Official NVIDIA GPU device plugin
|
#### Official NVIDIA GPU device plugin
|
||||||
|
@ -163,24 +158,32 @@ has the following requirements:
|
||||||
<!--
|
<!--
|
||||||
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
|
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
|
||||||
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
|
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
|
||||||
- nvidia-container-runtime must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
|
- Kubelet must use Docker as its container runtime
|
||||||
for docker instead of runc.
|
- `nvidia-container-runtime` must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
|
||||||
- NVIDIA drivers ~= 361.93
|
for Docker, instead of runc.
|
||||||
|
- The version of the NVIDIA drivers must match the constraint ~= 384.81.
|
||||||
|
|
||||||
To deploy the NVIDIA device plugin once your cluster is running and the above
|
To deploy the NVIDIA device plugin once your cluster is running and the above
|
||||||
requirements are satisfied:
|
requirements are satisfied:
|
||||||
-->
|
-->
|
||||||
- Kubernetes 的节点必须预先安装了 NVIDIA 驱动
|
- Kubernetes 的节点必须预先安装了 NVIDIA 驱动
|
||||||
- Kubernetes 的节点必须预先安装 [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
|
- Kubernetes 的节点必须预先安装 [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
|
||||||
- Docker 的[默认运行时](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)必须设置为 nvidia-container-runtime,而不是 runc
|
- Kubelet 的容器运行时必须使用 Docker
|
||||||
- NVIDIA 驱动版本 ~= 384.81
|
- Docker 的[默认运行时](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)必须设置为
|
||||||
|
`nvidia-container-runtime`,而不是 `runc`。
|
||||||
|
- NVIDIA 驱动程序的版本必须匹配 ~= 361.93
|
||||||
|
|
||||||
如果你的集群已经启动并且满足上述要求的话,可以这样部署 NVIDIA 设备插件:
|
如果你的集群已经启动并且满足上述要求的话,可以这样部署 NVIDIA 设备插件:
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
|
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
|
||||||
```
|
```
|
||||||
请到 [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)项目报告有关此设备插件的问题。
|
|
||||||
|
<!--
|
||||||
|
You can report issues with this third-party device plugin by logging an issue in
|
||||||
|
[NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).
|
||||||
|
-->
|
||||||
|
你可以通过在 [NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin) 中记录问题来报告此第三方设备插件的问题。
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
#### NVIDIA GPU device plugin used by GCE
|
#### NVIDIA GPU device plugin used by GCE
|
||||||
|
@ -195,6 +198,9 @@ and has experimental code for Ubuntu from 1.9 onwards.
|
||||||
|
|
||||||
[GCE 使用的 NVIDIA GPU 设备插件](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu) 并不要求使用 nvidia-docker,并且对于任何实现了 Kubernetes CRI 的容器运行时,都应该能够使用。这一实现已经在 [Container-Optimized OS](https://cloud.google.com/container-optimized-os/) 上进行了测试,并且在 1.9 版本之后会有对于 Ubuntu 的实验性代码。
|
[GCE 使用的 NVIDIA GPU 设备插件](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu) 并不要求使用 nvidia-docker,并且对于任何实现了 Kubernetes CRI 的容器运行时,都应该能够使用。这一实现已经在 [Container-Optimized OS](https://cloud.google.com/container-optimized-os/) 上进行了测试,并且在 1.9 版本之后会有对于 Ubuntu 的实验性代码。
|
||||||
|
|
||||||
|
<!--
|
||||||
|
You can use the following commands to install the NVIDIA drivers and device plugin:
|
||||||
|
-->
|
||||||
你可以使用下面的命令来安装 NVIDIA 驱动以及设备插件:
|
你可以使用下面的命令来安装 NVIDIA 驱动以及设备插件:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
@ -209,13 +215,15 @@ kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/releas
|
||||||
```
|
```
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
Report issues with this device plugin and installation method to [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
|
You can report issues with using or deploying this third-party device plugin by logging an issue in
|
||||||
|
[GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
|
||||||
|
|
||||||
Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
|
Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
|
||||||
-->
|
-->
|
||||||
请到 [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators) 报告有关此设备插件以及安装方法的问题。
|
你可以通过在 [GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators)
|
||||||
|
中记录问题来报告使用或部署此第三方设备插件的问题。
|
||||||
|
|
||||||
关于如何在 GKE 上使用 NVIDIA GPUs,Google 也提供自己的[指令](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)。
|
关于如何在 GKE 上使用 NVIDIA GPU,Google 也提供自己的[指令](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus)。
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
## Clusters containing different types of GPUs
|
## Clusters containing different types of GPUs
|
||||||
|
@ -249,14 +257,14 @@ kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
|
||||||
If you're using AMD GPU devices, you can deploy
|
If you're using AMD GPU devices, you can deploy
|
||||||
[Node Labeller](https://github.com/RadeonOpenCompute/k8s-device-plugin/tree/master/cmd/k8s-node-labeller).
|
[Node Labeller](https://github.com/RadeonOpenCompute/k8s-device-plugin/tree/master/cmd/k8s-node-labeller).
|
||||||
Node Labeller is a {{< glossary_tooltip text="controller" term_id="controller" >}} that automatically
|
Node Labeller is a {{< glossary_tooltip text="controller" term_id="controller" >}} that automatically
|
||||||
labels your nodes with GPU properties.
|
labels your nodes with GPU device properties.
|
||||||
|
|
||||||
At the moment, that controller can add labels for:
|
At the moment, that controller can add labels for:
|
||||||
-->
|
-->
|
||||||
如果你在使用 AMD GPUs,你可以部署
|
如果你在使用 AMD GPU,你可以部署
|
||||||
[Node Labeller](https://github.com/RadeonOpenCompute/k8s-device-plugin/tree/master/cmd/k8s-node-labeller),
|
[Node Labeller](https://github.com/RadeonOpenCompute/k8s-device-plugin/tree/master/cmd/k8s-node-labeller),
|
||||||
它是一个 {{< glossary_tooltip text="控制器" term_id="controller" >}},
|
它是一个 {{< glossary_tooltip text="控制器" term_id="controller" >}},
|
||||||
会自动给节点打上 GPU 属性标签。目前支持的属性:
|
会自动给节点打上 GPU 设备属性标签。目前支持的属性:
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
* Device ID (-device-id)
|
* Device ID (-device-id)
|
||||||
|
@ -307,7 +315,7 @@ kubectl describe node cluster-node-23
|
||||||
kubernetes.io/hostname=cluster-node-23
|
kubernetes.io/hostname=cluster-node-23
|
||||||
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
|
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
|
||||||
node.alpha.kubernetes.io/ttl: 0
|
node.alpha.kubernetes.io/ttl: 0
|
||||||
......
|
…
|
||||||
```
|
```
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
|
|
Loading…
Reference in New Issue