Omit vendor plugin installation docs

For the GPU scheduling task, omit the docs about installing
vendor-specific device plugins. Instead, link to those pages.
pull/36985/head
Tim Bannister 2022-09-23 18:57:41 +01:00
parent 4d701622f0
commit 871ee860ee
1 changed files with 47 additions and 118 deletions

View File

@ -10,133 +10,57 @@ description: Configure and schedule GPUs for use as a resource by nodes in a clu
{{< feature-state state="beta" for_k8s_version="v1.10" >}}
Kubernetes includes **experimental** support for managing AMD and NVIDIA GPUs
Kubernetes includes **experimental** support for managing GPUs
(graphical processing units) across several nodes.
This page describes how users can consume GPUs across different Kubernetes versions
and the current limitations.
This page describes how users can consume GPUs, and outlines
some of the limitations in the implementation.
<!-- body -->
## Using device plugins
Kubernetes implements {{< glossary_tooltip text="Device Plugins" term_id="device-plugin" >}}
Kubernetes implements {{< glossary_tooltip text="device plugins" term_id="device-plugin" >}}
to let Pods access specialized hardware features such as GPUs.
As an administrator, you have to install GPU drivers from the corresponding
hardware vendor on the nodes and run the corresponding device plugin from the
GPU vendor:
GPU vendor. Here are some links to vendors' instructions:
* [AMD](#deploying-amd-gpu-device-plugin)
* [NVIDIA](#deploying-nvidia-gpu-device-plugin)
* [AMD](https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment)
* [NVIDIA](https://github.com/NVIDIA/k8s-device-plugin#quick-start)
When the above conditions are true, Kubernetes will expose `amd.com/gpu` or
`nvidia.com/gpu` as a schedulable resource.
Once you have installed the plugin, your cluster exposes a custom schedulable resource such as `amd.com/gpu` or `nvidia.com/gpu`.
You can consume these GPUs from your containers by requesting
`<vendor>.com/gpu` the same way you request `cpu` or `memory`.
However, there are some limitations in how you specify the resource requirements
when using GPUs:
the custom GPU resource, the same way you request `cpu` or `memory`.
However, there are some limitations in how you specify the resource
requirements for custom devices.
- GPUs are only supposed to be specified in the `limits` section, which means:
* You can specify GPU `limits` without specifying `requests` because
Kubernetes will use the limit as the request value by default.
* You can specify GPU in both `limits` and `requests` but these two values
must be equal.
* You cannot specify GPU `requests` without specifying `limits`.
- Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
- Each container can request one or more GPUs. It is not possible to request a
fraction of a GPU.
GPUs are only supposed to be specified in the `limits` section, which means:
* You can specify GPU `limits` without specifying `requests`, because
Kubernetes will use the limit as the request value by default.
* You can specify GPU in both `limits` and `requests` but these two values
must be equal.
* You cannot specify GPU `requests` without specifying `limits`.
Here's an example:
Here's an example manifest for a Pod that requests a GPU:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
name: example-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "registry.k8s.io/cuda-vector-add:v0.1"
- name: example-vector-add
image: "registry.example/example-vector-add:v42"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
gpu-vendor.example/example-gpu: 1 # requesting 1 GPU
```
### Deploying AMD GPU device plugin
The [official AMD GPU device plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
has the following requirements:
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
To deploy the AMD device plugin once your cluster is running and the above
requirements are satisfied:
```shell
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
```
You can report issues with this third-party device plugin by logging an issue in
[RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin).
### Deploying NVIDIA GPU device plugin
There are currently two device plugin implementations for NVIDIA GPUs:
#### Official NVIDIA GPU device plugin
The [official NVIDIA GPU device plugin](https://github.com/NVIDIA/k8s-device-plugin)
has the following requirements:
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
- Kubelet must use Docker as its container runtime
- `nvidia-container-runtime` must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
for Docker, instead of runc.
- The version of the NVIDIA drivers must match the constraint ~= 384.81.
To deploy the NVIDIA device plugin once your cluster is running and the above
requirements are satisfied:
```shell
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
```
You can report issues with this third-party device plugin by logging an issue in
[NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).
#### NVIDIA GPU device plugin used by GCE
The [NVIDIA GPU device plugin used by GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
doesn't require using nvidia-docker and should work with any container runtime
that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested
on [Container-Optimized OS](https://cloud.google.com/container-optimized-os/)
and has experimental code for Ubuntu from 1.9 onwards.
You can use the following commands to install the NVIDIA drivers and device plugin:
```shell
# Install NVIDIA drivers on Container-Optimized OS:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
# Install NVIDIA drivers on Ubuntu (experimental):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
# Install the device plugin:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.14/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
```
You can report issues with using or deploying this third-party device plugin by logging an issue in
[GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
## Clusters containing different types of GPUs
If different nodes in your cluster have different types of GPUs, then you
@ -147,10 +71,13 @@ For example:
```shell
# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
kubectl label nodes node1 accelerator=example-gpu-x100
kubectl label nodes node2 accelerator=other-gpu-k915
```
That label key `accelerator` is just an example; you can use
a different label key if you prefer.
## Automatic node labelling {#node-labeller}
If you're using AMD GPU devices, you can deploy
@ -179,19 +106,18 @@ kubectl describe node cluster-node-23
```
```
Name: cluster-node-23
Roles: <none>
Labels: beta.amd.com/gpu.cu-count.64=1
beta.amd.com/gpu.device-id.6860=1
beta.amd.com/gpu.family.AI=1
beta.amd.com/gpu.simd-count.256=1
beta.amd.com/gpu.vram.16G=1
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=cluster-node-23
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
Name: cluster-node-23
Roles: <none>
Labels: beta.amd.com/gpu.cu-count.64=1
beta.amd.com/gpu.device-id.6860=1
beta.amd.com/gpu.family.AI=1
beta.amd.com/gpu.simd-count.256=1
beta.amd.com/gpu.vram.16G=1
kubernetes.io/arch=amd64
kubernetes.io/os=linux
kubernetes.io/hostname=cluster-node-23
Annotations: node.alpha.kubernetes.io/ttl: 0
```
With the Node Labeller in use, you can specify the GPU type in the Pod spec:
@ -210,11 +136,14 @@ spec:
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
matchExpressions:
key: beta.amd.com/gpu.family.RV # Raven GPU family
operator: Exist
```
This will ensure that the Pod will be scheduled to a node that has the GPU type
This ensures that the Pod will be scheduled to a node that has the GPU type
you specified.