Omit vendor plugin installation docs
For the GPU scheduling task, omit the docs about installing vendor-specific device plugins. Instead, link to those pages.pull/36985/head
parent
4d701622f0
commit
871ee860ee
|
@ -10,133 +10,57 @@ description: Configure and schedule GPUs for use as a resource by nodes in a clu
|
|||
|
||||
{{< feature-state state="beta" for_k8s_version="v1.10" >}}
|
||||
|
||||
Kubernetes includes **experimental** support for managing AMD and NVIDIA GPUs
|
||||
Kubernetes includes **experimental** support for managing GPUs
|
||||
(graphical processing units) across several nodes.
|
||||
|
||||
This page describes how users can consume GPUs across different Kubernetes versions
|
||||
and the current limitations.
|
||||
|
||||
|
||||
|
||||
This page describes how users can consume GPUs, and outlines
|
||||
some of the limitations in the implementation.
|
||||
|
||||
<!-- body -->
|
||||
|
||||
## Using device plugins
|
||||
|
||||
Kubernetes implements {{< glossary_tooltip text="Device Plugins" term_id="device-plugin" >}}
|
||||
Kubernetes implements {{< glossary_tooltip text="device plugins" term_id="device-plugin" >}}
|
||||
to let Pods access specialized hardware features such as GPUs.
|
||||
|
||||
As an administrator, you have to install GPU drivers from the corresponding
|
||||
hardware vendor on the nodes and run the corresponding device plugin from the
|
||||
GPU vendor:
|
||||
GPU vendor. Here are some links to vendors' instructions:
|
||||
|
||||
* [AMD](#deploying-amd-gpu-device-plugin)
|
||||
* [NVIDIA](#deploying-nvidia-gpu-device-plugin)
|
||||
* [AMD](https://github.com/RadeonOpenCompute/k8s-device-plugin#deployment)
|
||||
* [NVIDIA](https://github.com/NVIDIA/k8s-device-plugin#quick-start)
|
||||
|
||||
When the above conditions are true, Kubernetes will expose `amd.com/gpu` or
|
||||
`nvidia.com/gpu` as a schedulable resource.
|
||||
Once you have installed the plugin, your cluster exposes a custom schedulable resource such as `amd.com/gpu` or `nvidia.com/gpu`.
|
||||
|
||||
You can consume these GPUs from your containers by requesting
|
||||
`<vendor>.com/gpu` the same way you request `cpu` or `memory`.
|
||||
However, there are some limitations in how you specify the resource requirements
|
||||
when using GPUs:
|
||||
the custom GPU resource, the same way you request `cpu` or `memory`.
|
||||
However, there are some limitations in how you specify the resource
|
||||
requirements for custom devices.
|
||||
|
||||
- GPUs are only supposed to be specified in the `limits` section, which means:
|
||||
* You can specify GPU `limits` without specifying `requests` because
|
||||
Kubernetes will use the limit as the request value by default.
|
||||
* You can specify GPU in both `limits` and `requests` but these two values
|
||||
must be equal.
|
||||
* You cannot specify GPU `requests` without specifying `limits`.
|
||||
- Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
|
||||
- Each container can request one or more GPUs. It is not possible to request a
|
||||
fraction of a GPU.
|
||||
GPUs are only supposed to be specified in the `limits` section, which means:
|
||||
* You can specify GPU `limits` without specifying `requests`, because
|
||||
Kubernetes will use the limit as the request value by default.
|
||||
* You can specify GPU in both `limits` and `requests` but these two values
|
||||
must be equal.
|
||||
* You cannot specify GPU `requests` without specifying `limits`.
|
||||
|
||||
Here's an example:
|
||||
Here's an example manifest for a Pod that requests a GPU:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: cuda-vector-add
|
||||
name: example-vector-add
|
||||
spec:
|
||||
restartPolicy: OnFailure
|
||||
containers:
|
||||
- name: cuda-vector-add
|
||||
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
|
||||
image: "registry.k8s.io/cuda-vector-add:v0.1"
|
||||
- name: example-vector-add
|
||||
image: "registry.example/example-vector-add:v42"
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1 # requesting 1 GPU
|
||||
gpu-vendor.example/example-gpu: 1 # requesting 1 GPU
|
||||
```
|
||||
|
||||
### Deploying AMD GPU device plugin
|
||||
|
||||
The [official AMD GPU device plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin)
|
||||
has the following requirements:
|
||||
|
||||
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
|
||||
|
||||
To deploy the AMD device plugin once your cluster is running and the above
|
||||
requirements are satisfied:
|
||||
```shell
|
||||
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
|
||||
```
|
||||
|
||||
You can report issues with this third-party device plugin by logging an issue in
|
||||
[RadeonOpenCompute/k8s-device-plugin](https://github.com/RadeonOpenCompute/k8s-device-plugin).
|
||||
|
||||
### Deploying NVIDIA GPU device plugin
|
||||
|
||||
There are currently two device plugin implementations for NVIDIA GPUs:
|
||||
|
||||
#### Official NVIDIA GPU device plugin
|
||||
|
||||
The [official NVIDIA GPU device plugin](https://github.com/NVIDIA/k8s-device-plugin)
|
||||
has the following requirements:
|
||||
|
||||
- Kubernetes nodes have to be pre-installed with NVIDIA drivers.
|
||||
- Kubernetes nodes have to be pre-installed with [nvidia-docker 2.0](https://github.com/NVIDIA/nvidia-docker)
|
||||
- Kubelet must use Docker as its container runtime
|
||||
- `nvidia-container-runtime` must be configured as the [default runtime](https://github.com/NVIDIA/k8s-device-plugin#preparing-your-gpu-nodes)
|
||||
for Docker, instead of runc.
|
||||
- The version of the NVIDIA drivers must match the constraint ~= 384.81.
|
||||
|
||||
To deploy the NVIDIA device plugin once your cluster is running and the above
|
||||
requirements are satisfied:
|
||||
|
||||
```shell
|
||||
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
|
||||
```
|
||||
|
||||
You can report issues with this third-party device plugin by logging an issue in
|
||||
[NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin).
|
||||
|
||||
#### NVIDIA GPU device plugin used by GCE
|
||||
|
||||
The [NVIDIA GPU device plugin used by GCE](https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu)
|
||||
doesn't require using nvidia-docker and should work with any container runtime
|
||||
that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested
|
||||
on [Container-Optimized OS](https://cloud.google.com/container-optimized-os/)
|
||||
and has experimental code for Ubuntu from 1.9 onwards.
|
||||
|
||||
You can use the following commands to install the NVIDIA drivers and device plugin:
|
||||
|
||||
```shell
|
||||
# Install NVIDIA drivers on Container-Optimized OS:
|
||||
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
|
||||
|
||||
# Install NVIDIA drivers on Ubuntu (experimental):
|
||||
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
|
||||
|
||||
# Install the device plugin:
|
||||
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.14/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
|
||||
```
|
||||
|
||||
You can report issues with using or deploying this third-party device plugin by logging an issue in
|
||||
[GoogleCloudPlatform/container-engine-accelerators](https://github.com/GoogleCloudPlatform/container-engine-accelerators).
|
||||
|
||||
Google publishes its own [instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) for using NVIDIA GPUs on GKE .
|
||||
|
||||
## Clusters containing different types of GPUs
|
||||
|
||||
If different nodes in your cluster have different types of GPUs, then you
|
||||
|
@ -147,10 +71,13 @@ For example:
|
|||
|
||||
```shell
|
||||
# Label your nodes with the accelerator type they have.
|
||||
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
|
||||
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
|
||||
kubectl label nodes node1 accelerator=example-gpu-x100
|
||||
kubectl label nodes node2 accelerator=other-gpu-k915
|
||||
```
|
||||
|
||||
That label key `accelerator` is just an example; you can use
|
||||
a different label key if you prefer.
|
||||
|
||||
## Automatic node labelling {#node-labeller}
|
||||
|
||||
If you're using AMD GPU devices, you can deploy
|
||||
|
@ -179,19 +106,18 @@ kubectl describe node cluster-node-23
|
|||
```
|
||||
|
||||
```
|
||||
Name: cluster-node-23
|
||||
Roles: <none>
|
||||
Labels: beta.amd.com/gpu.cu-count.64=1
|
||||
beta.amd.com/gpu.device-id.6860=1
|
||||
beta.amd.com/gpu.family.AI=1
|
||||
beta.amd.com/gpu.simd-count.256=1
|
||||
beta.amd.com/gpu.vram.16G=1
|
||||
beta.kubernetes.io/arch=amd64
|
||||
beta.kubernetes.io/os=linux
|
||||
kubernetes.io/hostname=cluster-node-23
|
||||
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
|
||||
node.alpha.kubernetes.io/ttl: 0
|
||||
…
|
||||
Name: cluster-node-23
|
||||
Roles: <none>
|
||||
Labels: beta.amd.com/gpu.cu-count.64=1
|
||||
beta.amd.com/gpu.device-id.6860=1
|
||||
beta.amd.com/gpu.family.AI=1
|
||||
beta.amd.com/gpu.simd-count.256=1
|
||||
beta.amd.com/gpu.vram.16G=1
|
||||
kubernetes.io/arch=amd64
|
||||
kubernetes.io/os=linux
|
||||
kubernetes.io/hostname=cluster-node-23
|
||||
Annotations: node.alpha.kubernetes.io/ttl: 0
|
||||
…
|
||||
```
|
||||
|
||||
With the Node Labeller in use, you can specify the GPU type in the Pod spec:
|
||||
|
@ -210,11 +136,14 @@ spec:
|
|||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
nodeSelector:
|
||||
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
|
||||
affinity:
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
nodeSelectorTerms:
|
||||
– matchExpressions:
|
||||
– key: beta.amd.com/gpu.family.RV # Raven GPU family
|
||||
operator: Exist
|
||||
```
|
||||
|
||||
This will ensure that the Pod will be scheduled to a node that has the GPU type
|
||||
This ensures that the Pod will be scheduled to a node that has the GPU type
|
||||
you specified.
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue