Move GPU Support to Tasks (#3212)
* Move User Guides/GPU Support to Tasks/Managing GPUs/Scheduling GPUs * fix typo in tasks.yml * change discussion to stepspull/3213/head
parent
0d40264361
commit
d2ff41b6a5
|
|
@ -101,3 +101,8 @@ toc:
|
|||
- title: Managing Cluster Daemons
|
||||
section:
|
||||
- docs/tasks/manage-daemon/update-daemon-set.md
|
||||
|
||||
- title: Managing GPUs
|
||||
section:
|
||||
- docs/tasks/manage-gpus/scheduling-gpus.md
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,148 @@
|
|||
---
|
||||
assignees:
|
||||
- vishh
|
||||
title: Scheduling GPUs
|
||||
---
|
||||
|
||||
{% capture overview %}
|
||||
|
||||
Kubernetes includes **experimental** support for managing NVIDIA GPUs spread across nodes.
|
||||
This page describes how users can consume GPUs and the current limitations.
|
||||
|
||||
{% endcapture %}
|
||||
|
||||
{% capture prerequisites %}
|
||||
|
||||
1. Kubernetes nodes have to be pre-installed with Nvidia drivers. Kubelet will not detect Nvidia GPUs otherwise. Try to re-install nvidia drivers if kubelet fails to expose Nvidia GPUs as part of Node Capacity.
|
||||
2. A special **alpha** feature gate `Accelerators` has to be set to true across the system: `--feature-gates="Accelerators=true"`.
|
||||
3. Nodes must be using `docker engine` as the container runtime.
|
||||
|
||||
The nodes will automatically discover and expose all Nvidia GPUs as a schedulable resource.
|
||||
|
||||
{% endcapture %}
|
||||
|
||||
{% capture steps %}
|
||||
|
||||
## API
|
||||
|
||||
Nvidia GPUs can be consumed via container level resource requirements using the resource name `alpha.kubernetes.io/nvidia-gpu`.
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: pod
|
||||
spec:
|
||||
containers:
|
||||
-
|
||||
name: gpu-container-1
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs
|
||||
-
|
||||
name: gpu-container-2
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
|
||||
```
|
||||
|
||||
- GPUs can be specified in the `limits` section only.
|
||||
- Containers (and pods) do not share GPUs.
|
||||
- Each container can request one or more GPUs.
|
||||
- It is not possible to request a portion of a GPU.
|
||||
- Nodes are expected to be homogenous, i.e. run the same GPU hardware.
|
||||
|
||||
If your nodes are running different versions of GPUs, then use Node Labels and Node Selectors to schedule pods to appropriate GPUs.
|
||||
Following is an illustration of this workflow:
|
||||
|
||||
As part of your Node bootstrapping, identify the GPU hardware type on your nodes and expose it as a node label.
|
||||
|
||||
```shell
|
||||
NVIDIA_GPU_NAME=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0)
|
||||
source /etc/default/kubelet
|
||||
KUBELET_OPTS="$KUBELET_OPTS --node-labels='alpha.kubernetes.io/nvidia-gpu-name=$NVIDIA_GPU_NAME'"
|
||||
echo "KUBELET_OPTS=$KUBELET_OPTS" > /etc/default/kubelet
|
||||
```
|
||||
|
||||
Specify the GPU types a pod can use via [Node Affinity](./node-selection) rules.
|
||||
|
||||
```yaml
|
||||
kind: pod
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
annotations:
|
||||
scheduler.alpha.kubernetes.io/affinity: >
|
||||
{
|
||||
"nodeAffinity": {
|
||||
"requiredDuringSchedulingIgnoredDuringExecution": {
|
||||
"nodeSelectorTerms": [
|
||||
{
|
||||
"matchExpressions": [
|
||||
{
|
||||
"key": "alpha.kubernetes.io/nvidia-gpu-name",
|
||||
"operator": "In",
|
||||
"values": ["Tesla K80", "Tesla P100"]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
spec:
|
||||
containers:
|
||||
-
|
||||
name: gpu-container-1
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 2
|
||||
```
|
||||
|
||||
This will ensure that the pod will be scheduled to a node that has a `Tesla K80` or a `Tesla P100` Nvidia GPU.
|
||||
|
||||
### Warning
|
||||
|
||||
The API presented here **will change** in an upcoming release to better support GPUs, and hardware accelerators in general, in Kubernetes.
|
||||
|
||||
## Access to CUDA libraries
|
||||
|
||||
As of now, CUDA libraries are expected to be pre-installed on the nodes.
|
||||
|
||||
Pods can access the libraries using `hostPath` volumes.
|
||||
|
||||
```yaml
|
||||
kind: Pod
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
name: gpu-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: gpu-container-1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
volumeMounts:
|
||||
- mountPath: /usr/local/nvidia/bin
|
||||
name: bin
|
||||
- mountPath: /usr/lib/nvidia
|
||||
name: lib
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /usr/lib/nvidia-367/bin
|
||||
name: bin
|
||||
- hostPath:
|
||||
path: /usr/lib/nvidia-367
|
||||
name: lib
|
||||
```
|
||||
|
||||
## Future
|
||||
|
||||
- Support for hardware accelerators is in it's early stages in Kubernetes.
|
||||
- GPUs and other accelerators will soon be a native compute resource across the system.
|
||||
- Better APIs will be introduced to provision and consume accelerators in a scalable manner.
|
||||
- Kubernetes will automatically ensure that applications consuming GPUs gets the best possible performance.
|
||||
- Key usability problems like access to CUDA libraries will be addressed.
|
||||
|
||||
{% endcapture %}
|
||||
|
||||
{% include templates/task.md %}
|
||||
|
|
@ -4,133 +4,6 @@ assignees:
|
|||
title: GPU Support
|
||||
---
|
||||
|
||||
Kubernetes includes **experimental** support for managing NVIDIA GPUs spread across nodes.
|
||||
This page describes how users can consume GPUs and the current limitations.
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
## Pre-requisites
|
||||
|
||||
1. Kubernetes nodes have to be pre-installed with Nvidia drivers. Kubelet will not detect Nvidia GPUs otherwise. Try to re-install nvidia drivers if kubelet fails to expose Nvidia GPUs as part of Node Capacity.
|
||||
2. A special **alpha** feature gate `Accelerators` has to be set to true across the system: `--feature-gates="Accelerators=true"`.
|
||||
3. Nodes must be using `docker engine` as the container runtime.
|
||||
|
||||
The nodes will automatically discover and expose all Nvidia GPUs as a schedulable resource.
|
||||
|
||||
## API
|
||||
|
||||
Nvidia GPUs can be consumed via container level resource requirements using the resource name `alpha.kubernetes.io/nvidia-gpu`.
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: pod
|
||||
spec:
|
||||
containers:
|
||||
-
|
||||
name: gpu-container-1
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 2 # requesting 2 GPUs
|
||||
-
|
||||
name: gpu-container-2
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 3 # requesting 3 GPUs
|
||||
```
|
||||
|
||||
- GPUs can be specified in the `limits` section only.
|
||||
- Containers (and pods) do not share GPUs.
|
||||
- Each container can request one or more GPUs.
|
||||
- It is not possible to request a portion of a GPU.
|
||||
- Nodes are expected to be homogenous, i.e. run the same GPU hardware.
|
||||
|
||||
If your nodes are running different versions of GPUs, then use Node Labels and Node Selectors to schedule pods to appropriate GPUs.
|
||||
Following is an illustration of this workflow:
|
||||
|
||||
As part of your Node bootstrapping, identify the GPU hardware type on your nodes and expose it as a node label.
|
||||
|
||||
```shell
|
||||
NVIDIA_GPU_NAME=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0)
|
||||
source /etc/default/kubelet
|
||||
KUBELET_OPTS="$KUBELET_OPTS --node-labels='alpha.kubernetes.io/nvidia-gpu-name=$NVIDIA_GPU_NAME'"
|
||||
echo "KUBELET_OPTS=$KUBELET_OPTS" > /etc/default/kubelet
|
||||
```
|
||||
|
||||
Specify the GPU types a pod can use via [Node Affinity](./node-selection) rules.
|
||||
|
||||
```yaml
|
||||
kind: pod
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
annotations:
|
||||
scheduler.alpha.kubernetes.io/affinity: >
|
||||
{
|
||||
"nodeAffinity": {
|
||||
"requiredDuringSchedulingIgnoredDuringExecution": {
|
||||
"nodeSelectorTerms": [
|
||||
{
|
||||
"matchExpressions": [
|
||||
{
|
||||
"key": "alpha.kubernetes.io/nvidia-gpu-name",
|
||||
"operator": "In",
|
||||
"values": ["Tesla K80", "Tesla P100"]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
spec:
|
||||
containers:
|
||||
-
|
||||
name: gpu-container-1
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 2
|
||||
```
|
||||
|
||||
This will ensure that the pod will be scheduled to a node that has a `Tesla K80` or a `Tesla P100` Nvidia GPU.
|
||||
|
||||
### Warning
|
||||
|
||||
The API presented here **will change** in an upcoming release to better support GPUs, and hardware accelerators in general, in Kubernetes.
|
||||
|
||||
## Access to CUDA libraries
|
||||
|
||||
As of now, CUDA libraries are expected to be pre-installed on the nodes.
|
||||
|
||||
Pods can access the libraries using `hostPath` volumes.
|
||||
|
||||
```yaml
|
||||
kind: Pod
|
||||
apiVersion: v1
|
||||
metadata:
|
||||
name: gpu-pod
|
||||
spec:
|
||||
containers:
|
||||
- name: gpu-container-1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
alpha.kubernetes.io/nvidia-gpu: 1
|
||||
volumeMounts:
|
||||
- mountPath: /usr/local/nvidia/bin
|
||||
name: bin
|
||||
- mountPath: /usr/lib/nvidia
|
||||
name: lib
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /usr/lib/nvidia-367/bin
|
||||
name: bin
|
||||
- hostPath:
|
||||
path: /usr/lib/nvidia-367
|
||||
name: lib
|
||||
```
|
||||
|
||||
## Future
|
||||
|
||||
- Support for hardware accelerators is in it's early stages in Kubernetes.
|
||||
- GPUs and other accelerators will soon be a native compute resource across the system.
|
||||
- Better APIs will be introduced to provision and consume accelerators in a scalable manner.
|
||||
- Kubernetes will automatically ensure that applications consuming GPUs gets the best possible performance.
|
||||
- Key usability problems like access to CUDA libraries will be addressed.
|
||||
[Scheduling GPUs](/docs/tasks/manage-gpus/scheduling-gpus/)
|
||||
|
|
|
|||
Loading…
Reference in New Issue