diff --git a/content/en/docs/concepts/scheduling-eviction/_index.md b/content/en/docs/concepts/scheduling-eviction/_index.md index 91d77b01ec..4ceb552edc 100644 --- a/content/en/docs/concepts/scheduling-eviction/_index.md +++ b/content/en/docs/concepts/scheduling-eviction/_index.md @@ -26,6 +26,7 @@ of terminating one or more Pods on Nodes. * [Pod Topology Spread Constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/) * [Taints and Tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/) * [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework) +* [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation) * [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/) * [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/) * [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/) diff --git a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md new file mode 100644 index 0000000000..e1c468f58f --- /dev/null +++ b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md @@ -0,0 +1,215 @@ +--- +reviewers: +- klueska +- pohly +title: Dynamic Resource Allocation +content_type: concept +weight: 65 +--- + + + +{{< feature-state for_k8s_version="v1.26" state="alpha" >}} + +Dynamic resource allocation is a new API for requesting and sharing resources +between pods and containers inside a pod. It is a generalization of the +persistent volumes API for generic resources. Third-party resource drivers are +responsible for tracking and allocating resources. Different kinds of +resources support arbitrary parameters for defining requirements and +initialization. + +## {{% heading "prerequisites" %}} + +Kubernetes v{{< skew currentVersion >}} includes cluster-level API support for +dynamic resource allocation, but it [needs to be +enabled](#enabling-dynamic-resource-allocation) explicitly. You also must +install a resource driver for specific resources that are meant to be managed +using this API. If you are not running Kubernetes v{{< skew currentVersion>}}, +check the documentation for that version of Kubernetes. + + + +## API + +The new `resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group" +term_id="api-group" >}} provides four new types: + +ResourceClass +: Defines which resource driver handles a certain kind of + resource and provides common parameters for it. ResourceClasses + are created by a cluster administrator when installing a resource + driver. + +ResourceClaim +: Defines a particular resource instances that is required by a + workload. Created by a user (lifecycle managed manually, can be shared + between different Pods) or for individual Pods by the control plane based on + a ResourceClaimTemplate (automatic lifecycle, typically used by just one + Pod). + +ResourceClaimTemplate +: Defines the spec and some meta data for creating + ResourceClaims. Created by a user when deploying a workload. + +PodScheduling +: Used internally by the control plane and resource drivers + to coordinate pod scheduling when ResourceClaims need to be allocated + for a Pod. + +Parameters for ResourceClass and ResourceClaim are stored in separate objects, +typically using the type defined by a {{< glossary_tooltip +term_id="CustomResourceDefinition" text="CRD" >}} that was created when +installing a resource driver. + +The `core/v1` `PodSpec` defines ResourceClaims that are needed for a Pod in a new +`resourceClaims` field. Entries in that list reference either a ResourceClaim +or a ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using +this PodSpec (for example, inside a Deployment or StatefulSet) share the same +ResourceClaim instance. When referencing a ResourceClaimTemplate, each Pod gets +its own instance. + +The `resources.claims` list for container resources defines whether a container gets +access to these resource instances, which makes it possible to share resources +between one or more containers. + +Here is an example for a fictional resource driver. Two ResourceClaim objects +will get created for this Pod and each container gets access to one of them. + +```yaml +apiVersion: resource.k8s.io/v1alpha1 +kind: ResourceClass +name: resource.example.com +driverName: resource-driver.example.com +--- +apiVersion: cats.resource.example.com/v1 +kind: ClaimParameters +name: large-black-cat-claim-parameters +spec: + color: black + size: large +--- +apiVersion: resource.k8s.io/v1alpha1 +kind: ResourceClaimTemplate +metadata: + name: large-black-cat-claim-template +spec: + spec: + resourceClassName: resource.example.com + parametersRef: + apiGroup: cats.resource.example.com + kind: ClaimParameters + name: large-black-cat-claim-parameters +–-- +apiVersion: v1 +kind: Pod +metadata: + name: pod-with-cats +spec: + containers: + - name: container0 + image: ubuntu:20.04 + command: ["sleep", "9999"] + resources: + claims: + - name: cat-0 + - name: container1 + image: ubuntu:20.04 + command: ["sleep", "9999"] + resources: + claims: + - name: cat-1 + resourceClaims: + - name: cat-0 + source: + resourceClaimTemplateName: large-black-cat-claim-template + - name: cat-1 + source: + resourceClaimTemplateName: large-black-cat-claim-template +``` + +## Scheduling + +In contrast to native resources (CPU, RAM) and extended resources (managed by a +device plugin, advertised by kubelet), the scheduler has no knowledge of what +dynamic resources are available in a cluster or how they could be split up to +satisfy the requirements of a specific ResourceClaim. Resource drivers are +responsible for that. They mark ResourceClaims as "allocated" once resources +for it are reserved. This also then tells the scheduler where in the cluster a +ResourceClaim is available. + +ResourceClaims can get allocated as soon as they are created ("immediate +allocation"), without considering which Pods will use them. The default is to +delay allocation until a Pod gets scheduled which needs the ResourceClaim +(i.e. "wait for first consumer"). + +In that mode, the scheduler checks all ResourceClaims needed by a Pod and +creates a PodScheduling object where it informs the resource drivers +responsible for those ResourceClaims about nodes that the scheduler considers +suitable for the Pod. The resource drivers respond by excluding nodes that +don't have enough of the driver's resources left. Once the scheduler has that +information, it selects one node and stores that choice in the PodScheduling +object. The resource drivers then allocate their ResourceClaims so that the +resources will be available on that node. Once that is complete, the Pod +gets scheduled. + +As part of this process, ResourceClaims also get reserved for the +Pod. Currently ResourceClaims can either be used exclusively by a single Pod or +an unlimited number of Pods. + +One key feature is that Pods do not get scheduled to a node unless all of +their resources are allocated and reserved. This avoids the scenario where a Pod +gets scheduled onto one node and then cannot run there, which is bad because +such a pending Pod also blocks all other resources like RAM or CPU that were +set aside for it. + +## Limitations + +The scheduler plugin must be involved in scheduling Pods which use +ResourceClaims. Bypassing the scheduler by setting the `nodeName` field leads +to Pods that the kubelet refuses to start because the ResourceClaims are not +reserved or not even allocated. It may be possible to [remove this +limitation](https://github.com/kubernetes/kubernetes/issues/114005) in the +future. + +## Enabling dynamic resource allocation + +Dynamic resource allocation is an *alpha feature* and only enabled when the +`DynamicResourceAllocation` [feature +gate](/docs/reference/command-line-tools-reference/feature-gates/) and the +`resource.k8s.io/v1alpha1` {{< glossary_tooltip text="API group" +term_id="api-group" >}} are enabled. For details on that, see the +`--feature-gates` and `--runtime-config` [kube-apiserver +parameters](/docs/reference/command-line-tools-reference/kube-apiserver/). +kube-scheduler, kube-controller-manager and kubelet also need the feature gate. + +A quick check whether a Kubernetes cluster supports the feature is to list +ResourceClass objects with: + +```shell +kubectl get resourceclasses +``` + +If your cluster supports dynamic resource allocation, the response is either a +list of ResourceClass objects or: + +``` +No resources found +``` + +If not supported, this error is printed instead: + +``` +error: the server doesn't have a resource type "resourceclasses" +``` + +The default configuration of kube-scheduler enables the "DynamicResources" +plugin if and only if the feature gate is enabled. Custom configurations may +have to be modified to include it. + +In addition to enabling the feature in the cluster, a resource driver also has to +be installed. Please refer to the driver's documentation for details. + +## {{% heading "whatsnext" %}} + + - For more information on the design, see the +[Dynamic Resource Allocation KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3063-dynamic-resource-allocation/README.md). diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates.md b/content/en/docs/reference/command-line-tools-reference/feature-gates.md index 2aceae2542..b7909b2ac4 100644 --- a/content/en/docs/reference/command-line-tools-reference/feature-gates.md +++ b/content/en/docs/reference/command-line-tools-reference/feature-gates.md @@ -88,6 +88,7 @@ For a reference to old feature gates that are removed, please refer to | `DownwardAPIHugePages` | `false` | Alpha | 1.20 | 1.20 | | `DownwardAPIHugePages` | `false` | Beta | 1.21 | 1.21 | | `DownwardAPIHugePages` | `true` | Beta | 1.22 | | +| `DynamicResourceAllocation` | `false` | Alpha | 1.26 | | | `EndpointSliceTerminatingCondition` | `false` | Alpha | 1.20 | 1.21 | | `EndpointSliceTerminatingCondition` | `true` | Beta | 1.22 | | | `ExpandedDNSConfig` | `false` | Alpha | 1.22 | |