Document using multiple scheduler profiles (#19172)

* Add instructions for using multiple scheduling profiles. Signed-off-by: Aldo Culquicondor <acondor@google.com> * Move scheduling policies and profiles to Reference Signed-off-by: Aldo Culquicondor <acondor@google.com> * Renames and nits Signed-off-by: Aldo Culquicondor <acondor@google.com> * Fix links and grammar Signed-off-by: Aldo Culquicondor <acondor@google.com> * Fix link and flag usage Signed-off-by: Aldo Culquicondor <acondor@google.com>
2020-03-16 13:06:43 -04:00 · 2020-03-16 13:06:43 -04:00 · d1056364e2
parent 615c7f619f
commit d1056364e2
5 changed files with 324 additions and 103 deletions
--- a/content/en/docs/concepts/scheduling/kube-scheduler.md
+++ b/content/en/docs/concepts/scheduling/kube-scheduler.md
@ -54,14 +54,12 @@ individual and collective resource requirements, hardware / software /
 policy constraints, affinity and anti-affinity specifications, data
 locality, inter-workload interference, and so on.

-## Scheduling with kube-scheduler {#kube-scheduler-implementation}
+### Node selection in kube-scheduler {#kube-scheduler-implementation}

 kube-scheduler selects a node for the pod in a 2-step operation:

 1. Filtering
-
-2. Scoring
-
+1. Scoring

 The _filtering_ step finds the set of Nodes where it's feasible to
 schedule the Pod. For example, the PodFitsResources filter checks whether a
@ -78,105 +76,15 @@ Finally, kube-scheduler assigns the Pod to the Node with the highest ranking.
 If there is more than one node with equal scores, kube-scheduler selects
 one of these at random.

+There are two supported ways to configure the filtering and scoring behavior
+of the scheduler:

-### Default policies
-
-kube-scheduler has a default set of scheduling policies.
-
-### Filtering
-
- `PodFitsHostPorts`: Checks if a Node has free ports (the network protocol kind)
-  for the Pod ports the Pod is requesting.
-
- `PodFitsHost`: Checks if a Pod specifies a specific Node by its hostname.
-
- `PodFitsResources`: Checks if the Node has free resources (eg, CPU and Memory)
-  to meet the requirement of the Pod.
-
- `PodMatchNodeSelector`: Checks if a Pod's Node {{< glossary_tooltip term_id="selector" >}}
-   matches the Node's {{< glossary_tooltip text="label(s)" term_id="label" >}}.
-
- `NoVolumeZoneConflict`: Evaluate if the {{< glossary_tooltip text="Volumes" term_id="volume" >}}
-  that a Pod requests are available on the Node, given the failure zone restrictions for
-  that storage.
-
- `NoDiskConflict`: Evaluates if a Pod can fit on a Node due to the volumes it requests,
-   and those that are already mounted.
-
- `MaxCSIVolumeCount`: Decides how many {{< glossary_tooltip text="CSI" term_id="csi" >}}
-  volumes should be attached, and whether that's over a configured limit.
-
- `CheckNodeMemoryPressure`: If a Node is reporting memory pressure, and there's no
-  configured exception, the Pod won't be scheduled there.
-
- `CheckNodePIDPressure`: If a Node is reporting that process IDs are scarce, and
-  there's no configured exception, the Pod won't be scheduled there.
-
- `CheckNodeDiskPressure`: If a Node is reporting storage pressure (a filesystem that
-   is full or nearly full), and there's no configured exception, the Pod won't be
-   scheduled there.
-
- `CheckNodeCondition`: Nodes can report that they have a completely full filesystem,
-  that networking isn't available or that kubelet is otherwise not ready to run Pods.
-  If such a condition is set for a Node, and there's no configured exception, the Pod
-  won't be scheduled there.
-
- `PodToleratesNodeTaints`: checks if a Pod's {{< glossary_tooltip text="tolerations" term_id="toleration" >}}
-  can tolerate the Node's {{< glossary_tooltip text="taints" term_id="taint" >}}.
-
- `CheckVolumeBinding`: Evaluates if a Pod can fit due to the volumes it requests.
-  This applies for both bound and unbound
-  {{< glossary_tooltip text="PVCs" term_id="persistent-volume-claim" >}}.
-
-### Scoring
-
- `SelectorSpreadPriority`: Spreads Pods across hosts, considering Pods that
-   belong to the same {{< glossary_tooltip text="Service" term_id="service" >}},
-   {{< glossary_tooltip term_id="statefulset" >}} or
-   {{< glossary_tooltip term_id="replica-set" >}}.
-
- `InterPodAffinityPriority`: Computes a sum by iterating through the elements
-  of weightedPodAffinityTerm and adding “weight” to the sum if the corresponding
-  PodAffinityTerm is satisfied for that node; the node(s) with the highest sum
-  are the most preferred.
-
- `LeastRequestedPriority`: Favors nodes with fewer requested resources. In other
-  words, the more Pods that are placed on a Node, and the more resources those
-  Pods use, the lower the ranking this policy will give.
-
- `MostRequestedPriority`: Favors nodes with most requested resources. This policy
-  will fit the scheduled Pods onto the smallest number of Nodes needed to run your
-  overall set of workloads.
-
- `RequestedToCapacityRatioPriority`: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
-
- `BalancedResourceAllocation`: Favors nodes with balanced resource usage.
-
- `NodePreferAvoidPodsPriority`: Prioritizes nodes according to the node annotation
-  `scheduler.alpha.kubernetes.io/preferAvoidPods`. You can use this to hint that
-  two different Pods shouldn't run on the same Node.
-
- `NodeAffinityPriority`: Prioritizes nodes according to node affinity scheduling
-   preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution.
-   You can read more about this in [Assigning Pods to Nodes](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/).
-
- `TaintTolerationPriority`: Prepares the priority list for all the nodes, based on
-  the number of intolerable taints on the node. This policy adjusts a node's rank
-  taking that list into account.
-
- `ImageLocalityPriority`: Favors nodes that already have the
-  {{< glossary_tooltip text="container images" term_id="image" >}} for that
-  Pod cached locally.
-
- `ServiceSpreadingPriority`: For a given Service, this policy aims to make sure that
-  the Pods for the Service run on different nodes. It favours scheduling onto nodes
-  that don't have Pods for the service already assigned there. The overall outcome is
-  that the Service becomes more resilient to a single Node failure.
-
- `CalculateAntiAffinityPriorityMap`: This policy helps implement
-  [pod anti-affinity](/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity).
-
- `EqualPriorityMap`: Gives an equal weight of one to all nodes.
+1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to
+  configure _Predicates_ for filtering and _Priorities_ for scoring.
+1. [Scheduling Profiles](/docs/reference/scheduling/profiles) allow you to
+  configure Plugins that implement different scheduling stages, including:
+  `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You
+  can also configure the kube-scheduler to run different profiles.

 {{% /capture %}}
 {{% capture whatsnext %}}
--- a/content/en/docs/reference/_index.md
+++ b/content/en/docs/reference/_index.md
@ -38,13 +38,15 @@ client libraries:
    * [JSONPath](/docs/reference/kubectl/jsonpath/) - Syntax guide for using [JSONPath expressions](http://goessner.net/articles/JsonPath/) with kubectl.
 * [kubeadm](/docs/reference/setup-tools/kubeadm/kubeadm/) - CLI tool to easily provision a secure Kubernetes cluster.

-## Config Reference
+## Components Reference

 * [kubelet](/docs/reference/command-line-tools-reference/kubelet/) - The primary *node agent* that runs on each node. The kubelet takes a set of PodSpecs and ensures that the described containers are running and healthy.
 * [kube-apiserver](/docs/reference/command-line-tools-reference/kube-apiserver/) - REST API that validates and configures data for API objects such as  pods, services, replication controllers.
 * [kube-controller-manager](/docs/reference/command-line-tools-reference/kube-controller-manager/) - Daemon that embeds the core control loops shipped with Kubernetes.
 * [kube-proxy](/docs/reference/command-line-tools-reference/kube-proxy/) - Can do simple TCP/UDP stream forwarding or round-robin TCP/UDP forwarding across a set of back-ends.
 * [kube-scheduler](/docs/reference/command-line-tools-reference/kube-scheduler/) - Scheduler that manages availability, performance, and capacity.
+  * [kube-scheduler Policies](/docs/reference/scheduling/policies)
+  * [kube-scheduler Profiles](/docs/reference/scheduling/profiles)

 ## Design Docs

--- a/content/en/docs/reference/scheduling/_index.md
+++ b/content/en/docs/reference/scheduling/_index.md
@ -0,0 +1,5 @@
+---
+title: Scheduling
+weight: 70
+toc-hide: true
+---
--- a/content/en/docs/reference/scheduling/policies.md
+++ b/content/en/docs/reference/scheduling/policies.md
@ -0,0 +1,125 @@
+---
+title: Scheduling Policies
+content_template: templates/concept
+weight: 10
+---
+
+{{% capture overview %}}
+
+A scheduling Policy can be used to specify the *predicates* and *priorities*
+that the {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}
+runs to [filter and score nodes](/docs/concepts/scheduling/kube-scheduler/#kube-scheduler-implementation),
+respectively.
+
+You can set a scheduling policy by running
+`kube-scheduler --policy-config-file <filename>` or
+`kube-scheduler --policy-configmap <ConfigMap>`
+and using the [Policy type](https://pkg.go.dev/k8s.io/kube-scheduler@v0.18.0/config/v1?tab=doc#Policy).
+
+{{% /capture %}}
+
+{{% capture body %}}
+
+## Predicates
+
+The following *predicates* implement filtering:
+
+- `PodFitsHostPorts`: Checks if a Node has free ports (the network protocol kind)
+  for the Pod ports the Pod is requesting.
+
+- `PodFitsHost`: Checks if a Pod specifies a specific Node by its hostname.
+
+- `PodFitsResources`: Checks if the Node has free resources (eg, CPU and Memory)
+  to meet the requirement of the Pod.
+
+- `PodMatchNodeSelector`: Checks if a Pod's Node {{< glossary_tooltip term_id="selector" >}}
+   matches the Node's {{< glossary_tooltip text="label(s)" term_id="label" >}}.
+
+- `NoVolumeZoneConflict`: Evaluate if the {{< glossary_tooltip text="Volumes" term_id="volume" >}}
+  that a Pod requests are available on the Node, given the failure zone restrictions for
+  that storage.
+
+- `NoDiskConflict`: Evaluates if a Pod can fit on a Node due to the volumes it requests,
+   and those that are already mounted.
+
+- `MaxCSIVolumeCount`: Decides how many {{< glossary_tooltip text="CSI" term_id="csi" >}}
+  volumes should be attached, and whether that's over a configured limit.
+
+- `CheckNodeMemoryPressure`: If a Node is reporting memory pressure, and there's no
+  configured exception, the Pod won't be scheduled there.
+
+- `CheckNodePIDPressure`: If a Node is reporting that process IDs are scarce, and
+  there's no configured exception, the Pod won't be scheduled there.
+
+- `CheckNodeDiskPressure`: If a Node is reporting storage pressure (a filesystem that
+   is full or nearly full), and there's no configured exception, the Pod won't be
+   scheduled there.
+
+- `CheckNodeCondition`: Nodes can report that they have a completely full filesystem,
+  that networking isn't available or that kubelet is otherwise not ready to run Pods.
+  If such a condition is set for a Node, and there's no configured exception, the Pod
+  won't be scheduled there.
+
+- `PodToleratesNodeTaints`: checks if a Pod's {{< glossary_tooltip text="tolerations" term_id="toleration" >}}
+  can tolerate the Node's {{< glossary_tooltip text="taints" term_id="taint" >}}.
+
+- `CheckVolumeBinding`: Evaluates if a Pod can fit due to the volumes it requests.
+  This applies for both bound and unbound
+  {{< glossary_tooltip text="PVCs" term_id="persistent-volume-claim" >}}.
+
+## Priorities
+
+The following *priorities* implement scoring:
+
+- `SelectorSpreadPriority`: Spreads Pods across hosts, considering Pods that
+   belong to the same {{< glossary_tooltip text="Service" term_id="service" >}},
+   {{< glossary_tooltip term_id="statefulset" >}} or
+   {{< glossary_tooltip term_id="replica-set" >}}.
+
+- `InterPodAffinityPriority`: Implements preferred
+  [inter pod affininity and antiaffinity](/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity).
+
+- `LeastRequestedPriority`: Favors nodes with fewer requested resources. In other
+  words, the more Pods that are placed on a Node, and the more resources those
+  Pods use, the lower the ranking this policy will give.
+
+- `MostRequestedPriority`: Favors nodes with most requested resources. This policy
+  will fit the scheduled Pods onto the smallest number of Nodes needed to run your
+  overall set of workloads.
+
+- `RequestedToCapacityRatioPriority`: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
+
+- `BalancedResourceAllocation`: Favors nodes with balanced resource usage.
+
+- `NodePreferAvoidPodsPriority`: Prioritizes nodes according to the node annotation
+  `scheduler.alpha.kubernetes.io/preferAvoidPods`. You can use this to hint that
+  two different Pods shouldn't run on the same Node.
+
+- `NodeAffinityPriority`: Prioritizes nodes according to node affinity scheduling
+   preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution.
+   You can read more about this in [Assigning Pods to Nodes](/docs/concepts/configuration/assign-pod-node/).
+
+- `TaintTolerationPriority`: Prepares the priority list for all the nodes, based on
+  the number of intolerable taints on the node. This policy adjusts a node's rank
+  taking that list into account.
+
+- `ImageLocalityPriority`: Favors nodes that already have the
+  {{< glossary_tooltip text="container images" term_id="image" >}} for that
+  Pod cached locally.
+
+- `ServiceSpreadingPriority`: For a given Service, this policy aims to make sure that
+  the Pods for the Service run on different nodes. It favours scheduling onto nodes
+  that don't have Pods for the service already assigned there. The overall outcome is
+  that the Service becomes more resilient to a single Node failure.
+
+- `EqualPriority`: Gives an equal weight of one to all nodes.
+
+- `EvenPodsSpreadPriority`: Implements preferred
+  [pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/).
+
+{{% /capture %}}
+
+{{% capture whatsnext %}}
+* Learn about [scheduling](/docs/concepts/scheduling/kube-scheduler/)
+* Learn about [kube-scheduler profiles](/docs/reference/scheduling/profiles/)
+{{% /capture %}}
--- a/content/en/docs/reference/scheduling/profiles.md
+++ b/content/en/docs/reference/scheduling/profiles.md
@ -0,0 +1,181 @@
+---
+title: Scheduling Profiles
+content_template: templates/concept
+weight: 20
+---
+
+{{% capture overview %}}
+
+{{< feature-state for_k8s_version="v1.18" state="alpha" >}}
+
+A scheduling Profile allows you to configure the different stages of scheduling
+in the {{< glossary_tooltip text="kube-scheduler" term_id="kube-scheduler" >}}.
+Each stage is exposed in a extension point. Plugins provide scheduling behaviors
+by implementing one or more of these extension points.
+
+You can specify scheduling profiles by running `kube-scheduler --config <filename>`,
+using the component config APIs
+([`v1alpha1`](https://pkg.go.dev/k8s.io/kube-scheduler@{{< param "fullversion" >}}/config/v1alpha1?tab=doc#KubeSchedulerConfiguration)
+or [`v1alpha2`](https://pkg.go.dev/k8s.io/kube-scheduler@{{< param "fullversion" >}}/config/v1alpha2?tab=doc#KubeSchedulerConfiguration)).
+The `v1alpha2` API allows you to configure kube-scheduler to run
+[multiple profiles](#multiple-profiles).
+
+{{% /capture %}}
+
+{{% capture body %}}
+
+## Extension points
+
+Scheduling happens in a series of stages that are exposed through the following
+extension points:
+
+1. `QueueSort`: These plugins provide an ordering function that is used to
+   sort pending Pods in the scheduling queue. Exactly one queue sort plugin
+   may be enabled at a time.
+1. `PreFilter`: These plugins are used to pre-process or check information
+   about a Pod or the cluster before filtering.
+1. `Filter`: These plugins are the equivalent of Predicates in a scheduling
+   Policy and are used to filter out nodes that can not run the Pod. Filters
+   are called in the configured order.
+1. `PreScore`: This is an informational extension point that can be used
+   for doing pre-scoring work.
+1. `Score`: These plugins provide a score to each node that has passed the
+   filtering phase. The scheduler will then select the node with the highest
+   weighted scores sum.
+1. `Reserve`: This is an informational extension point that notifies plugins
+   when resources have being reserved for a given Pod.
+1. `Permit`: These plugins can prevent or delay the binding of a Pod.
+1. `PreBind`: These plugins perform any work required before a Pod is bound.
+1. `Bind`: The plugins bind a Pod to a Node. Bind plugins are called in order
+   and once one has done the binding, the remaining plugins are skipped. At
+   least one bind plugin is required.
+1. `PostBind`: This is an informational extension point that is called after
+   a Pod has been bound.
+1. `UnReserve`: This is an informational extension point that is called if
+   a Pod is rejected after being reserved and put on hold by a `Permit` plugin.
+   
+## Scheduling plugins
+
+The following plugins, enabled by default, implement one or more of these
+extension points:
+
+- `DefaultTopologySpread`: Favors spreading across nodes for Pods that belong to
+  {{< glossary_tooltip text="Services" term_id="service" >}},
+  {{< glossary_tooltip text="ReplicaSets" term_id="replica-set" >}} and
+  {{< glossary_tooltip text="StatefulSets" term_id="statefulset" >}}
+  Extension points: `PreScore`, `Score`.
+- `ImageLocality`: Favors nodes that already have the container images that the
+  Pod runs.
+  Extension points: `Score`.
+- `TaintToleration`: Implements
+  [taints and tolerations](/docs/concepts/configuration/taint-and-toleration/).
+  Implements extension points: `Filter`, `Prescore`, `Score`.
+- `NodeName`: Checks if a Pod spec node name matches the current node.
+  Extension points: `Filter`.
+- `NodePorts`: Checks if a node has free ports for the requested Pod ports.
+  Extension points: `PreFilter`, `Filter`.
+- `NodePreferAvoidPods`: Scores nodes according to the node
+  {{< glossary_tooltip text="annotation" term_id="annotation" >}}
+  `scheduler.alpha.kubernetes.io/preferAvoidPods`.
+  Extension points: `Score`.
+- `NodeAffinity`: Implements
+  [node selectors](/docs/concepts/configuration/assign-pod-node/#nodeselector)
+  and [node affinity](/docs/concepts/configuration/assign-pod-node/#node-affinity).
+  Extension points: `Filter`, `Score`.
+- `PodTopologySpread`: Implements
+  [Pod topology spread](/docs/concepts/workloads/pods/pod-topology-spread-constraints/).
+  Extension points: `PreFilter`, `Filter`, `PreScore`, `Score`.
+- `NodeUnschedulable`: Filters out nodes that have `.spec.unschedulable` set to
+  true.
+  Extension points: `Filter`.
+- `NodeResourcesFit`: Checks if the node has all the resources that the Pod is
+  requesting.
+  Extension points: `PreFilter`, `Filter`.
+- `NodeResourcesBallancedAllocation`: Favors nodes that would obtain a more
+  balanced resource usage if the Pod is scheduled there.
+  Extension points: `Score`.
+- `NodeResourcesLeastAllocated`: Favors nodes that have a low allocation of
+  resources.
+  Extension points: `Score`.
+- `VolumeBinding`: Checks if the node has or if it can bind the requested
+  {{< glossary_tooltip text="volumes" term_id="volume" >}}.
+  Extension points: `Filter`.
+- `VolumeRestrictions`: Checks that volumes mounted in the node satisfy
+  restrictions that are specific to the volume provider.
+  Extension points: `Filter`.
+- `VolumeZone`: Checks that volumes requested satisfy any zone requirements they
+  might have.
+  Extension points: `Filter`.
+- `NodeVolumeLimits`: Checks that CSI volume limits can be satisfied for the
+  node.
+  Extension points: `Filter`.
+- `EBSLimits`: Checks that AWS EBS volume limits can be satisfied for the node.
+  Extension points: `Filter`.
+- `GCEPDLimits`: Checks that GCP-PD volume limits can be satisfied for the node.
+  Extension points: `Filter`.
+- `AzureDiskLimits`: Checks that Azure disk volume limits can be satisfied for
+  the node.
+  Extension points: `Filter`.
+- `InterPodAffinity`: Implements
+  [inter-Pod affinity and anti-affinity](/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity).
+  Extension points: `PreFilter`, `Filter`, `PreScore`, `Score`.
+- `PrioritySort`: Provides the default priority based sorting.
+  Extension points: `QueueSort`.
+- `DefaultBinder`: Provides the default binding mechanism.
+  Extension points: `Bind`.
+  
+You can also enable the following plugins, through the component config APIs,
+that are not enabled by default:
+
+- `NodeResourcesMostAllocated`: Favors nodes that have a high allocation of
+  resources.
+  Extension points: `Score`.
+- `RequestedToCapacityRatio`: Favor nodes according to a configured function of
+  the allocated resources.
+  Extension points: `Score`.
+- `NodeResourceLimits`: Favors nodes that satisfy the Pod resource limits.
+  Extension points: `PreScore`, `Score`.
+- `CinderVolume`: Checks that OpenStack Cinder volume limits can be satisfied
+  for the node.
+  Extension points: `Filter`.
+- `NodeLabel`: Filters and / or scores a node according to configured
+  {{< glossary_tooltip text="label(s)" term_id="label" >}}.
+  Extension points: `Filter`, `Score`.
+- `ServiceAffinity`: Checks that Pods that belong to a
+  {{< glossary_tooltip term_id="service" >}} fit in a set of nodes defined by
+  configured labels. This plugin also favors spreading the Pods belonging to a
+  Service across nodes.
+  Extension points: `PreFilter`, `Filter`, `Score`.
+  
+## Multiple profiles
+
+When using the component config API v1alpha2, a scheduler can be configured to
+run more than one profile. Each profile has an associated scheduler name.
+Pods that want to be scheduled according to a specific profile can include
+the corresponding scheduler name in its `.spec.schedulerName`.
+
+By default, one profile with the scheduler name `default-scheduler` is created.
+This profile includes the default plugins described above. When declaring more
+than one profile, a unique scheduler name for each of them is required.
+
+If a Pod doesn't specify a scheduler name, kube-apiserver will set it to
+`default-scheduler`. Therefore, a profile with this scheduler name should exist
+to get those pods scheduled.
+
+{{< note >}}
+Pod's scheduling events have `.spec.schedulerName` as the ReportingController.
+Events for leader election use the scheduler name of the first profile in the
+list.
+{{< /note >}}
+
+{{< note >}}
+All profiles must use the same plugin in the QueueSort extension point and have
+the same configuration parameters (if applicable). This is because the scheduler
+only has one pending pods queue.
+{{< /note >}}
+
+{{% /capture %}}
+
+{{% capture whatsnext %}}
+* Learn about [scheduling](/docs/concepts/scheduling/kube-scheduler/)
+{{% /capture %}}