Merge pull request #33792 from sftim/20220510_assign_pod_node_affinity_updates

Revise scheduling-related docs
pull/35312/head
Kubernetes Prow Robot 2022-07-24 03:38:57 -07:00 committed by GitHub
commit 7078c38d3b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 616 additions and 446 deletions

View File

@ -67,7 +67,7 @@ Let's see an example of a cluster to understand this API.
As the feature name "PodTopologySpread" implies, the basic usage of this feature
is to run your workload with an absolute even manner (maxSkew=1), or relatively
even manner (maxSkew>=2). See the [official
document](/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
document](/docs/concepts/scheduling-eviction/topology-spread-constraints/)
for more details.
In addition to this basic usage, there are some advanced usage examples that

View File

@ -70,7 +70,7 @@ To correct the latter issue, we now employ a "hunt and peck" approach to removin
### 1. Upgrade to kubernetes 1.18 and make use of Pod Topology Spread Constraints
While this seems like it could have been the perfect solution, at the time of writing Kubernetes 1.18 was unavailable on the two most common managed Kubernetes services in public cloud, EKS and GKE.
Furthermore, [pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/) were still a [beta feature in 1.18](https://v1-18.docs.kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/) which meant that it [wasn't guaranteed to be available in managed clusters](https://cloud.google.com/kubernetes-engine/docs/concepts/types-of-clusters#kubernetes_feature_choices) even when v1.18 became available.
Furthermore, [pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/) were still a beta feature in 1.18 which meant that it [wasn't guaranteed to be available in managed clusters](https://cloud.google.com/kubernetes-engine/docs/concepts/types-of-clusters#kubernetes_feature_choices) even when v1.18 became available.
The entire endeavour was concerningly reminiscent of checking [caniuse.com](https://caniuse.com/) when Internet Explorer 8 was still around.
### 2. Deploy a statefulset _per zone_.

View File

@ -23,6 +23,7 @@ of terminating one or more Pods on Nodes.
* [Kubernetes Scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/)
* [Assigning Pods to Nodes](/docs/concepts/scheduling-eviction/assign-pod-node/)
* [Pod Overhead](/docs/concepts/scheduling-eviction/pod-overhead/)
* [Pod Topology Spread Constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/)
* [Taints and Tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/)
* [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework)
* [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/)

View File

@ -11,24 +11,27 @@ weight: 20
<!-- overview -->
You can constrain a {{< glossary_tooltip text="Pod" term_id="pod" >}} so that it can only run on particular set of
{{< glossary_tooltip text="node(s)" term_id="node" >}}.
You can constrain a {{< glossary_tooltip text="Pod" term_id="pod" >}} so that it is
_restricted_ to run on particular {{< glossary_tooltip text="node(s)" term_id="node" >}},
or to _prefer_ to run on particular nodes.
There are several ways to do this and the recommended approaches all use
[label selectors](/docs/concepts/overview/working-with-objects/labels/) to facilitate the selection.
Generally such constraints are unnecessary, as the scheduler will automatically do a reasonable placement
Often, you do not need to set any such constraints; the
{{< glossary_tooltip text="scheduler" term_id="kube-scheduler" >}} will automatically do a reasonable placement
(for example, spreading your Pods across nodes so as not place Pods on a node with insufficient free resources).
However, there are some circumstances where you may want to control which node
the Pod deploys to, for example, to ensure that a Pod ends up on a node with an SSD attached to it, or to co-locate Pods from two different
services that communicate a lot into the same availability zone.
the Pod deploys to, for example, to ensure that a Pod ends up on a node with an SSD attached to it,
or to co-locate Pods from two different services that communicate a lot into the same availability zone.
<!-- body -->
You can use any of the following methods to choose where Kubernetes schedules
specific Pods:
specific Pods:
* [nodeSelector](#nodeselector) field matching against [node labels](#built-in-node-labels)
* [Affinity and anti-affinity](#affinity-and-anti-affinity)
* [nodeName](#nodename) field
* [Pod topology spread constraints](#pod-topology-spread-constraints)
## Node labels {#built-in-node-labels}
@ -337,13 +340,15 @@ null `namespaceSelector` matches the namespace of the Pod where the rule is defi
Inter-pod affinity and anti-affinity can be even more useful when they are used with higher
level collections such as ReplicaSets, StatefulSets, Deployments, etc. These
rules allow you to configure that a set of workloads should
be co-located in the same defined topology, eg., the same node.
be co-located in the same defined topology; for example, preferring to place two related
Pods onto the same node.
Take, for example, a three-node cluster running a web application with an
in-memory cache like redis. You could use inter-pod affinity and anti-affinity
to co-locate the web servers with the cache as much as possible.
For example: imagine a three-node cluster. You use the cluster to run a web application
and also an in-memory cache (such as Redis). For this example, also assume that latency between
the web application and the memory cache should be as low as is practical. You could use inter-pod
affinity and anti-affinity to co-locate the web servers with the cache as much as possible.
In the following example Deployment for the redis cache, the replicas get the label `app=store`. The
In the following example Deployment for the Redis cache, the replicas get the label `app=store`. The
`podAntiAffinity` rule tells the scheduler to avoid placing multiple replicas
with the `app=store` label on a single node. This creates each cache in a
separate node.
@ -378,10 +383,10 @@ spec:
image: redis:3.2-alpine
```
The following Deployment for the web servers creates replicas with the label `app=web-store`. The
Pod affinity rule tells the scheduler to place each replica on a node that has a
Pod with the label `app=store`. The Pod anti-affinity rule tells the scheduler
to avoid placing multiple `app=web-store` servers on a single node.
The following example Deployment for the web servers creates replicas with the label `app=web-store`.
The Pod affinity rule tells the scheduler to place each replica on a node that has a Pod
with the label `app=store`. The Pod anti-affinity rule tells the scheduler never to place
multiple `app=web-store` servers on a single node.
```yaml
apiVersion: apps/v1
@ -430,6 +435,10 @@ where each web server is co-located with a cache, on three separate nodes.
| *webserver-1* | *webserver-2* | *webserver-3* |
| *cache-1* | *cache-2* | *cache-3* |
The overall effect is that each cache instance is likely to be accessed by a single client, that
is running on the same node. This approach aims to minimize both skew (imbalanced load) and latency.
You might have other reasons to use Pod anti-affinity.
See the [ZooKeeper tutorial](/docs/tutorials/stateful-application/zookeeper/#tolerating-node-failure)
for an example of a StatefulSet configured with anti-affinity for high
availability, using the same technique as this example.
@ -468,6 +477,16 @@ spec:
The above Pod will only run on the node `kube-01`.
## Pod topology spread constraints
You can use _topology spread constraints_ to control how {{< glossary_tooltip text="Pods" term_id="Pod" >}}
are spread across your cluster among failure-domains such as regions, zones, nodes, or among any other
topology domains that you define. You might do this to improve performance, expected availability, or
overall utilization.
Read [Pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/)
to learn more about how these work.
## {{% heading "whatsnext" %}}
* Read more about [taints and tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/) .

View File

@ -83,7 +83,7 @@ of the scheduler:
## {{% heading "whatsnext" %}}
* Read about [scheduler performance tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/)
* Read about [Pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
* Read about [Pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/)
* Read the [reference documentation](/docs/reference/command-line-tools-reference/kube-scheduler/) for kube-scheduler
* Read the [kube-scheduler config (v1beta3)](/docs/reference/config-api/kube-scheduler-config.v1beta3/) reference
* Learn about [configuring multiple schedulers](/docs/tasks/extend-kubernetes/configure-multiple-schedulers/)

View File

@ -0,0 +1,570 @@
---
title: Pod Topology Spread Constraints
content_type: concept
weight: 40
---
<!-- overview -->
You can use _topology spread constraints_ to control how
{{< glossary_tooltip text="Pods" term_id="Pod" >}} are spread across your cluster
among failure-domains such as regions, zones, nodes, and other user-defined topology
domains. This can help to achieve high availability as well as efficient resource
utilization.
You can set [cluster-level constraints](#cluster-level-default-constraints) as a default,
or configure topology spread constraints for individual workloads.
<!-- body -->
## Motivation
Imagine that you have a cluster of up to twenty nodes, and you want to run a
{{< glossary_tooltip text="workload" term_id="workload" >}}
that automatically scales how many replicas it uses. There could be as few as
two Pods or as many as fifteen.
When there are only two Pods, you'd prefer not to have both of those Pods run on the
same node: you would run the risk that a single node failure takes your workload
offline.
In addition to this basic usage, there are some advanced usage examples that
enable your workloads to benefit on high availability and cluster utilization.
As you scale up and run more Pods, a different concern becomes important. Imagine
that you have three nodes running five Pods each. The nodes have enough capacity
to run that many replicas; however, the clients that interact with this workload
are split across three different datacenters (or infrastructure zones). Now you
have less concern about a single node failure, but you notice that latency is
higher than you'd like, and you are paying for network costs associated with
sending network traffic between the different zones.
You decide that under normal operation you'd prefer to have a similar number of replicas
[scheduled](/docs/concepts/scheduling-eviction/) into each infrastructure zone,
and you'd like the cluster to self-heal in the case that there is a problem.
Pod topology spread constraints offer you a declarative way to configure that.
## `topologySpreadConstraints` field
The Pod API includes a field, `spec.topologySpreadConstraints`. Here is an example:
```yaml
---
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
# Configure a topology spread constraint
topologySpreadConstraints:
- maxSkew: <integer>
minDomains: <integer> # optional; alpha since v1.24
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
### other Pod fields go here
```
You can read more about this field by running `kubectl explain Pod.spec.topologySpreadConstraints`.
### Spread constraint definition
You can define one or multiple `topologySpreadConstraints` entries to instruct the
kube-scheduler how to place each incoming Pod in relation to the existing Pods across
your cluster. Those fields are:
- **maxSkew** describes the degree to which Pods may be unevenly distributed. You must
specify this field and the number must be greater than zero. Its semantics differ
according to the value of `whenUnsatisfiable`:
- if you select `whenUnsatisfiable: DoNotSchedule`, then `maxSkew` defines the
maximum permitted difference between the number of matching pods in the target
topology and the _global minimum_
(the minimum number of pods that match the label selector in a topology domain).
For example, if you have 3 zones with 2, 4 and 5 matching pods respectively,
then the global minimum is 2 and `maxSkew` is compared relative to that number.
- if you select `whenUnsatisfiable: ScheduleAnyway`, the scheduler gives higher
precedence to topologies that would help reduce the skew.
- **minDomains** indicates a minimum number of eligible domains. This field is optional.
A domain is a particular instance of a topology. An eligible domain is a domain whose
nodes match the node selector.
{{< note >}}
The `minDomains` field is an alpha field added in 1.24. You have to enable the
`MinDomainsInPodToplogySpread` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
in order to use it.
{{< /note >}}
- The value of `minDomains` must be greater than 0, when specified.
You can only specify `minDomains` in conjunction with `whenUnsatisfiable: DoNotSchedule`.
- When the number of eligible domains with match topology keys is less than `minDomains`,
Pod topology spread treats global minimum as 0, and then the calculation of `skew` is performed.
The global minimum is the minimum number of matching Pods in an eligible domain,
or zero if the number of eligible domains is less than `minDomains`.
- When the number of eligible domains with matching topology keys equals or is greater than
`minDomains`, this value has no effect on scheduling.
- If you do not specify `minDomains`, the constraint behaves as if `minDomains` is 1.
- **topologyKey** is the key of [node labels](#node-labels). If two Nodes are labelled
with this key and have identical values for that label, the scheduler treats both
Nodes as being in the same topology. The scheduler tries to place a balanced number
of Pods into each topology domain.
- **whenUnsatisfiable** indicates how to deal with a Pod if it doesn't satisfy the spread constraint:
- `DoNotSchedule` (default) tells the scheduler not to schedule it.
- `ScheduleAnyway` tells the scheduler to still schedule it while prioritizing nodes that minimize the skew.
- **labelSelector** is used to find matching Pods. Pods
that match this label selector are counted to determine the
number of Pods in their corresponding topology domain.
See [Label Selectors](/docs/concepts/overview/working-with-objects/labels/#label-selectors)
for more details.
When a Pod defines more than one `topologySpreadConstraint`, those constraints are
combined using a logical AND operation: the kube-scheduler looks for a node for the incoming Pod
that satisfies all the configured constraints.
### Node labels
Topology spread constraints rely on node labels to identify the topology
domain(s) that each {{< glossary_tooltip text="node" term_id="node" >}} is in.
For example, a node might have labels:
```yaml
region: us-east-1
zone: us-east-1a
```
{{< note >}}
For brevity, this example doesn't use the
[well-known](/docs/reference/labels-annotations-taints/) label keys
`topology.kubernetes.io/zone` and `topology.kubernetes.io/region`. However,
those registered label keys are nonetheless recommended rather than the private
(unqualified) label keys `region` and `zone` that are used here.
You can't make a reliable assumption about the meaning of a private label key
between different contexts.
{{< /note >}}
Suppose you have a 4-node cluster with the following labels:
```
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready <none> 4m26s v1.16.0 node=node1,zone=zoneA
node2 Ready <none> 3m58s v1.16.0 node=node2,zone=zoneA
node3 Ready <none> 3m17s v1.16.0 node=node3,zone=zoneB
node4 Ready <none> 2m43s v1.16.0 node=node4,zone=zoneB
```
Then the cluster is logically viewed as below:
{{<mermaid>}}
graph TB
subgraph "zoneB"
n3(Node3)
n4(Node4)
end
subgraph "zoneA"
n1(Node1)
n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4 k8s;
class zoneA,zoneB cluster;
{{< /mermaid >}}
## Consistency
You should set the same Pod topology spread constraints on all pods in a group.
Usually, if you are using a workload controller such as a Deployment, the pod template
takes care of this for you. If you mix different spread constraints then Kubernetes
follows the API definition of the field; however, the behavior is more likely to become
confusing and troubleshooting is less straightforward.
You need a mechanism to ensure that all the nodes in a topology domain (such as a
cloud provider region) are labelled consistently.
To avoid you needing to manually label nodes, most clusters automatically
populate well-known labels such as `topology.kubernetes.io/hostname`. Check whether
your cluster supports this.
## Topology spread constraint examples
### Example: one topology spread constraint {#example-one-topologyspreadconstraint}
Suppose you have a 4-node cluster where 3 Pods labelled `foo: bar` are located in
node1, node2 and node3 respectively:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class zoneA,zoneB cluster;
{{< /mermaid >}}
If you want an incoming Pod to be evenly spread with existing Pods across zones, you
can use a manifest similar to:
{{< codenew file="pods/topology-spread-constraints/one-constraint.yaml" >}}
From that manifest, `topologyKey: zone` implies the even distribution will only be applied
to nodes that are labelled `zone: <any value>` (nodes that don't have a `zone` label
are skipped). The field `whenUnsatisfiable: DoNotSchedule` tells the scheduler to let the
incoming Pod stay pending if the scheduler can't find a way to satisfy the constraint.
If the scheduler placed this incoming Pod into zone `A`, the distribution of Pods would
become `[3, 1]`. That means the actual skew is then 2 (calculated as `3 - 1`), which
violates `maxSkew: 1`. To satisfy the constraints and context for this example, the
incoming Pod can only be placed onto a node in zone `B`:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
p4(mypod) --> n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
OR
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
p4(mypod) --> n3
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
You can tweak the Pod spec to meet various kinds of requirements:
- Change `maxSkew` to a bigger value - such as `2` - so that the incoming Pod can
be placed into zone `A` as well.
- Change `topologyKey` to `node` so as to distribute the Pods evenly across nodes
instead of zones. In the above example, if `maxSkew` remains `1`, the incoming
Pod can only be placed onto the node `node4`.
- Change `whenUnsatisfiable: DoNotSchedule` to `whenUnsatisfiable: ScheduleAnyway`
to ensure the incoming Pod to be always schedulable (suppose other scheduling APIs
are satisfied). However, it's preferred to be placed into the topology domain which
has fewer matching Pods. (Be aware that this preference is jointly normalized
with other internal scheduling priorities such as resource usage ratio).
### Example: multiple topology spread constraints {#example-multiple-topologyspreadconstraints}
This builds upon the previous example. Suppose you have a 4-node cluster where 3
existing Pods labeled `foo: bar` are located on node1, node2 and node3 respectively:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
You can combine two topology spread constraints to control the spread of Pods both
by node and by zone:
{{< codenew file="pods/topology-spread-constraints/two-constraints.yaml" >}}
In this case, to match the first constraint, the incoming Pod can only be placed onto
nodes in zone `B`; while in terms of the second constraint, the incoming Pod can only be
scheduled to the node `node4`. The scheduler only considers options that satisfy all
defined constraints, so the only valid placement is onto node `node4`.
### Example: conflicting topology spread constraints {#example-conflicting-topologyspreadconstraints}
Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p4(Pod) --> n3(Node3)
p5(Pod) --> n3
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n1
p3(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3,p4,p5 k8s;
class zoneA,zoneB cluster;
{{< /mermaid >}}
If you were to apply
[`two-constraints.yaml`](https://raw.githubusercontent.com/kubernetes/website/main/content/en/examples/pods/topology-spread-constraints/two-constraints.yaml)
(the manifest from the previous example)
to **this** cluster, you would see that the Pod `mypod` stays in the `Pending` state.
This happens because: to satisfy the first constraint, the Pod `mypod` can only
be placed into zone `B`; while in terms of the second constraint, the Pod `mypod`
can only schedule to node `node2`. The intersection of the two constraints returns
an empty set, and the scheduler cannot place the Pod.
To overcome this situation, you can either increase the value of `maxSkew` or modify
one of the constraints to use `whenUnsatisfiable: ScheduleAnyway`. Depending on
circumstances, you might also decide to delete an existing Pod manually - for example,
if you are troubleshooting why a bug-fix rollout is not making progress.
#### Interaction with node affinity and node selectors
The scheduler will skip the non-matching nodes from the skew calculations if the
incoming Pod has `spec.nodeSelector` or `spec.affinity.nodeAffinity` defined.
### Example: topology spread constraints with node affinity {#example-topologyspreadconstraints-with-nodeaffinity}
Suppose you have a 5-node cluster ranging across zones A to C:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
{{<mermaid>}}
graph BT
subgraph "zoneC"
n5(Node5)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n5 k8s;
class zoneC cluster;
{{< /mermaid >}}
and you know that zone `C` must be excluded. In this case, you can compose a manifest
as below, so that Pod `mypod` will be placed into zone `B` instead of zone `C`.
Similarly, Kubernetes also respects `spec.nodeSelector`.
{{< codenew file="pods/topology-spread-constraints/one-constraint-with-nodeaffinity.yaml" >}}
## Implicit conventions
There are some implicit conventions worth noting here:
- Only the Pods holding the same namespace as the incoming Pod can be matching candidates.
- The scheduler bypasses any nodes that don't have any `topologySpreadConstraints[*].topologyKey`
present. This implies that:
1. any Pods located on those bypassed nodes do not impact `maxSkew` calculation - in the
above example, suppose the node `node1` does not have a label "zone", then the 2 Pods will
be disregarded, hence the incoming Pod will be scheduled into zone `A`.
2. the incoming Pod has no chances to be scheduled onto this kind of nodes -
in the above example, suppose a node `node5` has the **mistyped** label `zone-typo: zoneC`
(and no `zone` label set). After node `node5` joins the cluster, it will be bypassed and
Pods for this workload aren't scheduled there.
- Be aware of what will happen if the incoming Pod's
`topologySpreadConstraints[*].labelSelector` doesn't match its own labels. In the
above example, if you remove the incoming Pod's labels, it can still be placed onto
nodes in zone `B`, since the constraints are still satisfied. However, after that
placement, the degree of imbalance of the cluster remains unchanged - it's still zone `A`
having 2 Pods labelled as `foo: bar`, and zone `B` having 1 Pod labelled as
`foo: bar`. If this is not what you expect, update the workload's
`topologySpreadConstraints[*].labelSelector` to match the labels in the pod template.
## Cluster-level default constraints
It is possible to set default topology spread constraints for a cluster. Default
topology spread constraints are applied to a Pod if, and only if:
- It doesn't define any constraints in its `.spec.topologySpreadConstraints`.
- It belongs to a Service, ReplicaSet, StatefulSet or ReplicationController.
Default constraints can be set as part of the `PodTopologySpread` plugin
arguments in a [scheduling profile](/docs/reference/scheduling/config/#profiles).
The constraints are specified with the same [API above](#api), except that
`labelSelector` must be empty. The selectors are calculated from the Services,
ReplicaSets, StatefulSets or ReplicationControllers that the Pod belongs to.
An example configuration might look like follows:
```yaml
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
```
{{< note >}}
The [`SelectorSpread` plugin](/docs/reference/scheduling/config/#scheduling-plugins)
is disabled by default. The Kubernetes project recommends using `PodTopologySpread`
to achieve similar behavior.
{{< /note >}}
### Built-in default constraints {#internal-default-constraints}
{{< feature-state for_k8s_version="v1.24" state="stable" >}}
If you don't configure any cluster-level default constraints for pod topology spreading,
then kube-scheduler acts as if you specified the following default topology constraints:
```yaml
defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
```
Also, the legacy `SelectorSpread` plugin, which provides an equivalent behavior,
is disabled by default.
{{< note >}}
The `PodTopologySpread` plugin does not score the nodes that don't have
the topology keys specified in the spreading constraints. This might result
in a different default behavior compared to the legacy `SelectorSpread` plugin when
using the default topology constraints.
If your nodes are not expected to have **both** `kubernetes.io/hostname` and
`topology.kubernetes.io/zone` labels set, define your own constraints
instead of using the Kubernetes defaults.
{{< /note >}}
If you don't want to use the default Pod spreading constraints for your cluster,
you can disable those defaults by setting `defaultingType` to `List` and leaving
empty `defaultConstraints` in the `PodTopologySpread` plugin configuration:
```yaml
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints: []
defaultingType: List
```
## Comparison with podAffinity and podAntiAffinity {#comparison-with-podaffinity-podantiaffinity}
In Kubernetes, [inter-Pod affinity and anti-affinity](/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity)
control how Pods are scheduled in relation to one another - either more packed
or more scattered.
`podAffinity`
: attracts Pods; you can try to pack any number of Pods into qualifying
topology domain(s)
`podAntiAffinity`
: repels Pods. If you set this to `requiredDuringSchedulingIgnoredDuringExecution` mode then
only a single Pod can be scheduled into a single topology domain; if you choose
`preferredDuringSchedulingIgnoredDuringExecution` then you lose the ability to enforce the
constraint.
For finer control, you can specify topology spread constraints to distribute
Pods across different topology domains - to achieve either high availability or
cost-saving. This can also help on rolling update workloads and scaling out
replicas smoothly.
For more context, see the
[Motivation](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/895-pod-topology-spread#motivation)
section of the enhancement proposal about Pod topology spread constraints.
## Known limitations
- There's no guarantee that the constraints remain satisfied when Pods are removed. For
example, scaling down a Deployment may result in imbalanced Pods distribution.
You can use a tool such as the [Descheduler](https://github.com/kubernetes-sigs/descheduler)
to rebalance the Pods distribution.
- Pods matched on tainted nodes are respected.
See [Issue 80921](https://github.com/kubernetes/kubernetes/issues/80921).
- The scheduler doesn't have prior knowledge of all the zones or other topology
domains that a cluster has. They are determined from the existing nodes in the
cluster. This could lead to a problem in autoscaled clusters, when a node pool (or
node group) is scaled to zero nodes, and you're expecting the cluster to scale up,
because, in this case, those topology domains won't be considered until there is
at least one node in them.
You can work around this by using an cluster autoscaling tool that is aware of
Pod topology spread constraints and is also aware of the overall set of topology
domains.
## {{% heading "whatsnext" %}}
- The blog article [Introducing PodTopologySpread](/blog/2020/05/introducing-podtopologyspread/)
explains `maxSkew` in some detail, as well as covering some advanced usage examples.
- Read the [scheduling](/docs/reference/kubernetes-api/workload-resources/pod-v1/#scheduling) section of
the API reference for Pod.

View File

@ -320,12 +320,12 @@ in the Pod Lifecycle documentation.
* Learn about the [lifecycle of a Pod](/docs/concepts/workloads/pods/pod-lifecycle/).
* Learn about [RuntimeClass](/docs/concepts/containers/runtime-class/) and how you can use it to
configure different Pods with different container runtime configurations.
* Read about [Pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/).
* Read about [PodDisruptionBudget](/docs/concepts/workloads/pods/disruptions/) and how you can use it to manage application availability during disruptions.
* Pod is a top-level resource in the Kubernetes REST API.
The {{< api-reference page="workload-resources/pod-v1" >}}
object definition describes the object in detail.
* [The Distributed System Toolkit: Patterns for Composite Containers](/blog/2015/06/the-distributed-system-toolkit-patterns/) explains common layouts for Pods with more than one container.
* Read about [Pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/)
To understand the context for why Kubernetes wraps a common Pod API in other resources (such as {{< glossary_tooltip text="StatefulSets" term_id="statefulset" >}} or {{< glossary_tooltip text="Deployments" term_id="deployment" >}}), you can read about the prior art, including:

View File

@ -1,421 +0,0 @@
---
title: Pod Topology Spread Constraints
content_type: concept
weight: 40
---
<!-- overview -->
You can use _topology spread constraints_ to control how {{< glossary_tooltip text="Pods" term_id="Pod" >}} are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.
<!-- body -->
## Prerequisites
### Node Labels
Topology spread constraints rely on node labels to identify the topology domain(s) that each Node is in. For example, a Node might have labels: `node=node1,zone=us-east-1a,region=us-east-1`
Suppose you have a 4-node cluster with the following labels:
```
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready <none> 4m26s v1.16.0 node=node1,zone=zoneA
node2 Ready <none> 3m58s v1.16.0 node=node2,zone=zoneA
node3 Ready <none> 3m17s v1.16.0 node=node3,zone=zoneB
node4 Ready <none> 2m43s v1.16.0 node=node4,zone=zoneB
```
Then the cluster is logically viewed as below:
{{<mermaid>}}
graph TB
subgraph "zoneB"
n3(Node3)
n4(Node4)
end
subgraph "zoneA"
n1(Node1)
n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4 k8s;
class zoneA,zoneB cluster;
{{< /mermaid >}}
Instead of manually applying labels, you can also reuse the [well-known labels](/docs/reference/labels-annotations-taints/) that are created and populated automatically on most clusters.
## Spread Constraints for Pods
### API
The API field `pod.spec.topologySpreadConstraints` is defined as below:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
topologySpreadConstraints:
- maxSkew: <integer>
minDomains: <integer>
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
```
You can define one or multiple `topologySpreadConstraint` to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. The fields are:
- **maxSkew** describes the degree to which Pods may be unevenly distributed.
It must be greater than zero. Its semantics differs according to the value of `whenUnsatisfiable`:
- when `whenUnsatisfiable` equals to "DoNotSchedule", `maxSkew` is the maximum
permitted difference between the number of matching pods in the target
topology and the global minimum
(the minimum number of pods that match the label selector in a topology domain.
For example, if you have 3 zones with 0, 2 and 3 matching pods respectively,
The global minimum is 0).
- when `whenUnsatisfiable` equals to "ScheduleAnyway", scheduler gives higher
precedence to topologies that would help reduce the skew.
- **minDomains** indicates a minimum number of eligible domains.
A domain is a particular instance of a topology. An eligible domain is a domain whose
nodes match the node selector.
- The value of `minDomains` must be greater than 0, when specified.
- When the number of eligible domains with match topology keys is less than `minDomains`,
Pod topology spread treats "global minimum" as 0, and then the calculation of `skew` is performed.
The "global minimum" is the minimum number of matching Pods in an eligible domain,
or zero if the number of eligible domains is less than `minDomains`.
- When the number of eligible domains with matching topology keys equals or is greater than
`minDomains`, this value has no effect on scheduling.
- When `minDomains` is nil, the constraint behaves as if `minDomains` is 1.
- When `minDomains` is not nil, the value of `whenUnsatisfiable` must be "`DoNotSchedule`".
{{< note >}}
The `minDomains` field is an alpha field added in 1.24. You have to enable the
`MinDomainsInPodToplogySpread` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
in order to use it.
{{< /note >}}
- **topologyKey** is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.
- **whenUnsatisfiable** indicates how to deal with a Pod if it doesn't satisfy the spread constraint:
- `DoNotSchedule` (default) tells the scheduler not to schedule it.
- `ScheduleAnyway` tells the scheduler to still schedule it while prioritizing nodes that minimize the skew.
- **labelSelector** is used to find matching Pods. Pods that match this label selector are counted to determine the number of Pods in their corresponding topology domain. See [Label Selectors](/docs/concepts/overview/working-with-objects/labels/#label-selectors) for more details.
When a Pod defines more than one `topologySpreadConstraint`, those constraints are ANDed: The kube-scheduler looks for a node for the incoming Pod that satisfies all the constraints.
You can read more about this field by running `kubectl explain Pod.spec.topologySpreadConstraints`.
### Example: One TopologySpreadConstraint
Suppose you have a 4-node cluster where 3 Pods labeled `foo:bar` are located in node1, node2 and node3 respectively:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class zoneA,zoneB cluster;
{{< /mermaid >}}
If we want an incoming Pod to be evenly spread with existing Pods across zones, the spec can be given as:
{{< codenew file="pods/topology-spread-constraints/one-constraint.yaml" >}}
`topologyKey: zone` implies the even distribution will only be applied to the nodes which have label pair "zone:&lt;any value&gt;" present. `whenUnsatisfiable: DoNotSchedule` tells the scheduler to let it stay pending if the incoming Pod can't satisfy the constraint.
If the scheduler placed this incoming Pod into "zoneA", the Pods distribution would become [3, 1], hence the actual skew is 2 (3 - 1) - which violates `maxSkew: 1`. In this example, the incoming Pod can only be placed into "zoneB":
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
p4(mypod) --> n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
OR
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
p4(mypod) --> n3
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
You can tweak the Pod spec to meet various kinds of requirements:
- Change `maxSkew` to a bigger value like "2" so that the incoming Pod can be placed into "zoneA" as well.
- Change `topologyKey` to "node" so as to distribute the Pods evenly across nodes instead of zones. In the above example, if `maxSkew` remains "1", the incoming Pod can only be placed onto "node4".
- Change `whenUnsatisfiable: DoNotSchedule` to `whenUnsatisfiable: ScheduleAnyway` to ensure the incoming Pod to be always schedulable (suppose other scheduling APIs are satisfied). However, it's preferred to be placed onto the topology domain which has fewer matching Pods. (Be aware that this preferability is jointly normalized with other internal scheduling priorities like resource usage ratio, etc.)
### Example: Multiple TopologySpreadConstraints
This builds upon the previous example. Suppose you have a 4-node cluster where 3 Pods labeled `foo:bar` are located in node1, node2 and node3 respectively:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
You can use 2 TopologySpreadConstraints to control the Pods spreading on both zone and node:
{{< codenew file="pods/topology-spread-constraints/two-constraints.yaml" >}}
In this case, to match the first constraint, the incoming Pod can only be placed into "zoneB"; while in terms of the second constraint, the incoming Pod can only be placed onto "node4". Then the results of 2 constraints are ANDed, so the only viable option is to place on "node4".
Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p4(Pod) --> n3(Node3)
p5(Pod) --> n3
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n1
p3(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3,p4,p5 k8s;
class zoneA,zoneB cluster;
{{< /mermaid >}}
If you apply "two-constraints.yaml" to this cluster, you will notice "mypod" stays in `Pending` state. This is because: to satisfy the first constraint, "mypod" can only placed into "zoneB"; while in terms of the second constraint, "mypod" can only be placed onto "node2". Then a joint result of "zoneB" and "node2" returns nothing.
To overcome this situation, you can either increase the `maxSkew` or modify one of the constraints to use `whenUnsatisfiable: ScheduleAnyway`.
### Interaction With Node Affinity and Node Selectors
The scheduler will skip the non-matching nodes from the skew calculations if the incoming Pod has `spec.nodeSelector` or `spec.affinity.nodeAffinity` defined.
### Example: TopologySpreadConstraints with NodeAffinity
Suppose you have a 5-node cluster ranging from zoneA to zoneC:
{{<mermaid>}}
graph BT
subgraph "zoneB"
p3(Pod) --> n3(Node3)
n4(Node4)
end
subgraph "zoneA"
p1(Pod) --> n1(Node1)
p2(Pod) --> n2(Node2)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n1,n2,n3,n4,p1,p2,p3 k8s;
class p4 plain;
class zoneA,zoneB cluster;
{{< /mermaid >}}
{{<mermaid>}}
graph BT
subgraph "zoneC"
n5(Node5)
end
classDef plain fill:#ddd,stroke:#fff,stroke-width:4px,color:#000;
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:4px,color:#fff;
classDef cluster fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
class n5 k8s;
class zoneC cluster;
{{< /mermaid >}}
and you know that "zoneC" must be excluded. In this case, you can compose the yaml as below, so that "mypod" will be placed into "zoneB" instead of "zoneC". Similarly `spec.nodeSelector` is also respected.
{{< codenew file="pods/topology-spread-constraints/one-constraint-with-nodeaffinity.yaml" >}}
The scheduler doesn't have prior knowledge of all the zones or other topology domains that a cluster has. They are determined from the existing nodes in the cluster. This could lead to a problem in autoscaled clusters, when a node pool (or node group) is scaled to zero nodes and the user is expecting them to scale up, because, in this case, those topology domains won't be considered until there is at least one node in them.
### Other Noticeable Semantics
There are some implicit conventions worth noting here:
- Only the Pods holding the same namespace as the incoming Pod can be matching candidates.
- The scheduler will bypass the nodes without `topologySpreadConstraints[*].topologyKey` present. This implies that:
1. the Pods located on those nodes do not impact `maxSkew` calculation - in the above example, suppose "node1" does not have label "zone", then the 2 Pods will be disregarded, hence the incoming Pod will be scheduled into "zoneA".
2. the incoming Pod has no chances to be scheduled onto such nodes - in the above example, suppose a "node5" carrying label `{zone-typo: zoneC}` joins the cluster, it will be bypassed due to the absence of label key "zone".
- Be aware of what will happen if the incoming Pod's `topologySpreadConstraints[*].labelSelector` doesn't match its own labels. In the above example, if we remove the incoming Pod's labels, it can still be placed into "zoneB" since the constraints are still satisfied. However, after the placement, the degree of imbalance of the cluster remains unchanged - it's still zoneA having 2 Pods which hold label {foo:bar}, and zoneB having 1 Pod which holds label {foo:bar}. So if this is not what you expect, we recommend the workload's `topologySpreadConstraints[*].labelSelector` to match its own labels.
### Cluster-level default constraints
It is possible to set default topology spread constraints for a cluster. Default
topology spread constraints are applied to a Pod if, and only if:
- It doesn't define any constraints in its `.spec.topologySpreadConstraints`.
- It belongs to a service, replication controller, replica set or stateful set.
Default constraints can be set as part of the `PodTopologySpread` plugin args
in a [scheduling profile](/docs/reference/scheduling/config/#profiles).
The constraints are specified with the same [API above](#api), except that
`labelSelector` must be empty. The selectors are calculated from the services,
replication controllers, replica sets or stateful sets that the Pod belongs to.
An example configuration might look like follows:
```yaml
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
```
{{< note >}}
[`SelectorSpread` plugin](/docs/reference/scheduling/config/#scheduling-plugins)
is disabled by default. It's recommended to use `PodTopologySpread` to achieve similar
behavior.
{{< /note >}}
#### Built-in default constraints {#internal-default-constraints}
{{< feature-state for_k8s_version="v1.24" state="stable" >}}
If you don't configure any cluster-level default constraints for pod topology spreading,
then kube-scheduler acts as if you specified the following default topology constraints:
```yaml
defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
```
Also, the legacy `SelectorSpread` plugin, which provides an equivalent behavior,
is disabled by default.
{{< note >}}
The `PodTopologySpread` plugin does not score the nodes that don't have
the topology keys specified in the spreading constraints. This might result
in a different default behavior compared to the legacy `SelectorSpread` plugin when
using the default topology constraints.
If your nodes are not expected to have **both** `kubernetes.io/hostname` and
`topology.kubernetes.io/zone` labels set, define your own constraints
instead of using the Kubernetes defaults.
{{< /note >}}
If you don't want to use the default Pod spreading constraints for your cluster,
you can disable those defaults by setting `defaultingType` to `List` and leaving
empty `defaultConstraints` in the `PodTopologySpread` plugin configuration:
```yaml
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints: []
defaultingType: List
```
## Comparison with PodAffinity/PodAntiAffinity
In Kubernetes, directives related to "Affinity" control how Pods are
scheduled - more packed or more scattered.
- For `PodAffinity`, you can try to pack any number of Pods into qualifying
topology domain(s)
- For `PodAntiAffinity`, only one Pod can be scheduled into a
single topology domain.
For finer control, you can specify topology spread constraints to distribute
Pods across different topology domains - to achieve either high availability or
cost-saving. This can also help on rolling update workloads and scaling out
replicas smoothly. See
[Motivation](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/895-pod-topology-spread#motivation)
for more details.
## Known Limitations
- There's no guarantee that the constraints remain satisfied when Pods are removed. For example, scaling down a Deployment may result in imbalanced Pods distribution.
You can use [Descheduler](https://github.com/kubernetes-sigs/descheduler) to rebalance the Pods distribution.
- Pods matched on tainted nodes are respected. See [Issue 80921](https://github.com/kubernetes/kubernetes/issues/80921)
## {{% heading "whatsnext" %}}
- [Blog: Introducing PodTopologySpread](/blog/2020/05/introducing-podtopologyspread/)
explains `maxSkew` in details, as well as bringing up some advanced usage examples.

View File

@ -438,7 +438,7 @@ Note that the live editor doesn't recognize Hugo shortcodes.
### Example 1 - Pod topology spread constraints
Figure 6 shows the diagram appearing in the
[Pod topology pread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/#node-labels)
[Pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/#node-labels)
page.
{{< mermaid >}}

View File

@ -808,7 +808,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
availability during update per node.
See [Perform a Rolling Update on a DaemonSet](/docs/tasks/manage-daemon/update-daemon-set/).
- `DefaultPodTopologySpread`: Enables the use of `PodTopologySpread` scheduling plugin to do
[default spreading](/docs/concepts/workloads/pods/pod-topology-spread-constraints/#internal-default-constraints).
[default spreading](/docs/concepts/scheduling-eviction/topology-spread-constraints/#internal-default-constraints).
- `DelegateFSGroupToCSIDriver`: If supported by the CSI driver, delegates the
role of applying `fsGroup` from a Pod's `securityContext` to the driver by
passing `fsGroup` through the NodeStageVolume and NodePublishVolume CSI calls.
@ -854,7 +854,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
{{< glossary_tooltip text="ephemeral containers" term_id="ephemeral-container" >}}
to running pods.
- `EvenPodsSpread`: Enable pods to be scheduled evenly across topology domains. See
[Pod Topology Spread Constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/).
[Pod Topology Spread Constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/).
- `ExecProbeTimeout`: Ensure kubelet respects exec probe timeouts.
This feature gate exists in case any of your existing workloads depend on a
now-corrected fault where Kubernetes ignored exec probe timeouts. See
@ -995,7 +995,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
- `MemoryQoS`: Enable memory protection and usage throttle on pod / container using
cgroup v2 memory controller.
- `MinDomainsInPodTopologySpread`: Enable `minDomains` in Pod
[topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/).
[topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/).
- `MixedProtocolLBService`: Enable using different protocols in the same `LoadBalancer` type
Service instance.
- `MountContainers`: Enable using utility containers on host as the volume mounter.

View File

@ -123,7 +123,7 @@ extension points:
and [node affinity](/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity).
Extension points: `filter`, `score`.
- `PodTopologySpread`: Implements
[Pod topology spread](/docs/concepts/workloads/pods/pod-topology-spread-constraints/).
[Pod topology spread](/docs/concepts/scheduling-eviction/topology-spread-constraints/).
Extension points: `preFilter`, `filter`, `preScore`, `score`.
- `NodeUnschedulable`: Filters out nodes that have `.spec.unschedulable` set to
true.

View File

@ -63,7 +63,7 @@ These labels can include
If your cluster spans multiple zones or regions, you can use node labels
in conjunction with
[Pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
[Pod topology spread constraints](/docs/concepts/scheduling-eviction/topology-spread-constraints/)
to control how Pods are spread across your cluster among fault domains:
regions, zones, and even specific nodes.
These hints enable the

View File

@ -158,6 +158,7 @@
/docs/concepts/workloads/controllers/statefulset.md /docs/concepts/workloads/controllers/statefulset/ 301!
/docs/concepts/workloads/pods/pod/ /docs/concepts/workloads/pods/ 301
/docs/concepts/workloads/pods/pod-overview/ /docs/concepts/workloads/pods/ 301
/docs/concepts/workloads/pods/pod-topology-spread-constraints/ /docs/concepts/scheduling-eviction/topology-spread-constraints/ 301
/docs/concepts/workloads/pods/init-containers/Kubernetes/ /docs/concepts/workloads/pods/init-containers/ 301
/docs/concepts/policy/pod-security-policy/ /docs/concepts/security/pod-security-policy/ 301