Blog introducing PodTopologySpread
parent
652e361bf6
commit
bd85794f04
|
@ -0,0 +1,192 @@
|
||||||
|
---
|
||||||
|
title: "Introducing PodTopologySpread"
|
||||||
|
date: 2020-05-05
|
||||||
|
slug: introducing-podtopologyspread
|
||||||
|
url: /blog/2020/05/Introducing-PodTopologySpread
|
||||||
|
---
|
||||||
|
|
||||||
|
**Author:** Wei Huang (IBM), Aldo Culquicondor (Google)
|
||||||
|
|
||||||
|
Managing Pods distribution across a cluster is hard. The well-known Kubernetes
|
||||||
|
features for Pod affinity and anti-affinity, allow some control of Pod placement
|
||||||
|
in different topologies. However, these features only resolve part of Pods
|
||||||
|
distribution use cases: either place unlimited Pods to a single topology, or
|
||||||
|
disallow two Pods to co-locate in the same topology. In between these two
|
||||||
|
extreme cases, there is a common need to distribute the Pods evenly across the
|
||||||
|
topologies, so as to achieve better cluster utilization and high availability of
|
||||||
|
applications.
|
||||||
|
|
||||||
|
The PodTopologySpread scheduling plugin (originally proposed as EvenPodsSpread)
|
||||||
|
was designed to fill that gap. We promoted it to beta in 1.18.
|
||||||
|
|
||||||
|
## API changes
|
||||||
|
|
||||||
|
A new field `topologySpreadConstraints` is introduced in the Pod's spec API:
|
||||||
|
|
||||||
|
```
|
||||||
|
spec:
|
||||||
|
topologySpreadConstraints:
|
||||||
|
- maxSkew: <integer>
|
||||||
|
topologyKey: <string>
|
||||||
|
whenUnsatisfiable: <string>
|
||||||
|
labelSelector: <object>
|
||||||
|
```
|
||||||
|
|
||||||
|
As this API is embedded in Pod's spec, you can use this feature in all the
|
||||||
|
high-level workload APIs, such as Deployment, DaemonSet, StatefulSet, etc.
|
||||||
|
|
||||||
|
Let's see an example of a cluster to understand this API.
|
||||||
|
|
||||||
|
![API](/images/blog/2020-05-05-introducing-podtopologyspread/api.png)
|
||||||
|
|
||||||
|
- **labelSelector** is used to find matching Pods. For each topology, we count
|
||||||
|
the number of Pods that match this label selector. In the above example, given
|
||||||
|
the labelSelector as "app: foo", the matching number in "zone1" is 2; while
|
||||||
|
the number in "zone2" is 0.
|
||||||
|
- **topologyKey** is the key that defines a topology in the Nodes' labels. In
|
||||||
|
the above example, some Nodes are grouped into "zone1" if they have the label
|
||||||
|
"zone=zone1" label; while other ones are grouped into "zone2".
|
||||||
|
- **maxSkew** describes the maximum degree to which Pods can be unevenly
|
||||||
|
distributed. In the above example:
|
||||||
|
- if we put the incoming Pod to "zone1", the skew on "zone1" will become 3 (3
|
||||||
|
Pods matched in "zone1"; global minimum of 0 Pods matched on "zone2"), which
|
||||||
|
violates the "maxSkew: 1" constraint.
|
||||||
|
- if the incoming Pod is placed to "zone2", the skew on "zone2" is 0 (1 Pod
|
||||||
|
matched in "zone2"; global minimum of 1 Pod matched on "zone2" itself),
|
||||||
|
which satisfies the "maxSkew: 1" constraint. Note that the skew is
|
||||||
|
calculated per each qualified Node, instead of a global skew.
|
||||||
|
- **whenUnsatisfiable** specifies, when "maxSkew" can't be satisfied, what
|
||||||
|
action should be taken:
|
||||||
|
- `DoNotSchedule` (default) tells the scheduler not to schedule it. It's a
|
||||||
|
hard constraint.
|
||||||
|
- `ScheduleAnyway` tells the scheduler to still schedule it while prioritizing
|
||||||
|
Nodes that reduce the skew. It's a soft constraint.
|
||||||
|
|
||||||
|
## Advanced usage
|
||||||
|
|
||||||
|
As the feature name "PodTopologySpread" implies, the basic usage of this feature
|
||||||
|
is to run your workload with an absolute even manner (maxSkew=1), or relatively
|
||||||
|
even manner (maxSkew>=2). See the [official
|
||||||
|
document](/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
|
||||||
|
for more details.
|
||||||
|
|
||||||
|
In addition to this basic usage, there are some advanced usage examples that
|
||||||
|
enable your workloads to benefit on high availability and cluster utilization.
|
||||||
|
|
||||||
|
### Usage along with NodeSelector / NodeAffinity
|
||||||
|
|
||||||
|
You may have found that we didn't have a "topologyValues" field to limit which
|
||||||
|
topologies the Pods are going to be scheduled to. By default, it is going to
|
||||||
|
search all Nodes and group them by "topologyKey". Sometimes this may not be the
|
||||||
|
ideal case. For instance, suppose there is a cluster with Nodes tagged with
|
||||||
|
"env=prod", "env=staging" and "env=qa", and now you want to evenly place Pods to
|
||||||
|
the "qa" environment across zones, is it possible?
|
||||||
|
|
||||||
|
The answer is yes. You can leverage the NodeSelector or NodeAffinity API spec.
|
||||||
|
Under the hood, the PodTopologySpread feature will **honor** that and calculate
|
||||||
|
the spread constraints among the nodes that satisfy the selectors.
|
||||||
|
|
||||||
|
![Advanced-Usage-1](/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-1.png)
|
||||||
|
|
||||||
|
As illustrated above, you can specify `spec.affinity.nodeAffinity` to limit the
|
||||||
|
"searching scope" to be "qa" environment, and within that scope, the Pod will be
|
||||||
|
scheduled to one zone which satisfies the topologySpreadConstraints. In this
|
||||||
|
case, it's "zone2".
|
||||||
|
|
||||||
|
### Multiple TopologySpreadConstraints
|
||||||
|
|
||||||
|
It's intuitive to understand how one single TopologySpreadConstraint works.
|
||||||
|
What's the case for multiple TopologySpreadConstraints? Internally, each
|
||||||
|
TopologySpreadConstraint is calculated independently, and the result sets will
|
||||||
|
be merged to generate the eventual result set - i.e., suitable Nodes.
|
||||||
|
|
||||||
|
In the following example, we want to schedule a Pod to a cluster with 2
|
||||||
|
requirements at the same time:
|
||||||
|
|
||||||
|
- place the Pod evenly with Pods across zones
|
||||||
|
- place the Pod evenly with Pods across nodes
|
||||||
|
|
||||||
|
![Advanced-Usage-2](/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-2.png)
|
||||||
|
|
||||||
|
For the first constraint, there are 3 Pods in zone1 and 2 Pods in zone2, so the
|
||||||
|
incoming Pod can be only put to zone2 to satisfy the "maxSkew=1" constraint. In
|
||||||
|
other words, the result set is nodeX and nodeY.
|
||||||
|
|
||||||
|
For the second constraint, there are too many Pods in nodeB and nodeX, so the
|
||||||
|
incoming Pod can be only put to nodeA and nodeY.
|
||||||
|
|
||||||
|
Now we can conclude the only qualified Node is nodeY - from the intersection of
|
||||||
|
the sets {nodeX, nodeY} (from the first constraint) and {nodeA, nodeY} (from the
|
||||||
|
second constraint).
|
||||||
|
|
||||||
|
Multiple TopologySpreadConstraints is powerful, but be sure to understand the
|
||||||
|
difference with the preceding "NodeSelector/NodeAffinity" example: one is to
|
||||||
|
calculate result set independently and then interjoined; while the other is to
|
||||||
|
calculate topologySpreadConstraints based on the filtering results of node
|
||||||
|
constraints.
|
||||||
|
|
||||||
|
Instead of using "hard" constraints in all topologySpreadConstraints, you can
|
||||||
|
also combine using "hard" constraints and "soft" constraints to adhere to more
|
||||||
|
diverse cluster situations.
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
If two TopologySpreadConstraints are being applied for the same {topologyKey,
|
||||||
|
whenUnsatisfiable} tuple, the Pod creation will be blocked returning a
|
||||||
|
validation error.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
## PodTopologySpread defaults
|
||||||
|
|
||||||
|
PodTopologySpread is a Pod level API. As such, to use the feature, workload
|
||||||
|
authors need to be aware of the underlying topology of the cluster, and then
|
||||||
|
specify proper `topologySpreadConstraints` in the Pod spec for every workload.
|
||||||
|
While the Pod-level API gives the most flexibility it is also possible to
|
||||||
|
specify cluster-level defaults.
|
||||||
|
|
||||||
|
The default PodTopologySpread constraints allow you to specify spreading for all
|
||||||
|
the workloads in the cluster, tailored for its topology. The constraints can be
|
||||||
|
specified by an operator/admin as PodTopologySpread plugin arguments in the
|
||||||
|
[scheduling profile configuration
|
||||||
|
API](/docs/reference/scheduling/profiles/) when starting
|
||||||
|
kube-scheduler.
|
||||||
|
|
||||||
|
A sample configuration could look like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
apiVersion: kubescheduler.config.k8s.io/v1alpha2
|
||||||
|
kind: KubeSchedulerConfiguration
|
||||||
|
profiles:
|
||||||
|
pluginConfig:
|
||||||
|
- name: PodTopologySpread
|
||||||
|
args:
|
||||||
|
defaultConstraints:
|
||||||
|
- maxSkew: 1
|
||||||
|
topologyKey: example.com/rack
|
||||||
|
whenUnsatisfiable: ScheduleAnyway
|
||||||
|
```
|
||||||
|
|
||||||
|
When configuring default constraints, label selectors must be left empty.
|
||||||
|
kube-scheduler will deduce the label selectors from the membership of the Pod to
|
||||||
|
Services, ReplicationControllers, ReplicaSets or StatefulSets. Pods can
|
||||||
|
always override the default constraints by providing their own through the
|
||||||
|
PodSpec.
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
When using default PodTopologySpread constraints, it is recommended to disable
|
||||||
|
the old DefaultTopologySpread plugin.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
|
## Wrap-up
|
||||||
|
|
||||||
|
PodTopologySpread allows you to define spreading constraints for your workloads
|
||||||
|
with a flexible and expressive Pod-level API. In the past, workload authors used
|
||||||
|
Pod AntiAffinity rules to force or hint the scheduler to run a single Pod per
|
||||||
|
topology domain. In contrast, the new PodTopologySpread constraints allow Pods
|
||||||
|
to specify skew levels that can be required (hard) or desired (soft). The
|
||||||
|
feature can be paired with Node selectors and Node affinity to limit the
|
||||||
|
spreading to specific domains. Pod spreading constraints can be defined for
|
||||||
|
different topologies such as hostnames, zones, regions, racks, etc.
|
||||||
|
|
||||||
|
Lastly, cluster operators can define default constraints to be applied to all
|
||||||
|
Pods. This way, Pods don't need to be aware of the underlying topology of the
|
||||||
|
cluster.
|
|
@ -55,7 +55,7 @@ Instead of manually applying labels, you can also reuse the [well-known labels](
|
||||||
|
|
||||||
The field `pod.spec.topologySpreadConstraints` is introduced in 1.16 as below:
|
The field `pod.spec.topologySpreadConstraints` is introduced in 1.16 as below:
|
||||||
|
|
||||||
```yaml
|
```
|
||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
kind: Pod
|
kind: Pod
|
||||||
metadata:
|
metadata:
|
||||||
|
|
Binary file not shown.
After Width: | Height: | Size: 66 KiB |
Binary file not shown.
After Width: | Height: | Size: 69 KiB |
Binary file not shown.
After Width: | Height: | Size: 43 KiB |
Loading…
Reference in New Issue