Blog introducing PodTopologySpread

2020-04-27 12:42:03 -07:00 · 2020-04-27 12:42:03 -07:00 · bd85794f04
parent 652e361bf6
commit bd85794f04
5 changed files with 193 additions and 1 deletions
--- a/content/en/blog/_posts/2020-05-05-introducing-podtopologyspread.md
+++ b/content/en/blog/_posts/2020-05-05-introducing-podtopologyspread.md
@ -0,0 +1,192 @@
 ---
 title: "Introducing PodTopologySpread"
 date: 2020-05-05
 slug: introducing-podtopologyspread
 url: /blog/2020/05/Introducing-PodTopologySpread
 ---
 **Author:** Wei Huang (IBM), Aldo Culquicondor (Google)
 Managing Pods distribution across a cluster is hard. The well-known Kubernetes
 features for Pod affinity and anti-affinity, allow some control of Pod placement
 in different topologies. However, these features only resolve part of Pods
 distribution use cases: either place unlimited Pods to a single topology, or
 disallow two Pods to co-locate in the same topology. In between these two
 extreme cases, there is a common need to distribute the Pods evenly across the
 topologies, so as to achieve better cluster utilization and high availability of
 applications.
 The PodTopologySpread scheduling plugin (originally proposed as EvenPodsSpread)
 was designed to fill that gap. We promoted it to beta in 1.18.
 ## API changes
 A new field `topologySpreadConstraints` is introduced in the Pod's spec API:
 ```
 spec:
  topologySpreadConstraints:
  - maxSkew: <integer>
    topologyKey: <string>
    whenUnsatisfiable: <string>
    labelSelector: <object>
 ```
 As this API is embedded in Pod's spec, you can use this feature in all the
 high-level workload APIs, such as Deployment, DaemonSet, StatefulSet, etc.
 Let's see an example of a cluster to understand this API.
 ![API](/images/blog/2020-05-05-introducing-podtopologyspread/api.png)
 - **labelSelector** is used to find matching Pods. For each topology, we count
  the number of Pods that match this label selector. In the above example, given
  the labelSelector as "app: foo", the matching number in "zone1" is 2; while
  the number in "zone2" is 0.
 - **topologyKey** is the key that defines a topology in the Nodes' labels. In
  the above example, some Nodes are grouped into "zone1" if they have the label
  "zone=zone1" label; while other ones are grouped into "zone2".
 - **maxSkew** describes the maximum degree to which Pods can be unevenly
  distributed. In the above example:
  - if we put the incoming Pod to "zone1", the skew on "zone1" will become 3 (3
    Pods matched in "zone1"; global minimum of 0 Pods matched on "zone2"), which
    violates the "maxSkew: 1" constraint.
  - if the incoming Pod is placed to "zone2", the skew on "zone2" is 0 (1 Pod
    matched in "zone2"; global minimum of 1 Pod matched on "zone2" itself),
    which satisfies the "maxSkew: 1" constraint. Note that the skew is
    calculated per each qualified Node, instead of a global skew.
 - **whenUnsatisfiable** specifies, when "maxSkew" can't be satisfied, what
  action should be taken:
  - `DoNotSchedule` (default) tells the scheduler not to schedule it. It's a
    hard constraint.
  - `ScheduleAnyway` tells the scheduler to still schedule it while prioritizing
    Nodes that reduce the skew. It's a soft constraint.
 ## Advanced usage
 As the feature name "PodTopologySpread" implies, the basic usage of this feature
 is to run your workload with an absolute even manner (maxSkew=1), or relatively
 even manner (maxSkew>=2). See the [official
 document](/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
 for more details.
 In addition to this basic usage, there are some advanced usage examples that
 enable your workloads to benefit on high availability and cluster utilization.
 ### Usage along with NodeSelector / NodeAffinity
 You may have found that we didn't have a "topologyValues" field to limit which
 topologies the Pods are going to be scheduled to. By default, it is going to
 search all Nodes and group them by "topologyKey". Sometimes this may not be the
 ideal case. For instance, suppose there is a cluster with Nodes tagged with
 "env=prod", "env=staging" and "env=qa", and now you want to evenly place Pods to
 the "qa" environment across zones, is it possible?
 The answer is yes. You can leverage the NodeSelector or NodeAffinity API spec.
 Under the hood, the PodTopologySpread feature will **honor** that and calculate
 the spread constraints among the nodes that satisfy the selectors.
 ![Advanced-Usage-1](/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-1.png)
 As illustrated above, you can specify `spec.affinity.nodeAffinity` to limit the
 "searching scope" to be "qa" environment, and within that scope, the Pod will be
 scheduled to one zone which satisfies the topologySpreadConstraints. In this
 case, it's "zone2".
 ### Multiple TopologySpreadConstraints
 It's intuitive to understand how one single TopologySpreadConstraint works.
 What's the case for multiple TopologySpreadConstraints? Internally, each
 TopologySpreadConstraint is calculated independently, and the result sets will
 be merged to generate the eventual result set - i.e., suitable Nodes.
 In the following example, we want to schedule a Pod to a cluster with 2
 requirements at the same time:
 - place the Pod evenly with Pods across zones
 - place the Pod evenly with Pods across nodes
 ![Advanced-Usage-2](/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-2.png)
 For the first constraint, there are 3 Pods in zone1 and 2 Pods in zone2, so the
 incoming Pod can be only put to zone2 to satisfy the "maxSkew=1" constraint. In
 other words, the result set is nodeX and nodeY.
 For the second constraint, there are too many Pods in nodeB and nodeX, so the
 incoming Pod can be only put to nodeA and nodeY.
 Now we can conclude the only qualified Node is nodeY - from the intersection of
 the sets {nodeX, nodeY} (from the first constraint) and {nodeA, nodeY} (from the
 second constraint).
 Multiple TopologySpreadConstraints is powerful, but be sure to understand the
 difference with the preceding "NodeSelector/NodeAffinity" example: one is to
 calculate result set independently and then interjoined; while the other is to
 calculate topologySpreadConstraints based on the filtering results of node
 constraints.
 Instead of using "hard" constraints in all topologySpreadConstraints, you can
 also combine using "hard" constraints and "soft" constraints to adhere to more
 diverse cluster situations.
 {{< note >}}
 If two TopologySpreadConstraints are being applied for the same {topologyKey,
 whenUnsatisfiable} tuple, the Pod creation will be blocked returning a
 validation error.
 {{< /note >}}
 ## PodTopologySpread defaults
 PodTopologySpread is a Pod level API. As such, to use the feature, workload
 authors need to be aware of the underlying topology of the cluster, and then
 specify proper `topologySpreadConstraints` in the Pod spec for every workload.
 While the Pod-level API gives the most flexibility it is also possible to
 specify cluster-level defaults.
 The default PodTopologySpread constraints allow you to specify spreading for all
 the workloads in the cluster, tailored for its topology. The constraints can be
 specified by an operator/admin as PodTopologySpread plugin arguments in the
 [scheduling profile configuration
 API](/docs/reference/scheduling/profiles/) when starting
 kube-scheduler.
 A sample configuration could look like this:
 ```
 apiVersion: kubescheduler.config.k8s.io/v1alpha2
 kind: KubeSchedulerConfiguration
 profiles:
  pluginConfig:
  - name: PodTopologySpread
    args:
      defaultConstraints:
      - maxSkew: 1
        topologyKey: example.com/rack
        whenUnsatisfiable: ScheduleAnyway
 ```
 When configuring default constraints, label selectors must be left empty.
 kube-scheduler will deduce the label selectors from the membership of the Pod to
 Services, ReplicationControllers, ReplicaSets or StatefulSets. Pods can
 always override the default constraints by providing their own through the
 PodSpec.
 {{< note >}}
 When using default PodTopologySpread constraints, it is recommended to disable
 the old DefaultTopologySpread plugin.
 {{< /note >}}
 ## Wrap-up
 PodTopologySpread allows you to define spreading constraints for your workloads
 with a flexible and expressive Pod-level API. In the past, workload authors used
 Pod AntiAffinity rules to force or hint the scheduler to run a single Pod per
 topology domain. In contrast, the new PodTopologySpread constraints allow Pods
 to specify skew levels that can be required (hard) or desired (soft). The
 feature can be paired with Node selectors and Node affinity to limit the
 spreading to specific domains. Pod spreading constraints can be defined for
 different topologies such as hostnames, zones, regions, racks, etc.
 Lastly, cluster operators can define default constraints to be applied to all
 Pods. This way, Pods don't need to be aware of the underlying topology of the
 cluster.
--- a/content/en/docs/concepts/workloads/pods/pod-topology-spread-constraints.md
+++ b/content/en/docs/concepts/workloads/pods/pod-topology-spread-constraints.md
@ -55,7 +55,7 @@ Instead of manually applying labels, you can also reuse the [well-known labels](
 The field `pod.spec.topologySpreadConstraints` is introduced in 1.16 as below:
-```yaml
+```
 apiVersion: v1
 kind: Pod
 metadata:
--- a/static/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-1.png
+++ b/static/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-1.png
--- a/static/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-2.png
+++ b/static/images/blog/2020-05-05-introducing-podtopologyspread/advanced-usage-2.png
--- a/static/images/blog/2020-05-05-introducing-podtopologyspread/api.png
+++ b/static/images/blog/2020-05-05-introducing-podtopologyspread/api.png