Merge pull request #24783 from thockin/kep-1659-doc-topology-labels

Better docs for standard topology labels
2020-11-04 07:38:04 -08:00 · 2020-11-04 07:38:04 -08:00 · c779b2152b
parent db658b2d8e 300c2e8545
commit c779b2152b
8 changed files with 32 additions and 49 deletions
--- a/content/en/docs/concepts/configuration/pod-priority-preemption.md
+++ b/content/en/docs/concepts/configuration/pod-priority-preemption.md
@ -271,7 +271,7 @@ preempted. Here's an example:
 *   Pod P is being considered for Node N.
 *   Pod Q is running on another Node in the same Zone as Node N.
 *   Pod P has Zone-wide anti-affinity with Pod Q (`topologyKey:
-    failure-domain.beta.kubernetes.io/zone`).
+    topology.kubernetes.io/zone`).
 *   There are no other cases of anti-affinity between Pod P and other Pods in
    the Zone.
 *   In order to schedule Pod P on Node N, Pod Q can be preempted, but scheduler
--- a/content/en/docs/concepts/scheduling-eviction/assign-pod-node.md
+++ b/content/en/docs/concepts/scheduling-eviction/assign-pod-node.md
@ -200,8 +200,8 @@ The affinity on this pod defines one pod affinity rule and one pod anti-affinity
 while the `podAntiAffinity` is `preferredDuringSchedulingIgnoredDuringExecution`. The
 pod affinity rule says that the pod can be scheduled onto a node only if that node is in the same zone
 as at least one already-running pod that has a label with key "security" and value "S1". (More precisely, the pod is eligible to run
-on node N if node N has a label with key `failure-domain.beta.kubernetes.io/zone` and some value V
-such that there is at least one node in the cluster with key `failure-domain.beta.kubernetes.io/zone` and
+on node N if node N has a label with key `topology.kubernetes.io/zone` and some value V
+such that there is at least one node in the cluster with key `topology.kubernetes.io/zone` and
 value V that is running a pod that has a label with key "security" and value "S1".) The pod anti-affinity
 rule says that the pod cannot be scheduled onto a node if that node is in the same zone as a pod with
 label having key "security" and value "S2". See the
--- a/content/en/docs/concepts/storage/storage-classes.md
+++ b/content/en/docs/concepts/storage/storage-classes.md
@ -209,7 +209,7 @@ parameters:
 volumeBindingMode: WaitForFirstConsumer
 allowedTopologies:
 - matchLabelExpressions:
-  - key: failure-domain.beta.kubernetes.io/zone
+  - key: topology.kubernetes.io/zone
    values:
    - us-central1-a
    - us-central1-b
--- a/content/en/docs/concepts/storage/volumes.md
+++ b/content/en/docs/concepts/storage/volumes.md
@ -449,7 +449,7 @@ spec:
    required:
      nodeSelectorTerms:
      - matchExpressions:
-        - key: failure-domain.beta.kubernetes.io/zone
+        - key: topology.kubernetes.io/zone
          operator: In
          values:
          - us-central1-a
--- a/content/en/docs/reference/access-authn-authz/admission-controllers.md
+++ b/content/en/docs/reference/access-authn-authz/admission-controllers.md
@ -534,8 +534,8 @@ and kubelets will not be allowed to modify labels with that prefix.
  * `kubernetes.io/os`
  * `beta.kubernetes.io/instance-type`
  * `node.kubernetes.io/instance-type`
-  * `failure-domain.beta.kubernetes.io/region`
-  * `failure-domain.beta.kubernetes.io/zone`
+  * `failure-domain.beta.kubernetes.io/region` (deprecated)
+  * `failure-domain.beta.kubernetes.io/zone` (deprecated)
  * `topology.kubernetes.io/region`
  * `topology.kubernetes.io/zone`
  * `kubelet.kubernetes.io/`-prefixed labels
--- a/content/en/docs/reference/command-line-tools-reference/kubelet.md
+++ b/content/en/docs/reference/command-line-tools-reference/kubelet.md
@ -967,7 +967,7 @@ WindowsEndpointSliceProxying=true|false (ALPHA - default=false)<br/>
 <td colspan="2">--node-labels mapStringString</td>
 </tr>
 <tr>
-<td></td><td style="line-height: 130%; word-wrap: break-word;">&lt;Warning: Alpha feature&gt; Labels to add when registering the node in the cluster. Labels must be `key=value` pairs separated by `,`. Labels in the `kubernetes.io` namespace must begin with an allowed prefix (`kubelet.kubernetes.io`, `node.kubernetes.io`) or be in the specifically allowed set (`beta.kubernetes.io/arch`, `beta.kubernetes.io/instance-type`, `beta.kubernetes.io/os`, `failure-domain.beta.kubernetes.io/region`, `failure-domain.beta.kubernetes.io/zone`, `failure-domain.kubernetes.io/region`, `failure-domain.kubernetes.io/zone`, `kubernetes.io/arch`, `kubernetes.io/hostname`, `kubernetes.io/instance-type`, `kubernetes.io/os`)</td>
+<td></td><td style="line-height: 130%; word-wrap: break-word;">&lt;Warning: Alpha feature&gt;Labels to add when registering the node in the cluster. Labels must be `key=value pairs` separated by `,`. Labels in the `kubernetes.io` namespace must begin with an allowed prefix (`kubelet.kubernetes.io`, `node.kubernetes.io`) or be in the specifically allowed set (`beta.kubernetes.io/arch`, `beta.kubernetes.io/instance-type`, `beta.kubernetes.io/os`, `failure-domain.beta.kubernetes.io/region`, `failure-domain.beta.kubernetes.io/zone`, `kubernetes.io/arch`, `kubernetes.io/hostname`, `kubernetes.io/os`, `node.kubernetes.io/instance-type`, `topology.kubernetes.io/region`, `topology.kubernetes.io/zone`)</td>
 </tr>

 <tr>
--- a/content/en/docs/reference/kubernetes-api/labels-annotations-taints.md
+++ b/content/en/docs/reference/kubernetes-api/labels-annotations-taints.md
@ -38,7 +38,7 @@ This label has been deprecated. Please use `kubernetes.io/arch` instead.

 This label has been deprecated. Please use `kubernetes.io/os` instead.

-## kubernetes.io/hostname
+## kubernetes.io/hostname {#kubernetesiohostname}

 Example: `kubernetes.io/hostname=ip-172-20-114-199.ec2.internal`

@ -46,6 +46,8 @@ Used on: Node

 The Kubelet populates this label with the hostname. Note that the hostname can be changed from the "actual" hostname by passing the `--hostname-override` flag to the `kubelet`.

+This label is also used as part of the topology hierarchy.  See [topology.kubernetes.io/zone](#topologykubernetesiozone) for more information.
+
 ## beta.kubernetes.io/instance-type (deprecated)

 {{< note >}} Starting in v1.17, this label is deprecated in favor of [node.kubernetes.io/instance-type](#nodekubernetesioinstance-type). {{< /note >}}
@ -63,71 +65,52 @@ to rely on the Kubernetes scheduler to perform resource-based scheduling. You sh

 ## failure-domain.beta.kubernetes.io/region (deprecated) {#failure-domainbetakubernetesioregion}

-See [failure-domain.beta.kubernetes.io/zone](#failure-domainbetakubernetesiozone).
+See [topology.kubernetes.io/region](#topologykubernetesioregion).

 {{< note >}} Starting in v1.17, this label is deprecated in favor of [topology.kubernetes.io/region](#topologykubernetesioregion). {{< /note >}}

 ## failure-domain.beta.kubernetes.io/zone (deprecated) {#failure-domainbetakubernetesiozone}

-Example:
-
-`failure-domain.beta.kubernetes.io/region=us-east-1`
-
-`failure-domain.beta.kubernetes.io/zone=us-east-1c`
-
-Used on: Node, PersistentVolume
-
-On the Node: The `kubelet` populates this with the zone information as defined by the `cloudprovider`.
-This will be set only if you are using a `cloudprovider`. However, you should consider setting this
-on the nodes if it makes sense in your topology.
-
-On the PersistentVolume: The `PersistentVolumeLabel` admission controller will automatically add zone labels to PersistentVolumes, on GCE and AWS.
-
-Kubernetes will automatically spread the Pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via _SelectorSpreadPriority_.
-
-_SelectorSpreadPriority_ is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.
-
-The scheduler (through the _VolumeZonePredicate_ predicate) also will ensure that Pods, that claim a given volume, are only placed into the same zone as that volume. Volumes cannot be attached across zones.
-
-The actual values of zone and region don't matter. Nor is the node hierarchy rigidly defined.
-The expectation is that failures of nodes in different zones should be uncorrelated unless the entire region has failed. For example, zones should typically avoid sharing a single network switch. The exact mapping depends on your particular infrastructure - a three rack installation will choose a very different setup to a multi-datacenter configuration.
-
-If `PersistentVolumeLabel` does not support automatic labeling of your PersistentVolumes, you should consider
-adding the labels manually (or adding support for `PersistentVolumeLabel`). With `PersistentVolumeLabel`, the scheduler prevents Pods from mounting volumes in a different zone. If your infrastructure doesn't have this constraint, you don't need to add the zone labels to the volumes at all.
+See [topology.kubernetes.io/zone](#topologykubernetesiozone).

 {{< note >}} Starting in v1.17, this label is deprecated in favor of [topology.kubernetes.io/zone](#topologykubernetesiozone). {{< /note >}}

 ## topology.kubernetes.io/region {#topologykubernetesioregion}

+Example:
+
+`topology.kubernetes.io/region=us-east-1`
+
 See [topology.kubernetes.io/zone](#topologykubernetesiozone).

 ## topology.kubernetes.io/zone {#topologykubernetesiozone}

 Example:

-`topology.kubernetes.io/region=us-east-1`
-
 `topology.kubernetes.io/zone=us-east-1c`

 Used on: Node, PersistentVolume

-On the Node: The `kubelet` populates this with the zone information as defined by the `cloudprovider`.
-This will be set only if you are using a `cloudprovider`. However, you should consider setting this
-on the nodes if it makes sense in your topology.
+On Node: The `kubelet` or the external `cloud-controller-manager` populates this with the information as provided by the `cloudprovider`.  This will be set only if you are using a `cloudprovider`. However, you should consider setting this on nodes if it makes sense in your topology.

-On the PersistentVolume: The `PersistentVolumeLabel` admission controller will automatically add zone labels to PersistentVolumes, on GCE and AWS.
+On PersistentVolume: topology-aware volume provisioners will automatically set node affinity constraints on `PersistentVolumes`.

-Kubernetes will automatically spread the Pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via _SelectorSpreadPriority_.
+A zone represents a logical failure domain.  It is common for Kubernetes clusters to span multiple zones for increased availability.  While the exact definition of a zone is left to infrastructure implementations, common properties of a zone include very low network latency within a zone, no-cost network traffic within a zone, and failure independence from other zones.  For example, nodes within a zone might share a network switch, but nodes in different zones should not.
+
+A region represents a larger domain, made up of one or more zones.  It is uncommon for Kubernetes clusters to span multiple regions,  While the exact definition of a zone or region is left to infrastructure implementations, common properties of a region include higher network latency between them than within them, non-zero cost for network traffic between them, and failure independence from other zones or regions.  For example, nodes within a region might share power infrastructure (e.g. a UPS or generator), but nodes in different regions typically would not.
+
+Kubernetes makes a few assumptions about the structure of zones and regions:
+1) regions and zones are hierarchical: zones are strict subsets of regions and no zone can be in 2 regions
+2) zone names are unique across regions; for example region "africa-east-1" might be comprised of zones "africa-east-1a" and "africa-east-1b"
+
+It should be safe to assume that topology labels do not change.  Even though labels are strictly mutable, consumers of them can assume that a given node is not going to be moved between zones without being destroyed and recreated.
+
+Kubernetes can use this information in various ways.  For example, the scheduler automatically tries to spread the Pods in a ReplicaSet across nodes in a single-zone cluster (to reduce the impact of node failures, see [kubernetes.io/hostname](#kubernetesiohostname)). With multiple-zone clusters, this spreading behavior also applies to zones (to reduce the impact of zone failures). This is achieved via _SelectorSpreadPriority_.

 _SelectorSpreadPriority_ is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.

 The scheduler (through the _VolumeZonePredicate_ predicate) also will ensure that Pods, that claim a given volume, are only placed into the same zone as that volume. Volumes cannot be attached across zones.

-The actual values of zone and region don't matter. Nor is the node hierarchy rigidly defined.
-The expectation is that failures of nodes in different zones should be uncorrelated unless the entire region has failed. For example, zones should typically avoid sharing a single network switch. The exact mapping depends on your particular infrastructure - a three rack installation will choose a very different setup to a multi-datacenter configuration.
-
 If `PersistentVolumeLabel` does not support automatic labeling of your PersistentVolumes, you should consider
 adding the labels manually (or adding support for `PersistentVolumeLabel`). With `PersistentVolumeLabel`, the scheduler prevents Pods from mounting volumes in a different zone. If your infrastructure doesn't have this constraint, you don't need to add the zone labels to the volumes at all.

-
-
--- a/content/en/examples/pods/pod-with-pod-affinity.yaml
+++ b/content/en/examples/pods/pod-with-pod-affinity.yaml
@ -12,7 +12,7 @@ spec:
            operator: In
            values:
            - S1
-        topologyKey: failure-domain.beta.kubernetes.io/zone
+        topologyKey: topology.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
@ -23,7 +23,7 @@ spec:
              operator: In
              values:
              - S2
-          topologyKey: failure-domain.beta.kubernetes.io/zone
+          topologyKey: topology.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0