Merge pull request #24783 from thockin/kep-1659-doc-topology-labels

Better docs for standard topology labels
pull/24886/head
Kubernetes Prow Robot 2020-11-04 07:38:04 -08:00 committed by GitHub
commit c779b2152b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 32 additions and 49 deletions

View File

@ -271,7 +271,7 @@ preempted. Here's an example:
* Pod P is being considered for Node N.
* Pod Q is running on another Node in the same Zone as Node N.
* Pod P has Zone-wide anti-affinity with Pod Q (`topologyKey:
failure-domain.beta.kubernetes.io/zone`).
topology.kubernetes.io/zone`).
* There are no other cases of anti-affinity between Pod P and other Pods in
the Zone.
* In order to schedule Pod P on Node N, Pod Q can be preempted, but scheduler

View File

@ -200,8 +200,8 @@ The affinity on this pod defines one pod affinity rule and one pod anti-affinity
while the `podAntiAffinity` is `preferredDuringSchedulingIgnoredDuringExecution`. The
pod affinity rule says that the pod can be scheduled onto a node only if that node is in the same zone
as at least one already-running pod that has a label with key "security" and value "S1". (More precisely, the pod is eligible to run
on node N if node N has a label with key `failure-domain.beta.kubernetes.io/zone` and some value V
such that there is at least one node in the cluster with key `failure-domain.beta.kubernetes.io/zone` and
on node N if node N has a label with key `topology.kubernetes.io/zone` and some value V
such that there is at least one node in the cluster with key `topology.kubernetes.io/zone` and
value V that is running a pod that has a label with key "security" and value "S1".) The pod anti-affinity
rule says that the pod cannot be scheduled onto a node if that node is in the same zone as a pod with
label having key "security" and value "S2". See the

View File

@ -209,7 +209,7 @@ parameters:
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
- key: topology.kubernetes.io/zone
values:
- us-central1-a
- us-central1-b

View File

@ -449,7 +449,7 @@ spec:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
- key: topology.kubernetes.io/zone
operator: In
values:
- us-central1-a

View File

@ -534,8 +534,8 @@ and kubelets will not be allowed to modify labels with that prefix.
* `kubernetes.io/os`
* `beta.kubernetes.io/instance-type`
* `node.kubernetes.io/instance-type`
* `failure-domain.beta.kubernetes.io/region`
* `failure-domain.beta.kubernetes.io/zone`
* `failure-domain.beta.kubernetes.io/region` (deprecated)
* `failure-domain.beta.kubernetes.io/zone` (deprecated)
* `topology.kubernetes.io/region`
* `topology.kubernetes.io/zone`
* `kubelet.kubernetes.io/`-prefixed labels

View File

@ -967,7 +967,7 @@ WindowsEndpointSliceProxying=true|false (ALPHA - default=false)<br/>
<td colspan="2">--node-labels mapStringString</td>
</tr>
<tr>
<td></td><td style="line-height: 130%; word-wrap: break-word;">&lt;Warning: Alpha feature&gt; Labels to add when registering the node in the cluster. Labels must be `key=value` pairs separated by `,`. Labels in the `kubernetes.io` namespace must begin with an allowed prefix (`kubelet.kubernetes.io`, `node.kubernetes.io`) or be in the specifically allowed set (`beta.kubernetes.io/arch`, `beta.kubernetes.io/instance-type`, `beta.kubernetes.io/os`, `failure-domain.beta.kubernetes.io/region`, `failure-domain.beta.kubernetes.io/zone`, `failure-domain.kubernetes.io/region`, `failure-domain.kubernetes.io/zone`, `kubernetes.io/arch`, `kubernetes.io/hostname`, `kubernetes.io/instance-type`, `kubernetes.io/os`)</td>
<td></td><td style="line-height: 130%; word-wrap: break-word;">&lt;Warning: Alpha feature&gt;Labels to add when registering the node in the cluster. Labels must be `key=value pairs` separated by `,`. Labels in the `kubernetes.io` namespace must begin with an allowed prefix (`kubelet.kubernetes.io`, `node.kubernetes.io`) or be in the specifically allowed set (`beta.kubernetes.io/arch`, `beta.kubernetes.io/instance-type`, `beta.kubernetes.io/os`, `failure-domain.beta.kubernetes.io/region`, `failure-domain.beta.kubernetes.io/zone`, `kubernetes.io/arch`, `kubernetes.io/hostname`, `kubernetes.io/os`, `node.kubernetes.io/instance-type`, `topology.kubernetes.io/region`, `topology.kubernetes.io/zone`)</td>
</tr>
<tr>

View File

@ -38,7 +38,7 @@ This label has been deprecated. Please use `kubernetes.io/arch` instead.
This label has been deprecated. Please use `kubernetes.io/os` instead.
## kubernetes.io/hostname
## kubernetes.io/hostname {#kubernetesiohostname}
Example: `kubernetes.io/hostname=ip-172-20-114-199.ec2.internal`
@ -46,6 +46,8 @@ Used on: Node
The Kubelet populates this label with the hostname. Note that the hostname can be changed from the "actual" hostname by passing the `--hostname-override` flag to the `kubelet`.
This label is also used as part of the topology hierarchy. See [topology.kubernetes.io/zone](#topologykubernetesiozone) for more information.
## beta.kubernetes.io/instance-type (deprecated)
{{< note >}} Starting in v1.17, this label is deprecated in favor of [node.kubernetes.io/instance-type](#nodekubernetesioinstance-type). {{< /note >}}
@ -63,71 +65,52 @@ to rely on the Kubernetes scheduler to perform resource-based scheduling. You sh
## failure-domain.beta.kubernetes.io/region (deprecated) {#failure-domainbetakubernetesioregion}
See [failure-domain.beta.kubernetes.io/zone](#failure-domainbetakubernetesiozone).
See [topology.kubernetes.io/region](#topologykubernetesioregion).
{{< note >}} Starting in v1.17, this label is deprecated in favor of [topology.kubernetes.io/region](#topologykubernetesioregion). {{< /note >}}
## failure-domain.beta.kubernetes.io/zone (deprecated) {#failure-domainbetakubernetesiozone}
Example:
`failure-domain.beta.kubernetes.io/region=us-east-1`
`failure-domain.beta.kubernetes.io/zone=us-east-1c`
Used on: Node, PersistentVolume
On the Node: The `kubelet` populates this with the zone information as defined by the `cloudprovider`.
This will be set only if you are using a `cloudprovider`. However, you should consider setting this
on the nodes if it makes sense in your topology.
On the PersistentVolume: The `PersistentVolumeLabel` admission controller will automatically add zone labels to PersistentVolumes, on GCE and AWS.
Kubernetes will automatically spread the Pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via _SelectorSpreadPriority_.
_SelectorSpreadPriority_ is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.
The scheduler (through the _VolumeZonePredicate_ predicate) also will ensure that Pods, that claim a given volume, are only placed into the same zone as that volume. Volumes cannot be attached across zones.
The actual values of zone and region don't matter. Nor is the node hierarchy rigidly defined.
The expectation is that failures of nodes in different zones should be uncorrelated unless the entire region has failed. For example, zones should typically avoid sharing a single network switch. The exact mapping depends on your particular infrastructure - a three rack installation will choose a very different setup to a multi-datacenter configuration.
If `PersistentVolumeLabel` does not support automatic labeling of your PersistentVolumes, you should consider
adding the labels manually (or adding support for `PersistentVolumeLabel`). With `PersistentVolumeLabel`, the scheduler prevents Pods from mounting volumes in a different zone. If your infrastructure doesn't have this constraint, you don't need to add the zone labels to the volumes at all.
See [topology.kubernetes.io/zone](#topologykubernetesiozone).
{{< note >}} Starting in v1.17, this label is deprecated in favor of [topology.kubernetes.io/zone](#topologykubernetesiozone). {{< /note >}}
## topology.kubernetes.io/region {#topologykubernetesioregion}
Example:
`topology.kubernetes.io/region=us-east-1`
See [topology.kubernetes.io/zone](#topologykubernetesiozone).
## topology.kubernetes.io/zone {#topologykubernetesiozone}
Example:
`topology.kubernetes.io/region=us-east-1`
`topology.kubernetes.io/zone=us-east-1c`
Used on: Node, PersistentVolume
On the Node: The `kubelet` populates this with the zone information as defined by the `cloudprovider`.
This will be set only if you are using a `cloudprovider`. However, you should consider setting this
on the nodes if it makes sense in your topology.
On Node: The `kubelet` or the external `cloud-controller-manager` populates this with the information as provided by the `cloudprovider`. This will be set only if you are using a `cloudprovider`. However, you should consider setting this on nodes if it makes sense in your topology.
On the PersistentVolume: The `PersistentVolumeLabel` admission controller will automatically add zone labels to PersistentVolumes, on GCE and AWS.
On PersistentVolume: topology-aware volume provisioners will automatically set node affinity constraints on `PersistentVolumes`.
Kubernetes will automatically spread the Pods in a replication controller or service across nodes in a single-zone cluster (to reduce the impact of failures). With multiple-zone clusters, this spreading behaviour is extended across zones (to reduce the impact of zone failures). This is achieved via _SelectorSpreadPriority_.
A zone represents a logical failure domain. It is common for Kubernetes clusters to span multiple zones for increased availability. While the exact definition of a zone is left to infrastructure implementations, common properties of a zone include very low network latency within a zone, no-cost network traffic within a zone, and failure independence from other zones. For example, nodes within a zone might share a network switch, but nodes in different zones should not.
A region represents a larger domain, made up of one or more zones. It is uncommon for Kubernetes clusters to span multiple regions, While the exact definition of a zone or region is left to infrastructure implementations, common properties of a region include higher network latency between them than within them, non-zero cost for network traffic between them, and failure independence from other zones or regions. For example, nodes within a region might share power infrastructure (e.g. a UPS or generator), but nodes in different regions typically would not.
Kubernetes makes a few assumptions about the structure of zones and regions:
1) regions and zones are hierarchical: zones are strict subsets of regions and no zone can be in 2 regions
2) zone names are unique across regions; for example region "africa-east-1" might be comprised of zones "africa-east-1a" and "africa-east-1b"
It should be safe to assume that topology labels do not change. Even though labels are strictly mutable, consumers of them can assume that a given node is not going to be moved between zones without being destroyed and recreated.
Kubernetes can use this information in various ways. For example, the scheduler automatically tries to spread the Pods in a ReplicaSet across nodes in a single-zone cluster (to reduce the impact of node failures, see [kubernetes.io/hostname](#kubernetesiohostname)). With multiple-zone clusters, this spreading behavior also applies to zones (to reduce the impact of zone failures). This is achieved via _SelectorSpreadPriority_.
_SelectorSpreadPriority_ is a best effort placement. If the zones in your cluster are heterogeneous (for example: different numbers of nodes, different types of nodes, or different pod resource requirements), this placement might prevent equal spreading of your Pods across zones. If desired, you can use homogenous zones (same number and types of nodes) to reduce the probability of unequal spreading.
The scheduler (through the _VolumeZonePredicate_ predicate) also will ensure that Pods, that claim a given volume, are only placed into the same zone as that volume. Volumes cannot be attached across zones.
The actual values of zone and region don't matter. Nor is the node hierarchy rigidly defined.
The expectation is that failures of nodes in different zones should be uncorrelated unless the entire region has failed. For example, zones should typically avoid sharing a single network switch. The exact mapping depends on your particular infrastructure - a three rack installation will choose a very different setup to a multi-datacenter configuration.
If `PersistentVolumeLabel` does not support automatic labeling of your PersistentVolumes, you should consider
adding the labels manually (or adding support for `PersistentVolumeLabel`). With `PersistentVolumeLabel`, the scheduler prevents Pods from mounting volumes in a different zone. If your infrastructure doesn't have this constraint, you don't need to add the zone labels to the volumes at all.

View File

@ -12,7 +12,7 @@ spec:
operator: In
values:
- S1
topologyKey: failure-domain.beta.kubernetes.io/zone
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
@ -23,7 +23,7 @@ spec:
operator: In
values:
- S2
topologyKey: failure-domain.beta.kubernetes.io/zone
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: k8s.gcr.io/pause:2.0