Merge pull request #48458 from windsonsea/polom

Clean up topology-manager.md
pull/48486/head
Kubernetes Prow Robot 2024-10-22 01:32:51 +01:00 committed by GitHub
commit 05afa58db3
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 50 additions and 51 deletions

View File

@ -1,13 +1,11 @@
---
title: Control Topology Management Policies on a node
reviewers:
- ConnorDoyle
- klueska
- lmdaly
- nolancon
- bg-chun
content_type: task
min-kubernetes-server-version: v1.18
weight: 150
@ -26,7 +24,7 @@ In order to extract the best performance, optimizations related to CPU isolation
device locality are required. However, in Kubernetes, these optimizations are handled by a
disjoint set of components.
_Topology Manager_ is a Kubelet component that aims to coordinate the set of components that are
_Topology Manager_ is a kubelet component that aims to coordinate the set of components that are
responsible for these optimizations.
## {{% heading "prerequisites" %}}
@ -38,24 +36,24 @@ responsible for these optimizations.
## How topology manager works
Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make
resource allocation decisions independently of each other. This can result in undesirable
allocations on multiple-socketed systems, performance/latency sensitive applications will suffer
due to these undesirable allocations. Undesirable in this case meaning for example, CPUs and
devices being allocated from different NUMA Nodes thus, incurring additional latency.
resource allocation decisions independently of each other. This can result in undesirable
allocations on multiple-socketed systems, and performance/latency sensitive applications will suffer
due to these undesirable allocations. Undesirable in this case meaning, for example, CPUs and
devices being allocated from different NUMA Nodes, thus incurring additional latency.
The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet
The Topology Manager is a kubelet component, which acts as a source of truth so that other kubelet
components can make topology aligned resource allocation choices.
The Topology Manager provides an interface for components, called *Hint Providers*, to send and
receive topology information. Topology Manager has a set of node level policies which are
receive topology information. The Topology Manager has a set of node level policies which are
explained below.
The Topology manager receives Topology information from the *Hint Providers* as a bitmask denoting
The Topology Manager receives topology information from the *Hint Providers* as a bitmask denoting
NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform
a set of operations on the hints provided and converge on the hint determined by the policy to
give the optimal result, if an undesirable hint is stored the preferred field for the hint will be
give the optimal result. If an undesirable hint is stored, the preferred field for the hint will be
set to false. In the current policies preferred is the narrowest preferred mask.
The selected hint is stored as part of the Topology Manager. Depending on the policy configured
The selected hint is stored as part of the Topology Manager. Depending on the policy configured,
the pod can be accepted or rejected from the node based on the selected hint.
The hint is then stored in the Topology Manager for use by the *Hint Providers* when making the
resource allocation decisions.
@ -64,28 +62,28 @@ resource allocation decisions.
The Topology Manager currently:
- Aligns Pods of all QoS classes.
- Aligns the requested resources that Hint Provider provides topology hints for.
- aligns Pods of all QoS classes.
- aligns the requested resources that Hint Provider provides topology hints for.
If these conditions are met, the Topology Manager will align the requested resources.
In order to customise how this alignment is carried out, the Topology Manager provides two
distinct knobs: `scope` and `policy`.
In order to customize how this alignment is carried out, the Topology Manager provides two
distinct options: `scope` and `policy`.
The `scope` defines the granularity at which you would like resource alignment to be performed
(e.g. at the `pod` or `container` level). And the `policy` defines the actual strategy used to
carry out the alignment (e.g. `best-effort`, `restricted`, `single-numa-node`, etc.).
The `scope` defines the granularity at which you would like resource alignment to be performed,
for example, at the `pod` or `container` level. And the `policy` defines the actual policy used to
carry out the alignment, for example, `best-effort`, `restricted`, and `single-numa-node`.
Details on the various `scopes` and `policies` available today can be found below.
{{< note >}}
To align CPU resources with other requested resources in a Pod spec, the CPU Manager should be
enabled and proper CPU Manager policy should be configured on a Node.
See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/).
See [Control CPU Management Policies on the Node](/docs/tasks/administer-cluster/cpu-management-policies/).
{{< /note >}}
{{< note >}}
To align memory (and hugepages) resources with other requested resources in a Pod spec, the Memory
Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine
Manager should be enabled and proper Memory Manager policy should be configured on a Node. Refer to
[Memory Manager](/docs/tasks/administer-cluster/memory-manager/) documentation.
{{< /note >}}
@ -116,7 +114,8 @@ scope, for example the `pod` scope.
### `pod` scope
To select the `pod` scope, set `topologyManagerScope` in the [kubelet configuration file](/docs/tasks/administer-cluster/kubelet-config-file/) to `pod`.`
To select the `pod` scope, set `topologyManagerScope` in the
[kubelet configuration file](/docs/tasks/administer-cluster/kubelet-config-file/) to `pod`.
This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the
Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers)
@ -127,8 +126,8 @@ alignments produced by the Topology Manager on different occasions:
* all containers can be and are allocated to a shared set of NUMA nodes.
The total amount of particular resource demanded for the entire pod is calculated according to
[effective requests/limits](/docs/concepts/workloads/pods/init-containers/#resources) formula, and
thus, this total value is equal to the maximum of:
[effective requests/limits](/docs/concepts/workloads/pods/init-containers/#resource-sharing-within-containers)
formula, and thus, this total value is equal to the maximum of:
* the sum of all app container requests,
* the maximum of init container requests,
@ -147,12 +146,12 @@ is present among possible allocations. Reconsider the example above:
* whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one
NUMA node, two or more NUMA nodes are required to satisfy the allocation).
To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology
To recap, the Topology Manager first computes a set of NUMA nodes and then tests it against the Topology
Manager policy, which either leads to the rejection or admission of the pod.
## Topology manager policies
Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag,
The Topology Manager supports four allocation policies. You can set a policy via a kubelet flag,
`--topology-manager-policy`. There are four supported policies:
* `none` (default)
@ -161,7 +160,7 @@ Topology Manager supports four allocation policies. You can set a policy via a K
* `single-numa-node`
{{< note >}}
If Topology Manager is configured with the **pod** scope, the container, which is considered by
If the Topology Manager is configured with the **pod** scope, the container, which is considered by
the policy, is reflecting requirements of the entire pod, and thus each container from the pod
will result with **the same** topology alignment decision.
{{< /note >}}
@ -175,7 +174,7 @@ This is the default policy and does not perform any topology alignment.
For each container in a Pod, the kubelet, with `best-effort` topology management policy, calls
each Hint Provider to discover their resource availability. Using this information, the Topology
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
preferred, Topology Manager will store this and admit the pod to the node anyway.
preferred, the Topology Manager will store this and admit the pod to the node anyway.
The *Hint Providers* can then use this information when making the
resource allocation decision.
@ -183,13 +182,13 @@ resource allocation decision.
### `restricted` policy {#policy-restricted}
For each container in a Pod, the kubelet, with `restricted` topology management policy, calls each
Hint Provider to discover their resource availability. Using this information, the Topology
Hint Provider to discover their resource availability. Using this information, the Topology
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
preferred, Topology Manager will reject this pod from the node. This will result in a pod in a
preferred, the Topology Manager will reject this pod from the node. This will result in a pod entering a
`Terminated` state with a pod admission failure.
Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to
reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of
reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeployment of
the pod. An external control loop could be also implemented to trigger a redeployment of pods that
have the `Topology Affinity` error.
@ -199,16 +198,16 @@ resource allocation decision.
### `single-numa-node` policy {#policy-single-numa-node}
For each container in a Pod, the kubelet, with `single-numa-node` topology management policy,
calls each Hint Provider to discover their resource availability. Using this information, the
Topology Manager determines if a single NUMA Node affinity is possible. If it is, Topology
calls each Hint Provider to discover their resource availability. Using this information, the
Topology Manager determines if a single NUMA Node affinity is possible. If it is, Topology
Manager will store this and the *Hint Providers* can then use this information when making the
resource allocation decision. If, however, this is not possible then the Topology Manager will
resource allocation decision. If, however, this is not possible then the Topology Manager will
reject the pod from the node. This will result in a pod in a `Terminated` state with a pod
admission failure.
Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to
reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of
the Pod.An external control loop could be also implemented to trigger a redeployment of pods
reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeployment of
the Pod. An external control loop could be also implemented to trigger a redeployment of pods
that have the `Topology Affinity` error.
## Topology manager policy options
@ -218,6 +217,7 @@ Support for the Topology Manager policy options requires `TopologyManagerPolicyO
(it is enabled by default).
You can toggle groups of options on and off based upon their maturity level using the following feature gates:
* `TopologyManagerPolicyBetaOptions` default enabled. Enable to show beta-level options.
* `TopologyManagerPolicyAlphaOptions` default disabled. Enable to show alpha-level options.
@ -230,34 +230,34 @@ this policy option is visible by default provided that the `TopologyManagerPolic
`TopologyManagerPolicyBetaOptions` [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
are enabled.
The topology manager is not aware by default of NUMA distances, and does not take them into account when making
The Topology Manager is not aware by default of NUMA distances, and does not take them into account when making
Pod admission decisions. This limitation surfaces in multi-socket, as well as single-socket multi NUMA systems,
and can cause significant performance degradation in latency-critical execution and high-throughput applications
if the topology manager decides to align resources on non-adjacent NUMA nodes.
if the Topology Manager decides to align resources on non-adjacent NUMA nodes.
If you specify the `prefer-closest-numa-nodes` policy option, the `best-effort` and `restricted`
policies favor sets of NUMA nodes with shorter distance between them when making admission decisions.
You can enable this option by adding `prefer-closest-numa-nodes=true` to the Topology Manager policy options.
By default (without this option), Topology Manager aligns resources on either a single NUMA node or,
By default (without this option), the Topology Manager aligns resources on either a single NUMA node or,
in the case where more than one NUMA node is required, using the minimum number of NUMA nodes.
### `max-allowable-numa-nodes` (beta) {#policy-option-max-allowable-numa-nodes}
The `max-allowable-numa-nodes` option is beta since Kubernetes 1.31. In Kubernetes {{< skew currentVersion >}}
The `max-allowable-numa-nodes` option is beta since Kubernetes 1.31. In Kubernetes {{< skew currentVersion >}},
this policy option is visible by default provided that the `TopologyManagerPolicyOptions` and
`TopologyManagerPolicyBetaOptions` [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
are enabled.
The time to admit a pod is tied to the number of NUMA nodes on the physical machine.
By default, Kubernetes does not run a kubelet with the topology manager enabled, on any (Kubernetes) node where
By default, Kubernetes does not run a kubelet with the Topology Manager enabled, on any (Kubernetes) node where
more than 8 NUMA nodes are detected.
{{< note >}}
If you select the the `max-allowable-numa-nodes` policy option, nodes with more than 8 NUMA nodes can
be allowed to run with the topology manager enabled. The Kubernetes project only has limited data on the impact
of using the topology manager on (Kubernetes) nodes with more than 8 NUMA nodes. Because of that
If you select the `max-allowable-numa-nodes` policy option, nodes with more than 8 NUMA nodes can
be allowed to run with the Topology Manager enabled. The Kubernetes project only has limited data on the impact
of using the Topology Manager on (Kubernetes) nodes with more than 8 NUMA nodes. Because of that
lack of data, using this policy option with Kubernetes {{< skew currentVersion >}} is **not** recommended and is
at your own risk.
{{< /note >}}
@ -265,7 +265,7 @@ at your own risk.
You can enable this option by adding `max-allowable-numa-nodes=true` to the Topology Manager policy options.
Setting a value of `max-allowable-numa-nodes` does not (in and of itself) affect the
latency of pod admission, but binding a Pod to a (Kubernetes) node with many NUMA does does have an impact.
latency of pod admission, but binding a Pod to a (Kubernetes) node with many NUMA does have an impact.
Future, potential improvements to Kubernetes may improve Pod admission performance and the high
latency that happens as the number of NUMA nodes increases.
@ -296,10 +296,10 @@ spec:
This pod runs in the `Burstable` QoS class because requests are less than limits.
If the selected policy is anything other than `none`, Topology Manager would consider these Pod
If the selected policy is anything other than `none`, the Topology Manager would consider these Pod
specifications. The Topology Manager would consult the Hint Providers to get topology hints.
In the case of the `static`, the CPU Manager policy would return default topology hint, because
these Pods do not have explicitly request CPU resources.
these Pods do not explicitly request CPU resources.
```yaml
spec:
@ -320,7 +320,6 @@ spec:
This pod with integer CPU request runs in the `Guaranteed` QoS class because `requests` are equal
to `limits`.
```yaml
spec:
containers:
@ -380,10 +379,10 @@ assignments.
## Known limitations
1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes
1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes,
there will be a state explosion when trying to enumerate the possible NUMA affinities and
generating their hints. See [`max-allowable-numa-nodes`](#policy-option-max-allowable-numa-nodes)
(beta) for more options.
2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail
on the node due to the Topology Manager.
1. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail
on the node due to the Topology Manager.