Reword guidance about topology manager policy options

pull/47318/head
Tim Bannister 2024-07-30 16:10:35 +01:00
parent 2deb95c701
commit cca6aaf6aa
1 changed files with 45 additions and 26 deletions

View File

@ -223,37 +223,55 @@ You can toggle groups of options on and off based upon their maturity level usin
You will still have to enable each option using the `TopologyManagerPolicyOptions` kubelet option.
The following policy options exists:
* `prefer-closest-numa-nodes` (beta, visible by default; `TopologyManagerPolicyOptions` and `TopologyManagerPolicyBetaOptions` feature gates have to be enabled).
The `prefer-closest-numa-nodes` policy option is beta in Kubernetes {{< skew currentVersion >}}.
### `prefer-closest-numa-nodes` (beta) {#policy-option-prefer-closest-numa-nodes}
If the `prefer-closest-numa-nodes` policy option is specified, the `best-effort` and `restricted`
policies will favor sets of NUMA nodes with shorter distance between them when making admission decisions.
You can enable this option by adding `prefer-closest-numa-nodes=true` to the Topology Manager policy options.
By default, without this option, Topology Manager aligns resources on either a single NUMA node or
the minimum number of NUMA nodes (in cases where more than one NUMA node is required). However,
the `TopologyManager` is not aware of NUMA distances and does not take them into account when making admission decisions.
This limitation surfaces in multi-socket, as well as single-socket multi NUMA systems,
and can cause significant performance degradation in latency-critical execution and high-throughput applications if the
Topology Manager decides to align resources on non-adjacent NUMA nodes.
The `prefer-closest-numa-nodes` option is beta since Kubernetes 1.28. In Kubernetes {{< skew currentVersion >}}
this policy option is visible by default provided that the `TopologyManagerPolicyOptions` and
`TopologyManagerPolicyBetaOptions` [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
are enabled.
* `max-allowable-numa-nodes` (beta, visible by default).
The `max-allowable-numa-nodes` policy option is beta in Kubernetes {{< skew currentVersion >}}.
The topology manager is not aware by default of NUMA distances, and does not take them into account when making
Pod admission decisions. This limitation surfaces in multi-socket, as well as single-socket multi NUMA systems,
and can cause significant performance degradation in latency-critical execution and high-throughput applications
if the topology manager decides to align resources on non-adjacent NUMA nodes.
The time to admit a pod is tied to the number of NUMA nodes on the physical machine.
By default, Kubernetes does not run a kubelet with the topology manager enabled, on any (Kubernetes) node where more than 8 NUMA nodes are detected.
If you select the the `max-allowable-numa-nodes` policy option, nodes with more than 8 NUMA nodes can
be allowed to run with the topology manager enabled. The Kubernetes project only has limited data on the impact
of using the topology manager on (Kubernetes) nodes with more than 8 NUMA nodes. Because of that
lack of data, using this policy option is **not** recommended and is at your own risk.
Setting a value of `max-allowable-numa-nodes` does not (in and of itself) affect the
latency of pod admission, but binding a Pod to a (Kubernetes) node with many NUMA does does have an impact.
Future, potential improvements to Kubernetes may improve Pod admission performance and the high
latency that happens as the number of NUMA nodes increases.
If you specify the `prefer-closest-numa-nodes` policy option, the `best-effort` and `restricted`
policies favor sets of NUMA nodes with shorter distance between them when making admission decisions.
You can enable this option by adding `prefer-closest-numa-nodes=true` to the Topology Manager policy options.
By default (without this option), Topology Manager aligns resources on either a single NUMA node or,
in the case where more than one NUMA node is required, using the minimum number of NUMA nodes.
### `max-allowable-numa-nodes` (beta) {#policy-option-max-allowable-numa-nodes}
The `max-allowable-numa-nodes` option is beta since Kubernetes 1.31. In Kubernetes {{< skew currentVersion >}}
this policy option is visible by default provided that the `TopologyManagerPolicyOptions` and
`TopologyManagerPolicyBetaOptions` [feature gates](/docs/reference/command-line-tools-reference/feature-gates/)
are enabled.
The time to admit a pod is tied to the number of NUMA nodes on the physical machine.
By default, Kubernetes does not run a kubelet with the topology manager enabled, on any (Kubernetes) node where
more than 8 NUMA nodes are detected.
{{< note >}}
If you select the the `max-allowable-numa-nodes` policy option, nodes with more than 8 NUMA nodes can
be allowed to run with the topology manager enabled. The Kubernetes project only has limited data on the impact
of using the topology manager on (Kubernetes) nodes with more than 8 NUMA nodes. Because of that
lack of data, using this policy option with Kubernetes {{< skew currentVersion >}} is **not** recommended and is
at your own risk.
{{< /note >}}
You can enable this option by adding `max-allowable-numa-nodes=true` to the Topology Manager policy options.
Setting a value of `max-allowable-numa-nodes` does not (in and of itself) affect the
latency of pod admission, but binding a Pod to a (Kubernetes) node with many NUMA does does have an impact.
Future, potential improvements to Kubernetes may improve Pod admission performance and the high
latency that happens as the number of NUMA nodes increases.
## Pod interactions with topology manager policies
Consider the containers in the following pod specs:
Consider the containers in the following Pod manifest:
```yaml
spec:
@ -364,7 +382,8 @@ assignments.
1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes
there will be a state explosion when trying to enumerate the possible NUMA affinities and
generating their hints.
generating their hints. See [`max-allowable-numa-nodes`](#policy-option-max-allowable-numa-nodes)
(beta) for more options.
2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail
on the node due to the Topology Manager.