Try to improve 2023-05-05-memory-qos-cgroups-v2
parent
8662094c8e
commit
6f6b78d89c
|
@ -10,18 +10,19 @@ slug: qos-memory-resources
|
|||
Kubernetes v1.27, released in April 2023, introduced changes to
|
||||
Memory QoS (alpha) to improve memory management capabilites in Linux nodes.
|
||||
|
||||
Support for Memory QoS was initially added in Kubernetes v1.22, and later some
|
||||
[limitations](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127)
|
||||
around the formula for calculating `memory.high` were identified. These limitations are
|
||||
Support for Memory QoS was initially added in Kubernetes v1.22, and later some
|
||||
[limitations](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos#reasons-for-changing-the-formula-of-memoryhigh-calculation-in-alpha-v127)
|
||||
around the formula for calculating `memory.high` were identified. These limitations are
|
||||
addressed in Kubernetes v1.27.
|
||||
|
||||
## Background
|
||||
|
||||
Kubernetes allows you to optionally specify how much of each resources a container needs
|
||||
in the Pod specification. The most common resources to specify are CPU and Memory.
|
||||
in the Pod specification. The most common resources to specify are CPU and Memory.
|
||||
|
||||
For example, a Pod manifest that defines container resource requirements could look like:
|
||||
```
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
|
@ -40,19 +41,19 @@ spec:
|
|||
|
||||
* `spec.containers[].resources.requests`
|
||||
|
||||
When you specify the resource request for containers in a Pod, the
|
||||
When you specify the resource request for containers in a Pod, the
|
||||
[Kubernetes scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/#kube-scheduler)
|
||||
uses this information to decide which node to place the Pod on. The scheduler
|
||||
ensures that for each resource type, the sum of the resource requests of the
|
||||
ensures that for each resource type, the sum of the resource requests of the
|
||||
scheduled containers is less than the total allocatable resources on the node.
|
||||
|
||||
* `spec.containers[].resources.limits`
|
||||
|
||||
When you specify the resource limit for containers in a Pod, the kubelet enforces
|
||||
those limits so that the running containers are not allowed to use more of those
|
||||
When you specify the resource limit for containers in a Pod, the kubelet enforces
|
||||
those limits so that the running containers are not allowed to use more of those
|
||||
resources than the limits you set.
|
||||
|
||||
When the kubelet starts a container as a part of a Pod, kubelet passes the
|
||||
When the kubelet starts a container as a part of a Pod, kubelet passes the
|
||||
container's requests and limits for CPU and memory to the container runtime.
|
||||
The container runtime assigns both CPU request and CPU limit to a container.
|
||||
Provided the system has free CPU time, the containers are guaranteed to be
|
||||
|
@ -61,9 +62,9 @@ the configured limit i.e. containers CPU usage will be throttled if they
|
|||
use more CPU than the specified limit within a given time slice.
|
||||
|
||||
Prior to Memory QoS feature, the container runtime only used the memory
|
||||
limit and discarded the memory `request` (requests were, and still are,
|
||||
limit and discarded the memory `request` (requests were, and still are,
|
||||
also used to influence [scheduling](/docs/concepts/scheduling-eviction/#scheduling)).
|
||||
If a container uses more memory than the configured limit,
|
||||
If a container uses more memory than the configured limit,
|
||||
the Linux Out Of Memory (OOM) killer will be invoked.
|
||||
|
||||
Let's compare how the container runtime on Linux typically configures memory
|
||||
|
@ -71,26 +72,26 @@ request and limit in cgroups, with and without Memory QoS feature:
|
|||
|
||||
* **Memory request**
|
||||
|
||||
The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
|
||||
The memory request is mainly used by kube-scheduler during (Kubernetes) Pod
|
||||
scheduling. In cgroups v1, there are no controls to specify the minimum amount
|
||||
of memory the cgroups must always retain. Hence, the container runtime did not
|
||||
use the value of requested memory set in the Pod spec.
|
||||
|
||||
cgroups v2 introduced a `memory.min` setting, used to specify the minimum
|
||||
cgroups v2 introduced a `memory.min` setting, used to specify the minimum
|
||||
amount of memory that should remain available to the processes within
|
||||
a given cgroup. If the memory usage of a cgroup is within its effective
|
||||
min boundary, the cgroup’s memory won’t be reclaimed under any conditions.
|
||||
If the kernel cannot maintain at least `memory.min` bytes of memory for the
|
||||
If the kernel cannot maintain at least `memory.min` bytes of memory for the
|
||||
processes within the cgroup, the kernel invokes its OOM killer. In other words,
|
||||
the kernel guarantees at least this much memory is available or terminates
|
||||
the kernel guarantees at least this much memory is available or terminates
|
||||
processes (which may be outside the cgroup) in order to make memory more available.
|
||||
Memory QoS maps `memory.min` to `spec.containers[].resources.requests.memory`
|
||||
to ensure the availability of memory for containers in Kubernetes Pods.
|
||||
to ensure the availability of memory for containers in Kubernetes Pods.
|
||||
|
||||
* **Memory limit**
|
||||
|
||||
The `memory.limit` specifies the memory limit, beyond which if the container tries
|
||||
to allocate more memory, Linux kernel will terminate a process with an
|
||||
to allocate more memory, Linux kernel will terminate a process with an
|
||||
OOM (Out of Memory) kill. If the terminated process was the main (or only) process
|
||||
inside the container, the container may exit.
|
||||
|
||||
|
@ -103,7 +104,7 @@ request and limit in cgroups, with and without Memory QoS feature:
|
|||
specify the hard limit for memory usage. If the memory consumption goes above this
|
||||
level, the kernel invokes its OOM Killer.
|
||||
|
||||
cgroups v2 also added `memory.high` configuration . Memory QoS uses `memory.high`
|
||||
cgroups v2 also added `memory.high` configuration. Memory QoS uses `memory.high`
|
||||
to set memory usage throttle limit. If the `memory.high` limit is breached,
|
||||
the offending cgroups are throttled, and the kernel tries to reclaim memory
|
||||
which may avoid an OOM kill.
|
||||
|
@ -113,40 +114,49 @@ request and limit in cgroups, with and without Memory QoS feature:
|
|||
### Cgroups v2 memory controller interfaces & Kubernetes container resources mapping
|
||||
|
||||
Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in
|
||||
Kubernetes. cgroupv2 interfaces that this feature uses are:
|
||||
Kubernetes. cgroupv2 interfaces that this feature uses are:
|
||||
|
||||
* `memory.max`
|
||||
* `memory.min`
|
||||
* `memory.high`.
|
||||
|
||||
{{< figure src="/blog/2023/05/05/qos-memory-resources/memory-qos-cal.svg" title="Memory QoS Levels" alt="Memory QoS Levels" >}}
|
||||
|
||||
`memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and
|
||||
the container runtime configure the limit in the respective cgroup. The kernel
|
||||
`memory.max` is mapped to `limits.memory` specified in the Pod spec. The kubelet and
|
||||
the container runtime configure the limit in the respective cgroup. The kernel
|
||||
enforces the limit to prevent the container from using more than the configured
|
||||
resource limit. If a process in a container tries to consume more than the
|
||||
specified limit, kernel terminates a process(es) with an out of
|
||||
memory Out of Memory (OOM) error.
|
||||
resource limit. If a process in a container tries to consume more than the
|
||||
specified limit, kernel terminates a process(es) with an Out of Memory (OOM) error.
|
||||
|
||||
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-max.svg" title="memory.max maps to limits.memory" alt="memory.max maps to limits.memory" >}}
|
||||
```formula
|
||||
memory.max = pod.spec.containers[i].resources.limits[memory]
|
||||
```
|
||||
|
||||
`memory.min` is mapped to `requests.memory`, which results in reservation of memory resources
|
||||
that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
|
||||
memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
|
||||
that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of
|
||||
memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM
|
||||
killer is invoked to make more memory available.
|
||||
|
||||
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-min.svg" title="memory.min maps to requests.memory" alt="memory.min maps to requests.memory" >}}
|
||||
```formula
|
||||
memory.min = pod.spec.containers[i].resources.requests[memory]
|
||||
```
|
||||
|
||||
For memory protection, in addition to the original way of limiting memory usage, Memory QoS
|
||||
throttles workload approaching its memory limit, ensuring that the system is not overwhelmed
|
||||
by sporadic increases in memory usage. A new field, `memoryThrottlingFactor`, is available in
|
||||
the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
|
||||
the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.
|
||||
`memory.high` is mapped to throttling limit calculated by using `memoryThrottlingFactor`,
|
||||
`requests.memory` and `limits.memory` as in the formula below, and rounding down the
|
||||
`requests.memory` and `limits.memory` as in the formula below, and rounding down the
|
||||
value to the nearest page size:
|
||||
|
||||
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high.svg" title="memory.high formula" alt="memory.high formula" >}}
|
||||
```formula
|
||||
memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
|
||||
{(pod.spec.containers[i].resources.limits[memory] or NodeAllocatableMemory) - pod.spec.containers[i].resources.requests[memory]}
|
||||
```
|
||||
|
||||
**Note**: If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory.
|
||||
{{< note >}}
|
||||
If a container has no memory limits specified, `limits.memory` is substituted for node allocatable memory.
|
||||
{{< /note >}}
|
||||
|
||||
**Summary:**
|
||||
<table>
|
||||
|
@ -158,8 +168,8 @@ value to the nearest page size:
|
|||
<td>memory.max</td>
|
||||
<td><code>memory.max</code> specifies the maximum memory limit,
|
||||
a container is allowed to use. If a process within the container
|
||||
tries to consume more memory than the configured limit,
|
||||
the kernel terminates the process with an Out of Memory (OOM) error.
|
||||
tries to consume more memory than the configured limit,
|
||||
the kernel terminates the process with an Out of Memory (OOM) error.
|
||||
<br>
|
||||
<br>
|
||||
<i>It is mapped to the container's memory limit specified in Pod manifest.</i>
|
||||
|
@ -167,7 +177,7 @@ value to the nearest page size:
|
|||
</tr>
|
||||
<tr>
|
||||
<td>memory.min</td>
|
||||
<td><code>memory.min</code> specifies a minimum amount of memory
|
||||
<td><code>memory.min</code> specifies a minimum amount of memory
|
||||
the cgroups must always retain, i.e., memory that should never be
|
||||
reclaimed by the system.
|
||||
If there's no unprotected reclaimable memory available, OOM kill is invoked.
|
||||
|
@ -178,8 +188,8 @@ value to the nearest page size:
|
|||
</tr>
|
||||
<tr>
|
||||
<td>memory.high</td>
|
||||
<td><code>memory.high</code> specifies the memory usage throttle limit.
|
||||
This is the main mechanism to control a cgroup's memory use. If
|
||||
<td><code>memory.high</code> specifies the memory usage throttle limit.
|
||||
This is the main mechanism to control a cgroup's memory use. If
|
||||
cgroups memory use goes over the high boundary specified here,
|
||||
the cgroups processes are throttled and put under heavy reclaim pressure.
|
||||
<br>
|
||||
|
@ -193,66 +203,79 @@ value to the nearest page size:
|
|||
</tr>
|
||||
</table>
|
||||
|
||||
**Note** `memory.high` is set only on container level cgroups while `memory.min` is set on
|
||||
{{< note >}}
|
||||
`memory.high` is set only on container level cgroups while `memory.min` is set on
|
||||
container, pod, and node level cgroups.
|
||||
{{< /note >}}
|
||||
|
||||
### `memory.min` calculations for cgroups heirarchy
|
||||
|
||||
When container memory requests are made, kubelet passes `memory.min` to the back-end
|
||||
CRI runtime (such as containerd or CRI-O) via the `Unified` field in CRI during
|
||||
container creation. The `memory.min` in container level cgroups will be set to:
|
||||
container creation. For every i<sup>th</sup> container in a pod, the `memory.min`
|
||||
in container level cgroups will be set to:
|
||||
|
||||
$memory.min = pod.spec.containers[i].resources.requests[memory]$
|
||||
<sub>for every i<sup>th</sup> container in a pod</sub>
|
||||
<br>
|
||||
<br>
|
||||
Since the `memory.min` interface requires that the ancestor cgroups directories are all
|
||||
set, the pod and node cgroups directories need to be set correctly.
|
||||
```formula
|
||||
memory.min = pod.spec.containers[i].resources.requests[memory]
|
||||
```
|
||||
|
||||
`memory.min` in pod level cgroup:
|
||||
$memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$
|
||||
<sub>for every i<sup>th</sup> container in a pod</sub>
|
||||
<br>
|
||||
<br>
|
||||
`memory.min` in node level cgroup:
|
||||
$memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$
|
||||
<sub>for every j<sup>th</sup> container in every i<sup>th</sup> pod on a node</sub>
|
||||
<br>
|
||||
<br>
|
||||
Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
|
||||
Since the `memory.min` interface requires that the ancestor cgroups directories are all
|
||||
set, the pod and node cgroups directories need to be set correctly.
|
||||
|
||||
For every i<sup>th</sup> container in a pod, `memory.min` in pod level cgroup:
|
||||
|
||||
```formula
|
||||
memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]
|
||||
```
|
||||
|
||||
For every j<sup>th</sup> container in every i<sup>th</sup> pod on a node, `memory.min` in node level cgroup:
|
||||
|
||||
```formula
|
||||
memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]
|
||||
```
|
||||
|
||||
Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups
|
||||
directly using the libcontainer library (from the runc project), while container
|
||||
cgroups limits are managed by the container runtime.
|
||||
|
||||
### Support for Pod QoS classes
|
||||
|
||||
Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
|
||||
Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like
|
||||
to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling.
|
||||
Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
|
||||
Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per
|
||||
Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high
|
||||
as per QOS classes:
|
||||
|
||||
1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are
|
||||
not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
|
||||
memory.high. This ensures that Guaranteed pods can fully use their memory requests up
|
||||
to their set limit, and not hit any throttling.
|
||||
1. **Guaranteed pods** by their QoS definition require memory requests=memory limits and are
|
||||
not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting
|
||||
memory.high. This ensures that Guaranteed pods can fully use their memory requests up
|
||||
to their set limit, and not hit any throttling.
|
||||
|
||||
2. **Burstable pods** by their QoS definition require at least one container in the Pod with
|
||||
CPU or memory request or limit set.
|
||||
1. **Burstable pods** by their QoS definition require at least one container in the Pod with
|
||||
CPU or memory request or limit set.
|
||||
|
||||
* When requests.memory and limits.memory are set, the formula is used as-is:
|
||||
|
||||
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-limit.svg" title="memory.high when requests and limits are set" alt="memory.high when requests and limits are set" >}}
|
||||
```formula
|
||||
memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
|
||||
{(pod.spec.containers[i].resources.limits[memory]) - pod.spec.containers[i].resources.requests[memory]}
|
||||
```
|
||||
|
||||
* When requests.memory is set and limits.memory is not set, limits.memory is substituted
|
||||
for node allocatable memory in the formula:
|
||||
|
||||
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-no-limits.svg" title="memory.high when requests and limits are not set" alt="memory.high when requests and limits are not set" >}}
|
||||
```formula
|
||||
memory.high = pod.spec.containers[i].resources.requests[memory] + MemoryThrottlingFactor *
|
||||
{(NodeAllocatableMemory) - pod.spec.containers[i].resources.requests[memory]}
|
||||
```
|
||||
|
||||
3. **BestEffort** by their QoS definition do not require any memory or CPU limits or requests.
|
||||
1. **BestEffort** by their QoS definition do not require any memory or CPU limits or requests.
|
||||
For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable
|
||||
memory in the formula:
|
||||
|
||||
{{< figure src="/blog/2023/05/05/qos-memory-resources/container-memory-high-best-effort.svg" title="memory.high for BestEffort Pod" alt="memory.high for BestEffort Pod" >}}
|
||||
```formula
|
||||
memory.high = MemoryThrottlingFactor * NodeAllocatableMemory
|
||||
```
|
||||
|
||||
**Summary**: Only Pods in Burstable and BestEffort QoS classes will set `memory.high`.
|
||||
Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed.
|
||||
|
@ -261,10 +284,10 @@ Guaranteed QoS pods do not set `memory.high` as their memory is guaranteed.
|
|||
|
||||
The prerequisites for enabling Memory QoS feature on your Linux node are:
|
||||
|
||||
1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements)
|
||||
1. Verify the [requirements](/docs/concepts/architecture/cgroups/#requirements)
|
||||
related to [Kubernetes support for cgroups v2](/docs/concepts/architecture/cgroups)
|
||||
are met.
|
||||
2. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
|
||||
are met.
|
||||
1. Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd
|
||||
and CRI-O provide support compatible with Memory QoS (alpha). This was implemented
|
||||
in the following PRs:
|
||||
* Containerd: [Feature: containerd-cri support LinuxContainerResources.Unified #5627](https://github.com/containerd/containerd/pull/5627).
|
||||
|
@ -291,8 +314,9 @@ and review of this feature:
|
|||
* David Porter([bobbypage](https://github.com/bobbypage))
|
||||
* Mrunal Patel([mrunalp](https://github.com/mrunalp))
|
||||
|
||||
For those interested in getting involved in future discussions on Memory QoS feature,
|
||||
For those interested in getting involved in future discussions on Memory QoS feature,
|
||||
you can reach out SIG Node by several means:
|
||||
|
||||
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
|
||||
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
|
||||
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
|
||||
|
|
Loading…
Reference in New Issue