Cleanup node-pressure-eviction.md
parent
d6f6e084a8
commit
3aa5fe0cca
|
@ -6,16 +6,16 @@ weight: 100
|
|||
|
||||
{{<glossary_definition term_id="node-pressure-eviction" length="short">}}</br>
|
||||
|
||||
The {{<glossary_tooltip term_id="kubelet" text="kubelet">}} monitors resources
|
||||
like memory, disk space, and filesystem inodes on your cluster's nodes.
|
||||
When one or more of these resources reach specific consumption levels, the
|
||||
The {{<glossary_tooltip term_id="kubelet" text="kubelet">}} monitors resources
|
||||
like memory, disk space, and filesystem inodes on your cluster's nodes.
|
||||
When one or more of these resources reach specific consumption levels, the
|
||||
kubelet can proactively fail one or more pods on the node to reclaim resources
|
||||
and prevent starvation.
|
||||
and prevent starvation.
|
||||
|
||||
During a node-pressure eviction, the kubelet sets the `PodPhase` for the
|
||||
selected pods to `Failed`. This terminates the pods.
|
||||
selected pods to `Failed`. This terminates the pods.
|
||||
|
||||
Node-pressure eviction is not the same as
|
||||
Node-pressure eviction is not the same as
|
||||
[API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/).
|
||||
|
||||
The kubelet does not respect your configured `PodDisruptionBudget` or the pod's
|
||||
|
@ -26,7 +26,7 @@ the kubelet respects your configured `eviction-max-pod-grace-period`. If you use
|
|||
If the pods are managed by a {{< glossary_tooltip text="workload" term_id="workload" >}}
|
||||
resource (such as {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}
|
||||
or {{< glossary_tooltip text="Deployment" term_id="deployment" >}}) that
|
||||
replaces failed pods, the control plane or `kube-controller-manager` creates new
|
||||
replaces failed pods, the control plane or `kube-controller-manager` creates new
|
||||
pods in place of the evicted pods.
|
||||
|
||||
{{<note>}}
|
||||
|
@ -37,16 +37,16 @@ images when disk resources are starved.
|
|||
|
||||
The kubelet uses various parameters to make eviction decisions, like the following:
|
||||
|
||||
* Eviction signals
|
||||
* Eviction thresholds
|
||||
* Monitoring intervals
|
||||
- Eviction signals
|
||||
- Eviction thresholds
|
||||
- Monitoring intervals
|
||||
|
||||
### Eviction signals {#eviction-signals}
|
||||
|
||||
Eviction signals are the current state of a particular resource at a specific
|
||||
point in time. Kubelet uses eviction signals to make eviction decisions by
|
||||
comparing the signals to eviction thresholds, which are the minimum amount of
|
||||
the resource that should be available on the node.
|
||||
comparing the signals to eviction thresholds, which are the minimum amount of
|
||||
the resource that should be available on the node.
|
||||
|
||||
Kubelet uses the following eviction signals:
|
||||
|
||||
|
@ -60,9 +60,9 @@ Kubelet uses the following eviction signals:
|
|||
| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` |
|
||||
|
||||
In this table, the `Description` column shows how kubelet gets the value of the
|
||||
signal. Each signal supports either a percentage or a literal value. Kubelet
|
||||
signal. Each signal supports either a percentage or a literal value. Kubelet
|
||||
calculates the percentage value relative to the total capacity associated with
|
||||
the signal.
|
||||
the signal.
|
||||
|
||||
The value for `memory.available` is derived from the cgroupfs instead of tools
|
||||
like `free -m`. This is important because `free -m` does not work in a
|
||||
|
@ -78,7 +78,7 @@ memory is reclaimable under pressure.
|
|||
The kubelet supports the following filesystem partitions:
|
||||
|
||||
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir,
|
||||
log storage, and more. For example, `nodefs` contains `/var/lib/kubelet/`.
|
||||
log storage, and more. For example, `nodefs` contains `/var/lib/kubelet/`.
|
||||
1. `imagefs`: An optional filesystem that container runtimes use to store container
|
||||
images and container writable layers.
|
||||
|
||||
|
@ -102,10 +102,10 @@ eviction decisions.
|
|||
|
||||
Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where:
|
||||
|
||||
* `eviction-signal` is the [eviction signal](#eviction-signals) to use.
|
||||
* `operator` is the [relational operator](https://en.wikipedia.org/wiki/Relational_operator#Standard_relational_operators)
|
||||
- `eviction-signal` is the [eviction signal](#eviction-signals) to use.
|
||||
- `operator` is the [relational operator](https://en.wikipedia.org/wiki/Relational_operator#Standard_relational_operators)
|
||||
you want, such as `<` (less than).
|
||||
* `quantity` is the eviction threshold amount, such as `1Gi`. The value of `quantity`
|
||||
- `quantity` is the eviction threshold amount, such as `1Gi`. The value of `quantity`
|
||||
must match the quantity representation used by Kubernetes. You can use either
|
||||
literal values or percentages (`%`).
|
||||
|
||||
|
@ -120,22 +120,22 @@ You can configure soft and hard eviction thresholds.
|
|||
A soft eviction threshold pairs an eviction threshold with a required
|
||||
administrator-specified grace period. The kubelet does not evict pods until the
|
||||
grace period is exceeded. The kubelet returns an error on startup if there is no
|
||||
specified grace period.
|
||||
specified grace period.
|
||||
|
||||
You can specify both a soft eviction threshold grace period and a maximum
|
||||
allowed pod termination grace period for kubelet to use during evictions. If you
|
||||
specify a maximum allowed grace period and the soft eviction threshold is met,
|
||||
specify a maximum allowed grace period and the soft eviction threshold is met,
|
||||
the kubelet uses the lesser of the two grace periods. If you do not specify a
|
||||
maximum allowed grace period, the kubelet kills evicted pods immediately without
|
||||
graceful termination.
|
||||
|
||||
You can use the following flags to configure soft eviction thresholds:
|
||||
|
||||
* `eviction-soft`: A set of eviction thresholds like `memory.available<1.5Gi`
|
||||
- `eviction-soft`: A set of eviction thresholds like `memory.available<1.5Gi`
|
||||
that can trigger pod eviction if held over the specified grace period.
|
||||
* `eviction-soft-grace-period`: A set of eviction grace periods like `memory.available=1m30s`
|
||||
- `eviction-soft-grace-period`: A set of eviction grace periods like `memory.available=1m30s`
|
||||
that define how long a soft eviction threshold must hold before triggering a Pod eviction.
|
||||
* `eviction-max-pod-grace-period`: The maximum allowed grace period (in seconds)
|
||||
- `eviction-max-pod-grace-period`: The maximum allowed grace period (in seconds)
|
||||
to use when terminating pods in response to a soft eviction threshold being met.
|
||||
|
||||
#### Hard eviction thresholds {#hard-eviction-thresholds}
|
||||
|
@ -144,20 +144,20 @@ A hard eviction threshold has no grace period. When a hard eviction threshold is
|
|||
met, the kubelet kills pods immediately without graceful termination to reclaim
|
||||
the starved resource.
|
||||
|
||||
You can use the `eviction-hard` flag to configure a set of hard eviction
|
||||
thresholds like `memory.available<1Gi`.
|
||||
You can use the `eviction-hard` flag to configure a set of hard eviction
|
||||
thresholds like `memory.available<1Gi`.
|
||||
|
||||
The kubelet has the following default hard eviction thresholds:
|
||||
|
||||
* `memory.available<100Mi`
|
||||
* `nodefs.available<10%`
|
||||
* `imagefs.available<15%`
|
||||
* `nodefs.inodesFree<5%` (Linux nodes)
|
||||
- `memory.available<100Mi`
|
||||
- `nodefs.available<10%`
|
||||
- `imagefs.available<15%`
|
||||
- `nodefs.inodesFree<5%` (Linux nodes)
|
||||
|
||||
These default values of hard eviction thresholds will only be set if none
|
||||
of the parameters is changed. If you changed the value of any parameter,
|
||||
then the values of other parameters will not be inherited as the default
|
||||
values and will be set to zero. In order to provide custom values, you
|
||||
These default values of hard eviction thresholds will only be set if none
|
||||
of the parameters is changed. If you changed the value of any parameter,
|
||||
then the values of other parameters will not be inherited as the default
|
||||
values and will be set to zero. In order to provide custom values, you
|
||||
should provide all the thresholds respectively.
|
||||
|
||||
### Eviction monitoring interval
|
||||
|
@ -169,9 +169,9 @@ which defaults to `10s`.
|
|||
|
||||
The kubelet reports node conditions to reflect that the node is under pressure
|
||||
because hard or soft eviction threshold is met, independent of configured grace
|
||||
periods.
|
||||
periods.
|
||||
|
||||
The kubelet maps eviction signals to node conditions as follows:
|
||||
The kubelet maps eviction signals to node conditions as follows:
|
||||
|
||||
| Node Condition | Eviction Signal | Description |
|
||||
|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
|
||||
|
@ -179,7 +179,7 @@ The kubelet maps eviction signals to node conditions as follows:
|
|||
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold |
|
||||
| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold |
|
||||
|
||||
The kubelet updates the node conditions based on the configured
|
||||
The kubelet updates the node conditions based on the configured
|
||||
`--node-status-update-frequency`, which defaults to `10s`.
|
||||
|
||||
#### Node condition oscillation
|
||||
|
@ -197,17 +197,17 @@ condition to a different state. The transition period has a default value of `5m
|
|||
The kubelet tries to reclaim node-level resources before it evicts end-user pods.
|
||||
|
||||
When a `DiskPressure` node condition is reported, the kubelet reclaims node-level
|
||||
resources based on the filesystems on the node.
|
||||
resources based on the filesystems on the node.
|
||||
|
||||
#### With `imagefs`
|
||||
|
||||
If the node has a dedicated `imagefs` filesystem for container runtimes to use,
|
||||
the kubelet does the following:
|
||||
|
||||
* If the `nodefs` filesystem meets the eviction thresholds, the kubelet garbage collects
|
||||
dead pods and containers.
|
||||
* If the `imagefs` filesystem meets the eviction thresholds, the kubelet
|
||||
deletes all unused images.
|
||||
- If the `nodefs` filesystem meets the eviction thresholds, the kubelet garbage collects
|
||||
dead pods and containers.
|
||||
- If the `imagefs` filesystem meets the eviction thresholds, the kubelet
|
||||
deletes all unused images.
|
||||
|
||||
#### Without `imagefs`
|
||||
|
||||
|
@ -220,7 +220,7 @@ the kubelet frees up disk space in the following order:
|
|||
### Pod selection for kubelet eviction
|
||||
|
||||
If the kubelet's attempts to reclaim node-level resources don't bring the eviction
|
||||
signal below the threshold, the kubelet begins to evict end-user pods.
|
||||
signal below the threshold, the kubelet begins to evict end-user pods.
|
||||
|
||||
The kubelet uses the following parameters to determine the pod eviction order:
|
||||
|
||||
|
@ -238,7 +238,7 @@ As a result, kubelet ranks and evicts pods in the following order:
|
|||
|
||||
{{<note>}}
|
||||
The kubelet does not use the pod's QoS class to determine the eviction order.
|
||||
You can use the QoS class to estimate the most likely pod eviction order when
|
||||
You can use the QoS class to estimate the most likely pod eviction order when
|
||||
reclaiming resources like memory. QoS does not apply to EphemeralStorage requests,
|
||||
so the above scenario will not apply if the node is, for example, under `DiskPressure`.
|
||||
{{</note>}}
|
||||
|
@ -246,7 +246,7 @@ so the above scenario will not apply if the node is, for example, under `DiskPre
|
|||
`Guaranteed` pods are guaranteed only when requests and limits are specified for
|
||||
all the containers and they are equal. These pods will never be evicted because
|
||||
of another pod's resource consumption. If a system daemon (such as `kubelet`
|
||||
and `journald`) is consuming more resources than were reserved via
|
||||
and `journald`) is consuming more resources than were reserved via
|
||||
`system-reserved` or `kube-reserved` allocations, and the node only has
|
||||
`Guaranteed` or `Burstable` pods using less resources than requests left on it,
|
||||
then the kubelet must choose to evict one of these pods to preserve node stability
|
||||
|
@ -277,14 +277,14 @@ disk usage (`local volumes + logs & writable layer of all containers`)
|
|||
|
||||
In some cases, pod eviction only reclaims a small amount of the starved resource.
|
||||
This can lead to the kubelet repeatedly hitting the configured eviction thresholds
|
||||
and triggering multiple evictions.
|
||||
and triggering multiple evictions.
|
||||
|
||||
You can use the `--eviction-minimum-reclaim` flag or a [kubelet config file](/docs/tasks/administer-cluster/kubelet-config-file/)
|
||||
to configure a minimum reclaim amount for each resource. When the kubelet notices
|
||||
that a resource is starved, it continues to reclaim that resource until it
|
||||
reclaims the quantity you specify.
|
||||
reclaims the quantity you specify.
|
||||
|
||||
For example, the following configuration sets minimum reclaim amounts:
|
||||
For example, the following configuration sets minimum reclaim amounts:
|
||||
|
||||
```yaml
|
||||
apiVersion: kubelet.config.k8s.io/v1beta1
|
||||
|
@ -302,10 +302,10 @@ evictionMinimumReclaim:
|
|||
In this example, if the `nodefs.available` signal meets the eviction threshold,
|
||||
the kubelet reclaims the resource until the signal reaches the threshold of `1Gi`,
|
||||
and then continues to reclaim the minimum amount of `500Mi` it until the signal
|
||||
reaches `1.5Gi`.
|
||||
reaches `1.5Gi`.
|
||||
|
||||
Similarly, the kubelet reclaims the `imagefs` resource until the `imagefs.available`
|
||||
signal reaches `102Gi`.
|
||||
signal reaches `102Gi`.
|
||||
|
||||
The default `eviction-minimum-reclaim` is `0` for all resources.
|
||||
|
||||
|
@ -336,7 +336,7 @@ for each container. It then kills the container with the highest score.
|
|||
This means that containers in low QoS pods that consume a large amount of memory
|
||||
relative to their scheduling requests are killed first.
|
||||
|
||||
Unlike pod eviction, if a container is OOM killed, the `kubelet` can restart it
|
||||
Unlike pod eviction, if a container is OOM killed, the `kubelet` can restart it
|
||||
based on its `RestartPolicy`.
|
||||
|
||||
### Best practices {#node-pressure-eviction-good-practices}
|
||||
|
@ -351,9 +351,9 @@ immediately induce memory pressure.
|
|||
|
||||
Consider the following scenario:
|
||||
|
||||
* Node memory capacity: `10Gi`
|
||||
* Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
|
||||
* Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
|
||||
- Node memory capacity: `10Gi`
|
||||
- Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
|
||||
- Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
|
||||
|
||||
For this to work, the kubelet is launched as follows:
|
||||
|
||||
|
@ -363,18 +363,18 @@ For this to work, the kubelet is launched as follows:
|
|||
```
|
||||
|
||||
In this configuration, the `--system-reserved` flag reserves `1.5Gi` of memory
|
||||
for the system, which is `10% of the total memory + the eviction threshold amount`.
|
||||
for the system, which is `10% of the total memory + the eviction threshold amount`.
|
||||
|
||||
The node can reach the eviction threshold if a pod is using more than its request,
|
||||
or if the system is using more than `1Gi` of memory, which makes the `memory.available`
|
||||
signal fall below `500Mi` and triggers the threshold.
|
||||
signal fall below `500Mi` and triggers the threshold.
|
||||
|
||||
#### DaemonSet
|
||||
|
||||
Pod Priority is a major factor in making eviction decisions. If you do not want
|
||||
the kubelet to evict pods that belong to a `DaemonSet`, give those pods a high
|
||||
enough `priorityClass` in the pod spec. You can also use a lower `priorityClass`
|
||||
or the default to only allow `DaemonSet` pods to run when there are enough
|
||||
or the default to only allow `DaemonSet` pods to run when there are enough
|
||||
resources.
|
||||
|
||||
### Known issues
|
||||
|
@ -386,7 +386,7 @@ The following sections describe known issues related to out of resource handling
|
|||
By default, the kubelet polls `cAdvisor` to collect memory usage stats at a
|
||||
regular interval. If memory usage increases within that window rapidly, the
|
||||
kubelet may not observe `MemoryPressure` fast enough, and the `OOMKiller`
|
||||
will still be invoked.
|
||||
will still be invoked.
|
||||
|
||||
You can use the `--kernel-memcg-notification` flag to enable the `memcg`
|
||||
notification API on the kubelet to get notified immediately when a threshold
|
||||
|
@ -394,29 +394,29 @@ is crossed.
|
|||
|
||||
If you are not trying to achieve extreme utilization, but a sensible measure of
|
||||
overcommit, a viable workaround for this issue is to use the `--kube-reserved`
|
||||
and `--system-reserved` flags to allocate memory for the system.
|
||||
and `--system-reserved` flags to allocate memory for the system.
|
||||
|
||||
#### active_file memory is not considered as available memory
|
||||
|
||||
On Linux, the kernel tracks the number of bytes of file-backed memory on active
|
||||
On Linux, the kernel tracks the number of bytes of file-backed memory on active
|
||||
LRU list as the `active_file` statistic. The kubelet treats `active_file` memory
|
||||
areas as not reclaimable. For workloads that make intensive use of block-backed
|
||||
local storage, including ephemeral local storage, kernel-level caches of file
|
||||
and block data means that many recently accessed cache pages are likely to be
|
||||
counted as `active_file`. If enough of these kernel block buffers are on the
|
||||
active LRU list, the kubelet is liable to observe this as high resource use and
|
||||
areas as not reclaimable. For workloads that make intensive use of block-backed
|
||||
local storage, including ephemeral local storage, kernel-level caches of file
|
||||
and block data means that many recently accessed cache pages are likely to be
|
||||
counted as `active_file`. If enough of these kernel block buffers are on the
|
||||
active LRU list, the kubelet is liable to observe this as high resource use and
|
||||
taint the node as experiencing memory pressure - triggering pod eviction.
|
||||
|
||||
For more details, see [https://github.com/kubernetes/kubernetes/issues/43916](https://github.com/kubernetes/kubernetes/issues/43916)
|
||||
|
||||
You can work around that behavior by setting the memory limit and memory request
|
||||
the same for containers likely to perform intensive I/O activity. You will need
|
||||
the same for containers likely to perform intensive I/O activity. You will need
|
||||
to estimate or measure an optimal memory limit value for that container.
|
||||
|
||||
## {{% heading "whatsnext" %}}
|
||||
|
||||
* Learn about [API-initiated Eviction](/docs/concepts/scheduling-eviction/api-eviction/)
|
||||
* Learn about [Pod Priority and Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
|
||||
* Learn about [PodDisruptionBudgets](/docs/tasks/run-application/configure-pdb/)
|
||||
* Learn about [Quality of Service](/docs/tasks/configure-pod-container/quality-service-pod/) (QoS)
|
||||
* Check out the [Eviction API](/docs/reference/generated/kubernetes-api/{{<param "version">}}/#create-eviction-pod-v1-core)
|
||||
- Learn about [API-initiated Eviction](/docs/concepts/scheduling-eviction/api-eviction/)
|
||||
- Learn about [Pod Priority and Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
|
||||
- Learn about [PodDisruptionBudgets](/docs/tasks/run-application/configure-pdb/)
|
||||
- Learn about [Quality of Service](/docs/tasks/configure-pod-container/quality-service-pod/) (QoS)
|
||||
- Check out the [Eviction API](/docs/reference/generated/kubernetes-api/{{<param "version">}}/#create-eviction-pod-v1-core)
|
||||
|
|
Loading…
Reference in New Issue