Merge pull request #6589 from kubernetes/ryanmcginnis-patch-1

KubeCon 2017: Edits out-of-resource.md
2017-12-05 18:05:16 -08:00 · 2017-12-05 18:05:16 -08:00 · aa4ddb2eed
parent 343c03953c a18c302ae0
commit aa4ddb2eed
1 changed files with 116 additions and 125 deletions
--- a/docs/tasks/administer-cluster/out-of-resource.md
+++ b/docs/tasks/administer-cluster/out-of-resource.md
@ -9,27 +9,25 @@ title: Configure Out Of Resource Handling
 * TOC
 {:toc}

+This page explains how to configure out of resource handling with `kubelet`.
+
 The `kubelet` needs to preserve node stability when available compute resources
-are low.
-
-This is especially important when dealing with incompressible resources such as
-memory or disk.
-
-If either resource is exhausted, the node would become unstable.
+are low. This is especially important when dealing with incompressible
+compute resources, such as memory or disk space. If such resources are exhausted,
+nodes become unstable.

 ## Eviction Policy

-The `kubelet` can pro-actively monitor for and prevent against total starvation
-of a compute resource.  In those cases, the `kubelet` can pro-actively fail one
-or more pods in order to reclaim the starved resource.  When the `kubelet` fails
-a pod, it terminates all containers in the pod, and the `PodPhase` is
-transitioned to `Failed`.
+The `kubelet` can proactively monitor for and prevent total starvation of a
+compute resource.  In those cases, the `kubelet` can reclaim the starved
+resource by proactively failing one or more Pods.  When the `kubelet` fails
+a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`.

 ### Eviction Signals

-The `kubelet` can support the ability to trigger eviction decisions on the
-signals described in the table below.  The value of each signal is described in
-the description column based on the `kubelet` summary API.
+The `kubelet` supports eviction decisions based on the signals described in the following
+table. The value of each signal is described in the Description column, which is based on
+the `kubelet` summary API.

 | Eviction Signal  | Description                                                                     |
 |----------------------------|-----------------------------------------------------------------------|
@ -47,7 +45,7 @@ The value for `memory.available` is derived from the cgroupfs instead of tools
 like `free -m`.  This is important because `free -m` does not work in a
 container, and if users use the [node
 allocatable](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable) feature, out of resource decisions
-are made local to the end user pod part of the cgroup hierarchy as well as the
+are made local to the end user Pod part of the cgroup hierarchy as well as the
 root node.  This
 [script](/docs/tasks/administer-cluster/out-of-resource/memory-available.sh)
 reproduces the same set of steps that the `kubelet` performs to calculate
@ -67,64 +65,64 @@ of configurations are not currently supported by the kubelet. For example, it is
 *not OK* to store volumes and logs in a dedicated `filesystem`.

 In future releases, the `kubelet` will deprecate the existing [garbage
-collection](/docs/concepts/cluster-administration/kubelet-garbage-collection/) support in favor of eviction in
-response to disk pressure.
+collection](/docs/concepts/cluster-administration/kubelet-garbage-collection/)
+support in favor of eviction in response to disk pressure.

 ### Eviction Thresholds

 The `kubelet` supports the ability to specify eviction thresholds that trigger the `kubelet` to reclaim resources.

-Each threshold is of the following form:
+Each threshold has the following form:

-`<eviction-signal><operator><quantity>`
+`[eviction-signal][operator][quantity]`

-* valid `eviction-signal` tokens as defined above.
-* valid `operator` tokens are `<`
-* valid `quantity` tokens must match the quantity representation used by Kubernetes
-* an eviction threshold can be expressed as a percentage if ends with `%` token.
+where:

-For example, if a node has `10Gi` of memory, and the desire is to induce eviction
-if available memory falls below `1Gi`, an eviction threshold can be specified as either
-of the following (but not both).
+* `eviction-signal` is a eviction signal token as defined in the previous table.
+* `operator` is the desired relational operator, such as `<` (less than).
+* `quantity` is the eviction threshhold quantity, such as `1Gi`. These tokens must
+match the quantity representation used by Kubernetes. An eviction threshold can also
+be expressed as a percentage using the `%` token.

-* `memory.available<10%`
-* `memory.available<1Gi`
+For example, if a node has `10Gi` of total memory and you want trigger eviction if
+the available memory falls below `1Gi`, you can define the eviction threshold as
+either `memory.available<10%` or `memory.available<1Gi`. You cannot use both.

 #### Soft Eviction Thresholds

 A soft eviction threshold pairs an eviction threshold with a required
-administrator specified grace period.  No action is taken by the `kubelet`
+administrator-specified grace period. No action is taken by the `kubelet`
 to reclaim resources associated with the eviction signal until that grace
-period has been exceeded.  If no grace period is provided, the `kubelet` will
-error on startup.
+period has been exceeded. If no grace period is provided, the `kubelet`
+returns an error on startup.

 In addition, if a soft eviction threshold has been met, an operator can
-specify a maximum allowed pod termination grace period to use when evicting
-pods from the node.  If specified, the `kubelet` will use the lesser value among
+specify a maximum allowed Pod termination grace period to use when evicting
+pods from the node. If specified, the `kubelet` uses the lesser value among
 the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
-If not specified, the `kubelet` will kill pods immediately with no graceful
+If not specified, the `kubelet` kills Pods immediately with no graceful
 termination.

 To configure soft eviction thresholds, the following flags are supported:

 * `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
-corresponding grace period would trigger a pod eviction.
+corresponding grace period would trigger a Pod eviction.
 * `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
-correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
+correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
 * `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
 pods in response to a soft eviction threshold being met.

 #### Hard Eviction Thresholds

 A hard eviction threshold has no grace period, and if observed, the `kubelet`
-will take immediate action to reclaim the associated starved resource.  If a
-hard eviction threshold is met, the `kubelet` will kill the pod immediately
+will take immediate action to reclaim the associated starved resource. If a
+hard eviction threshold is met, the `kubelet` kills the Pod immediately
 with no graceful termination.

 To configure hard eviction thresholds, the following flag is supported:

 * `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
-would trigger a pod eviction.
+would trigger a Pod eviction.

 The `kubelet` has the following default hard eviction threshold:

@ -138,10 +136,10 @@ The `kubelet` evaluates eviction thresholds per its configured housekeeping inte

 ### Node Conditions

-The `kubelet` will map one or more eviction signals to a corresponding node condition.
+The `kubelet` maps one or more eviction signals to a corresponding node condition.

 If a hard eviction threshold has been met, or a soft eviction threshold has been met
-independent of its associated grace period, the `kubelet` will report a condition that
+independent of its associated grace period, the `kubelet` reports a condition that
 reflects the node is under pressure.

 The following node conditions are defined that correspond to the specified eviction signal.
@ -151,7 +149,7 @@ The following node conditions are defined that correspond to the specified evict
 | `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
 | `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |

-The `kubelet` will continue to report node status updates at the frequency specified by
+The `kubelet` continues to report node status updates at the frequency specified by
 `--node-status-update-frequency` which defaults to `10s`.

 ### Oscillation of node conditions
@ -174,85 +172,76 @@ condition back to `false`.
 ### Reclaiming node level resources

 If an eviction threshold has been met and the grace period has passed,
-the `kubelet` will initiate the process of reclaiming the pressured resource
+the `kubelet` initiates the process of reclaiming the pressured resource
 until it has observed the signal has gone below its defined threshold.

-The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If
+The `kubelet` attempts to reclaim node level resources prior to evicting end-user Pods. If
 disk pressure is observed, the `kubelet` reclaims node level resources differently if the
 machine has a dedicated `imagefs` configured for the container runtime.

-#### With Imagefs
+#### With `imagefs`

-If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+If `nodefs` filesystem has met eviction thresholds, `kubelet` frees up disk space by deleting the dead Pods and their containers.

-1. Delete dead pods/containers
+If `imagefs` filesystem has met eviction thresholds, `kubelet` frees up disk space by deleting all unused images.

-If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+#### Without `imagefs`

+If `nodefs` filesystem has met eviction thresholds, `kubelet` frees up disk space in the following order:
+
+1. Delete dead Pods and their containers
 1. Delete all unused images

-#### Without Imagefs
+### Evicting end-user Pods

-If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
+If the `kubelet` is unable to reclaim sufficient resource on the node, `kubelet` begins evicting Pods.

-1. Delete dead pods/containers
-1. Delete all unused images
+The `kubelet` ranks Pods for eviction first by their quality of service, and then by the consumption
+of the starved compute resource relative to the Pods' scheduling requests.

-### Evicting end-user pods
+As a result, `kubectl` ranks and evicts Pods in the following order:

-If the `kubelet` is unable to reclaim sufficient resource on the node,
-it will begin evicting pods.
-
-The `kubelet` ranks pods for eviction as follows:
-
-* by their quality of service.
-* by the consumption of the starved compute resource relative to the pods scheduling request.
-
-As a result, pod eviction occurs in the following order:
-
-* `BestEffort` pods that consume the most of the starved resource are failed
-first.
-* `Burstable` pods that consume the greatest amount of the starved resource
-relative to their request for that resource are killed first.  If no pod
+* `BestEffort` Pods consume the most of the starved resource are failed first.
+Local disk is a `BestEffort` resource.
+* `Burstable` Pods consume the greatest amount of the starved resource
+relative to their request for that resource are killed first. If no Pod
 has exceeded its request, the strategy targets the largest consumer of the
 starved resource.
-* `Guaranteed` pods that consume the greatest amount of the starved resource
-relative to their request are killed first.  If no pod has exceeded its request,
-the strategy targets the largest consumer of the starved resource.
-
-A `Guaranteed` pod is guaranteed to never be evicted because of another pod's
-resource consumption.  If a system daemon (i.e. `kubelet`, `docker`, `journald`, etc.)
-is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations,
-and the node only has `Guaranteed` pod(s) remaining, then the node must choose to evict a
-`Guaranteed` pod in order to preserve node stability, and to limit the impact
-of the unexpected consumption to other `Guaranteed` pod(s).
-
-Local disk is a `BestEffort` resource.  If necessary, `kubelet` will evict pods one at a time to reclaim
-disk when `DiskPressure` is encountered.  The `kubelet` will rank pods by quality of service.  If the `kubelet`
-is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service
-first.  If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service
+* `Guaranteed` Pods are guaranteed only when requests and limits are specified
+for all the containers and they are equal A `Guaranteed` Pod is guaranteed to
+never be evicted because of another Pod's resource consumption. If a system
+daemon (such as `kubelet`, `docker`, and `journald`) is consuming more resources
+than were reserved via `system-reserved` or `kube-reserved` allocations, and the
+node only has `Guaranteed` Pods remaining, then the node must choose to evict a
+`Guaranteed` Pod in order to preserve node stability and to limit the impact
+of the unexpected consumption to other `Guaranteed` Pods.
+ 
+If necessary, `kubelet` evicts Pods one at a time to reclaim disk when `DiskPressure`
+is encountered. If the `kubelet` is responding to `inode` starvation, it reclaims
+`inodes` by evicting Pods with the lowest quality of service first. If the `kubelet`
+is responding to lack of available disk, it ranks Pods within a quality of service
 that consumes the largest amount of disk and kill those first.

-#### With Imagefs
+#### With `imagefs`

-If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
+If `nodefs` is triggering evictions, `kubelet` sorts Pods based on the usage on `nodefs`
 - local volumes + logs of all its containers.

-If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
+If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable layer usage of all its containers.

-#### Without Imagefs
+#### Without `imagefs`

-If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
+If `nodefs` is triggering evictions, `kubelet` sorts Pods based on their total disk usage
 - local volumes + logs & writable layer of all its containers.

 ### Minimum eviction reclaim

-In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
+In certain scenarios, eviction of Pods could result in reclamation of small amount of resources. This can result in
 `kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
 is time consuming.

 To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
-resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below
+resource pressure, `kubelet` attempts to reclaim at least `minimum-reclaim` amount of resource below
 the configured eviction threshold.

 For example, with the following configuration:
@ -262,31 +251,31 @@ For example, with the following configuration:
 --eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
 ```

-If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure
-that `memory.available` is at least `500Mi`.  For `nodefs.available`, the `kubelet` will work
-to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will
-work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
+If an eviction threshold is triggered for `memory.available`, the `kubelet` works to ensure
+that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` works
+to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it
+works to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
 on their associated resources.

 The default `eviction-minimum-reclaim` is `0` for all resources.

 ### Scheduler

-The node will report a condition when a compute resource is under pressure.  The
+The node reports a condition when a compute resource is under pressure. The
 scheduler views that condition as a signal to dissuade placing additional
 pods on the node.

 | Node Condition    | Scheduler Behavior                               |
 | ---------------- | ------------------------------------------------ |
-| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. |
-| `DiskPressure` | No new pods are scheduled to the node. |
+| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. |
+| `DiskPressure` | No new Pods are scheduled to the node. |

 ## Node OOM Behavior

 If the node experiences a system OOM (out of memory) event prior to the `kubelet` is able to reclaim memory,
 the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respond.

-The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the pod.
+The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the Pod.

 | Quality of Service | oom_score_adj |
 |----------------------------|-----------------------------------------------------------------------|
@ -294,7 +283,7 @@ The `kubelet` sets a `oom_score_adj` value for each container based on the quali
 | `BestEffort` | 1000 |
 | `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |

-If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` will calculate
+If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` calculates
 an `oom_score` based on the percentage of memory it's using on the node, and then add the `oom_score_adj` to get an
 effective `oom_score` for the container, and then kills the container with the highest score.

@ -302,17 +291,19 @@ The intended behavior should be that containers with the lowest quality of servi
 are consuming the largest amount of memory relative to the scheduling request should be killed first in order
 to reclaim memory.

-Unlike pod eviction, if a pod container is OOM killed, it may be restarted by the `kubelet` based on its `RestartPolicy`.
+Unlike Pod eviction, if a Pod container is OOM killed, it may be restarted by the `kubelet` based on its `RestartPolicy`.

 ## Best Practices

+The following sections describe best practices for out of resource handling.
+
 ### Schedulable resources and eviction policies

-Let's imagine the following scenario:
+Consider the following scenario:

 * Node memory capacity: `10Gi`
 * Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
-* Operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
+* Operator wants to evict Pods at 95% memory utilization to reduce thrashing and incidence of system OOM.

 To facilitate this scenario, the `kubelet` would be launched as follows:

@ -324,31 +315,30 @@ To facilitate this scenario, the `kubelet` would be launched as follows:
 Implicit in this configuration is the understanding that "System reserved" should include the amount of memory
 covered by the eviction threshold.

-To reach that capacity, either some pod is using more than its request, or the system is using more than `500Mi`.
+To reach that capacity, either some Pod is using more than its request, or the system is using more than `500Mi`.

-This configuration will ensure that the scheduler does not place pods on a node that immediately induce memory pressure
-and trigger eviction assuming those pods use less than their configured request.
+This configuration ensures that the scheduler does not place Pods on a node that immediately induce memory pressure
+and trigger eviction assuming those Pods use less than their configured request.

 ### DaemonSet

-It is never desired for a `kubelet` to evict a pod that was derived from
-a `DaemonSet` since the pod will immediately be recreated and rescheduled
-back to the same node.
+It is never desired for a `kubelet` to evict a `DaemonSet` Pod, since the Pod is
+immediately recreated and rescheduled back to the same node.

-At the moment, the `kubelet` has no ability to distinguish a pod created
-from `DaemonSet` versus any other object.  If/when that information is
-available, the `kubelet` could pro-actively filter those pods from the
-candidate set of pods provided to the eviction strategy.
+At the moment, the `kubelet` has no ability to distinguish a Pod created
+from `DaemonSet` versus any other object. If/when that information is
+available, the `kubelet` could pro-actively filter those Pods from the
+candidate set of Pods provided to the eviction strategy.

 In general, it is strongly recommended that `DaemonSet` not
-create `BestEffort` pods to avoid being identified as a candidate pod
-for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods.
+create `BestEffort` Pods to avoid being identified as a candidate Pod
+for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` Pods.

 ## Deprecation of existing feature flags to reclaim disk

 `kubelet` has been freeing up disk space on demand to keep the node stable.

-As disk based eviction matures, the following `kubelet` flags will be marked for deprecation
+As disk based eviction matures, the following `kubelet` flags are marked for deprecation
 in favor of the simpler configuration supported around eviction.

 | Existing Flag | New Flag |
@ -363,26 +353,27 @@ in favor of the simpler configuration supported around eviction.

 ## Known issues

+The following sections describe known issues related to out of resource handling.
+
 ### kubelet may not observe memory pressure right away

-The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval.  If memory usage
+The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage
 increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller`
-will still be invoked.  We intend to integrate with the `memcg` notification API in a future release to reduce this
+will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this
 latency, and instead have the kernel tell us when a threshold has been crossed immediately.

 If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for
-this issue is to set eviction thresholds at approximately 75% capacity.  This increases the ability of this feature
+this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature
 to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.

-### kubelet may evict more pods than needed
+### kubelet may evict more Pods than needed

-The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
+The Pod eviction may evict more Pods than needed due to stats collection timing gap. This can be mitigated by adding
 the ability to get root container stats on an on-demand basis [(https://github.com/google/cadvisor/issues/1247)](https://github.com/google/cadvisor/issues/1247) in the future.

-### How kubelet ranks pods for eviction in response to inode exhaustion
+### How kubelet ranks Pods for eviction in response to inode exhaustion

-At this time, it is not possible to know how many inodes were consumed by a particular container.  If the `kubelet` observes
-inode exhaustion, it will evict pods by ranking them by quality of service.  The following issue has been opened in cadvisor
-to track per container inode consumption [(https://github.com/google/cadvisor/issues/1422)](https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
-by inode consumption.  For example, this would let us identify a container that created large numbers of 0 byte files, and evict
-that pod over others.
+At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
+inode exhaustion, it evicts Pods by ranking them by quality of service. The following issue has been opened in cadvisor
+to track per container inode consumption [(https://github.com/google/cadvisor/issues/1422)](https://github.com/google/cadvisor/issues/1422) which would allow us to rank Pods
+by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict that Pod over others.