adding docs for node allocatable
Signed-off-by: Vishnu kannan <vishnuk@google.com>reviewable/pr2841/r1
parent
73cb92a0a9
commit
0ecd5254d9
|
@ -172,6 +172,7 @@ toc:
|
|||
- docs/admin/cluster-management.md
|
||||
- docs/admin/kubeadm.md
|
||||
- docs/admin/addons.md
|
||||
- docs/admin/node-allocatable.md
|
||||
- docs/admin/audit.md
|
||||
- docs/admin/ha-master-gce.md
|
||||
- docs/admin/namespaces/index.md
|
||||
|
|
|
@ -0,0 +1,144 @@
|
|||
---
|
||||
assignees:
|
||||
-vishh
|
||||
-derekwaynecarr
|
||||
-dashpole
|
||||
title: Reserving Compute Resources for System Daemons
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
Kubernetes nodes can be scheduled to `capacity`.
|
||||
Pods can consume all the available capacity on a node by default.
|
||||
This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself.
|
||||
Unless resources are set aside for these system daemons, pods and system daemons will compete for resources and lead to resource starvation issues on the node.
|
||||
The `kubelet` exposes a feature named `Node Allocatable` that helps to reserve compute resources for system daemons.
|
||||
Kubernetes recommends cluster administrators to configure `Node Allocatable` based on their workload density on each node.
|
||||
|
||||
## Node Allocatable
|
||||
|
||||
Node Capacity
|
||||
---------------------------
|
||||
| kube-reserved |
|
||||
|-------------------------|
|
||||
| system-reserved |
|
||||
|-------------------------|
|
||||
| eviction-threshold |
|
||||
|-------------------------|
|
||||
| |
|
||||
| allocatable |
|
||||
| (available |
|
||||
| for pods) |
|
||||
| |
|
||||
| |
|
||||
---------------------------
|
||||
|
||||
`Allocatable` on a Kubernetes node is defined as the amount of compute resources that are available for pods.
|
||||
The scheduler does not over subscribe `allocatable`.
|
||||
`CPU` and `memory` are supported as of now.
|
||||
Support for `storage` will be added in the future.
|
||||
|
||||
Node Allocatable is exposed as part of `v1.Node` object in the API and as part of `kubectl describe node` in the CLI.
|
||||
|
||||
Resources can be reserved for two categories of system daemons in the `kubelet`.
|
||||
|
||||
### Kube Reserved
|
||||
|
||||
**Kubelet Flag**: `--kube-reserved=[cpu=100mi][,][memory=100Mi]`
|
||||
**Kubelet Flag**: `--kube-reserved-cgroup=`/runtime.slice`
|
||||
|
||||
`kube-reserved` is meant to capture resource reservation for kubernetes system daemons like the `kubelet`, `container runtime`, `node problem detector`, etc.
|
||||
It is not meant to reserve resources for system daemons that are run as pods.
|
||||
`kube-reserved` is typically a function of `pod density` on the nodes.
|
||||
[This performance dashboard](http://node-perf-dash.k8s.io/#/builds) exposes `cpu` and `memory` usage profiles of `kubelet` and `docker engine` at multiple levels of pod density.
|
||||
[This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard can be interpreted to come up with a suitable `kube-reserved` reservation.
|
||||
|
||||
It is recommended that the kubernetes system daemons are placed under a top level control group (`system.slice` on systemd machines for example).
|
||||
Each system daemon should ideally run within its own child control group.
|
||||
Refer to [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#recommended-cgroups-setup) for more details on recommended control group hierarchy.
|
||||
|
||||
To optionally enforce `kube-reserved` on system daemons, specify the parent control group for kube daemons as the value for `--kube-reserved-cgroup` kubelet flag.
|
||||
|
||||
### System Reserved
|
||||
|
||||
**Kubelet Flag**: `--system-reserved=[cpu=100mi][,][memory=100Mi]`
|
||||
**Kubelet Flag**: `--system-reserved-cgroup=`/system.slice`
|
||||
|
||||
|
||||
`system-reserved` is meant to capture resource reservation for OS system daemons like `sshd`, `udev`, etc.
|
||||
`system-reserved` should reserve `memory` for the `kernel` too since `kernel` memory is not accounted to pods (yet) in Kubernetes.
|
||||
Reserving resources for user login sessions is also recommended (`user.slice` in systemd world).
|
||||
|
||||
To optionally enforce `system-reserved` on system daemons, specify the parent control group for OS system daemons as the value for `--system-reserved-cgroup` kubelet flag.
|
||||
|
||||
### Eviction Thresholds
|
||||
|
||||
**Kubelet Flag**: `--eviction-hard=[memory.available<500Mi]`
|
||||
|
||||
Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it.
|
||||
Nodes can go offline temporarily until memory has been reclaimed.
|
||||
To avoid (or reduce the probabilty) system OOMs kubelet provides [`Out of Resource`](./out-of-resource.md) management.
|
||||
Evictions are supported for `memory` and `storage` only.
|
||||
By reserving some memory via `--eviction-hard` flag, the `kubelet` attempts to `evict` pods whenever memory availability on the node drops below the reserved value.
|
||||
Hypothetically, if system daemons did not exist on a node, pods cannot use more than `capacity - eviction-hard`.
|
||||
For this reason, resources reserved for evictions will not be available for pods.
|
||||
|
||||
### Enforcing Node Allocatable
|
||||
|
||||
**Kubelet Flag**: `--enforce-node-allocatable=[pods][,][system-reserved][,][kube-reserved]`
|
||||
|
||||
The scheduler will treat `Allocatable` as the available `capacity` for pods.
|
||||
|
||||
`kubelet` will enforce `Allocatable` across pods by default.
|
||||
This enforcement is controlled by specifying `pods` value to the kubelet flag `--enforce-node-allocatable`.
|
||||
|
||||
Optionally, `kubelet` can be made to enforce `kube-reserved` and `system-reserved` by specifying `kube-reserved` & `system-reserved` values in the same flag.
|
||||
Note that to enforce `kube-reserved` or `system-reserved`, `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be specified respectively.
|
||||
|
||||
## General Guidelines
|
||||
|
||||
System daemons are expected to be treated similar to `Guaranteed` pods.
|
||||
System daemons can burst within their bounding control groups and this behavior needs to be managed as part of kubernetes deployments.
|
||||
For example, `kubelet` should have its own control group and share `Kube-reserved` resources with the container runtime.
|
||||
However, Kubelet cannot burst and use up all available Node resources if `kube-reserved` is enforced.
|
||||
|
||||
Be extra careful while enforcing `system-reserved` reservation since it can lead to critical system services being CPU starved or OOM killed on the node.
|
||||
The recommendation is to enforce `system-reserved` only if a user has profiled their nodes exhaustively to come up with precise estimates.
|
||||
|
||||
* To begin with enforce `Allocatable` on `pods`.
|
||||
* Once adequate monitoring and alerting is in place to track kube system daemons, attempt to enforce `kube-reserved` based on usage heuristics.
|
||||
* If aboslutely necessary, enforce `system-reserved` over time.
|
||||
|
||||
The resource requirements of kube system daemons will grow over time as more and more features are added.
|
||||
Over time, kubernetes will attempt to bring down utilization of node system daemons, but that is not a priority as of now.
|
||||
So expect a drop in `Allocatable` capacity in future releases.
|
||||
|
||||
## Example Scenario
|
||||
|
||||
Here is an example to illustrate Node Allocatable computation:
|
||||
|
||||
* Node has `32Gi` of `memory` and `16 CPUs`
|
||||
* `--kube-reserved` is set to `cpu=1,memory=2Gi`
|
||||
* `--system-reserved` is set to `cpu=500m,memory=1Gi`
|
||||
* ``--eviction-hard` is set to `memory.available<500Mi`
|
||||
|
||||
Under this scenario, `Allocatable` will be `14.5 CPUs` & `28.5Gi` of memory.
|
||||
Scheduler will ensure that the total `requests` across all pods on this node does not exceed `28.5Gi`.
|
||||
Kubelet will evict pods whenever the overall memory usage exceeds across pods exceed `28.5Gi`.
|
||||
If all processes on the node consume as much CPU as they can, pods together cannot consume more than `14.5 CPUs`.
|
||||
|
||||
If `kube-reserved` and/or `system-reserved` is not enforced and system daemons exceed their reservation, `kubelet` will evict pods whenever the overall node memory usage is higher than `31.5Gi`.
|
||||
|
||||
## Feature Availability
|
||||
|
||||
Since `v1.2`, it has been possible to **optionally** specify `kube-reserved` and `system-reserved` reservations.
|
||||
The scheduler switched to using `Allocatable` instead of `Capacity` when available in the same release.
|
||||
|
||||
Since `v1.6`, `eviction-thresholds` are being considered by computing `Allocatable`.
|
||||
To revert to the old behavior set `--experimental-allocatable-ignore-eviction` kubelet flag to `true`.
|
||||
|
||||
Since `v1.6`, `kubelet` will enforce `Allocatable` on pods using control groups.
|
||||
To revert to the old behavior unset `--enforce-node-allocatable` kubelet flag.
|
||||
Note that unless `--kube-reserved`, or `--system-reserved` or `--eviction-hard` flags have non-default values, `Allocatable` enforcement will not affect existing deployments.
|
||||
|
Loading…
Reference in New Issue