adding docs for node allocatable

Signed-off-by: Vishnu kannan <vishnuk@google.com>
2017-02-28 16:12:04 -08:00 · 2017-02-28 16:12:04 -08:00 · 0ecd5254d9
parent 73cb92a0a9
commit 0ecd5254d9
2 changed files with 145 additions and 0 deletions
--- a/_data/guides.yml
+++ b/_data/guides.yml
@ -172,6 +172,7 @@ toc:
  - docs/admin/cluster-management.md
  - docs/admin/kubeadm.md
  - docs/admin/addons.md
+  - docs/admin/node-allocatable.md
  - docs/admin/audit.md
  - docs/admin/ha-master-gce.md
  - docs/admin/namespaces/index.md
--- a/docs/admin/node-allocatable.md
+++ b/docs/admin/node-allocatable.md
@ -0,0 +1,144 @@
+---
+assignees:
+-vishh
+-derekwaynecarr
+-dashpole
+title: Reserving Compute Resources for System Daemons
+---
+
+* TOC
+{:toc}
+
+Kubernetes nodes can be scheduled to `capacity`.
+Pods can consume all the available capacity on a node by default.
+This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself.
+Unless resources are set aside for these system daemons, pods and system daemons will compete for resources and lead to resource starvation issues on the node.
+The `kubelet` exposes a feature named `Node Allocatable` that helps to reserve compute resources for system daemons.
+Kubernetes recommends cluster administrators to configure `Node Allocatable` based on their workload density on each node.
+
+## Node Allocatable
+
+      Node Capacity
+---------------------------
+|     kube-reserved		  |
+|-------------------------|
+|	  system-reserved	  |
+|-------------------------|
+|    eviction-threshold	  |
+|-------------------------|
+|						  |
+|  		allocatable		  |
+|  	    (available        |
+|  	   	 for pods)     	  |
+|  						  |
+|						  |
+---------------------------
+
+`Allocatable` on a Kubernetes node is defined as the amount of compute resources that are available for pods.
+The scheduler does not over subscribe `allocatable`.
+`CPU` and `memory` are supported as of now.
+Support for `storage` will be added in the future.
+
+Node Allocatable is exposed as part of `v1.Node` object in the API and as part of `kubectl describe node` in the CLI.
+
+Resources can be reserved for two categories of system daemons in the `kubelet`.
+
+### Kube Reserved
+
+**Kubelet Flag**: `--kube-reserved=[cpu=100mi][,][memory=100Mi]`
+**Kubelet Flag**: `--kube-reserved-cgroup=`/runtime.slice`
+
+`kube-reserved` is meant to capture resource reservation for kubernetes system daemons like the `kubelet`, `container runtime`, `node problem detector`, etc.
+It is not meant to reserve resources for system daemons that are run as pods.
+`kube-reserved` is typically a function of `pod density` on the nodes.
+[This performance dashboard](http://node-perf-dash.k8s.io/#/builds) exposes `cpu` and `memory` usage profiles of `kubelet` and `docker engine` at multiple levels of pod density.
+[This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard can be interpreted to come up with a suitable `kube-reserved` reservation.
+
+It is recommended that the kubernetes system daemons are placed under a top level control group (`system.slice` on systemd machines for example).
+Each system daemon should ideally run within its own child control group.
+Refer to [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#recommended-cgroups-setup) for more details on recommended control group hierarchy.
+
+To optionally enforce `kube-reserved` on system daemons, specify the parent control group for kube daemons as the value for `--kube-reserved-cgroup` kubelet flag.
+
+### System Reserved
+
+**Kubelet Flag**: `--system-reserved=[cpu=100mi][,][memory=100Mi]`
+**Kubelet Flag**: `--system-reserved-cgroup=`/system.slice`
+
+
+`system-reserved` is meant to capture resource reservation for OS system daemons like `sshd`, `udev`, etc.
+`system-reserved` should reserve `memory` for the `kernel` too since `kernel` memory is not accounted to pods (yet) in Kubernetes.
+Reserving resources for user login sessions is also recommended (`user.slice` in systemd world).
+
+To optionally enforce `system-reserved` on system daemons, specify the parent control group for OS system daemons as the value for `--system-reserved-cgroup` kubelet flag.
+
+### Eviction Thresholds
+
+**Kubelet Flag**: `--eviction-hard=[memory.available<500Mi]`
+
+Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it.
+Nodes can go offline temporarily until memory has been reclaimed.
+To avoid (or reduce the probabilty) system OOMs kubelet provides [`Out of Resource`](./out-of-resource.md) management.
+Evictions are supported for `memory` and `storage` only.
+By reserving some memory via `--eviction-hard` flag, the `kubelet` attempts to `evict` pods whenever memory availability on the node drops below the reserved value.
+Hypothetically, if system daemons did not exist on a node, pods cannot use more than `capacity - eviction-hard`.
+For this reason, resources reserved for evictions will not be available for pods.
+
+### Enforcing Node Allocatable
+
+**Kubelet Flag**: `--enforce-node-allocatable=[pods][,][system-reserved][,][kube-reserved]`
+
+The scheduler will treat `Allocatable` as the available `capacity` for pods.
+
+`kubelet` will enforce `Allocatable` across pods by default.
+This enforcement is controlled by specifying `pods` value to the kubelet flag `--enforce-node-allocatable`.
+
+Optionally, `kubelet` can be made to enforce `kube-reserved` and `system-reserved` by specifying `kube-reserved` & `system-reserved` values in the same flag.
+Note that to enforce `kube-reserved` or `system-reserved`, `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be specified respectively.
+
+## General Guidelines
+
+System daemons are expected to be treated similar to `Guaranteed` pods.
+System daemons can burst within their bounding control groups and this behavior needs to be managed as part of kubernetes deployments.
+For example, `kubelet` should have its own control group and share `Kube-reserved` resources with the container runtime.
+However, Kubelet cannot burst and use up all available Node resources if `kube-reserved` is enforced.
+
+Be extra careful while enforcing `system-reserved` reservation since it can lead to critical system services being CPU starved or OOM killed on the node.
+The recommendation is to enforce `system-reserved` only if a user has profiled their nodes exhaustively to come up with precise estimates.
+
+* To begin with enforce `Allocatable` on `pods`.
+* Once adequate monitoring and alerting is in place to track kube system daemons, attempt to enforce `kube-reserved` based on usage heuristics.
+* If aboslutely necessary, enforce `system-reserved` over time.
+
+The resource requirements of kube system daemons will grow over time as more and more features are added.
+Over time, kubernetes will attempt to bring down utilization of node system daemons, but that is not a priority as of now.
+So expect a drop in `Allocatable` capacity in future releases.
+
+## Example Scenario
+
+Here is an example to illustrate Node Allocatable computation:
+
+* Node has `32Gi` of `memory` and `16 CPUs`
+* `--kube-reserved` is set to `cpu=1,memory=2Gi`
+* `--system-reserved` is set to `cpu=500m,memory=1Gi`
+* ``--eviction-hard` is set to `memory.available<500Mi`
+
+Under this scenario, `Allocatable` will be `14.5 CPUs` & `28.5Gi` of memory.
+Scheduler will ensure that the total `requests` across all pods on this node does not exceed `28.5Gi`.
+Kubelet will evict pods whenever the overall memory usage exceeds across pods exceed `28.5Gi`.
+If all processes on the node consume as much CPU as they can, pods together cannot consume more than `14.5 CPUs`.
+
+If `kube-reserved` and/or `system-reserved` is not enforced and system daemons exceed their reservation, `kubelet` will evict pods whenever the overall node memory usage is higher than `31.5Gi`.
+
+## Feature Availability
+
+Since `v1.2`, it has been possible to **optionally** specify `kube-reserved` and `system-reserved` reservations.
+The scheduler switched to using `Allocatable` instead of `Capacity` when available in the same release.
+
+Since `v1.6`, `eviction-thresholds` are being considered by computing `Allocatable`.
+To revert to the old behavior set `--experimental-allocatable-ignore-eviction` kubelet flag to `true`.
+
+Since `v1.6`, `kubelet` will enforce `Allocatable` on pods using control groups.
+To revert to the old behavior unset `--enforce-node-allocatable` kubelet flag.
+Note that unless `--kube-reserved`, or `--system-reserved` or `--eviction-hard` flags have non-default values, `Allocatable` enforcement will not affect existing deployments.
+