Kubernetes nodes can be scheduled to `Capacity`. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.
The `kubelet` exposes a feature named `Node Allocatable` that helps to reserve compute resources for system daemons. Kubernetes recommends cluster administrators to configure `Node Allocatable` based on their workload density on each node.
`kube-reserved` is meant to capture resource reservation for kubernetes system daemons like the `kubelet`, `container runtime`, `node problem detector`, etc.
It is not meant to reserve resources for system daemons that are run as pods.
`kube-reserved` is typically a function of `pod density` on the nodes.
[This performance dashboard](http://node-perf-dash.k8s.io/#/builds) exposes `cpu` and `memory` usage profiles of `kubelet` and `docker engine` at multiple levels of pod density.
[This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard can be interpreted to come up with a suitable `kube-reserved` reservation.
To optionally enforce `kube-reserved` on system daemons, specify the parent control group for kube daemons as the value for `--kube-reserved-cgroup` kubelet flag.
It is recommended that the kubernetes system daemons are placed under a top level control group (`runtime.slice` on systemd machines for example).
Each system daemon should ideally run within its own child control group.
Refer to [this doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#recommended-cgroups-setup) for more details on recommended control group hierarchy.
`system-reserved` is meant to capture resource reservation for OS system daemons like `sshd`, `udev`, etc.
`system-reserved` should reserve `memory` for the `kernel` too since `kernel` memory is not accounted to pods (yet) in Kubernetes.
Reserving resources for user login sessions is also recommended (`user.slice` in systemd world).
To optionally enforce `system-reserved` on system daemons, specify the parent control group for OS system daemons as the value for `--system-reserved-cgroup` kubelet flag.
Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it.
Nodes can go offline temporarily until memory has been reclaimed.
To avoid (or reduce the probabilty) system OOMs kubelet provides [`Out of Resource`](./out-of-resource.md) management.
Evictions are supported for `memory` and `storage` only.
By reserving some memory via `--eviction-hard` flag, the `kubelet` attempts to `evict` pods whenever memory availability on the node drops below the reserved value.
Hypothetically, if system daemons did not exist on a node, pods cannot use more than `capacity - eviction-hard`.
This enforcement is controlled by specifying `pods` value to the kubelet flag `--enforce-node-allocatable`.
Optionally, `kubelet` can be made to enforce `kube-reserved` and `system-reserved` by specifying `kube-reserved`&`system-reserved` values in the same flag.
Note that to enforce `kube-reserved` or `system-reserved`, `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be specified respectively.
## General Guidelines
System daemons are expected to be treated similar to `Guaranteed` pods.
System daemons can burst within their bounding control groups and this behavior needs to be managed as part of kubernetes deployments.
For example, `kubelet` should have its own control group and share `Kube-reserved` resources with the container runtime.
However, Kubelet cannot burst and use up all available Node resources if `kube-reserved` is enforced.
Be extra careful while enforcing `system-reserved` reservation since it can lead to critical system services being CPU starved or OOM killed on the node.
The recommendation is to enforce `system-reserved` only if a user has profiled their nodes exhaustively to come up with precise estimates and is confident in their ability to recover if any process in that group is oom_killed.
If `kube-reserved` and/or `system-reserved` is not enforced and system daemons exceed their reservation, `kubelet` evicts pods whenever the overall node memory usage is higher than `31.5Gi`.
Note that unless `--kube-reserved`, or `--system-reserved` or `--eviction-hard` flags have non-default values, `Allocatable` enforcement does not affect existing deployments.