website/content/en/docs/tasks/administer-cluster/sysctl-cluster.md

210 lines
6.8 KiB
Markdown

---
title: Using sysctls in a Kubernetes Cluster
reviewers:
- sttts
content_template: templates/task
---
{{% capture overview %}}
{{< feature-state for_k8s_version="v1.12" state="beta" >}}
This document describes how to configure and use kernel parameters within a
Kubernetes cluster using the sysctl interface.
{{% /capture %}}
{{% capture prerequisites %}}
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
{{% /capture %}}
{{% capture steps %}}
## Listing all Sysctl Parameters
In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via the `/proc/sys/` virtual
process file system. The parameters cover various subsystems such as:
- kernel (common prefix: `kernel.`)
- networking (common prefix: `net.`)
- virtual memory (common prefix: `vm.`)
- MDADM (common prefix: `dev.`)
- More subsystems are described in [Kernel docs](https://www.kernel.org/doc/Documentation/sysctl/README).
To get a list of all parameters, you can run
```shell
sudo sysctl -a
```
## Enabling Unsafe Sysctls
Sysctls are grouped into _safe_ and _unsafe_ sysctls. In addition to proper
namespacing a _safe_ sysctl must be properly _isolated_ between pods on the same
node. This means that setting a _safe_ sysctl for one pod
- must not have any influence on any other pod on the node
- must not allow to harm the node's health
- must not allow to gain CPU or memory resources outside of the resource limits
of a pod.
By far, most of the _namespaced_ sysctls are not necessarily considered _safe_.
The following sysctls are supported in the _safe_ set:
- `kernel.shm_rmid_forced`,
- `net.ipv4.ip_local_port_range`,
- `net.ipv4.tcp_syncookies`.
{{< note >}}
The example `net.ipv4.tcp_syncookies` is not namespaced on Linux kernel version 4.4 or lower.
{{< /note >}}
This list will be extended in future Kubernetes versions when the kubelet
supports better isolation mechanisms.
All _safe_ sysctls are enabled by default.
All _unsafe_ sysctls are disabled by default and must be allowed manually by the
cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be
scheduled, but will fail to launch.
With the warning above in mind, the cluster admin can allow certain _unsafe_
sysctls for very special situations like e.g. high-performance or real-time
application tuning. _Unsafe_ sysctls are enabled on a node-by-node basis with a
flag of the kubelet, e.g.:
```shell
kubelet --allowed-unsafe-sysctls \
'kernel.msg*,net.ipv4.route.min_pmtu' ...
```
For minikube, this can be done via the `extra-config` flag:
```shell
minikube start --extra-config="kubelet.allowed-unsafe-sysctls=kernel.msg*,net.ipv4.route.min_pmtu"...
```
Only _namespaced_ sysctls can be enabled this way.
## Setting Sysctls for a Pod
A number of sysctls are _namespaced_ in today's Linux kernels. This means that
they can be set independently for each pod on a node. Only namespaced sysctls
are configurable via the pod securityContext within Kubernetes.
The following sysctls are known to be namespaced. This list could change
in future versions of the Linux kernel.
- `kernel.shm*`,
- `kernel.msg*`,
- `kernel.sem`,
- `fs.mqueue.*`,
- `net.*`.
Sysctls with no namespace are called _node-level_ sysctls. If you need to set
them, you must manually configure them on each node's operating system, or by
using a DaemonSet with privileged containers.
Use the pod securityContext to configure namespaced sysctls. The securityContext
applies to all containers in the same pod.
This example uses the pod securityContext to set a safe sysctl
`kernel.shm_rmid_forced` and two unsafe sysctls `net.ipv4.route.min_pmtu` and
`kernel.msgmax` There is no distinction between _safe_ and _unsafe_ sysctls in
the specification.
{{< warning >}}
Only modify sysctl parameters after you understand their effects, to avoid
destabilizing your operating system.
{{< /warning >}}
```yaml
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.ipv4.route.min_pmtu
value: "552"
- name: kernel.msgmax
value: "65536"
...
```
{{% /capture %}}
{{% capture discussion %}}
{{< warning >}}
Due to their nature of being _unsafe_, the use of _unsafe_ sysctls
is at-your-own-risk and can lead to severe problems like wrong behavior of
containers, resource shortage or complete breakage of a node.
{{< /warning >}}
It is good practice to consider nodes with special sysctl settings as
_tainted_ within a cluster, and only schedule pods onto them which need those
sysctl settings. It is suggested to use the Kubernetes [_taints and toleration_
feature](/docs/reference/generated/kubectl/kubectl-commands/#taint) to implement this.
A pod with the _unsafe_ sysctls will fail to launch on any node which has not
enabled those two _unsafe_ sysctls explicitly. As with _node-level_ sysctls it
is recommended to use
[_taints and toleration_ feature](/docs/reference/generated/kubectl/kubectl-commands/#taint) or
[taints on nodes](/docs/concepts/configuration/taint-and-toleration/)
to schedule those pods onto the right nodes.
## PodSecurityPolicy
You can further control which sysctls can be set in pods by specifying lists of
sysctls or sysctl patterns in the `forbiddenSysctls` and/or
`allowedUnsafeSysctls` fields of the PodSecurityPolicy. A sysctl pattern ends
with a `*` character, such as `kernel.*`. A `*` character on its own matches
all sysctls.
By default, all safe sysctls are allowed.
Both `forbiddenSysctls` and `allowedUnsafeSysctls` are lists of plain sysctl names
or sysctl patterns (which end with `*`). The string `*` matches all sysctls.
The `forbiddenSysctls` field excludes specific sysctls. You can forbid a
combination of safe and unsafe sysctls in the list. To forbid setting any
sysctls, use `*` on its own.
If you specify any unsafe sysctl in the `allowedUnsafeSysctls` field and it is
not present in the `forbiddenSysctls` field, that sysctl can be used in Pods
using this PodSecurityPolicy. To allow all unsafe sysctls in the
PodSecurityPolicy to be set, use `*` on its own.
Do not configure these two fields such that there is overlap, meaning that a
given sysctl is both allowed and forbidden.
{{< warning >}}
If you whitelist unsafe sysctls via the `allowedUnsafeSysctls` field
in a PodSecurityPolicy, any pod using such a sysctl will fail to start
if the sysctl is not whitelisted via the `--allowed-unsafe-sysctls` kubelet
flag as well on that node.
{{< /warning >}}
This example allows unsafe sysctls prefixed with `kernel.msg` to be set and
disallows setting of the `kernel.shm_rmid_forced` sysctl.
```yaml
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: sysctl-psp
spec:
allowedUnsafeSysctls:
- kernel.msg*
forbiddenSysctls:
- kernel.shm_rmid_forced
...
```
{{% /capture %}}