Merge pull request #35235 from kinvolk/rata/userns
Add docs for user namespaces in pods - phase 1 (KEP-127)pull/36053/head
commit
fae72f6cc0
|
@ -0,0 +1,164 @@
|
|||
---
|
||||
title: User Namespaces
|
||||
reviewers:
|
||||
content_type: concept
|
||||
weight: 160
|
||||
min-kubernetes-server-version: v1.25
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||
|
||||
This page explains how user namespaces are used in Kubernetes pods. A user
|
||||
namespace allows to isolate the user running inside the container from the one
|
||||
in the host.
|
||||
|
||||
A process running as root in a container can run as a different (non-root) user
|
||||
in the host; in other words, the process has full privileges for operations
|
||||
inside the user namespace, but is unprivileged for operations outside the
|
||||
namespace.
|
||||
|
||||
You can use this feature to reduce the damage a compromised container can do to
|
||||
the host or other pods in the same node. There are [several security
|
||||
vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not
|
||||
exploitable when user namespaces is active. It is expected user namespace will
|
||||
mitigate some future vulnerabilities too.
|
||||
|
||||
[KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation
|
||||
|
||||
<!-- body -->
|
||||
## {{% heading "prerequisites" %}}
|
||||
|
||||
{{% thirdparty-content single="true" %}}
|
||||
<!-- if adding another runtime in the future, omit the single setting -->
|
||||
|
||||
This is a Linux only feature. In addition, support is needed in the
|
||||
{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}
|
||||
to use this feature with Kubernetes stateless pods:
|
||||
|
||||
* CRI-O: v1.25 has support for user namespaces.
|
||||
|
||||
* containerd: support is planned for the 1.7 release. See containerd
|
||||
issue [#7063][containerd-userns-issue] for more details.
|
||||
|
||||
Support for this in [cri-dockerd is not planned][CRI-dockerd-issue] yet.
|
||||
|
||||
[CRI-dockerd-issue]: https://github.com/Mirantis/cri-dockerd/issues/74
|
||||
[containerd-userns-issue]: https://github.com/containerd/containerd/issues/7063
|
||||
|
||||
## Introduction
|
||||
|
||||
User namespaces is a Linux feature that allows to map users in the container to
|
||||
different users in the host. Furthermore, the capabilities granted to a pod in
|
||||
a user namespace are valid only in the namespace and void outside of it.
|
||||
|
||||
A pod can opt-in to use user nameapces by setting the `pod.spec.hostUsers` field
|
||||
to `false`.
|
||||
|
||||
The kubelet will pick host UIDs/GIDs a pod is mapped to, and will do so in a way
|
||||
to guarantee that no two stateless pods on the same node use the same mapping.
|
||||
|
||||
The `runAsUser`, `runAsGroup`, `fsGroup`, etc. fields in the `pod.spec` always
|
||||
refer to the user inside the container.
|
||||
|
||||
The valid UIDs/GIDs when this feature is enabled is the range 0-65535. This
|
||||
applies to files and processes (`runAsUser`, `runAsGroup`, etc.).
|
||||
|
||||
Files using a UID/GID outside this range will be seen as belonging to the
|
||||
overflow ID, usually 65534 (configured in `/proc/sys/kernel/overflowuid` and
|
||||
`/proc/sys/kernel/overflowgid`). However, it is not possible to modify those
|
||||
files, even by running as the 65534 user/group.
|
||||
|
||||
Most applications that need to run as root but don't access other host
|
||||
namespaces or resources, should continue to run fine without any changes needed
|
||||
if user namespaces is activated.
|
||||
|
||||
## Understanding user namespaces for stateless pods
|
||||
|
||||
Several container runtimes with their default configuration (like Docker Engine,
|
||||
containerd, CRI-O) use Linux namespaces for isolation. Other technologies exist
|
||||
and can be used with those runtimes too (e.g. Kata Containers uses VMs instead of
|
||||
Linux namespaces). This page is applicable for container runtimes using Linux
|
||||
namespaces for isolation.
|
||||
|
||||
When creating a pod, by default, several new namespaces are used for isolation:
|
||||
a network namespace to isolate the network of the container, a PID namespace to
|
||||
isolate the view of processes, etc. If a user namespace is used, this will
|
||||
isolate the users in the container from the users in the node.
|
||||
|
||||
This means containers can run as root and be mapped to a non-root user on the
|
||||
host. Inside the container the process will think it is running as root (and
|
||||
therefore tools like `apt`, `yum`, etc. work fine), while in reality the process
|
||||
doesn't have privileges on the host. You can verify this, for example, if you
|
||||
check the user the container process is running `ps` from the host. The user
|
||||
`ps` shows is not the same as the user you see if you execute inside the
|
||||
container the command `id`.
|
||||
|
||||
This abstraction limits what can happen, for example, if the container manages
|
||||
to escape to the host. Given that the container is running as a non-privileged
|
||||
user on the host, it is limited what it can do to the host.
|
||||
|
||||
Furthermore, as users on each pod will be mapped to different non-overlapping
|
||||
users in the host, it is limited what they can do to other pods too.
|
||||
|
||||
Capabilities granted to a pod are also limited to the pod user namespace and
|
||||
mostly invalid out of it, some are even completely void. Here are two examples:
|
||||
- `CAP_SYS_MODULE` does not have any effect if granted to a pod using user
|
||||
namespaces, the pod isn't able to load kernel modules.
|
||||
- `CAP_SYS_ADMIN` is limited to the pod's user namespace and invalid outside
|
||||
of it.
|
||||
|
||||
Without using a user namespace a container running as root, in the case of a
|
||||
container breakout, has root privileges on the node. And if some capability were
|
||||
granted to the container, the capabilities are valid on the host too. None of
|
||||
this is true when we use user namespaces.
|
||||
|
||||
If you want to know more details about what changes when user namespaces are in
|
||||
use, see `man 7 user_namespaces`.
|
||||
|
||||
## Set up a node to support user namespaces
|
||||
|
||||
It is recommended that the host's files and host's processes use UIDs/GIDs in
|
||||
the range of 0-65535.
|
||||
|
||||
The kubelet will assign UIDs/GIDs higher than that to pods. Therefore, to
|
||||
guarantee as much isolation as possible, the UIDs/GIDs used by the host's files
|
||||
and host's processes should be in the range 0-65535.
|
||||
|
||||
Note that this recommendation is important to mitigate the impact of CVEs like
|
||||
[CVE-2021-25741][CVE-2021-25741], where a pod can potentially read arbitrary
|
||||
files in the hosts. If the UIDs/GIDs of the pod and the host don't overlap, it
|
||||
is limited what a pod would be able to do: the pod UID/GID won't match the
|
||||
host's file owner/group.
|
||||
|
||||
[CVE-2021-25741]: https://github.com/kubernetes/kubernetes/issues/104980
|
||||
|
||||
## Limitations
|
||||
|
||||
When using a user namespace for the pod, it is disallowed to use other host
|
||||
namespaces. In particular, if you set `hostUsers: false` then you are not
|
||||
allowed to set any of:
|
||||
|
||||
* `hostNetwork: true`
|
||||
* `hostIPC: true`
|
||||
* `hostPID: true`
|
||||
|
||||
The pod is allowed to use no volumes at all or, if using volumes, only these
|
||||
volume types are allowed:
|
||||
|
||||
* configmap
|
||||
* secret
|
||||
* projected
|
||||
* downwardAPI
|
||||
* emptyDir
|
||||
|
||||
To guarantee that the pod can read the files of such volumes, volumes are
|
||||
created as if you specified `.spec.securityContext.fsGroup` as `0` for the Pod.
|
||||
If it is specified to a different value, this other value will of course be
|
||||
honored instead.
|
||||
|
||||
As a by-product of this, folders and files for these volumes will have
|
||||
permissions for the group, even if `defaultMode` or `mode` to specific items of
|
||||
the volumes were specified without permissions to groups. For example, it is not
|
||||
possible to mount these volumes in a way that its files have permissions only
|
||||
for the owner.
|
|
@ -0,0 +1,100 @@
|
|||
---
|
||||
title: Use a User Namespace With a Pod
|
||||
reviewers:
|
||||
content_type: task
|
||||
weight: 160
|
||||
min-kubernetes-server-version: v1.25
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
|
||||
|
||||
This page shows how to configure a user namespace for stateless pods. This
|
||||
allows to isolate the user running inside the container from the one in the
|
||||
host.
|
||||
|
||||
A process running as root in a container can run as a different (non-root) user
|
||||
in the host; in other words, the process has full privileges for operations
|
||||
inside the user namespace, but is unprivileged for operations outside the
|
||||
namespace.
|
||||
|
||||
You can use this feature to reduce the damage a compromised container can do to
|
||||
the host or other pods in the same node. There are [several security
|
||||
vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not
|
||||
exploitable when user namespaces is active. It is expected user namespace will
|
||||
mitigate some future vulnerabilities too.
|
||||
|
||||
Without using a user namespace a container running as root, in the case of a
|
||||
container breakout, has root privileges on the node. And if some capability were
|
||||
granted to the container, the capabilities are valid on the host too. None of
|
||||
this is true when user namespaces are used.
|
||||
|
||||
[KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation
|
||||
|
||||
## {{% heading "prerequisites" %}}
|
||||
|
||||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||||
|
||||
{{% thirdparty-content single="true" %}}
|
||||
<!-- if adding another runtime in the future, omit the single setting -->
|
||||
|
||||
* The node OS needs to be Linux
|
||||
* You need to exec commands in the host
|
||||
* You need to be able to exec into pods
|
||||
* Feature gate `UserNamespacesStatelessPodsSupport` need to be enabled.
|
||||
|
||||
In addition, support is needed in the
|
||||
{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}
|
||||
to use this feature with Kubernetes stateless pods:
|
||||
|
||||
* CRI-O: v1.25 has support for user namespaces.
|
||||
|
||||
Please note that **if your container runtime doesn't support user namespaces, the
|
||||
new `pod.spec` field will be silently ignored and the pod will be created without
|
||||
user namespaces.**
|
||||
|
||||
<!-- steps -->
|
||||
|
||||
## Run a Pod that uses a user namespace {#create-pod}
|
||||
|
||||
A user namespace for a stateless pod is enabled setting the `hostUsers` field of
|
||||
`.spec` to `false`. For example:
|
||||
|
||||
{{< codenew file="pods/user-namespaces-stateless.yaml" >}}
|
||||
|
||||
1. Create the pod on your cluster:
|
||||
|
||||
```shell
|
||||
kubectl apply -f https://k8s.io/examples/pods/user-namespaces-stateless.yaml
|
||||
```
|
||||
|
||||
1. Attach to the container and run `readlink /proc/self/ns/user`:
|
||||
|
||||
```shell
|
||||
kubectl attach -it userns bash
|
||||
```
|
||||
|
||||
And run the command. The output is similar to this:
|
||||
|
||||
```none
|
||||
readlink /proc/self/ns/user
|
||||
user:[4026531837]
|
||||
cat /proc/self/uid_map
|
||||
0 0 4294967295
|
||||
```
|
||||
|
||||
Then, open a shell in the host and run the same command.
|
||||
|
||||
The output must be different. This means the host and the pod are using a
|
||||
different user namespace. When user namespaces are not enabled, the host and the
|
||||
pod use the same user namespace.
|
||||
|
||||
If you are running the kubelet inside a user namespace, you need to compare the
|
||||
output from running the command in the pod to the output of running in the host:
|
||||
|
||||
```none
|
||||
readlink /proc/$pid/ns/user
|
||||
user:[4026534732]
|
||||
|
||||
|
||||
replacing `$pid` with the kubelet PID.
|
|
@ -0,0 +1,10 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: userns
|
||||
spec:
|
||||
hostUsers: false
|
||||
containers:
|
||||
- name: shell
|
||||
command: ["sleep", "infinity"]
|
||||
image: debian
|
Loading…
Reference in New Issue