[zh-cn] sync blog: 2023-09-13-userns-stateful-pods

Signed-off-by: xin.li <xin.li@daocloud.io>
pull/43038/head
xin.li 2023-09-13 23:30:40 +08:00
parent eb1ead1185
commit eacc6e43e8
1 changed files with 291 additions and 0 deletions

View File

@ -0,0 +1,291 @@
---
layout: blog
title: "用户命名空间:对运行有状态 Pod 的支持进入 Alpha 阶段!"
date: 2023-09-13
slug: userns-alpha
---
<!--
layout: blog
title: "User Namespaces: Now Supports Running Stateful Pods in Alpha!"
date: 2023-09-13
slug: userns-alpha
-->
<!--
**Authors:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
-->
**作者:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
**译者:** Xin Li (DaoCloud)
<!--
Kubernetes v1.25 introduced support for user namespaces for only stateless
pods. Kubernetes 1.28 lifted that restriction, after some design changes were
done in 1.27.
-->
Kubernetes v1.25 引入用户命名空间User Namespace特性仅支持无状态StatelessPod。
Kubernetes 1.28 在 1.27 的基础上中进行了一些改进后,取消了这一限制。
<!--
The beauty of this feature is that:
* it is trivial to adopt (you just need to set a bool in the pod spec)
* doesn't need any changes for **most** applications
* improves security by _drastically_ enhancing the isolation of containers and
mitigating CVEs rated HIGH and CRITICAL.
-->
此特性的精妙之处在于:
* 使用起来很简单(只需在 Pod 规约spec中设置一个 bool
* **大多数**应用程序不需要任何更改
* 通过**大幅度**加强容器的隔离性以及应对评级为高HIGH和关键CRITICAL的 CVE 来提高安全性。
<!--
This post explains the basics of user namespaces and also shows:
* the changes that arrived in the recent Kubernetes v1.28 release
* a **demo of a vulnerability rated as HIGH** that is not exploitable with user namespaces
* the runtime requirements to use this feature
* what you can expect in future releases regarding user namespaces.
-->
这篇文章介绍了用户命名空间的基础知识,并展示了:
* 最近的 Kubernetes v1.28 版本中出现的变化
* 一个评级为**高HIGH的漏洞的演示Demo**,该漏洞无法在用户命名空间中被利用
* 使用此特性的运行时要求
* 关于用户命名空间的未来版本中可以期待的内容
<!--
## What is a user namespace?
A user namespace is a Linux feature that isolates the user and group identifiers
(UIDs and GIDs) of the containers from the ones on the host. The indentifiers
in the container can be mapped to indentifiers on the host in a way where the
host UID/GIDs used for different containers never overlap. Even more, the
identifiers can be mapped to *unprivileged* non-overlapping UIDs and GIDs on the
host. This basically means two things:
-->
## 用户命名空间是什么?
用户命名空间是 Linux 的一项特性它将容器的用户和组标识符UID 和 GID与宿主机上的标识符隔离开来。
容器中的标识符可以映射到宿主机上的标识符,其中用于不同容器的主机 UID/GID 从不重叠。
更重要的是,标识符可以映射到宿主机上的**非特权**、非重叠的 UID 和 GID。这基本上意味着两件事
<!--
* As the UIDs and GIDs for different containers are mapped to different UIDs
and GIDs on the host, containers have a harder time to attack each other even
if they escape the container boundaries. For example, if container A is running
with different UIDs and GIDs on the host than container B, the operations it
can do on container B's files and process are limited: only read/write what a
file allows to others, as it will never have permission for the owner or
group (the UIDs/GIDs on the host are guaranteed to be different for
different containers).
-->
* 由于不同容器的 UID 和 GID 映射到宿主机上不同的 UID 和 GID因此即使它们逃逸出了容器的边界也很难相互攻击。
例如,如果容器 A 在宿主机上使用与容器 B 不同的 UID 和 GID 运行,则它可以对容器 B
的文件和进程执行的操作受到限制:只能读/写允许其他人使用的文件,
因为它永远不会拥有所有者或组的权限(宿主机上的 UID/GID 保证对于不同的容器是不同的)。
<!--
* As the UIDs and GIDs are mapped to unprivileged users on the host, if a
container escapes the container boundaries, even if it is running as root
inside the container, it has no privileges on the host. This greatly
protects what host files it can read/write, which process it can send signals
to, etc.
Furthermore, capabilities granted are only valid inside the user namespace and
not on the host.
-->
* 由于 UID 和 GID 映射到宿主机上的非特权用户,如果容器逃逸出了容器边界,
即使它在容器内以 root 身份运行,它在宿主机上也没有特权。
这极大地保护了它可以读/写哪些宿主机文件、可以向哪个进程发送信号等。
此外所授予的权能Capability仅在用户命名空间内有效而在宿主机上无效。
<!--
Without using a user namespace a container running as root, in the case of a
container breakout, has root privileges on the node. And if some capabilities
were granted to the container, the capabilities are valid on the host too. None
of this is true when using user namespaces (modulo bugs, of course 🙂).
-->
在不使用用户命名空间的情况下,以 root 身份运行的容器在发生逃逸的情况下会获得节点上的
root 权限。如果某些权能被授予容器,那么这些权能在主机上也有效。
当使用用户命名空间时,这些情况都会被避免(当然,除非存在漏洞 🙂)。
<!--
## Changes in 1.28
As already mentioned, starting from 1.28, Kubernetes supports user namespaces
with stateful pods. This means that pods with user namespaces can use any type
of volume, they are no longer limited to only some volume types as before.
-->
## 1.28 版本的变化
正如之前提到的,从 1.28 版本开始Kubernetes 支持有状态的 Pod 的用户命名空间。
这意味着具有用户命名空间的 Pod 可以使用任何类型的卷,不再仅限于以前的部分卷类型。
<!--
The feature gate to activate this feature was renamed, it is no longer
`UserNamespacesStatelessPodsSupport` but from 1.28 onwards you should use
`UserNamespacesSupport`. There were many changes done and the requirements on
the node hosts changed. So with Kubernetes 1.28 the feature flag was renamed to
reflect this.
-->
从 1.28 版本开始,用于激活此特性的特性门控已被重命名,不再是 `UserNamespacesStatelessPodsSupport`
而应该使用 `UserNamespacesSupport`。此特性经历了许多更改,
对节点主机的要求也发生了变化。因此Kubernetes 1.28 版本将该特性标志重命名以反映这一变化。
<!--
## Demo
Rodrigo created a demo which exploits [CVE 2022-0492][cve-link] and shows how
the exploit can occur without user namespaces. He also shows how it is not
possible to use this exploit from a Pod where the containers are using this
feature.
-->
## 演示
Rodrigo 创建了一个利用 [CVE 2022-0492][cve-link] 的演示,
用以展现如何在没有用户命名空间的情况下利用该漏洞。
他还展示了在容器使用了此特性的 Pod 中无法利用此漏洞的情况。
<!--
This vulnerability is rated **HIGH** and allows **a container with no special
privileges to read/write to any path on the host** and launch processes as root
on the host too.
{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support">}}
-->
此漏洞被评为高危,允许一个没有特殊特权的容器读/写宿主机上的任何路径,并在宿主机上以 root 身份启动进程。
{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support">}}
<!--
Most applications in containers run as root today, or as a semi-predictable
non-root user (user ID 65534 is a somewhat popular choice). When you run a Pod
with containers using a userns, Kubernetes runs those containers as unprivileged
users, with no changes needed in your app.
-->
如今,容器中的大多数应用程序都以 root 身份运行,或者以半可预测的非 root
用户身份运行(用户 ID 65534 是一个比较流行的选择)。
当你运行某个 Pod而其中带有使用用户名命名空间userns的容器时Kubernetes
以非特权用户身份运行这些容器,无需在你的应用程序中进行任何更改。
<!--
This means two containers running as user 65534 will effectively be mapped to
different users on the host, limiting what they can do to each other in case of
an escape, and if they are running as root, the privileges on the host are
reduced to the one of an unprivileged user.
[cve-link]: https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
-->
这意味着两个以用户 65534 身份运行的容器实际上会被映射到宿主机上的不同用户,
从而限制了它们在发生逃逸的情况下能够对彼此执行的操作,如果它们以 root 身份运行,
宿主机上的特权也会降低到非特权用户的权限。
[cve-link]: https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
<!--
## Node system requirements
There are requirements on the Linux kernel version as well as the container
runtime to use this feature.
-->
## 节点系统要求
要使用此功能,对 Linux 内核版本以及容器运行时有一定要求。
<!--
On Linux you need Linux 6.3 or greater. This is because the feature relies on a
kernel feature named idmap mounts, and support to use idmap mounts with tmpfs
was merged in Linux 6.3.
If you are using CRI-O with crun, this is [supported in CRI-O
1.28.1][CRIO-release] and crun 1.9 or greater. If you are using CRI-O with runc,
this is still not supported.
-->
在 Linux上你需要 Linux 6.3 或更高版本。这是因为该特性依赖于一个名为
idmap mounts 的内核特性,而 Linux 6.3 中合并了针对 tmpfs 使用 idmap mounts 的支持
如果你使用 CRI-O 与 crun这一特性在 [CRI-O 1.28.1][CRIO-release] 和 crun 1.9 或更高版本中受支持。
如果你使用 CRI-O 与 runc目前仍不受支持。
<!--
containerd support is currently targeted for containerd 2.0; it is likely that
it won't matter if you use it with crun or runc.
Please note that containerd 1.7 added _experimental_ support for user
namespaces as implemented in Kubernetes 1.25 and 1.26. The redesign done in 1.27
is not supported by containerd 1.7, therefore it only works, in terms of user
namespaces support, with Kubernetes 1.25 and 1.26.
-->
containerd 对此的支持目前设定的目标是 containerd 2.0;不管你是否与 crun 或 runc 一起使用,或许都不重要。
请注意containerd 1.7 添加了对用户命名空间的实验性支持,正如在 Kubernetes 1.25
和 1.26 中实现的那样。1.27 版本中进行的重新设计不受 containerd 1.7 支持,
因此它在用户命名空间支持方面仅适用于 Kubernetes 1.25 和 1.26。
<!--
One limitation present in containerd 1.7 is that it needs to change the
ownership of every file and directory inside the container image, during Pod
startup. This means it has a storage overhead and can significantly impact the
container startup latency. Containerd 2.0 will probably include a implementation
that will eliminate the startup latency added and the storage overhead. Take
this into account if you plan to use containerd 1.7 with user namespaces in
production.
None of these containerd limitations apply to [CRI-O 1.28][CRIO-release].
[CRIO-release]: https://github.com/cri-o/cri-o/releases/tag/v1.28.1
-->
containerd 1.7 存在的一个限制是,在 Pod 启动期间需要更改容器镜像中每个文件和目录的所有权。
这意味着它具有存储开销并且可能会显著影响容器启动延迟。containerd 2.0
可能会包括一个实现,可以消除增加的启动延迟和存储开销。如果计划在生产中使用
containerd 1.7 与用户命名空间,请考虑这一点。
这些 Containerd 限制均不适用于 [CRI-O 1.28][CRIO 版本]。
[CRIO-release]: https://github.com/cri-o/cri-o/releases/tag/v1.28.1
<!--
## Whats next?
Looking ahead to Kubernetes 1.29, the plan is to work with SIG Auth to integrate user
namespaces to Pod Security Standards (PSS) and the Pod Security Admission. For
the time being, the plan is to relax checks in PSS policies when user namespaces are
in use. This means that the fields `spec[.*].securityContext` `runAsUser`,
`runAsNonRoot`, `allowPrivilegeEscalation` and `capabilities` will not trigger a
violation if user namespaces are in use. The behavior will probably be controlled by
utilizing a API Server feature gate, like `UserNamespacesPodSecurityStandards`
or similar.
-->
## 接下来?
展望 Kubernetes 1.29,计划是与 SIG Auth 合作,将用户命名空间集成到 Pod 安全标准PSS和 Pod 安全准入中。
目前的计划是在使用用户命名空间时放宽 Pod 安全标准PSS策略中的检查。这意味着如果使用用户命名空间那么字段
`spec[.*].securityContext`、`runAsUser`、`runAsNonRoot`、`allowPrivilegeEscalation和capabilities`
将不会触发违规,此行为可能会通过使用 API Server 特性门控来控制,比如 `UserNamespacesPodSecurityStandards` 或其他类似的。
<!--
## How do I get involved?
You can reach SIG Node by several means:
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
You can also contact us directly:
- GitHub: @rata @giuseppe @saschagrunert
- Slack: @rata @giuseppe @sascha
-->
## 我该如何参与?
你可以通过以下方式与 SIG Node 联系:
- Slack[#sig-node](https://kubernetes.slack.com/messages/sig-node)
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
你还可以直接联系我们:
- GitHub@rata @giuseppe @saschagrunert
- Slack@rata @giuseppe @sascha