[zh-cn] sync blog: 2023-09-13-userns-stateful-pods

Signed-off-by: xin.li <xin.li@daocloud.io>
2023-09-13 23:30:40 +08:00 · 2023-09-13 23:30:40 +08:00 · eacc6e43e8
parent eb1ead1185
commit eacc6e43e8
1 changed files with 291 additions and 0 deletions
--- a/content/zh-cn/blog/_posts/2023-09-13-userns-stateful-pods/index.md
+++ b/content/zh-cn/blog/_posts/2023-09-13-userns-stateful-pods/index.md
@ -0,0 +1,291 @@
+---
+layout: blog
+title: "用户命名空间：对运行有状态 Pod 的支持进入 Alpha 阶段!"
+date: 2023-09-13
+slug: userns-alpha
+---
+
+<!--
+layout: blog
+title: "User Namespaces: Now Supports Running Stateful Pods in Alpha!"
+date: 2023-09-13
+slug: userns-alpha
+-->
+
+<!--
+**Authors:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
+-->
+**作者：** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
+
+**译者：** Xin Li (DaoCloud)
+
+<!--
+Kubernetes v1.25 introduced support for user namespaces for only stateless
+pods. Kubernetes 1.28 lifted that restriction, after some design changes were
+done in 1.27.
+-->
+Kubernetes v1.25 引入用户命名空间（User Namespace）特性，仅支持无状态（Stateless）Pod。
+Kubernetes 1.28 在 1.27 的基础上中进行了一些改进后，取消了这一限制。
+
+<!--
+The beauty of this feature is that:
+ * it is trivial to adopt (you just need to set a bool in the pod spec)
+ * doesn't need any changes for **most** applications
+ * improves security by _drastically_ enhancing the isolation of containers and
+   mitigating CVEs rated HIGH and CRITICAL.
+-->
+此特性的精妙之处在于：
+
+ * 使用起来很简单（只需在 Pod 规约（spec）中设置一个 bool）
+ * **大多数**应用程序不需要任何更改
+ * 通过**大幅度**加强容器的隔离性以及应对评级为高（HIGH）和关键（CRITICAL）的 CVE 来提高安全性。
+
+<!--
+This post explains the basics of user namespaces and also shows:
+ * the changes that arrived in the recent Kubernetes v1.28 release
+ * a **demo of a vulnerability rated as HIGH** that is not exploitable with user namespaces
+ * the runtime requirements to use this feature
+ * what you can expect in future releases regarding user namespaces.
+-->
+这篇文章介绍了用户命名空间的基础知识，并展示了：
+
+* 最近的 Kubernetes v1.28 版本中出现的变化
+* 一个评级为**高（HIGH）的漏洞的演示（Demo）**，该漏洞无法在用户命名空间中被利用
+* 使用此特性的运行时要求
+* 关于用户命名空间的未来版本中可以期待的内容
+
+<!--
+## What is a user namespace?
+
+A user namespace is a Linux feature that isolates the user and group identifiers
+(UIDs and GIDs) of the containers from the ones on the host. The indentifiers
+in the container can be mapped to indentifiers on the host in a way where the
+host UID/GIDs used for different containers never overlap. Even more, the
+identifiers can be mapped to *unprivileged* non-overlapping UIDs and GIDs on the
+host. This basically means two things:
+-->
+## 用户命名空间是什么？
+
+用户命名空间是 Linux 的一项特性，它将容器的用户和组标识符（UID 和 GID）与宿主机上的标识符隔离开来。
+容器中的标识符可以映射到宿主机上的标识符，其中用于不同容器的主机 UID/GID 从不重叠。
+更重要的是，标识符可以映射到宿主机上的**非特权**、非重叠的 UID 和 GID。这基本上意味着两件事：
+
+<!--
+ * As the UIDs and GIDs for different containers are mapped to different UIDs
+   and GIDs on the host, containers have a harder time to attack each other even
+   if they escape the container boundaries. For example, if container A is running
+   with different UIDs and GIDs on the host than container B, the operations it
+   can do on container B's files and process are limited: only read/write what a
+   file allows to others, as it will never have permission for the owner or
+   group (the UIDs/GIDs on the host are guaranteed to be different for
+   different containers).
+-->
+ * 由于不同容器的 UID 和 GID 映射到宿主机上不同的 UID 和 GID，因此即使它们逃逸出了容器的边界，也很难相互攻击。
+   例如，如果容器 A 在宿主机上使用与容器 B 不同的 UID 和 GID 运行，则它可以对容器 B
+   的文件和进程执行的操作受到限制：只能读/写允许其他人使用的文件，
+   因为它永远不会拥有所有者或组的权限（宿主机上的 UID/GID 保证对于不同的容器是不同的）。
+
+<!--
+ * As the UIDs and GIDs are mapped to unprivileged users on the host, if a
+   container escapes the container boundaries, even if it is running as root
+   inside the container, it has no privileges on the host. This greatly
+   protects what host files it can read/write, which process it can send signals
+   to, etc.
+
+Furthermore, capabilities granted are only valid inside the user namespace and
+not on the host.
+-->
+ * 由于 UID 和 GID 映射到宿主机上的非特权用户，如果容器逃逸出了容器边界，
+   即使它在容器内以 root 身份运行，它在宿主机上也没有特权。
+   这极大地保护了它可以读/写哪些宿主机文件、可以向哪个进程发送信号等。
+
+此外，所授予的权能（Capability）仅在用户命名空间内有效，而在宿主机上无效。
+
+<!--
+Without using a user namespace a container running as root, in the case of a
+container breakout, has root privileges on the node. And if some capabilities
+were granted to the container, the capabilities are valid on the host too. None
+of this is true when using user namespaces (modulo bugs, of course 🙂).
+-->
+在不使用用户命名空间的情况下，以 root 身份运行的容器在发生逃逸的情况下会获得节点上的
+root 权限。如果某些权能被授予容器，那么这些权能在主机上也有效。
+当使用用户命名空间时，这些情况都会被避免（当然，除非存在漏洞 🙂）。
+
+<!--
+## Changes in 1.28
+
+As already mentioned, starting from 1.28, Kubernetes supports user namespaces
+with stateful pods. This means that pods with user namespaces can use any type
+of volume, they are no longer limited to only some volume types as before.
+-->
+## 1.28 版本的变化
+
+正如之前提到的，从 1.28 版本开始，Kubernetes 支持有状态的 Pod 的用户命名空间。
+这意味着具有用户命名空间的 Pod 可以使用任何类型的卷，不再仅限于以前的部分卷类型。
+
+<!--
+The feature gate to activate this feature was renamed, it is no longer
+`UserNamespacesStatelessPodsSupport` but from 1.28 onwards you should use
+`UserNamespacesSupport`. There were many changes done and the requirements on
+the node hosts changed. So with Kubernetes 1.28 the feature flag was renamed to
+reflect this.
+-->
+从 1.28 版本开始，用于激活此特性的特性门控已被重命名，不再是 `UserNamespacesStatelessPodsSupport`，
+而应该使用 `UserNamespacesSupport`。此特性经历了许多更改，
+对节点主机的要求也发生了变化。因此，Kubernetes 1.28 版本将该特性标志重命名以反映这一变化。
+
+<!--
+## Demo
+
+Rodrigo created a demo which exploits [CVE 2022-0492][cve-link] and shows how
+the exploit can occur without user namespaces. He also shows how it is not
+possible to use this exploit from a Pod where the containers are using this
+feature.
+-->
+## 演示
+
+Rodrigo 创建了一个利用 [CVE 2022-0492][cve-link] 的演示，
+用以展现如何在没有用户命名空间的情况下利用该漏洞。
+他还展示了在容器使用了此特性的 Pod 中无法利用此漏洞的情况。
+
+<!--
+This vulnerability is rated **HIGH** and allows **a container with no special
+privileges to read/write to any path on the host** and launch processes as root
+on the host too.
+
+{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support">}}
+-->
+此漏洞被评为高危，允许一个没有特殊特权的容器读/写宿主机上的任何路径，并在宿主机上以 root 身份启动进程。
+
+{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support">}}
+
+<!--
+Most applications in containers run as root today, or as a semi-predictable
+non-root user (user ID 65534 is a somewhat popular choice). When you run a Pod
+with containers using a userns, Kubernetes runs those containers as unprivileged
+users, with no changes needed in your app.
+-->
+如今，容器中的大多数应用程序都以 root 身份运行，或者以半可预测的非 root
+用户身份运行（用户 ID 65534 是一个比较流行的选择）。
+当你运行某个 Pod，而其中带有使用用户名命名空间（userns）的容器时，Kubernetes
+以非特权用户身份运行这些容器，无需在你的应用程序中进行任何更改。
+
+<!--
+This means two containers running as user 65534 will effectively be mapped to
+different users on the host, limiting what they can do to each other in case of
+an escape, and if they are running as root, the privileges on the host are
+reduced to the one of an unprivileged user.
+
+[cve-link]: https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
+-->
+这意味着两个以用户 65534 身份运行的容器实际上会被映射到宿主机上的不同用户，
+从而限制了它们在发生逃逸的情况下能够对彼此执行的操作，如果它们以 root 身份运行，
+宿主机上的特权也会降低到非特权用户的权限。
+
+[cve-link]: https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
+
+<!--
+## Node system requirements
+
+There are requirements on the Linux kernel version as well as the container
+runtime to use this feature.
+-->
+## 节点系统要求 
+
+要使用此功能，对 Linux 内核版本以及容器运行时有一定要求。
+
+<!--
+On Linux you need Linux 6.3 or greater. This is because the feature relies on a
+kernel feature named idmap mounts, and support to use idmap mounts with tmpfs
+was merged in Linux 6.3.
+
+If you are using CRI-O with crun, this is [supported in CRI-O
+1.28.1][CRIO-release] and crun 1.9 or greater. If you are using CRI-O with runc,
+this is still not supported.
+-->
+在 Linux上，你需要 Linux 6.3 或更高版本。这是因为该特性依赖于一个名为
+idmap mounts 的内核特性，而 Linux 6.3 中合并了针对 tmpfs 使用 idmap mounts 的支持
+
+如果你使用 CRI-O 与 crun，这一特性在 [CRI-O 1.28.1][CRIO-release] 和 crun 1.9 或更高版本中受支持。
+如果你使用 CRI-O 与 runc，目前仍不受支持。
+
+<!--
+containerd support is currently targeted for containerd 2.0; it is likely that
+it won't matter if you use it with crun or runc.
+
+Please note that containerd 1.7 added _experimental_ support for user
+namespaces as implemented in Kubernetes 1.25 and 1.26. The redesign done in 1.27
+is not supported by containerd 1.7, therefore it only works, in terms of user
+namespaces support, with Kubernetes 1.25 and 1.26.
+-->
+containerd 对此的支持目前设定的目标是 containerd 2.0；不管你是否与 crun 或 runc 一起使用，或许都不重要。
+
+请注意，containerd 1.7 添加了对用户命名空间的实验性支持，正如在 Kubernetes 1.25
+和 1.26 中实现的那样。1.27 版本中进行的重新设计不受 containerd 1.7 支持，
+因此它在用户命名空间支持方面仅适用于 Kubernetes 1.25 和 1.26。
+
+<!--
+One limitation present in containerd 1.7 is that it needs to change the
+ownership of every file and directory inside the container image, during Pod
+startup. This means it has a storage overhead and can significantly impact the
+container startup latency. Containerd 2.0 will probably include a implementation
+that will eliminate the startup latency added and the storage overhead. Take
+this into account if you plan to use containerd 1.7 with user namespaces in
+production.
+
+None of these containerd limitations apply to [CRI-O 1.28][CRIO-release].
+
+[CRIO-release]: https://github.com/cri-o/cri-o/releases/tag/v1.28.1
+-->
+containerd 1.7 存在的一个限制是，在 Pod 启动期间需要更改容器镜像中每个文件和目录的所有权。
+这意味着它具有存储开销，并且可能会显著影响容器启动延迟。containerd 2.0
+可能会包括一个实现，可以消除增加的启动延迟和存储开销。如果计划在生产中使用
+containerd 1.7 与用户命名空间，请考虑这一点。
+
+这些 Containerd 限制均不适用于 [CRI-O 1.28][CRIO 版本]。
+
+[CRIO-release]: https://github.com/cri-o/cri-o/releases/tag/v1.28.1
+
+<!--
+## What’s next?
+
+Looking ahead to Kubernetes 1.29, the plan is to work with SIG Auth to integrate user
+namespaces to Pod Security Standards (PSS) and the Pod Security Admission. For
+the time being, the plan is to relax checks in PSS policies when user namespaces are
+in use. This means that the fields `spec[.*].securityContext` `runAsUser`,
+`runAsNonRoot`, `allowPrivilegeEscalation` and `capabilities` will not trigger a
+violation if user namespaces are in use. The behavior will probably be controlled by
+utilizing a API Server feature gate, like `UserNamespacesPodSecurityStandards`
+or similar.
+-->
+## 接下来？
+
+展望 Kubernetes 1.29，计划是与 SIG Auth 合作，将用户命名空间集成到 Pod 安全标准（PSS）和 Pod 安全准入中。
+目前的计划是在使用用户命名空间时放宽 Pod 安全标准（PSS）策略中的检查。这意味着如果使用用户命名空间，那么字段
+`spec[.*].securityContext`、`runAsUser`、`runAsNonRoot`、`allowPrivilegeEscalation和capabilities`
+将不会触发违规，此行为可能会通过使用 API Server 特性门控来控制，比如 `UserNamespacesPodSecurityStandards` 或其他类似的。
+
+<!--
+## How do I get involved?
+
+You can reach SIG Node by several means:
+- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
+- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
+- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
+
+You can also contact us directly:
+- GitHub: @rata @giuseppe @saschagrunert
+- Slack: @rata @giuseppe @sascha
+-->
+## 我该如何参与？
+
+你可以通过以下方式与 SIG Node 联系：
+
+- Slack：[#sig-node](https://kubernetes.slack.com/messages/sig-node)
+- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
+- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
+
+你还可以直接联系我们：
+
+- GitHub：@rata @giuseppe @saschagrunert
+- Slack：@rata @giuseppe @sascha