[zh-cn]add blog: 2025-02-28-nftables-kube-proxy
Signed-off-by: xin.li <xin.li@daocloud.io>pull/49780/head
parent
a7109292de
commit
de961a3ee9
|
@ -0,0 +1,456 @@
|
|||
---
|
||||
layout: blog
|
||||
title: "kube-proxy 的 NFTables 模式"
|
||||
date: 2025-02-28
|
||||
slug: nftables-kube-proxy
|
||||
author: >
|
||||
Dan Winship (Red Hat)
|
||||
translator: Xin Li (Daocloud)
|
||||
---
|
||||
<!--
|
||||
layout: blog
|
||||
title: "NFTables mode for kube-proxy"
|
||||
date: 2025-02-28
|
||||
slug: nftables-kube-proxy
|
||||
author: >
|
||||
Dan Winship (Red Hat)
|
||||
-->
|
||||
|
||||
<!--
|
||||
A new nftables mode for kube-proxy was introduced as an alpha feature
|
||||
in Kubernetes 1.29. Currently in beta, it is expected to be GA as of
|
||||
1.33. The new mode fixes long-standing performance problems with the
|
||||
iptables mode and all users running on systems with reasonably-recent
|
||||
kernels are encouraged to try it out. (For compatibility reasons, even
|
||||
once nftables becomes GA, iptables will still be the _default_.)
|
||||
-->
|
||||
Kubernetes 1.29 引入了一种新的 Alpha 特性:kube-proxy 的 nftables 模式。
|
||||
目前该模式处于 Beta 阶段,并预计将在 1.33 版本中达到一般可用(GA)状态。
|
||||
新模式解决了 iptables 模式长期存在的性能问题,建议所有运行在较新内核版本系统上的用户尝试使用。
|
||||
出于兼容性原因,即使 nftables 成为 GA 功能,iptables 仍将是**默认**模式。
|
||||
|
||||
<!--
|
||||
## Why nftables? Part 1: data plane latency
|
||||
|
||||
The iptables API was designed for implementing simple firewalls, and
|
||||
has problems scaling up to support Service proxying in a large
|
||||
Kubernetes cluster with tens of thousands of Services.
|
||||
|
||||
In general, the ruleset generated by kube-proxy in iptables mode has a
|
||||
number of iptables rules proportional to the sum of the number of
|
||||
Services and the total number of endpoints. In particular, at the top
|
||||
level of the ruleset, there is one rule to test each possible Service
|
||||
IP (and port) that a packet might be addressed to:
|
||||
-->
|
||||
## 为什么选择 nftables?第一部分:数据平面延迟
|
||||
|
||||
iptables API 是被设计用于实现简单的防火墙功能,在扩展到支持大型 Kubernetes 集群中的 Service
|
||||
代理时存在局限性,尤其是在包含数万个 Service 的集群中。
|
||||
|
||||
通常,kube-proxy 在 iptables 模式下生成的规则集中的 iptables 规则数量与
|
||||
Service 数量和总端点数量的总和成正比。
|
||||
特别是,在规则集的顶层,针对数据包可能指向的每个可能的 Service IP(以及端口),
|
||||
都有一条规则用于测试。
|
||||
|
||||
<!--
|
||||
```
|
||||
# If the packet is addressed to 172.30.0.41:80, then jump to the chain
|
||||
# KUBE-SVC-XPGD46QRK7WJZT7O for further processing
|
||||
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O
|
||||
|
||||
# If the packet is addressed to 172.30.0.42:443, then...
|
||||
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT
|
||||
|
||||
# etc...
|
||||
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
|
||||
```
|
||||
-->
|
||||
```
|
||||
# 如果数据包的目标地址是 172.30.0.41:80,则跳转到 KUBE-SVC-XPGD46QRK7WJZT7O 链进行进一步处理
|
||||
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O
|
||||
|
||||
# 如果数据包的目标地址是 172.30.0.42:443,则...
|
||||
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT
|
||||
|
||||
# 等等...
|
||||
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
|
||||
```
|
||||
|
||||
<!--
|
||||
This means that when a packet comes in, the time it takes the kernel
|
||||
to check it against all of the Service rules is **O(n)** in the number
|
||||
of Services. As the number of Services increases, both the average and
|
||||
the worst-case latency for the first packet of a new connection
|
||||
increases (with the difference between best-case, average, and
|
||||
worst-case being mostly determined by whether a given Service IP
|
||||
address appears earlier or later in the `KUBE-SERVICES` chain).
|
||||
|
||||
{{< figure src="iptables-only.svg" alt="kube-proxy iptables first packet latency, at various percentiles, in clusters of various sizes" >}}
|
||||
|
||||
By contrast, with nftables, the normal way to write a ruleset like
|
||||
this is to have a _single_ rule, using a "verdict map" to do the
|
||||
dispatch:
|
||||
-->
|
||||
这意味着当数据包到达时,内核检查该数据包与所有 Service 规则所需的时间是 **O(n)**,
|
||||
其中 n 为 Service 的数量。随着 Service 数量的增加,新连接的第一个数据包的平均延迟和最坏情况下的延迟都会增加
|
||||
(最佳情况、平均情况和最坏情况之间的差异主要取决于某个 Service IP 地址在 `KUBE-SERVICES`
|
||||
链中出现的顺序是靠前还是靠后)。
|
||||
|
||||
{{< figure src="iptables-only.svg" alt="kube-proxy iptables 在不同规模集群中各百分位数下的第一个数据包延迟" >}}
|
||||
|
||||
相比之下,使用 nftables,编写此类规则集的常规方法是使用一个单一规则,
|
||||
并通过"判决映射"(verdict map)来完成分发:
|
||||
|
||||
<!--
|
||||
```
|
||||
table ip kube-proxy {
|
||||
|
||||
# The service-ips verdict map indicates the action to take for each matching packet.
|
||||
map service-ips {
|
||||
type ipv4_addr . inet_proto . inet_service : verdict
|
||||
comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
|
||||
elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
|
||||
172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
|
||||
172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
|
||||
... }
|
||||
}
|
||||
|
||||
# Now we just need a single rule to process all packets matching an
|
||||
# element in the map. (This rule says, "construct a tuple from the
|
||||
# destination IP address, layer 4 protocol, and destination port; look
|
||||
# that tuple up in "service-ips"; and if there's a match, execute the
|
||||
# associated verdict.)
|
||||
chain services {
|
||||
ip daddr . meta l4proto . th dport vmap @service-ips
|
||||
}
|
||||
|
||||
...
|
||||
}
|
||||
```
|
||||
-->
|
||||
```none
|
||||
table ip kube-proxy {
|
||||
|
||||
# service-ips 判决映射指示了对每个匹配数据包应采取的操作。
|
||||
map service-ips {
|
||||
type ipv4_addr . inet_proto . inet_service : verdict
|
||||
comment "ClusterIP、ExternalIP 和 LoadBalancer IP 流量"
|
||||
elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
|
||||
172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
|
||||
172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
|
||||
... }
|
||||
}
|
||||
|
||||
# 现在我们只需要一条规则来处理所有与映射中元素匹配的数据包。
|
||||
# (此规则表示:"根据目标 IP 地址、第 4 层协议和目标端口构建一个元组;
|
||||
# 在 'service-ips' 中查找该元组;如果找到匹配项,则执行与之关联的判定。")
|
||||
chain services {
|
||||
ip daddr . meta l4proto . th dport vmap @service-ips
|
||||
}
|
||||
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
<!--
|
||||
Since there's only a single rule, with a roughly **O(1)** map lookup,
|
||||
packet processing time is more or less constant regardless of cluster
|
||||
size, and the best/average/worst cases are very similar:
|
||||
|
||||
{{< figure src="nftables-only.svg" alt="kube-proxy nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
|
||||
-->
|
||||
由于只有一条规则,并且映射查找的时间复杂度大约为 **O(1)**,因此数据包处理时间几乎与集群规模无关,
|
||||
并且最佳、平均和最坏情况下的表现非常接近:
|
||||
|
||||
{{< figure src="nftables-only.svg" alt="kube-proxy nftables 在不同规模集群中各百分位数下的第一个数据包延迟" >}}
|
||||
|
||||
<!--
|
||||
But note the huge difference in the vertical scale between the
|
||||
iptables and nftables graphs! In the clusters with 5000 and 10,000
|
||||
Services, the p50 (average) latency for nftables is about the same as
|
||||
the p01 (approximately best-case) latency for iptables. In the 30,000
|
||||
Service cluster, the p99 (approximately worst-case) latency for
|
||||
nftables manages to beat out the p01 latency for iptables by a few
|
||||
microseconds! Here's both sets of data together, but you may have to
|
||||
squint to see the nftables results!:
|
||||
|
||||
{{< figure src="iptables-vs-nftables.svg" alt="kube-proxy iptables-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
|
||||
-->
|
||||
但请注意图表中 iptables 和 nftables 之间在纵轴上的巨大差异!
|
||||
在包含 5000 和 10,000 个 Service 的集群中,nftables 的 p50(平均)延迟与 iptables
|
||||
的 p01(接近最佳情况)延迟大致相同。
|
||||
在包含 30,000 个 Service 的集群中,nftables 的 p99(接近最坏情况)延迟比 iptables 的 p01 延迟快了几微秒!
|
||||
以下是两组数据的对比图,但你可能需要仔细观察才能看到 nftables 的结果!
|
||||
|
||||
{{< figure src="iptables-vs-nftables.svg" alt="kube-proxy iptables 与 nftables 在不同规模集群中各百分位数下的第一个数据包延迟对比" >}}
|
||||
|
||||
<!--
|
||||
## Why nftables? Part 2: control plane latency
|
||||
|
||||
While the improvements to data plane latency in large clusters are
|
||||
great, there's another problem with iptables kube-proxy that often
|
||||
keeps users from even being able to grow their clusters to that size:
|
||||
the time it takes kube-proxy to program new iptables rules when
|
||||
Services and their endpoints change.
|
||||
-->
|
||||
## 为什么选择 nftables?第二部分:控制平面延迟
|
||||
|
||||
虽然在大型集群中数据平面延迟的改进非常显著,但 iptables 模式的 kube-proxy 还存在另一个问题,
|
||||
这往往使得用户无法将集群扩展到较大规模:那就是当 Service 及其端点发生变化时,kube-proxy
|
||||
更新 iptables 规则所需的时间。
|
||||
|
||||
<!--
|
||||
With both iptables and nftables, the total size of the ruleset as a
|
||||
whole (actual rules, plus associated data) is **O(n)** in the combined
|
||||
number of Services and their endpoints. Originally, the iptables
|
||||
backend would rewrite every rule on every update, and with tens of
|
||||
thousands of Services, this could grow to be hundreds of thousands of
|
||||
iptables rules. Starting in Kubernetes 1.26, we began improving
|
||||
kube-proxy so that it could skip updating _most_ of the unchanged
|
||||
rules in each update, but the limitations of `iptables-restore` as an
|
||||
API meant that it was still always necessary to send an update that's
|
||||
**O(n)** in the number of Services (though with a noticeably smaller
|
||||
constant than it used to be). Even with those optimizations, it can
|
||||
still be necessary to make use of kube-proxy's `minSyncPeriod` config
|
||||
option to ensure that it doesn't spend every waking second trying to
|
||||
push iptables updates.
|
||||
-->
|
||||
对于 iptables 和 nftables,规则集的整体大小(实际规则加上相关数据)与 Service
|
||||
及其端点的总数呈 **O(n)** 关系。原来,iptables 后端在每次更新时都会重写所有规则,
|
||||
当集群中存在数万个 Service 时,这可能导致规则数量增长至数十万条 iptables 规则。
|
||||
从 Kubernetes 1.26 开始,我们开始优化 kube-proxy,使其能够在每次更新时跳过对大多数未更改规则的更新,
|
||||
但由于 `iptables-restore` API 的限制,仍然需要发送与 Service 数量呈 **O(n)**
|
||||
比例的更新(尽管常数因子比以前明显减小)。即使进行了这些优化,有时仍需使用 kube-proxy 的
|
||||
`minSyncPeriod` 配置选项,以确保它不会每秒钟都在尝试推送 iptables 更新。
|
||||
|
||||
<!--
|
||||
The nftables APIs allow for doing much more incremental updates, and
|
||||
when kube-proxy in nftables mode does an update, the size of the
|
||||
update is only **O(n)** in the number of Services and endpoints that
|
||||
have changed since the last sync, regardless of the total number of
|
||||
Services and endpoints. The fact that the nftables API allows each
|
||||
nftables-using component to have its own private table also means that
|
||||
there is no global lock contention between components like with
|
||||
iptables. As a result, kube-proxy's nftables updates can be done much
|
||||
more efficiently than with iptables.
|
||||
|
||||
(Unfortunately I don't have cool graphs for this part.)
|
||||
-->
|
||||
nftables API 支持更为增量化的更新,当以 nftables 模式运行的 kube-proxy 执行更新时,
|
||||
更新的规模仅与自上次同步以来发生变化的 Service 和端点数量呈 **O(n)** 关系,而与总的 Service 和端点数量无关。
|
||||
此外,由于 nftables API 允许每个使用 nftables 的组件拥有自己的私有表,因此不会像 iptables
|
||||
那样在组件之间产生全局锁竞争。结果是,kube-proxy 在 nftables 模式下的更新可以比 iptables 模式下高效得多。
|
||||
|
||||
(不幸的是,这部分我没有酷炫的图表。)
|
||||
|
||||
<!--
|
||||
## Why _not_ nftables? {#why-not-nftables}
|
||||
|
||||
All that said, there are a few reasons why you might not want to jump
|
||||
right into using the nftables backend for now.
|
||||
|
||||
First, the code is still fairly new. While it has plenty of unit
|
||||
tests, performs correctly in our CI system, and has now been used in
|
||||
the real world by multiple users, it has not seen anything close to as
|
||||
much real-world usage as the iptables backend has, so we can't promise
|
||||
that it is as stable and bug-free.
|
||||
-->
|
||||
## 不选择 nftables 的理由有哪些? {#why-not-nftables}
|
||||
|
||||
尽管如此,仍有几个原因可能让你目前不希望立即使用 nftables 后端。
|
||||
|
||||
首先,该代码仍然相对较新。虽然它拥有大量的单元测试,在我们的 CI 系统中表现正确,
|
||||
并且已经在现实世界中被多个用户使用,但其实际使用量远远不及 iptables 后端,
|
||||
因此我们无法保证它同样稳定且无缺陷。
|
||||
|
||||
<!--
|
||||
Second, the nftables mode will not work on older Linux distributions;
|
||||
currently it requires a 5.13 or newer kernel. Additionally, because of
|
||||
bugs in early versions of the `nft` command line tool, you should not
|
||||
run kube-proxy in nftables mode on nodes that have an old (earlier
|
||||
than 1.0.0) version of `nft` in the host filesystem (or else
|
||||
kube-proxy's use of nftables may interfere with other uses of nftables
|
||||
on the system).
|
||||
-->
|
||||
其次,nftables 模式无法在较旧的 Linux 发行版上工作;目前它需要 5.13 或更高版本的内核。
|
||||
此外,由于早期版本的 `nft` 命令行工具存在缺陷,不应在运行旧版本(早于 1.0.0)
|
||||
`nft` 的节点主机文件系统中上以 nftables 模式运行 kube-proxy(否则 kube-proxy
|
||||
对 nftables 的使用可能会影响系统上其他程序对 nftables 的使用)。
|
||||
|
||||
<!--
|
||||
Third, you may have other networking components in your cluster, such
|
||||
as the pod network or NetworkPolicy implementation, that do not yet
|
||||
support kube-proxy in nftables mode. You should consult the
|
||||
documentation (or forums, bug tracker, etc.) for any such components
|
||||
to see if they have problems with nftables mode. (In many cases they
|
||||
will not; as long as they don't try to directly interact with or
|
||||
override kube-proxy's iptables rules, they shouldn't care whether
|
||||
kube-proxy is using iptables or nftables.) Additionally, observability
|
||||
and monitoring tools that have not been updated may report less data
|
||||
for kube-proxy in nftables mode than they do for kube-proxy in
|
||||
iptables mode.
|
||||
-->
|
||||
第三,你的集群中可能还存在其他网络组件,例如 Pod 网络或 NetworkPolicy 实现,
|
||||
这些组件可能尚不支持以 nftables 模式运行的 kube-proxy。你应查阅相关组件的文档(或论坛、问题跟踪系统等),
|
||||
以确认它们是否与 nftables 模式存在兼容性问题。(在许多情况下,它们并不会受到影响;
|
||||
只要它们不尝试直接操作或覆盖 kube-proxy 的 iptables 规则,就不在乎 kube-proxy
|
||||
使用的是 iptables 还是 nftables。)
|
||||
此外,相较于 iptables 模式下,尚未更新的可观测性和监控工具在 nftables
|
||||
模式下可能会为 kube-proxy 提供更少的数据。
|
||||
|
||||
<!--
|
||||
Finally, kube-proxy in nftables mode is intentionally not 100%
|
||||
compatible with kube-proxy in iptables mode. There are a few old
|
||||
kube-proxy features whose default behaviors are less secure, less
|
||||
performant, or less intuitive than we'd like, but where we felt that
|
||||
changing the default would be a compatibility break. Since the
|
||||
nftables mode is opt-in, this gave us a chance to fix those bad
|
||||
defaults without breaking users who weren't expecting changes. (In
|
||||
particular, with nftables mode, NodePort Services are now only
|
||||
reachable on their nodes' default IPs, as opposed to being reachable
|
||||
on all IPs, including `127.0.0.1`, with iptables mode.) The
|
||||
[kube-proxy documentation] has more information about this, including
|
||||
information about metrics you can look at to determine if you are
|
||||
relying on any of the changed functionality, and what configuration
|
||||
options are available to get more backward-compatible behavior.
|
||||
|
||||
[kube-proxy documentation]: https://kubernetes.io/docs/reference/networking/virtual-ips/#migrating-from-iptables-mode-to-nftables
|
||||
-->
|
||||
最后,以 nftables 模式运行的 kube-proxy 有意不与以 iptables 模式运行的 kube-proxy 完全兼容。
|
||||
有一些较旧的 kube-proxy 功能,默认行为不如我们期望的那样安全、高效或直观,但我们认为更改默认行为会导致兼容性问题。
|
||||
由于 nftables 模式是可选的,这为我们提供了一个机会,在不影响期望稳定性的用户的情况下修复这些不良默认设置。
|
||||
(特别是,在 nftables 模式下,NodePort 类型的 Service 现在仅在其节点的默认 IP 上可访问,而在 iptables 模式下,
|
||||
它们在所有 IP 上均可访问,包括 `127.0.0.1`。)[kube-proxy 文档] 提供了更多关于此方面的信息,
|
||||
包括如何通过查看某些指标来判断你是否依赖于任何已更改的特性,以及有哪些配置选项可用于实现更向后兼容的行为。
|
||||
|
||||
[kube-proxy 文档]: https://kubernetes.io/zh-cn/docs/reference/networking/virtual-ips/#migrating-from-iptables-mode-to-nftables
|
||||
|
||||
<!--
|
||||
## Trying out nftables mode
|
||||
|
||||
Ready to try it out? In Kubernetes 1.31 and later, you just need to
|
||||
pass `--proxy-mode nftables` to kube-proxy (or set `mode: nftables` in
|
||||
your kube-proxy config file).
|
||||
|
||||
If you are using kubeadm to set up your cluster, the kubeadm
|
||||
documentation explains [how to pass a `KubeProxyConfiguration` to
|
||||
`kubeadm init`]. You can also [deploy nftables-based clusters with
|
||||
`kind`].
|
||||
-->
|
||||
## 尝试使用 nftables 模式
|
||||
|
||||
准备尝试了吗?在 Kubernetes 1.31 及更高版本中,你只需将 `--proxy-mode nftables`
|
||||
参数传递给 kube-proxy(或在 kube-proxy 配置文件中设置 `mode: nftables`)。
|
||||
|
||||
如果你使用 kubeadm 部署集群,kubeadm 文档解释了[如何向 `kubeadm init` 传递 `KubeProxyConfiguration`]。
|
||||
你还可以[通过 `kind` 部署基于 nftables 的集群]。
|
||||
|
||||
[如何向 `kubeadm init` 传递 `KubeProxyConfiguration`]: https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#config-file
|
||||
[通过 `kind` 部署基于 nftables 的集群]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode
|
||||
|
||||
<!--
|
||||
You can also convert existing clusters from iptables (or ipvs) mode to
|
||||
nftables by updating the kube-proxy configuration and restarting the
|
||||
kube-proxy pods. (You do not need to reboot the nodes: when restarting
|
||||
in nftables mode, kube-proxy will delete any existing iptables or ipvs
|
||||
rules, and likewise, if you later revert back to iptables or ipvs
|
||||
mode, it will delete any existing nftables rules.)
|
||||
|
||||
[how to pass a `KubeProxyConfiguration` to `kubeadm init`]: /docs/setup/production-environment/tools/kubeadm/control-plane-flags/#customizing-kube-proxy
|
||||
[deploy nftables-based clusters with `kind`]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode
|
||||
-->
|
||||
你还可以通过更新 kube-proxy 配置并重启 kube-proxy Pod,将现有集群从
|
||||
iptables(或 ipvs)模式转换为 nftables 模式。(无需重启节点:
|
||||
在以 nftables 模式重新启动时,kube-proxy 会删除现有的所有 iptables 或 ipvs 规则;
|
||||
同样,如果你之后切换回 iptables 或 ipvs 模式,它将删除现有的所有 nftables 规则。)
|
||||
|
||||
[如何向 `kubeadm init` 传递 `KubeProxyConfiguration`]: /zh-cn/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#customizing-kube-proxy
|
||||
[通过 `kind` 部署基于 nftables 的集群]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode
|
||||
|
||||
<!--
|
||||
## Future plans
|
||||
|
||||
As mentioned above, while nftables is now the _best_ kube-proxy mode,
|
||||
it is not the _default_, and we do not yet have a plan for changing
|
||||
that. We will continue to support the iptables mode for a long time.
|
||||
|
||||
The future of the IPVS mode of kube-proxy is less certain: its main
|
||||
advantage over iptables was that it was faster, but certain aspects of
|
||||
the IPVS architecture and APIs were awkward for kube-proxy's purposes
|
||||
(for example, the fact that the `kube-ipvs0` device needs to have
|
||||
_every_ Service IP address assigned to it), and some parts of
|
||||
Kubernetes Service proxying semantics were difficult to implement
|
||||
using IPVS (particularly the fact that some Services had to have
|
||||
different endpoints depending on whether you connected to them from a
|
||||
local or remote client). And now, the nftables mode has the same
|
||||
performance as IPVS mode (actually, slightly better), without any of
|
||||
the downsides:
|
||||
-->
|
||||
## 未来计划
|
||||
|
||||
如上所述,虽然 nftables 现在是的 kube-proxy 的最佳模式,但它还不是默认模式,
|
||||
我们目前还没有更改这一设置的计划。我们将继续长期支持 iptables 模式。
|
||||
|
||||
kube-proxy 的 IPVS 模式的未来则不太确定:它相对于 iptables 的主要优势在于速度更快,
|
||||
但 IPVS 的架构和 API 在某些方面对 kube-proxy 来说不够理想(例如,`kube-ipvs0`
|
||||
设备需要被分配所有 Service IP 地址),
|
||||
并且 Kubernetes Service 代理的部分语义使用 IPVS 难以实现(特别是某些
|
||||
Service 根据连接的客户端是本地还是远程,需要有不同的端点)。
|
||||
现在,nftables 模式的性能与 IPVS 模式相同(实际上略胜一筹),而且没有任何缺点:
|
||||
|
||||
<!--
|
||||
{{< figure src="ipvs-vs-nftables.svg" alt="kube-proxy ipvs-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
|
||||
|
||||
(In theory the IPVS mode also has the advantage of being able to use
|
||||
various other IPVS functionality, like alternative "schedulers" for
|
||||
balancing endpoints. In practice, this ended up not being very useful,
|
||||
because kube-proxy runs independently on every node, and the IPVS
|
||||
schedulers on each node had no way of sharing their state with the
|
||||
proxies on other nodes, thus thwarting the effort to balance traffic
|
||||
more cleverly.)
|
||||
-->
|
||||
{{< figure src="ipvs-vs-nftables.svg" alt="kube-proxy IPVS 与 nftables 在不同规模集群中各百分位数下的第一个数据包延迟对比" >}}
|
||||
|
||||
(理论上,IPVS 模式还具有可以使用其他 IPVS 功能的优势,例如使用替代的"调度器"来平衡端点。
|
||||
但实际上,这并不太有用,因为 kube-proxy 在每个节点上独立运行,每个节点上的 IPVS
|
||||
调度器无法与其他节点上的代理共享状态,从而无法实现更智能的流量均衡。)
|
||||
|
||||
<!--
|
||||
While the Kubernetes project does not have an immediate plan to drop
|
||||
the IPVS backend, it is probably doomed in the long run, and people
|
||||
who are currently using IPVS mode should try out the nftables mode
|
||||
instead (and file bugs if you think there is missing functionality in
|
||||
nftables mode that you can't work around).
|
||||
-->
|
||||
虽然 Kubernetes 项目目前没有立即放弃 IPVS 后端的计划,但从长远来看,IPVS 可能难逃被淘汰的命运。
|
||||
目前使用 IPVS 模式的用户应尝试使用 nftables 模式(如果发现 nftables 模式中缺少某些无法绕过的功能,
|
||||
请提交问题报告)。
|
||||
|
||||
<!--
|
||||
## Learn more
|
||||
|
||||
- "[KEP-3866: Add an nftables-based kube-proxy backend]" has the
|
||||
history of the new feature.
|
||||
|
||||
- "[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]",
|
||||
from KubeCon/CloudNativeCon North America 2024, talks about porting
|
||||
kube-proxy and Calico from iptables to nftables.
|
||||
|
||||
- "[From Observability to Performance]", from KubeCon/CloudNativeCon
|
||||
North America 2024. (This is where the kube-proxy latency data came
|
||||
from; the [raw data for the charts] is also available.)
|
||||
-->
|
||||
## 进一步了解
|
||||
|
||||
- "[KEP-3866: Add an nftables-based kube-proxy backend]" 记录了此新特性的历史。
|
||||
|
||||
- "[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]",来自 2024 年
|
||||
KubeCon/CloudNativeCon 北美大会,讨论了将 kube-proxy 和 Calico 从 iptables 迁移到 nftables 的过程。
|
||||
|
||||
- "[From Observability to Performance]",同样来自 2024 年 KubeCon/CloudNativeCon 北美大会。
|
||||
(kube-proxy 延迟数据来源于此;[raw data for the charts] 也可用。)
|
||||
|
||||
[KEP-3866: Add an nftables-based kube-proxy backend]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3866-nftables-proxy/README.md
|
||||
[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]: https://youtu.be/yOGHb2HjslY?si=6O4PVJu7fGpReo1U
|
||||
[From Observability to Performance]: https://youtu.be/uYo2O3jbJLk?si=py2AXzMJZ4PuhxNg
|
||||
[raw data for the charts]: https://docs.google.com/spreadsheets/d/1-ryDNc6gZocnMHEXC7mNtqknKSOv5uhXFKDx8Hu3AYA/edit
|
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 91 KiB |
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 150 KiB |
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 136 KiB |
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 95 KiB |
Loading…
Reference in New Issue