---
layout: blog
title: "kube-proxy 的 NFTables 模式"
date: 2025-02-28
slug: nftables-kube-proxy
author: >
  Dan Winship (Red Hat)
translator: Xin Li (Daocloud)
---
<!--
layout: blog
title: "NFTables mode for kube-proxy"
date: 2025-02-28
slug: nftables-kube-proxy
author: >
  Dan Winship (Red Hat)
-->

<!--
A new nftables mode for kube-proxy was introduced as an alpha feature
in Kubernetes 1.29. Currently in beta, it is expected to be GA as of
1.33. The new mode fixes long-standing performance problems with the
iptables mode and all users running on systems with reasonably-recent
kernels are encouraged to try it out. (For compatibility reasons, even
once nftables becomes GA, iptables will still be the _default_.)
-->
Kubernetes 1.29 引入了一种新的 Alpha 特性：kube-proxy 的 nftables 模式。
目前该模式处于 Beta 阶段，并预计将在 1.33 版本中达到一般可用（GA）状态。
新模式解决了 iptables 模式长期存在的性能问题，建议所有运行在较新内核版本系统上的用户尝试使用。
出于兼容性原因，即使 nftables 成为 GA 功能，iptables 仍将是**默认**模式。

<!--
## Why nftables? Part 1: data plane latency

The iptables API was designed for implementing simple firewalls, and
has problems scaling up to support Service proxying in a large
Kubernetes cluster with tens of thousands of Services.

In general, the ruleset generated by kube-proxy in iptables mode has a
number of iptables rules proportional to the sum of the number of
Services and the total number of endpoints. In particular, at the top
level of the ruleset, there is one rule to test each possible Service
IP (and port) that a packet might be addressed to:
-->
## 为什么选择 nftables？第一部分：数据平面延迟

iptables API 是被设计用于实现简单的防火墙功能，在扩展到支持大型 Kubernetes 集群中的 Service
代理时存在局限性，尤其是在包含数万个 Service 的集群中。

通常，kube-proxy 在 iptables 模式下生成的规则集中的 iptables 规则数量与
Service 数量和总端点数量的总和成正比。
特别是，在规则集的顶层，针对数据包可能指向的每个可能的 Service IP（以及端口），
都有一条规则用于测试。

<!--
```
# If the packet is addressed to 172.30.0.41:80, then jump to the chain
# KUBE-SVC-XPGD46QRK7WJZT7O for further processing
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O

# If the packet is addressed to 172.30.0.42:443, then...
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT

# etc...
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
```
-->
```
# 如果数据包的目标地址是 172.30.0.41:80，则跳转到 KUBE-SVC-XPGD46QRK7WJZT7O 链进行进一步处理
-A KUBE-SERVICES -m comment --comment "namespace1/service1:p80 cluster IP" -m tcp -p tcp -d 172.30.0.41 --dport 80 -j KUBE-SVC-XPGD46QRK7WJZT7O

# 如果数据包的目标地址是 172.30.0.42:443，则...
-A KUBE-SERVICES -m comment --comment "namespace2/service2:p443 cluster IP" -m tcp -p tcp -d 172.30.0.42 --dport 443 -j KUBE-SVC-GNZBNJ2PO5MGZ6GT

# 等等...
-A KUBE-SERVICES -m comment --comment "namespace3/service3:p80 cluster IP" -m tcp -p tcp -d 172.30.0.43 --dport 80 -j KUBE-SVC-X27LE4BHSL4DOUIK
```

<!--
This means that when a packet comes in, the time it takes the kernel
to check it against all of the Service rules is **O(n)** in the number
of Services. As the number of Services increases, both the average and
the worst-case latency for the first packet of a new connection
increases (with the difference between best-case, average, and
worst-case being mostly determined by whether a given Service IP
address appears earlier or later in the `KUBE-SERVICES` chain).

{{< figure src="iptables-only.svg" alt="kube-proxy iptables first packet latency, at various percentiles, in clusters of various sizes" >}}

By contrast, with nftables, the normal way to write a ruleset like
this is to have a _single_ rule, using a "verdict map" to do the
dispatch:
-->
这意味着当数据包到达时，内核检查该数据包与所有 Service 规则所需的时间是 **O(n)**，
其中 n 为 Service 的数量。随着 Service 数量的增加，新连接的第一个数据包的平均延迟和最坏情况下的延迟都会增加
（最佳情况、平均情况和最坏情况之间的差异主要取决于某个 Service IP 地址在 `KUBE-SERVICES`
链中出现的顺序是靠前还是靠后）。

{{< figure src="iptables-only.svg" alt="kube-proxy iptables 在不同规模集群中各百分位数下的第一个数据包延迟" >}}

相比之下，使用 nftables，编写此类规则集的常规方法是使用一个单一规则，
并通过"判决映射"（verdict map）来完成分发：

<!--
```
table ip kube-proxy {

        # The service-ips verdict map indicates the action to take for each matching packet.
	map service-ips {
		type ipv4_addr . inet_proto . inet_service : verdict
		comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
		elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
                             172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
                             172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
                             ... }
        }

        # Now we just need a single rule to process all packets matching an
        # element in the map. (This rule says, "construct a tuple from the
        # destination IP address, layer 4 protocol, and destination port; look
        # that tuple up in "service-ips"; and if there's a match, execute the
        # associated verdict.)
	chain services {
		ip daddr . meta l4proto . th dport vmap @service-ips
	}

        ...
}
```
-->
```none
table ip kube-proxy {

  # service-ips 判决映射指示了对每个匹配数据包应采取的操作。
  map service-ips {
    type ipv4_addr . inet_proto . inet_service : verdict
    comment "ClusterIP、ExternalIP 和 LoadBalancer IP 流量"
    elements = { 172.30.0.41 . tcp . 80 : goto service-ULMVA6XW-namespace1/service1/tcp/p80,
                 172.30.0.42 . tcp . 443 : goto service-42NFTM6N-namespace2/service2/tcp/p443,
                 172.30.0.43 . tcp . 80 : goto service-4AT6LBPK-namespace3/service3/tcp/p80,
                 ... }
    }

  # 现在我们只需要一条规则来处理所有与映射中元素匹配的数据包。
  # （此规则表示："根据目标 IP 地址、第 4 层协议和目标端口构建一个元组；
  # 在 'service-ips' 中查找该元组；如果找到匹配项，则执行与之关联的判定。"）
  chain services {
    ip daddr . meta l4proto . th dport vmap @service-ips
  }

  ...
}
```

<!--
Since there's only a single rule, with a roughly **O(1)** map lookup,
packet processing time is more or less constant regardless of cluster
size, and the best/average/worst cases are very similar:

{{< figure src="nftables-only.svg" alt="kube-proxy nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
-->
由于只有一条规则，并且映射查找的时间复杂度大约为 **O(1)**，因此数据包处理时间几乎与集群规模无关，
并且最佳、平均和最坏情况下的表现非常接近：

{{< figure src="nftables-only.svg" alt="kube-proxy nftables 在不同规模集群中各百分位数下的第一个数据包延迟" >}}

<!--
But note the huge difference in the vertical scale between the
iptables and nftables graphs! In the clusters with 5000 and 10,000
Services, the p50 (average) latency for nftables is about the same as
the p01 (approximately best-case) latency for iptables. In the 30,000
Service cluster, the p99 (approximately worst-case) latency for
nftables manages to beat out the p01 latency for iptables by a few
microseconds! Here's both sets of data together, but you may have to
squint to see the nftables results!:

{{< figure src="iptables-vs-nftables.svg" alt="kube-proxy iptables-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}}
-->
但请注意图表中 iptables 和 nftables 之间在纵轴上的巨大差异！
在包含 5000 和 10,000 个 Service 的集群中，nftables 的 p50（平均）延迟与 iptables
的 p01（接近最佳情况）延迟大致相同。
在包含 30,000 个 Service 的集群中，nftables 的 p99（接近最坏情况）延迟比 iptables 的 p01 延迟快了几微秒！
以下是两组数据的对比图，但你可能需要仔细观察才能看到 nftables 的结果！

{{< figure src="iptables-vs-nftables.svg" alt="kube-proxy iptables 与 nftables 在不同规模集群中各百分位数下的第一个数据包延迟对比" >}}

<!--
## Why nftables? Part 2: control plane latency

While the improvements to data plane latency in large clusters are
great, there's another problem with iptables kube-proxy that often
keeps users from even being able to grow their clusters to that size:
the time it takes kube-proxy to program new iptables rules when
Services and their endpoints change.
-->
## 为什么选择 nftables？第二部分：控制平面延迟

虽然在大型集群中数据平面延迟的改进非常显著，但 iptables 模式的 kube-proxy 还存在另一个问题，
这往往使得用户无法将集群扩展到较大规模：那就是当 Service 及其端点发生变化时，kube-proxy
更新 iptables 规则所需的时间。

<!--
With both iptables and nftables, the total size of the ruleset as a
whole (actual rules, plus associated data) is **O(n)** in the combined
number of Services and their endpoints. Originally, the iptables
backend would rewrite every rule on every update, and with tens of
thousands of Services, this could grow to be hundreds of thousands of
iptables rules. Starting in Kubernetes 1.26, we began improving
kube-proxy so that it could skip updating _most_ of the unchanged
rules in each update, but the limitations of `iptables-restore` as an
API meant that it was still always necessary to send an update that's
**O(n)** in the number of Services (though with a noticeably smaller
constant than it used to be). Even with those optimizations, it can
still be necessary to make use of kube-proxy's `minSyncPeriod` config
option to ensure that it doesn't spend every waking second trying to
push iptables updates.
-->
对于 iptables 和 nftables，规则集的整体大小（实际规则加上相关数据）与 Service
及其端点的总数呈 **O(n)** 关系。原来，iptables 后端在每次更新时都会重写所有规则，
当集群中存在数万个 Service 时，这可能导致规则数量增长至数十万条 iptables 规则。
从 Kubernetes 1.26 开始，我们开始优化 kube-proxy，使其能够在每次更新时跳过对大多数未更改规则的更新，
但由于 `iptables-restore` API 的限制，仍然需要发送与 Service 数量呈 **O(n)**
比例的更新（尽管常数因子比以前明显减小）。即使进行了这些优化，有时仍需使用 kube-proxy 的
`minSyncPeriod` 配置选项，以确保它不会每秒钟都在尝试推送 iptables 更新。

<!--
The nftables APIs allow for doing much more incremental updates, and
when kube-proxy in nftables mode does an update, the size of the
update is only **O(n)** in the number of Services and endpoints that
have changed since the last sync, regardless of the total number of
Services and endpoints. The fact that the nftables API allows each
nftables-using component to have its own private table also means that
there is no global lock contention between components like with
iptables. As a result, kube-proxy's nftables updates can be done much
more efficiently than with iptables.

(Unfortunately I don't have cool graphs for this part.)
-->
nftables API 支持更为增量化的更新，当以 nftables 模式运行的 kube-proxy 执行更新时，
更新的规模仅与自上次同步以来发生变化的 Service 和端点数量呈 **O(n)** 关系，而与总的 Service 和端点数量无关。
此外，由于 nftables API 允许每个使用 nftables 的组件拥有自己的私有表，因此不会像 iptables
那样在组件之间产生全局锁竞争。结果是，kube-proxy 在 nftables 模式下的更新可以比 iptables 模式下高效得多。

（不幸的是，这部分我没有酷炫的图表。）

<!--
## Why _not_ nftables? {#why-not-nftables}

All that said, there are a few reasons why you might not want to jump
right into using the nftables backend for now.

First, the code is still fairly new. While it has plenty of unit
tests, performs correctly in our CI system, and has now been used in
the real world by multiple users, it has not seen anything close to as
much real-world usage as the iptables backend has, so we can't promise
that it is as stable and bug-free.
-->
## 不选择 nftables 的理由有哪些？  {#why-not-nftables}

尽管如此，仍有几个原因可能让你目前不希望立即使用 nftables 后端。

首先，该代码仍然相对较新。虽然它拥有大量的单元测试，在我们的 CI 系统中表现正确，
并且已经在现实世界中被多个用户使用，但其实际使用量远远不及 iptables 后端，
因此我们无法保证它同样稳定且无缺陷。

<!--
Second, the nftables mode will not work on older Linux distributions;
currently it requires a 5.13 or newer kernel. Additionally, because of
bugs in early versions of the `nft` command line tool, you should not
run kube-proxy in nftables mode on nodes that have an old (earlier
than 1.0.0) version of `nft` in the host filesystem (or else
kube-proxy's use of nftables may interfere with other uses of nftables
on the system).
-->
其次，nftables 模式无法在较旧的 Linux 发行版上工作；目前它需要 5.13 或更高版本的内核。
此外，由于早期版本的 `nft` 命令行工具存在缺陷，不应在运行旧版本（早于 1.0.0）
`nft` 的节点主机文件系统中上以 nftables 模式运行 kube-proxy（否则 kube-proxy
对 nftables 的使用可能会影响系统上其他程序对 nftables 的使用）。

<!--
Third, you may have other networking components in your cluster, such
as the pod network or NetworkPolicy implementation, that do not yet
support kube-proxy in nftables mode. You should consult the
documentation (or forums, bug tracker, etc.) for any such components
to see if they have problems with nftables mode. (In many cases they
will not; as long as they don't try to directly interact with or
override kube-proxy's iptables rules, they shouldn't care whether
kube-proxy is using iptables or nftables.) Additionally, observability
and monitoring tools that have not been updated may report less data
for kube-proxy in nftables mode than they do for kube-proxy in
iptables mode.
-->
第三，你的集群中可能还存在其他网络组件，例如 Pod 网络或 NetworkPolicy 实现，
这些组件可能尚不支持以 nftables 模式运行的 kube-proxy。你应查阅相关组件的文档（或论坛、问题跟踪系统等），
以确认它们是否与 nftables 模式存在兼容性问题。（在许多情况下，它们并不会受到影响；
只要它们不尝试直接操作或覆盖 kube-proxy 的 iptables 规则，就不在乎 kube-proxy
使用的是 iptables 还是 nftables。）
此外，相较于 iptables 模式下，尚未更新的可观测性和监控工具在 nftables
模式下可能会为 kube-proxy 提供更少的数据。

<!--
Finally, kube-proxy in nftables mode is intentionally not 100%
compatible with kube-proxy in iptables mode. There are a few old
kube-proxy features whose default behaviors are less secure, less
performant, or less intuitive than we'd like, but where we felt that
changing the default would be a compatibility break. Since the
nftables mode is opt-in, this gave us a chance to fix those bad
defaults without breaking users who weren't expecting changes. (In
particular, with nftables mode, NodePort Services are now only
reachable on their nodes' default IPs, as opposed to being reachable
on all IPs, including `127.0.0.1`, with iptables mode.) The
[kube-proxy documentation] has more information about this, including
information about metrics you can look at to determine if you are
relying on any of the changed functionality, and what configuration
options are available to get more backward-compatible behavior.

[kube-proxy documentation]: https://kubernetes.io/docs/reference/networking/virtual-ips/#migrating-from-iptables-mode-to-nftables
-->
最后，以 nftables 模式运行的 kube-proxy 有意不与以 iptables 模式运行的 kube-proxy 完全兼容。
有一些较旧的 kube-proxy 功能，默认行为不如我们期望的那样安全、高效或直观，但我们认为更改默认行为会导致兼容性问题。
由于 nftables 模式是可选的，这为我们提供了一个机会，在不影响期望稳定性的用户的情况下修复这些不良默认设置。
（特别是，在 nftables 模式下，NodePort 类型的 Service 现在仅在其节点的默认 IP 上可访问，而在 iptables 模式下，
它们在所有 IP 上均可访问，包括 `127.0.0.1`。）[kube-proxy 文档] 提供了更多关于此方面的信息，
包括如何通过查看某些指标来判断你是否依赖于任何已更改的特性，以及有哪些配置选项可用于实现更向后兼容的行为。

[kube-proxy 文档]: https://kubernetes.io/zh-cn/docs/reference/networking/virtual-ips/#migrating-from-iptables-mode-to-nftables

<!--
## Trying out nftables mode

Ready to try it out? In Kubernetes 1.31 and later, you just need to
pass `--proxy-mode nftables` to kube-proxy (or set `mode: nftables` in
your kube-proxy config file).

If you are using kubeadm to set up your cluster, the kubeadm
documentation explains [how to pass a `KubeProxyConfiguration` to
`kubeadm init`]. You can also [deploy nftables-based clusters with
`kind`].
-->
## 尝试使用 nftables 模式

准备尝试了吗？在 Kubernetes 1.31 及更高版本中，你只需将 `--proxy-mode nftables`
参数传递给 kube-proxy（或在 kube-proxy 配置文件中设置 `mode: nftables`）。

如果你使用 kubeadm 部署集群，kubeadm 文档解释了[如何向 `kubeadm init` 传递 `KubeProxyConfiguration`]。
你还可以[通过 `kind` 部署基于 nftables 的集群]。
  
[如何向 `kubeadm init` 传递 `KubeProxyConfiguration`]: https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#config-file  
[通过 `kind` 部署基于 nftables 的集群]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode

<!--
You can also convert existing clusters from iptables (or ipvs) mode to
nftables by updating the kube-proxy configuration and restarting the
kube-proxy pods. (You do not need to reboot the nodes: when restarting
in nftables mode, kube-proxy will delete any existing iptables or ipvs
rules, and likewise, if you later revert back to iptables or ipvs
mode, it will delete any existing nftables rules.)

[how to pass a `KubeProxyConfiguration` to `kubeadm init`]: /docs/setup/production-environment/tools/kubeadm/control-plane-flags/#customizing-kube-proxy
[deploy nftables-based clusters with `kind`]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode
-->
你还可以通过更新 kube-proxy 配置并重启 kube-proxy Pod，将现有集群从
iptables（或 ipvs）模式转换为 nftables 模式。（无需重启节点：
在以 nftables 模式重新启动时，kube-proxy 会删除现有的所有 iptables 或 ipvs 规则；
同样，如果你之后切换回 iptables 或 ipvs 模式，它将删除现有的所有 nftables 规则。）

[如何向 `kubeadm init` 传递 `KubeProxyConfiguration`]: /zh-cn/docs/setup/production-environment/tools/kubeadm/control-plane-flags/#customizing-kube-proxy
[通过 `kind` 部署基于 nftables 的集群]: https://kind.sigs.k8s.io/docs/user/configuration/#kube-proxy-mode

<!--
## Future plans

As mentioned above, while nftables is now the _best_ kube-proxy mode,
it is not the _default_, and we do not yet have a plan for changing
that. We will continue to support the iptables mode for a long time.

The future of the IPVS mode of kube-proxy is less certain: its main
advantage over iptables was that it was faster, but certain aspects of
the IPVS architecture and APIs were awkward for kube-proxy's purposes
(for example, the fact that the `kube-ipvs0` device needs to have
_every_ Service IP address assigned to it), and some parts of
Kubernetes Service proxying semantics were difficult to implement
using IPVS (particularly the fact that some Services had to have
different endpoints depending on whether you connected to them from a
local or remote client). And now, the nftables mode has the same
performance as IPVS mode (actually, slightly better), without any of
the downsides:
-->
## 未来计划

如上所述，虽然 nftables 现在是的 kube-proxy 的最佳模式，但它还不是默认模式，
我们目前还没有更改这一设置的计划。我们将继续长期支持 iptables 模式。

kube-proxy 的 IPVS 模式的未来则不太确定：它相对于 iptables 的主要优势在于速度更快，
但 IPVS 的架构和 API 在某些方面对 kube-proxy 来说不够理想（例如，`kube-ipvs0`
设备需要被分配所有 Service IP 地址），
并且 Kubernetes Service 代理的部分语义使用 IPVS 难以实现（特别是某些
Service 根据连接的客户端是本地还是远程，需要有不同的端点）。
现在，nftables 模式的性能与 IPVS 模式相同（实际上略胜一筹），而且没有任何缺点：

<!--
{{< figure src="ipvs-vs-nftables.svg" alt="kube-proxy ipvs-vs-nftables first packet latency, at various percentiles, in clusters of various sizes" >}}

(In theory the IPVS mode also has the advantage of being able to use
various other IPVS functionality, like alternative "schedulers" for
balancing endpoints. In practice, this ended up not being very useful,
because kube-proxy runs independently on every node, and the IPVS
schedulers on each node had no way of sharing their state with the
proxies on other nodes, thus thwarting the effort to balance traffic
more cleverly.)
-->
{{< figure src="ipvs-vs-nftables.svg" alt="kube-proxy IPVS 与 nftables 在不同规模集群中各百分位数下的第一个数据包延迟对比" >}}

（理论上，IPVS 模式还具有可以使用其他 IPVS 功能的优势，例如使用替代的"调度器"来平衡端点。
但实际上，这并不太有用，因为 kube-proxy 在每个节点上独立运行，每个节点上的 IPVS
调度器无法与其他节点上的代理共享状态，从而无法实现更智能的流量均衡。）

<!--
While the Kubernetes project does not have an immediate plan to drop
the IPVS backend, it is probably doomed in the long run, and people
who are currently using IPVS mode should try out the nftables mode
instead (and file bugs if you think there is missing functionality in
nftables mode that you can't work around).
-->
虽然 Kubernetes 项目目前没有立即放弃 IPVS 后端的计划，但从长远来看，IPVS 可能难逃被淘汰的命运。
目前使用 IPVS 模式的用户应尝试使用 nftables 模式（如果发现 nftables 模式中缺少某些无法绕过的功能，
请提交问题报告）。

<!--
## Learn more

- "[KEP-3866: Add an nftables-based kube-proxy backend]" has the
  history of the new feature.

- "[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]",
  from KubeCon/CloudNativeCon North America 2024, talks about porting
  kube-proxy and Calico from iptables to nftables.

- "[From Observability to Performance]", from KubeCon/CloudNativeCon
  North America 2024. (This is where the kube-proxy latency data came
  from; the [raw data for the charts] is also available.)
-->
## 进一步了解

- "[KEP-3866: Add an nftables-based kube-proxy backend]" 记录了此新特性的历史。

- "[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]"，来自 2024 年
  KubeCon/CloudNativeCon 北美大会，讨论了将 kube-proxy 和 Calico 从 iptables 迁移到 nftables 的过程。

- "[From Observability to Performance]"，同样来自 2024 年 KubeCon/CloudNativeCon 北美大会。
 （kube-proxy 延迟数据来源于此；[raw data for the charts] 也可用。）

[KEP-3866: Add an nftables-based kube-proxy backend]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3866-nftables-proxy/README.md
[How the Tables Have Turned: Kubernetes Says Goodbye to IPTables]: https://youtu.be/yOGHb2HjslY?si=6O4PVJu7fGpReo1U
[From Observability to Performance]: https://youtu.be/uYo2O3jbJLk?si=py2AXzMJZ4PuhxNg
[raw data for the charts]: https://docs.google.com/spreadsheets/d/1-ryDNc6gZocnMHEXC7mNtqknKSOv5uhXFKDx8Hu3AYA/edit