website/content/en/blog/_posts/2019-03-29-kube-proxy-subtl...

193 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
title: 'kube-proxy Subtleties: Debugging an Intermittent Connection Reset'
date: 2019-03-29
---
**Author:** [Yongkun Gui](mailto:ygui@google.com), Google
I recently came across a bug that causes intermittent connection resets. After
some digging, I found it was caused by a subtle combination of several different
network subsystems. It helped me understand Kubernetes networking better, and I
think its worthwhile to share with a wider audience who are interested in the same
topic.
## The symptom
We received a user report claiming they were getting connection resets while using a
Kubernetes service of type ClusterIP to serve large files to pods running in the
same cluster. Initial debugging of the cluster did not yield anything
interesting: network connectivity was fine and downloading the files did not hit
any issues. However, when we ran the workload in parallel across many clients,
we were able to reproduce the problem. Adding to the mystery was the fact that
the problem could not be reproduced when the workload was run using VMs without
Kubernetes. The problem, which could be easily reproduced by [a simple
app](https://github.com/tcarmet/k8s-connection-reset), clearly has something to
do with Kubernetes networking, but what?
## Kubernetes networking basics
Before digging into this problem, lets talk a little bit about some basics of
Kubernetes networking, as Kubernetes handles network traffic from a pod
very differently depending on different destinations.
### Pod-to-Pod
In Kubernetes, every pod has its own IP address. The benefit is that the
applications running inside pods could use their canonical port, instead of
remapping to a different random port. Pods have L3 connectivity between each
other. They can ping each other, and send TCP or UDP packets to each other.
[CNI](https://github.com/containernetworking/cni) is the standard that solves
this problem for containers running on different hosts. There are tons of
different plugins that support CNI.
### Pod-to-external
For the traffic that goes from pod to external addresses, Kubernetes simply uses
[SNAT](https://en.wikipedia.org/wiki/Network_address_translation). What it does
is replace the pods internal source IP:port with the hosts IP:port. When
the return packet comes back to the host, it rewrites the pods IP:port as the
destination and sends it back to the original pod. The whole process is transparent
to the original pod, who doesnt know the address translation at all.
### Pod-to-Service
Pods are mortal. Most likely, people want reliable service. Otherwise, its
pretty much useless. So Kubernetes has this concept called "service" which is
simply a L4 load balancer in front of pods. There are several different types of
services. The most basic type is called ClusterIP. For this type of service, it
has a unique VIP address that is only routable inside the cluster.
The component in Kubernetes that implements this feature is called kube-proxy.
It sits on every node, and programs complicated iptables rules to do all kinds
of filtering and NAT between pods and services. If you go to a Kubernetes node
and type `iptables-save`, youll see the rules that are inserted by Kubernetes
or other programs. The most important chains are `KUBE-SERVICES`, `KUBE-SVC-*`
and `KUBE-SEP-*`.
- `KUBE-SERVICES` is the entry point for service packets. What it does is to
match the destination IP:port and dispatch the packet to the corresponding
`KUBE-SVC-*` chain.
- `KUBE-SVC-*` chain acts as a load balancer, and distributes the packet to
`KUBE-SEP-*` chain equally. Every `KUBE-SVC-*` has the same number of
`KUBE-SEP-*` chains as the number of endpoints behind it.
- `KUBE-SEP-*` chain represents a Service EndPoint. It simply does DNAT,
replacing service IP:port with pod's endpoint IP:Port.
For DNAT, conntrack kicks in and tracks the connection state using a state
machine. The state is needed because it needs to remember the destination
address it changed to, and changed it back when the returning packet came back.
Iptables could also rely on the conntrack state (ctstate) to decide the destiny
of a packet. Those 4 conntrack states are especially important:
- *NEW*: conntrack knows nothing about this packet, which happens when the SYN
packet is received.
- *ESTABLISHED*: conntrack knows the packet belongs to an established connection,
which happens after handshake is complete.
- *RELATED*: The packet doesnt belong to any connection, but it is affiliated
to another connection, which is especially useful for protocols like FTP.
- *INVALID*: Something is wrong with the packet, and conntrack doesnt know how
to deal with it. This state plays a centric role in this Kubernetes issue.
Here is a diagram of how a TCP connection works between pod and service. The
sequence of events are:
- Client pod from left hand side sends a packet to a
service: 192.168.0.2:80
- The packet is going through iptables rules in client
node and the destination is changed to pod IP, 10.0.1.2:80
- Server pod handles the packet and sends back a packet with destination 10.0.0.2
- The packet is going back to the client node, conntrack recognizes the packet and rewrites the source
address back to 192.169.0.2:80
- Client pod receives the response packet
{{<figure width="100%"
src="/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/good-packet-flow.png"
caption="Good packet flow">}}
## What caused the connection reset?
Enough of the background, so what really went wrong and caused the unexpected
connection reset?
As the diagram below shows, the problem is packet 3. When conntrack cannot
recognize a returning packet, and mark it as *INVALID*. The most common
reasons include: conntrack cannot keep track of a connection because it is out
of capacity, the packet itself is out of a TCP window, etc. For those packets
that have been marked as *INVALID* state by conntrack, we dont have the
iptables rule to drop it, so it will be forwarded to client pod, with source IP
address not rewritten (as shown in packet 4)! Client pod doesnt recognize this
packet because it has a different source IP, which is pod IP, not service IP. As
a result, client pod says, "Wait a second, I don't recall this connection to
this IP ever existed, why does this dude keep sending this packet to me?" Basically,
what the client does is simply send a RST packet to the server pod IP, which
is packet 5. Unfortunately, this is a totally legit pod-to-pod packet, which can
be delivered to server pod. Server pod doesnt know all the address translations
that happened on the client side. From its view, packet 5 is a totally legit
packet, like packet 2 and 3. All server pod knows is, "Well, client pod doesnt
want to talk to me, so lets close the connection!" Boom! Of course, in order
for all these to happen, the RST packet has to be legit too, with the right TCP
sequence number, etc. But when it happens, both parties agree to close the
connection.
{{<figure width="100%"
src="/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/connection-reset-packet-flow.png"
caption="Connection reset packet flow">}}
## How to address it?
Once we understand the root cause, the fix is not hard. There are at least 2
ways to address it.
- Make conntrack more liberal on packets, and dont mark the packets as
*INVALID*. In Linux, you can do this by `echo 1 >
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal`.
- Specifically add an iptables rule to drop the packets that are marked as
*INVALID*, so it wont reach to client pod and cause harm.
The [fix](https://github.com/kubernetes/kubernetes/pull/74840) is available in v1.15+.
However, for the users that are affected by this bug, there is a way to mitigate the
problem by applying the following rule in your cluster.
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: startup-script
labels:
app: startup-script
spec:
template:
metadata:
labels:
app: startup-script
spec:
hostPID: true
containers:
- name: startup-script
image: gcr.io/google-containers/startup-script:v1
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
env:
- name: STARTUP_SCRIPT
value: |
#! /bin/bash
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
echo done
```
## Summary
Obviously, the bug has existed almost forever. I am surprised that it
hasnt been noticed until recently. I believe the reasons could be: (1) this
happens more in a congested server serving large payloads, which might not be a
common use case; (2) the application layer handles the retry to be tolerant of
this kind of reset. Anyways, regardless of how fast Kubernetes has been growing,
its still a young project. There are no other secrets than listening closely to
customers feedback, not taking anything for granted but digging deep, we can
make it the best platform to run applications.
Special thanks to [bowei](https://github.com/bowei) for the consulting for both
debugging process and the blog, to [tcarmet](https://github.com/tcarmet) for
reporting the issue and providing a reproduction.