Connection reset blog post (#13456)
* Connection reset blog post * Update and rename 2019-03-27-kube-proxy-subtleties-debugging-an-intermittent-connection-resets.md to 2019-03-29-kube-proxy-subtleties-debugging-an-intermittent-connection-resets.mdpull/13537/head
parent
5bbd682613
commit
7c985a1545
|
@ -0,0 +1,193 @@
|
|||
---
|
||||
title: 'kube-proxy Subtleties: Debugging an Intermittent Connection Reset'
|
||||
date: 2019-03-29
|
||||
---
|
||||
|
||||
**Author:** [Yongkun Gui](ygui@google.com), Google
|
||||
|
||||
I recently came across a bug that causes intermittent connection resets. After
|
||||
some digging, I found it was caused by a subtle combination of several different
|
||||
network subsystems. It helped me understand Kubernetes networking better, and I
|
||||
think it’s worthwhile to share with a wider audience who are interested in the same
|
||||
topic.
|
||||
|
||||
## The symptom
|
||||
|
||||
We received a user report claiming they were getting connection resets while using a
|
||||
Kubernetes service of type ClusterIP to serve large files to pods running in the
|
||||
same cluster. Initial debugging of the cluster did not yield anything
|
||||
interesting: network connectivity was fine and downloading the files did not hit
|
||||
any issues. However, when we ran the workload in parallel across many clients,
|
||||
we were able to reproduce the problem. Adding to the mystery was the fact that
|
||||
the problem could not be reproduced when the workload was run using VMs without
|
||||
Kubernetes. The problem, which could be easily reproduced by [a simple
|
||||
app](https://github.com/tcarmet/k8s-connection-reset), clearly has something to
|
||||
do with Kubernetes networking, but what?
|
||||
|
||||
## Kubernetes networking basics
|
||||
|
||||
Before digging into this problem, let’s talk a little bit about some basics of
|
||||
Kubernetes networking, as Kubernetes handles network traffic from a pod
|
||||
very differently depending on different destinations.
|
||||
|
||||
### Pod-to-Pod
|
||||
|
||||
In Kubernetes, every pod has its own IP address. The benefit is that the
|
||||
applications running inside pods could use their canonical port, instead of
|
||||
remapping to a different random port. Pods have L3 connectivity between each
|
||||
other. They can ping each other, and send TCP or UDP packets to each other.
|
||||
[CNI](https://github.com/containernetworking/cni) is the standard that solves
|
||||
this problem for containers running on different hosts. There are tons of
|
||||
different plugins that support CNI.
|
||||
|
||||
### Pod-to-external
|
||||
|
||||
For the traffic that goes from pod to external addresses, Kubernetes simply uses
|
||||
[SNAT](https://en.wikipedia.org/wiki/Network_address_translation). What it does
|
||||
is replace the pod’s internal source IP:port with the host’s IP:port. When
|
||||
the return packet comes back to the host, it rewrites the pod’s IP:port as the
|
||||
destination and sends it back to the original pod. The whole process is transparent
|
||||
to the original pod, who doesn’t know the address translation at all.
|
||||
|
||||
### Pod-to-Service
|
||||
|
||||
Pods are mortal. Most likely, people want reliable service. Otherwise, it’s
|
||||
pretty much useless. So Kubernetes has this concept called "service" which is
|
||||
simply a L4 load balancer in front of pods. There are several different types of
|
||||
services. The most basic type is called ClusterIP. For this type of service, it
|
||||
has a unique VIP address that is only routable inside the cluster.
|
||||
|
||||
The component in Kubernetes that implements this feature is called kube-proxy.
|
||||
It sits on every node, and programs complicated iptables rules to do all kinds
|
||||
of filtering and NAT between pods and services. If you go to a Kubernetes node
|
||||
and type `iptables-save`, you’ll see the rules that are inserted by Kubernetes
|
||||
or other programs. The most important chains are `KUBE-SERVICES`, `KUBE-SVC-*`
|
||||
and `KUBE-SEP-*`.
|
||||
|
||||
- `KUBE-SERVICES` is the entry point for service packets. What it does is to
|
||||
match the destination IP:port and dispatch the packet to the corresponding
|
||||
`KUBE-SVC-*` chain.
|
||||
- `KUBE-SVC-*` chain acts as a load balancer, and distributes the packet to
|
||||
`KUBE-SEP-*` chain equally. Every `KUBE-SVC-*` has the same number of
|
||||
`KUBE-SEP-*` chains as the number of endpoints behind it.
|
||||
- `KUBE-SEP-*` chain represents a Service EndPoint. It simply does DNAT,
|
||||
replacing service IP:port with pod's endpoint IP:Port.
|
||||
|
||||
For DNAT, conntrack kicks in and tracks the connection state using a state
|
||||
machine. The state is needed because it needs to remember the destination
|
||||
address it changed to, and changed it back when the returning packet came back.
|
||||
Iptables could also rely on the conntrack state (ctstate) to decide the destiny
|
||||
of a packet. Those 4 conntrack states are especially important:
|
||||
|
||||
- *NEW*: conntrack knows nothing about this packet, which happens when the SYN
|
||||
packet is received.
|
||||
- *ESTABLISHED*: conntrack knows the packet belongs to an established connection,
|
||||
which happens after handshake is complete.
|
||||
- *RELATED*: The packet doesn’t belong to any connection, but it is affiliated
|
||||
to another connection, which is especially useful for protocols like FTP.
|
||||
- *INVALID*: Something is wrong with the packet, and conntrack doesn’t know how
|
||||
to deal with it. This state plays a centric role in this Kubernetes issue.
|
||||
|
||||
Here is a diagram of how a TCP connection works between pod and service. The
|
||||
sequence of events are:
|
||||
|
||||
- Client pod from left hand side sends a packet to a
|
||||
service: 192.168.0.2:80
|
||||
- The packet is going through iptables rules in client
|
||||
node and the destination is changed to pod IP, 10.0.1.2:80
|
||||
- Server pod handles the packet and sends back a packet with destination 10.0.0.2
|
||||
- The packet is going back to the client node, conntrack recognizes the packet and rewrites the source
|
||||
address back to 192.169.0.2:80
|
||||
- Client pod receives the response packet
|
||||
|
||||
{{<figure width="100%"
|
||||
src="/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/good-packet-flow.png"
|
||||
caption="Good packet flow">}}
|
||||
|
||||
## What caused the connection reset?
|
||||
|
||||
Enough of the background, so what really went wrong and caused the unexpected
|
||||
connection reset?
|
||||
|
||||
As the diagram below shows, the problem is packet 3. When conntrack cannot
|
||||
recognize a returning packet, and mark it as *INVALID*. The most common
|
||||
reasons include: conntrack cannot keep track of a connection because it is out
|
||||
of capacity, the packet itself is out of a TCP window, etc. For those packets
|
||||
that have been marked as *INVALID* state by conntrack, we don’t have the
|
||||
iptables rule to drop it, so it will be forwarded to client pod, with source IP
|
||||
address not rewritten (as shown in packet 4)! Client pod doesn’t recognize this
|
||||
packet because it has a different source IP, which is pod IP, not service IP. As
|
||||
a result, client pod says, "Wait a second, I don't recall this connection to
|
||||
this IP ever existed, why does this dude keep sending this packet to me?" Basically,
|
||||
what the client does is simply send a RST packet to the server pod IP, which
|
||||
is packet 5. Unfortunately, this is a totally legit pod-to-pod packet, which can
|
||||
be delivered to server pod. Server pod doesn’t know all the address translations
|
||||
that happened on the client side. From its view, packet 5 is a totally legit
|
||||
packet, like packet 2 and 3. All server pod knows is, "Well, client pod doesn’t
|
||||
want to talk to me, so let’s close the connection!" Boom! Of course, in order
|
||||
for all these to happen, the RST packet has to be legit too, with the right TCP
|
||||
sequence number, etc. But when it happens, both parties agree to close the
|
||||
connection.
|
||||
|
||||
{{<figure width="100%"
|
||||
src="/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/connection-reset-packet-flow.png"
|
||||
caption="Connection reset packet flow">}}
|
||||
|
||||
## How to address it?
|
||||
|
||||
Once we understand the root cause, the fix is not hard. There are at least 2
|
||||
ways to address it.
|
||||
|
||||
- Make conntrack more liberal on packets, and don’t mark the packets as
|
||||
*INVALID*. In Linux, you can do this by `echo 1 >
|
||||
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal`.
|
||||
- Specifically add an iptables rule to drop the packets that are marked as
|
||||
*INVALID*, so it won’t reach to client pod and cause harm.
|
||||
|
||||
The fix is drafted (https://github.com/kubernetes/kubernetes/pull/74840), but
|
||||
unfortunately it didn’t catch the v1.14 release window. However, for the users
|
||||
that are affected by this bug, there is a way to mitigate the problem by applying
|
||||
the following rule in your cluster.
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: startup-script
|
||||
labels:
|
||||
app: startup-script
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: startup-script
|
||||
spec:
|
||||
hostPID: true
|
||||
containers:
|
||||
- name: startup-script
|
||||
image: gcr.io/google-containers/startup-script:v1
|
||||
imagePullPolicy: IfNotPresent
|
||||
securityContext:
|
||||
privileged: true
|
||||
env:
|
||||
- name: STARTUP_SCRIPT
|
||||
value: |
|
||||
#! /bin/bash
|
||||
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
|
||||
echo done
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Obviously, the bug has existed almost forever. I am surprised that it
|
||||
hasn’t been noticed until recently. I believe the reasons could be: (1) this
|
||||
happens more in a congested server serving large payloads, which might not be a
|
||||
common use case; (2) the application layer handles the retry to be tolerant of
|
||||
this kind of reset. Anyways, regardless of how fast Kubernetes has been growing,
|
||||
it’s still a young project. There are no other secrets than listening closely to
|
||||
customers’ feedback, not taking anything for granted but digging deep, we can
|
||||
make it the best platform to run applications.
|
||||
|
||||
Special thanks to [bowei](https://github.com/bowei) for the consulting for both
|
||||
debugging process and the blog, to [tcarmet](https://github.com/tcarmet) for
|
||||
reporting the issue and providing a reproduction.
|
Binary file not shown.
After Width: | Height: | Size: 49 KiB |
Binary file not shown.
After Width: | Height: | Size: 36 KiB |
Loading…
Reference in New Issue