Connection reset blog post (#13456)

* Connection reset blog post * Update and rename 2019-03-27-kube-proxy-subtleties-debugging-an-intermittent-connection-resets.md to 2019-03-29-kube-proxy-subtleties-debugging-an-intermittent-connection-resets.md
2019-03-29 13:10:12 -07:00 · 2019-03-29 13:10:12 -07:00 · 7c985a1545
parent 5bbd682613
commit 7c985a1545
3 changed files with 193 additions and 0 deletions
--- a/content/en/blog/_posts/2019-03-29-kube-proxy-subtleties-debugging-an-intermittent-connection-resets.md
+++ b/content/en/blog/_posts/2019-03-29-kube-proxy-subtleties-debugging-an-intermittent-connection-resets.md
@ -0,0 +1,193 @@
+---
+title: 'kube-proxy Subtleties: Debugging an Intermittent Connection Reset'
+date: 2019-03-29
+---
+
+**Author:** [Yongkun Gui](ygui@google.com), Google
+
+I recently came across a bug that causes intermittent connection resets.  After
+some digging, I found it was caused by a subtle combination of several different
+network subsystems. It helped me understand Kubernetes networking better, and I
+think it’s worthwhile to share with a wider audience who are interested in the same
+topic.
+
+## The symptom
+
+We received a user report claiming they were getting connection resets while using a
+Kubernetes service of type ClusterIP to serve large files to pods running in the
+same cluster. Initial debugging of the cluster did not yield anything
+interesting: network connectivity was fine and downloading the files did not hit
+any issues. However, when we ran the workload in parallel across many clients,
+we were able to reproduce the problem. Adding to the mystery was the fact that
+the problem could not be reproduced when the workload was run using VMs without
+Kubernetes. The problem, which could be easily reproduced by [a simple
+app](https://github.com/tcarmet/k8s-connection-reset), clearly has something to
+do with Kubernetes networking, but what?
+
+## Kubernetes networking basics
+
+Before digging into this problem, let’s talk a little bit about some basics of
+Kubernetes networking, as Kubernetes handles network traffic from a pod
+very differently depending on different destinations.
+
+### Pod-to-Pod
+
+In Kubernetes, every pod has its own IP address. The benefit is that the
+applications running inside pods could use their canonical port, instead of
+remapping to a different random port. Pods have L3 connectivity between each
+other. They can ping each other, and send TCP or UDP packets to each other.
+[CNI](https://github.com/containernetworking/cni) is the standard that solves
+this problem for containers running on different hosts. There are tons of
+different plugins that support CNI.
+
+### Pod-to-external
+
+For the traffic that goes from pod to external addresses, Kubernetes simply uses
+[SNAT](https://en.wikipedia.org/wiki/Network_address_translation). What it does
+is replace the pod’s internal source IP:port with the host’s IP:port. When
+the return packet comes back to the host, it rewrites the pod’s IP:port as the
+destination and sends it back to the original pod. The whole process is transparent
+to the original pod, who doesn’t know the address translation at all.
+
+### Pod-to-Service
+
+Pods are mortal. Most likely, people want reliable service. Otherwise, it’s
+pretty much useless. So Kubernetes has this concept called "service" which is
+simply a L4 load balancer in front of pods. There are several different types of
+services. The most basic type is called ClusterIP. For this type of service, it
+has a unique VIP address that is only routable inside the cluster.
+
+The component in Kubernetes that implements this feature is called kube-proxy.
+It sits on every node, and programs complicated iptables rules to do all kinds
+of filtering and NAT between pods and services. If you go to a Kubernetes node
+and type `iptables-save`, you’ll see the rules that are inserted by Kubernetes
+or other programs. The most important chains are `KUBE-SERVICES`, `KUBE-SVC-*`
+and `KUBE-SEP-*`.
+
+- `KUBE-SERVICES` is the entry point for service packets. What it does is to
+  match the destination IP:port and dispatch the packet to the corresponding
+  `KUBE-SVC-*` chain.
+- `KUBE-SVC-*` chain acts as a load balancer, and distributes the packet to
+  `KUBE-SEP-*` chain equally. Every `KUBE-SVC-*` has the same number of
+  `KUBE-SEP-*` chains as the number of endpoints behind it.
+- `KUBE-SEP-*` chain represents a Service EndPoint. It simply does DNAT,
+  replacing service IP:port with pod's endpoint IP:Port.
+
+For DNAT, conntrack kicks in and tracks the connection state using a state
+machine. The state is needed because it needs to remember the destination
+address it changed to, and changed it back when the returning packet came back.
+Iptables could also rely on the conntrack state (ctstate) to decide the destiny
+of a packet. Those 4 conntrack states are especially important:
+
+- *NEW*: conntrack knows nothing about this packet, which happens when the SYN
+  packet is received.
+- *ESTABLISHED*: conntrack knows the packet belongs to an established connection,
+  which happens after handshake is complete.
+- *RELATED*: The packet doesn’t belong to any connection, but it is affiliated
+  to another connection, which is especially useful for protocols like FTP.
+- *INVALID*: Something is wrong with the packet, and conntrack doesn’t know how
+  to deal with it. This state plays a centric role in this Kubernetes issue.
+
+Here is a diagram of how a TCP connection works between pod and service. The
+sequence of events are: 
+
+- Client pod from left hand side sends a packet to a
+service: 192.168.0.2:80
+- The packet is going through iptables rules in client
+node and the destination is changed to pod IP, 10.0.1.2:80 
+- Server pod handles the packet and sends back a packet with destination 10.0.0.2
+- The packet is going back to the client node, conntrack recognizes the packet and rewrites the source
+address back to 192.169.0.2:80
+- Client pod receives the response packet
+
+{{<figure width="100%"
+src="/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/good-packet-flow.png"
+caption="Good packet flow">}}
+
+## What caused the connection reset?
+
+Enough of the background, so what really went wrong and caused the unexpected
+connection reset?
+
+As the diagram below shows, the problem is packet 3. When conntrack cannot
+recognize a returning packet, and mark it as *INVALID*. The most common
+reasons include: conntrack cannot keep track of a connection because it is out
+of capacity, the packet itself is out of a TCP window, etc. For those packets
+that have been marked as *INVALID* state by conntrack, we don’t have the
+iptables rule to drop it, so it will be forwarded to client pod, with source IP
+address not rewritten (as shown in packet 4)! Client pod doesn’t recognize this
+packet because it has a different source IP, which is pod IP, not service IP. As
+a result, client pod says, "Wait a second, I don't recall this connection to
+this IP ever existed, why does this dude keep sending this packet to me?" Basically,
+what the client does is simply send a RST packet to the server pod IP, which
+is packet 5. Unfortunately, this is a totally legit pod-to-pod packet, which can
+be delivered to server pod. Server pod doesn’t know all the address translations
+that happened on the client side. From its view, packet 5 is a totally legit
+packet, like packet 2 and 3. All server pod knows is, "Well, client pod doesn’t
+want to talk to me, so let’s close the connection!" Boom! Of course, in order
+for all these to happen, the RST packet has to be legit too, with the right TCP
+sequence number, etc. But when it happens, both parties agree to close the
+connection.
+
+{{<figure width="100%"
+src="/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/connection-reset-packet-flow.png"
+caption="Connection reset packet flow">}}
+
+## How to address it?
+
+Once we understand the root cause, the fix is not hard. There are at least 2
+ways to address it.
+
+- Make conntrack more liberal on packets, and don’t mark the packets as
+  *INVALID*. In Linux, you can do this by `echo 1 >
+  /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal`.
+- Specifically add an iptables rule to drop the packets that are marked as
+  *INVALID*, so it won’t reach to client pod and cause harm.
+
+The fix is drafted (https://github.com/kubernetes/kubernetes/pull/74840), but
+unfortunately it didn’t catch the v1.14 release window. However, for the users
+that are affected by this bug, there is a way to mitigate the problem by applying
+the following rule in your cluster.
+
+```yaml
+apiVersion: extensions/v1beta1
+kind: DaemonSet
+metadata:
+  name: startup-script
+  labels:
+    app: startup-script
+spec:
+  template:
+    metadata:
+      labels:
+        app: startup-script
+    spec:
+      hostPID: true
+      containers:
+      - name: startup-script
+        image: gcr.io/google-containers/startup-script:v1
+        imagePullPolicy: IfNotPresent
+        securityContext:
+          privileged: true
+        env:
+        - name: STARTUP_SCRIPT
+          value: |
+            #! /bin/bash
+            echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
+            echo done
+```
+
+## Summary
+
+Obviously, the bug has existed almost forever. I am surprised that it
+hasn’t been noticed until recently. I believe the reasons could be: (1) this
+happens more in a congested server serving large payloads, which might not be a
+common use case; (2) the application layer handles the retry to be tolerant of
+this kind of reset. Anyways, regardless of how fast Kubernetes has been growing,
+it’s still a young project. There are no other secrets than listening closely to
+customers’ feedback, not taking anything for granted but digging deep, we can
+make it the best platform to run applications.
+
+Special thanks to [bowei](https://github.com/bowei) for the consulting for both
+debugging process and the blog, to [tcarmet](https://github.com/tcarmet) for
+reporting the issue and providing a reproduction.
--- a/static/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/connection-reset-packet-flow.png
+++ b/static/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/connection-reset-packet-flow.png
--- a/static/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/good-packet-flow.png
+++ b/static/images/blog/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets/good-packet-flow.png