Feature Request: Set NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on masquerading rules #1004

ghost · 2018-06-15T05:20:03Z

Current Behavior

We are experiencing random 5 second timeouts with DNS, database connections and other things in our Kubernetes cluster.

Possible Solution

Use the iptables --random-fully flag when creating masquerade rules. I have the first step of this pending in go-iptables, coreos/go-iptables#48

Steps to Reproduce (for bugs)

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Context

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-flannel CNI implementations, and is (ironically) not even a flannel issue really. However, its becomes a flannel issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for flannel.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that flannel sets up.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

Your Environment

Flannel version: v0.9.1
Backend used (e.g. vxlan or udp): vxlan
Etcd version:
Kubernetes version (if used): 1.8.14
Operating System and version:
Link to your project (optional):

This issue was copied from weaveworks/weave#3287

IvanovOleg · 2018-06-22T09:38:32Z

I have the same issue with Kubernetes 1.10.4 + Azure CNI 1.0.6

Quentin-M · 2018-06-24T23:03:44Z

I just posted a little write-up about our journey troubleshooting the issue, and how we are worked around it in production: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/.

Implementing NF_NAT_RANGE_PROTO_RANDOM_FULLY may not be enough to fix the problem, as it only addresses the SNAT race in my understand, whereas the race also exist with DNAT.

anton-johansson · 2019-09-18T12:54:52Z

We've seen a lot of 5 second delays for DNS lookups and other requests as well. Also, lateley, we've seen a lot of 1 second connects.

I tried upgrading Flannel to 0.11.0 (from 0.10.0), but I'm still seeing some issues. Not sure if it's related.

But how do I confirm that flannel is actually "upgraded". I upgraded the daemon set and made sure pods are updated. But what does this mean with existing pods and networks? Do I need to re-create all the pods, or even do something on host-level?

EDIT: Turns out that iptables on the underlying OS does not have --fully-random (it's 1.6.1), so that's probabaly why it has no effect.

carlvine500 · 2020-04-03T03:08:34Z

@anton-johansson I tried Flannel-v0.12.0 k8s-1.6.2 iptables-1.6.2 , but no --fully-random in iptables-save result .

tomdee closed this as completed in 0d7b994 Oct 5, 2018

lucaswilric mentioned this issue Nov 15, 2018

Next planned release date #1063

Closed

rtaylor205 mentioned this issue Apr 16, 2019

DNS intermittent delays of 5s kubernetes/kubernetes#56903

Closed

irbekrm mentioned this issue Mar 15, 2024

Can't establish direct connection to a Kubernetes pod despite having a direct connection to its node (Flannel CNI) tailscale/tailscale#11427

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Set NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on masquerading rules #1004

Feature Request: Set NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on masquerading rules #1004

ghost commented Jun 15, 2018

IvanovOleg commented Jun 22, 2018 •

edited

Loading

Quentin-M commented Jun 24, 2018 •

edited

Loading

anton-johansson commented Sep 18, 2019 •

edited

Loading

carlvine500 commented Apr 3, 2020

Feature Request: Set NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on masquerading rules #1004

Feature Request: Set NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on masquerading rules #1004

Comments

ghost commented Jun 15, 2018

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

IvanovOleg commented Jun 22, 2018 • edited Loading

Quentin-M commented Jun 24, 2018 • edited Loading

anton-johansson commented Sep 18, 2019 • edited Loading

carlvine500 commented Apr 3, 2020

IvanovOleg commented Jun 22, 2018 •

edited

Loading

Quentin-M commented Jun 24, 2018 •

edited

Loading

anton-johansson commented Sep 18, 2019 •

edited

Loading