Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Set NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on masquerading rules #1004

Closed
ghost opened this issue Jun 15, 2018 · 4 comments
Closed

Comments

@ghost
Copy link

ghost commented Jun 15, 2018

Current Behavior

We are experiencing random 5 second timeouts with DNS, database connections and other things in our Kubernetes cluster.

Possible Solution

Use the iptables --random-fully flag when creating masquerade rules. I have the first step of this pending in go-iptables, coreos/go-iptables#48

Steps to Reproduce (for bugs)

It is reproducible by requesting just about any in-cluster service, and observing that periodically ( in our case, 1 out of 50 or 100 times), we get a 5 second delay. It always happens in DNS lookup.

Context

We believe this is a result of a kernel level SNAT race condition that is described quite well here:

https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

The problem happens with non-flannel CNI implementations, and is (ironically) not even a flannel issue really. However, its becomes a flannel issue, because the solution is to set a flag on the masquerading rules that are created, which are not in anyone's control except for flannel.

What we need is the ability to apply the NF_NAT_RANGE_PROTO_RANDOM_FULLY flag on the masquerading rules that flannel sets up.

We searched for this issue, and didnt see that anyone had asked for this. We're also unaware of any settings that allow setting this flag today-- if that's possible, please let us know.

Your Environment

  • Flannel version: v0.9.1
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version:
  • Kubernetes version (if used): 1.8.14
  • Operating System and version:
  • Link to your project (optional):

This issue was copied from weaveworks/weave#3287

@IvanovOleg
Copy link

IvanovOleg commented Jun 22, 2018

I have the same issue with Kubernetes 1.10.4 + Azure CNI 1.0.6

@Quentin-M
Copy link

Quentin-M commented Jun 24, 2018

I just posted a little write-up about our journey troubleshooting the issue, and how we are worked around it in production: https://blog.quentin-machu.fr/2018/06/24/5-15s-dns-lookups-on-kubernetes/.

Implementing NF_NAT_RANGE_PROTO_RANDOM_FULLY may not be enough to fix the problem, as it only addresses the SNAT race in my understand, whereas the race also exist with DNAT.

@anton-johansson
Copy link

anton-johansson commented Sep 18, 2019

We've seen a lot of 5 second delays for DNS lookups and other requests as well. Also, lateley, we've seen a lot of 1 second connects.

I tried upgrading Flannel to 0.11.0 (from 0.10.0), but I'm still seeing some issues. Not sure if it's related.

But how do I confirm that flannel is actually "upgraded". I upgraded the daemon set and made sure pods are updated. But what does this mean with existing pods and networks? Do I need to re-create all the pods, or even do something on host-level?

EDIT: Turns out that iptables on the underlying OS does not have --fully-random (it's 1.6.1), so that's probabaly why it has no effect.

@carlvine500
Copy link

@anton-johansson I tried Flannel-v0.12.0 k8s-1.6.2 iptables-1.6.2 , but no --fully-random in iptables-save result .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants