Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic from pods use the wrong egress interface - sometimes #1856

Closed
akunszt opened this issue Feb 9, 2022 · 18 comments
Closed

Traffic from pods use the wrong egress interface - sometimes #1856

akunszt opened this issue Feb 9, 2022 · 18 comments
Assignees
Labels

Comments

@akunszt
Copy link

akunszt commented Feb 9, 2022

What happened:
We noticed that one of our pods which was running fine earlier lost access to the apiserver. After evaluating the situation we were be able to confirm that the traffic was DNAT-ed properly by kube-proxy from 10.2.0.1 to a real master node IP address. So that looked correct.

The endpoints:
Master node: eth0: 10.31.1.18/20
Worker node: eth0: 10.31.72.129/20, eth1: 10.31.75.90/20
Pod: 10.31.75.206/20

The pod's IP is a secondary IP on the eth1 interface according to the AWS console.

The outgoing traffic uses the eth0 interface - and maybe dropped by the VPC router as a spoofed traffic:

[root@ip-10-31-72-129 ~]# tcpdump -nei any host 10.31.75.206 and host 10.31.1.18 and port 443
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:54:39.763604 eth0  Out ifindex 2 02:72:26:2c:07:db ethertype IPv4 (0x0800), length 80: 10.31.75.206.50904 > 10.31.1.18.https: Flags [S], seq 2611885982, win 62727, options [mss 8961,sackOK,TS val 3556171341 ecr 0,nop,wscale 7], length 0
12:54:40.053050 eth0  Out ifindex 2 02:72:26:2c:07:db ethertype IPv4 (0x0800), length 80: 10.31.75.206.50906 > 10.31.1.18.https: Flags [S], seq 2265375090, win 62727, options [mss 8961,sackOK,TS val 3556171631 ecr 0,nop,wscale 7], length 0
12:54:40.785543 eth0  Out ifindex 2 02:72:26:2c:07:db ethertype IPv4 (0x0800), length 80: 10.31.75.206.50904 > 10.31.1.18.https: Flags [S], seq 2611885982, win 62727, options [mss 8961,sackOK,TS val 3556172363 ecr 0,nop,wscale 7], length 0
12:54:41.105538 eth0  Out ifindex 2 02:72:26:2c:07:db ethertype IPv4 (0x0800), length 80: 10.31.75.206.50906 > 10.31.1.18.https: Flags [S], seq 2265375090, win 62727, options [mss 8961,sackOK,TS val 3556172683 ecr 0,nop,wscale 7], length 0
12:54:41.494171 eth0  Out ifindex 2 02:72:26:2c:07:db ethertype IPv4 (0x0800), length 80: 10.31.75.206.50920 > 10.31.1.18.https: Flags [S], seq 961347920, win 62727, options [mss 8961,sackOK,TS val 3556173072 ecr 0,nop,wscale 7], length 0

When I ping the pod's IP from the master node then it won't answer. It receives the ICMP echo-request on eth1 but sends back the echo-reply on the eth0.

[root@ip-10-31-72-129 ~]# tcpdump -nei any icmp and host 10.31.1.18
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:56:20.457904 eth1  In  ifindex 9 02:1e:ad:00:32:9f ethertype IPv4 (0x0800), length 104: 10.31.1.18 > 10.31.75.206: ICMP echo request, id 59266, seq 1, length 64
12:56:20.457943 podabd1c79b5ca Out ifindex 12 36:fc:00:5a:49:87 ethertype IPv4 (0x0800), length 104: 10.31.1.18 > 10.31.75.206: ICMP echo request, id 59266, seq 1, length 64
12:56:20.457956 podabd1c79b5ca In  ifindex 12 02:6b:fe:e0:a1:b7 ethertype IPv4 (0x0800), length 104: 10.31.75.206 > 10.31.1.18: ICMP echo reply, id 59266, seq 1, length 64
12:56:20.457959 eth0  Out ifindex 2 02:72:26:2c:07:db ethertype IPv4 (0x0800), length 104: 10.31.75.206 > 10.31.1.18: ICMP echo reply, id 59266, seq 1, length 64

I checked a newly started worker and another pod on it where it was working as expected:

[root@ip-10-31-5-93 ~]# tcpdump -nei any icmp and host 10.31.1.18
tcpdump: data link type LINUX_SLL2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:00:06.858661 eth1  In  ifindex 7 06:60:92:05:39:bd ethertype IPv4 (0x0800), length 104: 10.31.1.18 > 10.31.8.238: ICMP echo request, id 63665, seq 1, length 64
13:00:06.858700 pod26d82261723 Out ifindex 13 12:c1:28:9d:b4:b5 ethertype IPv4 (0x0800), length 104: 10.31.1.18 > 10.31.8.238: ICMP echo request, id 63665, seq 1, length 64
13:00:06.858712 pod26d82261723 In  ifindex 13 1e:a1:02:29:48:7b ethertype IPv4 (0x0800), length 104: 10.31.8.238 > 10.31.1.18: ICMP echo reply, id 63665, seq 1, length 64
13:00:06.858716 eth1  Out ifindex 7 06:fc:96:2b:01:17 ethertype IPv4 (0x0800), length 104: 10.31.8.238 > 10.31.1.18: ICMP echo reply, id 63665, seq 1, length 6

I tried to switch the AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER from false to true and the AWS_VPC_K8S_CNI_EXTERNALSNAT from true to false as a test but that didn't make any changes.

Attach logs
Done and will arrive shortly.

What you expected to happen:
Egress traffic uses the proper interface.

How to reproduce it (as minimally and precisely as possible):
As it doesn't happen everytime we can't reproduce it.

Anything else we need to know?:
We use this CNI version for a longer period of time but we just updated our AMIs to the latest Flatcar Stable version. That means different kernel and docker too. We didn't notice this behaviour on the old 2605 series AMIs running on kernel 5.4. This can be a race condition which is triggered by the newer kernel. Maybe, maybe not.

Environment:

  • Kubernetes version (use kubectl version): 1.21.2
  • CNI Version: 1.7.10
  • OS (e.g: cat /etc/os-release):
ID=flatcar
ID_LIKE=coreos
VERSION=3033.2.0
VERSION_ID=3033.2.0
BUILD_ID=2021-12-10-1820
PRETTY_NAME="Flatcar Container Linux by Kinvolk 3033.2.0 (Oklo)"
ANSI_COLOR="38;5;75"
HOME_URL="https://flatcar-linux.org/"
BUG_REPORT_URL="https://issues.flatcar-linux.org"
FLATCAR_BOARD="amd64-usr```
- Kernel (e.g. `uname -a`):
```Linux ip-10-31-72-129.us-west-2.compute.internal 5.10.84-flatcar #1 SMP Fri Dec 10 17:33:00 -00 2021 x86_64 AMD EPYC 7R32 AuthenticAMD GNU/Linux```
@akunszt akunszt added the bug label Feb 9, 2022
@akunszt
Copy link
Author

akunszt commented Feb 9, 2022

Update: I can't send the logs as it was bigger than the maximum available size. Is there another way to share that? I don't want to make it public as it might contain sensitive information.

@achevuru
Copy link
Contributor

achevuru commented Feb 9, 2022

@akunszt We do have a known issue around this for non VPC traffic which was addressed in v1.8.0. Can you try with v1.8.0 and let us know if that helped?

@achevuru achevuru self-assigned this Feb 9, 2022
@akunszt
Copy link
Author

akunszt commented Feb 10, 2022

@achevuru Thanks for the information. I updated to 1.8.0. As the issue doesn't happen instantaneously then I'll have to wait. It could take a few hours or few days.
The traffic I was talking about is actually local to the VPC but happens between endpoints in different Subnets, hence it goes first to the VPC router and then back to target Subnet.

@akunszt
Copy link
Author

akunszt commented Feb 11, 2022

It looks this doesn't solve the issue at all. I experienced the same on a freshly started node with aws-vpc-cni:1.8.0. It looks like the egress interface selection doesn't work at all and when the pods are running fine they are using the eth0 and they fail when they should use eth1 immediately.

@akunszt
Copy link
Author

akunszt commented Feb 11, 2022

I did some more digging. I found the case where I had two pods "behind" the same ENI, one of the was working but the other didn't.

eth1/eni-0cfc343e6d82b3909:
	10.31.65.52 - BAD
	10.31.64.23 - GOOD

I checked the logs and I found only a few lines for the 10.31.65.52:

{"level":"info","ts":"2022-02-11T12:11:11.813Z","caller":"ipamd/ipamd.go:830","msg":"Added ENI(eni-0cfc343e6d82b3909)'s IP 10.31.65.52 to datastore"}
{"level":"info","ts":"2022-02-11T12:11:11.864Z","caller":"datastore/data_store.go:360","msg":"AssignPodIPv4Address: Assign IP 10.31.65.52 to sandbox _migrated-from-cri/fd37ddee6e6e0e5c3cd0bb9548ab77c21241fe039a49d3cf315650209369f17a/unknown"}
{"level":"debug","ts":"2022-02-11T12:11:11.864Z","caller":"ipamd/ipamd.go:413","msg":"Recovered _migrated-from-cri/fd37ddee6e6e0e5c3cd0bb9548ab77c21241fe039a49d3cf315650209369f17a/unknown => eni-0cfc343e6d82b3909/10.31.65.52"}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.65.52 ffffffff}] "

But a lot more related to the 10.31.64.23:

{"level":"info","ts":"2022-02-11T12:11:11.813Z","caller":"ipamd/ipamd.go:830","msg":"Added ENI(eni-0cfc343e6d82b3909)'s IP 10.31.64.23 to datastore"}
{"level":"info","ts":"2022-02-11T12:11:11.864Z","caller":"datastore/data_store.go:360","msg":"AssignPodIPv4Address: Assign IP 10.31.64.23 to sandbox _migrated-from-cri/9af8fce3cfe6bfc5a2f9a6922bc0459f8c1a1b70ad2260a14d1b7bff03525efa/unknown"}
{"level":"debug","ts":"2022-02-11T12:11:11.864Z","caller":"ipamd/ipamd.go:413","msg":"Recovered _migrated-from-cri/9af8fce3cfe6bfc5a2f9a6922bc0459f8c1a1b70ad2260a14d1b7bff03525efa/unknown => eni-0cfc343e6d82b3909/10.31.64.23"}
{"level":"debug","ts":"2022-02-11T12:11:11.865Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.65.215 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.865Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.75.68 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.865Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.68.60 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.865Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.69.56 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.865Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.74.206 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.865Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.75.73 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.65.240 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.69.129 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.76.111 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.64.23 ffffffff}] "}
{"level":"info","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Remove current list [[ip rule 1536: from 10.31.64.23/32 table 2]]"}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"UpdateRuleListBySrc: Successfully removed current rule [ip rule 1536: from 10.31.64.23/32 table 2] to "}
{"level":"info","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"UpdateRuleListBySrc: Successfully added pod rule[ip rule 1536: from 10.31.64.23/32 table 2]"}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.65.52 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.64.115 ffffffff}] "}
{"level":"debug","ts":"2022-02-11T12:11:11.866Z","caller":"ipamd/ipamd.go:483","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.64.23/32 table 2 ip rule 1536: from 10.31.65.240/32 table 3 ip rule 1536: from 10.31.69.129/32 table 3 ip rule 1536: from 10.31.64.115/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.71.33 ffffffff}] "}

The suspicious difference that there is no Successfully added pod rule line in the non-working case.

I checked the routing rules:

ip-10-31-67-96 ~ # ip ru sh
0:      from all lookup local
512:    from all to 10.31.64.23 lookup main
512:    from all to 10.31.65.240 lookup main
512:    from all to 10.31.69.129 lookup main
512:    from all to 10.31.64.115 lookup main
1536:   from 10.31.65.240 lookup 3
1536:   from 10.31.69.129 lookup 3
1536:   from 10.31.64.23 lookup 2
1536:   from 10.31.64.115 lookup 2
32766:  from all lookup main
32767:  from all lookup default

ip-10-31-67-96 ~ # ip r sh t 2   
default via 10.31.64.1 dev eth1 
10.31.64.1 dev eth1 scope link

After I manually added the rules for the non-working pod it started to work - but I don't want to do that manually all the time, this is what the CNI is for, I think:

ip ru add from all to 10.31.65.52/32 lookup main
ip ru add from 10.31.65.52/32 lookup 2

I skimmed through the Changelog and didn't find an entry which can be related to this.

This looks like a race condition to me. Do you think upgrading to the latest version will resolve this? Were there any changes related to this part of the code?

@akunszt
Copy link
Author

akunszt commented Feb 14, 2022

@achevuru I continued digging as this issues blocks us quite badly.
I updated the aws-vpc-cni to the latest version which was 1.10.2 at the time of writing. I have a working and a non-working node again. I compared the debug logs.
The non-working node:

{"level":"debug","ts":"2022-02-14T09:24:18.121Z","caller":"ipamd/ipamd.go:605","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.2.78 ffffffff}] "}                                                                                                  
{"level":"info","ts":"2022-02-14T09:24:18.121Z","caller":"ipamd/ipamd.go:605","msg":"Remove current list [[]]"}
{"level":"debug","ts":"2022-02-14T09:24:18.121Z","caller":"ipamd/ipamd.go:605","msg":"UpdateRuleListBySrc: empty list, no need to update"}

While on the working node:

{"level":"debug","ts":"2022-02-14T09:25:34.989Z","caller":"ipamd/ipamd.go:605","msg":"Update Rule List[[ip rule -1: from <nil> table 255 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 512: from <nil> table 254 ip rule 1536: from 10.31.70.252/32 table 2 ip rule 32766: from <nil> table 254 ip rule 32767: from <nil> table 253]] for source[{10.31.70.252 ffffffff}] "}                                       
{"level":"info","ts":"2022-02-14T09:25:34.989Z","caller":"ipamd/ipamd.go:605","msg":"Remove current list [[ip rule 1536: from 10.31.70.252/32 table 2]]"}
{"level":"debug","ts":"2022-02-14T09:25:34.989Z","caller":"ipamd/ipamd.go:605","msg":"UpdateRuleListBySrc: Successfully removed current rule [ip rule 1536: from 10.31.70.252/32 table 2] to "}
{"level":"info","ts":"2022-02-14T09:25:34.989Z","caller":"ipamd/ipamd.go:605","msg":"UpdateRuleListBySrc: Successfully added pod rule[ip rule 1536: from 10.31.70.252/32 table 2]"}

My Go knowledge is very limited - so to say - but I tried to follow the execution path for this.

func (n *linuxNetwork) UpdateRuleListBySrc(ruleList []netlink.Rule, src net.IPNet) error {

In this function the code fetches the actual ip rule sh output - but through netlink - to check if is there a rule already set for the Pod's IP address as a source/from. In the non-working case the ip rule sh is basically empty so it didn't found anything. After that it tries to remove this empty rule - why?. In the next step it realizes that it's empty so leaves the function. In the working case it re-adds the same rule again.
What I don't understand:

  • Based on this behaviour something else must add the ip rule in the first hand as it can't add a missing rule.
  • Deleting the empty rule and then return from the function looks like a bug to me. It should bail out earlier.
  • If the rule is already set then it will remove the proper rule and then add it again. Wouldn't it cause a temporary hiccup in the traffic or you're betting on the route cache for the interim period? Or is this a one-time action to prepare/fix the rules at startup. (I honestly can't recall if the ip rule decisions are cached or not. I think they don't but it was a long time since I had to work with them.)

Thanks for any insights/information. If you think this is not the right place to have this discussion then we can open a Support ticket in AWS and continue there.

@pa8mm
Copy link

pa8mm commented Feb 14, 2022

Hello,

We stumbled upon the same issue - lack of connectivity from outside to the random Pods on the Node.
Affected Pods have entries visible in ip rule show, and I confirm there are log entries similar to @akunszt about adding those rules.

What I also noticed is that traffic to the "working" Pods always coming via Primary ENI, but traffic to the non working Pods is arriving via Secondary ENI.

AWS CNI version is v1.7.10

@achevuru
Copy link
Contributor

achevuru commented Feb 15, 2022

@akunszt Logs you shared above are for Pods that aws-node (CNI pod) found running on the node during it's bootup. CNI pod reaches out to the local CRI socket to gather information about pods that are active on the node and will use that information to build it's internal database. So, these are pods that were already running prior to the current bootup of CNI pod. In this case, the Pod IP rule for 10.31.65.52 should've been present as there is Pod with this IP assigned but CNI didn't find the rule. So, we need logs from when this pod was initially assigned this IP.

So, if this worked prior to AMI/Kernel update for you, then some underlying issue in Kernel might be contributing to it. But, that's just a speculation and we need entire log bundle for us to debug the issue. Please feel free to open a support case and share the logs. I tried with a regular EKS AMI and I don't see any issue with pods running behind secondary ENIs. Also, can you try scheduling new pods behind secondary ENIs and see how that behaves? (Instead of reboot/update cases like above)..

@achevuru
Copy link
Contributor

@PA8MM If you see correct entries for affected pods in ip rule list then the issue is different to the one being discussed here. Can you try with v1.8.0+ as pointed above and let us know if that addresses the issue you are experiencing?

@akunszt
Copy link
Author

akunszt commented Feb 15, 2022

@achevuru I opened an AWS support ticket and linked this issue. I bounced every worker and found a situation where two pods are using a secondary IP address from the same ENI which is assigned to eth1 and one is working while the other one is not. I managed to run the aws-cni-support.sh script and upload the archive - it was quite small on a new instance.
The kernel netlink part changed between 5.8.105 - where aws-vpc-cni works realiably - and 5.10.84 - where it is not. That could be a contributing factor of this. I have a hunch that the root cause might be in the netlink handling part which contains a race condition and the newer kernel triggered this.
How can you schedule a pod behind a specific ENI? Is there a way to force that?

@achevuru
Copy link
Contributor

@akunszt No specific knob available to force a Pod to be placed behind a Secondary ENI. You can schedule enough pods to exhaust the Primary ENI and force any subsequent pods to secondary ENIs. Are you using a custom AMI? EKS AMIs are still in 5.4.x train.

@akunszt
Copy link
Author

akunszt commented Feb 16, 2022

@achevuru We are using the official Flatcar Linux AMIs. The 3033.2.0 - stable - is with the new kernel and 2605.22.1 - LTS - is the one with old kernel. It's the successor of the CoreOS which was practically cancelled by IBM/RedHat after they bought it. It's a minimal OS which can do only one thing, run containers, so it is ideal for Kubernetes.

@akunszt
Copy link
Author

akunszt commented Feb 18, 2022

@achevuru Can you point me to the right direction? I just can't find where adds the CNI of the IP rules in the first place. I thought that it should be at

func (s *server) AddNetwork(ctx context.Context, in *rpc.AddNetworkRequest) (*rpc.AddNetworkReply, error) {

But I didn't find anything which just remotely resembles to the netlink.RuleAdd function.

@achevuru
Copy link
Contributor

achevuru commented Feb 18, 2022

@akunszt Refer here.

Also, we didn't get your support ticket yet. Anyone from support helping you with this? Can you ask them to forward it to EKS service team?

@akunszt
Copy link
Author

akunszt commented Feb 21, 2022

@achevuru Thanks. We managed to find it out just before I read your message. It was a bit misleading that we don't use the routed-eni-cni plugin at all and it wasn't obvious that the binary will be named aws-cni and used by the aws-cni plugin.

I'll let our TAM knows that the support ticket not yet reached it's destination and ask him to escalate/forward/assign to the EKS service team.

@akunszt
Copy link
Author

akunszt commented Feb 22, 2022

@achevuru I think I found the root cause and a solution. It turned out that it wasn't the kernel, it was systemd, especially systemd-networkd. When a new interface/ENI added to the system the systemd-networkd by default cleaned up every routes and rules - it's a very stupid default setting, but hey, it's systemd...
I managed to notice this while I was running ip monitor rule and saw that the CNI actually set the rule properly and it was removed in the same exact second when another interface was brought up.
The solution was that I overwrote the default configuration of the systemd-networkd in /etc/systemd/networkd.conf.d/10-routes.conf:

ManageForeignRoutingPolicyRules=no
ManageForeignRoutes=no

It might be worth mentioning this in the documentation as it is a very annoying, and very hard to trace issue as it happens only during runtime. If someone tests with a few pods then everything could be okay but when aws-cni adds an extra interface in production where usually there are more pods then the systemd kills off the pod network.
I'm sorry for the noise, it turned out that this is not an issue with the amazon-k8s-vpc-cni. On the plus side I learnt a lot during the last few day about the internals of the CNI. Maybe a bit too much, though.

@akunszt akunszt closed this as completed Feb 22, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@achevuru
Copy link
Contributor

@akunszt Good find! We do have old issue(s) tied to the same behavior in the Q. Refer to this comment in particular. We will update our docs to call this out.

#1514
#1600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants