net regression: Connection to Zephyr server non-deterministically leads to client timeout, ENOTCONN on server side #34964

pfalcon · 2021-05-07T11:40:45Z

Describe the bug

When running dumb_http_server sample for qemu_x86, and connecting to it with ab -n1000 http://192.0.2.1:8080/, the ab tool aborts due to timeout during handling of one of these 1000 requests. Previously (v2.5.0), 100K requests were served without any problem.

To Reproduce
Steps to reproduce the behavior:

cd samples/net/sockets/dumb_http_server
west build -b qemu_x86 -t run
In another window, ab -n1000 http://192.0.2.1:8080/
Example output:

This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 542 requests completed

The exact request number at which ab fails, but over 10 tries, none of 1000-request runs completed successfully.

Expected behavior
Both ab -n1000 http://192.0.2.1:8080/ and ab -n100000 http://192.0.2.1:8080/ should complete successfully. v2.5.0 worked like that, and I verified that checking out that tag still works like that for me.

Impact
We used to have issue(s) like that previously (early 2.x releases). 2.4 and 2.5 passed this test. We're back to network stack instability with this issue.

Logs and console output

In addition to the ab output shown above, on the dumb_http_server sample side, the output is:

Connection #538 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #539 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #540 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #541 from 192.0.2.2
Connection from 192.0.2.2 closed
Connection #542 from 192.0.2.2
Error sending data to peer, errno: 128
Connection from 192.0.2.2 closed

Environment (please complete the following information):

OS: Linux
Toolchain: Zephyr SDK 0.12.4
Commit: git master f05ea67

The text was updated successfully, but these errors were encountered:

pfalcon · 2021-05-07T12:23:38Z

Just to show that the issue is fully reproducible:

pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 344 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 542 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 240 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 152 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 368 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
apr_pollset_poll: The timeout specified has expired (70007)
Total of 12 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 317 requests completed
pfalcon@zenbook:~$ ab -n1000 http://192.0.2.1:8080/
Benchmarking 192.0.2.1 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 303 requests completed

pfalcon · 2021-05-07T13:06:08Z

I now tested with frdm_k64f, and cannot reproduce this issue, with either -n1000 or -n100000.

How it was before, is that there was not ideal handling of retransmits and other uncommon conditions in the network stack. qemu serial emulation (as used for SLIP networking connection) is also not ideal, leading to some transmitted data loss, rexmits/timeouts in the stack. So, the fact that 100K could be handled in 2.5.0 with qemu_x86 (and even qemu_cortex_m3, which has even worse serial emulation problems) - was very good indication that rexmits, etc. are handled well in the stack. So, it's a bit unfortunate that this issue pops up again.

jukkar · 2021-05-10T14:07:11Z

I just tried this and did not see any issue with native_posix. With qemu + e1000, zephyr printed occacionally

[00:00:08.070,000] <err> net_tcp: net_conn_register(): -2
[00:00:08.070,000] <err> net_tcp: Cannot allocate a new TCP connection

which indicates that the app has too small amount of connections.
So I increased them by setting -DCONFIG_NET_MAX_CONTEXTS=16 -DCONFIG_NET_MAX_CONN=16 and after that the ab managed to do the run without issues.

Could you check if this helps in your case?

pfalcon · 2021-05-12T16:34:21Z

@jukkar:

So I increased them by setting -DCONFIG_NET_MAX_CONTEXTS=16 -DCONFIG_NET_MAX_CONN=16 and after that the ab managed to do the run without issues.

Could you check if this helps in your case?

I used standard prj.conf for the sample and test process as described in https://docs.zephyrproject.org/latest/guides/networking/qemu_setup.html?highlight=slip#basic-setup . Setting the above config options didn't help, still was getting errors from ab with about the same frequency.

I'll be doing bisect now.

pfalcon · 2021-05-12T17:24:54Z

Ok, finished bisect, it shows that the first problematic commit is dde03c6 .

jukkar · 2021-05-18T07:42:17Z

The issue seems to be related to slip connection between zephyr and host. If I am using Ethernet connection via e1000 driver, I was not able to see any problems.

Edit: or to be more precise, the issue seems to present with slip connection, although the actual problem is not there. After some digging, the socket layer receives the HTTP req data in zsock_received_cb() but then the received data just disappers (or no one is reading the receive queue).

jukkar · 2021-05-18T09:22:48Z

Managed to get a log from the issue

[00:00:02.660,000] <dbg> net_sock.zsock_accepted_cb: (rx_q[0]): parent=0x130900, ctx=0x130a50, st=0
[00:00:02.660,000] <dbg> net_sock.zsock_accept_ctx: (main): accept: ctx=0x130a50, fd=1
[00:00:02.660,000] <dbg> sample.main: Connection #57 from 192.0.2.2
[00:00:02.660,000] <dbg> net_sock.wait_data: (main): -----------waiting data 0x130a50 from queue 0x130abc
[00:00:02.670,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ctx=0x130a50, pkt=0x14c3f0, st=0, user_data=(nil)
[00:00:02.670,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ----------ctx 0x130a50 putting pkt 0x14c3f0 to queue 0x130abc
[00:00:32.720,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ctx=0x130a50, pkt=(nil), st=-104, user_data=(nil)
[00:00:32.720,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): Set EOF flag on pkt 0x14c3f0
[00:00:32.720,000] <dbg> net_sock.wait_data: (main): -----------done waiting 0x130a50
[00:00:32.720,000] <err> sample: Error sending data to peer, errno: 128
[00:00:32.720,000] <dbg> net_sock.z_impl_zsock_close: (main): close: ctx=0x130a50, fd=1

Here, the recv() in main() is able to run before the receive_cb() is called, this means the the queue seems to be empty and we are waiting for the data to arrive. The condition variable signal should release the wait but that does not seem to happen here.

jukkar · 2021-05-18T09:25:05Z

Also this log looks weird

[00:00:15.410,000] <dbg> net_sock.zsock_accepted_cb: (<log_strdup alloc failed>): parent=0x130900, ctx=0x1309a8, st=0
[00:00:15.410,000] <dbg> net_sock.zsock_accept_ctx: (<log_strdup alloc failed>): accept: ctx=0x1309a8, fd=1
[00:00:15.410,000] <dbg> sample.main: Connection #881 from 192.0.2.2
[00:00:15.410,000] <dbg> net_sock.zsock_received_cb: (<log_strdup alloc failed>): ctx=0x1309a8, pkt=0x14c470, st=0, user_data=(nil)
[00:00:15.410,000] <dbg> net_sock.zsock_received_cb: (<log_strdup alloc failed>): ----------ctx 0x1309a8 putting pkt 0x14c470 to queue 0x130a14 cond 0x130a34
[00:00:15.410,000] <dbg> net_sock.wait_data: (<log_strdup alloc failed>): -----------waiting data 0x1309a8 from queue 0x130a14 condvar 0x130a34
[00:00:45.440,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): ctx=0x1309a8, pkt=(nil), st=-104, user_data=(nil)
[00:00:45.440,000] <dbg> net_sock.zsock_received_cb: (rx_q[0]): Set EOF flag on pkt 0x14c470
[00:00:45.440,000] <dbg> net_sock.wait_data: (main): -----------done waiting 0x1309a8
[00:00:45.440,000] <err> sample: Error sending data to peer, errno: 128

In this log, the received_cb is called before the recv, but still we start to wait data even if the receive queue should have it there now.

Fix a regression when application is waiting data but does not notice that because socket layer is not woken up. This could happen because application was waiting condition variable but the signal to wake the condvar came before the wait started. Normally if there is constant flow of incoming data to the socket, the signal would be given later. But if the peer is waiting that Zephyr replies, there might be a timeout at peer. The solution is to add locking in socket receive callback so that we only signal the condition variable after we have made sure that the condition variable is actually waiting the data. Fixes zephyrproject-rtos#34964 Signed-off-by: Jukka Rissanen <[email protected]>

Fix a regression when application is waiting data but does not notice that because socket layer is not woken up. This could happen because application was waiting condition variable but the signal to wake the condvar came before the wait started. Normally if there is constant flow of incoming data to the socket, the signal would be given later. But if the peer is waiting that Zephyr replies, there might be a timeout at peer. The solution is to add locking in socket receive callback so that we only signal the condition variable after we have made sure that the condition variable is actually waiting the data. Fixes #34964 Signed-off-by: Jukka Rissanen <[email protected]>

pfalcon · 2021-05-28T14:02:46Z

Confirming that #35466 fixed this, tested with ab -n100000 http://192.0.2.1:8080/ with both qemu_x86 and qemu_cortex_m3.

pfalcon added the bug The issue is a bug, or the PR is fixing a bug label May 7, 2021

pfalcon assigned jukkar May 7, 2021

pfalcon added the area: Networking label May 7, 2021

carlescufi added the Regression Something, which was working, does not anymore label May 7, 2021

pfalcon mentioned this issue May 7, 2021

net: Use k_fifo instead of k_work in RX and TX processing #34703

Merged

jukkar added this to the v2.6.0 milestone May 11, 2021

galak added the priority: low Low impact/importance bug label May 11, 2021

jukkar mentioned this issue May 19, 2021

net.socket.tls.tls_ext: frdm_k64f test failure #35390

Closed

jukkar mentioned this issue May 19, 2021

net: sockets: Add locking to receive callback #35466

Merged

nashif closed this as completed in #35466 May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net regression: Connection to Zephyr server non-deterministically leads to client timeout, ENOTCONN on server side #34964

net regression: Connection to Zephyr server non-deterministically leads to client timeout, ENOTCONN on server side #34964

pfalcon commented May 7, 2021

pfalcon commented May 7, 2021

pfalcon commented May 7, 2021

jukkar commented May 10, 2021

pfalcon commented May 12, 2021

pfalcon commented May 12, 2021

jukkar commented May 18, 2021 •

edited

Loading

jukkar commented May 18, 2021

jukkar commented May 18, 2021

pfalcon commented May 28, 2021

net regression: Connection to Zephyr server non-deterministically leads to client timeout, ENOTCONN on server side #34964

net regression: Connection to Zephyr server non-deterministically leads to client timeout, ENOTCONN on server side #34964

Comments

pfalcon commented May 7, 2021

pfalcon commented May 7, 2021

pfalcon commented May 7, 2021

jukkar commented May 10, 2021

pfalcon commented May 12, 2021

pfalcon commented May 12, 2021

jukkar commented May 18, 2021 • edited Loading

jukkar commented May 18, 2021

jukkar commented May 18, 2021

pfalcon commented May 28, 2021

jukkar commented May 18, 2021 •

edited

Loading