-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net regression: Connection to Zephyr server non-deterministically leads to client timeout, ENOTCONN on server side #34964
Comments
Just to show that the issue is fully reproducible:
|
I now tested with frdm_k64f, and cannot reproduce this issue, with either How it was before, is that there was not ideal handling of retransmits and other uncommon conditions in the network stack. qemu serial emulation (as used for SLIP networking connection) is also not ideal, leading to some transmitted data loss, rexmits/timeouts in the stack. So, the fact that 100K could be handled in 2.5.0 with qemu_x86 (and even qemu_cortex_m3, which has even worse serial emulation problems) - was very good indication that rexmits, etc. are handled well in the stack. So, it's a bit unfortunate that this issue pops up again. |
I just tried this and did not see any issue with native_posix. With qemu + e1000, zephyr printed occacionally
which indicates that the app has too small amount of connections. Could you check if this helps in your case? |
I used standard prj.conf for the sample and test process as described in https://docs.zephyrproject.org/latest/guides/networking/qemu_setup.html?highlight=slip#basic-setup . Setting the above config options didn't help, still was getting errors from I'll be doing bisect now. |
Ok, finished bisect, it shows that the first problematic commit is dde03c6 . |
The issue seems to be related to slip connection between zephyr and host. If I am using Ethernet connection via e1000 driver, I was not able to see any problems. Edit: or to be more precise, the issue seems to present with slip connection, although the actual problem is not there. After some digging, the socket layer receives the HTTP req data in |
Managed to get a log from the issue
Here, the |
Also this log looks weird
In this log, the received_cb is called before the recv, but still we start to wait data even if the receive queue should have it there now. |
Fix a regression when application is waiting data but does not notice that because socket layer is not woken up. This could happen because application was waiting condition variable but the signal to wake the condvar came before the wait started. Normally if there is constant flow of incoming data to the socket, the signal would be given later. But if the peer is waiting that Zephyr replies, there might be a timeout at peer. The solution is to add locking in socket receive callback so that we only signal the condition variable after we have made sure that the condition variable is actually waiting the data. Fixes zephyrproject-rtos#34964 Signed-off-by: Jukka Rissanen <[email protected]>
Fix a regression when application is waiting data but does not notice that because socket layer is not woken up. This could happen because application was waiting condition variable but the signal to wake the condvar came before the wait started. Normally if there is constant flow of incoming data to the socket, the signal would be given later. But if the peer is waiting that Zephyr replies, there might be a timeout at peer. The solution is to add locking in socket receive callback so that we only signal the condition variable after we have made sure that the condition variable is actually waiting the data. Fixes #34964 Signed-off-by: Jukka Rissanen <[email protected]>
Confirming that #35466 fixed this, tested with |
Describe the bug
When running dumb_http_server sample for qemu_x86, and connecting to it with
ab -n1000 http://192.0.2.1:8080/
, theab
tool aborts due to timeout during handling of one of these 1000 requests. Previously (v2.5.0), 100K requests were served without any problem.To Reproduce
Steps to reproduce the behavior:
ab -n1000 http://192.0.2.1:8080/
The exact request number at which
ab
fails, but over 10 tries, none of 1000-request runs completed successfully.Expected behavior
Both
ab -n1000 http://192.0.2.1:8080/
andab -n100000 http://192.0.2.1:8080/
should complete successfully. v2.5.0 worked like that, and I verified that checking out that tag still works like that for me.Impact
We used to have issue(s) like that previously (early 2.x releases). 2.4 and 2.5 passed this test. We're back to network stack instability with this issue.
Logs and console output
In addition to the
ab
output shown above, on the dumb_http_server sample side, the output is:Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: