Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poll on DTLS socket returns -EAGAIN if bind & receive any data. #33330

Closed
emillindq opened this issue Mar 15, 2021 · 3 comments · Fixed by #33474
Closed

Poll on DTLS socket returns -EAGAIN if bind & receive any data. #33330

emillindq opened this issue Mar 15, 2021 · 3 comments · Fixed by #33474
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@emillindq
Copy link
Contributor

emillindq commented Mar 15, 2021

Describe the bug
I am running into a problem where poll() is called with timeout but returning 0 without waiting. This causes a busy-wait loop which is blocking lower-priority tasks (and essentially all tasks if time-slicing is not enabled). I am doing the following:

  1. A DTLS socket is created
  2. A call to bind() is done to open up a port.
  3. It then loops poll() with a timeout to see if any data has been received on the socket.
  4. If data is received, it will call recvfrom and handle the data.

The application is a CoAP server/client where a client call sendto starts the DTLS handshake. In other words, my end is a DTLS client.

It seems the problem is that bind() call will open up incoming data to be given to the socket FIFO without being given to mbedTLS.

if (net_context_get_type(ctx) == SOCK_DGRAM) {
SET_ERRNO(net_context_recv(ctx, zsock_received_cb, K_NO_WAIT,
ctx->user_data));
}

It seems like data is only given to mbedTLS when recv(from) or send(to) is called. So if any data comes in after bind() then this causes data to exist in the FIFO but it hasn't reached mbedTLS. Then from what I can see sockets_tls returns -EAGAIN in the update_pollin function because

  1. Data is available to read
  2. Handshake not completed
    ..but due to mentioned reasons, handshake hasn't even been started yet.

if (!is_handshake_complete(ctx)) {
goto again;
}

It will stay this way until the end of time since no handshake is started and will neither timeout nor ever finish.

I don't know if this is usage fault from my side, probably, but from what I can see even if my DTLS socket were to have the server role then if bind() is called and data is received then handshake would not be started and poll would still act this way because it seems like it's firstly when recvfrom is called that mbedTLS is fed the data and any data that comes in after bind() is called will just be put in the queue and not be given to mbedTLS. Please correct me if I'm wrong. Anyways in my opinion this kind of behavior is wrong and the case should be handled. Please comment and advice, and I can implement a fix if it's deemed necessary.

To Reproduce
Steps to reproduce the behavior:

  1. Create DTLS socket
  2. bind() to some port
  3. From extern send any data UDP data to said port
  4. --> Poll will return 0 without waiting even if timeout is given.

Expected behavior

  • If bind() is called and DTLS is set to be client, then if data is received and DTLS session is not established then the data should be discarded
  • If bind() is called and DTLS is set to be server, then if data is received then poll() should say data is available and recvfrom start DTLS handshaking.

Impact
Showstopper, kinda, since the external library is using bind()/poll() combination. I could change it to recvfrom with some non-blocking flag and then check if data was received and then k_msleep(1) but it seems like a better solution to fix the root problem.

Logs and console output
N/A

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Zephyr SDK
  • 89a9096

Additional context
Add any other context about the problem here.

@emillindq emillindq added the bug The issue is a bug, or the PR is fixing a bug label Mar 15, 2021
@rlubos
Copy link
Contributor

rlubos commented Mar 16, 2021

Thanks for reporting, indeed this seems to be a valid issue. I just wanted to clarify that we're on the same page first. In the first paragraph you wrote:

I am running into a problem where poll() is called with timeout but returning 0 without waiting. This causes a busy-wait loop

and the further in the analysis:

Then from what I can see sockets_tls returns -EAGAIN in the update_pollin function ... It will stay this way until the end of time since no handshake is started and will neither timeout nor ever finish.

Are you sure about the former statement? Because what I've observed was indeed a busy loop, but all inside a single poll() call. Setting a timeout for the poll() and sending some dummy data caused the poll() function to return 0 in a timely manner, as indicated by the timeout value, yet the poll() internals kept busy-looping for all that time. So for instance calling the poll() timeout to 1 second caused the poll() to return 0 after that 1 second, but immediately. But poll() kept busy-looping during that second, preventing other lower priority threads from running, just as you indicated.

Actually, we used to to handle this case properly (please have a look at some older revision:

if (net_context_get_type(ctx) == SOCK_DGRAM &&
), but a regression slipped in during the major TLS sockets rework (net_context decoupling). Nonetheless, I think it still could be handled in a similar way now.

So in my opinion, how poll() should work for DTLS:

In case of DTLS client:

  1. If the handshake is not complete, ztls_poll_prepare_ctx() funcion shall no call it's counterpart for the underlying socket, but should rather configure the k_poll_event to monitor the tls_established semaphore. This way, the poll() will wait either for timeout or for the handshake to complete w/o busy looping, in case the handshake takes place in some other thread.
  2. Once the handshake completes, and the timeout still hasn't expired, we should reconfigure the k_poll_event to monitor the underyling socket fifo instead in the ztls_poll_update_ctx() and return EAGAIN. Any data that arrives at this point will be fed to the poll()'s internal zsock_recv() and either consumed by mbedTLS or decrypted.

In case of DTLS server it's a bit different though:

  1. In this case the server shall be ready to handle the handshake from within the poll() function, since it's not the server who initiates this. This could be achieved in a relatively simple manner, by ignoring the tls_established value and monitoring the underlying socket at all times. In case any data arrives, it should be fed to the internal zsock_recv() which will handle the hanshake, or report an error in case some rubbish arrives.
  2. Once the handshake is complete and there's still no application data, return EAGAIN to monitor the fifo until the timeout.

What do you think, does the above make sense? I can work on a fix if we agree it's a sane approach.

@galak galak added the priority: low Low impact/importance bug label Mar 16, 2021
@emillindq
Copy link
Contributor Author

Thank you for looking into this @rlubos !

Are you sure about the former statement?

You are right, upon further investigation I concur with you, it busy waits within poll() but stays true to the timeout!

In case of DTLS client:

Yes this sounds like a good plan. Although it would seem easier to just poll mbedtls_ssl_get_bytes_avail somehow but of course that's not possible, and in addition we need to keep mbedtls module alive by calling zsock_recv() in update_pollin.

In case of DTLS server it's a bit different though:

I have nothing to add, it sounds like the cleanest solution available.

It would be great if you implement this, it feels a bit over my head at the moment. I have already two merge requests to update in my backlog as well 😄

Thanks!

@rlubos
Copy link
Contributor

rlubos commented Mar 17, 2021

I have nothing to add, it sounds like the cleanest solution available.

It would be great if you implement this, it feels a bit over my head at the moment.

Great, I'll work on the patch then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants