-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: dns: update k_work API and fix misuse #33109
Conversation
Cancelling an in-progress work item is not guaranteed to complete synchronously. Convert the DNS timeout to use the new delayable work structure, and use its handler to release the query slot for use in subsequent operations. This delays the release of the query slot but avoids some race conditions. Allocation of slots remains racy as the mixed identification of query slots through pointers and indexes, and the use of a null check for a function pointer (not compatible with the atomic_ptr API) as the in-use condition, makes it difficult to switch to a thread-safe request/release interface. Signed-off-by: Peter Bigot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I don't know how to address the test failure; the test says:
but there is no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
I'll test it with my application in the evening
There is a typo in the comment, the function name is |
The failing unit test works like this
|
It turns out zsock_getaddrinfo() won't tolerate a delay between being notified that a query failed (timed out) and the release of the query structure where it could be re-used. If it isn't available, the second query attempt for IPV6 fails because there's no available slot (-EAGAIN), so only one of the two expected results is detected. (Note that process_dns() is pretty fragile, because once the first result comes in the mutex has been released, and nothing can re-use it for another test. It passes currently only because test_getaddrinfo_ok() doesn't even attempt to check the mutex until both transactions have been attempted). So the simple solution of having query_timeout() be responsible for releasing the query structure, which is why Perhaps the subsystem should be willing to yield then retry when dns_resolve_name() returns However the previous solution of blindly canceling the query timeout and releasing the state won't work on SMP systems, because we can't be sure the timeout isn't running the notify operation on another processor. This ties to the lack of mutex for allocating and releasing the query slots, so a fix is significantly more complicated, because the slot also can't be re-used until any in-progress timer operation completes. |
I'm withdrawing this; it needs to be integrated with a solution to the thread safety problems with this module, and will require modifications of the failing unit test. I'll still try to do this, but it'll take a few days. |
Thanks Peter, I try to fix the locking issues and the test so this can proceed. |
See #33217 for locking support. I also tweaked the getaddrinfo tests a bit. |
Cancelling an in-progress work item is not guaranteed to complete synchronously. Convert the DNS timeout to use the new delayable work structure, and use its handler to release the query slot for use in subsequent operations.
This delays the release of the query slot but avoids some race conditions.
Allocation of slots remains racy as the mixed identification of query slots through pointers and indexes, and the use of a null check for a function pointer (not compatible with the atomic_ptr API) as the in-use condition, makes it difficult to switch to a thread-safe request/release interface.
May help resolve #33101, though other race conditions are not addressed in this PR.