Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: dns: update k_work API and fix misuse #33109

Closed
wants to merge 1 commit into from

Conversation

pabigot
Copy link
Collaborator

@pabigot pabigot commented Mar 7, 2021

Cancelling an in-progress work item is not guaranteed to complete synchronously. Convert the DNS timeout to use the new delayable work structure, and use its handler to release the query slot for use in subsequent operations.

This delays the release of the query slot but avoids some race conditions.

Allocation of slots remains racy as the mixed identification of query slots through pointers and indexes, and the use of a null check for a function pointer (not compatible with the atomic_ptr API) as the in-use condition, makes it difficult to switch to a thread-safe request/release interface.

May help resolve #33101, though other race conditions are not addressed in this PR.

Cancelling an in-progress work item is not guaranteed to complete
synchronously.  Convert the DNS timeout to use the new delayable work
structure, and use its handler to release the query slot for use in
subsequent operations.

This delays the release of the query slot but avoids some race
conditions.

Allocation of slots remains racy as the mixed identification of query
slots through pointers and indexes, and the use of a null check for a
function pointer (not compatible with the atomic_ptr API) as the
in-use condition, makes it difficult to switch to a thread-safe
request/release interface.

Signed-off-by: Peter Bigot <[email protected]>
@github-actions github-actions bot added area: API Changes to public APIs area: Networking area: Tests Issues related to a particular existing or missing test labels Mar 7, 2021
Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 8, 2021

I don't know how to address the test failure; the test says:

	/* This check simulates a local query that we will catch
	 * in dns_process() function. So we do not check the res variable ...

but there is no dns_process() function.

Copy link
Member

@hubertmis hubertmis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍
I'll test it with my application in the evening

@jukkar
Copy link
Member

jukkar commented Mar 9, 2021

but there is no dns_process() function.

There is a typo in the comment, the function name is process_dns() which is found in the same file.

@jukkar
Copy link
Member

jukkar commented Mar 9, 2021

The failing unit test works like this

  • call getaddrinfo() which will construct a DNS query and send it to server
  • this query should be caught by process_dns() function
  • as we do not want to implement a DNS server in this simple unit test, we do not expect to receive anything in the test_getaddrinfo_ok(), but instead expect that the process_dns() is able to parse the query and release the waiting mutex
  • the test will pass if that (query parsing) is done ok

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 9, 2021

It turns out zsock_getaddrinfo() won't tolerate a delay between being notified that a query failed (timed out) and the release of the query structure where it could be re-used. If it isn't available, the second query attempt for IPV6 fails because there's no available slot (-EAGAIN), so only one of the two expected results is detected. (Note that process_dns() is pretty fragile, because once the first result comes in the mutex has been released, and nothing can re-use it for another test. It passes currently only because test_getaddrinfo_ok() doesn't even attempt to check the mutex until both transactions have been attempted).

So the simple solution of having query_timeout() be responsible for releasing the query structure, which is why -EAGAIN can be observed, doesn't work.

Perhaps the subsystem should be willing to yield then retry when dns_resolve_name() returns -EAGAIN.

However the previous solution of blindly canceling the query timeout and releasing the state won't work on SMP systems, because we can't be sure the timeout isn't running the notify operation on another processor.

This ties to the lack of mutex for allocating and releasing the query slots, so a fix is significantly more complicated, because the slot also can't be re-used until any in-progress timer operation completes.

@pabigot
Copy link
Collaborator Author

pabigot commented Mar 10, 2021

I'm withdrawing this; it needs to be integrated with a solution to the thread safety problems with this module, and will require modifications of the failing unit test. I'll still try to do this, but it'll take a few days.

@pabigot pabigot closed this Mar 10, 2021
@jukkar
Copy link
Member

jukkar commented Mar 10, 2021

Thanks Peter, I try to fix the locking issues and the test so this can proceed.

@jukkar
Copy link
Member

jukkar commented Mar 10, 2021

See #33217 for locking support. I also tweaked the getaddrinfo tests a bit.

@pabigot pabigot deleted the nordic/20210307b branch March 15, 2021 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Networking area: Tests Issues related to a particular existing or missing test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DNS resolver misbehaves if receiving response too late
4 participants