Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNTP fails to close the used socket #34165

Closed
johann2 opened this issue Apr 9, 2021 · 7 comments · Fixed by #34453
Closed

SNTP fails to close the used socket #34165

johann2 opened this issue Apr 9, 2021 · 7 comments · Fixed by #34453
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@johann2
Copy link
Contributor

johann2 commented Apr 9, 2021

Describe the bug
Seems that the close function used in sntp_close will resolve to some close function not related to sockets instead of the one that calls zsock_close. As a result the socket remains open.

To Reproduce
I haven't been able to put together a minimal example to reproduce this.
The issue seems to happen only when CONFIG_NET_SOCKETS_POSIX_NAMES=n

Expected behaviour
sntp_close should close the socket

Impact
This is a blocking issue for us, because once enough sockets have been left open, no new sockets can be opened and the system loses connectivity.

Logs and console output
N/A

Environment (please complete the following information):

  • OS: Linux
  • Toolchain Zephyr SDK 0.11.3
  • Commit SHA 0a7ea88

Additional context
It seems to me that using the zsock_* functions as proposed in #24129 would be the easiest fix and would bypass the problem completely. I'd be happy to bring this PR up to date if needed.

However, this bug may be a symptom of something else wrong in the POSIX api. I hope someone who's more familiar with this part of the codebase could take a look. #31323 seems to be somehow related to this bug as well.

@johann2 johann2 added the bug The issue is a bug, or the PR is fixing a bug label Apr 9, 2021
@jukkar jukkar added area: Networking priority: medium Medium impact/importance bug labels Apr 9, 2021
@jukkar
Copy link
Member

jukkar commented Apr 9, 2021

@pfalcon can you take a look at this as this seems to be related to the Posix API issues we have been seeing occacionally.

@pfalcon
Copy link
Contributor

pfalcon commented Apr 9, 2021

@johann2:

Seems that the close function used in sntp_close will resolve to some close function
I haven't been able to put together a minimal example to reproduce this.

That doesn't sounds too reassuring. To fix the issue, we need to see the issue. So, please try to provide steps to reproduce (ideally, reproduction code + build files).

The issue seems to happen only when CONFIG_NET_SOCKETS_POSIX_NAMES=n

Then the likely reason is that SNTP should depend on CONFIG_NET_SOCKETS_POSIX_NAMES=y (or POSIX_API=y). I can prepare such a patch.

Toolchain Zephyr SDK 0.11.3

The latest SDK release is 0.12.4. 0.11.3 is 4-5 releases old. Please upgrade and retry.

It seems to me that using the zsock_* functions as proposed in #24129 would be the easiest fix and would bypass the problem completely. I'd be happy to bring this PR up to date if needed.

As best as I remember, SNTP is a completely "userspace" lib, and so intended to work with native POSIX API, not adhoc Zephyr wrappers (for example, there could be counter-proposal to remove zsock_* prefixes, though I don't think anybody came up with that yet). So, I personally can't recommend such a workaround. (But maybe somebody else will.)

However, this bug may be a symptom of something else wrong in the POSIX api.

Yes, POSIX API is experimental and needs more work. For that, we need tangible examples of problem cases, and seek to find solutions, not workarounds.

@johann2
Copy link
Contributor Author

johann2 commented Apr 13, 2021

An update:
The issue can be easily reproduced using the sntp_client example.
Set CONFIG_NET_SOCKETS_POSIX_NAMES=n and change the inet_pton function calls in the example to zsock_inet_pton
I've also updated the SDK, however it didn't affect anything.

As best as I remember, SNTP is a completely "userspace" lib, and so intended to work with native POSIX API, not adhoc Zephyr wrappers

I still don't understand why depending on POSIX api is desireable. It just makes tracking down things a lot harder. The API is not properly namespaced, which means it's prone to issues like these, and it pollutes the namespace everywhere sntp.h is included.

It could be argued that if all other functions are properly namespaced, this wouldn't be an issue, but unfortunately there's a lot of weird code out there, especially in the embedded world.

Is the zsock_* API unstable or an internal api?

@jukkar
Copy link
Member

jukkar commented Apr 13, 2021

Is the zsock_* API unstable or an internal api?

No, it is a public API. The BSD socket API calls the zsock_ functions.

@pfalcon
Copy link
Contributor

pfalcon commented Apr 13, 2021

@johann2 : Good timing. I also was looking into this stuff and posted #34244.

Set CONFIG_NET_SOCKETS_POSIX_NAMES=n

So, the question is why do you set that? Per the current state of affairs (not how it was 5 or so years ago), if you use networking in an application (== sockets), either CONFIG_POSIX_API or CONFIG_NET_SOCKETS_POSIX_NAMES should be set. Or, you're doing something very peculiar. What?

I've also updated the SDK, however it didn't affect anything.

Thanks. It at least ensures that we're on the same line, and there's less search space to wander thru. (And some time ago there were changes in the toolchain regarding handling stuff like close()).

I still don't understand why depending on POSIX api is desireable.

Well, I don't think that phrase conveys the full picture of it. BSD Socket API is a part of POSIX API, so that's how it gets into picture. But we exactly don't want to force use of the full POSIX subsys just to use sockets, that's why we have CONFIG_NET_SOCKETS_POSIX_NAMES. Finally, when Sockets API was initially developed ~5 years ago as an experimental project, there was also a requirement to namespace it, mostly from concerns of integrating with vendor network offloading engines, which themselves sometimes provide socket-like API. This last usecase is the least (if at all) used, and shouldn't be of concern for most users. #34244 tries to address that by making sure that either CONFIG_POSIX_API or CONFIG_NET_SOCKETS_POSIX_NAMES is enabled, and only users which have peculiar needs take explicit action (disabling it).

The API is not properly namespaced, which means it's prone to issues like these, and it pollutes the namespace everywhere sntp.h is included.

Which API exactly is not properly namespaced? BSD Socket API is the cross-OS de jure and de facto standard for networking connectivity. In that regard its more reasonable expect that SNTP (network protocol) depends on standard cross-platform networking API, than on adhoc Zephyr-specific API.

It could be argued that if all other functions are properly namespaced, this wouldn't be an issue, but unfortunately there's a lot of weird code out there, especially in the embedded world.

Yes, that's why using zsock_ prefixes was among the requirements for the initial implementation. Fast forward a few years, and it seems more like overly conservative, reinsuring constraint, which causes more confusion and maintenance work than real boon.

Is the zsock_* API unstable or an internal api?

It's an implementation detail of Zephyr RTOS, otherwise being on par with the standard cross-platform API. That's why I'm trying to understand why people aren't using the standard API (and even propose to switch standard API to Zephyr-specific in existing code).

pfalcon added a commit to pfalcon/zephyr that referenced this issue Apr 13, 2021
Zephyr socket subsystem has come a long way since initial experimental
alternative to internal Zephyr networking API. Its configuration also
mirrors the usual conservative approach, where a user needs to
explicitly enable options to get "more" features. And as an
experimental API, socket subsystem was initially developed as
namespaced API, where all functions/structures are prefixed with
"zsock_", and to get standard names, CONFIG_NET_SOCKETS_POSIX_NAMES
needs to be set (or alternatively, CONFIG_POSIX_API needs to be, which
enabled full POSIX subsys overall).

However, a few years later, sockets are the standard networking API,
and in majority of cases its used under the standard POSIX names.
Necessity to explicitly set an option to achieve this effects, and
confusion which results from it - are just unneeded chores for users.

So, switch CONFIG_NET_SOCKETS_POSIX_NAMES to be on by default (unless
CONFIG_POSIX_API is already defined). It still can be explicitly
disabled if needed (but usecases for that would be peculiar and rare).

Addresses zephyrproject-rtos#34165

Signed-off-by: Paul Sokolovsky <[email protected]>
@johann2
Copy link
Contributor Author

johann2 commented Apr 13, 2021

So, the question is why do you set that? Per the current state of affairs (not how it was 5 or so years ago), if you use networking in an application (== sockets), either CONFIG_POSIX_API or CONFIG_NET_SOCKETS_POSIX_NAMES should be set. Or, you're doing something very peculiar. What?

In our main codebase this option was implicitly set to n as I wasn't aware it's required for SNTP. I only set it to n to reproduce this using the sntp_client sample.

That's why I'm trying to understand why people aren't using the standard API (and even propose to switch standard API to Zephyr-specific in existing code).

The zsock_* API is in my opinion better. While not a global standard, zsock_close is more clear and less ambiguous than just close, especially for someone who hasn't done a lot of low-level unix C programming. Since the SNTP code is part of the zephyr code and isn't meant to be ported to other OSes, I don't see why public Zephyr API can't be used here, especially if it would be clearer and more readable, fix a bug and remove dependency on an experimental API.

In any case, I don't feel strongly enough about this question to push it further. There's a lot of ways to fix this bug and whatever you decide to do would work for me.

nashif pushed a commit that referenced this issue Apr 13, 2021
Zephyr socket subsystem has come a long way since initial experimental
alternative to internal Zephyr networking API. Its configuration also
mirrors the usual conservative approach, where a user needs to
explicitly enable options to get "more" features. And as an
experimental API, socket subsystem was initially developed as
namespaced API, where all functions/structures are prefixed with
"zsock_", and to get standard names, CONFIG_NET_SOCKETS_POSIX_NAMES
needs to be set (or alternatively, CONFIG_POSIX_API needs to be, which
enabled full POSIX subsys overall).

However, a few years later, sockets are the standard networking API,
and in majority of cases its used under the standard POSIX names.
Necessity to explicitly set an option to achieve this effects, and
confusion which results from it - are just unneeded chores for users.

So, switch CONFIG_NET_SOCKETS_POSIX_NAMES to be on by default (unless
CONFIG_POSIX_API is already defined). It still can be explicitly
disabled if needed (but usecases for that would be peculiar and rare).

Addresses #34165

Signed-off-by: Paul Sokolovsky <[email protected]>
pfalcon added a commit to pfalcon/zephyr that referenced this issue Apr 21, 2021
This library is coded with standard POSIX names for socket functions,
so make that requirement explicit.

Also, switch it from select'ing NET_SOCKETS, to depend'ing on it. This
follows the general approach of avoiding unneeded select's in Zephyr,
which lead to conflicting dependencies and make debugging dependencies
complex overall. In this particular case, it's fair (for a user) to
expect that "simple network time protocol" requires networking API,
namely sockets, and have that explicitly on in their app configuration,
giving better overview of their app config overall.

Fixes: zephyrproject-rtos#34165

Signed-off-by: Paul Sokolovsky <[email protected]>
@pfalcon
Copy link
Contributor

pfalcon commented Apr 21, 2021

@johann2:

In our main codebase this option was implicitly set to n as I wasn't aware it's required for SNTP. I only set it to n to reproduce this using the sntp_client sample.

That was my hunch, thanks for confirming it. That's why I went with #34244 , because that's the actual expectation how it should be for last few years.

The zsock_* API is in my opinion better. While not a global standard, zsock_close is more clear and less ambiguous than just close

Yes, "zsock_" is "better" as a thing-in-its-own-small-box. That's why I went the "zsock_" prefix in the initial prototype many years ago. But the goal was different - to support the standard BSD Sockets API. Using "zsock_" was just an interim implementation strategy I chose. It grew a lot since then, and we actually got more options to support on the "real BSD Sockets API" side, like full POSIX subsystem vs just sockets, protected kernel with syscalls, etc. Overall usage of "real BSD Sockets API" vs "adhoc shim" is 90% vs 10% (more likely 99% vs 1%). So, "zsock_" overall has become more of a nuisance, and as I mentioned, I won't be surprised if someone proposed to remove it (that won't be me, given all the effort I put into maintaining it over the years).

especially for someone who hasn't done a lot of low-level unix C programming.

It's not "low-level unix C programming", it's Portable Operating Systems Interface standard (abbreviated POSIX), the only open cross-industry OS API standard. The aim of Zephyr was to improve the situation with adhoc and proprietary APIs in the deeply embedded space (disclaimer: to the best of my knowledge, I don't set aims, and they change over time). Hence, standard API, known to majority of people. And for minority not familiar with it, it's again more beneficial to learn the standard API than yet another adhoc one.

Since the SNTP code is part of the zephyr code and isn't meant to be ported to other OSes

That's too generic question to even bring here IMHO. There was other RTOSes started in about the same time as Zephyr, let's call them colloquially "Zephyr competitors". What do you think - did they mean their code to be ported to other OSes? Parts of their code is now in Zephyr (and in other projects). So, generally, any open source project's code is meant for reuse in other projects. And it's definitely not "the code is not meant to be ported to other OSes and to emphasize that better, let's use proprietary API there" - unless someone explicitly sets that as the goal ;-).

In any case, I don't feel strongly enough about this question to push it further. There's a lot of ways to fix this bug and whatever you decide to do would work for me.

I tried to reply to your comments and explain the paradigmatic approach I follow. From my PoV, adding "zsock_" doesn't improve situation, but just preserves the (messy) situation and pushes it out of Zephyr tree into users' codebases (where they're expected to take existing socket code and it to just work). That's why I'm keen to keep specimen(s) of standard BSD Socket API usage in the tree, and look for ways to actually improve the situation around that usage. That's not easy, given the breadth and multitude of configurations Zephyr supports, but pretending that the problem doesn't exist (by retracting into the "zsock_" shell) is not an option for Zephyr developers.

jukkar pushed a commit that referenced this issue Apr 22, 2021
This library is coded with standard POSIX names for socket functions,
so make that requirement explicit.

Also, switch it from select'ing NET_SOCKETS, to depend'ing on it. This
follows the general approach of avoiding unneeded select's in Zephyr,
which lead to conflicting dependencies and make debugging dependencies
complex overall. In this particular case, it's fair (for a user) to
expect that "simple network time protocol" requires networking API,
namely sockets, and have that explicitly on in their app configuration,
giving better overview of their app config overall.

Fixes: #34165

Signed-off-by: Paul Sokolovsky <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants