Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: Use k_fifo instead of k_work in RX and TX processing #34703

Merged

Conversation

jukkar
Copy link
Member

@jukkar jukkar commented Apr 29, 2021

The k_work handler cannot manipulate the used k_work. This means
that it is not easy to cleanup the net_pkt because it contains
k_work in it. Because of this, use k_fifo instead between
RX thread and network driver, and between application and TX
thread.

A echo-server/client run with IPv4 and UDP gave following
results:

Using k_work
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	21922		5543071	103 us	[0->41->26->34=101 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (0)	0		0	-
[1] BE (0)	21925		6039151	97 us	[0->21->16->37->20=94 us]
Using k_fifo
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	15079		3811118	94 us	[0->36->23->32=91 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (1)	0		0	-
[1] BE (0)	15073		4150947	79 us	[0->17->12->32->14=75 us]

So using k_fifo gives about 10% better performance with same
workload.

Fixes #34690

Signed-off-by: Jukka Rissanen [email protected]

@github-actions github-actions bot added area: API Changes to public APIs area: Networking labels Apr 29, 2021
@jukkar jukkar added this to the v2.6.0 milestone Apr 29, 2021
Copy link
Collaborator

@pabigot pabigot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dedicated threads instead of a work queue thread that's only ever going to be asked to run one handler function makes a lot more sense, especially when you don't need delays or cancellation. If you need flush you can probably handle it better with the dedicated solution anyway.

subsys/net/ip/net_tc.c Outdated Show resolved Hide resolved
@jukkar
Copy link
Member Author

jukkar commented Apr 30, 2021

@alexanderwachter I converted 6locan to not use k_work in net_pkt. Unfortunately I have no way to test the changes, so asking would you be able to review and possibly check if the changes work as expected.

@jukkar
Copy link
Member Author

jukkar commented Apr 30, 2021

@mniestroj @tsvehagen You seem to have modified the offloading esp wifi driver last time. The driver uses net_pkt for its internal use but we want to get rid of the k_work in net_pkt. Unfortunately the driver looked somewhat convoluted so I was not able to modify it. Any suggestion how to solve this issue?

@jukkar jukkar requested a review from tsvehagen April 30, 2021 08:56
@tsvehagen
Copy link
Collaborator

I think the use of k_work in net_pkt was a quite recent change. Not sure if it's easy to revert or if we should solve it some other way. What do you think @mniestroj?

@jukkar
Copy link
Member Author

jukkar commented May 5, 2021

Does anyone have time to review this one, it would be nice to get this merged by Friday?

Copy link
Member

@mnkp mnkp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ways commits are split makes it easy to review and understand the changes. However, if someone runs git bisect while investigating their wifi esp code and then happen to checkout

  • net: Use k_fifo instead of k_work in RX and TX processing

they will get a compile time error. Maybe it would be better to squash the two commits

  • drivers: wifi: esp: Temporary fix to allow compilation
  • net: Use k_fifo instead of k_work in RX and TX processing

together?

subsys/net/l2/canbus/6locan.c Show resolved Hide resolved
@jukkar
Copy link
Member Author

jukkar commented May 5, 2021

Maybe it would be better to squash the two commits

* drivers: wifi: esp: Temporary fix to allow compilation

* net: Use k_fifo instead of k_work in RX and TX processing

together?

Perhaps, or we could move the esp change before the main commit net: Use k_fifo instead of k_work in RX and TX processing. This way there should be no issue with bisect.

I wonder if we should change the name of the function, since _work_handler is typically associated with Workqueue Threads.

I did not want to do too big changes with naming. The thread is basically doing "work" here, just that it is not happening via k_work. Function naming leads always to interesting discussion, so if there are suggestions what these changed functions should be called, I am all ears.

@jukkar jukkar force-pushed the bug-34690-k_work-issue-in-net branch from 4c184b7 to 400bc50 Compare May 5, 2021 11:06
subsys/net/l2/ppp/ppp_l2.c Outdated Show resolved Hide resolved
subsys/net/l2/ppp/Kconfig Show resolved Hide resolved
subsys/net/l2/ppp/ppp_l2.c Outdated Show resolved Hide resolved
Comment on lines +134 to +141
/** TX queue */
struct k_fifo tx_queue;
/** RX error queue */
struct k_fifo rx_err_queue;
/** Queue handler thread */
struct k_thread queue_handler;
/** Queue handler thread stack */
K_KERNEL_STACK_MEMBER(queue_stack, 512);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like "600 bytes * number of CAN contexts" more RAM is used after this change. I am fine with this change as is, but we might want to use in the future single thread for CAN, same as for PPP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like "600 bytes * number of CAN contexts" more RAM is used after this change. I am fine with this change as is, but we might want to use in the future single thread for CAN, same as for PPP.

Yeah, the 6locan driver is almost supporting multiple contexts / interfaces. It looks like the purpose was to support multiple context but it there are some bits and pieces missing to fully support this. I did not want to start to tweak this driver for now, but it can be done later if really needed. I will leave this to @alexanderwachter to decide.

Comment on lines 96 to 94
#if defined(CONFIG_WIFI_ESP)
/* TODO: The work is to be removed after the esp offloading
* wifi driver is modified to use its own private work struct.
*/
struct k_work esp_work;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to find some time to fix/change ESP driver, it won't happen this week. I wonder if there might be other downstream drivers that use pkt->work. net_pkt_work() API was kind of public, so I wonder if we should follow deprecation policy here for other drivers. So for example instead of introducing esp_work and removing work, add something like:

struct net_pkt {
	union {
#ifdef CONFIG_NET_PKT_WORK
		/** Internal variable that is used when packet is sent
		 * or received.
		 */
		struct k_work work;
#endif
...

with hidden Kconfig option:

config NET_PKT_WORK
	bool

which could be selected by ESP driver and any other downstream drivers or L2 layers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k_work is strictly for internal use so we are free to remove it if needed. Any downstream user using it will need to change its code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then what makes struct k_work work; different from struct net_pkt_cursor cursor;? They are both described as Internal variable... or Internal buffer..., which means that neither should be used by downstream code, right? But if someone wants to implement driver like ESP-AT (but for different chip), then net_pkt_cursor_init() still needs to be used for properly initializing packet before pushing it to other networking layers. This however makes pkt->cursor no longer internal. Well, it all depends on what "internal" really means.

Maybe we should better describe which structure members are reserved by native networking stack? And prevent using them by networking drivers, like ESP-AT does right now with pkt->work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should better describe which structure members are reserved by native networking stack? And prevent using them by networking drivers, like ESP-AT does right now with pkt->work?

Yes, that would be a good thing to do. It is a bit difficult to prevent the code from using the "internal" variables as we need to have net_pkt in the "open". The access functions for the internal variables are not documented so any downstream code should not use these vars anyway. For your cursor example, the cursor manipulation API is there and documented so I would not consider it as internal. The apps should never access the variables directly anyway.

@mniestroj
Copy link
Member

@mniestroj @tsvehagen You seem to have modified the offloading esp wifi driver last time. The driver uses net_pkt for its internal use but we want to get rid of the k_work in net_pkt. Unfortunately the driver looked somewhat convoluted so I was not able to modify it. Any suggestion how to solve this issue?

@jukkar I will look into ESP driver, but not this week. Unfortunately reverting to k_fifo (as was used before) is not an option, due to bugs that using work object allowed us to solve. It will be major work for ESP driver to get rid of pkt->work...

@jukkar
Copy link
Member Author

jukkar commented May 5, 2021

I will look into ESP driver, but not this week. Unfortunately reverting to k_fifo (as was used before) is not an option, due to bugs that using work object allowed us to solve. It will be major work for ESP driver to get rid of pkt->work...

Yes, I understand. I looked at the driver but it seemed quite a lot of work so did not try to do it myself. I think we should merge this PR like this and ESP changes can come when they are ready.

@jukkar jukkar force-pushed the bug-34690-k_work-issue-in-net branch from 400bc50 to ec308c3 Compare May 5, 2021 14:40
subsys/net/ip/net_tc.c Outdated Show resolved Hide resolved
subsys/net/ip/net_tc.c Outdated Show resolved Hide resolved
@jukkar jukkar force-pushed the bug-34690-k_work-issue-in-net branch from ec308c3 to 966e987 Compare May 5, 2021 16:29
@pfalcon
Copy link
Contributor

pfalcon commented May 5, 2021

I agree that this patchset leads to mixed feelings. On one hand, it definitely cleans up structure and processing - it's quite natural that, if we need to decouple IRQ-based and normal processing, then to put packets in a queue. On the other hand, that leads to proliferation of different queues, and most importantly, handlers threads which are just duplication of each other and differ only in which actual handler function they call.

It immediately leads to an idea to add a handler callback to net_pkt, and a common handler thread which would just call the handler. But that's apparently the reason why workqueue was used in the first place! Because it offers just that functionality, except it has the stipulation which net_pkt handling doesn't adhere too.

All in all, I guess it's a good change, but worth being further optimized to minimize number of processing threads (their stacks) somehow. (Not giving +1 right away, as I didn't test these changes yet.)

@mniestroj
Copy link
Member

k_fifo vs k_work time measurements seem suspicious to me and I honestly don't believe in 10% improvement. There is only one step (maybe two with CAN, but not sure) in networking TX packet path where this PR should make a difference. This is net_tc.c code, when CONFIG_NET_TC_TX_COUNT >= 1. So having results like:

k_work
[0] BK (1)	21922		5543071	103 us	[0->41->26->34=101 us]

k_fifo
[0] BK (1)	15079		3811118	94 us	[0->36->23->32=91 us]

seems wrong. IMO just 26 vs 23 shows the difference in k_work vs k_fifo. But then, something had to be wrong with measurement, as other numbers should theoretically be equal. Maybe measurements were not taken over multiple times?

I made some measurements with samples/net/sockets/echo_server and qemu_x86 with overlay-e1000.conf. As a result I get:

k_work
Avg TX net_pkt (709) time 41 us [0->19->6->13=38 us]

k_fifo
Avg TX net_pkt (709) time 39 us [0->19->4->13=36 us]

when taking average from multiple measurements in TX path. In case of RX, I see maybe 1us difference from total of 25us - too little to make any conclusions.

@mniestroj
Copy link
Member

I do like net: Use k_fifo instead of k_work in RX and TX processing commit, as it just changes from workqueue threads to specialized threads consuming k_fifo. It also does come with some performance improvement.

I don't like ppp and 6locan changes however. Most importantly, new threads are added for no good (enough) reason, which results in higher RAM and FLASH usage. I tested PPP with nRF52 based project and it seems that RAM usage increases by 1216 bytes and FLASH usage increases by 276 bytes. I don't like it, as I believe 32kB of RAM was enough to run PPP with some cellular modem and simple app. 1216 is a big difference in that case. On the other hand PPP change isn't that bad if we take into account that fixing PPP implementation will allow us (propably) to drop this thread. The reason why it shouldn't be needed is that following comment:

/* Do not call net_send_data() directly in order to make this
* thread run before the sending happens. If we call the
* net_send_data() from this thread, then in fast link (like
* when running inside QEMU) the reply might arrive before we
* have returned from this function. That is bad because the
* fsm would be in wrong state and the received pkt is dropped.
*/
k_work_init(net_pkt_work(pkt), ppp_pkt_send);
k_work_submit(net_pkt_work(pkt));

is simply wrong. Calling either k_work_submit or k_fifo_put can reschedule threads, so the original problem is still there, but possibly is not reproduced that often.

@jukkar
Copy link
Member Author

jukkar commented May 6, 2021

k_fifo vs k_work time measurements seem suspicious to me and I honestly don't believe in 10% improvement. There is only one step (maybe two with CAN, but not sure) in networking TX packet path where this PR should make a difference. This is net_tc.c code, when CONFIG_NET_TC_TX_COUNT >= 1. So having results like:

k_work
[0] BK (1)	21922		5543071	103 us	[0->41->26->34=101 us]

k_fifo
[0] BK (1)	15079		3811118	94 us	[0->36->23->32=91 us]

seems wrong. IMO just 26 vs 23 shows the difference in k_work vs k_fifo. But then, something had to be wrong with measurement, as other numbers should theoretically be equal. Maybe measurements were not taken over multiple times?

The number of packets is shown there, for k_work it was 21922 and for fifo 15079.

I made some measurements with samples/net/sockets/echo_server and qemu_x86 with overlay-e1000.conf. As a result I get:

k_work
Avg TX net_pkt (709) time 41 us [0->19->6->13=38 us]

k_fifo
Avg TX net_pkt (709) time 39 us [0->19->4->13=36 us]

when taking average from multiple measurements in TX path. In case of RX, I see maybe 1us difference from total of 25us - too little to make any conclusions.

You do not have any TC threads in your tests (note that the default number of TX threads is now 0), so you will not see any difference in your test between k_work and k_fifo.

@mniestroj
Copy link
Member

Maybe measurements were not taken over multiple times?

The number of packets is shown there, for k_work it was 21922 and for fifo 15079.

Okay. I used around 700 (and sometimes 7000) which is equivalent of 100 (1000) runs of echo-client. But then I restarted qemu_x86 machine and retried the whole test multiple times and have noticed that results are different (easily by several %). Even running 21922 vs 15079 would make a big difference. So I tried to average from those multiple "cold-start" runs, always using the same number echo-client runs. I also removed all logs, so they don't interfere.

You do not have any TC threads in your tests (note that the default number of TX threads is now 0), so you will not see any difference in your test between k_work and k_fifo.

I do have TX TC thread. I forgot to mention, that I specifically added CONFIG_NET_TC_TX_COUNT=1. Without that option, there was no difference at all (which is expected), i.e. there was no step that takes 6us in case of k_work and 4us in case of k_fifo.

Copy link
Contributor

@rlubos rlubos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic net changes LGTM.

@jukkar jukkar force-pushed the bug-34690-k_work-issue-in-net branch from 966e987 to ac009f8 Compare May 7, 2021 07:03
@jukkar
Copy link
Member Author

jukkar commented May 7, 2021

Fixed merge conflict

@jukkar jukkar force-pushed the bug-34690-k_work-issue-in-net branch from ac009f8 to 0fc857c Compare May 7, 2021 07:50
Copy link
Collaborator

@tsvehagen tsvehagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESP changes LGTM

@carlescufi
Copy link
Member

@jukkar please rebase

jukkar added 3 commits May 7, 2021 14:14
Following commits will remove k_work from net_pkt, so convert
6locan L2 to use k_fifo between application and TX thread, and
driver and RX error handler.

Signed-off-by: Jukka Rissanen <[email protected]>
Following commits will remove k_work from net_pkt, so convert
PPP L2 to use k_fifo when sending PPP data.

Signed-off-by: Jukka Rissanen <[email protected]>
The k_work handler cannot manipulate the used k_work. This means
that it is not easy to cleanup the net_pkt because it contains
k_work in it. Because of this, use k_fifo instead between
RX thread and network driver, and between application and TX
thread.

A echo-server/client run with IPv4 and UDP gave following
results:

Using k_work
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	21922		5543071	103 us	[0->41->26->34=101 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (0)	0		0	-
[1] BE (0)	21925		6039151	97 us	[0->21->16->37->20=94 us]

Using k_fifo
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	15079		3811118	94 us	[0->36->23->32=91 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (1)	0		0	-
[1] BE (0)	15073		4150947	79 us	[0->17->12->32->14=75 us]

So using k_fifo gives about 10% better performance with same workload.

Fixes zephyrproject-rtos#34690

Signed-off-by: Jukka Rissanen <[email protected]>
@jukkar jukkar force-pushed the bug-34690-k_work-issue-in-net branch from 0fc857c to 2f435fc Compare May 7, 2021 11:14
@galak galak merged commit bcdc762 into zephyrproject-rtos:master May 7, 2021
@jukkar jukkar deleted the bug-34690-k_work-issue-in-net branch May 7, 2021 14:51
@pfalcon
Copy link
Contributor

pfalcon commented May 7, 2021

For reference, I tested this (with master of f05ea67) on frdm_k64f with dumb_http_server/big_http_download samples, both work well. In my testing, big_http_download has higher throughput without this patch, but that's not scientific, as I downloaded different amounts of data each time, and it's affected by Internet connection anyway.

Trying to test on qemu_x86, I hit #34964 (not related to this patch, present in master too), so didn't look into it further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Networking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

net: process_rx_packet() work handler violates requirements of Workqueue Threads implementation
9 participants