Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: process_rx_packet() work handler violates requirements of Workqueue Threads implementation #34690

Closed
mnkp opened this issue Apr 29, 2021 · 2 comments · Fixed by #34703
Closed
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@mnkp
Copy link
Member

mnkp commented Apr 29, 2021

Describe the bug

The Workqueue Threads implementation imposes the following requirement

A pending work item must not be altered until the item has been processed by the workqueue thread. This means a work item must not be re-initialized while it is busy. Furthermore, any additional information the work item’s handler function needs to perform its work must not be altered until the handler function has finished executing.

This requirement is violated by void process_rx_packet(struct k_work *work) work handler function implemented in subsys/net/ip/net_core.c.

In the RX path when data are received, e.g. from an Ethernet driver, the packet is submitted to a workqueue in an interrupt context and then processed by process_rx_packet(struct k_work *work) work handler function in a thread context. In case of an error path (when the packet is not passed onto higher layers but needs to be discarded) the process_rx_packet function will free the memory allocated for the packet and along the memory where the k_work item is stored. I.e. ultimately the k_work item is freed before process_rx_packet function returns. If the same memory block is obtained and then resubmitted by the Ethernet driver we could overwrite content of k_work item.

Impact
This is a race condition, may lead to data lose / internal data corruption of the networking stack.

Environment (please complete the following information):

  • zephyr-v2.5.0
@mnkp mnkp added bug The issue is a bug, or the PR is fixing a bug area: Networking labels Apr 29, 2021
@jukkar jukkar added the priority: medium Medium impact/importance bug label Apr 29, 2021
@jukkar
Copy link
Member

jukkar commented Apr 29, 2021

Thanks @mnkp.

This issue was already discussed in slack with following summary:

  • Because of this workqueue restriction, it is not easy to free the net_pkt that contains the used k_work, as we cannot unref it from workhandler (as that would overwrite the k_work which is still in use)
  • One possible solution could be to use k_fifo instead of k_work in RX path, so the net_data_recv() would place the net_pkt (that contains the received network data) into a k_fifo instead of calling workqueue handler
  • This way we could remove the k_work (at least in RX path) from net_pkt saving some bytes
  • The changes would be quite minimal and mostly contained in net_core.c
  • The performance impact of this is unknown atm
  • TX path has also the same issue so would need tweaking too.

I will investigate if we could implement this still for 2.6

FYI cc: @rlubos @mniestroj

@jukkar
Copy link
Member

jukkar commented Apr 29, 2021

The changes were actually quite easy to do, and we get 10% better performance for both RX and TX. I think is good candidate for 2.6.

jukkar added a commit to jukkar/zephyr that referenced this issue May 7, 2021
The k_work handler cannot manipulate the used k_work. This means
that it is not easy to cleanup the net_pkt because it contains
k_work in it. Because of this, use k_fifo instead between
RX thread and network driver, and between application and TX
thread.

A echo-server/client run with IPv4 and UDP gave following
results:

Using k_work
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	21922		5543071	103 us	[0->41->26->34=101 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (0)	0		0	-
[1] BE (0)	21925		6039151	97 us	[0->21->16->37->20=94 us]

Using k_fifo
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	15079		3811118	94 us	[0->36->23->32=91 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (1)	0		0	-
[1] BE (0)	15073		4150947	79 us	[0->17->12->32->14=75 us]

So using k_fifo gives about 10% better performance with same workload.

Fixes zephyrproject-rtos#34690

Signed-off-by: Jukka Rissanen <[email protected]>
galak pushed a commit that referenced this issue May 7, 2021
The k_work handler cannot manipulate the used k_work. This means
that it is not easy to cleanup the net_pkt because it contains
k_work in it. Because of this, use k_fifo instead between
RX thread and network driver, and between application and TX
thread.

A echo-server/client run with IPv4 and UDP gave following
results:

Using k_work
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	21922		5543071	103 us	[0->41->26->34=101 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (0)	0		0	-
[1] BE (0)	21925		6039151	97 us	[0->21->16->37->20=94 us]

Using k_fifo
------------
TX traffic class statistics:
TC  Priority	Sent pkts	bytes	time
[0] BK (1)	15079		3811118	94 us	[0->36->23->32=91 us]
[1] BE (0)	0		0	-
RX traffic class statistics:
TC  Priority	Recv pkts	bytes	time
[0] BK (1)	0		0	-
[1] BE (0)	15073		4150947	79 us	[0->17->12->32->14=75 us]

So using k_fifo gives about 10% better performance with same workload.

Fixes #34690

Signed-off-by: Jukka Rissanen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants