Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt CBOR payloads in MCUMGR when sending multiple commands together #32579

Closed
jrhees-cae opened this issue Feb 23, 2021 · 16 comments · Fixed by #33780
Closed

Corrupt CBOR payloads in MCUMGR when sending multiple commands together #32579

jrhees-cae opened this issue Feb 23, 2021 · 16 comments · Fixed by #33780
Assignees
Labels
area: mcumgr bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@jrhees-cae
Copy link

When using MCUMGR SMP commands in quick succession, corrupt CBOR payloads are seen by the parser.

@jrhees-cae jrhees-cae added the bug The issue is a bug, or the PR is fixing a bug label Feb 23, 2021
@jrhees-cae
Copy link
Author

Suspect that the issue is related to incorrect fifo use for incoming netbuf's:

In mcumgr/smp.c, I replaced k_fifo_put and k_fifo_get with net_buf_put and net_buf_get and can no longer reproduce the issue.

Suggest correcting these calls and auditing other modules within Zephyr to ensure there are no other instances of k_fifo_put/get with net buf's.

For context, see: https://docs.zephyrproject.org/latest/reference/networking/net_buf.html

@mniestroj
Copy link
Member

@jrhees-cae Could you provide more context related to this issue? How to reproduce it? Which Zephyr version did you use, what type of transport protocol was used (BT, UART, SHELL, UDP), etc.? On which side do corrupted CBOR payloads appear (Zephyr or host)?

Looking quickly at net_buf_{get,put} vs k_fifo_{get,put}, I think that using k_fifo API is correct in current context - at least I don't see anything obviously wrong.

@jrhees-cae
Copy link
Author

According to the docs (and my reading of the buf source code), netbuf references should always use net_buf_put/get in case the netbuf has fragments.
I am using NCS 1.4 (not sure which Zephyr version it includes).
Transport protocol is BT. The corrupted CBOR payloads were on Zephyr side (e.g., parsing CBOR coming from the host).
When I added debug code to look at the netbuf payload contents before the fifo_put() call, the data from the host was intact. When I looked at the payload contents after retrieving from fifo_get, the netbuf payload was not longer intact.

This only would be seen when I was doing multiple SMP commands stacked (e.g. I was using multiple netbufs from the pool before they were freed).

@mniestroj
Copy link
Member

mniestroj commented Feb 23, 2021

According to the docs (and my reading of the buf source code), netbuf references should always use net_buf_put/get in case the netbuf has fragments.

My understanding of the code is that no net_buf fragments are used. Instead, just single net_buf is allocated and enqueued to FIFO at a time. Then each received net_buf is processes separately. So that is why I think net_buf_{put,get} API is not needed. In fact using net_buf_{put,get}, as you suggested, could result in dropping some net_bufs (when multiple net_bufs would be enqueued, then just the first one would be processed, rest would be freed).

@jrhees-cae
Copy link
Author

Hmm, interesting. Is there any sort of debugging info I could provide to help understand this issue better? Changing those function calls definitely cleared up the problem (on my setup anyway), but I completely understand that it might be a red herring.

@mniestroj
Copy link
Member

As a start, you could provide data based on which you have filed this bug. So dump of bytes before putting net_buf on FIFO and after getting from FIFO. I would like to see those "corrupted" packets and their data when they were "okay" before putting them into FIFO. Still, I would like to see what Zephyr version you are using. git describe in Zephyr repo should tell us enough.

@jrhees-cae
Copy link
Author

per git describe I see: v2.4.0-ncs1
I will work on getting a debug dump within the next day. Thanks.

@de-nordic
Copy link
Collaborator

Do you use the mcumgr cli (https:/apache/mynewt-mcumgr-cli) for the communication? I have not been able to find info on how you contact the mcumgr within Zephyr.
With mcumgr cli you may use -l "debug" to get comm dump

@jrhees-cae
Copy link
Author

To exercise the MCUMGR, I have been using the "nRF Connect Device Manager" application for Android. This application works OK as-is, and is based on (a Nordic-themed modified fork of) the mcumgr-android library here: https:/JuulLabs-OSS/mcumgr-android

However, the mcumgr-android library was updated with a new feature to speed up firmware uploads by sending multiple upload packets in a row before waiting for the response from the previous write. Attempting to use these new features causes the behavior I am describing.

@nashif nashif added the priority: medium Medium impact/importance bug label Mar 2, 2021
@khutchens
Copy link

I'm running into this issue as well and separately reached the same conclusion that the k_fifo_put and k_fifo_get calls are the root cause. struct net_buf uses the FIFO next pointer as it's fragment pointer and requires special put/get logic to make that work as far as I can tell. I added a bit of tracing to zephyr_smp_rx_req() and zephyr_smp_handle_reqs() to show the netbufs. Note the format for this tracing is nb: <net_buf addr> [<frag addr's>, ...]

With the unmodified code (k_fifo_ calls) I get the following behavior:

[00324842] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001cfc4 [(nil)]
[00324846] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cfc4 [(nil)]
[00422857] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001cff4 [(nil)]
[00524233] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001d00c [(nil)]
[00558597] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cff4 [0x2001d00c, (nil)]
[00558630] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001d00c [0x2001cff4, 0x2001cfc4, 0x2001cfdc, (nil)]
[00558648] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cff4 [0x2001cfc4, 0x2001cfdc, (nil)]
[00558664] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cfc4 [0x2001cfdc, (nil)]
[00558677] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cfdc [(nil)]
[00625056] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001cfc4 [(nil)]
[00625061] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cfc4 [(nil)]

Replacing k_fifo_put with net_buf_put and k_fifo_get with net_buf_get yields the following:

[00202095] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001cfc4 [(nil)]
[00202100] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cfc4 [(nil)]
[00303472] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001cff4 [(nil)]
[00402267] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001d00c [(nil)]
[00438736] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001cff4 [(nil)]
[00438763] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001d00c [(nil)]
[00440973] <inf> smp_subsys: zephyr_smp_rx_req(), put req:  nb: 0x2001d00c [(nil)]
[00440978] <inf> smp_subsys: zephyr_smp_handle_reqs(), get req: nb: 0x2001d00c [(nil)]

Which I believe matches the expected behavior. The failing case results in the net bufs being re-processed after they've been freed, which naturally leads to all sorts of mayhem.

This fails any time multiple requests are received between each run of the processing loop. This means it's pretty easy to repro by just blocking the work handler from running long enough to get two or more requests in. In my case I'm running into using the mcumgr CLI, and the error occurs while waiting for a flash page erase.

@jrhees-cae
Copy link
Author

I'm running into this issue as well and separately reached the same conclusion that the k_fifo_put and k_fifo_get calls are the root cause.

@khutchens Thanks for digging into this more deeply. I haven't had cycles to do it myself.

@mniestroj
Copy link
Member

Thanks for testing! I have looked once again into net_buf_put and net_buf_get and it seems that using them is the right thing to do. The problem with k_fifo_get() is that it returns a net_buf object (lets call it nb), which will have nb->frags != NULL whenever there was more than a single element in the queue (FIFO). We can solve that in two ways. One way is to clear nb->frags = NULL after k_fifo_get(). Another way is replacing k_fifo_* with net_buf_* API, as you suggested and tested above. The latter seems to be more future-proof of course.

@jrhees-cae
Copy link
Author

It may be worth an audit of other modules in Zephyr to see if this change (e.g. using k_fifo... api's with netbufs) needs to be applied elsewhere.

@mniestroj
Copy link
Member

It may be worth an audit of other modules in Zephyr to see if this change (e.g. using k_fifo... api's with netbufs) needs to be applied elsewhere.

I've taken a pretty quick look at other modules and didn't see issues there. But it would be good if others can confirm.

@khutchens
Copy link

I scrubbed through the codebase and only found one other case where a k_fifo_(put|get) call is made on a struct net_buf object, in canbus/isotp/isotp.c:pull_frags(). That function seems to be doing roughly what net_buf_get() would do so it may be safe. I admittedly didn't think hard enough about what that function is doing to conclude whether it's an acceptable use or not.

There are also many instances of using k_fifo_(put|get) with struct net_pkt. That seems to be correct usage though as struct net_pkt doesn't have the same specialized put/get requirement at struct net_buf as far as I can tell.

@mniestroj
Copy link
Member

@khutchens I got to the same conclusions as you.

@de-nordic de-nordic assigned mniestroj and khutchens and unassigned mniestroj and khutchens Mar 29, 2021
de-nordic added a commit to de-nordic/zephyr that referenced this issue Mar 29, 2021
Fixing problem where k_fifo functions have been used to get/put
data from/info net buf, where documentation has been strictly
forbidding so.

Found, reported and solution suggested by jrhees-cae.

Fixes: zephyrproject-rtos#32579

Signed-off-by: Dominik Ermel <[email protected]>
MaureenHelm pushed a commit that referenced this issue Mar 29, 2021
Fixing problem where k_fifo functions have been used to get/put
data from/info net buf, where documentation has been strictly
forbidding so.

Found, reported and solution suggested by jrhees-cae.

Fixes: #32579

Signed-off-by: Dominik Ermel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: mcumgr bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants