-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gnrc_sock_udp: Possible Race condition on copy in application buffer #10389
Comments
My check function in sock_udp_recv:
In the case of an error the memcmp returns != 0 and the printed buffer are different. I try to reproduce and send the corresponding output. |
Was able to reproduce this:
Returns:
Which is correct. Running the memcmp test as described in my last comment returns:
As you can clearly see the first 4 Byte 0x0000000b are the expected message type 11. Starting at the 3th byte things do not look correct neither in the gnrc buffer and the return buffer. I will take a closer look and write an update on this |
I just fixed my buffer dump printing:
Now its easy to see that the 2nd to the 4th 4 bytes are somehow screwed. Rest looks fine. fe80 00000000 0000c0c6 is the beginning of an link local address is transferred in the payload: My best guess is that some kind of context switch happens inside the memcpy or some kind of weird alignment errors. I appreciate any help :) |
@crest42 Can you share the application code? Also, please rule out stack size issues, either with ps() or by increasing your application stack to saomething very large (8k). |
Sure but application code is rather huge. One of the next things i will try is to minimize the application code to a few primitives to limit the problem to a specific part of the code. Application code: https:/crest42/RIOT ./chord_test. You need to clone it with --recurse-submodules. Just make a debug commit and the expected output when the fault happens is:
You can start a new network with "chord new" and join additional node with "chord join". The problem start with about 16-18 nodes. A short introduction: I try to build a distributed storage using a DHT (Chord) using a mtd device driver in RIOT. The Chord lib is using two threads. One Thread is an simple event loop to manage incomming UDP messages and the other is a periodic loop which implements the Chord state keeping protocol. The fault happens in wait_for_message thread which does the incomming message handling. The other thread mostly crafts UDP messages and wait for an answer. I already refactored the network code a little bit to use htons and friends but that should not make any difference, since i run it on the same machine anyway. (In fact the problem occures with both methods). Also i tried to null the read buffer on every run of wait_for_message in network.c (The buffer which gets passed to the gnrc_udp_recv function). The stack sizes should be no problem as i already make them rather huge. Here is a example ps output with 18 nodes running. I tried to use static memory in the DHT implementation so it should be constant:
Three additional remarks:
Regards, |
I just disabled interrupts before and enabled them after the copy. Now the problem does not occur in a 5 minute run with 18 nodes which is way longer than without the "fix". I am not sure if this really has an direct effect since printing just before the copy also make the problem disappear but it could be a hint in the right direction. |
I made additional observations while debugging the problem:
In 100% of all failure cases "irq called" was printed (>15 atm). This also happens on non-failure cases but on a much rare occassion.
I will take a further look next week. |
Atm i have a similar problem in this function:
Which sometimes fails with:
So basically the same as in the network stack happens, right after the memcpy a memcmp of the values fails. From the output you can see that neither the addresses provided to memcpy are invalid nor the memory region overlaps. So right now i am not sure anymore if the problem i reported first is even related to gnrc, but i can't see a way how i messed this up in my application code. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions. |
Description
When using gnrc_sock_udp in my application after a certain amount of hosts in the network (and thus a certain amount of packets/s on a single host), the buffer content returned by sock_udp_recv is filled with garbage. My message format is defined to be 4 bytes + 4 bytes + 4 bytes + 4 bytes (message type, app src id, app dst id, length of the remainder data).
As an example: A message of type 11 always sends a static sized buffer with size 32 and thus the message "header" would look like this: 0x0000000b 0x00000001 0x00000002 0x00000020. (Second and third 4 bytes can differ)
Sometimes if many hosts are joining the network it happens very seldom that a single host instead read something like: 0x0000000b 0xfe80aaf0 0x00000000 0x00afbb20
Two intresting observations here:
The first 4 bytes always are correct.
The second 4 bytes always contains fe80something (Maybe i am just paranoid but this looks like the beginning of a link local ipv6 address)
Further debugging:
Steps to reproduce the issue
Reading content from a udp socket and sending a lots (TBD) of messages. Testing the read buffer for correctness. Sadly no better way yet than try and wait.
Expected results
Content of buffer is correct
Actual results
Buffer content is garbage
Versions
Operating System Environment
Installed compiler toolchains
riscv-none-embed-gcc: missing
clang: clang version 7.0.0 (tags/RELEASE_700/final)
Installed compiler libs
arm-none-eabi-newlib: "3.0.0"
mips-mti-elf-newlib: missing
riscv-none-embed-newlib: missing
avr-libc: "2.0.0" ("20150208")
Installed development tools
The text was updated successfully, but these errors were encountered: