-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nrf5340: Random crashes when a lot of interrupts is triggered #44586
Comments
I was finally able to attach to the device when it get stuck (dead lock), when I compiled the project with optimize for debug.
And info registers:
|
As usual, the first thing to check is stack overflow. Can you please double all stacks ( |
I have tried, but lets repeat to be sure. |
These are the stack I have doubled. With the new stack sizes I still see locks and crashes. I am using this application to recreate the issue: I am using NCS 1.9.1, but I saw the same issue on NCS 1.7.0 I have not tested this application on an nrf5340dk yet, so I am not sure if it will work there. And the issue seems to have very different frequency on different devices, we where just lucky and found a "golden sample" that has a high frequency of hitting this issue. |
I tried to reproduce the issue with the above application on nRF5340 DK, using NCS 1.9.1, but I did not succeed so far. The application seems to run just fine (I waited for about 30 minutes and did not see it crashing), also when I shorten the delay for the work item submitted on the network core (I tried with e.g. 200 or 10 ms). Should I perform some additional action or try to change some configuration options to trigger the crash? |
Sorry for the slow response, I have been on Easter vacation and thanks for looking at my case. Yes, it's is very hard to reproduce this issue in a controlled environment. And as mentioned before it seems to be hardware dependent, currently we have found two devices here at the office that show this behavior, but all devices show this behavior over time. It also seems to be very dependent on the timings in the code how often the issue appear, but it will always appear at some point. If I change the timing as you have done I can make the issue seems to disappear, but it will just happen less frequently. That is what makes this issue so hard, because if I enable stack canaries or do some other measures to try to debug this issue the issue may be less frequent. In some cases I have had "luck" by enabling things that creates interrupts, for instance I enabled async uart which made our firmware crash all the time. So it may be that you can try something like that (I was trying to use the async uart backen in the logger). But it is very easy move the frequency of the issue in both directions by changing the code. I have not yet received the nrf5340dk kits I ordered, so I can not do any testing on my end on a nrf5340dk kit. I am not sure if it would help if I made a recording of how the device fails, or if we could to a remote session so I could show you how the issue behaves? |
After some more testing, it seems like the bug may be located or related to the logging system. In my setup I am using the UART backen. I have tried to run the device that usually fails within 15 - 20 seconds without logging (CONFIG_LOG=n) and it ran for 12 hours before I stopped it manually. Enabling logging again would make it fail within 15 -20 seconds as expected. I am also running a test now with two devices, one running my test application and one running our production firmware, both with logging set to minimal mode. They have now been running for 5.5 hours, which is far beyond what any of these devices has been able to do before. |
After some more investigation it seems that the issue we are seeing is the result of the EasyDMA in one of the UARTs and the CPU is accessing the same memory. This issue originates from our own code where we are trying to use on of the UARTs as a slave on a multi device single wire uart bus. I'll close this issue now as this is not related to any Zephyr code. |
I'm also facing the same situation, the application core crashes randomly after a while of running, the network core still works. |
I have an application that has a few threads, uses the work queue regularly and uses rpmsg for communication between netcore and application core. This is software in development, and new stuff are added all the time. Lately we have seen random crashes or complete dead locks on the application core. The network core does not seem to be affected by this issue.
I have a delayable work item running on the network core that sends a message to the application core every second. If I disable this work item the issue get less frequent.
After some investigation I ended up in this function -> https:/zephyrproject-rtos/zephyr/blob/main/subsys/logging/log_output.c#L627, I saw that I was always able to run down to line 644, but never past line 645. I tried to remove line 645 and it helped a lot, but it did not remove the issue, it was just less frequent. So I thought that we some how spent to much time in there. I saw in some cases (using k_uptime_get()) that we spent 60ms in this function. At this time we was running NCS 1.7.0, I saw that in NCS 1.9.1 async uart support was added to the UART logger backen. I wanted to try that to maybe lower the number of CPU cycles spent in this function.
Upgrading the project was pretty straight forward, but when I enabled the async uart backend the problem got a lot worse. To me it seems to be related to the number of interrupts being fired. As before, it I disable the message that the network core sends every second, the issue get less frequent.
If I enable CONFIG_ASSERT=y, I sometimes get this message when the application crashes.
Sometimes it freezes and I must use nrfjprog --recover to get access to the device again. Sometimes it just reboots without any error message. Sometimes a double bus fault
Seems pretty random.
Unfortunately I don't currently have an application that reproduces the issue on an nrf5340dk. I also managed to make a few coredumps on ncs 1.7.0 (haven't tried on ncs 1.9.0 yet), but I am not sure if these make much sense without the elf file.
To me it looks kind of similar to
#44349
#30074
On NCS 1.7.0 it almost always failed in the ipm_work_q thread, on NCS 1.9.1 it usually lists either the sysworkq or flash_id which is one of my application threads. I uses the system work queue to communicate with a led driver device over I2C regularly and the flash_id thread is running checking a message queue for data (which is always empty in this case) checks for data on the i2s module, and then sleeps for 1 ms.
The text was updated successfully, but these errors were encountered: