Occasional Spinlocks on zephyr 2.4.0 (ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92) #30074

caco3 · 2020-11-16T21:46:01Z

Describe the bug
We have an application based on a nrf52840. We see z_spin_lock_valid assertion every few hours: ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92.

I haven't found much information about the reasons for such an issue.
The code is used for a measurement device. We run an external ADC at 10 kSamples/s and read its data through SPI DMA transfers. After 30 samples (=> 333 Hz) we trigger an Interrupt Handler which copies the data into another buffer.
Then we we release a semaphore. A Kernel Thread is waiting for this semaphore so it can grab the data from the buffer.

Code snippets:

K_SEM_DEFINE(dataReadyToBeSend, 0, 1);
K_THREAD_DEFINE(DataStreaming, 10240, Thread_DataStreaming, NULL, NULL, NULL, 0, 0, 0);

void TIMER2_IRQHandler(nrf_timer_event_t event_type, void *p_context) {
    [Copy data from DMA buffer to buffer...]
    k_sem_give(&dataReadyToBeSend);
}

static void Thread_DataStreaming(void) {
    [..]
    while (1) {
        semRet = k_sem_take(&dataReadyToBeSend, K_MSEC(1000));
        if (semRet == 0) {
            [..]
        }
        else {
            [..]
        }
    }
}

Log:

[..]
ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92

Recursive spinl*** Booting Zephyr OS build zephyr-v2.4.0  ***

What is the best way to trace this further down?

Environment :

OS: Ubuntu 20.04,
Toolchain: zephyr-sdk-0.11.4
Zephyr Project (Tag 2.4.0)

The text was updated successfully, but these errors were encountered:

carlescufi · 2020-11-17T11:20:36Z

Perhaps @anangl can shed some light on the best way to go about debugging this.

anangl · 2020-11-18T12:53:16Z

The assertion fails when the k_spinlock structure passed in the call to k_spin_lock() function has invalid content. It's not possible to tell from the provided code snippets when this could happen. @caco3 You can try to set a breakpoint in z_spin_lock_valid() to see from where it is called when it returns false and what is the content of the pointed k_spinlock structure then. Maybe that structure gets somehow accidentally overwritten?

olofk · 2020-11-26T21:49:37Z

I can't tell if this is related but I have spent the past two weeks debugging why the philosophers sample application crashes on my RISC-V SoC with Zephyr 2.4.0 and I have several clues pointing at the spin lock. Sometimes the crashes triggers the assert
ASSERTION FAIL [z_spin_unlock_valid(l)] @ ../../../include/spinlock.h:132
PhilosopNot my spinlock 0x00010590E FORK
@ WEST_TOPDIR/zephyr/lib/os/assert.c:45s ]

Also seen __ASSERT(z_spin_lock_valid(l), "Recursive spinlock %p", l); - in spinlock.h:92
and just recently I found that a value gets overwritten (which eventually causes a crash later on) in the z_spin_lock_valid function

remy-luisant · 2020-11-29T18:19:18Z

Same issue on a custom board, ARMv7-A (sic), master branch.

I might be wrong on this one, but here's a sketch of the problem as I see it.

Below is the relevant, edited for brevity, code.

log_core.c:905

struct k_spinlock lock;
k_spinlock_key_t key;	
key = k_spin_lock(&lock);

The locking function:

static ALWAYS_INLINE k_spinlock_key_t k_spin_lock(struct k_spinlock *l)
{
	k_spinlock_key_t k;
	k.key = arch_irq_lock();
	__ASSERT(z_spin_lock_valid(l), "Recursive spinlock %p", l);

Next, in the validation code:

	uintptr_t thread_cpu = l->thread_cpu;

Who initialized thread_cpu? Not the user, not the locking procedure...

olofk · 2020-11-29T20:22:40Z

@remy-luisant Good finding. Sounds like you're onto something. I would love to help debug since this sounds related to what I've been debugging for three weeks now. One problem is that I can't really check if thread_cpu is valid because I have no idea what value it is supposed to have. From the code it looks like the two LSB is a CPU number (does Zephyr only support four CPUs?) and perhaps the rest is a memory location (which would then require 32-bit alignment) but I'm very unsure about all of this

remy-luisant · 2020-11-29T22:18:33Z

does Zephyr only support four CPUs?

I imagine that people who need an RTOS that runs on multiple CPUs at GHz speeds and with tens of MB of RAM would not go with Zephyr. From their homepage: "The Zephyr Project strives to deliver the best-in-class RTOS for connected resource-constrained devices, built to be secure and safe." Apparently, I'm not exactly the target audience, but well...

Git points out this edit as the point of inclusion of the line I suspect to be problematic.
e124670 kernel/thread.c (Jim Shu 2019-09-17 13:39:17 +0800 867) uintptr_t thread_cpu = l->thread_cpu;

The relevant commit was testing on SMP, possibly not on uniprocessors. I may well be wrong on this, of course.

Jim, what do you think about this? @cwshu

cwshu · 2020-11-30T20:31:12Z

Hi @remy-luisant,
If spinlock isn't hold, thread_cpu should be clear to 0 and z_spin_lock_valid() is passed.
Otherwise, if spinlock is hold, thread_cpu is non-zero (because of containing memory address of thread)
and z_spin_lock_valid() will check CPU ID to know if same CPU acquire same lock twice. (recursive spinlock)

Thus, I think thread_cpu of k_spinlock should be initialized to zero.
After checking source code under kernel/, I find all k_spinlock in local variable and struct member are manually initialized to 0.

// kernel/sched.c
struct k_spinlock local_lock = {};
k_spinlock_key_t key = k_spin_lock(&local_lock);

// kernel/queue.c
void z_impl_k_queue_init(struct k_queue *queue)
{
    sys_sflist_init(&queue->data_q);
    queue->lock = (struct k_spinlock) {};

I think there is a bug in log_core.c for no initialization of local variable k_spinlock.

Hi @olofk,
Currently, both Zephyr SMP and CONFIG_SPIN_VALIDATE only support 4 CPUs at most.
There is also a restriction in CONFIG_SPIN_VALIDATE.

// include/spinlock.h
#ifdef CONFIG_SPIN_VALIDATE
/* ... */
BUILD_ASSERT(CONFIG_MP_NUM_CPUS < 4, "Too many CPUs for mask");
#endif

remy-luisant · 2020-11-30T20:53:04Z

@cwshu
As a counter-argument to locks requiring initialization, I would wish to point out all of the other instances in the kernel where the lock is not initialized, most notably in tests/kernel/spinlock/src/main.c:

void test_spinlock_basic(void)
{
        k_spinlock_key_t key;
        static struct k_spinlock l;

        zassert_true(!l.locked, "Spinlock initialized to locked");

        key = k_spin_lock(&l);

        zassert_true(l.locked, "Spinlock failed to lock");

        k_spin_unlock(&l, key);

        zassert_true(!l.locked, "Spinlock failed to unlock");
}

Requiring initialization seems not to be the common practice of spinlock usage in Zephyr, unless there are many, many, many more bugs around.

Furthermore, and I might be wrong on this one, let's assume that the spinlock is indeed initialized to 0. At this point, any code taking a lock on CPU 0 will trip a recursive spinlock error with the current code.

remy-luisant · 2020-11-30T21:23:13Z

Actually, a minor correction here, the validation code DOES check for the case of 0, so I am indeed happily wrong about that part. My apologies.

The comment about the common usage being on uninitialized locks stands.

cwshu · 2020-12-01T05:04:51Z

@remy-luisant
Global/static spinlock is put into bss and is cleared to 0 without initializer, most of spinlock cases including your example are like that.
But it looks like that there may be some bugs of uninitialized local spinlock in the codebase.
tests/kernel/spinlock/src/main.c has one:

void test_spinlock_mutual_exclusion(void)
{
    k_spinlock_key_t key;
    struct k_spinlock lock_runtime;
    unsigned int irq_key;

    lock_runtime.locked = 0;

    key = k_spin_lock(&lock_runtime);

Also, I find #20993 has same problem and @andyross comments that 'Spinlocks are valid when zero-initialized.'

remy-luisant · 2020-12-01T05:51:58Z

@cwshu
Yup, gotcha. It appears the policy is indeed to initialize locks. I missed out on the initialization aspect of the static keyword. Thank you very much for clarifying the matter.

I do personally have a problem with that implicit way of initializing variables through the use of static, but I can see it would work. My issue is that the examples in tests look very similar to a stack allocation and not at all similar to an initialization. This is a code style issue, however.

More worryingly, I do not see any documentation on the project website, so it appears that this aspect of spinlocks is still not documented. I see no mention in the source either, including in spinlock.h.

I consider the matter resolved for my needs, thank you.

caco3 · 2020-12-01T08:02:22Z

While playing around I now get crashes immediately after startup:

ASSERTION FAIL [z_spin_lock_valid(lASSERTION FAIL [!arch_is_in_isr()] @ WEST_TOPDIR/zephyr/kernel/thread.c:627
        Threads may not be created in ISRs
ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92
        Recursive spinlock 0x2000daa8

As mentioned in the first post, I have external interrupts configured. They continue to fire even when the MCU restarts. If I interpret the assertion above correctly, Zephyr tries to configure a thread while it is in an ISR. I now disabled all my thread creations but it still happens. I guess Zephyr itself has some own (hidden) threads.

remy-luisant · 2020-12-01T08:35:10Z

@caco3 Do you have OpenOCD running? It has a pretty great way of showing you all threads that Zephyr has. At a minimum, there is the idle thread. The logger and the shell also have their own threads.

I would also suggest first initializing the spinlocks that are failing, since that is the very first error that you encounter. Once you start getting errors the system might have already gotten damaged. While I am not familiar with Cortex-M, You can add a "bkpt" instruction to asserts to put a Cortex-A7 CPU into a debug state on hitting an assert and debug it from there. I imagine Cortex-M might be similar. I would also advise against debugging software solely by editing code and seeing the output of print statements. A proper debugging setup is much better for finding issues.

carlescufi · 2021-01-26T17:48:38Z

@caco3 @olofj @remy-luisant is this still an issue in your respective boards and environments?

olofk · 2021-01-26T18:00:53Z

@carlescufi Sorry. Forgot about this. I believe my root cause was an issue with the memory controller that occasionally dropped writes. Fixed in hw.

remy-luisant · 2021-01-26T19:03:04Z

My root cause was the interrupt controller feeding garbage to the GIC, which was then mishandled by Zephyr. A separate issue was filed for that one. I do believe there still might be an issue with the logging system not initializing a lock, but I have no resources available to give you a firm answer.

caco3 · 2021-01-27T12:58:41Z

I refactored my initialization code. This made the crashs which occured immediately after startup to go away.
It is now quite stable, how ever I still see occational spin lock crashs.
I will have to invest this further once I have more time.

carlescufi · 2021-02-03T11:07:46Z

@caco3 do you have any spinlocks defined in your own code? if so, are they automatic variables or globals?

caco3 · 2021-02-15T11:30:59Z

@carlescufi No, I have never used spinlocks.

I haven't seen such crashes since a while. We improved our code and upraged to the latest Zephyr releases. Testing will show further.

For me we can close the issue and I would reopen it in case I see new spinlock crashes.

github-actions · 2021-04-17T00:29:49Z

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

caco3 · 2021-04-19T06:22:57Z

We migrated to Zephyr 2.5.0 and cleaned up our code/initializations. Since then I havent seen this issue anymore for months now, so I am going to close this issue. Thanks for your support!

caco3 added the bug The issue is a bug, or the PR is fixing a bug label Nov 16, 2020

carlescufi assigned ioannisg and anangl Nov 17, 2020

carlescufi added the platform: nRF Nordic nRFx label Nov 17, 2020

nashif added the priority: medium Medium impact/importance bug label Nov 17, 2020

remy-luisant mentioned this issue Nov 29, 2020

Data Access Violation when LOG_* is called on ISR context #29162

Closed

github-actions bot added the Stale label Apr 17, 2021

caco3 closed this as completed Apr 19, 2021

evenl mentioned this issue Apr 6, 2022

nrf5340: Random crashes when a lot of interrupts is triggered #44586

Closed

TorstenRobitzki mentioned this issue Apr 26, 2024

double spin lock in z_impl_k_sem_give() #72023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional Spinlocks on zephyr 2.4.0 (ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92) #30074

Occasional Spinlocks on zephyr 2.4.0 (ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92) #30074

caco3 commented Nov 16, 2020 •

edited

Loading

carlescufi commented Nov 17, 2020

anangl commented Nov 18, 2020

olofk commented Nov 26, 2020 •

edited

Loading

remy-luisant commented Nov 29, 2020

olofk commented Nov 29, 2020

remy-luisant commented Nov 29, 2020

cwshu commented Nov 30, 2020

remy-luisant commented Nov 30, 2020 •

edited

Loading

remy-luisant commented Nov 30, 2020

cwshu commented Dec 1, 2020

remy-luisant commented Dec 1, 2020

caco3 commented Dec 1, 2020

remy-luisant commented Dec 1, 2020 •

edited

Loading

carlescufi commented Jan 26, 2021

olofk commented Jan 26, 2021

remy-luisant commented Jan 26, 2021

caco3 commented Jan 27, 2021

carlescufi commented Feb 3, 2021

caco3 commented Feb 15, 2021

github-actions bot commented Apr 17, 2021

caco3 commented Apr 19, 2021

Occasional Spinlocks on zephyr 2.4.0 (ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92) #30074

Occasional Spinlocks on zephyr 2.4.0 (ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92) #30074

Comments

caco3 commented Nov 16, 2020 • edited Loading

carlescufi commented Nov 17, 2020

anangl commented Nov 18, 2020

olofk commented Nov 26, 2020 • edited Loading

remy-luisant commented Nov 29, 2020

olofk commented Nov 29, 2020

remy-luisant commented Nov 29, 2020

cwshu commented Nov 30, 2020

remy-luisant commented Nov 30, 2020 • edited Loading

remy-luisant commented Nov 30, 2020

cwshu commented Dec 1, 2020

remy-luisant commented Dec 1, 2020

caco3 commented Dec 1, 2020

remy-luisant commented Dec 1, 2020 • edited Loading

carlescufi commented Jan 26, 2021

olofk commented Jan 26, 2021

remy-luisant commented Jan 26, 2021

caco3 commented Jan 27, 2021

carlescufi commented Feb 3, 2021

caco3 commented Feb 15, 2021

github-actions bot commented Apr 17, 2021

caco3 commented Apr 19, 2021

caco3 commented Nov 16, 2020 •

edited

Loading

olofk commented Nov 26, 2020 •

edited

Loading

remy-luisant commented Nov 30, 2020 •

edited

Loading

remy-luisant commented Dec 1, 2020 •

edited

Loading