Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional Spinlocks on zephyr 2.4.0 (ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92) #30074

Closed
caco3 opened this issue Nov 16, 2020 · 21 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx priority: medium Medium impact/importance bug Stale

Comments

@caco3
Copy link
Contributor

caco3 commented Nov 16, 2020

Describe the bug
We have an application based on a nrf52840. We see z_spin_lock_valid assertion every few hours: ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92.

I haven't found much information about the reasons for such an issue.
The code is used for a measurement device. We run an external ADC at 10 kSamples/s and read its data through SPI DMA transfers. After 30 samples (=> 333 Hz) we trigger an Interrupt Handler which copies the data into another buffer.
Then we we release a semaphore. A Kernel Thread is waiting for this semaphore so it can grab the data from the buffer.

Code snippets:

K_SEM_DEFINE(dataReadyToBeSend, 0, 1);
K_THREAD_DEFINE(DataStreaming, 10240, Thread_DataStreaming, NULL, NULL, NULL, 0, 0, 0);

void TIMER2_IRQHandler(nrf_timer_event_t event_type, void *p_context) {
    [Copy data from DMA buffer to buffer...]
    k_sem_give(&dataReadyToBeSend);
}

static void Thread_DataStreaming(void) {
    [..]
    while (1) {
        semRet = k_sem_take(&dataReadyToBeSend, K_MSEC(1000));
        if (semRet == 0) {
            [..]
        }
        else {
            [..]
        }
    }
}

Log:

[..]
ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92

Recursive spinl*** Booting Zephyr OS build zephyr-v2.4.0  ***

What is the best way to trace this further down?

Environment :

  • OS: Ubuntu 20.04,
  • Toolchain: zephyr-sdk-0.11.4
  • Zephyr Project (Tag 2.4.0)
@caco3 caco3 added the bug The issue is a bug, or the PR is fixing a bug label Nov 16, 2020
@carlescufi carlescufi added the platform: nRF Nordic nRFx label Nov 17, 2020
@carlescufi
Copy link
Member

Perhaps @anangl can shed some light on the best way to go about debugging this.

@nashif nashif added the priority: medium Medium impact/importance bug label Nov 17, 2020
@anangl
Copy link
Member

anangl commented Nov 18, 2020

The assertion fails when the k_spinlock structure passed in the call to k_spin_lock() function has invalid content. It's not possible to tell from the provided code snippets when this could happen. @caco3 You can try to set a breakpoint in z_spin_lock_valid() to see from where it is called when it returns false and what is the content of the pointed k_spinlock structure then. Maybe that structure gets somehow accidentally overwritten?

@olofk
Copy link
Contributor

olofk commented Nov 26, 2020

I can't tell if this is related but I have spent the past two weeks debugging why the philosophers sample application crashes on my RISC-V SoC with Zephyr 2.4.0 and I have several clues pointing at the spin lock. Sometimes the crashes triggers the assert
ASSERTION FAIL [z_spin_unlock_valid(l)] @ ../../../include/spinlock.h:132
PhilosopNot my spinlock 0x00010590E FORK
@ WEST_TOPDIR/zephyr/lib/os/assert.c:45s ]

Also seen __ASSERT(z_spin_lock_valid(l), "Recursive spinlock %p", l); - in spinlock.h:92
and just recently I found that a value gets overwritten (which eventually causes a crash later on) in the z_spin_lock_valid function

@remy-luisant
Copy link
Contributor

Same issue on a custom board, ARMv7-A (sic), master branch.

I might be wrong on this one, but here's a sketch of the problem as I see it.

Below is the relevant, edited for brevity, code.

log_core.c:905

struct k_spinlock lock;
k_spinlock_key_t key;	
key = k_spin_lock(&lock);

The locking function:

static ALWAYS_INLINE k_spinlock_key_t k_spin_lock(struct k_spinlock *l)
{
	k_spinlock_key_t k;
	k.key = arch_irq_lock();
	__ASSERT(z_spin_lock_valid(l), "Recursive spinlock %p", l);

Next, in the validation code:

	uintptr_t thread_cpu = l->thread_cpu;

Who initialized thread_cpu? Not the user, not the locking procedure...

@olofk
Copy link
Contributor

olofk commented Nov 29, 2020

@remy-luisant Good finding. Sounds like you're onto something. I would love to help debug since this sounds related to what I've been debugging for three weeks now. One problem is that I can't really check if thread_cpu is valid because I have no idea what value it is supposed to have. From the code it looks like the two LSB is a CPU number (does Zephyr only support four CPUs?) and perhaps the rest is a memory location (which would then require 32-bit alignment) but I'm very unsure about all of this

@remy-luisant
Copy link
Contributor

does Zephyr only support four CPUs?

I imagine that people who need an RTOS that runs on multiple CPUs at GHz speeds and with tens of MB of RAM would not go with Zephyr. From their homepage: "The Zephyr Project strives to deliver the best-in-class RTOS for connected resource-constrained devices, built to be secure and safe." Apparently, I'm not exactly the target audience, but well...

Git points out this edit as the point of inclusion of the line I suspect to be problematic.
e124670 kernel/thread.c (Jim Shu 2019-09-17 13:39:17 +0800 867) uintptr_t thread_cpu = l->thread_cpu;

The relevant commit was testing on SMP, possibly not on uniprocessors. I may well be wrong on this, of course.

Jim, what do you think about this? @cwshu

@cwshu
Copy link
Contributor

cwshu commented Nov 30, 2020

Hi @remy-luisant,
If spinlock isn't hold, thread_cpu should be clear to 0 and z_spin_lock_valid() is passed.
Otherwise, if spinlock is hold, thread_cpu is non-zero (because of containing memory address of thread)
and z_spin_lock_valid() will check CPU ID to know if same CPU acquire same lock twice. (recursive spinlock)

Thus, I think thread_cpu of k_spinlock should be initialized to zero.
After checking source code under kernel/, I find all k_spinlock in local variable and struct member are manually initialized to 0.

// kernel/sched.c
struct k_spinlock local_lock = {};
k_spinlock_key_t key = k_spin_lock(&local_lock);

// kernel/queue.c
void z_impl_k_queue_init(struct k_queue *queue)
{
    sys_sflist_init(&queue->data_q);
    queue->lock = (struct k_spinlock) {};

I think there is a bug in log_core.c for no initialization of local variable k_spinlock.

Hi @olofk,
Currently, both Zephyr SMP and CONFIG_SPIN_VALIDATE only support 4 CPUs at most.
There is also a restriction in CONFIG_SPIN_VALIDATE.

// include/spinlock.h
#ifdef CONFIG_SPIN_VALIDATE
/* ... */
BUILD_ASSERT(CONFIG_MP_NUM_CPUS < 4, "Too many CPUs for mask");
#endif

@remy-luisant
Copy link
Contributor

remy-luisant commented Nov 30, 2020

@cwshu
As a counter-argument to locks requiring initialization, I would wish to point out all of the other instances in the kernel where the lock is not initialized, most notably in tests/kernel/spinlock/src/main.c:

void test_spinlock_basic(void)
{
        k_spinlock_key_t key;
        static struct k_spinlock l;

        zassert_true(!l.locked, "Spinlock initialized to locked");

        key = k_spin_lock(&l);

        zassert_true(l.locked, "Spinlock failed to lock");

        k_spin_unlock(&l, key);

        zassert_true(!l.locked, "Spinlock failed to unlock");
}

Requiring initialization seems not to be the common practice of spinlock usage in Zephyr, unless there are many, many, many more bugs around.

Furthermore, and I might be wrong on this one, let's assume that the spinlock is indeed initialized to 0. At this point, any code taking a lock on CPU 0 will trip a recursive spinlock error with the current code.

@remy-luisant
Copy link
Contributor

Actually, a minor correction here, the validation code DOES check for the case of 0, so I am indeed happily wrong about that part. My apologies.

The comment about the common usage being on uninitialized locks stands.

@cwshu
Copy link
Contributor

cwshu commented Dec 1, 2020

@remy-luisant
Global/static spinlock is put into bss and is cleared to 0 without initializer, most of spinlock cases including your example are like that.
But it looks like that there may be some bugs of uninitialized local spinlock in the codebase.
tests/kernel/spinlock/src/main.c has one:

void test_spinlock_mutual_exclusion(void)
{
    k_spinlock_key_t key;
    struct k_spinlock lock_runtime;
    unsigned int irq_key;

    lock_runtime.locked = 0;

    key = k_spin_lock(&lock_runtime);

Also, I find #20993 has same problem and @andyross comments that 'Spinlocks are valid when zero-initialized.'

@remy-luisant
Copy link
Contributor

@cwshu
Yup, gotcha. It appears the policy is indeed to initialize locks. I missed out on the initialization aspect of the static keyword. Thank you very much for clarifying the matter.

I do personally have a problem with that implicit way of initializing variables through the use of static, but I can see it would work. My issue is that the examples in tests look very similar to a stack allocation and not at all similar to an initialization. This is a code style issue, however.

More worryingly, I do not see any documentation on the project website, so it appears that this aspect of spinlocks is still not documented. I see no mention in the source either, including in spinlock.h.

I consider the matter resolved for my needs, thank you.

@caco3
Copy link
Contributor Author

caco3 commented Dec 1, 2020

While playing around I now get crashes immediately after startup:

ASSERTION FAIL [z_spin_lock_valid(lASSERTION FAIL [!arch_is_in_isr()] @ WEST_TOPDIR/zephyr/kernel/thread.c:627
        Threads may not be created in ISRs
ASSERTION FAIL [z_spin_lock_valid(l)] @ WEST_TOPDIR/zephyr/include/spinlock.h:92
        Recursive spinlock 0x2000daa8

As mentioned in the first post, I have external interrupts configured. They continue to fire even when the MCU restarts. If I interpret the assertion above correctly, Zephyr tries to configure a thread while it is in an ISR. I now disabled all my thread creations but it still happens. I guess Zephyr itself has some own (hidden) threads.

@remy-luisant
Copy link
Contributor

remy-luisant commented Dec 1, 2020

@caco3 Do you have OpenOCD running? It has a pretty great way of showing you all threads that Zephyr has. At a minimum, there is the idle thread. The logger and the shell also have their own threads.

I would also suggest first initializing the spinlocks that are failing, since that is the very first error that you encounter. Once you start getting errors the system might have already gotten damaged. While I am not familiar with Cortex-M, You can add a "bkpt" instruction to asserts to put a Cortex-A7 CPU into a debug state on hitting an assert and debug it from there. I imagine Cortex-M might be similar. I would also advise against debugging software solely by editing code and seeing the output of print statements. A proper debugging setup is much better for finding issues.

@carlescufi
Copy link
Member

@caco3 @olofj @remy-luisant is this still an issue in your respective boards and environments?

@olofk
Copy link
Contributor

olofk commented Jan 26, 2021

@carlescufi Sorry. Forgot about this. I believe my root cause was an issue with the memory controller that occasionally dropped writes. Fixed in hw.

@remy-luisant
Copy link
Contributor

My root cause was the interrupt controller feeding garbage to the GIC, which was then mishandled by Zephyr. A separate issue was filed for that one. I do believe there still might be an issue with the logging system not initializing a lock, but I have no resources available to give you a firm answer.

@caco3
Copy link
Contributor Author

caco3 commented Jan 27, 2021

I refactored my initialization code. This made the crashs which occured immediately after startup to go away.
It is now quite stable, how ever I still see occational spin lock crashs.
I will have to invest this further once I have more time.

@carlescufi
Copy link
Member

@caco3 do you have any spinlocks defined in your own code? if so, are they automatic variables or globals?

@caco3
Copy link
Contributor Author

caco3 commented Feb 15, 2021

@carlescufi No, I have never used spinlocks.

I haven't seen such crashes since a while. We improved our code and upraged to the latest Zephyr releases. Testing will show further.

For me we can close the issue and I would reopen it in case I see new spinlock crashes.

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Apr 17, 2021
@caco3
Copy link
Contributor Author

caco3 commented Apr 19, 2021

We migrated to Zephyr 2.5.0 and cleaned up our code/initializations. Since then I havent seen this issue anymore for months now, so I am going to close this issue. Thanks for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug platform: nRF Nordic nRFx priority: medium Medium impact/importance bug Stale
Projects
None yet
Development

No branches or pull requests

8 participants