Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smp atomic_t global_lock will never be cleared when a thread oops with global_lock is set #35202

Closed
IRISZZW opened this issue May 12, 2021 · 6 comments · Fixed by #35232
Closed
Assignees
Labels
area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug

Comments

@IRISZZW
Copy link
Contributor

IRISZZW commented May 12, 2021

Describe the bug

When CONFIG_SMP is configed.

we call irq_lock, we will call z_smp_global_lock and set global_lock:

zephyr/kernel/smp.c

Lines 16 to 28 in 8e69daf

unsigned int z_smp_global_lock(void)
{
unsigned int key = arch_irq_lock();
if (!_current->base.global_lock_count) {
while (!atomic_cas(&global_lock, 0, 1)) {
}
}
_current->base.global_lock_count++;
return key;
}

we call irq_unlock, we will call z_smp_global_unlock and clear global_lock:

zephyr/kernel/smp.c

Lines 30 to 41 in 8e69daf

void z_smp_global_unlock(unsigned int key)
{
if (_current->base.global_lock_count) {
_current->base.global_lock_count--;
if (!_current->base.global_lock_count) {
atomic_clear(&global_lock);
}
}
arch_irq_unlock(key);
}

When a thread called irq_lock, then oops, the global_lock will never be cleared.

I found this bug when I debug for this issue: #35200.

in test case test_fatal_on_smp, thread entry_oops will first call 'irq_lock`, then oops:

void entry_oops(void *p1, void *p2, void *p3)
{
unsigned int key;
ztest_set_fault_valid(true);
key = irq_lock();
k_oops();
TC_ERROR("SHOULD NEVER SEE THIS\n");
rv = TC_FAIL;
irq_unlock(key);

then in test case test_smp_release_global_lock_irq, thread t2_mutex_lock_with_irq, it will call 'irq_lock` again:

static void t2_mutex_lock_with_irq(void *p1, void *p2, void *p3)
{
int key;
zassert_equal(_current->base.global_lock_count, 0,
"thread global lock cnt %d is incorrect",
_current->base.global_lock_count);
key = irq_lock();
k_mutex_lock((struct k_mutex *)p1, K_FOREVER);
zassert_equal(_current->base.global_lock_count, 1,
"thread global lock cnt %d is incorrect",
_current->base.global_lock_count);
k_mutex_unlock((struct k_mutex *)p1);
/**TESTPOINT: Though the irq has not been locked yet, but the
* global_lock_cnt has been decreased during context switch.
*/
zassert_equal(_current->base.global_lock_count, 0,
"thread global lock cnt %d is incorrect",
_current->base.global_lock_count);
irq_unlock(key);
zassert_equal(_current->base.global_lock_count, 0,
"thread global lock cnt %d is incorrect",
_current->base.global_lock_count);
}

because the global_lock has been set, so irq_lock forever.

Impact
#35200

Environment (please complete the following information):

  • OS: (Linux)
  • Toolchain (Zephyr SDK)
  • Commit 8e69daf
@IRISZZW IRISZZW added bug The issue is a bug, or the PR is fixing a bug area: SMP Symmetric multiprocessing labels May 12, 2021
@IRISZZW
Copy link
Contributor Author

IRISZZW commented May 12, 2021

@enjiamai @nashif I found this issue with your test cases, a9edb1f

@enjiamai
Copy link
Collaborator

Hi, @IRISZZW , thanks for your message and debug, I think the test case test_fatal_on_smp() from this commit fa2724b ,
Could you please help me to try this on your hsdk board? I appreciate your help, thank you!

diff --git a/tests/kernel/smp/src/main.c b/tests/kernel/smp/src/main.c
index 342f9a79de..b5db76c0e9 100644
--- a/tests/kernel/smp/src/main.c
+++ b/tests/kernel/smp/src/main.c
@@ -1045,7 +1045,7 @@ void test_main(void)
                         ztest_unit_test(test_wakeup_threads),
                         ztest_unit_test(test_smp_ipi),
                         ztest_unit_test(test_get_cpu),
-                        ztest_unit_test(test_fatal_on_smp),
+                        /*ztest_unit_test(test_fatal_on_smp),*/
                         ztest_unit_test(test_workq_on_smp),
                         ztest_unit_test(test_smp_release_global_lock),
                         ztest_unit_test(test_smp_release_global_lock_irq),

@IRISZZW
Copy link
Contributor Author

IRISZZW commented May 12, 2021

without test_fatal_on_smp, it works:

*** Booting Zephyr OS build v2.6.0-rc1-60-g41afd838aabd  ***
Running test suite smp
===================================================================
START - test_smp_coop_threads
 PASS - test_smp_coop_threads in 0.563 seconds
===================================================================
START - test_cpu_id_threads
 PASS - test_cpu_id_threads in 1.11 seconds
===================================================================
START - test_coop_resched_threads
 PASS - test_coop_resched_threads in 0.51 seconds
===================================================================
START - test_preempt_resched_threads
 PASS - test_preempt_resched_threads in 0.531 seconds
===================================================================
START - test_yield_threads
 PASS - test_yield_threads in 0.301 seconds
===================================================================
START - test_sleep_threads
 PASS - test_sleep_threads in 1.11 seconds
===================================================================
START - test_wakeup_threads                                                                            
 PASS - test_wakeup_threads in 0.201 seconds                                                           
===================================================================                                    
START - test_smp_ipi                                                                                   
cpu num=4 PASS - test_smp_ipi in 0.331 seconds                                                         
===================================================================                                    
START - test_get_cpu                                                                                   
 PASS - test_get_cpu in 0.51 seconds                                                                   
===================================================================                                    
START - test_fatal_on_smp                                                                              
E: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 3                                                      
E: Current thread: 0x9000be58 (ztest_thread)                                                           
Caught system error -- reason 3 1                                                                      
Fatal error expected as part of test case.                                                             
 PASS - test_fatal_on_smp in 0.16 seconds                                                              
===================================================================                                    
START - test_workq_on_smp                                                                              
 PASS - test_workq_on_smp in 0.51 seconds                                                              
===================================================================                                    
START - test_smp_release_global_lock                                                                   
 PASS - test_smp_release_global_lock in 0.21 seconds                                                   
===================================================================                                    
START - test_smp_release_global_lock_irq                                                               
I: cy8c95xx_port1 init ok                                                                              
I: cy8c95xx_port0 init ok                                                                              
*** Booting Zephyr OS build v2.6.0-rc1-60-g41afd838aabd  ***                                           
Running test suite smp                                                                                 
===================================================================                                    
START - test_smp_coop_threads                                                                          
 PASS - test_smp_coop_threads in 0.563 seconds                                                         
I: cy8c95xx_port1 init ok
I: cy8c95xx_port0 init ok
*** Booting Zephyr OS build v2.6.0-rc1-60-g41afd838aabd  ***
Running test suite smp
===================================================================
START - test_smp_coop_threads
 PASS - test_smp_coop_threads in 0.563 seconds
===================================================================
START - test_cpu_id_threads
 PASS - test_cpu_id_threads in 1.11 seconds
===================================================================
START - test_coop_resched_threads
 PASS - test_coop_resched_threads in 0.51 seconds
===================================================================
START - test_preempt_resched_threads
 PASS - test_preempt_resched_threads in 0.531 seconds
===================================================================
START - test_yield_threads
 PASS - test_yield_threads in 0.301 seconds
===================================================================
START - test_sleep_threads
 PASS - test_sleep_threads in 1.11 seconds
===================================================================
START - test_wakeup_threads
 PASS - test_wakeup_threads in 0.201 seconds
===================================================================
START - test_smp_ipi
cpu num=4 PASS - test_smp_ipi in 0.331 seconds
===================================================================
START - test_get_cpu
 PASS - test_get_cpu in 0.51 seconds
===================================================================
START - test_workq_on_smp
 PASS - test_workq_on_smp in 0.51 seconds
===================================================================
START - test_smp_release_global_lock
 PASS - test_smp_release_global_lock in 0.21 seconds
===================================================================
START - test_smp_release_global_lock_irq
 PASS - test_smp_release_global_lock_irq in 0.21 seconds
===================================================================
START - test_inc_concurrency
type 0: cnt 30116, spend 5 ms
type 1: cnt 60000, spend 47 ms
type 2: cnt 60000, spend 1619 ms
type 3: cnt 60000, spend 1411 ms
 PASS - test_inc_concurrency in 3.92 seconds
===================================================================
Test suite smp succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL

@enjiamai
Copy link
Collaborator

Excuse me, @IRISZZW , could you please help to try this again on your hsdk board? thank you so much!

diff --git a/tests/kernel/smp/src/main.c b/tests/kernel/smp/src/main.c
index 342f9a79de..7464455edb 100644
--- a/tests/kernel/smp/src/main.c
+++ b/tests/kernel/smp/src/main.c
@@ -33,7 +33,7 @@ volatile int sync_count = -1;
 static int main_thread_id;
 static int child_thread_id;
 volatile int rv;
-
+static unsigned int key;

 K_SEM_DEFINE(cpuid_sema, 0, 1);
 K_SEM_DEFINE(sema, 0, 1);
@@ -647,12 +647,12 @@ void ztest_post_fatal_error_hook(unsigned int reason, const z_arch_esf_t *pEsf)
        } else {
                child_thread_id = curr_cpu();
        }
+
+       irq_unlock(key);
 }

 void entry_oops(void *p1, void *p2, void *p3)
 {
-       unsigned int key;
-
        ztest_set_fault_valid(true);

        key = irq_lock();

@andyross
Copy link
Contributor

This doesn't sound like a bug. The irq_lock()/irq_unlock() APIs are kernel APIs (they're also legacy things that aren't intended to be used in new code, FWIW). A kernel thread that has a fatal error is a system failure, that's not supposed to be a recoverable situation. Note that basically any API that has some kind of resource allocation/release structure is going to leak like this. It's true that the behavior of the locking under SMP is different from uniprocessors, because the latter will end up "forgetting" the lock state on the next context switch. I'm just not sure why that's a bug.

Broadly:

  1. Don't use irq_lock() in code that may be run in SMP contexts (precisely because this global lock thing is an expensive abstraction). Use spinlocks.
  2. Don't expect to recover the system from a kernel thread oops beyond verifying the execution of its fatal error handler.

@andyross
Copy link
Contributor

Just to clarify: none of our other "lock-like" APIs have the behavior desired here. If you terminate a SMP thread that holds a spinlock, it stays locked forever. If you terminate a thread that holds a k_mutex, nothing releases the mutex. If you terminate a thread expected to k_sem_give() a semaphore, the give will never happen. Ditto sys_sem, sys_mutex, k_futex, etc...[1].

The reason this used to work is that the lowest level irq_lock() always worked (on most architectures) just by setting a single bit of data in processor state indicating "interrupts are masked", which isn't enough to survive a context switch. So it could get "released" by switching to a new thread. That was effectively a bug in the old API.

[1] Notable corrollary: we have way to many locking abstractions.

enjiamai pushed a commit to enjiamai/zephyr that referenced this issue May 12, 2021
Update testcase test_fatal_on_smp(), and refine it and correct some
inappropriate usage such as unnecessary irq_lock(). This prevents
the error propagation to the later executing testcase.

Fixes zephyrproject-rtos#35200
Fixes zephyrproject-rtos#35202

Signed-off-by: Enjia Mai <[email protected]>
nashif pushed a commit that referenced this issue May 12, 2021
Update testcase test_fatal_on_smp(), and refine it and correct some
inappropriate usage such as unnecessary irq_lock(). This prevents
the error propagation to the later executing testcase.

Fixes #35200
Fixes #35202

Signed-off-by: Enjia Mai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: SMP Symmetric multiprocessing bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants