smp atomic_t global_lock will never be cleared when a thread oops with global_lock is set #35202

IRISZZW · 2021-05-12T06:52:38Z

Describe the bug

When CONFIG_SMP is configed.

we call irq_lock, we will call z_smp_global_lock and set global_lock:

Lines 16 to 28 in 8e69daf

 unsigned int z_smp_global_lock(void) 

 { 

 unsigned int key = arch_irq_lock(); 

 if (!_current->base.global_lock_count) { 

 while (!atomic_cas(&global_lock, 0, 1)) { 

 } 

 } 

 _current->base.global_lock_count++; 

 return key; 

 }

we call irq_unlock, we will call z_smp_global_unlock and clear global_lock:

zephyr/kernel/smp.c

Lines 30 to 41 in 8e69daf

 void z_smp_global_unlock(unsigned int key) 

 { 

 if (_current->base.global_lock_count) { 

 _current->base.global_lock_count--; 

 if (!_current->base.global_lock_count) { 

 atomic_clear(&global_lock); 

 } 

 } 

 arch_irq_unlock(key); 

 }

When a thread called irq_lock, then oops, the global_lock will never be cleared.

I found this bug when I debug for this issue: #35200.

in test case test_fatal_on_smp, thread entry_oops will first call 'irq_lock`, then oops:

zephyr/tests/kernel/smp/src/main.c

Lines 652 to 662 in 8e69daf

 void entry_oops(void *p1, void *p2, void *p3) 

 { 

 unsigned int key; 

 ztest_set_fault_valid(true); 

 key = irq_lock(); 

 k_oops(); 

 TC_ERROR("SHOULD NEVER SEE THIS\n"); 

 rv = TC_FAIL; 

 irq_unlock(key);

then in test case test_smp_release_global_lock_irq, thread t2_mutex_lock_with_irq, it will call 'irq_lock` again:

zephyr/tests/kernel/smp/src/main.c

Lines 758 to 788 in 8e69daf

 static void t2_mutex_lock_with_irq(void *p1, void *p2, void *p3) 

 { 

 int key; 

 zassert_equal(_current->base.global_lock_count, 0, 

 "thread global lock cnt %d is incorrect", 

 _current->base.global_lock_count); 

 key = irq_lock(); 

 k_mutex_lock((struct k_mutex *)p1, K_FOREVER); 

 zassert_equal(_current->base.global_lock_count, 1, 

 "thread global lock cnt %d is incorrect", 

 _current->base.global_lock_count); 

 k_mutex_unlock((struct k_mutex *)p1); 

 /**TESTPOINT: Though the irq has not been locked yet, but the 

  * global_lock_cnt has been decreased during context switch. 

  */ 

 zassert_equal(_current->base.global_lock_count, 0, 

 "thread global lock cnt %d is incorrect", 

 _current->base.global_lock_count); 

 irq_unlock(key); 

 zassert_equal(_current->base.global_lock_count, 0, 

 "thread global lock cnt %d is incorrect", 

 _current->base.global_lock_count); 

 }

because the global_lock has been set, so irq_lock forever.

Impact
#35200

Environment (please complete the following information):

OS: (Linux)
Toolchain (Zephyr SDK)
Commit 8e69daf

The text was updated successfully, but these errors were encountered:

IRISZZW · 2021-05-12T06:55:11Z

@enjiamai @nashif I found this issue with your test cases, a9edb1f

enjiamai · 2021-05-12T09:05:41Z

Hi, @IRISZZW , thanks for your message and debug, I think the test case test_fatal_on_smp() from this commit fa2724b ,
Could you please help me to try this on your hsdk board? I appreciate your help, thank you!

diff --git a/tests/kernel/smp/src/main.c b/tests/kernel/smp/src/main.c
index 342f9a79de..b5db76c0e9 100644
--- a/tests/kernel/smp/src/main.c
+++ b/tests/kernel/smp/src/main.c
@@ -1045,7 +1045,7 @@ void test_main(void)
                         ztest_unit_test(test_wakeup_threads),
                         ztest_unit_test(test_smp_ipi),
                         ztest_unit_test(test_get_cpu),
-                        ztest_unit_test(test_fatal_on_smp),
+                        /*ztest_unit_test(test_fatal_on_smp),*/
                         ztest_unit_test(test_workq_on_smp),
                         ztest_unit_test(test_smp_release_global_lock),
                         ztest_unit_test(test_smp_release_global_lock_irq),

IRISZZW · 2021-05-12T09:42:31Z

without test_fatal_on_smp, it works:

*** Booting Zephyr OS build v2.6.0-rc1-60-g41afd838aabd  ***
Running test suite smp
===================================================================
START - test_smp_coop_threads
 PASS - test_smp_coop_threads in 0.563 seconds
===================================================================
START - test_cpu_id_threads
 PASS - test_cpu_id_threads in 1.11 seconds
===================================================================
START - test_coop_resched_threads
 PASS - test_coop_resched_threads in 0.51 seconds
===================================================================
START - test_preempt_resched_threads
 PASS - test_preempt_resched_threads in 0.531 seconds
===================================================================
START - test_yield_threads
 PASS - test_yield_threads in 0.301 seconds
===================================================================
START - test_sleep_threads
 PASS - test_sleep_threads in 1.11 seconds
===================================================================
START - test_wakeup_threads                                                                            
 PASS - test_wakeup_threads in 0.201 seconds                                                           
===================================================================                                    
START - test_smp_ipi                                                                                   
cpu num=4 PASS - test_smp_ipi in 0.331 seconds                                                         
===================================================================                                    
START - test_get_cpu                                                                                   
 PASS - test_get_cpu in 0.51 seconds                                                                   
===================================================================                                    
START - test_fatal_on_smp                                                                              
E: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 3                                                      
E: Current thread: 0x9000be58 (ztest_thread)                                                           
Caught system error -- reason 3 1                                                                      
Fatal error expected as part of test case.                                                             
 PASS - test_fatal_on_smp in 0.16 seconds                                                              
===================================================================                                    
START - test_workq_on_smp                                                                              
 PASS - test_workq_on_smp in 0.51 seconds                                                              
===================================================================                                    
START - test_smp_release_global_lock                                                                   
 PASS - test_smp_release_global_lock in 0.21 seconds                                                   
===================================================================                                    
START - test_smp_release_global_lock_irq                                                               
I: cy8c95xx_port1 init ok                                                                              
I: cy8c95xx_port0 init ok                                                                              
*** Booting Zephyr OS build v2.6.0-rc1-60-g41afd838aabd  ***                                           
Running test suite smp                                                                                 
===================================================================                                    
START - test_smp_coop_threads                                                                          
 PASS - test_smp_coop_threads in 0.563 seconds                                                         
I: cy8c95xx_port1 init ok
I: cy8c95xx_port0 init ok
*** Booting Zephyr OS build v2.6.0-rc1-60-g41afd838aabd  ***
Running test suite smp
===================================================================
START - test_smp_coop_threads
 PASS - test_smp_coop_threads in 0.563 seconds
===================================================================
START - test_cpu_id_threads
 PASS - test_cpu_id_threads in 1.11 seconds
===================================================================
START - test_coop_resched_threads
 PASS - test_coop_resched_threads in 0.51 seconds
===================================================================
START - test_preempt_resched_threads
 PASS - test_preempt_resched_threads in 0.531 seconds
===================================================================
START - test_yield_threads
 PASS - test_yield_threads in 0.301 seconds
===================================================================
START - test_sleep_threads
 PASS - test_sleep_threads in 1.11 seconds
===================================================================
START - test_wakeup_threads
 PASS - test_wakeup_threads in 0.201 seconds
===================================================================
START - test_smp_ipi
cpu num=4 PASS - test_smp_ipi in 0.331 seconds
===================================================================
START - test_get_cpu
 PASS - test_get_cpu in 0.51 seconds
===================================================================
START - test_workq_on_smp
 PASS - test_workq_on_smp in 0.51 seconds
===================================================================
START - test_smp_release_global_lock
 PASS - test_smp_release_global_lock in 0.21 seconds
===================================================================
START - test_smp_release_global_lock_irq
 PASS - test_smp_release_global_lock_irq in 0.21 seconds
===================================================================
START - test_inc_concurrency
type 0: cnt 30116, spend 5 ms
type 1: cnt 60000, spend 47 ms
type 2: cnt 60000, spend 1619 ms
type 3: cnt 60000, spend 1411 ms
 PASS - test_inc_concurrency in 3.92 seconds
===================================================================
Test suite smp succeeded
===================================================================
PROJECT EXECUTION SUCCESSFUL

enjiamai · 2021-05-12T09:48:35Z

Excuse me, @IRISZZW , could you please help to try this again on your hsdk board? thank you so much!

diff --git a/tests/kernel/smp/src/main.c b/tests/kernel/smp/src/main.c
index 342f9a79de..7464455edb 100644
--- a/tests/kernel/smp/src/main.c
+++ b/tests/kernel/smp/src/main.c
@@ -33,7 +33,7 @@ volatile int sync_count = -1;
 static int main_thread_id;
 static int child_thread_id;
 volatile int rv;
-
+static unsigned int key;

 K_SEM_DEFINE(cpuid_sema, 0, 1);
 K_SEM_DEFINE(sema, 0, 1);
@@ -647,12 +647,12 @@ void ztest_post_fatal_error_hook(unsigned int reason, const z_arch_esf_t *pEsf)
        } else {
                child_thread_id = curr_cpu();
        }
+
+       irq_unlock(key);
 }

 void entry_oops(void *p1, void *p2, void *p3)
 {
-       unsigned int key;
-
        ztest_set_fault_valid(true);

        key = irq_lock();

andyross · 2021-05-12T13:43:14Z

This doesn't sound like a bug. The irq_lock()/irq_unlock() APIs are kernel APIs (they're also legacy things that aren't intended to be used in new code, FWIW). A kernel thread that has a fatal error is a system failure, that's not supposed to be a recoverable situation. Note that basically any API that has some kind of resource allocation/release structure is going to leak like this. It's true that the behavior of the locking under SMP is different from uniprocessors, because the latter will end up "forgetting" the lock state on the next context switch. I'm just not sure why that's a bug.

Broadly:

Don't use irq_lock() in code that may be run in SMP contexts (precisely because this global lock thing is an expensive abstraction). Use spinlocks.
Don't expect to recover the system from a kernel thread oops beyond verifying the execution of its fatal error handler.

andyross · 2021-05-12T13:51:01Z

Just to clarify: none of our other "lock-like" APIs have the behavior desired here. If you terminate a SMP thread that holds a spinlock, it stays locked forever. If you terminate a thread that holds a k_mutex, nothing releases the mutex. If you terminate a thread expected to k_sem_give() a semaphore, the give will never happen. Ditto sys_sem, sys_mutex, k_futex, etc...[1].

The reason this used to work is that the lowest level irq_lock() always worked (on most architectures) just by setting a single bit of data in processor state indicating "interrupts are masked", which isn't enough to survive a context switch. So it could get "released" by switching to a new thread. That was effectively a bug in the old API.

[1] Notable corrollary: we have way to many locking abstractions.

Update testcase test_fatal_on_smp(), and refine it and correct some inappropriate usage such as unnecessary irq_lock(). This prevents the error propagation to the later executing testcase. Fixes zephyrproject-rtos#35200 Fixes zephyrproject-rtos#35202 Signed-off-by: Enjia Mai <[email protected]>

Update testcase test_fatal_on_smp(), and refine it and correct some inappropriate usage such as unnecessary irq_lock(). This prevents the error propagation to the later executing testcase. Fixes #35200 Fixes #35202 Signed-off-by: Enjia Mai <[email protected]>

IRISZZW added bug The issue is a bug, or the PR is fixing a bug area: SMP Symmetric multiprocessing labels May 12, 2021

IRISZZW assigned andyross May 12, 2021

enjiamai mentioned this issue May 12, 2021

tests: smp: correct the inappropriate testcase #35232

Merged

nashif closed this as completed in #35232 May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smp atomic_t global_lock will never be cleared when a thread oops with global_lock is set #35202

smp atomic_t global_lock will never be cleared when a thread oops with global_lock is set #35202

IRISZZW commented May 12, 2021

IRISZZW commented May 12, 2021

enjiamai commented May 12, 2021

IRISZZW commented May 12, 2021

enjiamai commented May 12, 2021

andyross commented May 12, 2021

andyross commented May 12, 2021

smp atomic_t global_lock will never be cleared when a thread oops with global_lock is set #35202

smp atomic_t global_lock will never be cleared when a thread oops with global_lock is set #35202

Comments

IRISZZW commented May 12, 2021

IRISZZW commented May 12, 2021

enjiamai commented May 12, 2021

IRISZZW commented May 12, 2021

enjiamai commented May 12, 2021

andyross commented May 12, 2021

andyross commented May 12, 2021