Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qemu_cortex_a53_smp: tests/ztest/error_hook failed after enabling the FPU context switching support #34844

Closed
enjiamai opened this issue May 5, 2021 · 17 comments · Fixed by #34846
Assignees
Labels
area: ARM64 ARM (64-bit) Architecture bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Milestone

Comments

@enjiamai
Copy link
Collaborator

enjiamai commented May 5, 2021

Describe the bug
The testcase tests/ztest/error_hook failed after the FPU context switching support. This failure is not 100% reproduced.
And after git bisect, we found the issue started after commit: a82fff0, or f1f63dd.

To Reproduce
Steps to reproduce the behavior:
Run command:
twister -T tests/ztest/error_hook/ -p qemu_cortex_a53_smp
or
west build tests/ztest/error_hook/ -b qemu_cortex_a53_smp -t run -p always -DCONFIG_TEST_USERSPACE=y

Expected behavior
A test case pass as expected, and the expected log should like this:

===================================================================
START - test_catch_z_oops
E: ELR_ELn: 0x000000004000258c
E: ESR_ELn: 0x0000000056000000
E:   EC:  0x15 (Unknown)
E:   IL:  0x1
E:   ISS: 0x0
E: TPIDRRO: 0x000000004002f5b0
E: x0:  0x00000000140003ff  x1:  0x0000000000000000
E: x2:  0x00000000400833ac  x3:  0x0000000000000002
E: x4:  0x0000000000000000  x5:  0x0000000000000000
E: x6:  0x0000000000000000  x7:  0x00000000644d5241
E: x8:  0x0000000000000000  x9:  0x0000000000000000
E: x10: 0x0000000000000000  x11: 0x0000000000000000
E: x12: 0x0000000000000000  x13: 0x0000000000000000
E: x14: 0x0000000000000000  x15: 0x0000000000000000
E: x16: 0x0000000000000000  x17: 0x0000000000000000
E: x18: 0x0000000000000000  x30: 0x0000000000000000
E: >>> ZEPHYR FATAL ERROR 3: Kernel oops on CPU 0
E: Current thread: 0x4002e700 (ztest_thread)
Caught system error -- reason 3 1
Fatal error expected as part of test case.
 PASS - test_catch_z_oops in 0.1 seconds

Impact
It blocked the CI where the submitted PR run testcase tests/ztest/error_hook.

Logs and console output
Here is error log shows in twister-out/qemu_cortex_a53_smp/tests/ztest/error_hook/testing.ztest.error_hook/handler.log:

=================================================================
START - test_catch_z_oops

or

=================================================================
START - test_catch_z_oops
E: ELR_ELn: 0x000000004001d078
E: ESR_ELn: 0x000000008600000f
E:   EC:  0x21 (Instruction Abort taken without a change in Exception level.)
E:   IL:  0x1
E:   ISS: 0xf
E: FAR_ELn: 0x000000004001d078
E: TPIDRRO: 0x010000004002f5e0
E: x0:  0x00000000fffffff5  x1:  0x0000000000000000
E: x2:  0x0000000000000000  x3:  0x0000000000000000
E: x4:  0x0000000040007e54  x5:  0x000000004000152c
E: x6:  0x000000050000000c  x7:  0x0000000000000000
E: x8:  0x0000000000000000  x9:  0x0000000000000005
E: x10: 0x000000050000000c  x11: 0x0000000000000000
E: x12: 0x0000000040080fb0  x13: 0x0000000040007e90
E: x14: 0x0000000040080fe0  x15: 0x00000000400025a8
E: x16: 0x000000004002e700  x17: 0x0000000000000000
E: x18: 0x0000000000000000  x30: 0x000000004001d078
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x4002e700 (ztest_thread)
Caught system error -- reason 0 0
Fatal error was unexpected, aborting...

Environment (please complete the following information):

  • OS: ubuntu 18.04
  • Toolchain Zephyr SDK 0.12.4
  • Commit 8685143 , or commit after a82fff0

Additional context
N/A

@enjiamai enjiamai added the bug The issue is a bug, or the PR is fixing a bug label May 5, 2021
enjiamai pushed a commit to enjiamai/zephyr that referenced this issue May 5, 2021
Put the testcase test_catch_assert_in_isr() to execute last, to prevent
it affects other test cases. Because when we caught an assert failure
in the ISR handler, it cannot be guaranteed that all the current
program status would be recovered.

Fixes zephyrproject-rtos#34844.

Signed-off-by: Enjia Mai <[email protected]>
@enjiamai enjiamai assigned nashif and unassigned npitre May 5, 2021
@enjiamai
Copy link
Collaborator Author

enjiamai commented May 5, 2021

Hi @nashif, I found this is due to the assertion fail from ISR in the previous executed test case, and it might not recover all the status of the program. So I put the test_catch_assert_in_isr() in the last to prevent it affects the following test case.

@galak
Copy link
Collaborator

galak commented May 5, 2021

Possibly whatever is happening here might impact #34838

@npitre
Copy link
Collaborator

npitre commented May 5, 2021 via email

@enjiamai
Copy link
Collaborator Author

enjiamai commented May 5, 2021

And in this case I'm unable to reproduce when building and running the test via west. It requires twister to fail. What's the difference?

Hi, @npitre , in this case west command needs a extra -DCONFIG_TEST_USERSPACE=y to keep in concert with twister.
After looking at #34838, I'm not very sure, but it seems like a different issue, @galak, thanks!

@npitre
Copy link
Collaborator

npitre commented May 6, 2021 via email

ioannisg pushed a commit that referenced this issue May 6, 2021
Put the testcase test_catch_assert_in_isr() to execute last, to prevent
it affects other test cases. Because when we caught an assert failure
in the ISR handler, it cannot be guaranteed that all the current
program status would be recovered.

Fixes #34844.

Signed-off-by: Enjia Mai <[email protected]>
@enjiamai
Copy link
Collaborator Author

enjiamai commented May 6, 2021

Hi @npitre , I can reproduce it on my side. I just use the command:

west build tests/ztest/error_hook/ -b qemu_cortex_a53_smp -p always -t run -DCONFIG_TEST_USERSPACE=y

Did you use the same command as mine? thanks!
By the way, could you please tell me which commit ID you use? I can also try it.

@npitre
Copy link
Collaborator

npitre commented May 6, 2021 via email

@enjiamai
Copy link
Collaborator Author

enjiamai commented May 6, 2021

Excuse me, @npitre, could you please tell me the commit ID you use, let me also have a try, thanks~

@npitre
Copy link
Collaborator

npitre commented May 6, 2021 via email

@enjiamai
Copy link
Collaborator Author

enjiamai commented May 6, 2021

Here are my trials of running by "west build tests/ztest/error_hook/ -b qemu_cortex_a53_smp -p always -t run -DCONFIG_TEST_USERSPACE=y"

f5e3d89 (ztest order workaround) west 20/20 pass, by twister pass
2b864ed (last before workaround) west 20/15 pass, by twister failed
5f423d4 (commit you are using) west 20/19 pass, by twister failed
f1f63dd (enable FPU context) west 20/9 pass, by twister failed
9136c40 (last before enable FPU) west 20/20 pass, by twister pass

It was sometimes to reproduce very easily, but sometimes not, by west. By twister, it's very easy to reproduce.
I also compare their .config between west and twister, but they are the same.

I think this needs some time to take look at it.

@npitre npitre reopened this May 6, 2021
@npitre
Copy link
Collaborator

npitre commented May 6, 2021

Let's keep this open until we can explain the root issue.

@npitre npitre added area: ARM64 ARM (64-bit) Architecture priority: low Low impact/importance bug labels May 6, 2021
@enjiamai
Copy link
Collaborator Author

enjiamai commented May 6, 2021

@npitre But I also want to highlight that when an assertion happened, especially in ISR, in a normal situation program should be terminated. Recover it here then keep executing the program only for testing purposes.

@npitre
Copy link
Collaborator

npitre commented May 6, 2021 via email

@enjiamai enjiamai assigned npitre and enjiamai and unassigned nashif May 8, 2021
@galak galak added this to the v2.6.0 milestone May 11, 2021
@enjiamai
Copy link
Collaborator Author

Hi @npitre, why the test case sometimes failed is because when we intentionally trigger an assertion in ISR, then we did not leave ISR context gracefully. Though we release some resources such as semaphore or spinlock before the test_main thread end, it is still in ISR context ztest thread until some point. Currently, our ztest error hook does not have a mechanism to return back from the ISR and recover the stack.

For most of the platforms, this might be just enough to pass the test case. But in the SMP environment, it is possible to cause a problem if the ISR context does not exit immediately and another fatal error trigger at the same time.

Though put this kind of testcase in the last, or make the testcase run with 1cpu can avoid this but the best solution here I think is, to remove this kind of usage unless we have a good recovery mechanism for the testcases.
So I submitted a #35353 to remove this testcase, thanks.

enjiamai pushed a commit to enjiamai/zephyr that referenced this issue May 17, 2021
This testcase shows triggering an assertion in ISR intentionally, for
verifying the assertion works in our code. But currently, the ztest
error hook doesn't have a mechanism to fully recover from ISR context
for different platforms. It needs to recover the resource being hold
and exit ISR context, otherwise, the program will not stable enough to
execute the following testcases. We already submitted a workaround
PR#34846 for it by putting these kinds of testcases executing last.

Anyway, we recommend not to use it this way unless our ztest error
hook mechanism can be refined to handle this recovery completely.
So we tend to remove this testcase at this moment.

Fixes zephyrproject-rtos#34844

Signed-off-by: Enjia Mai <[email protected]>
@andyross
Copy link
Contributor

Came here to comment that this sounds like almost exactly the issue detailed in #35307, but it looks like @npitre is already here and aware.

@enjiamai
Copy link
Collaborator Author

@andyross Thank you so much! Like you just mentioned in #35307 , they might be the same cause. I verified the related PR #35410 will fix this issue. @npitre .

@npitre
Copy link
Collaborator

npitre commented May 21, 2021

#35410 is now merged

@npitre npitre closed this as completed May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: ARM64 ARM (64-bit) Architecture bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
5 participants