-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sanitycheck --coverage: stack overflows on qemu_x86 and mps2_an385 #14500
Comments
run the case(e.g. kernel.common) manually, it works well. It should be someone's commit cause this issue. I'm checking this issue. |
can reproduce the issue(tests/kernel/workq/work_queue_api case) when CONFIG_COVERAGE=y. And if CONFIG_COVERAGE=n, the issue won't happen. and the output is that: check the zephyr.lst file, and address 0x20018014 is in sys_work_q_stack and 0x20014c58 is the thread ID of k_sys_work_q.thread. |
@wentongwu I don't understand well what the compiler is doing when profiling is enabled. I saw your email earlier this morning but I haven't had a chance to analyze it yet. What code is at address 0x20fe0 which is accessing this data? |
from below, it seems 0x20fe4 - 0x20ff0 is some code added by compiler for gcov. And it seems 0x20fe0 is doing stack pushing operation. I'm not sure if there is stack overflow. @andrewboie |
Can you backtrace to where in the source code this is coming from? |
@andrewboie I will do that. And I increased system workq's stack size, system workq thread will not be mpu fault, another thread trigger's mpu fault, I'm trying to increase this thread's stack size. And could you please give me some ideas about the strategies for zephyr's memory protection? Thanks. |
do the same test on qemu_x86, but it's Stack Check Fail as #14499 instead of MPU fault. And for mps2_an385, enlarge the stack of thread, some not passed tests can pass, and the others still MPU fault. |
go back to old version 9072d34, with 0.9.5 SDK the test can pass, but with 0.10.0 SDK, still MPU fault. both tests are set -DCONFIG_COVERAGE=y . |
with latest code version and SDK 0.9.5, the test can also pass with DCONFIG_COVERAGE=y. before the test, I apply below patch: -set(REQUIRED_SDK_VER 0.10.0) |
@wentongwu Confirmed if switch to 0.9.5, sanitycheck --coverage works. If not, I have a lot of handler crashs and timeouts. |
Even if the errors are different between the two targets, I have a suspicion that this is a stack overflow in both cases. The new GCC may be using a lot more stack space when profiling is enabled in the 0.10 SDK. Some things to investigate:
Stack overflows are detected via MPU faults on ARM. The code is supposed to do some checks to see if an MPU fault is due to a stack overflow instead of some other reason, and report Stack Check Fail, but it might not be working correctly. |
thanks @andrewboie. by the way, as below the new compiler will consume more stack space when doing function call. old compiler: |
----> yes, this 32 bytes is allocated with the real stack and it's called MPU Guard region. And the layout is like below:
----> this MPU region(MPU_GUARD_ALIGN_AND_SIZE = 32) is configured as Read only and none executable(_K_MEM_PARTITION_P_RO_U_NA). |
Yeah, that's a stack overflow then. We'll need to fix the ARM code that reports overflows, @ioannisg |
yes, as previous comment, increasing stack could make some case pass. But there is new error happen. Checking the code, it seems stack issue again, from this line(" Stacking error (context area might be not valid"), it should be fault happen during exception stacking and then do same check and abort this thread. Maybe we should increase exception's stack. Let me review the code of failed case and see if there is exception and irq. Running test suite workqueue_api starting test - test_user_workq_start_before_submit |
0x2002880c is located in .priv_stacks.noinit. |
mis-click |
Indeed, this is controlled by CONFIG_ISR_STACK_SIZE |
@andrewboie did the test, it seems not work for increasing exception stack. And failing code try to access priv_stack, I didn't understand this stack well for now... |
it works when increasing CONFIG_PRIVILEGED_STACK_SIZE. I will review code more and test more. |
with all kinds of stack size's adjustment, it seems the timeout issue on platform mps2_an385 has been fixed. Overnight test is running and also I'm cooking the patch. Thanks |
@wentongwu : How do I increase CONFIG_PRIVILEGED_STACK_SIZE? and to what value? Many thanks |
Today I will submit a PR for this issue, but it will take some time for mps2_an385. I have already submit a PR for qemu_x86. Thanks. |
one more #15148 |
for SDK 0.10.0, it consumes more stack size when coverage enabled on qemu_x86 and mps2_an385 platform, adjust stack size for most of the test cases, otherwise there will be stack overflow. Fixes: zephyrproject-rtos#14500. Signed-off-by: Wentong Wu <[email protected]>
for SDK 0.10.0, it consumes more stack size when coverage enabled on qemu_x86 and mps2_an385 platform, adjust stack size for most of the test cases, otherwise there will be stack overflow. Fixes: #14500. Signed-off-by: Wentong Wu <[email protected]>
Describe the bug
We have too many timeouts when running sanitycheck --coverageon mps2_an385. The handler.logs generated by sanitycheck are empty.
Impact
Blocking code coverage improvement work.
To Reproduce and screenshots
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: