Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption for newlib-nano with float printf and disabled heap #30055

Closed
martinjaeger opened this issue Nov 16, 2020 · 25 comments · Fixed by #35799
Closed

Memory corruption for newlib-nano with float printf and disabled heap #30055

martinjaeger opened this issue Nov 16, 2020 · 25 comments · Fixed by #35799
Assignees
Labels
area: C Library C Standard Library bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Milestone

Comments

@martinjaeger
Copy link
Member

martinjaeger commented Nov 16, 2020

Describe the bug

We are facing a strange issue in our project based on STM32L072 with 20k of RAM. If certain features are enabled such that most of the RAM is consumed, float variables in printf statements (using newlib nano) get replaced by random junk characters. Printing of integers works fine. Also printk with recently added float support (cbprintf) works fine.

Example code:

printf("Junk: %.2f and maybe not: %d\n", 123.4F, 1234);

Results in:

Junk: <0xb0>@<break>
. <0x9d> and maybe not: 1234

To Reproduce

Haven't been able to generate a minimum working example to reproduce the issue, as it disappears if too much of the code is removed. However, it does not seem to be an issue in the application firmware itself. The issue happens in different threads and stack usage is still quite low (because I put all threads immediately into k_sleep(K_FOREVER) to exclude possible application firmware bugs):

I: Thread analyze:
I:  thread_analyzer     : unused 80 usage 432 / 512 (84 %)
I:  serial_thread_id    : unused 1160 usage 120 / 1280 (9 %)
I:  leds_thread         : unused 104 usage 152 / 256 (59 %)
I:  gsm                 : unused 1208 usage 216 / 1424 (15 %)
I:  control_thread_id   : unused 912 usage 112 / 1024 (10 %)
I:  idle 00             : unused 188 usage 68 / 256 (26 %)
I:  main                : unused 792 usage 232 / 1024 (22 %)

Possible root cause and workaround

Our application doesn't use the heap. Since PR #28486, the RAM reserved for the heap seems to be garbage-collected away in that case (independent of the value of CONFIG_HEAP_MEM_POOL_SIZE) and can be reused for the stack.

However, newlib requires malloc if printf is used with %f: http://www.nadler.com/embedded/newlibAndFreeRTOS.html

If I add a line void *mem_test = k_malloc(4); to the code, Zephyr compiles in the heap management again and the issue is gone.

I'm not 100% sure if the above is really the root cause or if it made the issue disappear by coincidence, but it looks plausible to me. Maybe someone with more insight into newlib internals can confirm.

This link posted by @pabigot on Slack might also be relevant: https://stackoverflow.com/questions/28746062/snprintf-prints-garbage-floats-with-newlib-nano

Ping @nashif @dcpleung @andrewboie @carlescufi as you were involved in mentioned PR.

@martinjaeger martinjaeger added the bug The issue is a bug, or the PR is fixing a bug label Nov 16, 2020
@nashif nashif added the priority: medium Medium impact/importance bug label Nov 17, 2020
@carlescufi carlescufi added the area: C Library C Standard Library label Nov 17, 2020
@andyross
Copy link
Contributor

IIRC the newlib heap and the Zephyr system heap aren't the same memory, right? Maybe the bug here is a linkage problem where the newlib heap memory is clobbering something adjacent?

@martinjaeger
Copy link
Member Author

Yes, that could be the case. Do you know how I can find out which memory location newlib uses as its heap?

@andyross
Copy link
Contributor

I dug around here a little, but obviously without an exerciser there's not a lot I can do. In fact yes, as I (vaguely) remember the newlib malloc() heap is an sbrk-style thing that is a physically separate region of memory from the Zephyr heap. So despite the fact that they both say "malloc" these regions shouldn't interact at all.

Can you post the zephyr.map files from your minimally-changed "working" and "failing" cases? I still strongly suspect this is going to turn out to be some kind of linker interaction, plausibly alignment of the heap that interacts with pointer math somewhere in the soft float implementation?

@martinjaeger
Copy link
Member Author

@andyross sorry for the delayed response. Since my upgrade to Zephyr SDK v0.12.0 I was not able to reproduce the issue anymore. Instead, I get the following error printed to the serial console now:

assertion "Balloc succeeded" failed: file "/workdir/build/build_arm/.build/arm-zephyr-eabi/src/newlib/newlib/libc/stdlib/mprec.c", line 778
exit

A colleage compiled our firmware with exactly the same settings on Windows and could still reproduce the issue. So here is a bunch of map files: https://nextcloud.libre.solar/s/CjeXZZt7AYKjEBJ

  • zephyr_balloc_assert.map: Compiled with SDK v0.12 and leading to above error
  • zephyr_junk_char.map: Same firmware configuration compiled under Windows, leading to junk character output
  • zephyr_no_junk.map: Configuration slightly changed such that the floats are printed correctly (also compiled under Windows)

The issue seems to be independent of the Zephyr version. I get the same behavior with v2.4-branch and v2.5-RC1.

@martinjaeger
Copy link
Member Author

Some additional information: The newlib version of my colleague where the issue appeared is 3.1.0, where as Zephyr SDK v0.12 updated newlib to 3.3.0.

This seems to be relevant to the problem: https://census-labs.com/news/2020/01/31/multiple-null-pointer-dereference-vulnerabilities-in-newlib/

So maybe the remaining problem is only to create a linker error if not enough space is left for the newlib heap, if that makes sense?

@andyross
Copy link
Contributor

andyross commented Mar 8, 2021

Grooming bugs: sorry, can you confirm that a 3.3.0 newlib fixes the problem? Or you just suspect it might?

@andyross
Copy link
Contributor

andyross commented Mar 8, 2021

And serendipitously, I just ran into #33164 which very well might be the root cause if your application has multiple threads trying to use that heap.

@martinjaeger
Copy link
Member Author

Grooming bugs: sorry, can you confirm that a 3.3.0 newlib fixes the problem? Or you just suspect it might?

I would not say it fixes the problem completely. I can confirm that we don't see the junk characters anymore with newlib 3.3.0. But we get above assertion message instead. Ideally, the linker would claim sufficient memory for newlib such that we see at compile-time that we run out of space.

Regarding the potential race condition in newlib: I'm not sure if that's really the root cause. I suspended all threads except for one with k_sleep(K_FOREVER). So there should not be multiple threads calling printf at the same time. I'm happy to verify it once a fix for newlib synchronization is in place.

@github-actions
Copy link

github-actions bot commented May 9, 2021

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label May 9, 2021
@galak
Copy link
Collaborator

galak commented May 18, 2021

@martinjaeger Can you re-test this to see if #35227 might have helped.

@martinjaeger
Copy link
Member Author

@galak No, don't think this has brought the solution. I compiled our firmware with the main stack size increased such that 99% of the RAM is consumed. Now, a printf at the beginning of the code doesn't get printed at all. After sending a command to the device via the serial it prints exit and restarts (without any further fault indication). So there must be still some function that uses memory locations it's not supposed to use.

@martinjaeger
Copy link
Member Author

I've created a minimum non-working example with most recent Zephyr branch:
martinjaeger@3bf40cb

$ west build -b nucleo_l073rz samples/hello_world/
[3/9] Linking C executable zephyr/zephyr_prebuilt.elf

[9/9] Linking C executable zephyr/zephyr.elf
Memory region         Used Size  Region Size  %age Used
           FLASH:       33752 B       192 KB     17.17%
            SRAM:       19920 B        20 KB     97.27%
        IDT_LIST:          0 GB         2 KB      0.00%
$ west flash

Result on the serial console:

*** Booting Zephyr OS build v2.6.0-rc1-337-g3bf40cb11342  ***
Hello World! nucleo_l073rz assertion "Balloc succeeded" failed: file "/workdir/build/build_arm/.build/arm-zephyr-eabi/src/newlib/newlib/libc/stdlib/mprec.c", line 778
exit

As soon as we decrease the main stack size by a few hundred bytes, everything works fine.

@galak
Copy link
Collaborator

galak commented May 25, 2021

@martinjaeger thanks for working up a test.

@carlescufi carlescufi assigned stephanosio and unassigned andyross May 25, 2021
@stephanosio
Copy link
Member

NOTE: The issue is reproducible with @martinjaeger's sample in qemu_cortex_m3 if you set CONFIG_MAIN_STACK_SIZE=61440.

@stephanosio stephanosio added this to the v2.6.0 milestone May 25, 2021
@martinjaeger
Copy link
Member Author

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

With CONFIG_MAIN_STACK_SIZE=16200:

*** Booting Zephyr OS build v2.6.0-rc1  ***
Hello World! nucleo_l073rz Ø.D. -2.

With CONFIG_MAIN_STACK_SIZE=1000:

*** Booting Zephyr OS build v2.6.0-rc1  ***
Hello World! nucleo_l073rz 3.140000

@stephanosio
Copy link
Member

stephanosio commented May 25, 2021

Analysis

By default, newlib heap starts at USED_RAM_END_ADDR = _end = _image_ram_end, which points to the memory address right after the end of the RAM region occupied by the Zephyr kernel, which includes the "main stack" area:

#define HEAP_BASE USED_RAM_END_ADDR

printf invokes various functions (e.g. dtoa), which internally call malloc and its friends (i.e. calloc) to allocate buffer memory, which in turn call _sbrk to increase the heap size as necessary:

Breakpoint 1, _sbrk (count=140) at /home/stephanos/Dev/zephyrproject/zephyr/lib/libc/newlib/libc-hooks.c:275
275             ptr = ((char *)HEAP_BASE) + heap_sz;
(gdb) bt
#0  _sbrk (count=140) at /home/stephanos/Dev/zephyrproject/zephyr/lib/libc/newlib/libc-hooks.c:275
#1  0x0000699c in _sbrk_r ()
#2  0x000066f2 in _malloc_r ()
#3  0x000090c0 in _calloc_r ()
#4  0x00006046 in _Balloc ()
#5  0x00006576 in __d2b ()
#6  0x00000e66 in _dtoa_r ()
#7  0x00008c68 in __cvt ()
#8  0x000058f0 in _printf_float ()
#9  0x00006948 in _vfprintf_r ()
#10 0x00005e9a in printf ()
#11 0x00000d62 in main () at ../src/main.c:13

_sbrk checks that the current heap size + requested expansion size (count) does not exceed MAX_HEAP_SIZE to ensure that the newlib heap region does not expand beyond the RAM area -- if it does, it will return -1, which in turn will cause malloc to fail and return NULL:

void *_sbrk(intptr_t count)
{
void *ret, *ptr;
ptr = ((char *)HEAP_BASE) + heap_sz;
if ((heap_sz + count) < MAX_HEAP_SIZE) {
heap_sz += count;
ret = ptr;
} else {
ret = (void *)-1;
}
return ret;
}

The default malloc algorithm implemented by newlib has a tendency to request large chunks through _sbrk and this fails when the memory available for the newlib heap is small (similar issue observed in #21167):

*** Booting Zephyr OS build v2.6.0-rc1-337-g3bf40cb11342  ***
_sbrk: SUCC: heap_sz = 0, count = 0
_sbrk: SUCC: heap_sz = 0, count = 420
_sbrk: FAIL: heap_sz = 420, count = 1032, MAX_HEAP_SIZE = 552
Hello World! qemu_cortex_m3 _sbrk: SUCC: heap_sz = 420, count = 24
_sbrk: FAIL: heap_sz = 444, count = 140, MAX_HEAP_SIZE = 552
assertion "Balloc succeeded" failed: file "/workdir/build/build_arm/.build/arm-zephyr-eabi/src/newlib/newlib/libc/stdlib/mprec.c", line 778
exit

The newlib internal __d2b function, which is called by printf, previously did not check the NULL return from malloc and this caused the function to access invalid memory address.

This problem was fixed in the recent commit:
zephyrproject-rtos/newlib-cygwin@f88aece

In summary, this was newlib's failure to NULL-check the pointer returned by malloc, which was fixed in the above commit, and not a problem with Zephyr per se.

@stephanosio
Copy link
Member

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

@martinjaeger zephyrproject-rtos/newlib-cygwin@f88aece is included in newlib 3.1.0 and, if the above analysis is correct, you should be seeing the assert message instead. Can you tell me what toolchain/version you are using?

@martinjaeger
Copy link
Member Author

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

@martinjaeger zephyrproject-rtos/newlib-cygwin@f88aece is included in newlib 3.1.0 and, if the above analysis is correct, you should be seeing the assert message instead. Can you tell me what toolchain/version you are using?

I'm using Zephyr SDK v0.12 with newlib 3.3.0. Above output with newlib 3.1.0 was from my colleague. @azeemshatp can you double-check that your newlib is 3.1.0?

@stephanosio
Copy link
Member

I'm using Zephyr SDK v0.12 with newlib 3.3.0. Above output with newlib 3.1.0 was from my colleague. @azeemshatp can you double-check that your newlib is 3.1.0?

Sorry, I misread. That commit is part of cygwin-3_1_0-release, not newlib-3.1.0, so seeing memory corruption with newlib 3.1.0 is to be expected.

@stephanosio
Copy link
Member

@martinjaeger Since this is not a Zephyr-side issue and it has been fixed in the newlib upstream (zephyrproject-rtos/newlib-cygwin@f88aece), shall we close this issue?

@azeemshatp
Copy link

Our colleague with using older newlib version 3.1.0 can also still confirm the junk characters.

@martinjaeger zephyrproject-rtos/newlib-cygwin@f88aece is included in newlib 3.1.0 and, if the above analysis is correct, you should be seeing the assert message instead. Can you tell me what toolchain/version you are using?

I'm using Zephyr SDK v0.12 with newlib 3.3.0. Above output with newlib 3.1.0 was from my colleague. @azeemshatp can you double-check that your newlib is 3.1.0?

I have the toolchain in Windows installed at C:\gnu_arm_embedded\arm-none-eabi. I can see the _newlib_version.h having the version info as _NEWLIB_VERSION "3.1.0"

@martinjaeger
Copy link
Member Author

@stephanosio I still think it's a Zephyr-related issue, as we don't consider the heap space required for newlib. Is the newlib MAX_HEAP_SIZE from your post above a newlib constant? Can we reduce the amount of RAM allowed to be consumed during linking by that value?

@stephanosio
Copy link
Member

Memory corruption in itself, which the reported issue is, is not a Zephyr issue and has been fixed upstream.

The default Zephyr-newlib integration scheme is designed to make use of whatever RAM is leftover (i.e. not explicitly allocated by the Zephyr kernel and apps) for the newlib heap region, and when this leftover space is insufficient for certain operations, such operations will fail.

I can see two potential enhancements we can make in this regard:

  1. Add a config that specifies the minimum amount of memory space that must be available for the newlib heap (with a default value that is deemed reasonable for common newlib operations). When MAX_HEAP_SIZE in libc-hooks.c is below the value specified in this config, a build-time error is output so that user is aware of potential problems and make adjustments as necessary.
  2. Make non-critical failures (e.g. printf) less disruptive (this will require submitting patches to the upstream newlib, so this should be deemed a long-term solution).

@stephanosio
Copy link
Member

Is the newlib MAX_HEAP_SIZE from your post above a newlib constant

#define MAX_HEAP_SIZE (KB(CONFIG_SRAM_SIZE) - (HEAP_BASE - \
CONFIG_SRAM_BASE_ADDRESS))

MAX_HEAP_SIZE in just the size of the RAM area not used (explicitly allocated) by the Zephyr kernel and apps.

@martinjaeger
Copy link
Member Author

Thanks for the clarification regarding newlib heap management in Zephyr.

I think option 1 makes sense. Not sure what a "reasonable" default would be, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: C Library C Standard Library bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants