Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Callstack unwind during out-of-memory error handling may trigger another error #476

Closed
jsas opened this issue Dec 3, 2015 · 33 comments · Fixed by #2106
Closed

Callstack unwind during out-of-memory error handling may trigger another error #476

jsas opened this issue Dec 3, 2015 · 33 comments · Fixed by #2106
Labels

Comments

@jsas
Copy link

jsas commented Dec 3, 2015

When a custom allocator is used, there is no way to tell if the duktape handle_error flag is set so that an allocation for the error can occur. This makes "safe call" wrappers unable to catch allocation errors (because an error cannot be pushed onto the stack, resulting in the fatal_handler being called.

Is it feasible to expose some properties of the current state of the heap? For example, the internal handling_error flag. In that case, my custom allocator could detect if duktape is trying to allocate memory for an error to push onto the stack.

@svaarala
Copy link
Owner

svaarala commented Dec 3, 2015

Hmm, assuming there's space in the value stack (which is the normal situation at any point) Duktape should first try to push an error, fail that with out-of-memory, and fall back to pushing a preallocated DoubleError object on the value stack. The double error doesn't require object allocations because the instance is created beforehand so the only thing this needs is the ability to write a tagged value (duk_tval) to a preallocated value stack entry - which requires no allocations.

Do you have an example which reproduces the case? Using a pooled allocator I either get a "Error: alloc failed" or similar, or a "double error" no matter what I try manually:

DoubleError: error in error handling

However I'm testing with Ecmascript code so maybe duk_safe_call(), duk_pcall() etc behave differently for some reason.

@jsas
Copy link
Author

jsas commented Dec 3, 2015

I'm trying to reproduce this outside my environment. In the meanwhile, here's the call stack

::sandbox_fatal(duk_hthread * ctx, int code, const char * msg) Line 206 C++
duk_fatal(duk_hthread * ctx, int err_code, const char * err_msg) Line 4287  C
duk_err_longjmp(duk_hthread * thr) Line 26  C
duk_err_create_and_throw(duk_hthread * thr, int code, const char * msg, const char * filename, int line) Line 131   C
duk_err_handle_error(const char * filename, int line, duk_hthread * thr, int code, const char * fmt, ...) Line 19   C
duk_push_buffer_raw(duk_hthread * ctx, unsigned __int64 size, unsigned int flags) Line 4071 C
duk__realloc_props(duk_hthread * thr, duk_hobject * obj, unsigned int new_e_size, unsigned int new_a_size, unsigned int new_h_size, int abandon_array) Line 628 C
duk__grow_props_for_new_entry_item(duk_hthread * thr, duk_hobject * obj) Line 970   C
duk__alloc_entry_checked(duk_hthread * thr, duk_hobject * obj, duk_hstring * key) Line 1272 C
duk_hobject_define_property_internal(duk_hthread * thr, duk_hobject * obj, duk_hstring * key, unsigned int flags) Line 4541 C
duk_xdef_prop_stridx(duk_hthread * ctx, int obj_index, int stridx, unsigned int desc_flags) Line 300    C
duk_push_error_object_va_raw(duk_hthread * ctx, int err_code, const char * filename, int line, const char * fmt, char * ap) Line 3992   C
duk_push_error_object_raw(duk_hthread * ctx, int err_code, const char * filename, int line, const char * fmt, ...) Line 4025    C
duk_err_create_and_throw(duk_hthread * thr, int code, const char * msg, const char * filename, int line) Line 107   C
duk_err_handle_error(const char * filename, int line, duk_hthread * thr, int code, const char * fmt, ...) Line 19   C
duk_push_buffer_raw(duk_hthread * ctx, unsigned __int64 size, unsigned int flags) Line 4071 C
duk__realloc_props(duk_hthread * thr, duk_hobject * obj, unsigned int new_e_size, unsigned int new_a_size, unsigned int new_h_size, int abandon_array) Line 628 C
duk__grow_props_for_new_entry_item(duk_hthread * thr, duk_hobject * obj) Line 970   C
duk__alloc_entry_checked(duk_hthread * thr, duk_hobject * obj, duk_hstring * key) Line 1272 C
duk_hobject_define_property_internal(duk_hthread * thr, duk_hobject * obj, duk_hstring * key, unsigned int flags) Line 4541 C
duk_xdef_prop(duk_hthread * ctx, int obj_index, unsigned int desc_flags) Line 267   C
duk_js_close_environment_record(duk_hthread * thr, duk_hobject * env, duk_hobject * func, unsigned __int64 regbase) Line 693    C
duk_hthread_callstack_unwind(duk_hthread * thr, unsigned __int64 new_top) Line 229  C
duk_handle_safe_call(duk_hthread * thr, int (duk_hthread *) * func, int num_stack_args, int num_stack_rets) Line 1838   C
duk_safe_call(duk_hthread * ctx, int (duk_hthread *) * func, int nargs, int nrets) Line 221 C

@svaarala svaarala added the bug label Dec 4, 2015
@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

Ok, I think the issue here is that the error is handled more or less as intended but the callstack unwind then closes an environment record which tries to allocate memory when it's copying over variables to a closure. Doing that is mandatory even when unwinding due to an error because function instances referencing the variables might have been created before the error occurred.

Not sure what the best fix would be, I'll have to think about this a bit.

One will necessarily need to sacrifice some semantics related to scopes to be able to unwind even when scopes cannot be closed (or the scope objects would have to be allocated on function entry to ensure allocations are not needed on an unwind). This is probably not a major issue for sandboxing, and it's more important to shut down cleanly.

Using another setjmp catchpoint for the unwind would catch the error. But that'd carry some overhead for all calls to handle a relatively rare case (an important case for sandboxing, of course).

It'd also possible to rework the scope handling to use a lower level interface to manage the scope objects, so that throwing would be avoided. For instance, the scope objects could be resized before copying any properties over and if that failed, just skip copying variables. If the resize succeeded, the variables could be copied without a risk of an error. This would also be a bit faster than the pretty simple explicit property handling.

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

It might also be useful for an allocation error (which only happens when emergency garbage collection has failed) to propagate out as an uncatchable error similar to script timeout errors. It's not really safe to continue, and as described above it may not be possible to respect all required Ecmascript semantics after the error.

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

Going back to the original question, it'd of course be possible to expose some heap state to the application. But doing so creates a compatibility promise so it'd need to be something which can be guaranteed to exist and be relevant over time.

One possible thing to expose would be to communicate with the user application about the need for emergency GC. For example:

  • If a memory allocation fails even after a normal GC, indicate this to user code using a callback.
  • If the memory allocation still fails after an emergency GC, indicate this too using a callback. User code could then request a forced shutdown, and perhaps provide some additional memory to process the shutdown.
  • Retry the allocation once more after the user callback.

Requesting a forced shutdown could work similarly to how a script timeout propagates, i.e. as an "uncatchable" error which isn't catchable using Ecmascript code but respects Duktape/C protected calls to allow native resources to be released. See #474.

Even with this kind of integration in place, there's not necessarily an easily determined upper bound on how much memory is needed to shutdown cleanly (to respect finalizer guarantees etc). It'd be relatively easy to shut down forcibly without respecting e.g. finalizer guarantees - but that's also problematic for sandboxing.

@fatcerberus
Copy link
Contributor

Wait, you need to allocate memory to unwind? That seems backwards, although I admit I'm not familiar with all the semantics related to Ecmascript call handling (the whole spec is quite convoluted).

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

Yes, as things are now, unwinding creates an explicit scope object during unwinding. Unless that object were allocated on entry, memory is needed to unwind. This is not an Ecmascript requirement but a Duktape internal design issue. The memory could be allocated on entry instead which should fix this issue, but I'll need to check the unwind path to see if similar issues exist elsewhere.

Ecmascript does require e.g. execution of "finally" blocks which will definitely need memory. Duktape also executes finalizers, which needs memory. But either of these can fail without impacting the unwind process which is the issue here, because an error during unwind is not inside the same protection as the call itself. Not having enough memory to run finalizers may be a problem by itself, of course, because you don't necessarily get an actual chance to free native resources. Dealing with actual out-of-memory is quite hard if you simultaneously want some guarantees (usually one needs preallocation to get them).

@fatcerberus
Copy link
Contributor

That's all true. Although I've learned that if you're truly out of memory enough that malloc() starts to fail, you're pretty much dead in the water already; continuing with normal operation is not going to work so you may as well just clean up what you can and pack it in.

@fatcerberus
Copy link
Contributor

Anyway, what I said about I meant in the more general sense; it isn't (necessarily) relevant here because the allocations are sandboxed, they may not "have to" fail.

If I understand the issue correctly, a custom allocator is being used and needs to know whether an important allocation is being requested so the sandbox environment can exempt it from some quota. In that case a workable solution that would avoid exposing internal heap state would be to allow an additional parameter for the custom allocator which indicated the relative "importance" of the allocation. Normal heap allocations and anything else related to script execution would be zero, higher values would be for more critical needs (unwinding, errors originating from Duktape, etc.). Doing this without breaking API compatibility for the purposes of semantic versioning is left as an exercise for the reader. :)

@fatcerberus
Copy link
Contributor

Yet another solution would be to have two callbacks: one to allocate memory during normal bytecode execution (which should always be sandboxed), and another to allocate memory for internal Duktape structures (which might be more useful to allow lenience). If the app needs both of these to be sandboxed, it will just use the same function for both, else it can provide separate functions.

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

Yes, it's often viable to have some reserve. But how much? If the amount of memory needed for an unwind cannot be bounded, it's a workable solution for some cases, but not all. In particular, it will be easy to craft code which will exceed that reserve.

@fatcerberus
Copy link
Contributor

That's why I suggested leaving it up to the allocation callback(s) still--the app still gets to decide how much to provide as "reserve".

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

How would one arrive at a safe value?

@fatcerberus
Copy link
Contributor

Well, the exact amount of reserve space is out of Duktape's scope, I'm just suggesting ways Duktape could communicate the relative "importance" of the request, with the application still ultimately in control of what to do with it.

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

I know. But what does that solve? If the sandbox is running potentially unknown code, that code can always contrive to exceed whatever reserves are in place. The application implementing the sandbox is then responsible for figuring out a reserve which works - which is in general not possible to know in advance.

Because there are solutions which work in all cases (and at least allow the heap to be terminated safely), I'd much rather do something like that.

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

At a high level the two basic approaches which have consistent behavior would probably be:

  • Ensure the current error handling and unwinding path does not rely on allocations succeeding. It's originally designed to do that, but the scope unwinding doesn't respect that policy now. This should be fixable one way or another.
  • Treat a persistent out-of-memory error (i.e. allocation fails after even emergency GC) as a hard error, so that Duktape wouldn't support resuming execution afterwards. Instead, the guarantee would be that a heap can be freed without encountering fatal errors in the process.

@fatcerberus
Copy link
Contributor

I was mostly going on @jsas original feature request, to expose the heap handling_error flag. Since that's internal state, this was just a way to achieve the same (or similar) goal in a more general way.

Of course as touched on in your post above, if we can remove the requirement for allocations on unwind entirely, this whole point is moot. Even so, giving the user the ability to prioritize "infrastructure" allocations (even if the ultimate quota is the same) might not be a bad idea.

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

Like I said above, I think it's important that allocation errors are handled reliably in all cases, so that we don't just shift the problem from one place to another. But having said that, it might still be useful to know some context to an allocation.

Considering a "this is an infrastructure allocation" flag: in practice that would be quite difficult to implement.

For example, some helper code might be called in response to Duktape internals doing something quite far away from the helper, or as a result of a public API call calling into Duktape internals which eventually calls the helper. There may be a lot of call levels in between; unless that "originating request" is somehow carried all through the call layers, it wouldn't work in practice and you'd have something similar to priority inversion.

It would also be possible to set some heap level "unwinding" flag which would then be carried to all allocations while that is in progress. But it would then also apply to e.g. user finalizers which may do an arbitrary amount of work which probably shouldn't be flagged critical.

So while it'd maybe be useful, it's quite difficult for the "this is an infrastructure allocation" flag to come out right.

Providing a callback or other information about allocations which fail even after garbage collection would be possible and quite straightforward. This might be useful if allocation error leads to a hard exit so that some memory could be added to (mostly) handle the unwindind cleanly (e.g. allow finalizers to run). However, this would still rely on an actual solution to make the unwinding safe even if the memory added wasn't enough for a clean unwind.

@fatcerberus
Copy link
Contributor

Yeah, that's true. I should really have known better about the helpers, having seen what the bytecode executor looks like firsthand. :)

@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

I'll give the unwind path scope handling a little test once I have 1.4 stuff in a reasonable shape. It should be relatively simple to just preallocate the properties for the scope object on function entry.

@svaarala svaarala changed the title Custom allocator failures cannot handle safe calls; Callstack unwind during out-of-memory error handling may trigger another error Dec 4, 2015
@svaarala
Copy link
Owner

svaarala commented Dec 4, 2015

#477 provides a testcase which seems to reproduce the issue using the "ajduk" command line tool. This should allow testing of a possible fix.

@fatcerberus
Copy link
Contributor

What is ajduk, how is it different from normal duk?

svaarala added a commit that referenced this issue Dec 5, 2015
@svaarala
Copy link
Owner

svaarala commented Dec 5, 2015

It's the "duk" command line which uses the pooled allocator from AllJoyn.js. When compiled for x86 with pointer compression etc, it quite closely mimics very low memory targets.

@svaarala
Copy link
Owner

svaarala commented Dec 5, 2015

Just as a quick test I disabled variable value copying in the scope handling done on callstack unwind: as a result "ajduk" now fails gracefully:

000.001 ajs_heap.c:215 AJS_Alloc of 33792 bytes failed
======= dump of 23 heap pools ======
heap[0] pool[8] used=10 free=0 high-water=10 min-alloc=0 max-alloc=8
heap[0] pool[12] used=10 free=0 high-water=10 min-alloc=0 max-alloc=12
heap[0] pool[16] used=200 free=0 high-water=200 min-alloc=0 max-alloc=16
heap[0] pool[20] used=400 free=0 high-water=400 min-alloc=0 max-alloc=20
heap[0] pool[24] used=400 free=0 high-water=400 min-alloc=0 max-alloc=24
heap[0] pool[28] used=79 free=121 high-water=101 min-alloc=25 max-alloc=28
heap[0] pool[32] used=2 free=198 high-water=2 min-alloc=30 max-alloc=30
heap[0] pool[40] used=107 free=93 high-water=108 min-alloc=35 max-alloc=40
heap[0] pool[48] used=1 free=49 high-water=5 min-alloc=42 max-alloc=47
heap[0] pool[52] used=16 free=34 high-water=50 min-alloc=52 max-alloc=52
heap[0] pool[56] used=1 free=49 high-water=9 min-alloc=52 max-alloc=54
heap[0] pool[60] used=0 free=50 high-water=4 min-alloc=60 max-alloc=60
heap[0] pool[64] unused
heap[0] pool[128] used=55 free=25 high-water=61 min-alloc=66 max-alloc=116
heap[0] pool[256] used=5 free=11 high-water=8 min-alloc=132 max-alloc=228
heap[0] pool[512] used=4 free=12 high-water=7 min-alloc=268 max-alloc=476
heap[0] pool[1024] used=4 free=2 high-water=6 min-alloc=516 max-alloc=1024
heap[0] pool[1360] used=0 free=1 high-water=1 min-alloc=1052 max-alloc=1356
heap[0] pool[2048] used=3 free=2 high-water=5 min-alloc=1196 max-alloc=2048
heap[0] pool[4096] used=1 free=2 high-water=3 min-alloc=1456 max-alloc=4096
heap[0] pool[8192] used=0 free=3 high-water=1 min-alloc=5120 max-alloc=8192
heap[0] pool[16384] used=0 free=1 high-water=1 min-alloc=9216 max-alloc=16384
heap[0] pool[32768] used=1 free=0 high-water=1 min-alloc=17408 max-alloc=32768
======= heap hwm = 132612 use = 85964 waste = 1239 ======
Error: failed to extend valstack
    duk_api_stack.c:707
    light_08082198_000f light strict preventsyield
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:343
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    recursive tests/ecmascript/test-dev-outofmemory-unwind-gh476.js:366
    [...]
error in executing file tests/ecmascript/test-dev-outofmemory-unwind-gh476.js

Before the change there was a fatal error, so perhaps removing any mandatory allocations from scope unwinding would be enough.

@svaarala
Copy link
Owner

svaarala commented Dec 5, 2015

@jsas Here's a branch which disables the scope-related memory allocation on unwind:

It breaks compliance so it's just a test, but it'd be interesting to hear if it fixes your fatal error issue. If so, we'd know the root cause is not something else. Here's a prebuilt dist package if that's useful to you:

@jsas
Copy link
Author

jsas commented Dec 5, 2015

That build did the trick ( duk_safe_call returned 1 instead of calling fatal_error handler). Of course, there's no error on the stack to pick up at this point, so would it be possible to have the return value be an error code (like -53)?

@fatcerberus
Copy link
Contributor

@jsas What does your duk_safe_call() call site look like?

@fatcerberus
Copy link
Contributor

The reason I ask is because of this:

If an error occurs, the stack will still have nrets [the fourth parameter] values at "base index"; the first of such values is the error, and the remaining values are undefined. If nrets is zero, the error will not be present on the stack (the return stack top will equal the "base index"), so calling this function with nrets as 0 is not very useful.

@svaarala
Copy link
Owner

svaarala commented Dec 6, 2015

@jsas That's good to hear - that branch is not usable as a solution because it just skips full scope handling in unwind, but it does identify the root cause and if this is the only concrete problem it should be relatively easy to fix. I'll need to review the unwind path fully to make sure though.

There should be an error object on top of the stack, but it'd simply be a generic Error in this case - either an "alloc error" or similar, or if there wasn't memory to create an actual error, it would be a "double error". Both are simple Error instances from Ecmascript typing point of view. You can get the equivalent error code using http://duktape.org/api.html#duk_get_error_code, but that would just return DUK_ERR_ERROR.

@svaarala
Copy link
Owner

svaarala commented Dec 6, 2015

As things are now, error objects are modeled after Ecmascript errors. They are not very verbose from a programmatic point of view: errors have a name, a message, and a stack trace, which are useful as human readable descriptions but programmatically you can basically rely on the error name and/or the error code. They don't provide a detailed distinction between error sources or error severity right now.

One thing which is not very nice now is that Duktape specific (non-standard) errors are not proper subclasses but direct instances of Error which means you can't really deal with them programmatically. I'd like to change it roughly as follows:

  • Reduce the number of specific custom errors; right now the custom errors are: unimplemented, unsupported, internal, alloc, assertion, API, uncaught. It should be enough to have one or two implementation specific errors, which is more in line with the quite sparse Ecmascript error "tradition". Some errors could map to standard Ecmascript errors, e.g. API errors could mostly map to TypeError.
  • Map the implementation specific errors to actual subtypes so that they have proper programmatic typing, and allow detection e.g. as err instanceof InternalError or duk_is_internal_error(ctx, -1).

It would also be possible to add some additional boolean flags to errors to indicate severity somehow. For example, error objects could have a .fatal boolean flag to indicate that the error is a severe one, such as an allocation error or an actual internal error after which is not necessarily safe to continue. The pre-created DoubleError instance could also have this flag, so checking for .fatal would then identify the error quite intuitively.

Another related improvement would be to change the simple "success vs. error" return code indication of safe calls to provide a three level indication of "success vs. normal error vs. severe error". I'm not sure how good this change would be in practice: would two error levels be enough, would drawing the line between error types be clean, etc. So I'd perhaps favor having that information in the error object itself and provide good accessors for that.

Anyway, from sandboxing point of view, you should treat the error on the value stack carefully so that you don't invoke new memory allocations when dealing with the error. Because the value stack is preallocated the following should be safe at least:

  • Reading the error code using duk_get_error_code()
  • Reading the error .name and .message

Coercing the error to a string is not generally safe because it involves a temporary string which needs to be interned. The .stack property is virtual and is generated on-the-fly so reading it is risky if you're potentially out of memory.

@fatcerberus
Copy link
Contributor

The Error won't be on the stack if nrets = 0 for the safe call, which is why I asked what his duk_safe_call() looked like.

@jsas
Copy link
Author

jsas commented Dec 6, 2015

@fatcerberus nrets is indeed 0 for that particular call.

@svaarala thx for tip about interning string if coercing error. makes sense.

@fatcerberus
Copy link
Contributor

Yeah, you'll need to pass nrets == 1 if you want the error on the stack (you can pop the unneeded value off afterwards).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants