Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #9194 - Allow more control over crash dump #9196

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

DRSDavidSoft
Copy link
Contributor

@DRSDavidSoft DRSDavidSoft commented Aug 30, 2024

@mcspr This is a proof of concept PR for an alternative solution to #9193 and also closes #9194. Could you please verify? I would appreciate any input and feedback. Thanks!

Description

This PR adds more granular control over how the crash dump is delivered, and some more features:

  • custom_panic_callback(panic_file, panic_line, panic_func, panic_what): Allows the user to additionally listen for panics (useful to record them in flash memory)
  • ets_uart_putc1()crash_ets_uart_putc1(): renamed function that prints to the Serial port. This allows the function to be overridden, possibly applying transformations on the text (useful if transmitting over a Bus-like UART, such as RS-485), remove clutter or insert additional comments (such as uptime), or completely inhibit the dump (useful is uart is connected to a peripheral device)
  • NO_CUT_HERE: Introduce a macro that allows disabling the "CUT HERE" message and the divider lines. Can be used to minify the number of transmitted/recorded characters over the wire or storage. Can be used with the -D flag.
  • ets_println() instead of ets_putc('\n'): This makes the new lines use the "cooked" interface instead of the "raw" one, which is also used on the static ets_printf_P() function, which allows:
    • uniform line endings (always "\r\n", instead of just the ones used within printf_P(...)
    • making the line endings also silent when the crash_ets_uart_putc1() is overriden
  • rst_info.exccause used instead of local exccause for the 6 exception (divide by zero): unless I'm missing something, this allows the custom_crash_handler at the end of the postmortem function to also receive a correct exception number. Useful for projects such as EspCrashSave and EspCrashSaveSpiffs

Additional notes

The naming of some functions might not be the optimal value, if you would like to suggest a better one please either modify this PR or submit a comment.

@mcspr
Copy link
Collaborator

mcspr commented Aug 30, 2024

custom_panic_callback

Probably just one callback would be ok? We do have custom 'USER' reasons which are private now, but obviously can be moved to a public header. See rst_reason_sw. Suppose, this can be extended with something like esp_panic_info() returning a struct of those pointers?

NO_CUT_HERE

Any reason not to fully hide the whole internal printf with a similar flag? Or, at least hide the part of postmortem & funcs calling postmortem that print stuff, structuring it a bit different in the main func especially.
(or simply replacing such internal printf with an empty macro on demand?)

rst_info.exccause used instead of local exccause for the 6 exception (divide by zero): unless I'm missing something, this allows the custom_crash_handler at the end of the postmortem function to also receive a correct exception number. Useful for projects such as EspCrashSave and EspCrashSaveSpiffs

see #7496
This helps out exception decoder by pretending the hw does have instructions that would've triggered the exception. If only to be consistent with the output... I think the only issue is exc code would be different between rst info struct kept in memory between reboots and the one passed to the callback?

@DRSDavidSoft
Copy link
Contributor Author

Thank you for the feedback! I agree with your points. Let me elaborate a bit on the original intentions behind these changes while also addressing your suggestions.

custom_panic_callback

The idea behind introducing a custom panic callback was to provide users with more flexibility in handling panics, especially for scenarios where recording the crash information in flash memory is necessary. I agree that consolidating this into a single callback would be best as it simplifies the interface and reduces the addition of unnecessary callbacks.

I suppose the reason for adding this to the PR was to avoid modification of the existing custom_crash_callback's signature.

Probably just one callback would be ok? ... Suppose, this can be extended with something like esp_panic_info() returning a struct of those pointers

That's a great idea! I very much prefer to use something like esp_panic_info() instead of what I came with (custom_panic_callback).

Moving the rst_reason_sw to a public header and extending it with something like an esp_panic_info() function that returns a structured set of pointers would be an excellent approach. Do you have any specific ideas on what that struct should contain, or how the implementation might be extended best?

NO_CUT_HERE

The purpose of the NO_CUT_HERE macro was to provide users with an option to minimize the output, especially when the crash dump is being transmitted over limited bandwidth or recorded in constrained storage environments. I can see the benefit of hiding more of the internal printf calls, particularly those within the postmortem functions.

Any reason not to fully hide the whole internal printf with a similar flag, ... or simply replacing such internal printf with an empty macro on demand?

The second idea is exactly what I tried to pursue by introducing the new crash_ets_uart_putc1() func. Since it can be overridden by the user, it can also be used to point to a no-op function, or to simply not print anything to the serial. I also agree with using a macro instead, however I have found that the linkage approach works better. Alternatively, replacing the higher-level internal printf with a macro as needed could offer a more straightforward solution.

As to why the cut_here() function specifically, it's currently the only part that produces static content, and can be forgoed when saving or transmitting the crash dump. I just needed a way to signal to prevent it from being generated. I agree this is not the best approach and leads to clutter in code, but it's currently the best I could come up with 😅 I would appreciate other solutions as well!

Structuring the main function differently to conditionally include or exclude these outputs is also a good approach. Would you prefer the more granular control of individual printf calls, or a broader flag that suppresses all related output?

rst_info.exccause

Your point about using rst_info.exccause for the divide by zero exception is well taken. The intention here was to ensure consistency across the crash reports, especially for tools like EspCrashSave and EspCrashSaveSpiffs that rely on accurate exception information.

I think the only issue is exc code would be different between rst info struct kept in memory between reboots and the one passed to the callback

As for my personal opinion, I can accept that the consequent exception code wouldn't be the correct one (6), that is due to ESP8266's hardware design. The more important factor is that during the crash instance, the code is correctly reported (as is the current case when considering print to serial), and the code is also correctly saved, so that further in the line it can also be correctly reported over the network, or something else.

However, you raise a valid concern about the discrepancy between the rst_info struct retained between reboots and the one passed to the callback. This could potentially lead to confusion in diagnosing issues.

Currently, the exception is correctly only displayed when a crash happens. We don't care about the next reboot, because we have lost the context of the actual crash. This PR aims to also correctly save the exception code. This means that later, when we access the crash information, we will also see the correct code. This can either be another print to the uart (triggered by user), or in the case I'm interested, transmit it over the network on the next reboot. Which means the code that is available in next reboots will be ignored in favor of the saved one, received from the crash callback.

I'm open to suggestions on how to handle this more effectively. Sadly, it appears that getting the same exception code across both instances or providing additional context in the callback to clarify the source of the exception code are both unlikely to implement.

Once again, I appreciate your insights and would love to hear any further thoughts you have on these topics. Your expertise in this area is invaluable, and I'm keen to ensure that the final implementation meets both the flexibility and reliability requirements of our users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Don't print stack trace to Serial on crash
2 participants