Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logger: add watchdog to boost its priority in case of a busy-looping task #9404

Merged
merged 7 commits into from
May 18, 2018

Conversation

bkueng
Copy link
Member

@bkueng bkueng commented May 3, 2018

In case a task with higher prio than the log file writer starts to busy-loop, the log just stops. And since the writer runs at very low prio, most tasks can cause it.
So this PR adds a watchdog that checks from a timer interrupt if the log writer task has been in ready state but has not been scheduled over the last second. This means that someone else is hogging the CPU and we boost the priority of the logger and writer to maximum.
In case the wachdog triggers, logger will also log the CPU load of all running tasks.

The watchdog timeout is fairly low at 1 second. It will only give a false positive if there is a long-running computation of another task without interruption (which we should avoid anyway). But even if it triggers, the system will continue to work fine.

Testing

  • tested by letting navigator busy-loop after 30 seconds after bootup. Log: https://logs.px4.io/plot_app?log=1b38110f-04e2-4f95-a344-b512b2a3de38
  • tested on a slow SD card, with high logging rate active: when the writer waits for an SD transfer, it is not in ready state (thus a long logging dropout will not trigger it)
  • tested with running logger and the writer at max prio with HIL: the vehicle still flies well.

This might help with debugging #9271, since the log stops in-air. There might be different causes for the different reports, and in some cases a busy-looping task could be the cause.

Careful review required.

bkueng added 6 commits May 3, 2018 11:42
…ked for >1s

This is to guard against busy-looping tasks, so that we at least have a
log. In case the watchdog is triggered, it also logs CPU load of all
tasks.
In case a log ends abruptly, we will know that we have everything up to the
last second.

A test showed that CPU load and the amount of logging drops are unaffected
by this.
Helps debugging busy-looping tasks.
avoids compile error on OSX, trying to cast pthread_t to pid_t
perf_name = "perf_top_preflight";
break;
case PrintLoadReason::Postflight:
perf_name = "perf_top_postflight";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@dagar
Copy link
Member

dagar commented May 3, 2018

I'm curious to see how often this is happens in a regular "good" flight.

@dagar
Copy link
Member

dagar commented May 3, 2018

Is anything else in the system intentionally scheduled at the max priority?

What about restoring the original priority after a relatively long period?

@dagar dagar requested a review from davids5 May 3, 2018 16:54
@bkueng
Copy link
Member Author

bkueng commented May 4, 2018

I'm curious to see how often this is happens in a regular "good" flight.

Well, it should not! :)

Is anything else in the system intentionally scheduled at the max priority?

Not that I know of.

What about restoring the original priority after a relatively long period?

I want to keep the logic as simple as possible, as it's something that is supposed to never happen.

Copy link
Member

@davids5 davids5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkueng This looks good.

I would be concerned about using non ISR safe system resources off the call back.

Also comment about the intentional lack of restore would be helpful.

static int counter = 0;

if (++counter > 300) {
PX4_ERR("max time in ready: %i ms", (int)max_time / 1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is printf called off of this in interrupt context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought this was safe?

Copy link
Member

@davids5 davids5 May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recollection is that there is a possibility of a wait on semaphore in the code path of PX4_ERR. Is that not the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, printf calls lib_sem_initialize which uses a semaphore. What are you using for such cases?
In any case, thanks for the hint. I removed them now.

PX4_ERR calls printf, which calls lib_sem_initialize, which calls sem_wait
@bkueng
Copy link
Member Author

bkueng commented May 7, 2018

Regarding more frequent fsync calls: I ran a longer test (15min) and checked the dropouts, and I see pretty much the same pattern:

logging rate around 110kb/s
logger status after N seconds/minutes:

new behavior, fsync at least every second:
- 0s: dropouts: 3 (max len: 0.027 s)
- 60s: dropouts: 5 (max len: 0.650 s)
- 5m: dropouts: 36 (max len: 1.133 s)
- 10m:  dropouts: 48 (max len: 1.075 s)
- 15m: dropouts: 49 (max len: 1.150 s)
previous (fsync around 10s):
- 0s: dropouts: 4 (max len: 0.487 s)
- 60s: dropouts: 10 (max len: 0.595 s)
- 5m: dropouts: 36 (max len: 0.832 s)
- 10m: dropouts: 34 (max len: 0.921 s)
- 15m: dropouts: 54 (max len: 0.857 s)

@davids5
Copy link
Member

davids5 commented May 7, 2018

@bkueng what do you attribute the dropouts to?

@bkueng
Copy link
Member Author

bkueng commented May 7, 2018

@bkueng what do you attribute the dropouts to?

I enabled the high-rate logging profile, which considerably increases the logging data rate (around 110kb/s vs around 40kb/s). Plus I used a cheap SD card.

@davids5
Copy link
Member

davids5 commented May 7, 2018

I enabled the high-rate logging profile, which considerably increases the logging data rate (around 110kb/s vs around 40kb/s). Plus I used a cheap SD card.

Does this approach 0 with a high quality card at an adequate rate or is there something we need to have deeper look at?

@bkueng
Copy link
Member Author

bkueng commented May 7, 2018

Does this approach 0 with a high quality card at an adequate rate or is there something we need to have deeper look at?

Yes, I have much less drops with a Sandisk Extreme U3, even with the high-rate logging.

@bkueng
Copy link
Member Author

bkueng commented May 9, 2018

@PX4/testflights can you test this please? And check that logging works as usual.

@santiago3dr
Copy link

couple flights on a Pixracer (V4) using a typical class 4 SanDisk microSD card
https://logs.px4.io/plot_app?log=12fd5c60-aedf-4f55-84fb-d4f6873e2592
https://logs.px4.io/plot_app?log=84dcdb2c-99e1-4721-8fac-91c158bf8b80

Only issue i encountered was that in the first flight, there was a sudden pause after engaging Position Mode at minute 1:46
I zoom in and see there is an instance of flight mode change which i did not input through the controller

Second flight seemed normal, no issues were observed

@Avysuarez
Copy link

Flights with pixhawk 2.1 (V3) using SdCard sandisk edge, 8gb, C4.
https://review.px4.io/plot_app?log=8eddc379-e539-4aed-b50f-15249ed73211
https://review.px4.io/plot_app?log=b6cd30f2-3709-44fc-a9e2-7ff0b81a4be9
GOOD FLIGHTS, NO ISSUES!

@dannyfpv
Copy link

dannyfpv commented May 9, 2018

couple flights on a Pixhawkmini (V3) I used a Kingston SD card, class 4, 8GB.
https://logs.px4.io/plot_app?log=0a9c4d4f-9d89-4d4c-aa01-75e42767d29b
https://logs.px4.io/plot_app?log=cf97636d-00da-4c9b-bc1b-054a9aeb58ed
No issues found.

@SayanR1
Copy link

SayanR1 commented May 9, 2018

Couple flights on a Pixracer (V3) Pixhawk 2 using a no brand 8GB CLASS 4 SD CARD

https://logs.px4.io/plot_app?log=df7cc843-6302-4790-9903-619cab15a079

https://logs.px4.io/plot_app?log=7661a01f-3397-4eac-8543-774d6d0d888e

No Issues during flight.

@LorenzMeier
Copy link
Member

@bkueng Please review before merging.

@bkueng
Copy link
Member Author

bkueng commented May 11, 2018

I inspected all logs, the watchdog never got triggered.

Only issue i encountered was that in the first flight, there was a sudden pause after engaging Position Mode at minute 1:46

At 1:46, there was a very short commanded (RC stick input change) switch into mission mode, and then back to position. This explains the pause.

@bkueng
Copy link
Member Author

bkueng commented May 18, 2018

@LorenzMeier good to merge?

@LorenzMeier LorenzMeier merged commit 9b5df10 into master May 18, 2018
@LorenzMeier LorenzMeier deleted the logger_watchdog branch May 18, 2018 20:23
@LorenzMeier
Copy link
Member

Thanks, I reviewed the full state and its great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants