[Alerting] Task Manager doesn't automatically recover if polling fails #74785

gmmorris · 2020-08-11T17:22:42Z

Task Manager doesn't have any built in ability to recover if the polling cycle fails.
We have identified in the past failure cases where the polling cycle broke and addressed those cases, but ideally TM would recover independently when such a case happens by restarting a broken poller.

In order for us to gain full confidence in mission critical usage of alerting, a Nodemon like ability to restart the internal poller seems paramount.
Along side this change, we should expose metrics that can be collected on demand to aid in SDH support once we go GA.

elasticmachine · 2020-08-11T17:22:57Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris · 2020-08-13T08:32:38Z

Doing some research yesterday I'm now a little more that I understand the failure case we encountered last week.
We already have extensive error catching in the poller, so it's not a case of "when does it break?", but rather, what happened is that a bug in the bufferedTaskStore caused the update operation of a specific task to fail to ever resolve or reject, causing the poller to hang as the operation never ends.

This still doesn't address the potential case where the poller might break/hang for some other unknown reason, so I still think we need a nodemon-like solution, but at least this will reduce the chances of the poller restart being needed.

To address these issues, I'll progress with 3 separate PRs:

Fix the bug in the buffered task store so that two updates of the same ID work by tracking the index along with the ID, rather than just the ID.
Add a timeout on the work function (which marks tasks as running and kicks them off, it doesn't wait on task completion anyway) in the poller so that if it takes longer than a certain amount of time, it rejects and treats it the same as if the operation errors. This should make it impossible for another hanging promise to hang the poller itself.
Address the recovery from a failing poller.

gmmorris · 2020-08-13T13:37:44Z

Fix the bug in the buffered task store so that two updates of the same ID work by tracking the index along with the ID, rather than just the ID.

There's now a PR for the first step: #74943

gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Aug 11, 2020

gmmorris self-assigned this Aug 13, 2020

gmmorris closed this as completed in #75420 Aug 20, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Task Manager doesn't automatically recover if polling fails #74785

[Alerting] Task Manager doesn't automatically recover if polling fails #74785

gmmorris commented Aug 11, 2020

elasticmachine commented Aug 11, 2020

gmmorris commented Aug 13, 2020

gmmorris commented Aug 13, 2020

[Alerting] Task Manager doesn't automatically recover if polling fails #74785

[Alerting] Task Manager doesn't automatically recover if polling fails #74785

Comments

gmmorris commented Aug 11, 2020

elasticmachine commented Aug 11, 2020

gmmorris commented Aug 13, 2020

gmmorris commented Aug 13, 2020