Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Task Manager doesn't automatically recover if polling fails #74785

Closed
gmmorris opened this issue Aug 11, 2020 · 3 comments · Fixed by #75420
Closed

[Alerting] Task Manager doesn't automatically recover if polling fails #74785

gmmorris opened this issue Aug 11, 2020 · 3 comments · Fixed by #75420
Assignees
Labels
Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

Task Manager doesn't have any built in ability to recover if the polling cycle fails.
We have identified in the past failure cases where the polling cycle broke and addressed those cases, but ideally TM would recover independently when such a case happens by restarting a broken poller.

In order for us to gain full confidence in mission critical usage of alerting, a Nodemon like ability to restart the internal poller seems paramount.
Along side this change, we should expose metrics that can be collected on demand to aid in SDH support once we go GA.

@gmmorris gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Aug 11, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor Author

Doing some research yesterday I'm now a little more that I understand the failure case we encountered last week.
We already have extensive error catching in the poller, so it's not a case of "when does it break?", but rather, what happened is that a bug in the bufferedTaskStore caused the update operation of a specific task to fail to ever resolve or reject, causing the poller to hang as the operation never ends.

This still doesn't address the potential case where the poller might break/hang for some other unknown reason, so I still think we need a nodemon-like solution, but at least this will reduce the chances of the poller restart being needed.

To address these issues, I'll progress with 3 separate PRs:

  1. Fix the bug in the buffered task store so that two updates of the same ID work by tracking the index along with the ID, rather than just the ID.
  2. Add a timeout on the work function (which marks tasks as running and kicks them off, it doesn't wait on task completion anyway) in the poller so that if it takes longer than a certain amount of time, it rejects and treats it the same as if the operation errors. This should make it impossible for another hanging promise to hang the poller itself.
  3. Address the recovery from a failing poller.

@gmmorris
Copy link
Contributor Author

Fix the bug in the buffered task store so that two updates of the same ID work by tracking the index along with the ID, rather than just the ID.

There's now a PR for the first step: #74943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
3 participants