-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Task Manager doesn't automatically recover if polling fails #74785
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Doing some research yesterday I'm now a little more that I understand the failure case we encountered last week. This still doesn't address the potential case where the poller might break/hang for some other unknown reason, so I still think we need a nodemon-like solution, but at least this will reduce the chances of the poller restart being needed. To address these issues, I'll progress with 3 separate PRs:
|
There's now a PR for the first step: #74943 |
Task Manager doesn't have any built in ability to recover if the polling cycle fails.
We have identified in the past failure cases where the polling cycle broke and addressed those cases, but ideally TM would recover independently when such a case happens by restarting a broken poller.
In order for us to gain full confidence in mission critical usage of alerting, a Nodemon like ability to restart the internal poller seems paramount.
Along side this change, we should expose metrics that can be collected on demand to aid in SDH support once we go GA.
The text was updated successfully, but these errors were encountered: