Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to fire actions when an alert instance is resolved #49405

Closed
2 tasks done
mikecote opened this issue Oct 25, 2019 · 20 comments · Fixed by #82799
Closed
2 tasks done

Ability to fire actions when an alert instance is resolved #49405

mikecote opened this issue Oct 25, 2019 · 20 comments · Fixed by #82799
Assignees
Labels
enhancement New value added to drive a business result Feature:Alerting Meta needs_docs Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Oct 25, 2019

The has been a lot of requests for users wanting to be notified whenever an alert instance is no longer considered "active". The determination of this is done whenever the event log logs a resolved-instance event here.

This issue should add a static action group called resolved that is available to all alert types and allows alert actions to attach themselves to it. This action group should fire using the same logic used to log resolved-instance and should leverage the UI work done in #64077 to add a "Resolved" option to the group dropdown.

NOTE: Some UI work will be blocked until #64077 is finished, the server side work can be done in parallel.

Steps:

Original description

One option is to create some sort of "resolved" action group. Alert actions can assign themselves to this new action group. The new action group would fire when an alert instance stops firing (considered resolved when nothing to alert on anymore).

Another option may be to give the alert type executor the tools it needs to manually mark alert instances as resolved.

Note for docs

User guides will need to be updated for this new feature. Maybe the PagerDuty connector docs should also be updated to indicate how to "resolve" a PagerDuty incident when the alert has recovered.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@pmuellr
Copy link
Member

pmuellr commented Apr 10, 2020

Seems like the easiest thing to do is to handle it as a a "built-in" action group. An alert executor wouldn't have to do anything explicit. When an instanceId which had scheduled actions on it's previous turn did not schedule actions on it's current turn, we would treat this instanceId as resolved, and schedule the actions for the "resolved" action group. Every alert would have this action group available, for free.

Raises the following questions:

  • what is the name of this action group? Let's say it's "resolved" for the subsequent questions
  • can an alert type define an action group with this name? What other "reserved" action group names do we think we might have in the future?
  • can an alert type specifically schedule actions for this group itself?
  • should there be more knobs and dials, like being able to specify how many alert executions must not have scheduled actions before firing the "resolved" action group? Or perhaps time-based?

@chrisronline
Copy link
Contributor

Stack Monitoring is doing this right now, but it's not the solution we want long term.

We are detecting two separate scenarios and firing actions for both:

  1. The alert is firing
  2. The alert is not firing, but was firing the last time the alert ran

We call scheduleActions with different data based on the above two scenarios.

Ideally, we don't have to worry about this and the alerting solution provides this for us.

It feels like something every solution will want, as it helps close the loop on the fixing the underlying issue causing the alert.

@Zacqary
Copy link
Contributor

Zacqary commented May 5, 2020

@chrisronline Did y'all figure out how to handle recovery alerts when a renotify period is specified? I'm trying to implement these for Metrics right now, and I can only get it to work without a renotify period. When I do specify one:

  • scheduleActions does not schedule any actions when it's called between renotify intervals
  • alertInstance.replaceState still behaves as expected, persisting the new state or clearing it based on whether scheduleActions is called
  • If The alert is not firing, but was firing the last time the alert ran occurs during the blackout period between renotifications, the user won't get the recovery alert
  • The next alert they receive will probably be another The alert is firing message, and we're back to square one of having no recovery alerts configured

I'm not sure how to get around this without full support for multiple action groups, which I believe is tracked in #64077

@Zacqary
Copy link
Contributor

Zacqary commented May 7, 2020

Spoke to @chrisronline and apparently they don't have a solution.

Right now a single action group has to do double-duty for both Fired and Recovered alerts, and the alerts can get swallowed if throttling is turned on.

We definitely need core support from the plugin for this.

@mikecote
Copy link
Contributor Author

mikecote commented May 7, 2020

Thanks for the input @Zacqary. We're aiming to have a built-in solution in 7.9, stay tuned!

@sgrodzicki sgrodzicki added this to the Metrics UI 7.9 milestone May 25, 2020
@Zacqary
Copy link
Contributor

Zacqary commented Jun 15, 2020

Any update on this? Still have 7.9 work in Metrics UI blocked by this so I'm wondering if y'all have a timetable

@mikecote
Copy link
Contributor Author

mikecote commented Jun 16, 2020

@Zacqary The best update I have is aiming for late 7.9 or 7.10 at this point.

@shahzad31
Copy link
Contributor

We have received lots of uptime alerts users requests for this, i appreciate this being prioritised.

@justinfiore
Copy link

Any update on this issue? Did it make 7.9? Or will it be 7.10? Or later.
An alerting system that can't fire actions on resolution is of limited utility to our company.
This is why we use Grafana alerts on our Elastic data (which is not ideal for many reasons).
If this feature was implemented, we would definitely switch over to Kibana Alerting.

@mikecote
Copy link
Contributor Author

Hi @justinfiore, this is planned to be part of the 7.11 release and to be worked on starting soon. Stay tuned!

This issue will rely on the UI work done in #64077 and add a "resolved" group to all types of alerts.

@mikecote
Copy link
Contributor Author

mikecote commented Nov 5, 2020

I'm wondering if some types of alerts would like to opt-out of this functionality if ever it doesn't make sense to them?

Example:
If the tracking alert uses two action groups like inside / contained and outside a polygon. Would there be a use case for the resolved action group? Or would it be better to use inside / contained and resolved (to represent outside a polygon)?

cc @YulNaumenko

@pmuellr
Copy link
Member

pmuellr commented Nov 5, 2020

response from some questions I asked here #49405 (comment), as it relates to an initial implementation:

  • can an alert type specifically schedule actions for this group itself?

I think, no, for now. I think we could prevent this, but if we can't, that's fine for now as well.

  • should there be more knobs and dials, like being able to specify how many alert executions must not have scheduled actions before firing the "resolved" action group? Or perhaps time-based?

Not now.

And a response to a question from @mikecote in #49405 (comment)

I'm wondering if some types of alerts would like to opt-out of this functionality if ever it doesn't make sense to them?

I raised a question in PR #82645 (comment), regarding what events should be generated when an alert "switches" action groups while it's active. One option, is that it might "resolve" with the old group and then new-instance on the new group. I think maybe we need to answer that before wondering whether opt-out is something to consider.

But in any case, it would be opt-out at the alert type, you're thinking, right? So seems like it would be easy to add this later if we needed to.

@pmuellr
Copy link
Member

pmuellr commented Nov 5, 2020

@YulNaumenko were just chatting about what "context" variables would be available for the resolved action group. Answer (today): none (I think).

I hope we can make all the other variables available, except the context ones; they're set here:

const variables = {
alertId,
alertName,
spaceId,
tags,
alertInstanceId,
context,
state,
params: alertParams,
};
return Mustache.render(value, variables);
});

I'd hate to have to do more GETs to get the variable values for the bits listed there (including the state), since we presumably have all that info at some point during the call to the alert type executor, but it might end up being difficult to make it available where we need it. A couple more GETs for a resolved action execution would probably be worth it though, for now, if it made the code simpler.

In theory, we could probably arrange for the alert executors to make the context variables available in such a way that we could get them for the resolved action execution, but it would mean a change to the way the context is set in the alert type, so I'm hesitant to make a change like that now. Eg, a new API in the services object passed in to set the context variables, independent of scheduling actions.

@mikecote
Copy link
Contributor Author

mikecote commented Nov 6, 2020

But in any case, it would be opt-out at the alert type, you're thinking, right? So seems like it would be easy to add this later if we needed to.

That's correct and agree it's something we can handle later. Looks like we can discuss it in #82792 🙂

I hope we can make all the other variables available, except the context ones; they're set here

I think I'm +1 on skipping context for now and we can revisit later.

@YulNaumenko
Copy link
Contributor

I have a Draft PR where introduced the solution without the possibility to use the alert execution context. If we are OK with the approach, then we need to handle available variables at least in the UI, because it can raise a bunch of questions and SDHs.
Still need to figure out how to be with the alert types like tracking alert, at least need to make it works as by design with the current changes.

@pmuellr
Copy link
Member

pmuellr commented Nov 6, 2020

I have a Draft PR where introduced the solution without the possibility to use the alert execution context. If we are OK with the approach, then we need to handle available variables at least in the UI, because it can raise a bunch of questions and SDHs.

Yeah, I'm fine with this approach for now, don't see how we can get the context vars in there.

In terms of the UI, we don't actually have anything that prevents people from entering invalid variable names now, right? But we do have the list of variables that pops up, that can be inserted via selecting them. Can we have the UI identify that it's editing an action in the "resolved" group, and not list those variables in the pop-up? Absolute worst-case, since I think the plan is to somehow provide a link to doc for the available mustache variables per alert, we could note in that doc that the context.* variables are not available in the resolved action group.

Still need to figure out how to be with the alert types like tracking alert, at least need to make it works as by design with the current changes.

The geo-tracking alert? My understanding is that it will NEVER be in a resolved state, it would always be in an "inside" or "outside" action group. But they could perhaps change the design so that the customer selects if they want to run actions on "inside" or "outside", and when the alert "crosses the line" from the customer selected "side" to the other one, we'd fire the resolved group. I'm not seeing any issues with this alert regarding the resolved action group, but maybe I'm missing something ...

@mukeshelastic
Copy link

@arisonl In order for the auto-closing a PD incident when an alert is resolved, do users needs to do something in rule configuration? If not, great. If so, is that documented?

@mikecote
Copy link
Contributor Author

mikecote commented Dec 18, 2020

@mukeshelastic users will need to do something in the rule configuration in order to make PD incidents close when an alert recovers. We've made the steps a few less clicks with this PR where the user just needs to add another PD action that "Runs When: Recovered" and the form settings will be populated with default values that resolve the incident.

The PR has a needs_docs label so we'll definitely clarify this in the documentation with the necessary steps.

@arisonl
Copy link
Contributor

arisonl commented Dec 18, 2020

@mukeshelastic As Mike mentions, users will need to specify on the UI that they want the PD incident to be resolved when the alert recovers. This will be documented but we also plan to go beyond documentation by releasing additional "how-to" content around actions and integrations.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Feature:Alerting Meta needs_docs Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.