[Metrics] POC for Alerting on Metric Threshold #47165
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #46511
(DON'T MERGE THIS TO MASTER)
This is a proof of concept for a metric alerting system. Through three endpoints, you can execute CRUD operations (well, CRD in this proof of concept) to manage a set of metric threshold alerts for any part of your infrastructure.
Design Intent
There is no UI yet, but I designed this system with the expectation that a user would:
Out of the three basic types of search queries you might execute:
This proof of concept allows you to create alerts on 1. and 2. It should be easy to add functionality for 3. by borrowing code from the Snapshot query, but in the interest of timeboxing this POC I stopped short of doing that.
The endpoints in this POC allow you to input all the parameters of a metric query, plus:
(You can also have it send a server log to Kibana on alert, but this is primarily for testing. There's a parameter for email alerts as well but I didn't get those working yet.)
Testing
To test this, make sure to explicitly enable both required plugins in your Kibana config file:
You will also need to run Elasticsearch and Kibana with SSL enabled in order to use the Alerting APIs:
The shared Observability clusters don't seem to work with SSL enabled in Kibana. Try a Cloud account if you want to test this with more complexity than you can locally.
How It Works
groupBy
queries are implemented by creating one alert for each possible group with a single API call. There's probably a more efficient way to do this using an Elasticsearch query and one single alert.API Reference
POST /api/infra/alerts/metric_threshold
- Create AlertThis API creates a new metric threshold alert. The
/metric_threshold
URL is a convention I'm using under the assumption that we might add more types of metric alerts besides simple threshold, such as rate of change alerts, anomaly or outlier detection, forecasting, etc.Query parameters
metric
::(Required, string) The metric to measure and alert on, e.g.
system.load.1
aggregator
::(Required, string) Valid options are
avg
,max
,min
,cardinality
,rate
, andcount
comparator
::(Required, string) Valid options are
>
,<
,>=
, or<=
threshold
::(Required, number) An alert will fire when the
metric
is>
,<
,>=
, or<=
this value (as defined by thecomparator
)interval
::(Required, string) Must be a valid calendar interval. This is how often to run the alert, and also the length of the time bucket that it will evaluate data over
searchField
::(Required, object) Takes a
name
andvalue
param. This defines the field to retrieve metric data from.name
::(Required, string) e.g
host.hostname
,agent.id
, etc.value
::(Required, string) If this is a specific value, a single alert for that value (e.g
host.name: myHost
) will be created. If this is*
, a multi-alert will be created for every possible value ofsearchField.name
. Essentially it will track every chart that you'd get back in a groupBy query on the Metrics Explorer.indexPattern
::(Required, string) The index pattern to query for metric data, e.g.
metricbeat-*
actions
::(Required, object) This can contain one or more of:
slack
::(Optional, string) A webhook URL for a Slack channel. When this alert fires, it will send notifications to this channel.
log
::(Optional, boolean) If true, this will log out a message to the Kibana server when the alert fires.
email
::(Not yet implemented)
GET /api/infra/alerts/list
- List AlertsThis API will return a JSON array of all the currently created metric alerts, plus their current alert states. The value of
currentAlertState
can be:0
- The alert is in an "OK" state1
- The alert is in an "ALERT" stateIncluded in the
AlertStates
enum, but not yet implemented, are:2
- A "WARN" state3
- A "NO DATA" state, for when the alert queries the metric and receives no data back4
- A "SNOOZED" state, for when we don't want the alert to fire right nowFor multi-alerts, this API will return a parent alert that lists the IDs of an individual child alert for each grouping. In this POC, you will need to refer to each child alert to determine the overall alert state. We can automate this in a later iteration.
DELETE /api/infra/alerts
- Delete an AlertThis API will delete an alert that you've created. This is a wrapper for the Alerting API's delete system, with some additional features:
Query parameters
id
::(Required, string) The ID of the alert you'd like to delete
Known issue
The syntax for this query is
/api/infra/alerts?id=<id>
. I would prefer to do it like/api/infra/alerts/<id>
but I don't actually know how to configure a path parameter using our routing system, so if someone could tell me how to do that, that would be great.Feedback for the Alerting Team
API Limitations
Due to limitations in the way the Alerting API handles saved objects, our Create Alert API will:
infrastructure-alert
in order to keep track of which alerts were created by the infrastructure appThe List Alerts API retrieves its alerts from the
infrastructure-alert
SavedObject collection. If there were a way to add tags to created alert instances so that we could retrieve them later, we might not have to maintain as much of a separate SavedObject database.The Alerting API also doesn't allow you to retrieve the current state of an alert instance. Therefore, every time an alert evaluates, I have it update its
infrastructure-alert
SavedObject with its current state so that the List API can display it. I'd prefer to have an endpoint in the Alerting API that would allow me to do this.#47379 is necessary for this POC to work. I cherry-picked the essential parts of it, but please definitely merge that.
Documentation
While I did end up figuring out what
actionGroups
were for, and I used them to differentiate between sending afired
notification and arecovered
notification, I do agree with the note on #46547 that the documentation could be clearer about them.It was difficult for me to figure out the
{{{context}}}
convention of templating alert messages. I copied this from the APM POC, but I'm still not sure where the docs actually explain that.