Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Thread pool rejections alert #79433

Merged
merged 20 commits into from
Oct 30, 2020

Conversation

igoristic
Copy link
Contributor

Resolves #74822

The check calculates each data node to make sure that the thread pool rejections MAX within the last 5m range is below the implied threshold.

This is part of the "Additional Alerting" effort for Stack Monitoring
Screen Shot 2020-10-05 at 6 39 04 AM
Screen Shot 2020-10-05 at 6 42 05 AM
Screen Shot 2020-10-05 at 6 43 15 AM


Testing:

  1. Create a Stack Monitoring environment
  2. Through the regular Setup Mode > Alert Edit flow/UX, set search or write threshold to a value and make sure it's enabled eg:
    Screen Shot 2020-10-05 at 6 39 29 AM
  3. Modify field thread_pool.search.rejected in node_stats document (that's within the last 5 minutes) in the .monitoring-es* index

*Note: that it might take a couple of minutes for the notification to show up in the UI

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to see this already ready! Great work so far! I had a few comments so far and I'm going to keep testing but wanted a chance to start talking through the comments.

@jakommo
Copy link
Contributor

jakommo commented Oct 5, 2020

Nice work!

In the second screenshot there is a link "Tune thread pools" where does this link to? (Sorry, can't find it in the code).
We need to be cautious with this as it might give the wrong expectation that increasing the thread pool will fix the issue, but in almost all cases the solution is to leave the thread tools at default and fix the root cause for why they were exhausted. I.e. writing or querying too many shards or using very small bulk sizes etc.

@igoristic
Copy link
Contributor Author

@jakommo

In the second screenshot there is a link "Tune thread pools" where does this link to? ...

https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html

I agree 💯 that we should guide on how to treat the cause/problem and not the symptom. But, still wanted them to be aware of the thread pool API (incase it's applicable)

@jakommo
Copy link
Contributor

jakommo commented Oct 7, 2020

@igoristic thanks for calcifying.
I'm ok with keeping it to inform the user about it's existence, but I think we should move it to a less prominent spot (maybe last in the list?) and also change the name of the link. Maybe just call it "Thread pool settings" or so? The "Tune" makes it sound like increasing those would do any good 😁
cc @inqueue for input, because we talked about this a few weeks ago.

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, this is looking great!! Nice work separating these two out. I have a couple of suggestions and will take a deeper look in the code next.

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things I found in the code, but looking great!

Also, I wanted to bring up this code: https:/elastic/kibana/blob/master/x-pack/plugins/monitoring/server/alerts/missing_monitoring_data_alert.ts#L109. I wonder if we need to do this for all alerts that support a customizable duration (which I think they all do). We hard-code the query to fetch clusters to look 2m in the past, but that wasn't sufficient for the missing monitoring data alert, and I'm wondering if it's sufficient for the others too.

WDYT?

@igoristic igoristic requested a review from a team as a code owner October 26, 2020 22:12
@@ -54,7 +54,7 @@ pageLoadAssetSize:
mapsLegacy: 116817
mapsLegacyLicensing: 20214
ml: 82187
monitoring: 268612
monitoring: 288612
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our unofficial goal is to ultimately limit all page load asset sizes to 200kb in the next year or so, as teams continue to work reducing the size of their bundles. Is anyone from the monitoring team working to reduce the page load asset size of this bundle? a 20kb increase feels reasonable for now, but I'd like to ask that someone from the monitoring team take a look at the x-pack/plugins/monitoring/target/public/stats.json in one of the many webpack analyzers or visualizers after running node scripts/build_kibana_platform_plugins --focus monitoring --profile?

When I run it I see that the server code is being bundles in the page load bundle, which seems like the best way to fix the limit issue, rather than raising the limit.

image

PS, I'm working on unifying the docs for the best way to diagnose and deal with this stuff now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thank you @spalger 🙇

@@ -81,9 +81,9 @@
*
* @param {[type]} prov [description]
*/
import _ from 'lodash';
import { partial, uniqueId, isObject } from 'lodash';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this has no impact on the size of the bundles since we always load and share a single, complete, lodash instance.

@igoristic
Copy link
Contributor Author

@chrisronline Re-requested your already approved review, since I had to remove all the /server/ imports which added a lot of weight to our bundle size. Also, loading some stuff async to improve initial load

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

@kbn/optimizer bundle module count

id before after diff
monitoring 633 607 -26

async chunk count

id before after diff
monitoring 1 7 +6

async chunks size

id before after diff
monitoring 892.7KB 964.4KB +71.7KB

distributable file count

id before after diff
default 48115 48137 +22

page load bundle size

id before after diff
monitoring 190.0KB 33.1KB -157.0KB

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@spalger spalger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh snap, love the bundle limit decrease! 💯

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work!

@igoristic igoristic merged commit c1294f0 into elastic:master Oct 30, 2020
@igoristic igoristic deleted the threadpool_rejection_alert branch October 30, 2020 14:50
gmmorris added a commit to gmmorris/kibana that referenced this pull request Oct 30, 2020
* master: (71 commits)
  [Chrome] Extension to append an element to the last breadcrumb (elastic#82015)
  [Monitoring] Thread pool rejections alert (elastic#79433)
  [Actions] Fix actionType type on registerType function (elastic#82125)
  [Security Solution] Modal for saving timeline (elastic#81802)
  add tests for index pattern switching (elastic#81987)
  TS project references for share plugin (elastic#82051)
  [Graph] Fix problem with duplicate ids (elastic#82109)
  skip 'returns a single bucket if array has 1'.  related elastic#81460
  Add a link to documentation in the alerts and actions management UI (elastic#81909)
  [Fleet] fix duplicate ingest pipeline refs (elastic#82078)
  Context menu trigger for URL Drilldown (elastic#81158)
  SO management: fix legacy import index pattern selection being reset when switching page (elastic#81621)
  Fixed dead links (elastic#78696)
  [Search] Add "restore" to session service (elastic#81924)
  fix Lens heading structure (elastic#81752)
  [ML] Data Frame Analytics: Fix feature importance cell value and decision path chart (elastic#82011)
  Remove legacy app arch items from codeowners. (elastic#82084)
  [TSVB] Renamed 'positive rate' to 'counter rate' (elastic#80939)
  Expressions/migrations2 (elastic#81281)
  [Telemetry] [Schema] remove number type and support all es number types (elastic#81774)
  ...
igoristic added a commit that referenced this pull request Oct 30, 2020
* Thread pool rejections first draft

* Split search and write rejections to seperate alerts

* Code review feedback

* Optimized page loading and bundle size

* Increased monitoring bundle limit

* Removed server app import into the frontend

* Fixed tests and bundle size

Co-authored-by: Kibana Machine <[email protected]>
# Conflicts:
#	packages/kbn-optimizer/limits.yml
@igoristic
Copy link
Contributor Author

Backport:
7.x: 0163b50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Monitoring][Additional-Alerting] Thread Pool Rejections
7 participants