Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] time out work when it overruns in poller #74980

Merged
merged 3 commits into from
Aug 18, 2020

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Aug 13, 2020

Summary

This PR addresses a potential failure case discovered during the investigation of this issue: #74785

If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work.
This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@gmmorris gmmorris requested a review from a team as a code owner August 13, 2020 18:00
@gmmorris gmmorris added Feature:Alerting Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Aug 13, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Copy link
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gmmorris
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but left some questions, mainly about potential timing issues

// The `work` phase includes the prework needed *before* executing a task
// (such as polling for new work, marking tasks as running etc.) but does not
// include the time of actually running the task
workTimeout: opts.config.poll_interval * opts.config.max_poll_inactivity_cycles,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you used a "count" as the config bit here, instead of a duration - my guess is the "count" makes this setting "move" if the poll interval changes, which I think is a nice pattern - I'd been thinking some of our other explicit timeouts could be "relative" like this as well - eg, throttle based on alert interval (eg, throttle for 10 intervals, not 10 minutes, or probably allow both).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, Mike asked the same. :)
It's exactly that - to make it relative to the polling interval.
It also reduces the scenarios you need to support, such as a work duration that's lower than polling interval, and the potential complexity that introduces.

pollRequests$.next(some(three));

advance(workTimeout);
await sleep(workTimeout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we account for a bit of slop here? Seems like it shouldn't be possible for this to resolve before the workTimeout, but you know ... node, time, etc, heh. Maybe just add 100ms or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to start here and address if it does introduce flakiness.
In this case it's using fake timers, which I've found to be quite accurate, because it isn't actually time based, but rather is based on the queued timeouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I'm guessing even if we did see some timing flakiness, it would be fairly obvious that was the problem, given the context, so WORKSFORME

await promiseResult<H, Error>(
timeoutPromiseAfter<H, Error>(
work(...pullFromSet(set, getCapacity())),
workTimeout ?? pollInterval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case for workTimeout being undefined? Tests? I worry that if there a path in production code to have this as undefined, the pollInterval value used would be too close to a valid cycle - should it at least be some kind of multiple of pollInterval, maybe like 2 or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it was really just a default... but yeah, you're right that if someone sets the interval to be lower than the latency of talking to Elasticsearch this will behave badly 🤔
That said... same is true for the actual polling....

Not sure about the * 2 approach, as we have no idea what the interval will be and it might be set to something ridiculous like 30m, in which case, timeout will be 60m....
That said... if your poll is 30m you don't care about immediate results anyway, do you 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the default config is set to 10 I think we can leave it as is for now and see how it behaves.
Unless someone specifically changes it, it'll always be far longer than the interval itself, and as this is here as a safeguard against issues in marking tasks as running (this time does not include task execution) I feel comfortable waiting to see how it behaves in the real world.

@gmmorris gmmorris merged commit 773883f into elastic:master Aug 18, 2020
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 18, 2020
If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work.
This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 18, 2020
* master:
  Skip failing test in CI (elastic#75266)
  [Task Manager] time out work when it overruns in poller (elastic#74980)
  [Drilldowns] misc improvements & fixes (elastic#75276)
  Small README note on bumping memory for builds (elastic#75247)
  [Security Solution][Detections] Adds exception modal tests (elastic#74596)
  [Dashboard] Sample data link does not work (elastic#75262)
  [Dashboard First] Unlink from Library Action With ReferenceOrValueEmbeddable (elastic#74905)
  [Form lib] Fix issue where serializer on fields are called on every change (elastic#75166)
  convert processor labels to sentence case (elastic#75278)
  [Monaco] Refactor the way XJSON grammar checker gets registered (elastic#75160)
  Clarify no documents error message when filtering by is_training (elastic#75227)
  [Lens] Fix crash when two layers xychart  switches to pie (elastic#75267)
  [Observability Homepage] Fix console error because of side effect (elastic#75258)
  [Usage Collection] Add `legacy=true` option to the /api/stats request in the docs (elastic#75146)
  [ML] Functional tests - re-activate DFA test suites (elastic#75257)
  GS providers improvements (elastic#75174)
  [Visualize] First version of by-value visualize editor (elastic#72256)
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 19, 2020
…emove-header

* saved-objects/version-on-create: (59 commits)
  remove version when loading sample data
  omit version from SO import/export
  Skip failing test in CI (elastic#75266)
  [Task Manager] time out work when it overruns in poller (elastic#74980)
  [Drilldowns] misc improvements & fixes (elastic#75276)
  Small README note on bumping memory for builds (elastic#75247)
  [Security Solution][Detections] Adds exception modal tests (elastic#74596)
  Revert "Revert "added missing core docs""
  Revert "Revert "added version to saved object bulk creation""
  [Dashboard] Sample data link does not work (elastic#75262)
  [Dashboard First] Unlink from Library Action With ReferenceOrValueEmbeddable (elastic#74905)
  [Form lib] Fix issue where serializer on fields are called on every change (elastic#75166)
  convert processor labels to sentence case (elastic#75278)
  [Monaco] Refactor the way XJSON grammar checker gets registered (elastic#75160)
  Clarify no documents error message when filtering by is_training (elastic#75227)
  [Lens] Fix crash when two layers xychart  switches to pie (elastic#75267)
  [Observability Homepage] Fix console error because of side effect (elastic#75258)
  [Usage Collection] Add `legacy=true` option to the /api/stats request in the docs (elastic#75146)
  [ML] Functional tests - re-activate DFA test suites (elastic#75257)
  GS providers improvements (elastic#75174)
  ...
gmmorris added a commit that referenced this pull request Aug 19, 2020
)

If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work.
This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 19, 2020
* master: (112 commits)
  [Ingest Manager] Fix agent config rollout rate limit to use constants (elastic#75364)
  Update Node.js to version 10.22.0 (elastic#75254)
  [ML] Anomaly Explorer / Single Metric Viewer: Fix error reporting for annotations. (elastic#74953)
  [Discover] Fix histogram cloud tests (elastic#75268)
  Uiactions to navigate to visualize or maps (elastic#74121)
  Use prefix search invis editor field/agg combo box (elastic#75290)
  Fix docs in trigger alerting UI (elastic#75363)
  [SIEM] Fixes search bar Cypress test (elastic#74833)
  Add libnss3.so to Dockerfile template (reporting) (elastic#75370)
  [Discover] Create field_button and add popovers to sidebar (elastic#73226)
  [Reporting] Network Policy: Do not throw from the intercept handler (elastic#75105)
  [Reporting] Increase capture.timeouts.openUrl to 1 minute (elastic#75207)
  Allow routes to specify the idle socket timeout in addition to the payload timeout (elastic#73730)
  [src/dev/build] remove node-version from snapshots (elastic#75303)
  [ENDPOINT] Reintroduced tabs to endpoint management and migrated pages to use common security components (elastic#74886)
  [Canvas] Remove dependency on legacy expressions APIs (elastic#74885)
  Skip failing test in CI (elastic#75266)
  [Task Manager] time out work when it overruns in poller (elastic#74980)
  [Drilldowns] misc improvements & fixes (elastic#75276)
  Small README note on bumping memory for builds (elastic#75247)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants