Eliminate the downtime between tasks completing and the next polling interval #65552

mikecote · 2020-05-06T19:01:24Z

If we always kept a certain number of tasks claimed and ready for idle workers, we'd drastically reduce the idle time.

elasticmachine · 2020-05-06T19:01:26Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-07-21T14:18:18Z

This needs more context.

My initial guess is that rather than waiting for the task manager interval to complete, we would like to read from the queue earlier if worker slots become available.

If so,

was there some threshold of empty worker slots that we'd re-read the queue? Like, over 50%? Or just re-read the queue whenever a new worker slot becomes available?
does the existing interval basically become an idle time-out? Eg, once we do a queue read, start the interval timeout, so we'd only do a queue read in that interval if no other workers became available in that time slice. Presumably if a worker became available (or % of workers became available), we'd read immediately, and cancel the outstanding idle timeout, but then start a new idle timeout once that read was done.
were we thinking that maybe we'd claim more tasks that we have workers for, and then run those as existing workers finished, so we could keep the worker slots busy without having to do another read

gmmorris · 2020-07-21T14:25:26Z

IIRC the intention was to claim more tasks that we have workers for and work through them if a ew workers finish before the next interval - so your last one.

pmuellr · 2020-07-21T15:07:03Z

Ah cool, that seems pretty straight-forward. Any thoughts on how many extra we'd ask for? 50%?

gmmorris · 2020-07-21T15:44:31Z

I think we were talking about doubling it, but we can make it configurable 🤷

I'm working on this and think there might be some overlapping work here: #71441
As we can't limit how many items are returned of a certain type in the query, we might need some smart queue that takes on more than needed and pulls in work by type... 🤔 I'm still playing around with ideas

pmuellr · 2020-07-21T15:56:38Z

Guessing it will be painful to work on this and #71441 at the same time, two different people, does feel like there's going to be some overlap.

pmuellr · 2020-07-21T15:59:24Z

As we can't limit how many items are returned of a certain type in the query

One simple thing we could do would be that if we know we're already "at capacity" for type X, we could add a filter on the query to not return ANY X's.

gmmorris · 2020-07-21T16:08:24Z

Yeah, I had the same thoughts: #71441 (comment)

pmuellr · 2020-07-21T16:51:10Z

I'm curious how this is going to work if tasks get claimed, but the claim "times out" because the current tasks have prevented claimed tasks from being run. We must have some kind of a claiming timeout somewhere, to handle the case of tasks getting claimed by a Kibana instance and then that instance goes down. Presumably, we'll be hitting those cases a lot more once we start asking for more tasks than we can run. And I guess we'd need to check some of these, before we run them, to make sure some other Kibana instance hasn't claimed them in the meantime.

gmmorris · 2020-07-21T17:33:06Z

When it tries to mark as running it'll either fail because it's been claimed by someone else or update the expiration, so it should work fine.

mikecote · 2020-07-22T19:11:48Z

In regards to #71441, the issue is purely research and we'll plan on supporting limited concurrency with #54916 (~7.11) based on the research / findings.

resolves elastic#65552 Currently TaskManager attempts to claim exactly the number of tasks that it has capacity for. As an optimization, we're going to change to have it request more tasks than it has capacity for. This should improve latency, as when tasks complete, there may be some of these excess tasks that can be started, as they are already claimed (they still need to be marked running). All the plumbing already handles getting more tasks than we asked for, we were just never asking for more than we needed previously.

pmuellr · 2020-10-21T18:50:33Z

I've done a little more thinking on the "fetch more tasks to run than available capacity" PR #75429 - which doesn't actually work anyway, right now.

My main concern at this point, is that I believe this will actually cause more 409 conflicts when claiming/mark running, when there are > 1 Kibana instances running. Presumably each Kibana will be getting even more tasks that will conflict with other Kibanas, than if they only got their actual capacity. And adding more Kibanas may make things worse.

Not clear how bad this would be on the system, but you could certainly imagine degenerate cases where some Kibanas consistently get starved by other Kibanas.

I think we'll need a decent set of benchmarks in place before we could make a change like this and assure it doesn't make things worse :-)

mikecote · 2020-10-21T18:53:56Z

++ Also would add complexity when managing timeouts.

After chatting with @pmuellr and @kobelb, I think we're in a good state with task manager performance without this issue and that we can park the issue until we see a need for it. Closing for now and we can re-open when we see a need.

mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels May 6, 2020

mikecote mentioned this issue May 6, 2020

Allow-list Task Manager configuration in cloud #56478

Closed

mikecote mentioned this issue Jul 13, 2020

Research adding concurrency support to Task Manager #71441

Closed

gmmorris self-assigned this Jul 22, 2020

pmuellr self-assigned this Jul 22, 2020

gmmorris removed their assignment Jul 22, 2020

This was referenced Aug 11, 2020

Alerting GA #74788

Closed

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

Closed

pmuellr mentioned this issue Aug 19, 2020

[TaskManager] fetch more tasks to run than available capacity #75429

Closed

2 tasks

mikecote mentioned this issue Sep 29, 2020

Modify default Task Manager configuration for better throughput? #78851

Closed

mikecote closed this as completed Oct 21, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate the downtime between tasks completing and the next polling interval #65552

Eliminate the downtime between tasks completing and the next polling interval #65552

mikecote commented May 6, 2020

elasticmachine commented May 6, 2020

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020 •

edited

Loading

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020 •

edited

Loading

pmuellr commented Jul 21, 2020

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020 •

edited

Loading

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020

mikecote commented Jul 22, 2020

pmuellr commented Oct 21, 2020

mikecote commented Oct 21, 2020

Eliminate the downtime between tasks completing and the next polling interval #65552

Eliminate the downtime between tasks completing and the next polling interval #65552

Comments

mikecote commented May 6, 2020

elasticmachine commented May 6, 2020

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020 • edited Loading

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020 • edited Loading

pmuellr commented Jul 21, 2020

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020 • edited Loading

pmuellr commented Jul 21, 2020

gmmorris commented Jul 21, 2020

mikecote commented Jul 22, 2020

pmuellr commented Oct 21, 2020

mikecote commented Oct 21, 2020

gmmorris commented Jul 21, 2020 •

edited

Loading

gmmorris commented Jul 21, 2020 •

edited

Loading

gmmorris commented Jul 21, 2020 •

edited

Loading