Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide caps on task manager configuration settings #56573

Closed
pmuellr opened this issue Feb 1, 2020 · 4 comments · Fixed by #85574
Closed

provide caps on task manager configuration settings #56573

pmuellr opened this issue Feb 1, 2020 · 4 comments · Fixed by #85574
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Feb 1, 2020

Task manager currently exposes a number of configuration properties that could cause systemic problems if set to extreme values.

Here are the current config properties:

export const configSchema = schema.object({
enabled: schema.boolean({ defaultValue: true }),
/* The maximum number of times a task will be attempted before being abandoned as failed */
max_attempts: schema.number({
defaultValue: 3,
min: 1,
}),
/* How often, in milliseconds, the task manager will look for more work. */
poll_interval: schema.number({
defaultValue: DEFAULT_POLL_INTERVAL,
min: 100,
}),
/* How many requests can Task Manager buffer before it rejects new requests. */
request_capacity: schema.number({
// a nice round contrived number, feel free to change as we learn how it behaves
defaultValue: 1000,
min: 1,
}),
/* The name of the index used to store task information. */
index: schema.string({
defaultValue: '.kibana_task_manager',
validate: val => {
if (val.toLowerCase() === '.tasks') {
return `"${val}" is an invalid Kibana Task Manager index, as it is already in use by the ElasticSearch Tasks Manager`;
}
},
}),
/* The maximum number of tasks that this Kibana instance will run simultaneously. */
max_workers: schema.number({
defaultValue: DEFAULT_MAX_WORKERS,
// disable the task manager rather than trying to specify it with 0 workers
min: 1,
}),
});

Issue #56478 specifically calls out max_workers and poll_interval to whitelist on the cloud, and so these properties are ones we should consider capping first, to keep customers from shooting themselves in the foot.

Suggestions:

  • max_workers: should have max: 100
  • poll_interval: current min: 100; suggest changing it to 1000 (1 second)
@pmuellr pmuellr added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Feb 1, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor

gmmorris commented Feb 3, 2020

Not sure why we'd want to raise the min of poll_interval to 1000.
When running perf test I see a lot of dead time waiting for the interval as many queries often take far less that a second to run, so allowing this number to be dialed down makes sense to me.

@pmuellr
Copy link
Member Author

pmuellr commented Feb 3, 2020

Running the polling 10x/second seems like it's going to be overkill, if it's running at that speed all the time.

Seems like we're going to eventually want some kind of adaptive story. It would be fine to run 10x/second if there's a ton of short-running task to process, and they need to be processed in a timely fashion, but if you don't have anything scheduled for the next 10 minutes, that's a lot of ES calls all resulting in empty results.

I've used polling systems that adapt to "idle" situations by polling differently or less often. We could also optimistically take into account the time the next task actually has to run. If it's an hour from now, we could certainly scale back the amount of polling.

I think we'll get a better sense for whether we need something like this, once we instrument TM a bit more (telemetry &| logging).

It's probably also true that the built-in "lag" from the polling interval provides a bit of "spreading out" a stampeding herd of new tasks.

@mikecote
Copy link
Contributor

mikecote commented Dec 8, 2020

It was discussed to deprecate the values above our limit and then in 8.0 to remove support. Log some sort of warning until 8.0, create follow up issue for cleanup.

@pmuellr pmuellr self-assigned this Dec 8, 2020
pmuellr added a commit to pmuellr/kibana that referenced this issue Dec 10, 2020
…limit

resolves elastic#56573

In this PR we create a new task manager limit on the config property
`xpack.task_manager.max_workers` of 100, but only log a deprecation
warning if that property exceeds the limit.  We'll enforce the limit
in 8.0.

The rationale is that it's unlikely going to be useful to run with
more than some number of workers, due to the amount of simultaneous
work that would end up happening.  In practice, too many workers can
slow things down more than speed them up.

We're setting the limit to 100 for now, but may increase / decrease it
based on further research.
pmuellr added a commit to pmuellr/kibana that referenced this issue Dec 10, 2020
…limit

resolves elastic#56573

In this PR we create a new task manager limit on the config property
`xpack.task_manager.max_workers` of 100, but only log a deprecation
warning if that property exceeds the limit.  We'll enforce the limit
in 8.0.

The rationale is that it's unlikely going to be useful to run with
more than some number of workers, due to the amount of simultaneous
work that would end up happening.  In practice, too many workers can
slow things down more than speed them up.

We're setting the limit to 100 for now, but may increase / decrease it
based on further research.
pmuellr added a commit that referenced this issue Dec 14, 2020
…limit (#85574)

resolves #56573

In this PR we create a new task manager limit on the config property
`xpack.task_manager.max_workers` of 100, but only log a deprecation
warning if that property exceeds the limit.  We'll enforce the limit
in 8.0.

The rationale is that it's unlikely going to be useful to run with
more than some number of workers, due to the amount of simultaneous
work that would end up happening.  In practice, too many workers can
slow things down more than speed them up.

We're setting the limit to 100 for now, but may increase / decrease it
based on further research.
pmuellr added a commit to pmuellr/kibana that referenced this issue Dec 14, 2020
…limit (elastic#85574)

resolves elastic#56573

In this PR we create a new task manager limit on the config property
`xpack.task_manager.max_workers` of 100, but only log a deprecation
warning if that property exceeds the limit.  We'll enforce the limit
in 8.0.

The rationale is that it's unlikely going to be useful to run with
more than some number of workers, due to the amount of simultaneous
work that would end up happening.  In practice, too many workers can
slow things down more than speed them up.

We're setting the limit to 100 for now, but may increase / decrease it
based on further research.
pmuellr added a commit that referenced this issue Dec 14, 2020
…limit (#85574) (#85872)

resolves #56573

In this PR we create a new task manager limit on the config property
`xpack.task_manager.max_workers` of 100, but only log a deprecation
warning if that property exceeds the limit.  We'll enforce the limit
in 8.0.

The rationale is that it's unlikely going to be useful to run with
more than some number of workers, due to the amount of simultaneous
work that would end up happening.  In practice, too many workers can
slow things down more than speed them up.

We're setting the limit to 100 for now, but may increase / decrease it
based on further research.
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants