Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put the auto calculation of capacity behind a feature flag, for now #195390

Merged

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Oct 8, 2024

In this PR, I'm preparing for the 8.16 release where we'd like to start rolling out the mget task claiming strategy separately from the added concurrency. To accomplish this, we need to put the capacity calculation behind a feature flag that is default to false for now, until we do a second rollout with an increased concurrency. The increased concurrency can be calculated and adjusted based on experiments of clusters setting xpack.task_manager.capacity to a higher value and observe the resource usage.

PR to deploy to Cloud and verify that we always default to 10 normal tasks: #195392

@mikecote mikecote added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v9.0.0 v8.16.0 labels Oct 8, 2024
@mikecote mikecote self-assigned this Oct 8, 2024
@mikecote mikecote marked this pull request as ready for review October 8, 2024 15:18
@mikecote mikecote requested a review from a team as a code owner October 8, 2024 15:18
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote mikecote added the backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) label Oct 8, 2024
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one comment about adding the config to the logger

@@ -286,6 +286,7 @@ export class TaskManagerPlugin
const isServerless = this.initContext.env.packageInfo.buildFlavor === 'serverless';

const defaultCapacity = getDefaultCapacity({
autoCalculateDefaultEchCapacity: this.config.auto_calculate_default_ech_capacity,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update the logger.info message below with this config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be useful, I added it to the log message in this commit: 1deaff2

@@ -204,6 +204,7 @@ export const configSchema = schema.object(
}),
claim_strategy: schema.string({ defaultValue: CLAIM_STRATEGY_UPDATE_BY_QUERY }),
request_timeouts: requestTimeoutsConfig,
auto_calculate_default_ech_capacity: schema.boolean({ defaultValue: false }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add to the docker allowlist and the cloud allowlist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. The original thinking was to revert this PR by 8.18 and ensure we're happy with the HEAP_TO_CAPACITY_MAP config based on production experiments, and use xpack.task_manager.capacity as the opt-out strategy. But I can see where we could use this as an opt-out mechanism as well. I'll take note to think it through, I'll add it to the dockerfile anyway in this PR 1deaff2, leaving the cloud allow list that we'll need to add if ever we continue with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it in the cloud allow list, it will only be possible to set it with the operator overrides capability. That should be fine if we only need to deal with a few cases.

Also to keep in mind, I believe the cloud allow list stuff is only updated on releases, but not sure. Meaning we may need to wait for a point release to wait for it to go into effect.

Copy link
Contributor Author

@mikecote mikecote Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ I can expand a bit into the two options in 8.18 to rollback the auto calculated capacity:

  1. Customer sets xpack.task_manager.capacity to an explicit value, which will take precedence over the calculated default value. In this case, we can make them set 10 or something else.
  2. If we keep the feature flag, customers can set xpack.task_manager. auto_calculate_default_ech_capacity to false, which means we'll default to 10 normal tasks until they specify otherwise via xpack.task_manager.capacity. It's pretty much the same as asking them to put a capacity of 10 but with the added benefit that we can re-opt them into auto calculating when removing the auto_calculate_default_ech_capacity setting (breaking change).

It feels like option 1 is ok and we can remove this new auto_calculate_default_ech_capacity setting in 8.18 when we no longer need this functionality off by default. I was thinking of this approach as an alternate way of removing the code and adding it back in for 8.18

@mikecote mikecote requested a review from a team as a code owner October 8, 2024 15:42
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

cc @mikecote

@mikecote mikecote enabled auto-merge (squash) October 8, 2024 17:43
@mikecote mikecote merged commit 9c8f689 into elastic:main Oct 8, 2024
39 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https:/elastic/kibana/actions/runs/11240892056

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 8, 2024
…lastic#195390)

In this PR, I'm preparing for the 8.16 release where we'd like to start
rolling out the `mget` task claiming strategy separately from the added
concurrency. To accomplish this, we need to put the capacity calculation
behind a feature flag that is default to false for now, until we do a
second rollout with an increased concurrency. The increased concurrency
can be calculated and adjusted based on experiments of clusters setting
`xpack.task_manager.capacity` to a higher value and observe the resource
usage.

PR to deploy to Cloud and verify that we always default to 10 normal
tasks: elastic#195392

(cherry picked from commit 9c8f689)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 8, 2024
… now (#195390) (#195486)

# Backport

This will backport the following commits from `main` to `8.x`:
- [Put the auto calculation of capacity behind a feature flag, for now
(#195390)](#195390)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https:/sqren/backport)

<!--BACKPORT [{"author":{"name":"Mike
Côté","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-10-08T17:48:07Z","message":"Put
the auto calculation of capacity behind a feature flag, for now
(#195390)\n\nIn this PR, I'm preparing for the 8.16 release where we'd
like to start\r\nrolling out the `mget` task claiming strategy
separately from the added\r\nconcurrency. To accomplish this, we need to
put the capacity calculation\r\nbehind a feature flag that is default to
false for now, until we do a\r\nsecond rollout with an increased
concurrency. The increased concurrency\r\ncan be calculated and adjusted
based on experiments of clusters
setting\r\n`xpack.task_manager.capacity` to a higher value and observe
the resource\r\nusage.\r\n\r\nPR to deploy to Cloud and verify that we
always default to 10 normal\r\ntasks:
https:/elastic/kibana/pull/195392","sha":"9c8f689aca23ed8b1f560c57a9a660d318375412","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Task
Manager","Team:ResponseOps","v9.0.0","backport:prev-minor","v8.16.0"],"title":"Put
the auto calculation of capacity behind a feature flag, for
now","number":195390,"url":"https:/elastic/kibana/pull/195390","mergeCommit":{"message":"Put
the auto calculation of capacity behind a feature flag, for now
(#195390)\n\nIn this PR, I'm preparing for the 8.16 release where we'd
like to start\r\nrolling out the `mget` task claiming strategy
separately from the added\r\nconcurrency. To accomplish this, we need to
put the capacity calculation\r\nbehind a feature flag that is default to
false for now, until we do a\r\nsecond rollout with an increased
concurrency. The increased concurrency\r\ncan be calculated and adjusted
based on experiments of clusters
setting\r\n`xpack.task_manager.capacity` to a higher value and observe
the resource\r\nusage.\r\n\r\nPR to deploy to Cloud and verify that we
always default to 10 normal\r\ntasks:
https:/elastic/kibana/pull/195392","sha":"9c8f689aca23ed8b1f560c57a9a660d318375412"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https:/elastic/kibana/pull/195390","number":195390,"mergeCommit":{"message":"Put
the auto calculation of capacity behind a feature flag, for now
(#195390)\n\nIn this PR, I'm preparing for the 8.16 release where we'd
like to start\r\nrolling out the `mget` task claiming strategy
separately from the added\r\nconcurrency. To accomplish this, we need to
put the capacity calculation\r\nbehind a feature flag that is default to
false for now, until we do a\r\nsecond rollout with an increased
concurrency. The increased concurrency\r\ncan be calculated and adjusted
based on experiments of clusters
setting\r\n`xpack.task_manager.capacity` to a higher value and observe
the resource\r\nusage.\r\n\r\nPR to deploy to Cloud and verify that we
always default to 10 normal\r\ntasks:
https:/elastic/kibana/pull/195392","sha":"9c8f689aca23ed8b1f560c57a9a660d318375412"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Mike Côté <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants