Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPEDGE-808: feat: add ep for cpu limits with workload partitioning #1546

Merged
merged 4 commits into from
Mar 21, 2024

Conversation

eggfoobar
Copy link
Contributor

@eggfoobar eggfoobar commented Jan 24, 2024

Workload partitioning currently ignores modifying pods that set limits.cpu, with this proposed change we are looking to add support for limits in CRIO and the Management Admission Webhook to allow passing the CPUQuota through the existing mechanism.

More Info:
OCPEDGE-57

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 24, 2024

@eggfoobar: This pull request references OCPEDGE-808 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Workload partitioning currently ignores modifying pods that set limits.cpu, with this proposed change we are looking to add support for limits in CRIO and the Management Admission Webhook to allow passing the CPUQuota through the existing mechanism.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 24, 2024
@eggfoobar eggfoobar force-pushed the workload-partitioning-cpu-limits branch from 34248c1 to e8f8d0a Compare January 24, 2024 09:20
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 24, 2024

@eggfoobar: This pull request references OCPEDGE-808 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Workload partitioning currently ignores modifying pods that set limits.cpu, with this proposed change we are looking to add support for limits in CRIO and the Management Admission Webhook to allow passing the CPUQuota through the existing mechanism.

More Info:
OCPEDGE-57

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Tal-or
Copy link
Contributor

Tal-or commented Jan 24, 2024

@eggfoobar how this should incorporated with "disable-cpu-quota" annotation? Or we don't expect infra containers to have this?

@eggfoobar
Copy link
Contributor Author

@eggfoobar how this should incorporated with "disable-cpu-quota" annotation? Or we don't expect infra containers to have this?

That's a great call out, I'll update the doc to make reference to that. As it stands I don't think we will have platform pods using that feature. @haircommander Please keep me honest, but to my knowledge the code that triggers that falls under high performance and would require guaranteed QoS to trigger, and we will not alter guaranteed QoS pods. We also can simply deny pods that try to do both workload partitioning with limits and turn off cpu quota, but I think we should only do that if the workload configs end up superseding the performance handler configs.

@browsell
Copy link

@eggfoobar how this should incorporated with "disable-cpu-quota" annotation? Or we don't expect infra containers to have this?

That's a great call out, I'll update the doc to make reference to that. As it stands I don't think we will have platform pods using that feature. @haircommander Please keep me honest, but to my knowledge the code that triggers that falls under high performance and would require guaranteed QoS to trigger, and we will not alter guaranteed QoS pods. We also can simply deny pods that try to do both workload partitioning with limits and turn off cpu quota, but I think we should only do that if the workload configs end up superseding the performance handler configs.

The "disable-cpu-quota" is only relevant for guaranteed QoS pods which would be excluded by this feature.

- Update Mutating Spec call to modify CPU Quota
- Update Cgroup Manager to set the CPUQuota for workload partitioned
containers
- Default value of 0 will be assumed for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we replace the cpu quota disable annotation with this behavior? if not, how do we reconcile between the two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, as I understand it the cpu quota annotation requires the guaranteed pods, as it stands that's not something we alter so it would seem it's something that can be kept and treated differently. Unless I'm misunderstanding how that annotation gets used.

@dhellmann
Copy link
Contributor

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

updated approach to pass down CPU limit to the container runtime
updated goals to make mention of the crio disable-cpu-quota annotation

Signed-off-by: ehila <[email protected]>
@eggfoobar eggfoobar force-pushed the workload-partitioning-cpu-limits branch from e8f8d0a to cd8deac Compare February 28, 2024 06:39
- Sufficient test coverage
- Gather feedback from users rather than just developers
- Enumerate service level indicators (SLIs), expose SLIs as metrics
- Write symptoms-based alerts for the component(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a OpenShift FeatureGate defined yet? We probably need one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh apologies @rphillips, that must have come along with the re-templating I did, I don't think those are relevant for this feature since this is extending the existing feature. Since the core feature is already GA all we are doing in this EP is allowing it correctly apply Limits. What do you think?

Copy link
Contributor

openshift-ci bot commented Mar 12, 2024

@eggfoobar: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mrunalp
Copy link
Member

mrunalp commented Mar 21, 2024

We should continue to be judicious in any use of cpu limits. I am fine approving as long as we ensure that existing functionality does not regress and we add enough test coverage.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2024
Copy link
Contributor

openshift-ci bot commented Mar 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2024
@mrunalp mrunalp merged commit bae1644 into openshift:master Mar 21, 2024
2 checks passed
@eggfoobar
Copy link
Contributor Author

Agreed @mrunalp , The origin tests will be updated to make sure this change and the previous feature are fully covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants