cluster-autoscaler does not support custom scheduling config #4518

ialidzhikov · 2021-12-13T09:26:33Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Not applicable

What k8s version are you using (kubectl version)?:

v1.22.2

What environment is this in?:

Gardener

What did you expect to happen?:

cluster-autoscaler to support custom scheduling config that is configurable (or cluster-autoscaler to rework the existing mechanism with the similator and the hard-coded default scheduling config).

What happened instead?:

The default scheduling algorithm in kube-scheduler is to spread Pods accross Nodes. To improve the utilization of our Nodes, we would like to run the kube-scheduler with a custom configuration that improves the Nodes utilizations by selecting the most allocated Node.
However with #4517 (and also reading the code and existing issues) we see that cluster-autoscaler internally vendors the kube-scheduler pkgs and runs a simulation whether a Pod can be scheduled. We see that the similator is using the default scheduling config and currently there is no way to run the autoscaler with custom scheduling config. As you may already guess, discrepancies and issues arise when kube-sheduler and cluster-autorscaler run with different scheduling configs - we may easily end up in a situation where a Pod is unschedulable according to the kube-scheduler but it is schedulable according to the cluster-autoscaler -> cluster-autorscaler rejects to scale up the Nodes count.

How to reproduce it (as minimally and precisely as possible):

Run kube-scheduler with a custom scheduling config
Make sure that issues will occur as described above because the cluster-autoscaler is using the default scheduling config

The text was updated successfully, but these errors were encountered:

MaciekPytel · 2021-12-13T10:05:26Z

The specific use-case (changing scheduler preferences regarding node utilization) should work fine with CA as is. Cluster Autoscaler only runs scheduler Filters in simulation. It completely ignores Scores. In other words CA only simulates "hard" scheduling requirements (ex. whether node has resources, requiredDuringScheduling affinities) and completely ignores any preferences (ex. more/less utilized nodes, preferredDuringScheduling affinities, podTopologySpreading with ScheduleAnyway set).

The change that you described above only changes scheduler preferences, which CA doesn't take into account anyway and so it shouldn't conflict with CA.

Supporting scheduler config is a fairly significant feature request. It's also unclear how useful it would be, given that:

Only Filters config is relevant to CA.
Adding any custom Filter that is not part of k8s codebase would require recompiling CA anyway.
I'm not aware of any common use-case for tweaking config of default Filters.

Conceptually, this feature makes sense and we'd be happy to accept a contribution, but it's not a high priority for us I think given the above.

Finally, nit: this should not be kind/bug. CA explicitly only supports default scheduler and doesn't have any feature that would allow customizing scheduler config. Lack of feature is not a bug.

ialidzhikov · 2021-12-13T10:55:51Z

Thanks for the reply @MaciekPytel . Happy to see that, in theory, custom scheduling config regarding node utilization should work fine with the cluster-autoscaler. Let me try it out.

t0rr3sp3dr0 · 2022-01-14T04:03:00Z

@MaciekPytel, would it be possible to make CA not ignore the scheduling preferences you listed? I'm expecting a problem related to it when using topologySpreadConstraints with ScheduleAnyway.

I have a cluster on AWS with node groups in two AZs and my deployments use topologySpreadConstraints with ScheduleAnyway to have a balanced number of replicas between AZs. It needs to be configured with ScheduleAnyway so in case an AZ goes down, it reschedules the pods to the other AZ, keeping the number of total replicas of the service.

The problem is that CA is scaling down nodes without considering the pods with ScheduleAnyway, so one AZ ends up with more nodes and replicas of services than the other one. Sometimes all replicas of a deployment go to a single AZ due to this behavior. In case of an AZ outage, it would cause these services to have downtime.

t0rr3sp3dr0 · 2022-01-14T04:37:21Z

I've oversimplified the description of my setup on last message, but I think it's enough to understand the problem. Anyway, I'll give you some extra detail here.

To ensure I always have available space on nodes of both AZs for the scheduler to assign my pods to, I have an overprovisioning cronjob that runs periodically and creates a pod that completes instantly but requests all allocatable space for the node type that I have on that node group. This effectively makes CA to scale up the node group every time I don't have an empty node on it. With that, pods with topologySpreadConstraints set to ScheduleAnyway can be assigned to the preferred AZ up to the space available on that spare node.

It's possible that the overprovisioning didn't run it time or the scale up took too long and replicas are now unbalanced between AZs. To fix that, I use kubernetes-sigs/descheduler to evict pods violating the topologySpreadConstraints, even the ones with ScheduleAnyway. Together with the overprovisioning cronjob, this will eventually rebalance all replicas between AZs.

Then comes the CA problem. It looks at the cluster state without considering scheduling preferences and concludes there are too many nodes on the cluster and that it can reallocate pods to reduce the total number of nodes. It performs the scale down operation and now the replicas are unbalanced.

Now overprovisioning and descheduler start to fight against CA, causing an infinite loop of scale ups and scale downs on the cluster. A high number of evictions start to happen, degrading the performance of services caught in this reallocation battle.

If somehow we could tell CA to consider the soft scheduling constraints for scale downs, it would solve this problem.

MaciekPytel · 2022-01-14T10:38:35Z

It's not a simple switch we could flip unfortunately. We originally decided to only run Filters(), because pod preferences completely don't fit into First-fit binpacking algorithm CA uses for scale-up. Also CA runs a lot of scheduler simulations and it's hard enough to get CA to work in large clusters running just the Filters(). Adding Scores() would significantly increase the amount of computation required.

I still don't know how to fix either of those issues and even if I did, we'd have to rewrite a lot of CA to support Scores(). So, I think it's very unlikely we'll ever do this. I think the best suggestion I have for your use-case is to implement your own version of https:/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodes/types.go#L39. The interface is meant as an extension point for customizing CA behavior and it allows you to choose the order in which nodes will be scaled-down.

k8s-triage-robot · 2022-04-14T11:28:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-14T12:18:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-06-13T12:58:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-06-13T13:00:05Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2021

MaciekPytel added area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 13, 2021

ialidzhikov mentioned this issue Dec 14, 2021

cluster-autoscaler cannot count migrated PVs and cannot scale up on exceed max volume count gardener/gardener#5064

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 14, 2022

andyzhangx mentioned this issue May 26, 2022

[v2] cluster-autoscaler does not support custom scheduling config kubernetes-sigs/azuredisk-csi-driver#1354

Closed

k8s-ci-robot closed this as completed Jun 13, 2022

lbernick mentioned this issue Apr 20, 2023

TEP-0135: Per-PipelineRun (instead of per-workspace) affinity assistant tektoncd/pipeline#6543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler does not support custom scheduling config #4518

cluster-autoscaler does not support custom scheduling config #4518

ialidzhikov commented Dec 13, 2021

MaciekPytel commented Dec 13, 2021

ialidzhikov commented Dec 13, 2021

t0rr3sp3dr0 commented Jan 14, 2022

t0rr3sp3dr0 commented Jan 14, 2022

MaciekPytel commented Jan 14, 2022

k8s-triage-robot commented Apr 14, 2022

k8s-triage-robot commented May 14, 2022

k8s-triage-robot commented Jun 13, 2022

k8s-ci-robot commented Jun 13, 2022

cluster-autoscaler does not support custom scheduling config #4518

cluster-autoscaler does not support custom scheduling config #4518

Comments

ialidzhikov commented Dec 13, 2021

MaciekPytel commented Dec 13, 2021

ialidzhikov commented Dec 13, 2021

t0rr3sp3dr0 commented Jan 14, 2022

t0rr3sp3dr0 commented Jan 14, 2022

MaciekPytel commented Jan 14, 2022

k8s-triage-robot commented Apr 14, 2022

k8s-triage-robot commented May 14, 2022

k8s-triage-robot commented Jun 13, 2022

k8s-ci-robot commented Jun 13, 2022