Cluster autoscaler improvements for AI workloads #5170

asm582 · 2022-09-06T15:19:34Z

Which component are you using?:
Cluster autoscaler component which scales Kubernetes cluster

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

We see that the current cluster autoscaler (CA) reacts to pending pods, this may not work well for AI/HPC workloads, we outline the scenarios below:

CA reacts to pending pods, When a large number of pods are being co-scheduled it can wait tens of minutes with pending pods before triggering scale-up, stressing the control plane.
When multiple AI/HPC workloads are submitted to the cluster, none of the jobs may progress even after scale-up by CA due to its inability to understand gang scheduling.
- Example: user submitted 10 AI/HPC jobs (with 10 * N pods), based on N pending pods, lets's assume CA scaled to 10 nodes, and worst case only the head nodes of all the jobs got scheduled, such partial jobs will cause no user workloads to make progress.
CA reactive scaling can hamper cluster scale-up when a user does not have sufficient quota to create pods in a namespace.
CA performs termination based on traditional compute (i.e. CPU and memory ) utilization threshold which may not work for AI workload(s).
- AI workload which uses GPU with large GPU memory may cause low CPU and memory utilization on the node which could lead to premature termination.
- Setting the termination threshold to a very low value will cause CA to not evict any nodes.
CA terminates nodes one at a time, scaling down a cluster that uses tens or hundreds of nodes will take time.
Selection of Machineset(s) to scale cluster is random (or complex via expander) and periodically it causes over-provisioning.

Describe the solution you'd like.:

Looking for feedback on a better solution to have holistic autoscaling for large AI/HPC workloads

Describe any alternative solutions you've considered.:

We have not encountered any alternative solution yet, please share if you have a solution to above shortcomings.

Additional context.:

I had opened the same issue here: kubernetes/community#6840 which will be closed.

kerthcet · 2022-09-06T15:27:42Z

/cc

Dingshujie · 2022-09-07T01:09:10Z

/cc

alculquicondor · 2022-09-29T14:27:12Z

/cc

kisieland · 2022-09-30T07:53:14Z

/cc

x13n · 2022-10-07T19:30:46Z

Thanks for creating this issue, I believe this is an important use case Cluster Autoscaler should support. Some of the pain points you're raising require fundamental changes to how CA operates, but some should already work properly with the right setup. Let me go through them one by one:

CA reacts to pending pods, When a large number of pods are being co-scheduled it can wait tens of minutes with pending pods before triggering scale-up, stressing the control plane.

Why does CA wait for tens of minutes in this scenario? CA runs iterations of the main loop using a fixed interval, so I guess this has something to do with coscheduling plugin holding pods from reaching CA? This plugin isn't really compatible with CA, which is one of the reasons why it wasn't moved to in-tree plugins, see discussion in kubernetes/kubernetes#105802

When multiple AI/HPC workloads are submitted to the cluster, none of the jobs may progress even after scale-up by CA due to its inability to understand gang scheduling.

I think this is one of the reasons behind https:/kubernetes-sigs/kueue We should make sure it works well with CA.

CA reactive scaling can hamper cluster scale-up when a user does not have sufficient quota to create pods in a namespace.

Yup, no way around that without making CA understand some pods should be grouped.

CA performs termination based on traditional compute (i.e. CPU and memory ) utilization threshold which may not work for AI workload(s).

I don't think this one is true. GPU nodes are evaluated based on GPU utilization, not cpu/memory utilization:

autoscaler/cluster-autoscaler/core/scaledown/eligibility/eligibility.go

Lines 158 to 168 in 4ff4903

 if gpu.NodeHasGpu(context.CloudProvider.GPULabel(), node) { 

 threshold, err = c.thresholdGetter.GetScaleDownGpuUtilizationThreshold(context, nodeGroup) 

 if err != nil { 

 return false, err 

 } 

 } else { 

 threshold, err = c.thresholdGetter.GetScaleDownUtilizationThreshold(context, nodeGroup) 

 if err != nil { 

 return false, err 

 } 

 }

CA terminates nodes one at a time, scaling down a cluster that uses tens or hundreds of nodes will take time.

CA drains nodes one at a time, empty nodes (i.e. containing only daemonsets) are removed in bulk. Parallel drain is WIP though, this is tracked in #5079

Selection of Machineset(s) to scale cluster is random (or complex via expander) and periodically it causes over-provisioning.

There are multiple expanders to choose from, so yes, this can be complex (except for a managed setting where CA flags are fine-tuned already). However, even random expander shouldn't cause overprovisioning. If that happens, it's could be due to a bug in the cloudprovider-specific code.

I think the most CA-friendly way of addressing the fundamental CA compatibility problem is through some new k8s API representing a group of pods. If CA understood such API, it could trigger a scale up for all of them in a single go (or error out, e.g. due to lack of quota). kubernetes/enhancements#3371 looks promising, but autoscaling support hasn't been fully fleshed out yet.

kerthcet · 2022-10-08T04:02:12Z

Some supplements:

Why does CA wait for tens of minutes in this scenario? CA runs iterations of the main loop using a fixed interval, so I guess this has something to do with coscheduling plugin holding pods from reaching CA? This plugin isn't really compatible with CA, which is one of the reasons why it wasn't moved to in-tree plugins, see discussion in kubernetes/kubernetes#105802

We just proposed a KEP kubernetes/enhancements#3521 to make scheduling switchable, I think it also helps.

When multiple AI/HPC workloads are submitted to the cluster, none of the jobs may progress even after scale-up by CA due to its inability to understand gang scheduling.

This also looks like the capacity of gang-scheduling, we have the coscheduling plugin, and further, a new pod group api is on design, see kubernetes/enhancements#3371 .

I think this is one of the reasons behind https:/kubernetes-sigs/kueue We should make sure it works well with CA.

And yes, k-sigs/kueue is also helpful in job queueing with limited resources. Closely integration with autoscaling is one of our goals.

Luke-Smartnews · 2022-10-17T03:15:11Z

We had the same issue, in our case our users are deploying more than 2k spark pods on k8s at the same time which costs CA about 20m to scale out.
I think the issue is here, CA is only using ownerReference UID to group pods. in our case the spark pods don't have the same owner UID.

We use a workaround by using another UID from the pod labels.
I hope CA can support some custom labels to identify extra ownerReference.

denkensk · 2023-01-13T09:40:38Z

/cc

x13n · 2023-01-16T14:58:36Z

I think instead of custom labels, CA should just use a hash of all relevant Pod fields. Today there's a comparison function:

autoscaler/cluster-autoscaler/utils/utils.go

Lines 64 to 101 in e8d3e9b

 func PodSpecSemanticallyEqual(p1 apiv1.PodSpec, p2 apiv1.PodSpec) bool { 

 p1Spec := sanitizePodSpec(p1) 

 p2Spec := sanitizePodSpec(p2) 

 return apiequality.Semantic.DeepEqual(p1Spec, p2Spec) 

 } 

 func sanitizePodSpec(podSpec apiv1.PodSpec) apiv1.PodSpec { 

 dropProjectedVolumesAndMounts(&podSpec) 

 dropHostname(&podSpec) 

 return podSpec 

 } 

 func dropProjectedVolumesAndMounts(podSpec *apiv1.PodSpec) { 

 projectedVolumeNames := map[string]bool{} 

 var volumes []apiv1.Volume 

 for _, v := range podSpec.Volumes { 

 if v.Projected == nil { 

 volumes = append(volumes, v) 

 } else { 

 projectedVolumeNames[v.Name] = true 

 } 

 } 

 podSpec.Volumes = volumes 

 for i := range podSpec.Containers { 

 var volumeMounts []apiv1.VolumeMount 

 for _, mount := range podSpec.Containers[i].VolumeMounts { 

 if ok := projectedVolumeNames[mount.Name]; !ok { 

 volumeMounts = append(volumeMounts, mount) 

 } 

 } 

 podSpec.Containers[i].VolumeMounts = volumeMounts 

 } 

 } 

 func dropHostname(podSpec *apiv1.PodSpec) { 

 podSpec.Hostname = "" 

 }

It should be quite straightforward to calculate a hash in a similar manner instead of relying on owner refs as hints for similarity. That'd also have an extra optimization benefit of considering identical pods as similar even if they belong to completely different controllers.

k8s-triage-robot · 2023-04-16T15:04:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor · 2023-04-18T08:02:01Z

/remove-lifecycle stale

k8s-triage-robot · 2023-07-17T08:41:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2023-07-17T10:15:11Z

/remove-lifecycle stale

k8s-triage-robot · 2024-01-24T16:03:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2024-01-25T02:23:09Z

/remove-lifecycle stale

k8s-triage-robot · 2024-06-19T13:42:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-07-19T14:35:42Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-08-18T14:40:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-08-18T14:40:24Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

asm582 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 6, 2022

asm582 mentioned this issue Sep 6, 2022

Cluster autoscaler improvements for AI workloads kubernetes/community#6840

Closed

jbartosik added the area/cluster-autoscaler label Sep 28, 2022

Luke-Smartnews mentioned this issue Oct 17, 2022

[workaround] add owner for spark jobs smartnews/k8s-autoscaler#3

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2023

qianlei90 mentioned this issue Jun 13, 2023

Introduce AEP with Provisioning Request CRD #5848

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2023

asm582 mentioned this issue Nov 21, 2023

REQUEST: New membership for asm582 kubernetes/org#4594

Closed

9 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster autoscaler improvements for AI workloads #5170

Cluster autoscaler improvements for AI workloads #5170

asm582 commented Sep 6, 2022

kerthcet commented Sep 6, 2022

Dingshujie commented Sep 7, 2022

alculquicondor commented Sep 29, 2022

kisieland commented Sep 30, 2022

x13n commented Oct 7, 2022

kerthcet commented Oct 8, 2022

Luke-Smartnews commented Oct 17, 2022

denkensk commented Jan 13, 2023

x13n commented Jan 16, 2023 •

edited

Loading

k8s-triage-robot commented Apr 16, 2023

alculquicondor commented Apr 18, 2023

k8s-triage-robot commented Jul 17, 2023

kerthcet commented Jul 17, 2023

k8s-triage-robot commented Jan 24, 2024

kerthcet commented Jan 25, 2024

k8s-triage-robot commented Jun 19, 2024

k8s-triage-robot commented Jul 19, 2024

k8s-triage-robot commented Aug 18, 2024

k8s-ci-robot commented Aug 18, 2024

Cluster autoscaler improvements for AI workloads #5170

Cluster autoscaler improvements for AI workloads #5170

Comments

asm582 commented Sep 6, 2022

kerthcet commented Sep 6, 2022

Dingshujie commented Sep 7, 2022

alculquicondor commented Sep 29, 2022

kisieland commented Sep 30, 2022

x13n commented Oct 7, 2022

kerthcet commented Oct 8, 2022

Luke-Smartnews commented Oct 17, 2022

denkensk commented Jan 13, 2023

x13n commented Jan 16, 2023 • edited Loading

k8s-triage-robot commented Apr 16, 2023

alculquicondor commented Apr 18, 2023

k8s-triage-robot commented Jul 17, 2023

kerthcet commented Jul 17, 2023

k8s-triage-robot commented Jan 24, 2024

kerthcet commented Jan 25, 2024

k8s-triage-robot commented Jun 19, 2024

k8s-triage-robot commented Jul 19, 2024

k8s-triage-robot commented Aug 18, 2024

k8s-ci-robot commented Aug 18, 2024

x13n commented Jan 16, 2023 •

edited

Loading