Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscaler improvements for AI workloads #5170

Closed
asm582 opened this issue Sep 6, 2022 · 19 comments
Closed

Cluster autoscaler improvements for AI workloads #5170

asm582 opened this issue Sep 6, 2022 · 19 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@asm582
Copy link
Member

asm582 commented Sep 6, 2022

Which component are you using?:
Cluster autoscaler component which scales Kubernetes cluster

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

We see that the current cluster autoscaler (CA) reacts to pending pods, this may not work well for AI/HPC workloads, we outline the scenarios below:

  • CA reacts to pending pods, When a large number of pods are being co-scheduled it can wait tens of minutes with pending pods before triggering scale-up, stressing the control plane.
  • When multiple AI/HPC workloads are submitted to the cluster, none of the jobs may progress even after scale-up by CA due to its inability to understand gang scheduling.
    • Example: user submitted 10 AI/HPC jobs (with 10 * N pods), based on N pending pods, lets's assume CA scaled to 10 nodes, and worst case only the head nodes of all the jobs got scheduled, such partial jobs will cause no user workloads to make progress.
  • CA reactive scaling can hamper cluster scale-up when a user does not have sufficient quota to create pods in a namespace.
  • CA performs termination based on traditional compute (i.e. CPU and memory ) utilization threshold which may not work for AI workload(s).
    • AI workload which uses GPU with large GPU memory may cause low CPU and memory utilization on the node which could lead to premature termination.
    • Setting the termination threshold to a very low value will cause CA to not evict any nodes.
  • CA terminates nodes one at a time, scaling down a cluster that uses tens or hundreds of nodes will take time.
  • Selection of Machineset(s) to scale cluster is random (or complex via expander) and periodically it causes over-provisioning.

Describe the solution you'd like.:

Looking for feedback on a better solution to have holistic autoscaling for large AI/HPC workloads

Describe any alternative solutions you've considered.:

We have not encountered any alternative solution yet, please share if you have a solution to above shortcomings.

Additional context.:

I had opened the same issue here: kubernetes/community#6840 which will be closed.

@asm582 asm582 added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 6, 2022
@kerthcet
Copy link
Member

kerthcet commented Sep 6, 2022

/cc

1 similar comment
@Dingshujie
Copy link
Member

/cc

@alculquicondor
Copy link
Member

/cc

1 similar comment
@kisieland
Copy link
Contributor

/cc

@x13n
Copy link
Member

x13n commented Oct 7, 2022

Thanks for creating this issue, I believe this is an important use case Cluster Autoscaler should support. Some of the pain points you're raising require fundamental changes to how CA operates, but some should already work properly with the right setup. Let me go through them one by one:

  • CA reacts to pending pods, When a large number of pods are being co-scheduled it can wait tens of minutes with pending pods before triggering scale-up, stressing the control plane.

Why does CA wait for tens of minutes in this scenario? CA runs iterations of the main loop using a fixed interval, so I guess this has something to do with coscheduling plugin holding pods from reaching CA? This plugin isn't really compatible with CA, which is one of the reasons why it wasn't moved to in-tree plugins, see discussion in kubernetes/kubernetes#105802

  • When multiple AI/HPC workloads are submitted to the cluster, none of the jobs may progress even after scale-up by CA due to its inability to understand gang scheduling.

I think this is one of the reasons behind https:/kubernetes-sigs/kueue We should make sure it works well with CA.

  • CA reactive scaling can hamper cluster scale-up when a user does not have sufficient quota to create pods in a namespace.

Yup, no way around that without making CA understand some pods should be grouped.

  • CA performs termination based on traditional compute (i.e. CPU and memory ) utilization threshold which may not work for AI workload(s).

I don't think this one is true. GPU nodes are evaluated based on GPU utilization, not cpu/memory utilization:

if gpu.NodeHasGpu(context.CloudProvider.GPULabel(), node) {
threshold, err = c.thresholdGetter.GetScaleDownGpuUtilizationThreshold(context, nodeGroup)
if err != nil {
return false, err
}
} else {
threshold, err = c.thresholdGetter.GetScaleDownUtilizationThreshold(context, nodeGroup)
if err != nil {
return false, err
}
}

  • CA terminates nodes one at a time, scaling down a cluster that uses tens or hundreds of nodes will take time.

CA drains nodes one at a time, empty nodes (i.e. containing only daemonsets) are removed in bulk. Parallel drain is WIP though, this is tracked in #5079

  • Selection of Machineset(s) to scale cluster is random (or complex via expander) and periodically it causes over-provisioning.

There are multiple expanders to choose from, so yes, this can be complex (except for a managed setting where CA flags are fine-tuned already). However, even random expander shouldn't cause overprovisioning. If that happens, it's could be due to a bug in the cloudprovider-specific code.

I think the most CA-friendly way of addressing the fundamental CA compatibility problem is through some new k8s API representing a group of pods. If CA understood such API, it could trigger a scale up for all of them in a single go (or error out, e.g. due to lack of quota). kubernetes/enhancements#3371 looks promising, but autoscaling support hasn't been fully fleshed out yet.

@kerthcet
Copy link
Member

kerthcet commented Oct 8, 2022

Some supplements:

Why does CA wait for tens of minutes in this scenario? CA runs iterations of the main loop using a fixed interval, so I guess this has something to do with coscheduling plugin holding pods from reaching CA? This plugin isn't really compatible with CA, which is one of the reasons why it wasn't moved to in-tree plugins, see discussion in kubernetes/kubernetes#105802

We just proposed a KEP kubernetes/enhancements#3521 to make scheduling switchable, I think it also helps.

When multiple AI/HPC workloads are submitted to the cluster, none of the jobs may progress even after scale-up by CA due to its inability to understand gang scheduling.

This also looks like the capacity of gang-scheduling, we have the coscheduling plugin, and further, a new pod group api is on design, see kubernetes/enhancements#3371 .

I think this is one of the reasons behind https:/kubernetes-sigs/kueue We should make sure it works well with CA.

And yes, k-sigs/kueue is also helpful in job queueing with limited resources. Closely integration with autoscaling is one of our goals.

@Luke-Smartnews
Copy link

We had the same issue, in our case our users are deploying more than 2k spark pods on k8s at the same time which costs CA about 20m to scale out.
I think the issue is here, CA is only using ownerReference UID to group pods. in our case the spark pods don't have the same owner UID.

We use a workaround by using another UID from the pod labels.
I hope CA can support some custom labels to identify extra ownerReference.

@denkensk
Copy link
Member

/cc

@x13n
Copy link
Member

x13n commented Jan 16, 2023

I think instead of custom labels, CA should just use a hash of all relevant Pod fields. Today there's a comparison function:

func PodSpecSemanticallyEqual(p1 apiv1.PodSpec, p2 apiv1.PodSpec) bool {
p1Spec := sanitizePodSpec(p1)
p2Spec := sanitizePodSpec(p2)
return apiequality.Semantic.DeepEqual(p1Spec, p2Spec)
}
func sanitizePodSpec(podSpec apiv1.PodSpec) apiv1.PodSpec {
dropProjectedVolumesAndMounts(&podSpec)
dropHostname(&podSpec)
return podSpec
}
func dropProjectedVolumesAndMounts(podSpec *apiv1.PodSpec) {
projectedVolumeNames := map[string]bool{}
var volumes []apiv1.Volume
for _, v := range podSpec.Volumes {
if v.Projected == nil {
volumes = append(volumes, v)
} else {
projectedVolumeNames[v.Name] = true
}
}
podSpec.Volumes = volumes
for i := range podSpec.Containers {
var volumeMounts []apiv1.VolumeMount
for _, mount := range podSpec.Containers[i].VolumeMounts {
if ok := projectedVolumeNames[mount.Name]; !ok {
volumeMounts = append(volumeMounts, mount)
}
}
podSpec.Containers[i].VolumeMounts = volumeMounts
}
}
func dropHostname(podSpec *apiv1.PodSpec) {
podSpec.Hostname = ""
}

It should be quite straightforward to calculate a hash in a similar manner instead of relying on owner refs as hints for similarity. That'd also have an extra optimization benefit of considering identical pods as similar even if they belong to completely different controllers.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2023
@alculquicondor
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2023
@kerthcet
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@kerthcet
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024
@towca towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 18, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests