Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler cannot count migrated PVs (CSI enabled) and cannot scale up on exceed max volume count #4517

Closed
ialidzhikov opened this issue Dec 13, 2021 · 3 comments · Fixed by #4539
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@ialidzhikov
Copy link
Contributor

ialidzhikov commented Dec 13, 2021

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.18.0

What k8s version are you using (kubectl version)?:

1.17 and 1.18

What environment is this in?:

Gardener

What did you expect to happen?:

cluster-autoscaler to properly count migrated PVs and to scale up appropriately on scheduling failures with reason exceed max volume count.

What happened instead?:

cluster-autoscaler cannot count migrated PVs when CSI enabled -> cannot scale up on exceed max volume count. Pod(s) hangs forever in Pending state.

How to reproduce it (as minimally and precisely as possible):

  1. Create a single node cluster with K8s that does not have CSI enabled (for example AWS cluster with K8s 1.17)
    For machine type select a one that allows 25 volume attachments - for example m5.large

    Make sure that you have a single Node. Its allocatable volume attachments should be 25.

    $ k get csinode
    
    spec:
      drivers:
      - allocatable:
          count: 25
        name: ebs.csi.aws.com
    
  2. Create dummy StatefulSet and scale to 20 replicas

    apiVersion: v1
    kind: Service
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      ports:
      - port: 80
        name: web
      clusterIP: None
      selector:
        app: nginx
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: web
    spec:
      selector:
        matchLabels:
          app: nginx # has to match .spec.template.metadata.labels
      serviceName: "nginx"
      replicas: 3 # by default is 1
      template:
        metadata:
          labels:
            app: nginx # has to match .spec.selector.matchLabels
        spec:
          terminationGracePeriodSeconds: 10
          containers:
          - name: nginx
            image: k8s.gcr.io/nginx-slim:0.8
            ports:
            - containerPort: 80
              name: web
            volumeMounts:
            - name: www
              mountPath: /usr/share/nginx/html
      volumeClaimTemplates:
      - metadata:
          name: www
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 1Gi
    $ k scale sts web --replicas=20
    

    This will create 20 Pods and PVs (note that PV are created with the internal volume driver).

  3. Update to K8s version with CSI enabled (for example AWS cluster with K8s 1.18)

    After completing this spec you should have 20 "migrated" PVs (PVs that are provisioned with the in-tree volume plugin).

  4. Scale the StatefulSet to 26 replicas

    $ k scale sts web --replicas=26
    
  5. Make sure that the 26th replica (Pod web-25) fails to be scheduled (as expected) but cluster-autoscaler never triggers a scale up

    Events:
    Type     Reason            Age   From               Message
    ----     ------            ----  ----               -------
    Warning  FailedScheduling  54m   default-scheduler  0/1 nodes are available: 1 node(s) exceed max volume count.
    Warning  FailedScheduling  54m   default-scheduler  0/1 nodes are available: 1 node(s) exceed max volume count.
    

    Logs of cluster-autoscaler:

    I1124 14:44:02.817569       1 csi.go:178] Persistent volume had no name for claim default/www-web-25
    I1124 14:44:02.817584       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817592       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817599       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817606       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817612       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817615       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817629       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817634       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817641       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817644       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817649       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817652       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817659       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817662       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817668       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817671       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817678       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817685       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817691       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817694       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817700       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817705       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817713       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817721       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817729       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817732       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817738       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817741       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817751       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817759       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817766       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817769       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817774       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817778       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817787       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817791       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817797       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817800       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817806       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
    I1124 14:44:02.817809       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume
    I1124 14:44:02.817818       1 scheduler_binder.go:241] FindPodVolumes for pod "default/web-25", node "ip-10-250-28-64.eu-west-1.compute.internal"
    I1124 14:44:02.817842       1 scheduler_binder.go:819] No matching volumes for Pod "default/web-25", PVC "default/www-web-25" on node "ip-10-250-28-64.eu-west-1.compute.internal"
    I1124 14:44:02.817852       1 scheduler_binder.go:883] Provisioning for 1 claims of pod "default/web-25" that has no matching volumes on node "ip-10-250-28-64.eu-west-1.compute.internal" ...
    I1124 14:44:02.817868       1 filter_out_schedulable.go:118] Pod default.web-25 marked as unschedulable can be scheduled on node ip-10-250-28-64.eu-west-1.compute.internal (based on hinting). Ignoring in scale up.
    I1124 14:44:02.817878       1 filter_out_schedulable.go:132] Filtered out 1 pods using hints
    I1124 14:44:02.817884       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
    I1124 14:44:02.817888       1 filter_out_schedulable.go:171] 1 pods marked as unschedulable can be scheduled.
    I1124 14:44:02.817894       1 filter_out_schedulable.go:79] Schedulable pods present
    I1124 14:44:02.817913       1 static_autoscaler.go:402] No unschedulable pods
    I1124 14:44:02.817929       1 static_autoscaler.go:449] Calculating unneeded nodes
    

Anything else we need to know?:

From what I managed to track in the autoscaler repository, the autoscaler creates a new scheduler framework and "simulates" whether the Pod is really unschedulable (most probably using default scheduling config).

I1124 14:44:02.817868       1 filter_out_schedulable.go:118] Pod default.web-25 marked as unschedulable can be scheduled on node ip-10-250-28-64.eu-west-1.compute.internal (based on hinting). Ignoring in scale up.

The above log entry makes it clear that the Pod is unschedulable according to the kube-scheduler (exceed max volume count) but the same Pod is schedulable according to the cluster-autoscaler. The differences comes from the NodeVolumeLimits filter in the scheduler - kube-scheduler obviously has the required CSI migration feature gates set and can correctly count migrated volumes. cluster-autoscaler current does not have any such config, hence it cannot count volumes with CSI enabled:

I1124 14:44:02.817806       1 csi.go:205] CSI Migration of plugin kubernetes.io/aws-ebs is not enabled
I1124 14:44:02.817809       1 csi.go:158] Could not find a CSI driver name or volume handle, not counting volume

(Note that kube-scheduler has the required CSI migration flags and CSI migration is enabled for AWS)

@ialidzhikov
Copy link
Contributor Author

As discussed in the SIG meeting on Monday, I tried out the cluster-autoscaler with a manual hack that simulates feature flag enablement. I used the following diff:

diff --git a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
index c05b49cd8..f08ed5b6e 100644
--- a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
+++ b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go
@@ -92,12 +92,14 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v

 	// If the pod doesn't have any new CSI volumes, the predicate will always be true
 	if len(newVolumes) == 0 {
+		klog.V(5).Info("Early exit len(newVolumes) == 0")
 		return nil
 	}

 	// If the node doesn't have volume limits, the predicate will always be true
 	nodeVolumeLimits := getVolumeLimits(nodeInfo, csiNode)
 	if len(nodeVolumeLimits) == 0 {
+		klog.V(5).Info("Early exit len(nodeVolumeLimits) == 0")
 		return nil
 	}

@@ -125,6 +127,7 @@ func (pl *CSILimits) Filter(ctx context.Context, _ *framework.CycleState, pod *v
 		if ok {
 			currentVolumeCount := attachedVolumeCount[volumeLimitKey]
 			if currentVolumeCount+count > int(maxVolumeLimit) {
+				klog.V(5).Info("Pod is unschedulable.")
 				return framework.NewStatus(framework.Unschedulable, ErrReasonMaxVolumeCountExceeded)
 			}
 		}
diff --git a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
index 3fd98da14..9de43b175 100644
--- a/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
+++ b/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/utils.go
@@ -19,7 +19,7 @@ package nodevolumelimits
 import (
 	"strings"

-	"k8s.io/api/core/v1"
+	v1 "k8s.io/api/core/v1"
 	storagev1 "k8s.io/api/storage/v1"
 	"k8s.io/apimachinery/pkg/util/sets"
 	utilfeature "k8s.io/apiserver/pkg/util/feature"
@@ -44,9 +44,7 @@ func isCSIMigrationOn(csiNode *storagev1.CSINode, pluginName string) bool {

 	switch pluginName {
 	case csilibplugins.AWSEBSInTreePluginName:
-		if !utilfeature.DefaultFeatureGate.Enabled(features.CSIMigrationAWS) {
-			return false
-		}
+		return true
 	case csilibplugins.GCEPDInTreePluginName:
 		if !utilfeature.DefaultFeatureGate.Enabled(features.CSIMigrationGCE) {
 			return false

With this small hack cluster-autoscaler was able to successfully scale up on exceed max volume count. It seems that counting of migrated PVs works as expected. When a new node template is created, I see that the following early exit logic in the NodeVolumeLimits is executed:

https:/kubernetes/kubernetes/blob/ab69524f795c42094a6630298ff53f3c3ebab7f4/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L110-L114

(Note that csiNode is nil)

Logs:

I1216 17:20:30.158799       1 csi.go:95] Early exit len(newVolumes) == 0

I1216 17:20:30.158946       1 scale_up.go:574] Final scale-up plan: [{worker-nq600-z1 1->2 (max: 2)}]
I1216 17:20:30.158956       1 scale_up.go:663] Scale-up: setting group worker-nq600-z1 size to 2

[maciekpytel] - The issue here is when the CA runs the simulation, it will create the in-memory node object for the new nodes that would be created on scale-up, however the corresponding CSINode object won’t be created, and thus won’t be considered by the the scale-up simulation. This is the difficulty of simulation with CSI, as it depends a lot on informers and run-time decisions on node creation etc.

@MaciekPytel to my understanding this is not a real concern as the NodeVolumeLimits returns nil (which means that a Pod can be scheduled on the Node, right?) if there is no corresponding CSINode object and no capacity set on the Node itself. Feel free to correct me if I am wrong.

[maciekpytel] – Expectation would be that the CA will use the default options set in k/k due to the vendoring of upstream, therefore if they default to on in 1.23, CA should pick this up.

Having the above findings in mind, I guess this should be automatically fixed with vendoring of K8s 1.23 in cluster-autoscaler. Unfortunately I cannot easily verify this because we are using a fork of cluster-autoscaler that currently vendors K8s 1.18.

@ialidzhikov
Copy link
Contributor Author

@MaciekPytel to tackle this issue for all K8s < 1.23 versions and for CSI migrations for providers that are still not enabled by default in K8s 1.23, would it be okay to introduce a --feature-gates flag to cluster-autoscaler for the CSI migration related feature gates. The cluster-autoscaler will set/pass these CSI migration related feature gates to the scheduler feature flags. Actually all K8s control plane components have these CSI migration feature flags - kube-apiserver, kube-controller-manager and kube-scheduler. In this way a managed service that uses the CA can also configure the CA and make it CSI migration aware. WDYT?

@ialidzhikov
Copy link
Contributor Author

/cc @msau42 @jsafrane

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
2 participants