Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceLB is broken when rke2-cloud-provider is on 1.29 but k8s version is <=1.28 #5882

Closed
xyzzyz opened this issue May 3, 2024 · 5 comments
Assignees

Comments

@xyzzyz
Copy link

xyzzyz commented May 3, 2024

Environmental Info:
RKE2 Version: v1.28.9+rke2r1

Node(s) CPU architecture, OS, and Version: Linux dev 4.18.0-552.el8.x86_64 #1 SMP Sun Apr 7 19:39:51 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux, CentOS 8

Cluster Configuration:
1 server, 0 agents

Describe the bug:
ServiceLB fails to come up

Steps To Reproduce:

  • Installed RKE2:
disable-cloud-controller: true
disable:
  - rke2-ingress-nginx
  • create any LoadBalancer

Expected behavior:
ServiceLB Load Balancer comes up

Actual behavior:
LoadBalancer Service is stuck pending, with the following in events:

  Warning  SyncLoadBalancerFailed  1s (x6 over 2m36s)     service-controller  Error syncing load balancer: failed to ensure load balancer: failed to create kube-system/svclb-contour-envoy-b1fbfa01 apps/v1, Kind=DaemonSet for  default/contour-envoy: DaemonSet.apps "svclb-contour-envoy-b1fbfa01" is invalid: [spec.template.spec.containers[0].env[4].valueFrom.fieldRef: Forbidden: may not be set when feature gate 'PodHostIPs' is not enabled, spec.template.spec.containers[1].env[4].valueFrom.fieldRef: Forbidden: may not be set when feature gate 'PodHostIPs' is not enabled]

Additional context / logs:
This is caused by version mismatch of rke2-cloud-provider and kubernetes apiserver. rke2-cloud-provider decides whether to use HostIPs ref based on what's enabled by default on the k8s version cloud-provider is compiled with. If the version is 1.29, rke2-cloud-provider believes that PodHostIPs is available, but if k8s version is actually 1.28, it's not enabled by default, so it breaks.

3 weeks ago, cloud-provider was bumped to 1.29 on 1.28 and 1.27 release branches, see e.g. this. Indeed:

$ kubectl version
Client Version: v1.28.9
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.9+rke2r1
$ kubectl get pods -n kube-system -A -o yaml | grep image: | grep 1.29
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240412
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240412

To fix, rke2-cloud-provider should obtain the actual feature gate state, instead of whatever's default for the given k8s version.

@brandond
Copy link
Member

brandond commented May 6, 2024

Yeah, the problem is that it's not set explicitly by RKE2, so it uses the default for the Kubernetes version the CCM is built against - and the K3s CCM is built against 1.29 (as you noted).

The best current work-around is to set this in your config.yaml:

kube-cloud-controller-manager-arg:
  - 'feature-gates=PodHostIPs=false'

@brandond brandond self-assigned this May 6, 2024
@brandond brandond added this to the v1.30.1+rke2r1 milestone May 6, 2024
@rancher-max rancher-max self-assigned this Jun 11, 2024
@rancher-max
Copy link
Contributor

I'm not sure this has changed at all in these releases. Is this meant to be in Working status still, not To Test? @brandond

I checked on release-1.28 branch commitid 2c90f3baa0dbd555d5972db542421f3d2cded7b5, and see the following still:

$ kubectl get pods -n kube-system -A -o yaml | grep image: | grep 1.29
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515
      image: docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240515

Also I will note that I'm not able to reproduce the issue exactly other than this. For my steps, I need to include enable-servicelb: true in the config.yaml and NOT disable the cloud-controller, otherwise the cluster either doesn't come up correctly or there is no svclb pod created when creating a service of type LoadBalancer.

@brandond
Copy link
Member

No sorry, I think I moved this over on accident. This can go back to next up until July.

@brandond
Copy link
Member

brandond commented Oct 1, 2024

Note that the CCM versions are now in sync with the Kubernetes minor versions across all branches, which should address the mismatch in default feature-gate states.

@VestigeJ
Copy link
Contributor

VestigeJ commented Oct 8, 2024

Currently releases do not reproduce this original issue.

$ kg pod/cloud-controller-manager-ip -n kube-system -o yaml | grep -i image:

    image: index.docker.io/rancher/rke2-cloud-provider:v1.28.13-build20240910
    image: docker.io/rancher/rke2-cloud-provider:v1.28.13-build20240910

$ k version

Client Version: v1.28.14+rke2r1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.14+rke2r1

$ rke2 -v

rke2 version v1.28.14+rke2r1 (05928c524ec436f7d854c68dea34f3e3bf4d5287)
go version go1.22.6 X:boringcrypto

$ k version

Client Version: v1.28.14+rke2r1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.14+rke2r1

$ kg svc # note pending service external-ip is expected here

NAME         TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP      10.43.0.1       <none>        443/TCP        15m
test-lb      LoadBalancer   10.43.179.246   3.3.3.42      80:30941/TCP   7m4s

$ kgp -n default

NAMESPACE     NAME                                                    READY   STATUS      RESTARTS      AGE
default       nginx-698447f456-lt2fc                                  1/1     Running     0             6m11s

$ get_figs

node-external-ip: 3.3.3.42
token: YOUR_TOKEN_HERE
write-kubeconfig-mode: 644
debug: true
cni: multus,cilium
embedded-registry: true
enable-servicelb: true
disable-cloud-controller: false
disable:
  - rke2-ingress-nginx

@VestigeJ VestigeJ closed this as completed Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants