Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K3s fails to start after running k3s certificate rotate-ca #11014

Closed
brandond opened this issue Oct 8, 2024 · 2 comments
Closed

K3s fails to start after running k3s certificate rotate-ca #11014

brandond opened this issue Oct 8, 2024 · 2 comments
Assignees

Comments

@brandond
Copy link
Member

brandond commented Oct 8, 2024

Environmental Info:
K3s Version:
v1.31.1+k3s1

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
n/a

Describe the bug:
After generating updated CA certs and updating the datastore with the k3s certificate rotate-ca command, K3s fails to restart with the following error:
Oct 08 18:31:31 server-0 k3s[9638]: time="2024-10-08T18:31:31Z" level=fatal msg="/var/lib/rancher/k3s/server/cred/ipsec.psk, /var/lib/rancher/k3s/server/cred/passwd newer than datastore and could cause a cluster outage. Remove the file(s) from disk and restart to be recreated from datastore."

If the token has not been manually specified in the config file and the files are removed, K3s will start once successfully, but subsequent restarts will fail because the token in the passwd file will have been regenerated and no longer match the bootstrap data:
Oct 08 23:02:20 systemd-node-1 k3s[6631]: time="2024-10-08T23:02:20Z" level=fatal msg="starting kubernetes: preparing server: bootstrap data already found and encrypted with different token"

Steps To Reproduce:

  1. Install K3s
  2. Rotate CA certificates
  3. Restart K3s

Expected behavior:
CA certs rotate successfully without causing problems

Actual behavior:
Problems

Additional context / logs:
Regression introduced by #10710

This caused e2e tests to fail, but apparently we didn't check e2e results during last month's release cycle.

@endawkins
Copy link

Validated on master using commit 054cec8 | version v1.31

Environment Details:

Node(s) CPU architecture, OS, and Version:

uname -a && cat /etc/os-release
Linux ip-172-31-33-119 6.8.0-1016-aws #17-Ubuntu SMP Mon Sep  2 13:48:07 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Cluster Configuration:

1 server (configuration does not matter)

Files:

  • config.yaml
cluster-init: true
write-kubeconfig-mode: 644

Steps:

  1. Install K3s
  2. Update Certificates using script
  3. Rotate ca-certs k3s certificate rotate-ca
  4. Restart k3s sudo systemctl restart k3s
  5. Check status of k3s sudo systemctl status k3s

Reproduction of the Issue:

- Observations:

k3s -v
k3s version v1.31.1+k3s1 (452dbbc1)
go version go1.22.6
kubectl get nodes,pods -A -o wide
NAME                    STATUS   ROLES                       AGE     VERSION        INTERNAL-IP     EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
node/ip-172-31-33-119   Ready    control-plane,etcd,master   5m50s   v1.31.1+k3s1   172.31.33.119   [REDACTED]       Ubuntu 24.04.1 LTS   6.8.0-1016-aws   containerd://1.7.21-k3s2

NAMESPACE     NAME                                          READY   STATUS      RESTARTS   AGE     IP          NODE               NOMINATED NODE   READINESS GATES
kube-system   pod/coredns-56f6fc8fd7-r97cx                  1/1     Running     0          5m44s   10.42.0.5   ip-172-31-33-119   <none>           <none>
kube-system   pod/helm-install-traefik-crd-lxps9            0/1     Completed   0          5m44s   10.42.0.3   ip-172-31-33-119   <none>           <none>
kube-system   pod/helm-install-traefik-zg7ct                0/1     Completed   1          5m44s   10.42.0.2   ip-172-31-33-119   <none>           <none>
kube-system   pod/local-path-provisioner-846b9dcb6c-d4r2c   1/1     Running     0          5m44s   10.42.0.6   ip-172-31-33-119   <none>           <none>
kube-system   pod/metrics-server-5985cbc9d7-9swqh           1/1     Running     0          5m44s   10.42.0.4   ip-172-31-33-119   <none>           <none>
kube-system   pod/svclb-traefik-578f5134-bbvjc              2/2     Running     0          5m32s   10.42.0.7   ip-172-31-33-119   <none>           <none>
kube-system   pod/traefik-8dc7cf49b-fnk8q                   1/1     Running     0          5m32s   10.42.0.8   ip-172-31-33-119   <none>           <none>
$ ./rotate-default-ca-certs.sh
To update certificates, you may now run:
    k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca

$ k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca
certificates saved to datastore
$ sudo systemctl restart k3s.service
Job for k3s.service failed because the control process exited with error code.
See "systemctl status k3s.service" and "journalctl -xeu k3s.service" for details.
Oct 14 23:01:19 ip-172-31-33-119 k3s[32802]: time="2024-10-14T23:01:19Z" level=fatal msg="/var/lib/rancher/k3s/server/cred/passwd, /var/lib/rancher/k3s/server/cred/ipsec.psk newer than datastore and could cause a cluster outage. Remove t>
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Failed with result 'exit-code'.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 2700 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 2743 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 2764 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 3710 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 3788 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: Failed to start k3s.service - Lightweight Kubernetes.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 1287.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2700 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2743 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2764 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3710 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3788 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: Starting k3s.service - Lightweight Kubernetes...
Oct 14 23:01:25 ip-172-31-33-119 sh[32816]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2700 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2743 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2764 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3710 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3788 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
$ sudo systemctl status k3s.service
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2024-10-14 20:00:11 UTC; 2s ago
       Docs: https://k3s.io
    Process: 6799 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 6801 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 6804 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 6806 ExecStart=/usr/local/bin/k3s server --token=test (code=exited, status=1/FAILURE)
$ kubectl get nodes,pods -A -o wide
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

Validation of the Issue:

- Observations:

k3s -v
k3s version v1.31.1+k3s-054cec84 (054cec84)
go version go1.22.6
$ ./rotate-default-ca-certs.sh
To update certificates, you may now run:
    k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca

$ k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca
certificates saved to datastore
$ sudo systemctl restart k3s
$ sudo systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled)
     Active: active (running) since Mon 2024-10-14 21:36:14 UTC; 20s ago
       Docs: https://k3s.io
    Process: 3914 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 3916 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 3917 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 3918 (k3s-server)
$ kubectl get nodes,pods -A -o wide
NAME                                               STATUS   ROLES                       AGE    VERSION                INTERNAL-IP     EXTERNAL-IP      OS-IMAGE                              KERNEL-VERSION                 CONTAINER-RUNTIME
node/ip-172-31-14-236.us-east-2.compute.internal   Ready    control-plane,etcd,master   133m   v1.31.1+k3s-054cec84   172.31.14.236   [REDACTED]       SUSE Linux Enterprise Server 15 SP5   5.14.21-150500.55.44-default   containerd://1.7.22-k3s1

NAMESPACE           NAME                                              READY   STATUS      RESTARTS   AGE    IP           NODE                                          NOMINATED NODE   READINESS GATES
kube-system         pod/coredns-56f6fc8fd7-rws6r                      1/1     Running     0          133m   10.42.0.5    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/helm-install-traefik-5z8k9                    0/1     Completed   2          133m   <none>       ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/helm-install-traefik-crd-vvwl6                0/1     Completed   0          133m   <none>       ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/local-path-provisioner-5cf85fd84d-sfcxs       1/1     Running     0          133m   10.42.0.3    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/metrics-server-5985cbc9d7-79wb2               1/1     Running     0          133m   10.42.0.6    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/svclb-nginx-loadbalancer-svc-7efe6867-vpzfh   1/1     Running     0          128m   10.42.0.21   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/svclb-traefik-3241a63e-j5wx2                  2/2     Running     0          133m   10.42.0.7    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/traefik-57b79cf995-zdp52                      1/1     Running     0          133m   10.42.0.8    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-ingressroute   pod/whoami-86c8d79cf4-42scz                       1/1     Running     0          127m   10.42.0.25   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-ingressroute   pod/whoami-86c8d79cf4-jg75p                       1/1     Running     0          127m   10.42.0.24   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-loadbalancer   pod/test-loadbalancer-6c774b8bb9-chrwc            1/1     Running     0          128m   10.42.0.22   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-loadbalancer   pod/test-loadbalancer-6c774b8bb9-prjgc            1/1     Running     0          128m   10.42.0.23   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>

@pascaliske
Copy link

Hi @brandond!

It seems that I've been running into the issue you mentioned above. I tried to rotate CA certificates and now have the mentioned error message ("bootstrap data already found and encrypted with different token")...

Is there a way to recover from this situation? Like, forcing a new token and/or certificates?

I have tried all possible combinations of the multiple tls folders, token file values and cred/passwd file values...

The workloads still seem to be running fine but my single node k3s cluster can not be started again. And it would be awesome if K3s could be recovered without needing to recreate the complete cluster...

Thanks in advance for your reply!

BR, Pascal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

3 participants