K3s fails to start after running `k3s certificate rotate-ca` #11014

brandond · 2024-10-08T23:05:05Z

Environmental Info:
K3s Version:
v1.31.1+k3s1

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
n/a

Describe the bug:
After generating updated CA certs and updating the datastore with the k3s certificate rotate-ca command, K3s fails to restart with the following error:
Oct 08 18:31:31 server-0 k3s[9638]: time="2024-10-08T18:31:31Z" level=fatal msg="/var/lib/rancher/k3s/server/cred/ipsec.psk, /var/lib/rancher/k3s/server/cred/passwd newer than datastore and could cause a cluster outage. Remove the file(s) from disk and restart to be recreated from datastore."

If the token has not been manually specified in the config file and the files are removed, K3s will start once successfully, but subsequent restarts will fail because the token in the passwd file will have been regenerated and no longer match the bootstrap data:
Oct 08 23:02:20 systemd-node-1 k3s[6631]: time="2024-10-08T23:02:20Z" level=fatal msg="starting kubernetes: preparing server: bootstrap data already found and encrypted with different token"

Steps To Reproduce:

Install K3s
Rotate CA certificates
Restart K3s

Expected behavior:
CA certs rotate successfully without causing problems

Actual behavior:
Problems

Additional context / logs:
Regression introduced by #10710

This caused e2e tests to fail, but apparently we didn't check e2e results during last month's release cycle.

The text was updated successfully, but these errors were encountered:

endawkins · 2024-10-14T23:40:16Z

Validated on master using commit `054cec8` | version v1.31

Environment Details:

Node(s) CPU architecture, OS, and Version:

uname -a && cat /etc/os-release
Linux ip-172-31-33-119 6.8.0-1016-aws #17-Ubuntu SMP Mon Sep  2 13:48:07 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Cluster Configuration:

1 server (configuration does not matter)

Files:

config.yaml

cluster-init: true
write-kubeconfig-mode: 644

rotate-default-ca-certs.sh
https:/k3s-io/k3s/blob/release-1.28/contrib/util/rotate-default-ca-certs.sh

Steps:

Install K3s
Update Certificates using script
Rotate ca-certs k3s certificate rotate-ca
Restart k3s sudo systemctl restart k3s
Check status of k3s sudo systemctl status k3s

Reproduction of the Issue:

- Observations:

k3s -v
k3s version v1.31.1+k3s1 (452dbbc1)
go version go1.22.6

kubectl get nodes,pods -A -o wide
NAME                    STATUS   ROLES                       AGE     VERSION        INTERNAL-IP     EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
node/ip-172-31-33-119   Ready    control-plane,etcd,master   5m50s   v1.31.1+k3s1   172.31.33.119   [REDACTED]       Ubuntu 24.04.1 LTS   6.8.0-1016-aws   containerd://1.7.21-k3s2

NAMESPACE     NAME                                          READY   STATUS      RESTARTS   AGE     IP          NODE               NOMINATED NODE   READINESS GATES
kube-system   pod/coredns-56f6fc8fd7-r97cx                  1/1     Running     0          5m44s   10.42.0.5   ip-172-31-33-119   <none>           <none>
kube-system   pod/helm-install-traefik-crd-lxps9            0/1     Completed   0          5m44s   10.42.0.3   ip-172-31-33-119   <none>           <none>
kube-system   pod/helm-install-traefik-zg7ct                0/1     Completed   1          5m44s   10.42.0.2   ip-172-31-33-119   <none>           <none>
kube-system   pod/local-path-provisioner-846b9dcb6c-d4r2c   1/1     Running     0          5m44s   10.42.0.6   ip-172-31-33-119   <none>           <none>
kube-system   pod/metrics-server-5985cbc9d7-9swqh           1/1     Running     0          5m44s   10.42.0.4   ip-172-31-33-119   <none>           <none>
kube-system   pod/svclb-traefik-578f5134-bbvjc              2/2     Running     0          5m32s   10.42.0.7   ip-172-31-33-119   <none>           <none>
kube-system   pod/traefik-8dc7cf49b-fnk8q                   1/1     Running     0          5m32s   10.42.0.8   ip-172-31-33-119   <none>           <none>

$ ./rotate-default-ca-certs.sh
To update certificates, you may now run:
    k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca

$ k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca
certificates saved to datastore

$ sudo systemctl restart k3s.service
Job for k3s.service failed because the control process exited with error code.
See "systemctl status k3s.service" and "journalctl -xeu k3s.service" for details.

Oct 14 23:01:19 ip-172-31-33-119 k3s[32802]: time="2024-10-14T23:01:19Z" level=fatal msg="/var/lib/rancher/k3s/server/cred/passwd, /var/lib/rancher/k3s/server/cred/ipsec.psk newer than datastore and could cause a cluster outage. Remove t>
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Failed with result 'exit-code'.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 2700 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 2743 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 2764 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 3710 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: k3s.service: Unit process 3788 (containerd-shim) remains running after unit stopped.
Oct 14 23:01:19 ip-172-31-33-119 systemd[1]: Failed to start k3s.service - Lightweight Kubernetes.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 1287.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2700 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2743 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2764 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3710 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3788 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: Starting k3s.service - Lightweight Kubernetes...
Oct 14 23:01:25 ip-172-31-33-119 sh[32816]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2700 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2743 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 2764 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3710 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: Found left-over process 3788 (containerd-shim) in control group while starting unit. Ignoring.
Oct 14 23:01:25 ip-172-31-33-119 systemd[1]: k3s.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

$ sudo systemctl status k3s.service
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2024-10-14 20:00:11 UTC; 2s ago
       Docs: https://k3s.io
    Process: 6799 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 6801 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 6804 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 6806 ExecStart=/usr/local/bin/k3s server --token=test (code=exited, status=1/FAILURE)

$ kubectl get nodes,pods -A -o wide
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

Validation of the Issue:

- Observations:

k3s -v
k3s version v1.31.1+k3s-054cec84 (054cec84)
go version go1.22.6

$ ./rotate-default-ca-certs.sh
To update certificates, you may now run:
    k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca

$ k3s certificate rotate-ca --path=/var/lib/rancher/k3s/server/rotate-ca
certificates saved to datastore

$ sudo systemctl restart k3s
$ sudo systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled)
     Active: active (running) since Mon 2024-10-14 21:36:14 UTC; 20s ago
       Docs: https://k3s.io
    Process: 3914 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 3916 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 3917 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 3918 (k3s-server)

$ kubectl get nodes,pods -A -o wide
NAME                                               STATUS   ROLES                       AGE    VERSION                INTERNAL-IP     EXTERNAL-IP      OS-IMAGE                              KERNEL-VERSION                 CONTAINER-RUNTIME
node/ip-172-31-14-236.us-east-2.compute.internal   Ready    control-plane,etcd,master   133m   v1.31.1+k3s-054cec84   172.31.14.236   [REDACTED]       SUSE Linux Enterprise Server 15 SP5   5.14.21-150500.55.44-default   containerd://1.7.22-k3s1

NAMESPACE           NAME                                              READY   STATUS      RESTARTS   AGE    IP           NODE                                          NOMINATED NODE   READINESS GATES
kube-system         pod/coredns-56f6fc8fd7-rws6r                      1/1     Running     0          133m   10.42.0.5    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/helm-install-traefik-5z8k9                    0/1     Completed   2          133m   <none>       ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/helm-install-traefik-crd-vvwl6                0/1     Completed   0          133m   <none>       ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/local-path-provisioner-5cf85fd84d-sfcxs       1/1     Running     0          133m   10.42.0.3    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/metrics-server-5985cbc9d7-79wb2               1/1     Running     0          133m   10.42.0.6    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/svclb-nginx-loadbalancer-svc-7efe6867-vpzfh   1/1     Running     0          128m   10.42.0.21   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/svclb-traefik-3241a63e-j5wx2                  2/2     Running     0          133m   10.42.0.7    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
kube-system         pod/traefik-57b79cf995-zdp52                      1/1     Running     0          133m   10.42.0.8    ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-ingressroute   pod/whoami-86c8d79cf4-42scz                       1/1     Running     0          127m   10.42.0.25   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-ingressroute   pod/whoami-86c8d79cf4-jg75p                       1/1     Running     0          127m   10.42.0.24   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-loadbalancer   pod/test-loadbalancer-6c774b8bb9-chrwc            1/1     Running     0          128m   10.42.0.22   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>
test-loadbalancer   pod/test-loadbalancer-6c774b8bb9-prjgc            1/1     Running     0          128m   10.42.0.23   ip-172-31-14-236.us-east-2.compute.internal   <none>           <none>

pascaliske · 2024-10-20T08:34:28Z

Hi @brandond!

It seems that I've been running into the issue you mentioned above. I tried to rotate CA certificates and now have the mentioned error message ("bootstrap data already found and encrypted with different token")...

Is there a way to recover from this situation? Like, forcing a new token and/or certificates?

I have tried all possible combinations of the multiple tls folders, token file values and cred/passwd file values...

The workloads still seem to be running fine but my single node k3s cluster can not be started again. And it would be awesome if K3s could be recovered without needing to recreate the complete cluster...

Thanks in advance for your reply!

BR, Pascal

brandond mentioned this issue Oct 8, 2024

Add ca-cert rotation integration test, and fix ca-cert rotation #11013

Merged

brandond self-assigned this Oct 8, 2024

brandond added this to the 2024-10 Release Cycle milestone Oct 8, 2024

aganesh-suse assigned endawkins Oct 10, 2024

endawkins closed this as completed Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K3s fails to start after running `k3s certificate rotate-ca` #11014

K3s fails to start after running `k3s certificate rotate-ca` #11014

brandond commented Oct 8, 2024 •

edited

Loading

endawkins commented Oct 14, 2024

pascaliske commented Oct 20, 2024

K3s fails to start after running k3s certificate rotate-ca #11014

K3s fails to start after running k3s certificate rotate-ca #11014

Comments

brandond commented Oct 8, 2024 • edited Loading

endawkins commented Oct 14, 2024

Validated on master using commit 054cec8 | version v1.31

Environment Details:

Files:

- Observations:

- Observations:

pascaliske commented Oct 20, 2024

K3s fails to start after running `k3s certificate rotate-ca` #11014

K3s fails to start after running `k3s certificate rotate-ca` #11014

brandond commented Oct 8, 2024 •

edited

Loading

Validated on master using commit `054cec8` | version v1.31