Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerd fails to start - RULE_APPEND failed (No such file or directory): rule in chain DOCKER-ISOLATION-STAGE-1 #463

Closed
giannello opened this issue Dec 15, 2023 · 18 comments · Fixed by #465

Comments

@giannello
Copy link

giannello commented Dec 15, 2023

After the merge of #461, the docker containers we run as part of our CI jobs stopped working.
The container fails to start with the following error:

failed to start daemon: Error initializing network controller: error obtaining controller instance: unable to add return rule in DOCKER-ISOLATION-STAGE-1 chain:  (iptables failed: iptables --wait -A DOCKER-ISOLATION-STAGE-1 -j RETURN: iptables v1.8.10 (nf_tables):  RULE_APPEND failed (No such file or directory): rule in chain DOCKER-ISOLATION-STAGE-1

The affected image is docker.io/library/docker@sha256:ae63bb7c7d3ae23884a2c5d206939640279f6d15730618192b58662a0619f182, while docker.io/library/docker@sha256:c90e58d30700470fc59bdaaf802340fd25c1db628756d7bf74e100c566ba9589 works fine. Both images are tagged as 24.0.7-dind

The environment is GKE 1.27 with Container-Optimized OS.

Workaround
Use docker:24.0.7-dind-alpine3.18, as it points at the previous version of the image that was overwritten

@Syphon83
Copy link

Hi everyone,
same problem here. Every dind pipeline stopped working.
Every gitlab runner fail with error
Error initializing network controller: error obtaining controller instance: failed to create NAT chain DOCKER: iptables failed: iptables -t nat -N DOCKER: iptables v1.8.10 (nf_tables): Could not fetch rule set generation id: Invalid argument

As a quick workaround we had to switch back from dind image docker:24.0.7-dind to docker:23.0.6-dind.
Environment is GKE 1.25.

@giannello
Copy link
Author

24.0.6-dind worked for us

@steve-mt
Copy link

steve-mt commented Dec 15, 2023

We (GitLab.com) have a similar issue (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17283) with 24.0.7 failing to start on Google Container Optimized OS.

As we can see with Alpine 3.19 changed the iptables version 👉 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17283#note_1695929693

We've also tried multiple Google Container Optmized OS versions and all of them seem to fail 👉 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17283#note_1696008058 and also tried some fixes in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17283#note_1696085805 which didn't work.

The only thing that worked for us is changing the host image to Ubuntu 👉 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17283#note_1696057259 but this is not a viable option for us.


At the moment I'm not sure what our (GitLab.com) next steps are since Alpine 3.19 seems to be incompatible with Google Container Optimized OS, and it also seems like other users are having the same problem.

cc @tianon

@paolomainardi
Copy link

paolomainardi commented Dec 15, 2023

@stevexuereb for us using 24.0.7 with alpine 3.18 fixed the issue.

But of course this is just a temporary workaround.

@steve-mt
Copy link

I've tried to use the legacy package instead in the Dockerfile, but still didn't have any luck 👉 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17283#note_1696330857

@dgteixeira
Copy link

dgteixeira commented Dec 15, 2023

This is also happening for us when using ARC for GitHub runners.

Since it used dind:latest it broke our self-hosted runners pipeline.

We will try to force a previous version as mentioned above and will give feedback back asap.

EDIT (27/12/2023):
This is how we, temporarily, solved it using the ARC helm chart.

We added this to our helm chart:

image:
  repository: "summerwind/actions-runner-controller"
  actionsRunnerRepositoryAndTag: "summerwind/actions-runner:latest"
  dindSidecarRepositoryAndTag: "docker:24.0.7-dind-alpine3.18"

@jnoordsij
Copy link
Contributor

I've tried to use the legacy package instead in the Dockerfile, but still didn't have any luck 👉 https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17283#note_1696330857

Not exactly sure on all the moving bits involved here, but given the changeset in #461, you might want to revert the change to the dockerd-entrypoint.sh file in there as well, to ensure the modprobe loads ip_tables rather than nf_tables.

@akerouanton
Copy link
Contributor

akerouanton commented Dec 15, 2023

@stevexuereb I'm not sure why your change doesn't fail to build (since that RUN command starts with set -eu), but your ln command is wrong. You need to pass it the -f flag to overwrite the existing symlink.

I'm gonna open a PR.

akerouanton added a commit to akerouanton/docker-doi that referenced this issue Dec 15, 2023
PR docker-library#461 updated Alpine to 3.19 and made a change to load the nf_tables
kernel module if needed. However, as demonstrated by docker-library#463 and docker-library#464 this
might break when the host system doesn't have the nf_tables module
available. In that case, we should still try to load the ip_tables
module and symlink /sbin/iptables to xtables-legacy-multi.

Signed-off-by: Albin Kerouanton <[email protected]>
@steve-mt
Copy link

steve-mt commented Dec 15, 2023

@stevexuereb I'm not sure why your change doesn't fail to build (since that RUN command starts with set -eu), but your ln command is wrong. You need to pass it the -f flag to overwrite the existing symlink.

I'm gonna open a PR.

@akerouanton interesting, it doesn't seem to have existing links, I wasn't installing the iptables apk.

I've tested your PR and it seems to be working:

$ gcloud compute instances create docker-test-cos-85 --image cos-85-13310-1498-7 --zone=us-east1-c

$ gcloud compute ssh docker-test-cos-85

steve@docker-test-cos-85 ~ $ git clone https:/akerouanton/docker-doi.git

steve@docker-test-cos-85 ~/docker-doi $ git switch fix-nf_tables
steve@docker-test-cos-85 ~/docker-doi $ cd 24/dind
steve@docker-test-cos-85 ~/docker-doi/24/dind $ docker build -t sxuereb:find .
steve@docker-test-cos-85 ~/docker-doi/24/dind $ docker run --rm --privileged sxuereb:dind
Certificate request self-signature ok
subject=CN = docker:dind server
/certs/server/cert.pem: OK
Certificate request self-signature ok
subject=CN = docker:dind client
/certs/client/cert.pem: OK
ip: can't find device 'nf_tables'
modprobe: can't change directory to '/lib/modules': No such file or directory
ip: can't find device 'ip_tables'
...
time="2023-12-15T13:49:31.883884128Z" level=info msg="Docker daemon" commit=311b9ff graphdriver=overlay2 version=24.0.7
time="2023-12-15T13:49:31.884533827Z" level=info msg="Daemon has completed initialization"
time="2023-12-15T13:49:31.931701127Z" level=info msg="API listen on /var/run/docker.sock"
time="2023-12-15T13:49:31.932209555Z" level=info msg="API listen on [::]:2376"

@tianon
Copy link
Member

tianon commented Dec 15, 2023

Ok, fix should be mostly deployed now. 👍

@steve-mt
Copy link

Thank you @tianon and @yosifkit for fixing this problem, we appreciate it 🙇 🚀

@tianon
Copy link
Member

tianon commented Dec 18, 2023

@stevexuereb would you be able to test or help coordinate a test of #468 on GitLab to make sure I don't cause a regression again? 😅

(docker build --pull 'https:/docker-library/docker.git#refs/pull/468/merge:24/dind', in case that's a helpful one-liner for you to get something running/tested)

@steve-mt
Copy link

@stevexuereb would you be able to test or help coordinate a test of #468 on GitLab to make sure I don't cause a regression again? 😅

(docker build --pull 'https:/docker-library/docker.git#refs/pull/468/merge:24/dind', in case that's a helpful one-liner for you to get something running/tested)

@tianon certainly, I left my testing results in #468 (comment) let me know if you needed something different 🙇


@tianon I'd be curious would it be possible to set up a test in GitHub Actions that builds the image in a PR, publishes it to some Registry (GitHub, DockerHub) and possibly trigger a pipeline on GitLab with that image (so there is no infra cost for you) so that we validate each PR going forward instead of running #468 (comment) manually? We could also keep it vendor agnostic, and trigger jobs in GitHub actions as well since ARC was effected

I'd be happy to to see if I can try and coordinate with the Runner team to see if they can contribute this to the project, but was curious to think if this was a good idea for this project or not.

@igitcode

This comment was marked as resolved.

@tianon
Copy link
Member

tianon commented Dec 19, 2023

We do test on GitHub Actions already, so it's odd that someone saw a failure there, but in general figuring out a good way to automate testing on GitLab's infrastructure is a great idea and one I'm personally very open to. 😅

Edit: ah, "Kubernetes controller for GitHub Actions self-hosted runners" -- that probably failed because it's not GitHub's runners (so more similar to #466 / #467)

wcjordan added a commit to wcjordan/chalk that referenced this issue Dec 21, 2023
wcjordan added a commit to wcjordan/chalk that referenced this issue Dec 21, 2023
@yuvipanda
Copy link

yuvipanda commented Jan 8, 2024

I'm still running into the same issue with docker:24.0.7-dind with the same error message listed here. I made sure to pull the latest tagged version (docker@sha256:96e0ecc1b024d393519f4bb53ac68fcd3caf0025b61c2c110b71ff82f960aa6c at the time of this writing), with no luck. This is on GKE v1.27.4-gke.900, running Container-Optimized OS from Google with kernelVersion 5.15.120+, COS version 105.

I've temporarily moved us back to 24.0.6-dind, which does work. Any other debugging info I can help provide?

@tianon
Copy link
Member

tianon commented Jan 8, 2024

The change in #468 isn't actually pushed all the way to the published images yet: docker-library/official-images#16009

According to #468 (comment), COS 105 will probably need --env DOCKER_IPTABLES_LEGACY=1 even with that change.

@Chickenmarkus
Copy link

According to #468 (comment), COS 105 will probably need --env DOCKER_IPTABLES_LEGACY=1 even with that change.

For documentation only because the internet search leads to this issue.

It is not an issue with COS 105 in general but with build ID 17412.226.68 and below. It already includes the kernel module nf_tables which is detected and used by the docker startup scripts. However, this kernel module is not functional yet and, thus, the startup scripts fail.
With build ID 17412.294.10 first, the kernel module nf_tables has officially been announced and is functional.

As a result, our GKE 1.26.6-gke.1700 (uses cos-101-17162-210-48) was working before the automatic update in the REGULAR release channel but was not after the upgrade to 1.27.8-gke.1067004 (uses cos-105-17412-226-62).
We had to manually upgrade to the RAPID release channel (1.29.1-gke.1589017 with cos-109-17800-66-78) or pin the version to 1.27.11-gke.1062000 (uses cos-105-17412-294-29).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.