Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker buildkit stuck with high CPU and unresponsive #4942

Closed
jogo-openai opened this issue May 21, 2024 · 8 comments · Fixed by #4976
Closed

Docker buildkit stuck with high CPU and unresponsive #4942

jogo-openai opened this issue May 21, 2024 · 8 comments · Fixed by #4976
Assignees
Labels
Milestone

Comments

@jogo-openai
Copy link

Symtoms: Every so often docker builds break (fail to complete) and upon further inspection most of the CPU on the system is consumed by the docker process itself. If we wait long enough things recover but that can be a while.

When running pprof (curl -o pprof --unix-socket /var/run/docker.sock http://./debug/pprof/profile?seconds=60) we get the following showing docker is spending it's time in buildkit/solver

image

environment:

docker-buildx-plugin          0.14.0-1~ubuntu.22.04~jammy             amd64 
ii  docker-ce                     5:26.0.2-1~ubuntu.22.04~jammy           amd64  
ii  docker-ce-cli                 5:26.0.2-1~ubuntu.22.04~jammy           amd64
ii  docker-compose-plugin         2.27.0-1~ubuntu.22.04~jammy             amd64

Large build systems (1+TB disk, 50+ cores) that are accessed using a remote docker build host as per docker context inspect -f '{{json .Endpoints.docker.Host}}', so we have lots of concurrent builds etc.

@tonistiigi
Copy link
Member

Looks like similar to #4917 (comment) . Do you have example case or parameters for such builds. If you can provide us a reproducible case that would help a lot. I assume it is using remote cache export as that's that is visible from the trace.

You can also try https:/moby/buildkit/blob/master/.github/issue_reporting_guide.md#reporting-deadlock when it looks to be hanging.

@jogo-openai
Copy link
Author

I don't have an example of how to reproduce, but we do have some very large dockerfiles (several hundred RUN commands, but in a multi stage docker build so the manifest has fewer than 100 layers) so it could be related. Next time it happens I will follow the link you shared and update this ticket with what I gather.

@tonistiigi
Copy link
Member

@jogo-openai And you are using --export-cache ?

@jogo-openai
Copy link
Author

just checked, doesn't look like we are. I checked based on https://docs.docker.com/build/cache/backends/

@jogo-openai
Copy link
Author

@tonistiigi hope this helps:

Attached are two dumps from running debug/pprof/goroutine?debug=2 as per https:/moby/buildkit/blob/master/.github/issue_reporting_guide.md#reporting-deadlock

dump-2.txt
dump.txt

@thompson-shaun thompson-shaun added this to the v0.next milestone May 23, 2024
@tonistiigi
Copy link
Member

There seem to be multiple ongoing builds in the trace that are in the middle of creating provenance. This code reuses the cache export codepath (that confused me before) to find all the cache sources that have layer chains associated with them.

I improved a performance of this part in #4947 that makes quite a big difference in my measurements but as your trace shows that current active function is addBacklinks I'm not sure if it does for you. For the provenance creation we don't actually need to create new cache relationships (these would only be needed in actual cache export) so I think we can fix your issue by skipping these calls. But I would like to get to the bottom of what case it is that is causing lot of such requests. Seems to be some combination of what commands you run and how they are shared between parallel builds.

@jogo-openai
Copy link
Author

Thank you @AkihiroSuda!

@jogo-openai
Copy link
Author

Thank you for the fix unfortunately we are still seeing the same issue with the latest release

https:/docker/buildx/releases/tag/v0.15.1 should have buildkit 0.14.1 and buildkit 0.14 has this fix
https:/moby/buildkit/releases/tag/v0.14.0

ii  docker-ce                     5:26.0.2-1~ubuntu.22.04~jammy           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                 5:26.0.2-1~ubuntu.22.04~jammy           amd64        Docker CLI: the open-source application container engine
ii  docker-compose-plugin         2.28.1-1~ubuntu.22.04~jammy             amd64        Docker Compose (V2) plugin for the Docker CLI.

Attached is the the debug output
curl --unix-socket /var/run/docker.sock http://localhost/debug/pprof/goroutine?debug=2

log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants