-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker buildkit stuck with high CPU and unresponsive #4942
Comments
Looks like similar to #4917 (comment) . Do you have example case or parameters for such builds. If you can provide us a reproducible case that would help a lot. I assume it is using remote cache export as that's that is visible from the trace. You can also try https:/moby/buildkit/blob/master/.github/issue_reporting_guide.md#reporting-deadlock when it looks to be hanging. |
I don't have an example of how to reproduce, but we do have some very large dockerfiles (several hundred RUN commands, but in a multi stage docker build so the manifest has fewer than 100 layers) so it could be related. Next time it happens I will follow the link you shared and update this ticket with what I gather. |
@jogo-openai And you are using |
just checked, doesn't look like we are. I checked based on https://docs.docker.com/build/cache/backends/ |
@tonistiigi hope this helps: Attached are two dumps from running |
There seem to be multiple ongoing builds in the trace that are in the middle of creating provenance. This code reuses the cache export codepath (that confused me before) to find all the cache sources that have layer chains associated with them. I improved a performance of this part in #4947 that makes quite a big difference in my measurements but as your trace shows that current active function is |
Thank you @AkihiroSuda! |
Thank you for the fix unfortunately we are still seeing the same issue with the latest release https:/docker/buildx/releases/tag/v0.15.1 should have buildkit 0.14.1 and buildkit 0.14 has this fix
Attached is the the debug output |
Symtoms: Every so often docker builds break (fail to complete) and upon further inspection most of the CPU on the system is consumed by the docker process itself. If we wait long enough things recover but that can be a while.
When running pprof (
curl -o pprof --unix-socket /var/run/docker.sock http://./debug/pprof/profile?seconds=60
) we get the following showing docker is spending it's time in buildkit/solverenvironment:
Large build systems (1+TB disk, 50+ cores) that are accessed using a remote docker build host as per
docker context inspect -f '{{json .Endpoints.docker.Host}}'
, so we have lots of concurrent builds etc.The text was updated successfully, but these errors were encountered: