Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing CUDA libraries when using bitsandbyes with Dockerfile #1923

Closed
tobrien6 opened this issue Jan 5, 2023 · 3 comments
Closed

Missing CUDA libraries when using bitsandbyes with Dockerfile #1923

tobrien6 opened this issue Jan 5, 2023 · 3 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@tobrien6
Copy link

tobrien6 commented Jan 5, 2023

Describe the bug

I am trying to run diffusers in Docker. I am using the diffusers-pytorch-cuda Dockerfile provided in the repo. The only change I made is to add diffusers and bitsandbytes to the pip install list in the Dockerfile.

When I enable --use_8bit_adam for Dreambooth training, I get this error:

000000000000000000000 BUG REPORT 000000000000000000000000
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https:/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
000000000000000000000
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
ERROR: /opt/venv/bin/python3: undefined symbol: cudaRuntimeGetVersion
CUDA SETUP: libcudart.so path is None
CUDA SETUP: Is seems that your cuda installation is not in your path. See bitsandbytes-foundation/bitsandbytes#85 for more information.
CUDA SETUP: CUDA version lower than 11 are currently not supported for LLM.int8(). You will be only to use 8-bit optimizers and quantization routines!!
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 00
CUDA SETUP: Loading binary /opt/venv/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
Traceback (most recent call last):
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 828, in
main(args)
File "/content/diffusers/examples/dreambooth/train_dreambooth.py", line 642, in main
lr_scheduler = get_scheduler(
TypeError: get_scheduler() got an unexpected keyword argument 'num_cycles'
Traceback (most recent call last):
File "/opt/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/opt/venv/lib/python3.8/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

I tried adding this line to the Dockerfile to get it to point to the CUDA library, but it did not work:
ENV LD_LIBRARY_PATH="/usr/local/cuda-11.7/targets/x86_64-linux/lib/:$LD_LIBRARY_PATH"
EDIT: I realized that specific file libcudart.so does not exist anywhere in the filesystem of the docker container. The path I was adding only contains a libcudart.so.11.0 which is found in the path it was looking in already. Not sure what I should do here.

Reproduction

Described above

Logs

No response

System Info

Using Debian 10 with a 16GB T4, 4 cores and 15GB system RAM on google compute engine

@tobrien6 tobrien6 added the bug Something isn't working label Jan 5, 2023
@anton-l
Copy link
Member

anton-l commented Jan 5, 2023

@tobrien6 one of the solutions in the bitandbytes issue (linked in the error message) suggests adding a symlink for libcudart.so: bitsandbytes-foundation/bitsandbytes#85 (comment)

If that doesn't help, try posting your workflow in bitsandbytes-foundation/bitsandbytes#85

@tobrien6
Copy link
Author

tobrien6 commented Jan 6, 2023

I was able to fix this by changing to a version of the nvidia docker image (pulled in at the top of the diffusers dockerfile) for 11.6, not 11.7.

Now I'm getting an error "ValueError: Attempting to unscale FP16 gradients." which I see discussed elsewhere.

@github-actions
Copy link

github-actions bot commented Feb 4, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

2 participants