-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError("8-bit operations on bitsandbytes
are not supported under CPU!")
#10
Comments
From my testing it seems the following happens when not enough memory is available on GPU: |
Hi @aninrusimha @Tianwei-She |
thanks for the reply! I'm using an AWS g5.48xlarge instance which has 192GiB GPU memory |
Actually I am a bit surprised it didn't fit your GPUs. Since I don't have access these machines, git clone https:/huggingface/transformers
cd transformers
pip install -e ".[dev]" And then add print(device_map) Just before this line: https:/huggingface/transformers/blob/6d175c1129538b27230be170fc1184e8490e95ef/src/transformers/modeling_utils.py#L2181 Also could you point me to the exact commands (better send me maybe the full script) you are using? Thanks |
I believe the main issue here is that you need to use the Either decrease mini-batch size and sequence length until it fits, or use a max_memory dictionary which leaves a couple of GB of memory free on each GPU. So if you have 24 GB of memory per GPU, you want to use 22-23 GB only. However, BLOOM-176B might not fit with 22GB, and you need slightly more, something like 22.5GB, but I am not sure if floating point values are supported for the |
Thanks for replying! @younesbelkada I printed out the
@TimDettmers
I understand this is most likely caused by insufficient GPU memory, however I'm wondering how BLOOM model was able to be run on 8x RTX 3090 with 24GB memory as shown in the paper |
@TimDettmers btw I also tried tuning the parameter |
it is as expected that threshold 0 and 6 use the close to same memory with the current implementation. The difference should be in the order of a couple of megabytes. If you are still receiving an error, you can try to tweak the exact amounts of memory reserved for the model and the activations. You might want to use between What is also important in this case is this max memory used for activations during inference. If your sequence dimension during inference is high, you might run out of memory at some point because the margins are so small. In that case, you need to retweak the |
I am closing this as this issue is related to a part of the model being on the CPU which is currently managed by the accelerate library. If this is still relevant, please open an issue there. Regarding the BLOOM model, I will try to debug the situation and post examples to run BLOOM in a setup similar to yours. |
…ize_64 Remove blocksize 64 for quant/dequant functions
Hi Tim,
Thanks for your awesome work!
I'm using your method to load the largest BLOOM model (the BLOOM model with 176b parameters) onto 1 node with 8 GPUs.
This line works for all the other smaller bloom models, eg. bloom-7b1. However when loading
bloom
(176b) I got error"8-bit operations on bitsandbytes are not supported under CPU!"
.In my understanding, this is because some modules of the model are automatically loaded onto CPU, which didn't happen to the smaller models. Is there a way to force the model to be loaded to GPU only? or do you have any advice on how to bypass this error? Thanks!!
Tianwei
The text was updated successfully, but these errors were encountered: