Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError("8-bit operations on bitsandbytes are not supported under CPU!") #10

Closed
Tianwei-She opened this issue Aug 15, 2022 · 9 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@Tianwei-She
Copy link

Hi Tim,

Thanks for your awesome work!

I'm using your method to load the largest BLOOM model (the BLOOM model with 176b parameters) onto 1 node with 8 GPUs.

model = AutoModelForCausalLM.from_pretrained(
                "bloom", 
                device_map="auto", 
                load_in_8bit=True,
            )

This line works for all the other smaller bloom models, eg. bloom-7b1. However when loading bloom (176b) I got error "8-bit operations on bitsandbytes are not supported under CPU!".

File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 463, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2182, in from_pretrained
    raise ValueError("8-bit operations on `bitsandbytes` are not supported under CPU!")
ValueError: 8-bit operations on `bitsandbytes` are not supported under CPU!

In my understanding, this is because some modules of the model are automatically loaded onto CPU, which didn't happen to the smaller models. Is there a way to force the model to be loaded to GPU only? or do you have any advice on how to bypass this error? Thanks!!

Tianwei

@aninrusimha
Copy link

From my testing it seems the following happens when not enough memory is available on GPU:
hf accelerate automatic device selection sees device_map = auto, and puts some layers on CPU
this device map with cpu layers is passed onward
the bnb code in hf transformers sees the cpu layers and raises this confusing error message.
My guess is that you lack enough GPU memory for bloom.

@younesbelkada
Copy link
Collaborator

younesbelkada commented Aug 16, 2022

Hi @aninrusimha @Tianwei-She
I second what @aninrusimha said, this error is thrown when you don't have enough GPU RAM to fit your quantized model before trying to assign it on the correct GPU device.
Could you also tell us what type of GPU are you using?

@Tianwei-She
Copy link
Author

thanks for the reply! I'm using an AWS g5.48xlarge instance which has 192GiB GPU memory

@younesbelkada
Copy link
Collaborator

Actually I am a bit surprised it didn't fit your GPUs. Since I don't have access these machines,
Could you please try to install transformers on dev mode - aka:

git clone https:/huggingface/transformers
cd transformers
pip install -e ".[dev]"

And then add

print(device_map)

Just before this line: https:/huggingface/transformers/blob/6d175c1129538b27230be170fc1184e8490e95ef/src/transformers/modeling_utils.py#L2181

Also could you point me to the exact commands (better send me maybe the full script) you are using? Thanks

@TimDettmers
Copy link
Collaborator

I believe the main issue here is that you need to use the max_memory dictionary as an argument. By default, it can be that the dictionary allocates too much memory for the model, such that the mini-batch no longer fits onto the GPU. This then causes a CPU error.

Either decrease mini-batch size and sequence length until it fits, or use a max_memory dictionary which leaves a couple of GB of memory free on each GPU. So if you have 24 GB of memory per GPU, you want to use 22-23 GB only. However, BLOOM-176B might not fit with 22GB, and you need slightly more, something like 22.5GB, but I am not sure if floating point values are supported for the max_memory dictionary. @younesbelkada do you know more?

@Tianwei-She
Copy link
Author

Tianwei-She commented Aug 23, 2022

Thanks for replying!

@younesbelkada I printed out the device_map, there are indeed some modules not on GPU - 'transformer.h.69': 'disk', 'transformer.ln_f': 'disk'

{'transformer.word_embeddings': 0, 'lm_head': 0, 'transformer.word_embeddings_layernorm': 0, 'transformer.h.0': 0, 'transformer.h.1': 0, 'transformer.h.2': 0, 'transformer.h.3': 0, 'transformer.h.4': 0, 'transformer.h.5': 0, 'transformer.h.6': 1, 'transformer.h.7': 1, 'transformer.h.8': 1, 'transformer.h.9': 1, 'transformer.h.10': 1, 'transformer.h.11': 1, 'transformer.h.12': 1, 'transformer.h.13': 1, 'transformer.h.14': 1, 'transformer.h.15': 2, 'transformer.h.16': 2, 'transformer.h.17': 2, 'transformer.h.18': 2, 'transformer.h.19': 2, 'transformer.h.20': 2, 'transformer.h.21': 2, 'transformer.h.22': 2, 'transformer.h.23': 2, 'transformer.h.24': 3, 'transformer.h.25': 3, 'transformer.h.26': 3, 'transformer.h.27': 3, 'transformer.h.28': 3, 'transformer.h.29': 3, 'transformer.h.30': 3, 'transformer.h.31': 3, 'transformer.h.32': 3, 'transformer.h.33': 4, 'transformer.h.34': 4, 'transformer.h.35': 4, 'transformer.h.36': 4, 'transformer.h.37': 4, 'transformer.h.38': 4, 'transformer.h.39': 4, 'transformer.h.40': 4, 'transformer.h.41': 4, 'transformer.h.42': 5, 'transformer.h.43': 5, 'transformer.h.44': 5, 'transformer.h.45': 5, 'transformer.h.46': 5, 'transformer.h.47': 5, 'transformer.h.48': 5, 'transformer.h.49': 5, 'transformer.h.50': 5, 'transformer.h.51': 6, 'transformer.h.52': 6, 'transformer.h.53': 6, 'transformer.h.54': 6, 'transformer.h.55': 6, 'transformer.h.56': 6, 'transformer.h.57': 6, 'transformer.h.58': 6, 'transformer.h.59': 6, 'transformer.h.60': 7, 'transformer.h.61': 7, 'transformer.h.62': 7, 'transformer.h.63': 7, 'transformer.h.64': 7, 'transformer.h.65': 7, 'transformer.h.66': 7, 'transformer.h.67': 7, 'transformer.h.68': 7, 'transformer.h.69': 'disk', 'transformer.ln_f': 'disk'}

@TimDettmers
I've added max_memory as an argument, even with 23GB max_memory I'm still getting the error. The code I ran is

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
# max_memory = f'{free_in_GB-2}GB'
# max_memory = f'{free_in_GB}GB'
max_memory = f'23GB'
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
print(max_memory)

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", load_in_8bit=True, max_memory=max_memory)

torch.cuda.mem_get_info()[0]/1024**3 is 21.5, nvidia-smi shows each GPU has 23028MiB memory.

I understand this is most likely caused by insufficient GPU memory, however I'm wondering how BLOOM model was able to be run on 8x RTX 3090 with 24GB memory as shown in the paper

@Tianwei-She
Copy link
Author

@TimDettmers btw I also tried tuning the parameter int8_threshold, with int8_threshold = 0, the memory usage is the same as the default int8_threshold = 0.6. Just wanted to confirm, is this expected? Thanks again for your help!

@TimDettmers
Copy link
Collaborator

it is as expected that threshold 0 and 6 use the close to same memory with the current implementation. The difference should be in the order of a couple of megabytes.

If you are still receiving an error, you can try to tweak the exact amounts of memory reserved for the model and the activations. You might want to use between max_memory=22016MB (21.5 GB) and max_memory=22784MB (22.25 GB) which leaves the rest of the memory for the activations.

What is also important in this case is this max memory used for activations during inference. If your sequence dimension during inference is high, you might run out of memory at some point because the margins are so small.

In that case, you need to retweak the max_memory parameters. It could also help to remove the caching from the model, but I am not sure how to do that.

@TimDettmers TimDettmers added bug Something isn't working documentation Improvements or additions to documentation labels Sep 5, 2022
@TimDettmers
Copy link
Collaborator

I am closing this as this issue is related to a part of the model being on the CPU which is currently managed by the accelerate library. If this is still relevant, please open an issue there.

Regarding the BLOOM model, I will try to debug the situation and post examples to run BLOOM in a setup similar to yours.

techthiyanes pushed a commit to techthiyanes/bitsandbytes-1 that referenced this issue Jul 7, 2023
TNTran92 pushed a commit to TNTran92/bitsandbytes that referenced this issue Mar 24, 2024
…ize_64

Remove blocksize 64 for quant/dequant functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants