How to dispatch LLM across GPUs like `device_map="auto"` of transformers.AutoModelForCausalLM #6372

SlyJabiru · 2024-07-12T13:38:11Z

SlyJabiru
Jul 12, 2024

I am curious about how to dispatch a large language model (LLM) into smaller pieces across GPUs using the vllm library.

For example, in the transformers library, adding device_map="auto" to the AutoModelForCausalLM.from_pretrained function allows the LLM to be split and loaded across multiple GPUs like this:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config,
)

Does vllm have a similar feature? What parameters should I add to the following code to enable dispatching the LLM across GPUs?

from vllm import LLM
model = LLM(model="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16")

When I use model = LLM(model="neuralmagic/Meta-Llama-3-70B-Instruct-quantized.w4a16", tensor_parallel_size=8), adding "tensor_parallel_size=8"
I can see logs like "(VllmWorkerProcess pid=364103) INFO 07-12 22:51:00 model_runner.py:255] Loading model weights took 4.9631 GB".

However, after a few seconds, each GPU actually uses too many memory. (46974 / 49140 MB)

Thank you for your help!

Answered by SlyJabiru

Jul 13, 2024

[Self response]
llm = LLM(model=model_id, tensor_parallel_size=4, gpu_memory_utilization=0.5) solved this issue.
solution from #550

View full answer

SlyJabiru · 2024-07-13T04:55:57Z

SlyJabiru
Jul 13, 2024
Author

[Self response]
llm = LLM(model=model_id, tensor_parallel_size=4, gpu_memory_utilization=0.5) solved this issue.
solution from #550

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to dispatch LLM across GPUs like `device_map="auto"` of transformers.AutoModelForCausalLM #6372

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to dispatch LLM across GPUs like device_map="auto" of transformers.AutoModelForCausalLM #6372

SlyJabiru Jul 12, 2024

Replies: 1 comment

SlyJabiru Jul 13, 2024 Author

How to dispatch LLM across GPUs like `device_map="auto"` of transformers.AutoModelForCausalLM #6372

SlyJabiru
Jul 12, 2024

SlyJabiru
Jul 13, 2024
Author