-
I am curious about how to dispatch a large language model (LLM) into smaller pieces across GPUs using the vllm library. For example, in the transformers library, adding
Does vllm have a similar feature? What parameters should I add to the following code to enable dispatching the LLM across GPUs?
When I use However, after a few seconds, each GPU actually uses too many memory. (46974 / 49140 MB) Thank you for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
[Self response] |
Beta Was this translation helpful? Give feedback.
[Self response]
llm = LLM(model=model_id, tensor_parallel_size=4, gpu_memory_utilization=0.5)
solved this issue.solution from #550