-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running model parallel Inference #88
Comments
I was able to run 7B on two 1080 Ti (only inference). Next, I'll try 13B and 33B. It still needs refining but it works! I forked LLaMA here: https:/modular-ml/wrapyfi-examples_llama and have a readme with the instructions on how to do it: LLaMA with WrapyfiWrapyfi enables distributing LLaMA (inference only) on multiple GPUs/machines, each with less than 16GB VRAM currently distributes on two cards only using ZeroMQ. Will support flexible distribution soon! This approach has only been tested on 7B model for now, using Ubuntu 20.04 with two 1080 Tis. Testing 13B/30B models soon! How to?
|
this is something we should cover in the llama-recipes repo. Thanks for raising! cc @HamidShojanazeri |
I am trying to run inference on the 7B parameter model on 4x2080Ti, the default script to run inference gives me a CUDA OOM error. is there a way to split the model across multiple GPU's and perform inference.
Thank You!
The text was updated successfully, but these errors were encountered: