-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whether nccl do not support virsual machines #575
Comments
I'm not familiar with bitfusion, but it seems to be sharing a GPU between multiple VM instances. This is very likely incompatible with NCCL, given each NCCL rank needs to run on a different physical GPU. |
Thank you for your reply. I got the information from the link(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/2.5/rn/vmware-vsphere-bitfusion-compatibility-interop.html). Vmware stated that bitfusion supportt NCCL version 2.3, 2.4, 2.5, 2.8 and later). What I used in my experiment is NCCL 2.7.8, which is not support in bitfusion. I cannot change the NCCL version even if I reinstall NCCL library. |
From that page:
I'm not sure what that means but it sounds like what I meant before. Now, regarding the NCCL version, many frameworks builds come with a NCCL version baked in, so you can't replace NCCL with a different version without changing or rebuilding the framework. |
OK! Thank you for your reply! |
Recently, I have got a VM with 2 A100 GPU. My team use the new
DELL XE8545 server
(https://infohub.delltechnologies.com/p/accelerating-hpc-workloads-with-nvidia-a100-nvlink-on-dell-poweredge-xe8545/) andbitfusion
(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/index.html) to create the virsual machine. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server(2 TITAN GPUs) without bitfusion. I want to know that whether nccl do not support such virsual machines? I know that nccl donot support WSL(#442 (comment)).I am looking forward to your reply.
The text was updated successfully, but these errors were encountered: