Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whether nccl do not support virsual machines #575

Closed
ljz756245026 opened this issue Sep 27, 2021 · 4 comments
Closed

whether nccl do not support virsual machines #575

ljz756245026 opened this issue Sep 27, 2021 · 4 comments

Comments

@ljz756245026
Copy link

Recently, I have got a VM with 2 A100 GPU. My team use the new DELL XE8545 server(https://infohub.delltechnologies.com/p/accelerating-hpc-workloads-with-nvidia-a100-nvlink-on-dell-poweredge-xe8545/) andbitfusion(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/index.html) to create the virsual machine. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server(2 TITAN GPUs) without bitfusion. I want to know that whether nccl do not support such virsual machines? I know that nccl donot support WSL(#442 (comment)).

I am looking forward to your reply.

@sjeaugey
Copy link
Member

I'm not familiar with bitfusion, but it seems to be sharing a GPU between multiple VM instances. This is very likely incompatible with NCCL, given each NCCL rank needs to run on a different physical GPU.

@ljz756245026
Copy link
Author

I'm not familiar with bitfusion, but it seems to be sharing a GPU between multiple VM instances. This is very likely incompatible with NCCL, given each NCCL rank needs to run on a different physical GPU.

Thank you for your reply. I got the information from the link(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/2.5/rn/vmware-vsphere-bitfusion-compatibility-interop.html). Vmware stated that bitfusion supportt NCCL version 2.3, 2.4, 2.5, 2.8 and later). What I used in my experiment is NCCL 2.7.8, which is not support in bitfusion. I cannot change the NCCL version even if I reinstall NCCL library.
Do you have any suggestions about how to reinstall the NCCL version? I tried it and reboot the mahcine, however the nccl version did not change.

@sjeaugey
Copy link
Member

sjeaugey commented Sep 28, 2021

From that page:

Using NCCL with multi-process applications that run on different vSphere Bitfusion clients is not supported.

I'm not sure what that means but it sounds like what I meant before.

Now, regarding the NCCL version, many frameworks builds come with a NCCL version baked in, so you can't replace NCCL with a different version without changing or rebuilding the framework.

@ljz756245026
Copy link
Author

From that page:

Using NCCL with multi-process applications that run on different vSphere Bitfusion clients is not supported.

I'm not sure what that means but it sounds like what I meant before.

Now, regarding the NCCL version, many frameworks builds come with a NCCL version baked in, so you can't replace NCCL with a different version without changing or rebuilding the framework.

OK! Thank you for your reply!
I know that the problem is caused by Bitfusion software. It is not NCCL bugs. Thank you for your patient!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants