-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL tests don't work on WSL #442
Comments
I have exactly the same problem... |
Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it. |
I think this is the reason why I cannot use multi-gpu training with PyTorch as well. Because when I use PyTorch DataParallel it give me similar error with NCCL. |
I also ran into the issue of NCCL simply not supporting WSL environments. It would have helped to have the lack of support documented right here https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations This might be the only place on the net a dev has said anything on the topic. |
Maybe I have the same error. I'm trying to use multigpu in two nodes where the one is wsl2 environment but seems that nccl communicator hangs displaying "cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error" only in the wsl2 side. Looking forward to the fix. |
Any update on this issue.. NCCL support for WSL2 is needed so that i can use Transfer Learning Toolkit 3 on my Windows desktop using WSL2 |
NCCL 2.10.3 was released last week and it should support WSL2 with a single GPU. Multi-GPU has not been validated yet. |
Still doesn't work with latest upgrades to TAO on WSL2 with newest driver 510.06... following is the output :
|
From your log:
Note, NCCL might have been compiled statically with tensorflow, so upgrading NCCL might not be enough to use the newest version. |
The current status should be that NCCL isn't supported (on multiple GPUs) for WSL. |
Same issue here with WSL2 (Windows 11), driver 510.06 and torch 1.9.1.cu111 with 2x 2080 Super. |
NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version. |
@AddyLaddy Thanks for getting back to me. I checked and Torch 1.9.1.cu111 apparently uses NCCL 2.7.8. Will have to see what our options are now. |
@AddyLaddy How can I unlink the old NCCL from pytorch and update the NCCL of pytorch to version 2.11.4? I have installed version 2.11.4 in wsl2 and can pass the test by using nccl-tests. However, when training the model, pytorch 1.7.1 still calls NCCL 2.7.8 |
I'm not a PyTorch expert, but I believe you need to configure and rebuild it using the USE_SYSTEM_NCCL=1 option. Perhaps ask in a PyTorch forum for help? |
@AddyLaddy Thank you very much. I'll try to recompile PyTorch. |
hi. I've got the same issue recently. Did it work to recompile PyTorch? |
I've installed NCCL and its tests on WSL. When trying to run a test like this:
I get the following error message:
The debug log shows this:
Version of NCCL: version 2.8.3
Version of CUDA: 11.1
Windows: 10.0.20277
WSL: Ubuntu 20.04
The text was updated successfully, but these errors were encountered: