Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'RuntimeError: No rendezvous handler for env://' with multi-gpu #5358

Closed
costantinoai opened this issue Jan 5, 2021 · 16 comments · Fixed by #5402
Closed

'RuntimeError: No rendezvous handler for env://' with multi-gpu #5358

costantinoai opened this issue Jan 5, 2021 · 16 comments · Fixed by #5402
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task

Comments

@costantinoai
Copy link

🐛 Bug

I get an error
'RuntimeError: No rendezvous handler for env://'
when I run my model with multiple GPU.

Below the code and the traceback:

trainer = pl.Trainer(gpus = -1,
                     accelerator='ddp',
                     check_val_every_n_epoch=10, 
                    # precision=16,
                    # auto_scale_batch_size='binsearch',
                     callbacks=[checkpoint_callback],
                     max_epochs = 1)

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

trainer.fit(model)

initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Traceback (most recent call last):

File "", line 1, in
trainer.fit(model)

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 470, in fit
results = self.accelerator_backend.train()

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\accelerators\ddp_accelerator.py", line 252, in ddp_train
self.init_ddp_connection(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 153, in init_ddp_connection
self.ddp_plugin.init_ddp_connection(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\pytorch_lightning\plugins\ddp_plugin.py", line 90, in init_ddp_connection
torch_distrib.init_process_group(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\torch\distributed\distributed_c10d.py", line 433, in init_process_group
rendezvous_iterator = rendezvous(

File "C:\Users\45027900\Anaconda3\envs\PyTorch\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))

RuntimeError: No rendezvous handler for env://

The error is not present if I set

gpus = 1

Expected behavior

Environment

  • PyTorch Version (e.g., 1.0): 1.7.1
  • OS (e.g., Linux): Windows 10
  • How you installed PyTorch (conda, pip, source): conda
  • Build command you used (if compiling from source): conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
  • Python version: 3.8.5
  • CUDA/cuDNN version: 11.0
  • GPU models and configuration: 2 * Quadro RTX 6000
  • Any other relevant information:
@costantinoai costantinoai added bug Something isn't working help wanted Open to be worked on labels Jan 5, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2021

Hi! thanks for your contribution!, great first issue!

@costantinoai
Copy link
Author

Also, I don't know if it is related, but when I check the GPU performance during training (with the flag GPU = 1) using windows task manager, I can see only 1-2% used in the GPU, and 45-50% in the CPU. Is this a normal behaviour?

@Borda Borda added the priority: 1 Medium priority task label Jan 6, 2021
@Borda
Copy link
Member

Borda commented Jan 6, 2021

@costantinoai mind share what PL version are you using? also, do you have and full example to reproduce?

@costantinoai
Copy link
Author

costantinoai commented Jan 6, 2021

Hi @Borda ,
Thanks for your reply.

PL version is 1.1.2.

I do have an example of the full code on colab, but I would rather not post it publicly.

How can I share it with you?

@awaelchli
Copy link
Contributor

awaelchli commented Jan 7, 2021

Hi, you can ping me on slack if you want. It's probably an issue with passing the argument gpus=-1 to the subprocess script. I bet if you set gpus=n where n is the number of gpus, it will work. We just have to support -1 for ddp.

@costantinoai
Copy link
Author

costantinoai commented Jan 7, 2021 via email

@costantinoai
Copy link
Author

@awaelchli still got the same problem after setting gpus = 2. I reached you on twitter (I don't have a slack account).

Thanks!

@awaelchli
Copy link
Contributor

In summary after private conversation with @costantinoai

  • ddp not supported on windows platform (yet)
  • script needs guard around entry point (if __name__ == "__main__")

if these requirements are not met, we see the No rendezvous handler for env://' or similar exceptions.

@BlockWaving
Copy link

Hi, I also get this error when adding the second gpu to machine:

RuntimeError: No rendezvous handler for env://

please advise how to fix and/or work around?
Thx!

@awaelchli
Copy link
Contributor

RuntimeError: No rendezvous handler for env://

That's not much information, but one possibility is because you are on Windows.
accelerator=ddp will not work on windows, you have to choose dp.

@mdja
Copy link

mdja commented Feb 3, 2021

I am on windows and saw this error. change accelerator to 'dp' works.

@DavidRimel
Copy link

I am on windows and saw this error. change accelerator to 'dp' works.

Windows 10 user here.. this worked for me

@carlomarxdk
Copy link

carlomarxdk commented Mar 4, 2021

I am on Pytorch-Lightning 1.2.1. and I still run into the issue on Windows if I set accelerator to "dp". I am training on 1 GPU.
I encounter this issue when I use DeepSpeed plugin.

@awaelchli
Copy link
Contributor

@carlomarxdk is deepspeed supported on windows? I can't find any mention of it, so probably not.

@ibrahimishag
Copy link

I ran into this issue on Windows 10.

@ibrahimishag
Copy link

Changing the accelerator to dp on Windows 10 as suggested by @awaelchli and @mdja solved my issue.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants