-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataParralel takes too long #16
Comments
1ytic
added a commit
to 1ytic/FloWaveNet
that referenced
this issue
Apr 22, 2019
Apex utilities https:/NVIDIA/apex handle some issues with specific nodes in the FloWaveNet architecture. List of changes made in train.py: 1. Determine local_rank and world_size for torch.distributed.init_process_group 2. Set a current device with torch.cuda.set_device 3. Wrap dataset with torch.utils.data.distributed.DistributedSampler 4. Apply amp.scale_loss at each backward pass 5. Clip gradient with amp.master_params 6. Divide step_size by world_size (not sure if this is necessary) 7. Initialize model and optimizer with amp.initialize 8. Wrap model with apex.parallel.DistributedDataParallel 9. Handle evaluation and messages on the first node using args.local_rank Resolves: ksw0306#13 See also: ksw0306#16
Sorry for the late reply. The >1 hour hang is indeed strange and shouldn't happen (the default stdout logging interval is 100 ( |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
I am trying to run the training part on multiple GPUs (4 Tesla V100), using the command
python train.py --model_name flowavenet --batch_size 8 --n_block 8 --n_flow 6 --n_layer 2 --block_per_split 4 --num_gpu 4
It runs everything without an error and outputs
num_gpu > 1 detected. converting the model to DataParallel...
It was frozen with this output for more than 1 hour. I checked the usage of the GPUs and all of them were used, but I didn't see any change. I have several questions: do I have some problem with the code or I have to wait more for the training to start? Will decrease in batch_size increase the speed of conversion to DataParallel?
Note* I run training on LJ-Speech-Dataset
Also, can you give us the download links of the pretrained models? It would be very helpful.
The text was updated successfully, but these errors were encountered: