DataParralel takes too long #16

Ersho · 2019-03-01T22:33:53Z

Hello,

I am trying to run the training part on multiple GPUs (4 Tesla V100), using the command

python train.py --model_name flowavenet --batch_size 8 --n_block 8 --n_flow 6 --n_layer 2 --block_per_split 4 --num_gpu 4

It runs everything without an error and outputs

num_gpu > 1 detected. converting the model to DataParallel...

It was frozen with this output for more than 1 hour. I checked the usage of the GPUs and all of them were used, but I didn't see any change. I have several questions: do I have some problem with the code or I have to wait more for the training to start? Will decrease in batch_size increase the speed of conversion to DataParallel?

Note* I run training on LJ-Speech-Dataset

Also, can you give us the download links of the pretrained models? It would be very helpful.

The text was updated successfully, but these errors were encountered:

Apex utilities https:/NVIDIA/apex handle some issues with specific nodes in the FloWaveNet architecture. List of changes made in train.py: 1. Determine local_rank and world_size for torch.distributed.init_process_group 2. Set a current device with torch.cuda.set_device 3. Wrap dataset with torch.utils.data.distributed.DistributedSampler 4. Apply amp.scale_loss at each backward pass 5. Clip gradient with amp.master_params 6. Divide step_size by world_size (not sure if this is necessary) 7. Initialize model and optimizer with amp.initialize 8. Wrap model with apex.parallel.DistributedDataParallel 9. Handle evaluation and messages on the first node using args.local_rank Resolves: ksw0306#13 See also: ksw0306#16

L0SG · 2019-04-23T14:01:16Z

Sorry for the late reply. The >1 hour hang is indeed strange and shouldn't happen (the default stdout logging interval is 100 (display_step)). Could you test again with display_step = 1 inside train()? Or, could you verify that DistributedDataParallel from @1ytic alleviates the problem?

1ytic mentioned this issue Apr 22, 2019

Distributed Training with Apex #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataParralel takes too long #16

DataParralel takes too long #16

Ersho commented Mar 1, 2019

L0SG commented Apr 23, 2019

DataParralel takes too long #16

DataParralel takes too long #16

Comments

Ersho commented Mar 1, 2019

L0SG commented Apr 23, 2019