torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py #447

Emibobo · 2024-07-25T08:48:38Z

[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
[rank3]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 3] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=1
[rank3]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 3] ProcessGroupNCCL preparing to dump debug info.
[rank3]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 3] [PG 0 Rank 3] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 1
W0725 08:32:26.101000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 651734 closing signal SIGTERM
/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
E0725 08:32:28.020000 140686322804544 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 0 (pid: 651732) of binary: /home/zhuxiaobo/anaconda3/envs/my_pytorch_env/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhuxiaobo/anaconda3/envs/my_pytorch_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/data/zxb/DINO_FL/train/train_stage.py FAILED

Failures:
[1]:
time : 2024-07-25_08:32:26
host : user
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 651733)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 651733
[2]:
time : 2024-07-25_08:32:26
host : user
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 651735)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 651735

Root Cause (first observed failure):
[0]:
time : 2024-07-25_08:32:26
host : user
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 651732)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 651732

SHAREN111 · 2024-08-05T07:00:57Z

I encountered the same issue as you. May I ask if your problem has been resolved? I look forward to your reply.

baldassarreFe · 2024-08-08T22:22:51Z

ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives.

One of the processes got stuck while waiting on a collective operation (all_gather etc.). It's likely a problem on your side rather than the code in the repo. It's also hard to debug without seeing the code and the python environment.

Troubleshooting tips:

Write a mini script that simply initializes distributed and runs some collectives to check that your GPUs can talk
Make a list of all modifications made on top of the provided code, check that all new collectives are executed in the same order on all ranks
Ensure that the modified code runs on a single GPU
Follow https://pytorch.org/docs/master/distributed.html#debugging-torch-distributed-applications

DarkJokers · 2024-09-23T07:13:48Z

请问解决了吗

Delaunay mentioned this issue Sep 9, 2024

Staging mila-iqia/milabench#269

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py #447

torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py #447

Emibobo commented Jul 25, 2024

SHAREN111 commented Aug 5, 2024

baldassarreFe commented Aug 8, 2024

DarkJokers commented Sep 23, 2024

torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py #447

torchrun --nproc_per_node=4 /data/zxb/DINO_FL/train/train.py #447

Comments

Emibobo commented Jul 25, 2024

/data/zxb/DINO_FL/train/train_stage.py FAILED

Root Cause (first observed failure): [0]: time : 2024-07-25_08:32:26 host : user rank : 0 (local_rank: 0) exitcode : -6 (pid: 651732) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 651732

SHAREN111 commented Aug 5, 2024

baldassarreFe commented Aug 8, 2024

DarkJokers commented Sep 23, 2024

Root Cause (first observed failure):
[0]:
time : 2024-07-25_08:32:26
host : user
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 651732)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 651732