torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281

xiaohuipoi · 2024-10-15T06:43:51Z

Hello, I have a problem, when I train a bevformer_small on the base dataset, the first epoch works fine and saves the result to the json file of result, but when the second epoch training is completed, RROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary.
Is it due to my lack of video memory, my lack of storage space, or some other issue?

GPU：4060ti 16g * 1 ， Remaining storage space：190G/800G

xiaohuipoi · 2024-10-15T06:44:53Z

2024-10-14 20:45:26,306 - mmdet - INFO - Epoch [2][28100/28130]	lr: 1.866e-04, eta: 2 days, 20:24:13, time: 2.191, data_time: 0.016, memory: 7344, loss_cls: 0.5252, loss_bbox: 0.6573, d0.loss_cls: 0.4906, d0.loss_bbox: 0.7357, d1.loss_cls: 0.5090, d1.loss_bbox: 0.6798, d2.loss_cls: 0.5115, d2.loss_bbox: 0.6681, d3.loss_cls: 0.5248, d3.loss_bbox: 0.6641, d4.loss_cls: 0.5259, d4.loss_bbox: 0.6617, loss: 7.1536, grad_norm: 158.5894
2024-10-14 20:46:32,026 - mmdet - INFO - Saving checkpoint at 2 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 2.6 task/s, elapsed: 2317s, ETA:     0s


Formating bboxes of pts_bbox
Start to convert detection format...
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 35.6 task/s, elapsed: 169s, ETA:     0s
Results writes to val/./work_dirs/bevformer_small/Sun_Oct_13_09_48_30_2024/pts_bbox/results_nusc.json
Evaluating bboxes of pts_bbox
======
Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 290596) of binary: /home/xiaohuipoi/anaconda3/envs/bev/bin/python
Traceback (most recent call last):
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
./tools/train.py FAILED
This is the result of my terminal running. thx

dwc13 · 2024-10-18T17:18:36Z

Ran into issue last night as well:

2024-10-17 19:09:35,041 - mmdet - INFO - Epoch [3][28100/28130] lr: 1.866e-04, eta: 8 days, 8:40:33, time: 3.019, data
2024-10-17 19:11:05,205 - mmdet - INFO - Saving checkpoint at 3 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 2.1 task/s, elapsed: 2879s, ETA:   0s

Formating bboxes of pts_bbox
Start to convert detection format...                                                                                  
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 20.5 task/s, elapsed: 293s, ETA:   0s
Results writes to val/./work_dirs/bevformer_small/Wed_Oct_16_20_53_19_2024/pts_bbox/results_nusc.json
Evaluating bboxes of pts_bbox
======
Loading NuScenes tables for version v1.0-trainval...
...info...
======
Initializing nuScenes detection evaluation
Loaded results from val/./work_dirs/bevformer_small/Wed_Oct_16_20_53_19_2024/pts_bbox/results_nusc.json.
Found detections for 6019 samples.
Loading annotations for val split from nuScenes version: v1.0-trainval
Loaded ground truth annotations for 6019 samples.
Filtering predictions
Filtering ground truth annotations

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 365975) of binary: /home
Traceback (most recent call last):
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  return _run_code(code, main_globals, None,
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
  exec(code, run_globals)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
  main()
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
  launch(args)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
  run(args)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
  elastic_launch(
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __c
  return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in lau
  raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
**************************************************
  ./tools/train.py FAILED
==================================================
Root Cause:
[0]:
time: 2024-10-17_20:07:11
rank: 0 (local_rank: 0)
exitcode: -9 (pid: 365975)
error_file: <N/A>
msg: "Signal 9 (SIGKILL) received by PID 365975"
==================================================
Other Failures:
<NO_OTHER_FAILURES>
**************************************************

Currently, looking into what is being thrown but I believe it stem from runner going through MMDet and then later through MMCV when conducting the evaluations after a epoch. But the initial cause of the error is not provided to us since this error looks like the main exit error to close all associating pid with ./tools/dist_train.py, which is seen in other issues. Therefore, some investigating is required to find the exact error causing the exit.

For the meantime to have my model continue on initial training I have adjusted the configuration.

./projects/configs/bevformer/bevformer_small.py:

evaluation = dict(interval=X, pipeline=test_pipeline)
load_from = 'ckpts/bevformer_small.pth'
resume_from = f'work_dirs/bevformer_small/epoch_X.pth'

checkpoint_config = dict(interval=1, create_symlink=False)

Notes, can find more information in links below:

Adjust X interval in evaluation to a larger number
To resume from last or latest checkpoint then set resume_from to the relative path.
Make sure interval in checkpoint is set to 1 to save ckpt.pth after every epoch.
If working off a external drive, and not internal drive (like me), then set create_symlink to False, refer to MMDet Issue
BEVFormer is using the MMDet 2.X configuration method and not MMDet 3.X.

General Documentation Sources:

MMDet 2.17 Config and Documentation: Start here for more configuration flags.
MMDet Config
MMDet Migration

xiaohuipoi · 2024-10-19T05:41:18Z

thx bro，with your code, I can continue to train epoch3 on top of epoch2.pth, which is very useful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281

xiaohuipoi commented Oct 15, 2024

xiaohuipoi commented Oct 15, 2024 •

edited

Loading

dwc13 commented Oct 18, 2024

xiaohuipoi commented Oct 19, 2024

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281

Comments

xiaohuipoi commented Oct 15, 2024

xiaohuipoi commented Oct 15, 2024 • edited Loading

dwc13 commented Oct 18, 2024

xiaohuipoi commented Oct 19, 2024

xiaohuipoi commented Oct 15, 2024 •

edited

Loading