Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281

Open
xiaohuipoi opened this issue Oct 15, 2024 · 3 comments

Comments

@xiaohuipoi
Copy link

Hello, I have a problem, when I train a bevformer_small on the base dataset, the first epoch works fine and saves the result to the json file of result, but when the second epoch training is completed, RROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary.
Is it due to my lack of video memory, my lack of storage space, or some other issue?

GPU:4060ti 16g * 1 , Remaining storage space:190G/800G

@xiaohuipoi
Copy link
Author

xiaohuipoi commented Oct 15, 2024

2024-10-14 20:45:26,306 - mmdet - INFO - Epoch [2][28100/28130]	lr: 1.866e-04, eta: 2 days, 20:24:13, time: 2.191, data_time: 0.016, memory: 7344, loss_cls: 0.5252, loss_bbox: 0.6573, d0.loss_cls: 0.4906, d0.loss_bbox: 0.7357, d1.loss_cls: 0.5090, d1.loss_bbox: 0.6798, d2.loss_cls: 0.5115, d2.loss_bbox: 0.6681, d3.loss_cls: 0.5248, d3.loss_bbox: 0.6641, d4.loss_cls: 0.5259, d4.loss_bbox: 0.6617, loss: 7.1536, grad_norm: 158.5894
2024-10-14 20:46:32,026 - mmdet - INFO - Saving checkpoint at 2 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 2.6 task/s, elapsed: 2317s, ETA:     0s


Formating bboxes of pts_bbox
Start to convert detection format...
[>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 35.6 task/s, elapsed: 169s, ETA:     0s
Results writes to val/./work_dirs/bevformer_small/Sun_Oct_13_09_48_30_2024/pts_bbox/results_nusc.json
Evaluating bboxes of pts_bbox
======
Loading NuScenes tables for version v1.0-trainval...
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 290596) of binary: /home/xiaohuipoi/anaconda3/envs/bev/bin/python
Traceback (most recent call last):
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xiaohuipoi/anaconda3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
./tools/train.py FAILED
This is the result of my terminal running. thx

@dwc13
Copy link

dwc13 commented Oct 18, 2024

Ran into issue last night as well:

2024-10-17 19:09:35,041 - mmdet - INFO - Epoch [3][28100/28130] lr: 1.866e-04, eta: 8 days, 8:40:33, time: 3.019, data
2024-10-17 19:11:05,205 - mmdet - INFO - Saving checkpoint at 3 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 2.1 task/s, elapsed: 2879s, ETA:   0s

Formating bboxes of pts_bbox
Start to convert detection format...                                                                                  
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 20.5 task/s, elapsed: 293s, ETA:   0s
Results writes to val/./work_dirs/bevformer_small/Wed_Oct_16_20_53_19_2024/pts_bbox/results_nusc.json
Evaluating bboxes of pts_bbox
======
Loading NuScenes tables for version v1.0-trainval...
...info...
======
Initializing nuScenes detection evaluation
Loaded results from val/./work_dirs/bevformer_small/Wed_Oct_16_20_53_19_2024/pts_bbox/results_nusc.json.
Found detections for 6019 samples.
Loading annotations for val split from nuScenes version: v1.0-trainval
Loaded ground truth annotations for 6019 samples.
Filtering predictions
Filtering ground truth annotations

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 365975) of binary: /home
Traceback (most recent call last):
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
  return _run_code(code, main_globals, None,
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
  exec(code, run_globals)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
  main()
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
  launch(args)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
  run(args)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
  elastic_launch(
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __c
  return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in lau
  raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
**************************************************
  ./tools/train.py FAILED
==================================================
Root Cause:
[0]:
time: 2024-10-17_20:07:11
rank: 0 (local_rank: 0)
exitcode: -9 (pid: 365975)
error_file: <N/A>
msg: "Signal 9 (SIGKILL) received by PID 365975"
==================================================
Other Failures:
<NO_OTHER_FAILURES>
**************************************************

Currently, looking into what is being thrown but I believe it stem from runner going through MMDet and then later through MMCV when conducting the evaluations after a epoch. But the initial cause of the error is not provided to us since this error looks like the main exit error to close all associating pid with ./tools/dist_train.py, which is seen in other issues. Therefore, some investigating is required to find the exact error causing the exit.

For the meantime to have my model continue on initial training I have adjusted the configuration.

./projects/configs/bevformer/bevformer_small.py:

evaluation = dict(interval=X, pipeline=test_pipeline)
load_from = 'ckpts/bevformer_small.pth'
resume_from = f'work_dirs/bevformer_small/epoch_X.pth'

checkpoint_config = dict(interval=1, create_symlink=False)

Notes, can find more information in links below:

  • Adjust X interval in evaluation to a larger number
  • To resume from last or latest checkpoint then set resume_from to the relative path.
  • Make sure interval in checkpoint is set to 1 to save ckpt.pth after every epoch.
  • If working off a external drive, and not internal drive (like me), then set create_symlink to False, refer to MMDet Issue
  • BEVFormer is using the MMDet 2.X configuration method and not MMDet 3.X.

General Documentation Sources:

@xiaohuipoi
Copy link
Author

thx bro,with your code, I can continue to train epoch3 on top of epoch2.pth, which is very useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants