-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary #281
Comments
|
Ran into issue last night as well: 2024-10-17 19:09:35,041 - mmdet - INFO - Epoch [3][28100/28130] lr: 1.866e-04, eta: 8 days, 8:40:33, time: 3.019, data
2024-10-17 19:11:05,205 - mmdet - INFO - Saving checkpoint at 3 epochs
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 2.1 task/s, elapsed: 2879s, ETA: 0s
Formating bboxes of pts_bbox
Start to convert detection format...
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 20.5 task/s, elapsed: 293s, ETA: 0s
Results writes to val/./work_dirs/bevformer_small/Wed_Oct_16_20_53_19_2024/pts_bbox/results_nusc.json
Evaluating bboxes of pts_bbox
======
Loading NuScenes tables for version v1.0-trainval...
...info...
======
Initializing nuScenes detection evaluation
Loaded results from val/./work_dirs/bevformer_small/Wed_Oct_16_20_53_19_2024/pts_bbox/results_nusc.json.
Found detections for 6019 samples.
Loading annotations for val split from nuScenes version: v1.0-trainval
Loaded ground truth annotations for 6019 samples.
Filtering predictions
Filtering ground truth annotations
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 365975) of binary: /home
Traceback (most recent call last):
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __c
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dwc13/miniforge3/envs/bev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in lau
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
**************************************************
./tools/train.py FAILED
==================================================
Root Cause:
[0]:
time: 2024-10-17_20:07:11
rank: 0 (local_rank: 0)
exitcode: -9 (pid: 365975)
error_file: <N/A>
msg: "Signal 9 (SIGKILL) received by PID 365975"
==================================================
Other Failures:
<NO_OTHER_FAILURES>
************************************************** Currently, looking into what is being thrown but I believe it stem from runner going through MMDet and then later through MMCV when conducting the evaluations after a epoch. But the initial cause of the error is not provided to us since this error looks like the main exit error to close all associating pid with ./tools/dist_train.py, which is seen in other issues. Therefore, some investigating is required to find the exact error causing the exit. For the meantime to have my model continue on initial training I have adjusted the configuration. ./projects/configs/bevformer/bevformer_small.py: evaluation = dict(interval=X, pipeline=test_pipeline)
load_from = 'ckpts/bevformer_small.pth'
resume_from = f'work_dirs/bevformer_small/epoch_X.pth'
checkpoint_config = dict(interval=1, create_symlink=False) Notes, can find more information in links below:
General Documentation Sources:
|
thx bro,with your code, I can continue to train epoch3 on top of epoch2.pth, which is very useful! |
Hello, I have a problem, when I train a bevformer_small on the base dataset, the first epoch works fine and saves the result to the json file of result, but when the second epoch training is completed, RROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) pops up local_rank: 0 (pid: 290596) of binary.
Is it due to my lack of video memory, my lack of storage space, or some other issue?
GPU:4060ti 16g * 1 , Remaining storage space:190G/800G
The text was updated successfully, but these errors were encountered: