Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VFNET device error when run inference_detector #4146

Closed
JonathanAndradeSilva opened this issue Nov 19, 2020 · 3 comments · Fixed by #4400
Closed

VFNET device error when run inference_detector #4146

JonathanAndradeSilva opened this issue Nov 19, 2020 · 3 comments · Fixed by #4400
Assignees

Comments

@JonathanAndradeSilva
Copy link

Hi everyone,

When a run the inference_detector to VFNET algorithm, I got this error message (only for VFNET, for ATSS ... no problems):

/usr/local/lib/python3.6/dist-packages/mmcv/parallel/_functions.py in forward(target_gpus, input)
71 # Perform CPU to GPU copies in a background stream
72 streams = [_get_stream(device) for device in target_gpus]
---> 73
74 outputs = scatter(input, target_gpus, streams)
75 # Synchronize with the copy stream

/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py in _get_stream(device)
117 if _streams is None:
118 _streams = [None] * torch.cuda.device_count()
--> 119 if _streams[device] is None:
120 _streams[device] = torch.cuda.Stream(device)
121 return _streams[device]

TypeError: list indices must be integers or slices, not torch.device

The device paramenter of init_detector is default ('cuda:0') and distributed=False. Can you help me?

@ggalan87
Copy link

ggalan87 commented Nov 26, 2020

I had the same problem with VFNet. I think it's a fundamental bug in the code of mmcv and nothing to do with the method. More specifically here:
https:/open-mmlab/mmcv/blob/91a7fee03a3973a56cb5f687a6859ef0aaacf15e/mmcv/parallel/_functions.py#L72

However torch _get_stream formulates a list, therefore device should be integer and not torch.device https:/pytorch/pytorch/blob/18ae12a841bdc99c6cce65ac5c77cc1149dc8564/torch/nn/parallel/_functions.py#L111-L120

The fix is to pass device.index rather than device while calling _get_stream.

I don't know however why it normally works, e.g. with simple faster-rcnn detectors. I think a first exception is muted in https:/open-mmlab/mmcv/blob/91a7fee03a3973a56cb5f687a6859ef0aaacf15e/mmcv/parallel/scatter_gather.py#L44

EDIT: Just did a quick and dirty fix in the local mmcv code as I suggested and inference using VFNet worked.
EDIT2: This bug (wrong parameter type) happens only through inference, so the workaround is WRONG for training. Looking into the real cause of the problem
EDIT3: The correct fix is to pass device.index rather than device in

data = scatter(data, [device])[0]

@lfydegithub
Copy link

lfydegithub commented Dec 22, 2020

I have the same issue when i run vfnet demo. how to solve it?

if next(model.parameters()).is_cuda:
    # scatter to specified GPU
    data = scatter(data, [device])[0] # this line throw the ERROR: list indices must be integers or slices, not torch.device

@lfydegithub
Copy link

@JonathanAndradeSilva @ggalan87 I have found the reason.
in vfnet_r50_fpn_1x_coco.py test_pipline, change dict(type='DefaultFormatBundle') to dict(type='ImageToTensor', keys=['img'])

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            # dict(type='DefaultFormatBundle'),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants