Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ConvLSTM on multiple GPUs #11

Open
LeonardKnuth opened this issue May 9, 2016 · 3 comments
Open

Use ConvLSTM on multiple GPUs #11

LeonardKnuth opened this issue May 9, 2016 · 3 comments

Comments

@LeonardKnuth
Copy link

Hi,

I successfully ran ConvLSTM with mini-batches on single GPU, but it failed when I tried to run it on multiple GPUs. The error is as follows,

/lua/5.1/nn/CAddTable.lua:16: bad argument #2 to 'add' (sizes do not match at /tmp/luarocks_cutorch-scm-1-7971/cutorch/lib/THC/THCTensorMathPointwise.cu:121)

I carefully checked the size of tensors before using CAddTable, and found it matched. So I am confused what happened?

Any one has an idea? Thanks a lot.

@viorik
Copy link
Owner

viorik commented May 10, 2016

Hi @LeonardKnuth,
Could you post a simple model that you are trying to train? I haven't used it on multi-gpu yet, but I would be interested to see what's happening.
cheers.

@LeonardKnuth
Copy link
Author

Hi @viorik ,

My model is binding with the data, so currently it's not easy to clean it up. However, the main idea is straightforward if we use the nn.DataParallelTable (see more at https:/torch/cunn/blob/master/doc/cunnmodules.md#nn.cunnmodules.dok).

I've always thought how to use LSTM or ConvLSTM in parallel, and found it seems impossible when the recurrent modules never call forget (e.g., remember('both')) because they have to run in an ordered way (i.e., the current state must depend on the previous state). Do you think so? Thanks.

@viorik
Copy link
Owner

viorik commented May 24, 2016

Hi @LeonardKnuth
Apologies for the late reply. Any news?
I think I agree with what you said, and I can't think of a way to make this run. Actually, a colleague set up the training on multi-gpus, the code was running, but the network didn't seem to learn anything. And I suspect it's because of what you said.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants