Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU operation and data / model Parallelism #876

Closed
shelhamer opened this issue Aug 7, 2014 · 33 comments
Closed

Multi-GPU operation and data / model Parallelism #876

shelhamer opened this issue Aug 7, 2014 · 33 comments

Comments

@shelhamer
Copy link
Member

Multi-GPU operation and data / model / hybrid parallelism are planned and in development for Caffe. The purpose of the thread is to focus the conversation, since this has been asked here, there, and everywhere. There are several ways to approach parallelization, so feel free to discuss your own work to this end here.

Note that Caffe does work with multiple GPUs in a standalone fashion right now: you can train on one GPU while extracting features on another and so on.

@Yangqing
Copy link
Member

Yangqing commented Aug 8, 2014

Note that training with multiple GPUs + data parallelism is also trivially possible with MPI - for model parallelism it is more nontrivial, though.

@kloudkl
Copy link
Contributor

kloudkl commented Aug 8, 2014

Does the data parallelism suggests shared model parameters? If it doesn't and the data is split before hand, the data parallelism can be implemented with a shell script.

@kloudkl
Copy link
Contributor

kloudkl commented Aug 8, 2014

Unfortunately, the weights of the replicas of the same layer have to be synchronized as described in the section 4.1 of [1].

[1] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 [cs.NE]

@kloudkl
Copy link
Contributor

kloudkl commented Aug 8, 2014

There are tens of thousands of lines in the diff between cuda-convnet2/cudaconvnet and cuda-convnet. It may not be plausible to reproduce all of Alex's work in a short time. Is it acceptable to just wrap it as @soumith did in cuda-convnet2.torch? The members of the Pylearn2 community have successfully wrapped cuda-convnet and are planning to upgrade to cuda-convnet2.

@visionscaper
Copy link

cc me

@palmforest
Copy link

Hope make it real on Caffe soon.

@bug-fixed
Copy link
Contributor

I just have a thought and post here for discussion.
When the training data is located in a distributed environment, thus is it possible or necessary to develop a dispatch server and along with the parameter database server to work together?
In the independent client, there are multiple computing resources(such as NVIDIA GPU, AMD or others), thus client programs are running in these clients. These clients communicate with the dispatch server or parameter database, the dispatch server is just responsible for the whole procedure for updating parameter database. The clients should have many interfaces to complete independent tasks on its own part of data and the client programs should be adaptive to the computing resources for best performance.

@kloudkl
Copy link
Contributor

kloudkl commented Aug 16, 2014

The training can be done by cuda-convnet2. It't only necessary to convert the model into Caffe's format.

@kloudkl
Copy link
Contributor

kloudkl commented Aug 30, 2014

@kloudkl
Copy link
Contributor

kloudkl commented Aug 30, 2014

@Yangqing, could you give us some hints how to integrate Caffe and CUDA MPS with MPI. Does the solver of each process have to communicate with each other? Or do they share the same CUDA context which automatically combines the memory of multiple GPUs into a single virtual address space?

@bhack
Copy link
Contributor

bhack commented Aug 30, 2014

I don't know if on multiple hosts we could explore something with spark and caffe python bindings. There is also already a deep network experiments on spark

@kloudkl
Copy link
Contributor

kloudkl commented Aug 31, 2014

Mixing multiple languages together is not a good idea.

@bhack
Copy link
Contributor

bhack commented Aug 31, 2014

@kloudkl I'm talking about using python bindings already available in caffe in pyspark

@kloudkl
Copy link
Contributor

kloudkl commented Aug 31, 2014

@madisonmay
Copy link

With the recent addition of cuDNN to the dev branch of caffe, are multiple gpu's now supported? The recent article on cuDNN indicates that cuDNN supports parallelism across gpu's, but doesn't mention whether this support is present in the Caffe wrap.

@shelhamer
Copy link
Member Author

Caffe and cuDNN alike are single-GPU libraries at the moment but they can
be run on multiple GPUs simultaneously in a standalone way.

Multi-GPU parallelism is still in development in Caffe.

On Monday, September 8, 2014, Madison May [email protected] wrote:

With the recent addition of cuDNN to the dev branch of caffe, are multiple
gpu's now supported? The recent article on cuDNN
http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/
indicates that cuDNN supports parallelism across gpu's, but doesn't mention
whether this support is present in the Caffe wrap.


Reply to this email directly or view it on GitHub
#876 (comment).

@madisonmay
Copy link

In other words, multiple gpus can be used for tasks like hyperparameter selection but not to allow more efficient training of a single model?

@bhack
Copy link
Contributor

bhack commented Nov 22, 2014

There is something interesting in TBB Graph flow parallelization. This a feature detector example.

@bhack bhack mentioned this issue Apr 20, 2015
@futurely
Copy link

futurely commented Jul 1, 2015

While #2114 and a series of related PRs #1535 (comment) have solved the data parallelism, there isn't yet a pull request dedicated to model parallelism. Here is Facebook's implementation just for reference.

@PhoenixDai
Copy link

Looks like Nvidia has figured out how to train model on Caffe with multiple GPUs. This page https://developer.nvidia.com/digits showed performance comparison on training with different number of GPUs. In the video on the page, it showed how to use multiple GPUs with DIGITS. Is there any plan to have this feature in Caffe?

@choosehappy
Copy link

@PhoenixDai 's comment is related to the digits github issue: NVIDIA/DIGITS#92. It seems they have forked their own caffe version which supports multiple gpus?

@thatguymike
Copy link
Contributor

The Nvidia branch uses #2114

@shaibagon
Copy link
Member

Linking to SO related question.

@futurely
Copy link

The master branch supports multi-GPU training. Please refer to the latest documents.
https:/BVLC/caffe/blob/master/docs/multigpu.md
https:/BVLC/caffe/blob/master/docs/tutorial/interfaces.md

@xiaoxiongli
Copy link

xiaoxiongli commented Jun 22, 2016

thank you for your great job! Now I am training googlenet in my K80, as you know , K80 has 2 core, and I enable these 2 core by "-gpu 0,1", the training speed is faster!

I know the cuda-convnet2 using the method introduced by "Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.", Is that the mothod caffe using for Multi-GPU Parallelism?

@tuonion
Copy link

tuonion commented Aug 19, 2016

Hi ,@shelhamer.

Does Caffe support 'model / hybrid parallelism' as you mentioned above?

@Lisandro79
Copy link

Hi,
I am working on a project that requires the use of 4 GPUs on a server to analyze images. I would like to do it in caffe (I prefer it over torch or tensorflow) but it seems that multiple GPU is still not available for test / inference.

Is there any estimated date for a version update of caffe that will allow using multiple GPUs for test / inference?

Thanks a lot

@cypof
Copy link
Member

cypof commented Dec 7, 2016 via email

@Lisandro79
Copy link

Lisandro79 commented Dec 7, 2016

Hi Cypof,

Thanks for your fast response. I am not sure I completely understand your suggestions. Let me first provide you with more details about the task.

We have a web server that receives requests (image files) from users. This server has a queue of requests at any given time. Our workstation has 4 TitanX GPUS within a motherboard. So I need to use the 4 GPUs to speed up (four times) the processing time of my queue. The requests will be handled as follows:

request 1 GPU:0;
request 2 GPU:1;
request 3 GPU:2;
request 4 GPU:3;
request 5 GPU:0;
request 6 GPU:1;
...

I am using caffe with the Python API. The problem comes with the selection of the GPU. If I select GPU= 0 with the first request of the queue

caffe.set_mode_gpu()
caffe.set_device(0)
%% run inference

then I cannot select GPU=1 with the next request. Even If I load the caffe model and a model in torch, torch cannot use GPU=1 after caffe has set the device to '0' because caffe "locks" all other GPUs.

So regarding your suggestions

1- "You can split your dataset and test each part independently".
I think this solution does not apply to this case (please correct me if I am wrong)

2- "Distribute items to multiple nets as you go"
Is this similar to what I described above (using one network in caffe and another network in torch in different GPUS) but with multiple nets in caffe? Could you please elaborate a bit more on this?

Thank you very much for you help

@pythonanonuser
Copy link

@Lisandro79 did you figure out a solution to this issue? I have a similar problem

@Lisandro79
Copy link

Hi @pythonanonuser,

Unfortunately, I have to say that my solution will be to use TensorFlow. I could not find a solution to my problem in Caffe.

In my opinion, the lack of parallelism for testing of Caffe is a major disadvantage for deployment of web applications. I would like to know what other members of the Caffe community have to say about this.

I really like Caffe and I would prefer to use it over other libraries, but at the moment I find that the parallel capabilities of Tensorflow plus the use of tensorboard do make a difference for production.

Best

@cypof
Copy link
Member

cypof commented Dec 26, 2016

I meant to prototype something but haven't got to it. I think the easiest way would be to use multiprocessing, either a Queue, and one Process per GPU, or maybe a Pool so that you can call map() on your inputs and directly get your ouputs. It also depends on where you want to store the results etc.

@shelhamer
Copy link
Member Author

Closing as NCCL + pycaffe #4563 is an effective approach to data parallel training of any kind of Caffe net. More involved forms of parallelism can be left to further efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests