Multi-GPU operation and data / model Parallelism #876

shelhamer · 2014-08-07T07:04:35Z

Multi-GPU operation and data / model / hybrid parallelism are planned and in development for Caffe. The purpose of the thread is to focus the conversation, since this has been asked here, there, and everywhere. There are several ways to approach parallelization, so feel free to discuss your own work to this end here.

Note that Caffe does work with multiple GPUs in a standalone fashion right now: you can train on one GPU while extracting features on another and so on.

Yangqing · 2014-08-08T06:07:32Z

Note that training with multiple GPUs + data parallelism is also trivially possible with MPI - for model parallelism it is more nontrivial, though.

kloudkl · 2014-08-08T07:45:17Z

Does the data parallelism suggests shared model parameters? If it doesn't and the data is split before hand, the data parallelism can be implemented with a shell script.

kloudkl · 2014-08-08T08:14:15Z

Unfortunately, the weights of the replicas of the same layer have to be synchronized as described in the section 4.1 of [1].

[1] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997 [cs.NE]

kloudkl · 2014-08-08T09:07:23Z

There are tens of thousands of lines in the diff between cuda-convnet2/cudaconvnet and cuda-convnet. It may not be plausible to reproduce all of Alex's work in a short time. Is it acceptable to just wrap it as @soumith did in cuda-convnet2.torch? The members of the Pylearn2 community have successfully wrapped cuda-convnet and are planning to upgrade to cuda-convnet2.

visionscaper · 2014-08-11T16:26:17Z

cc me

palmforest · 2014-08-12T08:19:54Z

Hope make it real on Caffe soon.

bug-fixed · 2014-08-12T09:17:34Z

I just have a thought and post here for discussion.
When the training data is located in a distributed environment, thus is it possible or necessary to develop a dispatch server and along with the parameter database server to work together?
In the independent client, there are multiple computing resources(such as NVIDIA GPU, AMD or others), thus client programs are running in these clients. These clients communicate with the dispatch server or parameter database, the dispatch server is just responsible for the whole procedure for updating parameter database. The clients should have many interfaces to complete independent tasks on its own part of data and the client programs should be adaptive to the computing resources for best performance.

kloudkl · 2014-08-16T23:21:01Z

The training can be done by cuda-convnet2. It't only necessary to convert the model into Caffe's format.

kloudkl · 2014-08-30T17:17:47Z

The "trivially possible data parallelism" can be implemented with CUDA Multi Process Service (MPS).

kloudkl · 2014-08-30T17:40:30Z

@Yangqing, could you give us some hints how to integrate Caffe and CUDA MPS with MPI. Does the solver of each process have to communicate with each other? Or do they share the same CUDA context which automatically combines the memory of multiple GPUs into a single virtual address space?

bhack · 2014-08-30T18:12:52Z

I don't know if on multiple hosts we could explore something with spark and caffe python bindings. There is also already a deep network experiments on spark

kloudkl · 2014-08-31T14:15:50Z

Mixing multiple languages together is not a good idea.

bhack · 2014-08-31T14:26:14Z

@kloudkl I'm talking about using python bindings already available in caffe in pyspark

kloudkl · 2014-08-31T14:50:49Z

The remarks of Evan Sparks @etrain posted by @sergeyk prove that it is plausible to "use cuda-convnet2 to train the models offline, but use Caffe to parse/analyze them and apply the models to new input images".

madisonmay · 2014-09-08T05:38:53Z

With the recent addition of cuDNN to the dev branch of caffe, are multiple gpu's now supported? The recent article on cuDNN indicates that cuDNN supports parallelism across gpu's, but doesn't mention whether this support is present in the Caffe wrap.

shelhamer · 2014-09-08T06:30:23Z

Caffe and cuDNN alike are single-GPU libraries at the moment but they can
be run on multiple GPUs simultaneously in a standalone way.

Multi-GPU parallelism is still in development in Caffe.

On Monday, September 8, 2014, Madison May [email protected] wrote:

With the recent addition of cuDNN to the dev branch of caffe, are multiple
gpu's now supported? The recent article on cuDNN
http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/
indicates that cuDNN supports parallelism across gpu's, but doesn't mention
whether this support is present in the Caffe wrap.

—
Reply to this email directly or view it on GitHub
#876 (comment).

madisonmay · 2014-09-08T17:46:41Z

In other words, multiple gpus can be used for tasks like hyperparameter selection but not to allow more efficient training of a single model?

bhack · 2014-11-22T23:17:53Z

There is something interesting in TBB Graph flow parallelization. This a feature detector example.

futurely · 2015-07-01T02:30:23Z

While #2114 and a series of related PRs #1535 (comment) have solved the data parallelism, there isn't yet a pull request dedicated to model parallelism. Here is Facebook's implementation just for reference.

PhoenixDai · 2015-07-28T15:09:29Z

Looks like Nvidia has figured out how to train model on Caffe with multiple GPUs. This page https://developer.nvidia.com/digits showed performance comparison on training with different number of GPUs. In the video on the page, it showed how to use multiple GPUs with DIGITS. Is there any plan to have this feature in Caffe?

choosehappy · 2015-07-28T15:32:50Z

@PhoenixDai 's comment is related to the digits github issue: NVIDIA/DIGITS#92. It seems they have forked their own caffe version which supports multiple gpus?

thatguymike · 2015-07-28T16:02:30Z

The Nvidia branch uses #2114

shaibagon · 2015-11-16T07:03:42Z

Linking to SO related question.

futurely · 2015-11-16T18:32:53Z

The master branch supports multi-GPU training. Please refer to the latest documents.
https:/BVLC/caffe/blob/master/docs/multigpu.md
https:/BVLC/caffe/blob/master/docs/tutorial/interfaces.md

xiaoxiongli · 2016-06-22T08:40:21Z

thank you for your great job! Now I am training googlenet in my K80, as you know , K80 has 2 core, and I enable these 2 core by "-gpu 0,1", the training speed is faster!

I know the cuda-convnet2 using the method introduced by "Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.", Is that the mothod caffe using for Multi-GPU Parallelism?

tuonion · 2016-08-19T01:56:52Z

Hi ,@shelhamer.

Does Caffe support 'model / hybrid parallelism' as you mentioned above?

Lisandro79 · 2016-12-07T03:07:11Z

Hi,
I am working on a project that requires the use of 4 GPUs on a server to analyze images. I would like to do it in caffe (I prefer it over torch or tensorflow) but it seems that multiple GPU is still not available for test / inference.

Is there any estimated date for a version update of caffe that will allow using multiple GPUs for test / inference?

Thanks a lot

cypof · 2016-12-07T06:08:58Z

We don't have a plan to add that right now but I would be happy to help. You can split your dataset and test each part independently, or distribute items to multiple nets as you go. Are you using the Python API?

…

On Dec 7, 2016 4:07 AM, "Lisandro" ***@***.***> wrote: Hi, I am working on a project that requires the use of 4 GPUs on a server to analyze images. I would like to do it in caffe (I prefer it over torch or tensorflow) but it seems that multiple GPU is still not available for test / inference. Is there any estimated date for a version update of caffe that will allow using multiple GPUs for test / inference? Thanks a lot — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#876 (comment)>, or mute the thread <https:/notifications/unsubscribe-auth/AA4RXnBquB4zqk7j_Vx4jdqo67nL-gfrks5rFiLjgaJpZM4CU9yK> .

Lisandro79 · 2016-12-07T11:11:40Z

Hi Cypof,

Thanks for your fast response. I am not sure I completely understand your suggestions. Let me first provide you with more details about the task.

We have a web server that receives requests (image files) from users. This server has a queue of requests at any given time. Our workstation has 4 TitanX GPUS within a motherboard. So I need to use the 4 GPUs to speed up (four times) the processing time of my queue. The requests will be handled as follows:

request 1 GPU:0;
request 2 GPU:1;
request 3 GPU:2;
request 4 GPU:3;
request 5 GPU:0;
request 6 GPU:1;
...

I am using caffe with the Python API. The problem comes with the selection of the GPU. If I select GPU= 0 with the first request of the queue

caffe.set_mode_gpu()
caffe.set_device(0)
%% run inference

then I cannot select GPU=1 with the next request. Even If I load the caffe model and a model in torch, torch cannot use GPU=1 after caffe has set the device to '0' because caffe "locks" all other GPUs.

So regarding your suggestions

1- "You can split your dataset and test each part independently".
I think this solution does not apply to this case (please correct me if I am wrong)

2- "Distribute items to multiple nets as you go"
Is this similar to what I described above (using one network in caffe and another network in torch in different GPUS) but with multiple nets in caffe? Could you please elaborate a bit more on this?

Thank you very much for you help

pythonanonuser · 2016-12-24T08:07:50Z

@Lisandro79 did you figure out a solution to this issue? I have a similar problem

Lisandro79 · 2016-12-24T10:01:09Z

Hi @pythonanonuser,

Unfortunately, I have to say that my solution will be to use TensorFlow. I could not find a solution to my problem in Caffe.

In my opinion, the lack of parallelism for testing of Caffe is a major disadvantage for deployment of web applications. I would like to know what other members of the Caffe community have to say about this.

I really like Caffe and I would prefer to use it over other libraries, but at the moment I find that the parallel capabilities of Tensorflow plus the use of tensorboard do make a difference for production.

Best

cypof · 2016-12-26T21:48:25Z

I meant to prototype something but haven't got to it. I think the easiest way would be to use multiprocessing, either a Queue, and one Process per GPU, or maybe a Pool so that you can call map() on your inputs and directly get your ouputs. It also depends on where you want to store the results etc.

shelhamer · 2017-03-23T07:13:51Z

Closing as NCCL + pycaffe #4563 is an effective approach to data parallel training of any kind of Caffe net. More involved forms of parallelism can be left to further efforts.

shelhamer added the enhancement label Aug 7, 2014

This was referenced Aug 7, 2014

Multi-GPU Parallelism / Distributed Computation in Caffe? #653

Closed

Is there any plan to make caffe support data and model parallelism? #875

Closed

Try to extract Convolution code from cuda-convnet2 #830

Closed

kloudkl mentioned this issue Aug 8, 2014

(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

Closed

bhack mentioned this issue Nov 22, 2014

Conditional layer #1448

Closed

bhack mentioned this issue Apr 20, 2015

OpenCL Backend #2195

Closed

shelhamer closed this as completed Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU operation and data / model Parallelism #876

Multi-GPU operation and data / model Parallelism #876

shelhamer commented Aug 7, 2014

Yangqing commented Aug 8, 2014

kloudkl commented Aug 8, 2014

kloudkl commented Aug 8, 2014

kloudkl commented Aug 8, 2014

visionscaper commented Aug 11, 2014

palmforest commented Aug 12, 2014

bug-fixed commented Aug 12, 2014

kloudkl commented Aug 16, 2014

kloudkl commented Aug 30, 2014

kloudkl commented Aug 30, 2014

bhack commented Aug 30, 2014

kloudkl commented Aug 31, 2014

bhack commented Aug 31, 2014

kloudkl commented Aug 31, 2014

madisonmay commented Sep 8, 2014

shelhamer commented Sep 8, 2014

madisonmay commented Sep 8, 2014

bhack commented Nov 22, 2014

futurely commented Jul 1, 2015

PhoenixDai commented Jul 28, 2015

choosehappy commented Jul 28, 2015

thatguymike commented Jul 28, 2015

shaibagon commented Nov 16, 2015

futurely commented Nov 16, 2015

xiaoxiongli commented Jun 22, 2016 •

edited

Loading

tuonion commented Aug 19, 2016

Lisandro79 commented Dec 7, 2016

cypof commented Dec 7, 2016 via email

Lisandro79 commented Dec 7, 2016 •

edited

Loading

pythonanonuser commented Dec 24, 2016

Lisandro79 commented Dec 24, 2016

cypof commented Dec 26, 2016

shelhamer commented Mar 23, 2017

Multi-GPU operation and data / model Parallelism #876

Multi-GPU operation and data / model Parallelism #876

Comments

shelhamer commented Aug 7, 2014

Yangqing commented Aug 8, 2014

kloudkl commented Aug 8, 2014

kloudkl commented Aug 8, 2014

kloudkl commented Aug 8, 2014

visionscaper commented Aug 11, 2014

palmforest commented Aug 12, 2014

bug-fixed commented Aug 12, 2014

kloudkl commented Aug 16, 2014

kloudkl commented Aug 30, 2014

kloudkl commented Aug 30, 2014

bhack commented Aug 30, 2014

kloudkl commented Aug 31, 2014

bhack commented Aug 31, 2014

kloudkl commented Aug 31, 2014

madisonmay commented Sep 8, 2014

shelhamer commented Sep 8, 2014

madisonmay commented Sep 8, 2014

bhack commented Nov 22, 2014

futurely commented Jul 1, 2015

PhoenixDai commented Jul 28, 2015

choosehappy commented Jul 28, 2015

thatguymike commented Jul 28, 2015

shaibagon commented Nov 16, 2015

futurely commented Nov 16, 2015

xiaoxiongli commented Jun 22, 2016 • edited Loading

tuonion commented Aug 19, 2016

Lisandro79 commented Dec 7, 2016

cypof commented Dec 7, 2016 via email

Lisandro79 commented Dec 7, 2016 • edited Loading

pythonanonuser commented Dec 24, 2016

Lisandro79 commented Dec 24, 2016

cypof commented Dec 26, 2016

shelhamer commented Mar 23, 2017

xiaoxiongli commented Jun 22, 2016 •

edited

Loading

Lisandro79 commented Dec 7, 2016 •

edited

Loading