Multi-GPU #2114

cypof · 2015-03-12T23:28:22Z

Uses CUDA peer-to-peer for communication, and parts of #1148. SGD is now synchronous instead of asynchronous, as @longjon showed bandwidth on one box is actually high enough. We haven’t really benchmarked yet, but it seems to work great. It also gets rid of the momentum coordination problem.

The synchronization code needs to hook into the solver, so it is a bit more invasive than before, but still pretty isolated. I refactored solver.cpp to separate the regularization and gradient compute phases so that they can be invoked at different times by the parallel solver.

One thing still missing is the way to compute the actual number of iterations. For now each solver runs as if it was by itself, so the run is going to take as long as without parallelism. I guess we could adapt the solver to run 1/N steps instead. Also the batch size should be experimented with, as now effectively N times larger. On that, would it be more convenient to switch to the number of images to compute progress, instead of iterations, to be independent of batch size?

To try it, run the samples in example/parallel/

wkal · 2015-03-14T06:38:15Z

Hi! I had tried this branch, and when I tried to using the gpus.bin work on Mnist, it failed. I input command:
../../build/examples/parallel/gpus.bin ../../examples/mnist/lenet_solver.prototxt 0:1

the error information like this:
F0314 14:28:16.295174 34820 p2p_sync.cpp:139] Check failed: error == cudaSuccess (51 vs. 0) peer access has not been enabled
*** Check failure stack trace: ***
@ 0x7f7ae5377daa (unknown)
@ 0x7f7ae5377ce4 (unknown)
@ 0x7f7ae53776e6 (unknown)
@ 0x7f7ae537a687 (unknown)
@ 0x7f7ae57122b2 caffe::P2PSync<>::GPU::~GPU()
@ 0x7f7ae57124a9 caffe::P2PSync<>::GPU::~GPU()
@ 0x40db4e boost::detail::sp_counted_base::release()
@ 0x40dbd5 caffe::P2PSync<>::~P2PSync()
@ 0x40b6c2 main
@ 0x7f7ae4686ec5 (unknown)
@ 0x40b96e (unknown)
@ (nil) (unknown)
Aborted (core dumped)

cypof · 2015-03-15T01:44:08Z

That's strange, it seems to be during shutdown when destructors get called. Did it train at all before reaching the end?

wkal · 2015-03-15T05:02:27Z

it didn't print the log information such as network configuration , training/testing loss, and testing accuracy at the screen. But when I call the nvidia-smi after start the gpu.bin, I can watch it really enable the GPU 0 and GPU 1 to work, the GPU monitor information like this:
0 28026 ../../build/examples/parallel/gpus.bin 122MiB
1 28026 ../../build/examples/parallel/gpus.bin 112MiB

wkal · 2015-03-15T05:07:31Z

wkal · 2015-03-15T05:09:43Z

So from the above monitor information, I guess the training should started, but I don't know how the error occured and also how to fixed it. Hope you can give me suggestions. Thanks!

wkal · 2015-03-15T05:58:33Z

When I attempt to add debug code to watch what's happened for this error:
LOG(INFO) << "access: " << access;
LOG(INFO) << "device: " << device;
LOG(INFO) << "peer: " << peer;
then error information becomes:
terminate called after throwing an instance of 'boost::exception_detail::clone_implboost::exception_detail::error_info_injector<boost::lock_error >'
what(): boost: mutex lock failed in pthread_mutex_lock: Invalid argument
*** Aborted at 1426398733 (unix time) try "date -d @1426398733" if you are using GNU date ***
PC: @ 0x7fc456963bb9 (unknown)
*** SIGABRT (@0x3e800001e78) received by PID 7800 (TID 0x7dc41ded3700) from PID 7800; stack trace: ***
@ 0x7fc456d03340 (unknown)
@ 0x7fc456963bb9 (unknown)
@ 0x7fc456966fc8 (unknown)
@ 0x7fc4571876b5 (unknown)
@ 0x7fc457185836 (unknown)
@ 0x7fc457185863 (unknown)
@ 0x7fc457185aa2 (unknown)
@ 0x7fc4579a93f1 boost::throw_exception<>()
@ 0x7fc4579a96a5 boost::unique_lock<>::lock()
@ 0x7fc4579a9f48 boost::condition_variable::wait()
@ 0x7fc4579ab238 caffe::blocking_queue<>::pop()
@ 0x7fc457a72f8d caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@ 0x7fc44f14fa4a (unknown)
@ 0x7fc456cfb182 start_thread
@ 0x7fc456a27fbd (unknown)
@ 0x0 (unknown)
Aborted (core dumped)

The changed code:
template
P2PSync::GPU::~GPU() {
CUDA_CHECK(cudaStreamDestroy(stream_));
for (int i = 0; i < sync_.params().size(); ++i) {
if (i != index_) {
CUDA_CHECK(cudaFree(targets_[i]));

  const int device = params_.device();
  const int peer = sync_.params()[i]->device();
  int access;
  CUDA_CHECK(cudaDeviceCanAccessPeer(&access, device, peer));
  LOG(INFO) << "access: " << access;
  LOG(INFO) << "device: " << device;
  LOG(INFO) << "peer: " << peer;
  if (access) {
    CUDA_CHECK(cudaDeviceDisablePeerAccess(peer));
  }
}

}
}

cypof · 2015-03-16T16:43:51Z

If it does not print any log, it might be glog config:
export GLOG_logtostderr=1

mlcv-ai · 2015-03-16T18:47:02Z

It crashes if I try more than 4 cards at a time. Any idea?

*** SIGSEGV (@0x30) received by PID 4086 (TID 0x7ffb40f6f720) from PID 48; stack trace: ***
@ 0x3d80c0f710 (unknown)
@ 0x40d534 SolverContext::run()
@ 0x408db2 main
@ 0x3d8081ed5d (unknown)
@ 0x408329 (unknown)
Segmentation fault (core dumped)

To provide more information, on the server, I have 8 GPUs. The training ran fine with with GPUs 0:1:2:3. However, if I used 0:x for x from 4 to 7, I had this error message: "Check failed: access GPU 0 cannot access GPU x". It crashed if I used any GPU from 4 to 7 for training.

mlcv-ai · 2015-03-16T21:41:38Z

I seem to figure out the problem. In line 69 of gpus.cpp, "params[gpu[0]]" should be "params[0]". Similarly, in line 76 of gpus.cpp, "params[device]" should be "params[i]". The segmentation fault was caused by the wrong indexing of the solvers.

wkal · 2015-03-17T03:56:59Z

@hhuytho
I followed your fix, but the error I reported still existed.
The error:
F0317 11:50:21.572767 29807 p2p_sync.cpp:142] Check failed: error == cudaSuccess (51 vs. 0) peer access has not been enabled
*** Check failure stack trace: ***
@ 0x7fdbcffa9daa (unknown)
@ 0x7fdbcffa9ce4 (unknown)
@ 0x7fdbcffa96e6 (unknown)
@ 0x7fdbcffac687 (unknown)
@ 0x7fdbd0344d79 caffe::P2PSync<>::GPU::~GPU()
@ 0x7fdbd0344fc9 caffe::P2PSync<>::GPU::~GPU()
@ 0x40db4e boost::detail::sp_counted_base::release()
@ 0x40dbd5 caffe::P2PSync<>::~P2PSync()
@ 0x40b6c2 main
@ 0x7fdbcf2b8ec5 (unknown)
@ 0x40b96e (unknown)
@ (nil) (unknown)
Aborted (core dumped)
From this error information, I think the error should occured in the p2p_sync.cpp, maybe we used different version of Caffe? The version of Caffe I used is caffe-multi_gpu_db_fixes(https:/cypof/caffe/tree/multi_gpu_db_fixes). Thanks for your sharing!

mlcv-ai · 2015-03-18T17:08:30Z

I used the "multi_gpu" branch from cypof. Your problem doesn't seem to relate to the bug I listed. Maybe it has something to do with the peer to peer communication between the GPUs?

GoGoStudio · 2015-03-24T19:16:08Z

@cypof Thanks for this PR! I tried it on 2 K80s (4 GPUs) with the multi_gpu branch, it failed to converge on MNIST data. It runs fine, no crashes. caffe.bin in the same branch converges. I think there is something wrong in gpus.cpp codes. BTW, I used mnist on lmdb, cudnn r1 enabled. Am I missing anything?

cypof · 2015-04-03T02:50:36Z

New version, should be more stable and faster. The gradients are now combined using a map-reduce scheme, first the GPUs on the same board, then same CPU socket, then to the root. It has not been tested a lot but seems to work OK.

The batch size and gradients are scaled, so that multi-gpu should now be exactly equivalent to running on a single one. I tried comparing the weights, and they are equal at least to 10e-5. The difference must be numerical approximations. To make runs reproducible it is necessary to set PREFETCH_COUNT to 1 in data_layer.hpp, otherwise examples might be reordered.

I also added a bunch of benchmarking logs that can be enabled through constants in Makefile.config. The code is in caffe.cpp instead of samples, and there is no parallel solver class anymore. Instead it adds a callback on the solver created by caffe.cpp, so it should work with advanced algo like AdaGrad etc., we haven’t tried. I also did some refactoring in solver.cpp, but not changed anything functionally.

We haven’t ran all the way on ImageNet yet but early benchmark seems to show using the default batch sizes and settings:

AlexNet
2 GPUs: 1.8x
4: 2.9x
8: 3.4x

GoogLeNet
2: 1.9x
4: 3.2x
8: 4.5x
8: 6.6x with batch of 128

I don’t know if we can get much more through data parallelism. 16 bits float transfers might help a bit. I hope hybrid models using #2219 can go further. Larger batches get better GPUs utilisation, but lower SGD convergence. On AlexNet the batch is already 256 and increasing seems hurt convergence a lot. On GoogLeNet going from 32 to 128 should be OK but we have no numbers yet.

cypof · 2015-04-03T18:34:08Z

If you get an error about nvidia-ml, install the GDK: https://developer.nvidia.com/gpu-deployment-kit, also this currenly only works on CUDA-7.

thatguymike · 2015-04-06T20:04:59Z

I think you can get around needing the GDK for your specific case of detecting multiGPU boards using direct device queries through the runtime. Something like:

if (cudaGetDeviceProperties(&properties, a) != cudaSuccess) {
    return <error>;
}
isMulti0 = properties.isMultiGpuBoard;
boardId0 = properties.multiGpuBoardGroupID;

if (cudaGetDeviceProperties(&properties, whichGPUs[1]) != cudaSuccess) {
    return <error>;
}
isMulti1 = properties.isMultiGpuBoard;
boardId1 = properties.multiGpuBoardGroupID;

if (boardId0 == boardId1) {
   <same board>;
}

sontran · 2015-04-09T17:42:52Z

src/caffe/solver.cpp

+ }
+#ifdef BENCHMARK_SOLVER
+ LOG(INFO)<< timing.str();
+#endif

 // Save a snapshot if needed.
 if (param_.snapshot() && (iter_ + 1) % param_.snapshot() == 0) {


you might need to add "&& Caffe::root_solver()" here

sontran · 2015-04-10T06:36:28Z

Thanks @cypof for this great PR! I've tried it on 2 and 4 gpus machines, local and ec2, with and without p2p access, on different nets, etc... it seems to be pretty stable so far... The speed up factors also seem to match with your numbers....

cypof · 2015-04-10T17:35:10Z

Thanks @thatguymike @sontran for checking this out. I tried the built-in CUDA API before switching to GDK but it seems to return 0 even if the GPUS are on a board. Maybe I was mistaken, I will look again. I hope the GDK will be merged into CUDA, there is lots of useful stuff in there. In terms of architecture, it seems on some machines doing a all-to-all is faster than map-reduce. I'm think it is possible to write a simple code that does both, always reducing on dual-gpu cards then all-to-all. I will probably stop there for single node, unlikely to gain much more this way.

bhack · 2015-04-10T18:02:30Z

@cypof how strictly this is related to cuda? Do you think that will be portable in the future to other backend effort like #2195?

cypof · 2015-04-10T18:06:52Z

@bhack as long as you can memcpy between two GPUs it should work. For performance you also want to have a memcpyAsync. Not sure what is provided by OpenCL.

thatguymike · 2015-04-10T18:31:55Z

Getting scaling performance really requires optimal data transfers. The GDK provides methods to get system topology and information on what GPUs can do direct DMAs. (I think we can replace those with standard CUDA calls, working on that now). lspci/hwloc might be usable to get system topology, but I don't believe there is a standard way in OpenCL to setup and manage P2P DMA transfers between devices.

thatguymike · 2015-04-10T23:26:24Z

With respect to needing the GDK, here is work-alike for seeing if 2 ASICs are on the same board. Also, one can parse the output of 'nvidia-smi topo -m' to get more topology information:

  // Group GPUs by board
  for (int i = 0; i < remaining.size(); ++i) {
      cudaDeviceProp prop_a;
      CUDA_CHECK(cudaGetDeviceProperties(&prop_a, remaining[i]));
      for (int j = i + 1; j < remaining.size(); ++j) {
          cudaDeviceProp prop_b;
          CUDA_CHECK(cudaGetDeviceProperties(&prop_b, remaining[j]));
          int same_board = 0;
          if(prop_a.isMultiGpuBoard && prop_b.isMultiGpuBoard)
              same_board = prop_a.multiGpuBoardGroupID == prop_b.multiGpuBoardGroupID ? 1 : 0;
          if (same_board) {
              pairs->push_back(DevicePair(remaining[i], remaining[j]));
              DLOG(INFO) << "\n\nSame board: " << remaining[i] << ":" << remaining[j] << " - Board ID:" << prop_a.multiGpuBoardGroupID;
              remaining.erase(remaining.begin() + j);
          }
      }
  }

bhack · 2015-04-11T13:15:01Z

I think that p2p will be covered by hsa initiative of hsa foundation chipmakers.

brenton2015 · 2015-04-15T13:27:47Z

@cypof Could you show GPU topology on 4 gpus machines?

thatguymike · 2015-04-15T15:15:44Z

This depends on the general system topology. The code currently tries to align the tree reduction strategy with the system. For example, if you have a dual GPU board, the code attempts to do the first reduction on the board, and then one of those ASICs communicates with one ASIC on the other dual GPU board.

If you have 4 independent GPUs, it tries to find P2P pairs, but the current code doesn't go much deeper into the topology trying to find host PCIe bridges (like PLX). But, it will attempt to avoid crossing PCIe root complexes, where P2P doesn't work, until the top of the reduction tree.

Basically, the code tries to optimize bandwidth use at each level in the reduction tree. It should do the right thing for most 4 GPU systems. For more complex server systems with PLX bridges or multiple root complex, there is more tuning that needs to be done.

However, the primary issue in the code at the moment is that there is a race somewhere and even with fixing the incoming random seed, we get different results run to run. This can lead to things failing to converge.

wkal · 2015-04-19T13:13:16Z

Sorry for the previous comment, there something not clear, so I add this comment to correct it. I had 4 gpus to run current multi-gpus branch, however I only could run the gpu 0 and 1 successfully, if I try to run the gpu 2 and 3, caffe will fail, the fail information:
*** SIGSEGV (@0x0) received by PID 1503 (TID 0x7f40e8d7e9c0) from PID 0; stack trace: ***
@ 0x7f40e79ce340 (unknown)
@ 0x41267b SolverContext::run()
@ 0x40b520 main
@ 0x7f40e7619ec5 (unknown)
@ 0x40b96e (unknown)
@ 0x0 (unknown)
Segmentation fault (core dumped)
and the running command like this: ../../build/examples/parallel/gpus.bin lenet_solver.prototxt 2:3.

I feel it out of my expectation, because I can run the gpu 2 and 3 in single gpu mode and also I can run on the gpu 0 and 1 at the same time, I think there should no reason can't run on gpu 2 and gpu 3 at the same time.

@sontran
My system also had 4 gpus, why I can't run the gpu 2 and gpu 3? Need I adjust the topology of the GPUs? Could you help me?

cypof · 2015-04-19T22:39:48Z

@wkal I see you are running the gpus.bin executable. The code is now in caffe.bin itself so you might be running an old version. Try a make clean, and run caffe train instead.

bhack · 2015-07-28T17:39:15Z

NVIDIA=CUDA?

naibaf7 · 2015-07-28T18:54:04Z

@bhack @cypof
I think he meant that they just have an nVidia GPU setup right now and thus don't have any practical interest themselves in doing the port to OpenCL, which is fine.
I'd just like to see how the two backends would compare and if multi-GPU could even work across backends (OpenCL to CUDA). It's just for the thrill of it, because flexibility is always a good thing.

thatguymike · 2015-07-28T21:46:37Z

The same techniques should generally work. However, there are going to be some issues with getting good performance, notably no standard way to do P2P DMA transfers between devinces and being able to walk the topology of the PCIe setup. Without P2P DMA, you get communication bound VERY quickly having to copy through host. Walking the topology to figure out who can talk to each other at high speed is also part of this. Multi-device support in OpenCL is a little trickier generally to deal with, at least if you need tight communication and synchronization between devices.

bhack · 2015-07-28T22:12:15Z

http://streamcomputing.eu/blog/2015-04-18/gpudirect-and-directgma-direct-gpu-gpu-communication/

thatguymike · 2015-07-28T22:21:07Z

Yes, I worked on the AMD extension many years ago and was a member of the committee for many years. ;-) My point was there is no standard way in OpenCL to do this. The memory model is a little tight and you can see some of the issues crop up when using the extension, specifcally around map/unmap and synchronization on buffers between multiple devices and the host.

Again, all of the work @cypof and I did on this PR should functionally port to OpenCL.

bhack · 2015-07-28T22:25:40Z

@thatguymike It is easy to create universal features on hardware where you are the only vendor ;)

naibaf7 · 2015-07-28T22:26:27Z

@thatguymike
Thanks for the input. Of course I'll start with a functional version first and then try to see which memory transfers can be done with P2P DMA. I have plenty of GPUs here to test with, so I'll see.

shelhamer · 2015-08-03T19:35:18Z

This is very close to merge and our only reservation is how to refactor 346e262 for clarity since the use of member variables and the naming of InternalThreadEntry() and InternalThread::entry() are not immediately obvious. We should have an alternative commit ready soon.

naibaf7 · 2015-08-03T21:04:06Z

Most of the code seems compatible with #2610, except for device initialization and the synced memory. Those will have to be refactored to the DeviceContext abstraction of #2610. I'll do that after the merge. After that, it should be relatively easy to also support OpenCL devices.

ghost · 2015-08-04T01:52:34Z

Does anyone else have the issue of loss blowing up after few thousand iterations? I trained on ImageNet with 2 GeForce 980 GPUs, initially the losses were normal, decreasing from 7.xx then suddenly it blew up to ~87 after ~20k iterations. The batch size was 128 and the learning rate, momentum, stepsize etc were same as the bvlc_reference_caffenet solver. I had to resume from 10k and it trained better again. This is not a one time thing, it happened to me at least 4-5 times, and it has been reproduced in another machine as well (with Titan Xs).

cdoersch · 2015-08-04T18:19:20Z

@sri-25 I haven't seen an issue like this. Just to be sure (since you didn't mention it in your post), the network you're training works on a single GPU with twice the batch size?

shelhamer · 2015-08-06T07:02:40Z

See #2870 for my packaging of this PR for merge -- the history preserves authorship and I only addressed my concerns about InternalThread, a minor TODO, and style fixes.

ghost · 2015-08-06T16:45:26Z

@cdoersch Yeah, it trains well on a single GPU (with 2x batch size). FYI, I'm training on ImageNet with bvlc_reference_caffenet solver/train_val.

shelhamer · 2015-08-08T18:52:49Z

src/caffe/test/test_gradient_based_solver.cpp

@@ -82,7 +96,7 @@ class GradientBasedSolverTest : public MultiDeviceTest<TypeParam> {
 " value: 1.0 "
 " } "
 " data_filler { "
- " type: 'gaussian' "
+ " type: '" << (run.const_data ? "constant" : "gaussian") << "' "


The targets top needs to always be Gaussian or else the test is ineffective. The constant vs. gaussian condition should apply to the filler for data. Switching this reveals that the CheckAccumulation test is broken since it doesn't set constant data. More ominously, it reveals that all the TestLeastSquares* fail for multi-GPU. Let's follow up on #2870.

shelhamer · 2015-08-08T19:14:23Z

Closing since this is continued on in #2870.

loulansuiye · 2018-01-05T03:16:13Z

@thatguymike I replace the tensor RT 2.x with the version 3.x and cudnn 7.x but the caffe Makefile.config can not find the cudnn now . Did you meet that similarity questions? or other suggestion?,thank you

longjon mentioned this pull request Mar 28, 2015

Synchronous SGD via layer-wise parallelism #2219

Closed

cypof force-pushed the multi_gpu branch from 6acef36 to 1706b79 Compare April 3, 2015 02:45

sontran reviewed Apr 9, 2015
View reviewed changes

bhack mentioned this pull request Apr 20, 2015

OpenCL Backend #2195

Closed

lukeyeager mentioned this pull request Jul 30, 2015

plan to merge to master? NVIDIA/caffe#13

Closed

shelhamer mentioned this pull request Aug 6, 2015

Multi-GPU #2870

Merged

10 tasks

This was referenced Aug 6, 2015

Persistent prefetch thread #2368

Closed

Thread local state for caffe instead of singleton #2367

Closed

blocking_queue #2366

Closed

shelhamer reviewed Aug 8, 2015
View reviewed changes

shelhamer closed this Aug 8, 2015

This was referenced Aug 11, 2015

Multi-GPU Data Parallelism (with Parallel Data Layers) #2903

Merged

Destroy CUDA stream when finished #2921

Merged

ronghanghu mentioned this pull request Sep 4, 2015

A MultiGPU bug with multiple input layers #2977

Closed

lukeyeager mentioned this pull request Sep 21, 2015

Model of parallelism? NVIDIA/caffe#37

Closed

lukeyeager mentioned this pull request Oct 21, 2015

Error: status == CUDNN_STATUS_SUCCESS (8 vs. 0) CUDNN_STATUS_EXECUTION_FAILED NVIDIA/DIGITS#232

Closed

This was referenced Feb 13, 2016

How does NVCaffe work on MultiGPUs? NVIDIA/caffe#48

Closed

How does NVCaffe MultiGPU Solver Works? NVIDIA/caffe#54

Closed

ih4cku mentioned this pull request Jul 9, 2016

caffe multiple card ih4cku/caffe-notes#13

Open

Multi-GPU #2114

Multi-GPU #2114

Conversation

cypof commented Mar 12, 2015

wkal commented Mar 14, 2015

cypof commented Mar 15, 2015

wkal commented Mar 15, 2015

wkal commented Mar 15, 2015

wkal commented Mar 15, 2015

wkal commented Mar 15, 2015

cypof commented Mar 16, 2015

mlcv-ai commented Mar 16, 2015

mlcv-ai commented Mar 16, 2015

wkal commented Mar 17, 2015

mlcv-ai commented Mar 18, 2015

GoGoStudio commented Mar 24, 2015

cypof commented Apr 3, 2015

cypof commented Apr 3, 2015

thatguymike commented Apr 6, 2015

sontran Apr 9, 2015

Choose a reason for hiding this comment

sontran commented Apr 10, 2015

cypof commented Apr 10, 2015

bhack commented Apr 10, 2015

cypof commented Apr 10, 2015

thatguymike commented Apr 10, 2015

thatguymike commented Apr 10, 2015

bhack commented Apr 11, 2015

brenton2015 commented Apr 15, 2015

thatguymike commented Apr 15, 2015

wkal commented Apr 19, 2015

cypof commented Apr 19, 2015

bhack commented Jul 28, 2015

naibaf7 commented Jul 28, 2015

thatguymike commented Jul 28, 2015

bhack commented Jul 28, 2015

thatguymike commented Jul 28, 2015

bhack commented Jul 28, 2015

naibaf7 commented Jul 28, 2015

shelhamer commented Aug 3, 2015

naibaf7 commented Aug 3, 2015

ghost commented Aug 4, 2015

cdoersch commented Aug 4, 2015

shelhamer commented Aug 6, 2015

ghost commented Aug 6, 2015

shelhamer Aug 8, 2015

Choose a reason for hiding this comment

shelhamer commented Aug 8, 2015

loulansuiye commented Jan 5, 2018