Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU #2114

Closed
wants to merge 15 commits into from
Closed

Multi-GPU #2114

wants to merge 15 commits into from

Conversation

cypof
Copy link
Member

@cypof cypof commented Mar 12, 2015

Uses CUDA peer-to-peer for communication, and parts of #1148. SGD is now synchronous instead of asynchronous, as @longjon showed bandwidth on one box is actually high enough. We haven’t really benchmarked yet, but it seems to work great. It also gets rid of the momentum coordination problem.

The synchronization code needs to hook into the solver, so it is a bit more invasive than before, but still pretty isolated. I refactored solver.cpp to separate the regularization and gradient compute phases so that they can be invoked at different times by the parallel solver.

One thing still missing is the way to compute the actual number of iterations. For now each solver runs as if it was by itself, so the run is going to take as long as without parallelism. I guess we could adapt the solver to run 1/N steps instead. Also the batch size should be experimented with, as now effectively N times larger. On that, would it be more convenient to switch to the number of images to compute progress, instead of iterations, to be independent of batch size?

To try it, run the samples in example/parallel/

@wkal
Copy link

wkal commented Mar 14, 2015

Hi! I had tried this branch, and when I tried to using the gpus.bin work on Mnist, it failed. I input command:
../../build/examples/parallel/gpus.bin ../../examples/mnist/lenet_solver.prototxt 0:1

the error information like this:
F0314 14:28:16.295174 34820 p2p_sync.cpp:139] Check failed: error == cudaSuccess (51 vs. 0) peer access has not been enabled
*** Check failure stack trace: ***
@ 0x7f7ae5377daa (unknown)
@ 0x7f7ae5377ce4 (unknown)
@ 0x7f7ae53776e6 (unknown)
@ 0x7f7ae537a687 (unknown)
@ 0x7f7ae57122b2 caffe::P2PSync<>::GPU::~GPU()
@ 0x7f7ae57124a9 caffe::P2PSync<>::GPU::~GPU()
@ 0x40db4e boost::detail::sp_counted_base::release()
@ 0x40dbd5 caffe::P2PSync<>::~P2PSync()
@ 0x40b6c2 main
@ 0x7f7ae4686ec5 (unknown)
@ 0x40b96e (unknown)
@ (nil) (unknown)
Aborted (core dumped)

@cypof
Copy link
Member Author

cypof commented Mar 15, 2015

That's strange, it seems to be during shutdown when destructors get called. Did it train at all before reaching the end?

@wkal
Copy link

wkal commented Mar 15, 2015

it didn't print the log information such as network configuration , training/testing loss, and testing accuracy at the screen. But when I call the nvidia-smi after start the gpu.bin, I can watch it really enable the GPU 0 and GPU 1 to work, the GPU monitor information like this:
0 28026 ../../build/examples/parallel/gpus.bin 122MiB
1 28026 ../../build/examples/parallel/gpus.bin 112MiB

@wkal
Copy link

wkal commented Mar 15, 2015

GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c Off | 0000:02:00.0 Off | Off |
| 23% 44C P0 106W / 235W | 148MiB / 12287MiB | 84% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c Off | 0000:03:00.0 Off | 0 |
| 23% 46C P0 104W / 235W | 137MiB / 11519MiB | 72% Default

@wkal
Copy link

wkal commented Mar 15, 2015

So from the above monitor information, I guess the training should started, but I don't know how the error occured and also how to fixed it. Hope you can give me suggestions. Thanks!

@wkal
Copy link

wkal commented Mar 15, 2015

When I attempt to add debug code to watch what's happened for this error:
LOG(INFO) << "access: " << access;
LOG(INFO) << "device: " << device;
LOG(INFO) << "peer: " << peer;
then error information becomes:
terminate called after throwing an instance of 'boost::exception_detail::clone_implboost::exception_detail::error_info_injector<boost::lock_error >'
what(): boost: mutex lock failed in pthread_mutex_lock: Invalid argument
*** Aborted at 1426398733 (unix time) try "date -d @1426398733" if you are using GNU date ***
PC: @ 0x7fc456963bb9 (unknown)
*** SIGABRT (@0x3e800001e78) received by PID 7800 (TID 0x7dc41ded3700) from PID 7800; stack trace: ***
@ 0x7fc456d03340 (unknown)
@ 0x7fc456963bb9 (unknown)
@ 0x7fc456966fc8 (unknown)
@ 0x7fc4571876b5 (unknown)
@ 0x7fc457185836 (unknown)
@ 0x7fc457185863 (unknown)
@ 0x7fc457185aa2 (unknown)
@ 0x7fc4579a93f1 boost::throw_exception<>()
@ 0x7fc4579a96a5 boost::unique_lock<>::lock()
@ 0x7fc4579a9f48 boost::condition_variable::wait()
@ 0x7fc4579ab238 caffe::blocking_queue<>::pop()
@ 0x7fc457a72f8d caffe::BasePrefetchingDataLayer<>::InternalThreadEntry()
@ 0x7fc44f14fa4a (unknown)
@ 0x7fc456cfb182 start_thread
@ 0x7fc456a27fbd (unknown)
@ 0x0 (unknown)
Aborted (core dumped)

The changed code:
template
P2PSync::GPU::~GPU() {
CUDA_CHECK(cudaStreamDestroy(stream_));
for (int i = 0; i < sync_.params().size(); ++i) {
if (i != index_) {
CUDA_CHECK(cudaFree(targets_[i]));

  const int device = params_.device();
  const int peer = sync_.params()[i]->device();
  int access;
  CUDA_CHECK(cudaDeviceCanAccessPeer(&access, device, peer));
  LOG(INFO) << "access: " << access;
  LOG(INFO) << "device: " << device;
  LOG(INFO) << "peer: " << peer;
  if (access) {
    CUDA_CHECK(cudaDeviceDisablePeerAccess(peer));
  }
}

}
}

@cypof
Copy link
Member Author

cypof commented Mar 16, 2015

If it does not print any log, it might be glog config:
export GLOG_logtostderr=1

@mlcv-ai
Copy link

mlcv-ai commented Mar 16, 2015

It crashes if I try more than 4 cards at a time. Any idea?

*** SIGSEGV (@0x30) received by PID 4086 (TID 0x7ffb40f6f720) from PID 48; stack trace: ***
@ 0x3d80c0f710 (unknown)
@ 0x40d534 SolverContext::run()
@ 0x408db2 main
@ 0x3d8081ed5d (unknown)
@ 0x408329 (unknown)
Segmentation fault (core dumped)

To provide more information, on the server, I have 8 GPUs. The training ran fine with with GPUs 0:1:2:3. However, if I used 0:x for x from 4 to 7, I had this error message: "Check failed: access GPU 0 cannot access GPU x". It crashed if I used any GPU from 4 to 7 for training.

@mlcv-ai
Copy link

mlcv-ai commented Mar 16, 2015

I seem to figure out the problem. In line 69 of gpus.cpp, "params[gpu[0]]" should be "params[0]". Similarly, in line 76 of gpus.cpp, "params[device]" should be "params[i]". The segmentation fault was caused by the wrong indexing of the solvers.

@wkal
Copy link

wkal commented Mar 17, 2015

@hhuytho
I followed your fix, but the error I reported still existed.
The error:
F0317 11:50:21.572767 29807 p2p_sync.cpp:142] Check failed: error == cudaSuccess (51 vs. 0) peer access has not been enabled
*** Check failure stack trace: ***
@ 0x7fdbcffa9daa (unknown)
@ 0x7fdbcffa9ce4 (unknown)
@ 0x7fdbcffa96e6 (unknown)
@ 0x7fdbcffac687 (unknown)
@ 0x7fdbd0344d79 caffe::P2PSync<>::GPU::~GPU()
@ 0x7fdbd0344fc9 caffe::P2PSync<>::GPU::~GPU()
@ 0x40db4e boost::detail::sp_counted_base::release()
@ 0x40dbd5 caffe::P2PSync<>::~P2PSync()
@ 0x40b6c2 main
@ 0x7fdbcf2b8ec5 (unknown)
@ 0x40b96e (unknown)
@ (nil) (unknown)
Aborted (core dumped)
From this error information, I think the error should occured in the p2p_sync.cpp, maybe we used different version of Caffe? The version of Caffe I used is caffe-multi_gpu_db_fixes(https:/cypof/caffe/tree/multi_gpu_db_fixes). Thanks for your sharing!

@mlcv-ai
Copy link

mlcv-ai commented Mar 18, 2015

I used the "multi_gpu" branch from cypof. Your problem doesn't seem to relate to the bug I listed. Maybe it has something to do with the peer to peer communication between the GPUs?

@GoGoStudio
Copy link

@cypof Thanks for this PR! I tried it on 2 K80s (4 GPUs) with the multi_gpu branch, it failed to converge on MNIST data. It runs fine, no crashes. caffe.bin in the same branch converges. I think there is something wrong in gpus.cpp codes. BTW, I used mnist on lmdb, cudnn r1 enabled. Am I missing anything?

@cypof
Copy link
Member Author

cypof commented Apr 3, 2015

New version, should be more stable and faster. The gradients are now combined using a map-reduce scheme, first the GPUs on the same board, then same CPU socket, then to the root. It has not been tested a lot but seems to work OK.

The batch size and gradients are scaled, so that multi-gpu should now be exactly equivalent to running on a single one. I tried comparing the weights, and they are equal at least to 10e-5. The difference must be numerical approximations. To make runs reproducible it is necessary to set PREFETCH_COUNT to 1 in data_layer.hpp, otherwise examples might be reordered.

I also added a bunch of benchmarking logs that can be enabled through constants in Makefile.config. The code is in caffe.cpp instead of samples, and there is no parallel solver class anymore. Instead it adds a callback on the solver created by caffe.cpp, so it should work with advanced algo like AdaGrad etc., we haven’t tried. I also did some refactoring in solver.cpp, but not changed anything functionally.

We haven’t ran all the way on ImageNet yet but early benchmark seems to show using the default batch sizes and settings:

AlexNet
2 GPUs: 1.8x
4: 2.9x
8: 3.4x

GoogLeNet
2: 1.9x
4: 3.2x
8: 4.5x
8: 6.6x with batch of 128

I don’t know if we can get much more through data parallelism. 16 bits float transfers might help a bit. I hope hybrid models using #2219 can go further. Larger batches get better GPUs utilisation, but lower SGD convergence. On AlexNet the batch is already 256 and increasing seems hurt convergence a lot. On GoogLeNet going from 32 to 128 should be OK but we have no numbers yet.

@cypof
Copy link
Member Author

cypof commented Apr 3, 2015

If you get an error about nvidia-ml, install the GDK: https://developer.nvidia.com/gpu-deployment-kit, also this currenly only works on CUDA-7.

@thatguymike
Copy link
Contributor

I think you can get around needing the GDK for your specific case of detecting multiGPU boards using direct device queries through the runtime. Something like:

if (cudaGetDeviceProperties(&properties, a) != cudaSuccess) {
    return <error>;
}
isMulti0 = properties.isMultiGpuBoard;
boardId0 = properties.multiGpuBoardGroupID;

if (cudaGetDeviceProperties(&properties, whichGPUs[1]) != cudaSuccess) {
    return <error>;
}
isMulti1 = properties.isMultiGpuBoard;
boardId1 = properties.multiGpuBoardGroupID;

if (boardId0 == boardId1) {
   <same board>;
} 

}
#ifdef BENCHMARK_SOLVER
LOG(INFO)<< timing.str();
#endif

// Save a snapshot if needed.
if (param_.snapshot() && (iter_ + 1) % param_.snapshot() == 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might need to add "&& Caffe::root_solver()" here

@sontran
Copy link

sontran commented Apr 10, 2015

Thanks @cypof for this great PR! I've tried it on 2 and 4 gpus machines, local and ec2, with and without p2p access, on different nets, etc... it seems to be pretty stable so far... The speed up factors also seem to match with your numbers....

@cypof
Copy link
Member Author

cypof commented Apr 10, 2015

Thanks @thatguymike @sontran for checking this out. I tried the built-in CUDA API before switching to GDK but it seems to return 0 even if the GPUS are on a board. Maybe I was mistaken, I will look again. I hope the GDK will be merged into CUDA, there is lots of useful stuff in there. In terms of architecture, it seems on some machines doing a all-to-all is faster than map-reduce. I'm think it is possible to write a simple code that does both, always reducing on dual-gpu cards then all-to-all. I will probably stop there for single node, unlikely to gain much more this way.

@bhack
Copy link
Contributor

bhack commented Apr 10, 2015

@cypof how strictly this is related to cuda? Do you think that will be portable in the future to other backend effort like #2195?

@cypof
Copy link
Member Author

cypof commented Apr 10, 2015

@bhack as long as you can memcpy between two GPUs it should work. For performance you also want to have a memcpyAsync. Not sure what is provided by OpenCL.

@thatguymike
Copy link
Contributor

Getting scaling performance really requires optimal data transfers. The GDK provides methods to get system topology and information on what GPUs can do direct DMAs. (I think we can replace those with standard CUDA calls, working on that now). lspci/hwloc might be usable to get system topology, but I don't believe there is a standard way in OpenCL to setup and manage P2P DMA transfers between devices.

@thatguymike
Copy link
Contributor

With respect to needing the GDK, here is work-alike for seeing if 2 ASICs are on the same board. Also, one can parse the output of 'nvidia-smi topo -m' to get more topology information:

  // Group GPUs by board
  for (int i = 0; i < remaining.size(); ++i) {
      cudaDeviceProp prop_a;
      CUDA_CHECK(cudaGetDeviceProperties(&prop_a, remaining[i]));
      for (int j = i + 1; j < remaining.size(); ++j) {
          cudaDeviceProp prop_b;
          CUDA_CHECK(cudaGetDeviceProperties(&prop_b, remaining[j]));
          int same_board = 0;
          if(prop_a.isMultiGpuBoard && prop_b.isMultiGpuBoard)
              same_board = prop_a.multiGpuBoardGroupID == prop_b.multiGpuBoardGroupID ? 1 : 0;
          if (same_board) {
              pairs->push_back(DevicePair(remaining[i], remaining[j]));
              DLOG(INFO) << "\n\nSame board: " << remaining[i] << ":" << remaining[j] << " - Board ID:" << prop_a.multiGpuBoardGroupID;
              remaining.erase(remaining.begin() + j);
          }
      }
  }

@bhack
Copy link
Contributor

bhack commented Apr 11, 2015

I think that p2p will be covered by hsa initiative of hsa foundation chipmakers.

@brenton2015
Copy link

@cypof Could you show GPU topology on 4 gpus machines?

@thatguymike
Copy link
Contributor

This depends on the general system topology. The code currently tries to align the tree reduction strategy with the system. For example, if you have a dual GPU board, the code attempts to do the first reduction on the board, and then one of those ASICs communicates with one ASIC on the other dual GPU board.

If you have 4 independent GPUs, it tries to find P2P pairs, but the current code doesn't go much deeper into the topology trying to find host PCIe bridges (like PLX). But, it will attempt to avoid crossing PCIe root complexes, where P2P doesn't work, until the top of the reduction tree.

Basically, the code tries to optimize bandwidth use at each level in the reduction tree. It should do the right thing for most 4 GPU systems. For more complex server systems with PLX bridges or multiple root complex, there is more tuning that needs to be done.

However, the primary issue in the code at the moment is that there is a race somewhere and even with fixing the incoming random seed, we get different results run to run. This can lead to things failing to converge.

@wkal
Copy link

wkal commented Apr 19, 2015

Sorry for the previous comment, there something not clear, so I add this comment to correct it. I had 4 gpus to run current multi-gpus branch, however I only could run the gpu 0 and 1 successfully, if I try to run the gpu 2 and 3, caffe will fail, the fail information:
*** SIGSEGV (@0x0) received by PID 1503 (TID 0x7f40e8d7e9c0) from PID 0; stack trace: ***
@ 0x7f40e79ce340 (unknown)
@ 0x41267b SolverContext::run()
@ 0x40b520 main
@ 0x7f40e7619ec5 (unknown)
@ 0x40b96e (unknown)
@ 0x0 (unknown)
Segmentation fault (core dumped)
and the running command like this: ../../build/examples/parallel/gpus.bin lenet_solver.prototxt 2:3.

I feel it out of my expectation, because I can run the gpu 2 and 3 in single gpu mode and also I can run on the gpu 0 and 1 at the same time, I think there should no reason can't run on gpu 2 and gpu 3 at the same time.

@sontran
My system also had 4 gpus, why I can't run the gpu 2 and gpu 3? Need I adjust the topology of the GPUs? Could you help me?

@cypof
Copy link
Member Author

cypof commented Apr 19, 2015

@wkal I see you are running the gpus.bin executable. The code is now in caffe.bin itself so you might be running an old version. Try a make clean, and run caffe train instead.

@bhack bhack mentioned this pull request Apr 20, 2015
@bhack
Copy link
Contributor

bhack commented Jul 28, 2015

NVIDIA=CUDA?

@naibaf7
Copy link
Member

naibaf7 commented Jul 28, 2015

@bhack @cypof
I think he meant that they just have an nVidia GPU setup right now and thus don't have any practical interest themselves in doing the port to OpenCL, which is fine.
I'd just like to see how the two backends would compare and if multi-GPU could even work across backends (OpenCL to CUDA). It's just for the thrill of it, because flexibility is always a good thing.

@thatguymike
Copy link
Contributor

The same techniques should generally work. However, there are going to be some issues with getting good performance, notably no standard way to do P2P DMA transfers between devinces and being able to walk the topology of the PCIe setup. Without P2P DMA, you get communication bound VERY quickly having to copy through host. Walking the topology to figure out who can talk to each other at high speed is also part of this. Multi-device support in OpenCL is a little trickier generally to deal with, at least if you need tight communication and synchronization between devices.

@bhack
Copy link
Contributor

bhack commented Jul 28, 2015

@thatguymike
Copy link
Contributor

Yes, I worked on the AMD extension many years ago and was a member of the committee for many years. ;-) My point was there is no standard way in OpenCL to do this. The memory model is a little tight and you can see some of the issues crop up when using the extension, specifcally around map/unmap and synchronization on buffers between multiple devices and the host.

Again, all of the work @cypof and I did on this PR should functionally port to OpenCL.

@bhack
Copy link
Contributor

bhack commented Jul 28, 2015

@thatguymike It is easy to create universal features on hardware where you are the only vendor ;)

@naibaf7
Copy link
Member

naibaf7 commented Jul 28, 2015

@thatguymike
Thanks for the input. Of course I'll start with a functional version first and then try to see which memory transfers can be done with P2P DMA. I have plenty of GPUs here to test with, so I'll see.

@shelhamer
Copy link
Member

This is very close to merge and our only reservation is how to refactor 346e262 for clarity since the use of member variables and the naming of InternalThreadEntry() and InternalThread::entry() are not immediately obvious. We should have an alternative commit ready soon.

@naibaf7
Copy link
Member

naibaf7 commented Aug 3, 2015

Most of the code seems compatible with #2610, except for device initialization and the synced memory. Those will have to be refactored to the DeviceContext abstraction of #2610. I'll do that after the merge. After that, it should be relatively easy to also support OpenCL devices.

@ghost
Copy link

ghost commented Aug 4, 2015

Does anyone else have the issue of loss blowing up after few thousand iterations? I trained on ImageNet with 2 GeForce 980 GPUs, initially the losses were normal, decreasing from 7.xx then suddenly it blew up to ~87 after ~20k iterations. The batch size was 128 and the learning rate, momentum, stepsize etc were same as the bvlc_reference_caffenet solver. I had to resume from 10k and it trained better again. This is not a one time thing, it happened to me at least 4-5 times, and it has been reproduced in another machine as well (with Titan Xs).

@cdoersch
Copy link
Contributor

cdoersch commented Aug 4, 2015

@sri-25 I haven't seen an issue like this. Just to be sure (since you didn't mention it in your post), the network you're training works on a single GPU with twice the batch size?

@shelhamer shelhamer mentioned this pull request Aug 6, 2015
10 tasks
@shelhamer
Copy link
Member

See #2870 for my packaging of this PR for merge -- the history preserves authorship and I only addressed my concerns about InternalThread, a minor TODO, and style fixes.

@ghost
Copy link

ghost commented Aug 6, 2015

@cdoersch Yeah, it trains well on a single GPU (with 2x batch size). FYI, I'm training on ImageNet with bvlc_reference_caffenet solver/train_val.

@@ -82,7 +96,7 @@ class GradientBasedSolverTest : public MultiDeviceTest<TypeParam> {
" value: 1.0 "
" } "
" data_filler { "
" type: 'gaussian' "
" type: '" << (run.const_data ? "constant" : "gaussian") << "' "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The targets top needs to always be Gaussian or else the test is ineffective. The constant vs. gaussian condition should apply to the filler for data. Switching this reveals that the CheckAccumulation test is broken since it doesn't set constant data. More ominously, it reveals that all the TestLeastSquares* fail for multi-GPU. Let's follow up on #2870.

@shelhamer
Copy link
Member

Closing since this is continued on in #2870.

@loulansuiye
Copy link

@thatguymike I replace the tensor RT 2.x with the version 3.x and cudnn 7.x but the caffe Makefile.config can not find the cudnn now . Did you meet that similarity questions? or other suggestion?,thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.