Shared pinned buffers #120

bengioe · 2024-02-23T22:50:29Z

This PR implements a better way of sharing torch tensors between process by creating (large enough) shared tensors that are created once are used as a transfer mechanism. Doing this on the fragment environment seh_frag.py I'm getting a 30% wall time improvement for simple settings, with batch size 64 (I'm sure we could have fun maxing that out and see how far we can take GPU utilization).

Some notes:

The effect is mostly felt when sampling (which is where most time is spent in the first place), and sending Batch and GraphActionCategoricals through shared buffers improves time
Passing batches to the training loop (which are much bigger and "rarer") doesn't seem to have a significant speedup, but I've implemented it nonetheless for future proofing

Other changes:

Removed local grad clipping which is not quite correct; the difference is minimal but relevant, there's also a nice speedup
Made all algorithms inherit from GFNAlgorithm
global_cfg is set for all algorithms
cond_info is now folded into the batch object rather than being passed as an argument everywhere
fixed GraphActionCategorical.entropy when masks are used, gradients wrt logits would be NaN.

Note, EnvelopeQL is still in a broken state, will fix in #127

bengioe · 2024-03-01T01:02:47Z

I'm of a mind to merge this actually. It's not the cleanest implementation possible but there are significant gains here (as mentioned, a 30% speedup with the default settings on seh_frag.py). Will test across tasks and report back.

bengioe · 2024-03-11T18:09:21Z

Made significant simplifications to the method by subclassing Pickler/Unpickler, found some very tricky bugs (I was making a bad usage of pinned CUDA buffers and ended up with rare race conditions). Speedups remain (might even be a bit faster).

bengioe · 2024-05-09T15:06:35Z

Merged with trunk + made a few fixes. Pretty happy with this now!

bengioe added 5 commits February 28, 2024 15:11

first throw at refactoring SamplingIterator

7dbca12

Merge branch 'trunk' into bengioe-better-iterators

939cb56

changed all iterators to DataSource

dfba1ca

lots of little fixes, tested all tasks, better device management

e5239fb

style

43dfc2b

bengioe added 14 commits March 7, 2024 08:25

change batch size hyperparameters + fix nested dataclasses

279ecfc

Merge branch 'trunk' into bengioe-better-iterators

2ba251a

move things around & prevent circular import

282bbfb

tox

c3bc6d0

fix imports

b1c5630

replace device references with get_worker_device

a64a639

little fixes

28bcc59

a few more stragglers

4811e7c

proof of concept of using shared pinned buffers

7d32ac1

32mb buffer

d4a2a7d

add to DataSource

27dfc23

various fixes

e9f1dc1

major simplification by reusing pickling mechanisms

c048e77

memory copy + fixes and doc

acfe070

bengioe force-pushed the bengioe-mp-with-batch-buffers branch from 648961f to acfe070 Compare March 11, 2024 16:00

Merge branch 'trunk' into bengioe-mp-with-batch-buffers

9454da8

bengioe mentioned this pull request Apr 4, 2024

Add MP Tensor buffers aanjaa/gflownet#8

Merged

bengioe added 4 commits May 8, 2024 16:57

Merge branch 'trunk' into bengioe-mp-with-batch-buffers

2b9da70

fix global_cfg + opt_Z when there's no Z

907ffcd

fix entropy when masks are used

60722a7

small fixes

f859640

bengioe marked this pull request as ready for review May 9, 2024 14:55

removing timing prints

d536233

bengioe changed the title ~~Proof of concept of using shared pinned buffers~~ Shared pinned buffers May 9, 2024

bengioe requested a review from julienroyd May 9, 2024 15:05

bengioe mentioned this pull request Oct 8, 2024

Better MP, multi-gpu, atom graphs, replay, GPS, LS-GFN, and fixes #141

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared pinned buffers #120

Shared pinned buffers #120

bengioe commented Feb 23, 2024 •

edited

Loading

bengioe commented Mar 1, 2024

bengioe commented Mar 11, 2024

bengioe commented May 9, 2024

Shared pinned buffers #120

Are you sure you want to change the base?

Shared pinned buffers #120

Conversation

bengioe commented Feb 23, 2024 • edited Loading

bengioe commented Mar 1, 2024

bengioe commented Mar 11, 2024

bengioe commented May 9, 2024

bengioe commented Feb 23, 2024 •

edited

Loading