Fixes GPU memory overflow when using replay buffer #130

julienroyd · 2024-04-01T21:04:16Z

With config.replay.use = True, we get the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 63.25 MiB is free. Including non-PyTorch memory, this process has 79.08 GiB memory in use. Of the allocated memory 78.38 GiB is allocated by PyTorch, and 196.30 MiB is reserved by PyTorch but unallocated.

This was likely introduced in PR#122 as it does not occur when going back to commit a8768ee (PR#121). It only occurs when using the replay buffer. This PR fixes the issue by simply detaching and sending to CPU every tensor getting pushed into the replay buffer.

To reproduce the error:

REPLAY = True
N_STEPS = 80_000
OBJECTIVES = ["seh", "qed", "sa", "mw"]
FOCUS_TYPE = "dirichlet"
PREF_TYPE = None

config.num_workers = 0
config.seed = 0
config.print_every = 100
config.validate_every = 100
config.num_final_gen_steps = 100
config.num_training_steps = N_STEPS
config.pickle_mp_messages = True
config.overwrite_existing_exp = True
config.opt.learning_rate = 1e-4
config.opt.lr_decay = 20_000
config.algo.method = "TB"
config.algo.sampling_tau = 0.95
config.algo.train_random_action_prob = 0.01
config.algo.tb.Z_learning_rate = 1e-3
config.algo.tb.Z_lr_decay = 50_000
config.model.num_layers = 2
config.model.num_emb = 256
config.task.seh_moo.objectives = OBJECTIVES
config.task.seh_moo.n_valid = 15
config.task.seh_moo.n_valid_repeats = 128
config.cond.temperature.sample_dist = "constant"
config.cond.temperature.dist_params = [60.0]
config.cond.weighted_prefs.preference_type = PREF_TYPE
config.cond.focus_region.focus_type = FOCUS_TYPE
config.cond.focus_region.focus_cosim = 0.98
config.cond.focus_region.focus_limit_coef = 0.20
config.cond.focus_region.focus_model_training_limits = (0.25, 0.75)
config.cond.focus_region.focus_model_state_space_res = 30
config.cond.focus_region.max_train_it = 5_000
config.replay.use = REPLAY
config.replay.warmup = 1000
config.replay.hindsight_ratio = 0.3
config.replay.capacity = 100_000

trial = SEHMOOFragTrainer(config)
trial.run()

src/gflownet/data/replay_buffer.py

bengioe

LGTM, just had one comment.
I need to test with num_workers=0 more often 😅

… outside the class definition and avoid silent bugs due to unintended default configs

julienroyd · 2024-04-02T19:32:51Z

@bengioe FYI, I've slipped another small change in there to prevent the user to create new attributes outside the config class definitions.

I've had issues comparing across different commits with configs that I thought were set correctly but were in fact using the default value because that particular config had a different name or had moved around.

fix: detach tensors entering the buffer and sending to cpu

70a4ec3

julienroyd requested a review from bengioe as a code owner April 1, 2024 21:04

bengioe reviewed Apr 1, 2024

View reviewed changes

src/gflownet/data/replay_buffer.py Show resolved Hide resolved

bengioe approved these changes Apr 1, 2024

View reviewed changes

julienroyd added 5 commits April 1, 2024 15:30

minor: added case 'tuple'

5443f6f

fix: added detach and cpu() at the begining of create_batch()

f92690a

minor: removed type cast, now support tuples

15031a3

feat: added StrictDataClass to prevent creating new config attributes…

f5abfc1

… outside the class definition and avoid silent bugs due to unintended default configs

minor: better error message

d7d6997

julienroyd added 2 commits April 4, 2024 07:39

fix: made focus_dir and preferences accessible at the batch level

ad0db6c

tox

5ab80df

julienroyd merged commit 2a24bb0 into trunk Apr 4, 2024
4 checks passed

julienroyd deleted the julien-fix-gpu-mem-bust branch April 4, 2024 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes GPU memory overflow when using replay buffer #130

Fixes GPU memory overflow when using replay buffer #130

julienroyd commented Apr 1, 2024

bengioe left a comment

julienroyd commented Apr 2, 2024

Fixes GPU memory overflow when using replay buffer #130

Fixes GPU memory overflow when using replay buffer #130

Conversation

julienroyd commented Apr 1, 2024

bengioe left a comment

Choose a reason for hiding this comment

julienroyd commented Apr 2, 2024