Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes GPU memory overflow when using replay buffer #130

Merged
merged 8 commits into from
Apr 4, 2024

Conversation

julienroyd
Copy link
Contributor

With config.replay.use = True, we get the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 63.25 MiB is free. Including non-PyTorch memory, this process has 79.08 GiB memory in use. Of the allocated memory 78.38 GiB is allocated by PyTorch, and 196.30 MiB is reserved by PyTorch but unallocated.

This was likely introduced in PR#122 as it does not occur when going back to commit a8768ee (PR#121). It only occurs when using the replay buffer. This PR fixes the issue by simply detaching and sending to CPU every tensor getting pushed into the replay buffer.

To reproduce the error:

REPLAY = True
N_STEPS = 80_000
OBJECTIVES = ["seh", "qed", "sa", "mw"]
FOCUS_TYPE = "dirichlet"
PREF_TYPE = None

config.num_workers = 0
config.seed = 0
config.print_every = 100
config.validate_every = 100
config.num_final_gen_steps = 100
config.num_training_steps = N_STEPS
config.pickle_mp_messages = True
config.overwrite_existing_exp = True
config.opt.learning_rate = 1e-4
config.opt.lr_decay = 20_000
config.algo.method = "TB"
config.algo.sampling_tau = 0.95
config.algo.train_random_action_prob = 0.01
config.algo.tb.Z_learning_rate = 1e-3
config.algo.tb.Z_lr_decay = 50_000
config.model.num_layers = 2
config.model.num_emb = 256
config.task.seh_moo.objectives = OBJECTIVES
config.task.seh_moo.n_valid = 15
config.task.seh_moo.n_valid_repeats = 128
config.cond.temperature.sample_dist = "constant"
config.cond.temperature.dist_params = [60.0]
config.cond.weighted_prefs.preference_type = PREF_TYPE
config.cond.focus_region.focus_type = FOCUS_TYPE
config.cond.focus_region.focus_cosim = 0.98
config.cond.focus_region.focus_limit_coef = 0.20
config.cond.focus_region.focus_model_training_limits = (0.25, 0.75)
config.cond.focus_region.focus_model_state_space_res = 30
config.cond.focus_region.max_train_it = 5_000
config.replay.use = REPLAY
config.replay.warmup = 1000
config.replay.hindsight_ratio = 0.3
config.replay.capacity = 100_000

trial = SEHMOOFragTrainer(config)
trial.run()

@julienroyd julienroyd requested a review from bengioe as a code owner April 1, 2024 21:04
Copy link
Collaborator

@bengioe bengioe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just had one comment.
I need to test with num_workers=0 more often 😅

@julienroyd
Copy link
Contributor Author

@bengioe FYI, I've slipped another small change in there to prevent the user to create new attributes outside the config class definitions.

I've had issues comparing across different commits with configs that I thought were set correctly but were in fact using the default value because that particular config had a different name or had moved around.

@julienroyd julienroyd merged commit 2a24bb0 into trunk Apr 4, 2024
4 checks passed
@julienroyd julienroyd deleted the julien-fix-gpu-mem-bust branch April 4, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants