You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am getting a CUDA out of memory error when I try to run the chat_assistant training's run_fsdp.sh script on a 34b model. Changing the model from 7b to 34b is the only change I made.
Local edits
I only edited chat_assistant/training/run_fsdp.sh to replace the 7b model with a 34b model. Screenshot:
Stack trace
File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
main(args)
File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
trainer.train()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
self.accelerator.backward(loss)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
loss.backward(**kwargs)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 1 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
Hardware / Gpu info
I was running this job on a machine with 8 H100 gpus. Below is screenshot of nvidia-smi -l 1 output; you can see that in the span of 5 seconds the GPU memory usage spiked from around 42GiB on all 8 devices to 80GiB on all 8 devices, and then the process crashed.
Hi, I am getting a CUDA out of memory error when I try to run the chat_assistant training's run_fsdp.sh script on a 34b model. Changing the model from 7b to 34b is the only change I made.
Local edits
I only edited
chat_assistant/training/run_fsdp.sh
to replace the 7b model with a 34b model. Screenshot:Stack trace
Hardware / Gpu info
I was running this job on a machine with 8 H100 gpus. Below is screenshot of
nvidia-smi -l 1
output; you can see that in the span of 5 seconds the GPU memory usage spiked from around 42GiB on all 8 devices to 80GiB on all 8 devices, and then the process crashed.Full logs:
full_logs_dhs.txt
Let me know if there's any additional information I can provide to be helpful. Thanks in advance!
The text was updated successfully, but these errors were encountered: