-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Optimizer-in-the-backward #1737
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1737
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit f639b6d with merge base f639b6d (): BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1737 +/- ##
===========================================
- Coverage 69.33% 25.44% -43.89%
===========================================
Files 305 305
Lines 15892 16089 +197
===========================================
- Hits 11018 4094 -6924
- Misses 4874 11995 +7121 ☔ View full report in Codecov by Sentry. |
could we draw loss curves in weights & bias to showcase numerics are the same with/without optimizer-in-the-backward? |
The loss curves have been added in the comments section of the third column on the right-hand side table. |
recipes/full_finetune_distributed.py
Outdated
optimizer, | ||
opt_state_dict, | ||
self._device, | ||
if not optimizer_in_bwd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit, but let's switch the order of this if/else so that it's if optimizer_in_bwd
. Then it's closer to what's in the single-device recipe and easier to compare between the two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the initial zero out at single device at:
torchtune/recipes/full_finetune_single_device.py
Lines 591 to 593 in 665ab3f
# zero out the gradients before starting training | |
if not self._optimizer_in_bwd: | |
self._optimizer.zero_grad() |
recipes/full_finetune_distributed.py
Outdated
raise NotImplementedError( | ||
"Gradient clipping is not supported after optimizer-in-the-backward." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually due to optimizer in backward, or something else? I don't think we have such a check in the single-device recipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's due to optimizer-in-backward.
optimizer-in-backward calls .step()
and .zero_grad()
during loss.backward()
, thus grad
s are None at the time after to call torch.nn.utils.clip_grad_norm_
, could not have gradient clipping successfully.
Single_device has the same issue that if optimizer_in_bwd=True
and clip_grad_norm=True
recipes/full_finetune_distributed.py
Outdated
@@ -722,6 +775,27 @@ def train(self) -> None: | |||
|
|||
self._profiler.stop() | |||
|
|||
def get_lr_scheduler(self) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are gonna have this as a utility I would take it out of this recipe since it's equally applicable to the single-device case. Also I wouldn't call it get_lr_scheduler
, since we're really getting the current lr, not the scheduler itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single_device get lr with the assumption that all optimizers have the same LR.
Shall we apply the logic here to single device, rename to get_lr
and move to /torchtune/utils
?
or just taking the same assumption as single device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically I think this was also previously the assumption in this recipe, right? Since we just log the LR from the first param group. So we should be able to maintain the same logic for both
version.parse(torch.__version__).base_version < "2.5.0", | ||
reason="torch >= 2.5 required", | ||
) | ||
def test_optimizer_in_backward(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why we create a whole new test in here rather than adding a distributed test case in the existing TestOptimInBackward test case (or if we want to inherit from FSDPTest
, a new distributed version of that class in the same file). Because testing optimizer-in-backward as part of a class that is otherwise meant to test our fully_shard + state dict save and load logic feels a bit strange to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test here wants to proof that models running with optimzier-in-backward could have the same performance as running without.
The state_dict saving and loading want to test optimzier-in-backward's wrapper which is a bit different to the traditional optim.
There are not covered in TestOptimInBackward
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to do such an end-to-end test, maybe we can add a recipe test case instead? See e.g. test_full_finetune_distributed.py here: with this test case you can run the full end-to-end recipe on a small test model and set optimizer_in_bwd=True
directly from the config
Context
What is the purpose of this PR? Is it to
Enable Optimizer-in-the-backward for full_finetune_distributed
Changelog
_optimizer_in_bwd
configTest plan
tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 optimizer_in_bwd=True
tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False max_steps_per_epoch=2 epochs=10 optimizer_in_bwd=True resume_from_checkpoint=True checkpointer.recipe_checkpoint=/tmp/Llama-2-7b-hf/recipe_state.pt checkpointer.checkpoint_files=[hf_model_0001_1.pt,hf_model_0002_1.pt]
pytest tests/torchtune/training/test_distributed.py -k test_optimizer_in_backward
Memory cost analysis:
With each layer gradient cost 193MB memory, the origin(left) case has the peak memory at the 31th layer with accumulation of 193MB memory times 30.
The right case with optimizer-in-the-backward frees these memory during backward, gets lower peak memory.
Training time and loss analysis: